TB-Bench：用于从行车记录仪图像/视频中理解时空交通行为的训练和测试多模态 AI

Korawat Char oen pit aks 1,*, Van-Quang Nguyen2,*, Masanori Suganuma1, Kentaro Arai3, Seiji Totsuka3, Hiroshi Ino3, Takayuki Okatani1,2

Korawat Charoenpitaks 1,*, Van-Quang Nguyen 2,*, Masanori Suganuma 1, Kentaro Arai 3, Seiji Totsuka 3, Hiroshi Ino 3, Takayuki Okatani 1,2

1Tohoku University 2RIKEN Center for AIP 3DENSO CORPORATION korawat $@$ vision.is.tohoku.ac.jp, quang.nguyen.jz@riken.jp, suganuma $@$ vision.is.tohoku.ac.jp, kentaro.arai. $\mathrm{j}8\mathrm{n}\mathcal{@}_{.}$ jp.denso.com, seiji.totsuka.j7z@jp.denso.com, hiroshi.naganawa.j7v@jp.denso.com, okatani $@$ tohoku.ac.jp *Corresponding authors

1东北大学 2理化学研究所 AIP中心 3电装公司 korawat $@$ vision.is.tohoku.ac.jp, quang.nguyen.jz@riken.jp, suganuma $@$ vision.is.tohoku.ac.jp, kentaro.arai. $\mathrm{j}8\mathrm{n}\mathcal{@}_{.}$ jp.denso.com, seiji.totsuka.j7z@jp.denso.com, hiroshi.naganawa.j7v@jp.denso.com, okatani $@$ tohoku.ac.jp *通讯作者

Abstract

摘要

The application of Multi-modal Large Language Models (MLLMs) in Autonomous Driving (AD) faces significant challenges due to their limited training on traffic-specific data and the absence of dedicated benchmarks for spatiotemporal understanding. This study addresses these issues by proposing TB-Bench, a comprehensive benchmark designed to evaluate MLLMs on understanding traffic behaviors across eight perception tasks from ego-centric views. We also introduce vision-language instruction tuning datasets, TB-100k and TB-250k, along with simple yet effective baselines for the tasks. Through extensive experiments, we show that existing MLLMs under perform in these tasks, with even a powerful model like GPT-4o achieving less than $35%$ accuracy on average. In contrast, when fine-tuned with TB-100k or TB250k, our baseline models achieve average accuracy up to $85%$ , significantly enhancing performance on the tasks. Additionally, we demonstrate performance transfer by co-training TB-100k with another traffic dataset, leading to improved performance on the latter. Overall, this study represents a step forward by introducing a comprehensive benchmark, highquality datasets, and baselines, thus supporting the gradual integration of MLLMs into the perception, prediction, and planning stages of AD.

多模态大语言模型 (MLLMs) 在自动驾驶 (AD) 中的应用面临重大挑战，原因是其在交通特定数据上的训练有限，且缺乏专门用于时空理解的基准。本研究通过提出 TB-Bench 来解决这些问题，TB-Bench 是一个全面的基准，旨在评估 MLLMs 在以自我为中心的视角下理解交通行为的八种感知任务中的表现。我们还引入了视觉-语言指令调优数据集 TB-100k 和 TB-250k，并为这些任务提供了简单但有效的基线模型。通过大量实验，我们发现现有的 MLLMs 在这些任务中表现不佳，即使是像 GPT-4o 这样的强大模型，平均准确率也低于 $35%$。相比之下，当使用 TB-100k 或 TB-250k 进行微调时，我们的基线模型的平均准确率可达 $85%$，显著提升了任务表现。此外，我们通过将 TB-100k 与另一个交通数据集联合训练，展示了性能迁移的效果，从而提高了后者的性能。总体而言，本研究通过引入一个全面的基准、高质量的数据集和基线模型，推动了 MLLMs 逐步集成到自动驾驶的感知、预测和规划阶段。

Introduction

引言

The application of MLLMs to Autonomous Driving (AD) has gained increasing attention, particularly for predicting risks and planning actions based on images or videos from in-vehicle cameras. Notably, MLLMs have demonstrated their effectiveness in the international competitions like Autonomous Grand Challenge (Renz et al. 2024) and in specific tasks such as traffic sign detection (Zhang et al. 2024b). However, two major challenges remain.

多模态大语言模型 (MLLMs) 在自动驾驶 (Autonomous Driving, AD) 中的应用日益受到关注，尤其是在基于车载摄像头图像或视频预测风险和规划行动方面。值得注意的是，MLLMs 在国际竞赛如 Autonomous Grand Challenge (Renz et al. 2024) 和特定任务如交通标志检测 (Zhang et al. 2024b) 中展现了其有效性。然而，仍存在两大挑战。

First, current MLLMs, ranging from proprietary models like GPT-4o (Achiam et al. 2023) and Gemini (Team et al. 2023) to open-source models like LLaVA (Liu et al. 2024b), are not optimized for dashcam images or traffic scenes. These models are primarily trained on vast amounts of webbased text and image-text pairs, with minimal traffic-specific data, limiting their effectiveness in AD scenarios. To improve the general iz ability of MLLMs, incorporating highquality domain-specific datasets into the pre-training data is crucial, as shown in (Li et al. $2024\mathrm{a}$ ; Zhang et al. 2024a).

首先，当前的多模态大语言模型（MLLMs），从专有模型如 GPT-4o (Achiam et al. 2023) 和 Gemini (Team et al. 2023) 到开源模型如 LLaVA (Liu et al. 2024b)，都没有针对行车记录仪图像或交通场景进行优化。这些模型主要是在大量的基于网络的文本和图像-文本对上进行训练的，其中交通相关的数据非常有限，这限制了它们在自动驾驶（AD）场景中的有效性。为了提高 MLLMs 的泛化能力，将高质量的领域特定数据集纳入预训练数据中至关重要，如 (Li et al. $2024\mathrm{a}$ ; Zhang et al. 2024a) 所示。

Figure 1: Examples of four tasks from TB-Bench; additional task examples are provided in the supplementary material.

图 1: TB-Bench 中的四个任务示例；补充材料中提供了更多任务示例。

Second, it lacks a dedicated benchmark for evaluating MLLMs’ abilities in s patio temporal understanding tasks, given their capabilities in vision-centric tasks are still developing. While these models are designed to handle diverse vision-language tasks, they struggle with complex visual understanding, such as spatial reasoning and object rela tion ships (Tian et al. 2024). Even in common domains unrelated to traffic scenes, there are insufficient benchmarks (e.g., Cambrian-1 (Tong et al. 2024), V-star benchmark $\mathrm{\DeltaWu}$ and Xie 2024)). Given that AD requires sophisticated geometric and s patio temporal understanding to capture the dynamic interactions between the vehicle and other entities, high-quality and dedicated benchmarks are much needed.

其次，鉴于多模态大语言模型 (MLLMs) 在以视觉为中心的任务中的能力仍在发展，缺乏专门用于评估其在时空理解任务中能力的基准。尽管这些模型旨在处理多样化的视觉-语言任务，但它们在复杂的视觉理解方面仍存在困难，例如空间推理和物体关系 (Tian et al. 2024)。即使在与交通场景无关的常见领域中，基准测试也不足 (例如 Cambrian-1 (Tong et al. 2024), V-star 基准 $\mathrm{\DeltaWu}$ 和 Xie 2024))。鉴于自动驾驶 (AD) 需要复杂的几何和时空理解来捕捉车辆与其他实体之间的动态交互，高质量且专门的基准测试是非常必要的。

While MLLMs are increasingly applied in AD, the aforementioned challenges remain insufficiently addressed. Recent research has primarily focused on using pretrained MLLMs derived from web data for specific AD tasks, without thoroughly investigating these challenges. Another issue is to determine which AD tasks across different stages and levels should be addressed by MLLMs. Within the “perception, prediction, and planning” framework, the question becomes: which stages should MLLMs handle?

尽管多模态大语言模型（MLLMs）在自动驾驶（AD）中的应用日益增多，但上述挑战仍未得到充分解决。最近的研究主要集中在使用从网络数据中预训练的MLLMs来完成特定的AD任务，而没有深入探讨这些挑战。另一个问题是确定MLLMs应该处理哪些跨不同阶段和层次的AD任务。在“感知、预测和规划”框架内，问题变成了：MLLMs应该处理哪些阶段？

Focusing solely on the perception stage and its associated tasks, it may not necessarily appear optimal to use MLLMs. Established technologies like LIDAR and CV methods such as object detection and visual odometry can accurately capture the vehicle’s position and s patio temporal relationships in Euclidean space.

仅关注感知阶段及其相关任务时，使用多模态大语言模型 (MLLMs) 可能并不一定是最优选择。像 LIDAR 和计算机视觉 (CV) 方法（如目标检测和视觉里程计）这样的成熟技术，能够准确捕捉车辆在欧几里得空间中的位置和时空关系。

However, when considering the use of MLLMs (or LLMs) in later stages, relying on such “Euclidean geometrically accurate” information in the earlier stage might not be optimal. Firstly, it is unclear how to input this information into LLMs. Moreover, achieving advanced understanding in later stages may require information representations specifically suited for MLLMs to perform higher-level tasks. This suggests the need for MLLM involvement from the perception stage.

然而，当考虑在后期阶段使用 MLLMs（或 LLMs）时，在早期阶段依赖这种“欧几里得几何精确”的信息可能并不是最优的。首先，尚不清楚如何将这些信息输入到 LLMs 中。此外，在后期阶段实现高级理解可能需要专门适合 MLLMs 的信息表示，以执行更高级的任务。这表明从感知阶段就需要 MLLM 的参与。

This study adopts this perspective, aiming to involve MLLMs in the perception stage by expressing spatiotemporal traffic scene tasks in natural language text. The underlying conjecture, as mentioned above, is that this approach could be important for prediction and planning in later stages by MLLMs.

本研究采用这一视角，旨在通过将时空交通场景任务以自然语言文本的形式表达，使多模态大语言模型 (MLLMs) 参与到感知阶段。正如前文所述，其基本假设是，这种方法可能对 MLLMs 在后续阶段的预测和规划具有重要意义。

To address the challenge of lacking dedicated benchmarks, we introduce TB-Bench, one of the first comprehensive benchmarks specifically designed to evaluate MLLM’s understanding of traffic behaviors. This benchmark assesses MLLM’s capabilities to perform perception tasks based on dashcam images or videos from the ego-centric views of vehicles, including determining the spatial position or orientation of other vehicles and interpreting the behaviors of both ego-vehicles and surrounding traffic. Compared to existing benchmarks, TB-Bench encompasses a wider array of eight distinct perception tasks,1 each corresponding to a typical driver maneuver. Figure 1 shows examples of several tasks. To ensure consistent evaluation across a diverse range of MLLMs, we employ a straightforward protocol. Specifically, we pair questions with images or video clips, requiring an MLLM to respond in plain text. Performance of the MLLM is then assessed by measuring response accuracy.

为了解决缺乏专用基准测试的挑战，我们引入了TB-Bench，这是首批专门设计用于评估多模态大语言模型（MLLM）对交通行为理解的综合基准测试之一。该基准测试评估MLLM基于车载摄像头图像或视频执行感知任务的能力，包括确定其他车辆的空间位置或方向，以及解释自车和周围交通的行为。与现有基准测试相比，TB-Bench涵盖了更广泛的八种不同的感知任务，每种任务对应典型的驾驶员操作。图1展示了几个任务的示例。为了确保在多种MLLM之间进行一致的评估，我们采用了一个简单的协议。具体来说，我们将问题与图像或视频片段配对，要求MLLM以纯文本形式回答。然后通过测量回答的准确性来评估MLLM的性能。

To address the challenge of insufficient training data for AD perception tasks, we introduce a high-quality dataset focused on traffic behavior understanding from ego-centric views. This dataset aligns with the task design of TB-Bench and is used for vision-language instruction tuning (VLIT) of MLLMs. We generate high-quality question-and-answer pairs using samples from established datasets such as KITTI, ONCE, and Argoverse 2. In total, we create TB-Bench comprising 2,000 manually constructed samples, along with two versions of training datasets: TB-250k containing 250,000 samples, and TB-100k (a more balanced version).

为了解决AD感知任务中训练数据不足的挑战，我们引入了一个专注于从自我中心视角理解交通行为的高质量数据集。该数据集与TB-Bench的任务设计保持一致，并用于MLLM的视觉-语言指令调优（VLIT）。我们利用KITTI、ONCE和Argoverse 2等现有数据集中的样本生成了高质量的问题-答案对。总共，我们创建了包含2,000个手动构建样本的TB-Bench，以及两个版本的训练数据集：包含250,000个样本的TB-250k和更平衡的TB-100k。

In addition to evaluating existing MLLMs, we introduce a generic framework that serves as a strong baseline for our tasks, consisting of three standard components: a pretrained vision encoder, a multi-modal connector, and a pretrained LLM. The vision encoder extracts visual representations from inputs with varying number of frames, while the connector projects these embeddings into the LLM’s embedding space, finally the LLM generates task-specific responses on our benchmark. This lightweight model is designed for efficient fine-tuning on our proposed dataset(s).

除了评估现有的多模态大语言模型 (MLLM)，我们还引入了一个通用框架，作为我们任务的强基线。该框架由三个标准组件组成：预训练的视觉编码器、多模态连接器和预训练的大语言模型。视觉编码器从不同帧数的输入中提取视觉表示，连接器将这些嵌入投影到大语言模型的嵌入空间中，最后由大语言模型在我们的基准上生成任务特定的响应。这个轻量级模型设计用于在我们提出的数据集上进行高效微调。

Using TB-Bench to evaluate popular proprietary models (GPT-4o and Gemini) and various state-of-the-art open- source MLLMs (LLaVA, Bunny, and InternVL), we find that none of these models excels across all traffic behavior understanding tasks. On average, the open-source models under perform random guessing, while proprietary mod- els achieve only slightly better results, with average accu- racy below $35%$ . In contrast, when fine-tuned on TB-100k or TB-250k, our proposed baseline models demonstrate strong performance across all tasks, with average accuracy ranging from $77%$ to $85%$ . This highlights the effectiveness of our dataset in enhancing MLLM traffic behavior understanding.

使用 TB-Bench 评估流行的专有模型（GPT-4o 和 Gemini）和各种最先进的开源 MLLM（LLaVA、Bunny 和 InternVL），我们发现这些模型在所有交通行为理解任务中均未表现出色。平均而言，开源模型的表现低于随机猜测，而专有模型的表现仅略好，平均准确率低于 $35%$。相比之下，当在 TB-100k 或 TB-250k 上进行微调时，我们提出的基线模型在所有任务中表现出色，平均准确率在 $77%$ 到 $85%$ 之间。这凸显了我们数据集在增强 MLLM 交通行为理解方面的有效性。

Overall, our contributions are fourfold: 1) we introduce TB-Bench, a benchmark for assessing MLLMs on eight perception tasks of traffic behavior understanding; 2) we present the VLIT datasets (TB-100k and TB-250k) for the tasks, along with a generic baseline; 3) we conduct extensive experiments demonstrating the performance gap between existing MLLMs and the fine-tuned baselines; and 4) we show that our VLIT dataset, i.e., TB-100k, can be used as part of a co-training dataset to generalize to other driving benchmarks, such as BDD-X (Kim et al. 2018).

总体而言，我们的贡献有四点：1) 我们引入了 TB-Bench，这是一个用于评估多模态大语言模型 (MLLMs) 在交通行为理解的八项感知任务上的基准；2) 我们为这些任务提供了 VLIT 数据集 (TB-100k 和 TB-250k)，并提出了一个通用基线；3) 我们进行了广泛的实验，展示了现有 MLLMs 与微调基线之间的性能差距；4) 我们展示了我们的 VLIT 数据集，即 TB-100k，可以作为联合训练数据集的一部分，泛化到其他驾驶基准，如 BDD-X (Kim et al. 2018)。

Autonomous Driving Tasks

自动驾驶任务

The majority of evaluations in the AD field are focused on either end-to-end driving systems, open-loop planning, or standalone task schemes, such as single-round visual question answering (VQA) or captioning. Traditionally, the AD framework consists of perception, prediction, and planning tasks (Nie et al. 2023), although slight variations exist, i.e., predicting intention-level outputs instead of trajectories (Tian et al. 2024).

自动驾驶 (AD) 领域的大多数评估都集中在端到端驾驶系统、开环规划或独立任务方案上，例如单轮视觉问答 (VQA) 或图像描述。传统上，自动驾驶框架由感知、预测和规划任务组成 (Nie et al. 2023)，尽管存在一些细微的差异，例如预测意图级别的输出而不是轨迹 (Tian et al. 2024)。

Generally, perception tasks in end-to-end driving systems are mainly auxiliary tasks, consisting of all available supervision signals provided based on the data source. For example, NuScene (Caesar et al. 2020) provides BEV information, segmentation labels, and more. Consequently, multitask learning is applied to these tasks, such as object detection, tracking, and segmentation.This approach is consistent across recent similar AD planning datasets, whether in open-loop or simulation scenarios. Occasionally, pretrained VL models are utilized to enhance these modules.

通常，端到端驾驶系统中的感知任务主要是辅助任务，由基于数据源提供的所有可用监督信号组成。例如，NuScene (Caesar et al. 2020) 提供了 BEV 信息、分割标签等。因此，多任务学习被应用于这些任务，如目标检测、跟踪和分割。这种方法在最近的类似 AD 规划数据集中是一致的，无论是在开环还是仿真场景中。偶尔会使用预训练的 VL 模型来增强这些模块。

Table 1: Summary of existing studies and benchmarks across AD tasks (brackets indicate tasks involving planning).

表 1: 现有研究和基准测试在自动驾驶任务中的总结（括号内表示涉及规划的任务）。

基准测试	视觉数据模态	感知（规划）任务	缩写	含义
自动驾驶中的独立任务			OD	2D和3D目标检测
DRAMA (Malla et al. 2023)	单图像	PER, REA	OT	2D和3D目标跟踪
Rank2Tell (Sachdeva et al. 2024)	单图像	PER, REA, LANE, TLS	D	深度估计
BDD-X (Kim et al. 2018)	多帧	PER, (AC)	OBJ	目标存在、类别等
BDD-OIA (Xu et al. 2020)	单图像	PER	KNOW	世界知识
TrafficQA (Xu, Huang, and Liu 2021)	多帧	PER, PRED, REA	LOC	位置或坐标
LingoQA (Marcu et al. 2023)	多帧		LANE	道路、车道、交叉口等
NuScenes-QA (Qian et al. 2024)	多视角	PER, PRED, REA	PER	通用感知
NuScenes-MQA (Inoue et al. 2024)	多视角	OBJ, SP OBJ, RD, OD	PRED PLAN	通用预测
MAPLM-QA (Cao et al. 2024)	多视角，BEV图像	LANE	REA	通用规划
DriveLM (Sima et al. 2024)	单图像	PER, PRED, (PLAN)	TLS	通用推理交通灯或标志
基准测试			AC	动作类别
SpatialRGPT (Cheng et al. 2024)	单图像	RD, SR, OR	AR	通用动作识别
SEED (Li et al. 2023a)	多图像，多帧	PER, PRED, REA, AR	RD	相对距离
MVBench (Li et al. 2024b)	多帧	PER, PRED, REA, LOC, AR	SR	空间推理
MME (Fu et al. 2023)	单图像	PER, PRED	OR	方向推理
MMMU (Yue et al. 2024)	多图像	PER, REA, KNOW	EGO-LANE	其他车道到自车
ELM (Zhou et al. 2024)	多帧	PER, PRED, TLS, OD, OT, AR, (PLAN)	OBJ-LANE	其他车道变换
Cambrian-1 (Tong et al. 2024)	单图像	RD, SR, D	OBJ-TURN	其他转向
OpenEQA (Majumdar et al. 2024)	多帧	OBJ, SR, KNOW, LOC, REA	EGO-TURN	自车转向
TB-Bench (Ours)	单图像，多帧	RD, SR, OR, EGO-LANE, OBJ-LANE, OBJ-TURN, EGO-TURN, EGO-TRA	EGO-TRA	自车穿越距离

Other popular traffic planning datasets are KITTI (Geiger et al. 2013), ONCE (Mao et al.), Waymo Open (Sun et al. 2020), and Argoverse2 (Wilson et al. 2021), which are inherently similar to NuScene in characteristics.

其他流行的交通规划数据集包括 KITTI (Geiger et al. 2013)、ONCE (Mao et al.)、Waymo Open (Sun et al. 2020) 和 Argoverse2 (Wilson et al. 2021)，这些数据集在特性上与 NuScene 本质上相似。

Pretrained VL models are commonly known for their excellence in scene understanding, details, and visual cues. Still, it shows limitations in spatial grounding and reasoning (Tian et al. 2024). In detail, most standalone task schemes focus on perception tasks, which include general event VQA (Xu, Huang, and Liu 2021; Marcu et al. 2023), environment and weather conditions, traffic signals, and lane information (Wang et al. 2024; Cao et al. 2024). These tasks also encompass critical object detection (Malla et al. 2023; Sachdeva et al. 2024) or tracking in various forms, such as bounding box coordinates (Tian et al. 2024), region proposals (Deruyttere et al. 2019; Xu et al. 2020), 2D (Wu et al. 2023a), and 3D (Wu et al. 2023b) language-guided object tracking, as well as scene analysis that includes attributes or motion of objects like size, position, direction, distance, spatial position relationships (Qian et al. 2024), and orientation (Cheng et al. 2024). In particular, there is a comprehensive task for driving with language that integrates all aspects of perception, prediction, and planning in a VQA format (Sima et al. 2024). In the prediction tasks, all previous perception inputs are used to predict the object’s future trajectory, such as parking or moving, and interactions with the ego-vehicle. In the planning stage, it involves combining prior information to generate actions, decision descriptions (Xu et al. 2020), and trajectory waypoints (Tian et al. 2024; Sima et al. 2024).

预训练的视觉语言 (VL) 模型通常以其在场景理解、细节和视觉线索方面的卓越表现而闻名，但在空间定位和推理方面仍存在局限性 (Tian et al. 2024)。具体而言，大多数独立任务方案侧重于感知任务，包括通用事件视觉问答 (VQA) (Xu, Huang, and Liu 2021; Marcu et al. 2023)、环境和天气条件、交通信号和车道信息 (Wang et al. 2024; Cao et al. 2024)。这些任务还包括关键对象检测 (Malla et al. 2023; Sachdeva et al. 2024) 或各种形式的跟踪，例如边界框坐标 (Tian et al. 2024)、区域提议 (Deruyttere et al. 2019; Xu et al. 2020)、2D (Wu et al. 2023a) 和 3D (Wu et al. 2023b) 语言引导的对象跟踪，以及包括对象属性或运动（如大小、位置、方向、距离、空间位置关系 (Qian et al. 2024) 和方向 (Cheng et al. 2024)）的场景分析。特别是，有一种综合的驾驶任务以 VQA 格式集成了感知、预测和规划的各个方面 (Sima et al. 2024)。在预测任务中，所有先前的感知输入都用于预测对象的未来轨迹，例如停车或移动，以及与自车的交互。在规划阶段，它涉及结合先验信息生成动作、决策描述 (Xu et al. 2020) 和轨迹路径点 (Tian et al. 2024; Sima et al. 2024)。

MLLMs and Benchmarks

多模态大语言模型 (MLLMs) 与基准测试

VL pre-training and foundation models started with learning from a broader source of supervision, specifically raw text at an internet scale (Radford et al. 2021), enabling zeroshot transfer of the model to downstream tasks. Notably, approaches attempting to connect VL pre-training to existing LLMs, referred to as MLLMs (Li et al. 2023b), enable capabilities similar to those of LLMs, such as image-to-text generation, improved via instruction tuning and in-context learning capabilities. Current frontier families of MLLMs, such as LLaVA (Liu et al. 2024b), VILA (Lin et al. 2024), and InternVL (Chen et al. 2024), utilize a similar architectural paradigm: vision encoder, multi-modal projector, and LLM connected in sequence. Despite some early work attempting resampler techniques like Q-Former (Dai et al. 2023), all state-of-the-art models use simpler linear layers with scaling to higher resolutions, focusing on higher quality VLIT instead. Another line of studies works on lightweight versions of MLLMs, optimizing for more informative, condensed training data and design choices (He et al. 2024; Shao et al. 2024). The latest MLLMs focus on simultaneously tackling multi-image, multi-frame (video), multi-view (3D), and multi-patch (single-image) scenarios, which show emergent capabilities and enhance overall performance (Li et al. 2024a). Nevertheless, it is a standard paradigm for MLLMs to evaluate on multiple general benchmarks, aiming to achieve overall performance.

视觉语言（VL）预训练和基础模型最初是从更广泛的监督源中学习，特别是互联网规模的原始文本（Radford 等人，2021），这使得模型能够零样本迁移到下游任务。值得注意的是，尝试将 VL 预训练与现有的大语言模型（LLM）连接的方法，称为多模态大语言模型（MLLM）（Li 等人，2023b），使得模型具备类似于 LLM 的能力，例如通过指令微调和上下文学习能力改进的图像到文本生成。当前前沿的 MLLM 家族，如 LLaVA（Liu 等人，2024b）、VILA（Lin 等人，2024）和 InternVL（Chen 等人，2024），采用了相似的架构范式：视觉编码器、多模态投影器和按顺序连接的大语言模型。尽管早期的一些工作尝试了像 Q-Former（Dai 等人，2023）这样的重采样技术，但所有最先进的模型都使用更简单的线性层，并通过扩展到更高分辨率来专注于更高质量的视觉语言推理任务（VLIT）。另一类研究致力于 MLLM 的轻量级版本，优化更具信息量的压缩训练数据和设计选择（He 等人，2024；Shao 等人，2024）。最新的 MLLM 专注于同时处理多图像、多帧（视频）、多视角（3D）和多块（单图像）场景，这些场景展示了涌现能力并提升了整体性能（Li 等人，2024a）。尽管如此，MLLM 的标准范式仍然是在多个通用基准上进行评估，旨在实现整体性能。

The existing benchmarks, which refer to MLLM benchmarks, aim to comprehensively evaluate various dimensions, but there is no standardized taxonomy for benchmark design. General benchmarks in the VL space started with simple perception-oriented tasks (Fu et al. 2023), followed by multi-frame benchmarks (Li et al. 2023a, 2024b) with action recognition and VL knowledge-based reasoning (Yue et al. 2024). Spatial or vision-centric benchmarks (Tong et al. 2024; Cheng et al. 2024) are becoming more relevant to address previously claimed weaknesses. Then, specialized benchmarks gained more attention, introducing tasks from different domains, such as robotics (Majumdar et al. 2024) and AD (Sima et al. 2024). In this case, there is still a lack of studies covering simple yet very important skills and behaviors in the AD context.

现有的基准测试（指 MLLM 基准测试）旨在全面评估各个维度，但在基准设计方面缺乏标准化的分类。视觉语言（VL）领域的通用基准测试始于简单的感知导向任务（Fu et al. 2023），随后是多帧基准测试（Li et al. 2023a, 2024b），涉及动作识别和基于视觉语言知识的推理（Yue et al. 2024）。空间或以视觉为中心的基准测试（Tong et al. 2024; Cheng et al. 2024）正变得越来越相关，以解决之前声称的弱点。随后，专业基准测试获得了更多关注，引入了来自不同领域的任务，例如机器人（Majumdar et al. 2024）和自动驾驶（AD）（Sima et al. 2024）。在这种情况下，仍然缺乏涵盖自动驾驶背景下简单但非常重要的技能和行为的研究。

Benchmark Design

基准设计

TB-Bench is created to fill the benchmark gap in evaluating MLLMs for AD, providing a specialized benchmark that rigorously tests their capability to understand complex traffic behaviors from an ego-centric perspective.

TB-Bench 的创建是为了填补评估用于自动驾驶 (AD) 的多模态大语言模型 (MLLM) 的基准测试空白，提供一个专门的基准测试，严格测试它们从自我中心视角理解复杂交通行为的能力。

Task Design

任务设计

We generate question-and-answer pairs in a VQA format, where the model takes an image or video paired with a question as input and produces a corresponding answer. Both the question and answer are expressed in a single sentence of free-form text.

我们以 VQA（视觉问答）格式生成问答对，模型将图像或视频与问题配对作为输入，并生成相应的答案。问题和答案都以自由形式的单句文本表示。

To achieve the above goal, we consider multiple types of Q&A pairs, each linked to a specific driver’s maneuver behavior. We refer to the Pre-crash Scenarios typology from the National Automotive Sampling System (NASS) variables (Najm et al. 2007), which are also utilized in the CARLA simulator (Do sov it ski y et al. 2017). This typology includes a total of 65 pre-crash scenarios, categorized into nine accident types2. Each scenario is described in the format of ‘an accident type: a detailed scenario.’ For example, the ‘lane change’ accident type includes scenarios like ‘one vehicle passing while another is turning.’ See the supp. material for the full list of scenarios.

为实现上述目标，我们考虑了多种类型的问答对，每种问答对都与特定的驾驶员操作行为相关联。我们参考了来自国家汽车采样系统 (NASS) 变量的预碰撞场景分类 (Najm et al. 2007)，这些变量也在 CARLA 模拟器 (Dosovitskiy et al. 2017) 中使用。该分类共包含 65 种预碰撞场景，分为九种事故类型2。每个场景都以“事故类型：详细场景”的格式描述。例如，“变道”事故类型包括“一辆车超车时另一辆车正在转弯”等场景。完整场景列表请参见补充材料。

Focusing on typical maneuver behaviors derived from NASS scenarios, we have identified eight distinct Q&A types, referred to as ‘tasks,’ as shown in Table 2. Some tasks require numerical outputs (e.g., ‘distance in meters’), while others require discrete classes (e.g., ‘back,’ ‘back left,’ etc.). It is important to note that the models are expected to provide these outputs in their natural language responses. Fig. 1 presents examples for four of the eight tasks, each of which consists of input image(s) accompanied by a question and a ground-truth answer. The visual input is either a single image or multiple images (up to eight), depending on the task, as will be explained later.

我们聚焦于从NASS场景中提取的典型机动行为，识别出了八种不同的问答类型，称为“任务”，如表2所示。部分任务需要输出数值（例如“距离（米）”），而其他任务则需要输出离散类别（例如“后方”、“左后方”等）。需要注意的是，模型需要在其自然语言响应中提供这些输出。图1展示了八种任务中的四种示例，每个示例由输入图像、伴随的问题以及真实答案组成。根据任务的不同，视觉输入可以是单张图像或多张图像（最多八张），具体将在后文解释。

Referencing Entities

引用实体

Some tasks require the model to determine the spatial position or orientation of other vehicles, as shown in Fig. 1. When multiple vehicles are present in a scene, it is essential to distinguish between them in both the questions and answers. One approach is to describe the vehicle by its attributes, such as “black compact sedan,” but this can pose challenges in ensuring the model accurately identifies and differentiates similar objects using such descriptions. To avoid these complications and focus on evaluating the model’s spatial understanding, we label each target traffic entity as ‘Entity #n’ in the questions and answers, where $n$ corresponds to its index in the input image(s); see examples in the upper part of Fig. 1. To identify these entities, we draw colored three-dimensional bounding boxes (BBs) directly in the input image(s), using a consistent color for each entity index $n$ throughout the dataset. Specifically, we use

某些任务需要模型确定其他车辆的空间位置或方向，如图 1 所示。当场景中存在多辆车时，必须在问题和答案中区分它们。一种方法是通过属性描述车辆，例如“黑色紧凑型轿车”，但这在确保模型准确识别和区分类似对象时可能会带来挑战。为了避免这些复杂性并专注于评估模型的空间理解能力，我们在问题和答案中将每个目标交通实体标记为“实体 #n”，其中 $n$ 对应于其在输入图像中的索引；参见图 1 上部的示例。为了识别这些实体，我们直接在输入图像中绘制彩色的三维边界框 (BBs)，在整个数据集中为每个实体索引 $n$ 使用一致的颜色。具体来说，我们使用

Table 2: Tasks and Concepts Addressed in Each. ‘Classes’ column indicates the types of outputs, i.e., the number of discrete classes or numerical outputs (indicated by $\mathcal{R}$ ); ‘Orientation Reasoning’ task contains both output types.

表 2: 每项任务和概念。‘Classes’列表示输出类型，即离散类别或数值输出的数量（用 $\mathcal{R}$ 表示）；‘Orientation Reasoning’任务包含两种输出类型。

任务类型	抽象概念	类别
空间信息:
相对距离	距离（米）	R
空间推理	后、后左、后右、前、前左、前右	6
方向推理	相反、垂直、相似、角度	3/R
物体行为:
其他车道到自车	前车道、前左车道、前右车道、对向车道	4
其他车道车辆变道	左车道变道、不变道、右车道变道、直行、左转、右转	3
其他转向		3
自车行为: 自车转向	直行、左转、右转	3
自车行驶距离	行驶距离（米）	R

cyan and magenta BBs for ‘Entity #1’ and ‘Entity #2,’ respectively. Our dataset includes up to two entities per scene, i.e., $n,=,1$ or 2. An additional advantage of this method is that it requires minimal instruction tuning or even no extra learning for MLLMs to adapt. Furthermore, it is compatible with multi-view, multi-frame, and multi-scale modalities, as demonstrated in AnyRes (Liu et al. 2024a), UniRes (Zhang et al. 2024a), and Interleave (Li et al. 2024a).

分别为“实体#1”和“实体#2”使用青色和洋红色的边界框 (BBs)。我们的数据集中每个场景最多包含两个实体，即 $n,=,1$ 或 2。这种方法的一个额外优势是，它需要最少的指令调优，甚至不需要额外的学习来适应 MLLMs。此外，它与多视图、多帧和多尺度模态兼容，如 AnyRes (Liu et al. 2024a)、UniRes (Zhang et al. 2024a) 和 Interleave (Li et al. 2024a) 中所展示的。

Evaluation

评估

Our benchmark requires MLLMs to generate plain text outputs. Since the goal is to evaluate the s patio temporal understanding capabilities of MLLMs, the accuracy of their outputs should be assessed using methods tailored to this requirement.

我们的基准测试要求多模态大语言模型 (MLLMs) 生成纯文本输出。由于目标是评估 MLLMs 的时空理解能力，因此应使用针对此需求定制的方法来评估其输出的准确性。

The questions in the dataset are broadly classified into two categories based on the type of answers expected. One category includes questions about positional relationships or orientation, with typical answers like “positioned at the back right” or “a right-turn maneuver.” The other category involves questions requiring numerical answers, such as “is situated 15.53 meters away.”

数据集中的问题根据预期答案的类型大致分为两类。一类包括关于位置关系或方向的问题，典型答案如“位于右后方”或“右转操作”。另一类涉及需要数值答案的问题，例如“位于15.53米处”。

For the first category of Q&A, keywords are manually selected for each task or ground truth answer, and their presence in the output text is identified using rule-based methods (i.e., regular expressions). For the second category, the predicted value is compared to the correct answer, and if the difference falls within a specified range, the prediction is considered correct; otherwise, it is deemed incorrect. In the experiments, thresholds are set such that a difference within $25%$ of the correct value is considered acceptable for distance, and a difference within 15 degrees is acceptable for angle. Refer to the supp. material for more details.

对于第一类问答任务，每个任务或真实答案的关键词是手动选择的，并通过基于规则的方法（即正则表达式）来识别输出文本中是否存在这些关键词。对于第二类任务，预测值与正确答案进行比较，如果差异在指定范围内，则认为预测正确；否则，认为预测错误。在实验中，设定的阈值是：距离的差异在正确值的 $25%$ 以内被认为是可接受的，角度的差异在15度以内被认为是可接受的。更多细节请参考补充材料。

Generation of VQA Data Outline

VQA 数据大纲生成

To generate Q&A pairs for the eight tasks mentioned, we repurpose existing datasets, specifically KITTI (Geiger,

为了生成上述八项任务的问答对，我们重新利用了现有的数据集，特别是 KITTI (Geiger,

Figure 2: Overview of Data Generation Pipeline. Left: Sensory data is processed into higher-level attributes. Middle-Top: Spatial positioning and lane orientation relative to the ego-vehicle are determined. Middle-Bottom: Q&A samples are generated using rules and LLM augmentation. Right: Data is filtered and refined for the final dataset.

图 2: 数据生成流程概览。左：感知数据被处理为更高层次的属性。中上：确定相对于自车的空间定位和车道方向。中下：使用规则和大语言模型增强生成问答样本。右：数据经过过滤和精炼，形成最终数据集。

Lenz, and Urtasun 2012), ONCE (Mao et al.), and Argoverse2 (Wilson et al. 2021). These datasets are originally designed for studying object detection, localization, and tracking in three-dimensional space, providing detailed threedimensional geometry of traffic entities. KITTI and ONCE, in particular, offer object class information and 3D bounding boxes for each traffic entity, including their position, dimensions, and yaw angle. Argoverse2 further enriches this with lane information relative to the ego vehicle.

Lenz 和 Urtasun 2012)、ONCE (Mao 等人) 以及 Argoverse2 (Wilson 等人 2021)。这些数据集最初设计用于研究三维空间中的目标检测、定位和跟踪，提供了交通实体的详细三维几何信息。特别是 KITTI 和 ONCE，它们为每个交通实体提供了对象类别信息和 3D 边界框，包括其位置、尺寸和偏航角。Argoverse2 进一步丰富了这些信息，提供了相对于自车的车道信息。

To align with the task design mentioned (Table 2), the quantities provided by these datasets, mostly represented in the Euclidean space, are converted into abstract concepts, such as six discrete angles between two vehicles (e.g., front right, back left, etc.), lanes relative to the ego-car (i.e., front left lane, oncoming lane) and lane changing.

为了与任务设计（表 2）保持一致，这些数据集提供的量（大多以欧几里得空间表示）被转换为抽象概念，例如两辆车之间的六个离散角度（例如，右前、左后等）、相对于自车的车道（例如，左前车道、对向车道）以及变道。

For the first three tasks—‘Relative Distance,’ ‘Spatial Reasoning,’ and ‘Orientation Reasoning’—we generate Q&A pairs using samples from KITTI and ONCE, as these tasks do not require lane information from the ego vehicle or others. Since these tasks can be performed using a single image, we utilize a static dashcam image as the visual input. For the remaining tasks—‘Other Lane to Ego,’ ‘Other Lane Changing,’ ‘Other Turning,’ ‘Ego Turning,’ and ‘Ego Traverse Distance’—which require lane information and a multi-frame source, we generate Q&A pairs using Argoverse2. Given that these tasks involve temporal changes, we extract eight image frames from the ‘long scenario’ sequences in the dataset for each Q&A pair3, using these sequences as the visual input for models.

对于前三个任务——“相对距离”、“空间推理”和“方向推理”——我们使用 KITTI 和 ONCE 的样本来生成问答对，因为这些任务不需要自车或其他车辆的车道信息。由于这些任务可以通过单张图像完成，我们使用静态行车记录仪图像作为视觉输入。对于其余任务——“其他车道到自车”、“其他车道变换”、“其他转向”、“自车转向”和“自车穿越距离”——这些任务需要车道信息和多帧数据源，我们使用 Argoverse2 生成问答对。鉴于这些任务涉及时间变化，我们从数据集中的“长场景”序列中为每个问答对提取八帧图像，并将这些序列作为模型的视觉输入。

After generating the data automatically, we conduct a manual screening process. Based on the extent of screening, the data is organized into three distinct datasets. One dataset, comprising 2,000 samples, is designated for evaluation purposes, which we will refer to as ‘benchmark’ in this paper. These samples undergone thorough manual inspection, removing low-quality samples and ensuring an equal number (i.e., 250) of samples per task. The remaining two datasets are intended for model training: the first, TB-250k, contains 250,000 samples; the second, TB-100k, includes over 100,000 samples that have been filtered to balance the number of samples per task. Table 3 summarizes the overall statistics of these datasets.

在自动生成数据后，我们进行手动筛选过程。根据筛选的程度，数据被组织成三个不同的数据集。其中一个数据集包含2000个样本，用于评估目的，我们在本文中将其称为“基准”。这些样本经过彻底的手动检查，删除了低质量样本，并确保每个任务的样本数量相等（即250个）。其余两个数据集用于模型训练：第一个数据集TB-250k包含250,000个样本；第二个数据集TB-100k包含超过100,000个样本，这些样本经过过滤以平衡每个任务的样本数量。表3总结了这些数据集的总体统计信息。

Table 3: Statistics of TB-Bench, TB-100k, and TB-250K. Source datasets: K (KITTI), O (ONCE), Arv2 (Argoverse2).

表 3: TB-Bench、TB-100k 和 TB-250K 的统计数据。源数据集：K (KITTI)、O (ONCE)、Arv2 (Argoverse2)。

任务类型	来源/帧数	TB-Bench	TB-250k	TB-100k
空间信息:
相对距离	[K, 0]/1	250	35k	10k
空间推理	[K, 0]/1	250	70k	30k
方向推理	[K, 0]/1	250	70k	30k
物体行为:
其他车道到本车	[Arv2]/8	250	50k	20k
其他车道变道	[Arv2]/8	250	1.5k	1.5k
其他转弯	[Arv2]/8	250	1.5k	1.5k
本车行为:
本车转弯	[Arv2]/8	250	1.5k	1.5k
本车行驶距离	[Arv2]/8	250	25k	15.5k
总计		2000	254k	110k

Details of the Pipeline

管道的详细信息

The Q&A pairs are generated automatically, with manual inspection following the automated process. The only exception is the ‘Other Lane Changing’ task, where we manually generate Q&A pairs due to noisy lane information at intersections. Figure 2 illustrates the pipeline used for generating Q&A pairs from these datasets.

问答对是自动生成的，随后会进行人工检查。唯一的例外是“其他车道变换”任务，由于交叉口的车道信息噪声较大，我们手动生成了问答对。图 2 展示了从这些数据集中生成问答对的流程。

The process unfolds as follows: The input to the pipeline is a single sample from the datasets, which could be either a static image with a set of entity attributes from KITTI/ONCE or a list of sequences with similar data from Argoverse2. The pipeline begins by extracting key information from the input, as depicted in the left panel of Fig. 2. This is followed by a processing step shown in the middle-top panel of Fig. 2, where spatial positions and facing angles relative to an anchor object are calculated. Additionally, the ‘lane to ego’ task identifies on which side the entity is located relative to the ego vehicle. For turning behaviors, we record the accumulated turning angle of each object to determine its recent motion. For lane changes, a flag is recorded if there are changes in lane id compared to the previous step. Similarly, all sensor numerical ground truth data—such as position, dimension, and angle of all entities—are processed into attributed data, such as distance to ego and spatial position.

流程如下：管道的输入是数据集中的单个样本，可以是来自 KITTI/ONCE 的带有实体属性集的静态图像，也可以是来自 Argoverse2 的带有类似数据的序列列表。管道首先从输入中提取关键信息，如图 2 左侧面板所示。接着是图 2 中间顶部面板所示的处理步骤，计算相对于锚定对象的空间位置和朝向角度。此外，“车道到自我”任务识别实体相对于自我车辆所在的位置。对于转向行为，我们记录每个对象的累积转向角度以确定其最近的运动。对于车道变更，如果与上一步相比车道 ID 发生变化，则记录一个标志。同样，所有传感器的数值真实数据（如所有实体的位置、尺寸和角度）都被处理为属性数据，例如到自我的距离和空间位置。

Figure 3: The overall architecture of our baseline framework.

图 3: 我们基线框架的整体架构。

Finally, a rule-based process, depicted in the middlebottom panel of Fig. 2, is triggered to identify the corresponding task and generate Q&A pairs. Further details can be found in the supplementary material.

最后，触发一个基于规则的过程（如图 2 中下面板所示）来识别相应的任务并生成问答对。更多细节可以在补充材料中找到。

In the next phase, a rule-based system generates QA samples from processed data attributes. This depends on the type of task, i.e., tasks aside from lane change and turning behavior can be created based on any frame, without necessarily needing an event to trigger it. Thus, they naturally have more data samples generated. After this, the rule-based QA is generated with simple short answers, such as ‘oncoming lane’ or ‘turn left.’ Then, it is augmented to be a more complex sentence using text-only information with an LLM; we used Microsoft-Phi3-medium (Abdin et al. 2024).

在下一阶段，基于规则的系统从处理后的数据属性中生成问答样本。这取决于任务类型，即除了变道和转向行为之外的任务可以基于任何帧创建，而不一定需要事件触发。因此，它们自然生成了更多的数据样本。之后，基于规则的问答生成简单的短答案，例如“对向车道”或“左转”。然后，使用仅文本信息的大语言模型（LLM）将其增强为更复杂的句子；我们使用了 Microsoft-Phi3-medium (Abdin et al. 2024)。

Baseline Framework

基线框架

We present a generic framework that serves as a strong baseline for our tasks, comprising three standard components: a vision encoder, a multi-modal connector (a two-layer MLP), and an LLM. The vision encoder extracts visual representations from input frames, the multi-modal connector projects these representations into the LLM’s embedding space, and finally the LLM generates a response based on the given question and visual embeddings. Figure 3 illustrates the architecture of our framework.

我们提出了一个通用框架，作为我们任务的强基线，包含三个标准组件：视觉编码器、多模态连接器（一个两层的 MLP）和大语言模型。视觉编码器从输入帧中提取视觉表示，多模态连接器将这些表示投影到大语言模型的嵌入空间中，最后大语言模型根据给定的问题和视觉嵌入生成响应。图 3 展示了我们框架的架构。

We now explain how to adaptively extract visual representations from varying numbers of frames and input them into the LLM. Given $N$ frames of $H\times W$ having colorcoded bounding boxes, the vision encoder processes each frame individually to produce $N$ visual representations of size $[H/p\times W/p,C]$ , where $p$ is the patch size and $C$ is the embedding dimension of the encoder. These visual representations are then projected into the LLM’s embedding space of $D$ using the multi-modal connector, resulting in $N$ visual embeddings of size $[H/p\times W/p,D]$ .

我们现在解释如何从不同数量的帧中自适应地提取视觉表示并将其输入到大语言模型中。给定 $N$ 帧大小为 $H\times W$ 的带有颜色编码边界框的图像，视觉编码器分别处理每一帧，生成 $N$ 个大小为 $[H/p\times W/p,C]$ 的视觉表示，其中 $p$ 是 patch 大小，$C$ 是编码器的嵌入维度。然后，这些视觉表示通过多模态连接器投影到大语言模型的嵌入空间 $D$ 中，生成 $N$ 个大小为 $[H/p\times W/p,D]$ 的视觉嵌入。

Inputting all visual embeddings of $N$ frames into the LLM can be computationally expensive. To address this, we sample spatially a subset of these visual embeddings per frame. Specifically, we apply adaptive average pooling to reduce each frame’s embeddings, from $[H/p\times W/p,D]$ to $[k=h\times w,D]$ , where $k\ll H/p\times W/p$ . The value of $k$ is determined as a hyper parameter. The sampled embeddings from all $N$ frames are then reshaped and concatenated, preserving spatial and temporal order, which yields final visual embeddings of size $[N,\bar{\times},k,D]$ that are passed into the LLM along with the textual embeddings.

将所有 $N$ 帧的视觉嵌入输入到大语言模型中可能会带来高昂的计算成本。为了解决这个问题，我们对每帧的视觉嵌入进行空间采样。具体来说，我们应用自适应平均池化来减少每帧的嵌入，从 $[H/p\times W/p,D]$ 减少到 $[k=h\times w,D]$，其中 $k\ll H/p\times W/p$。$k$ 的值作为一个超参数来确定。然后，将所有 $N$ 帧的采样嵌入进行重塑和拼接，保留空间和时间顺序，最终生成大小为 $[N,\bar{\times},k,D]$ 的视觉嵌入，这些嵌入与文本嵌入一起输入到大语言模型中。

To process text input, we tokenize the question and its ground-truth response, converting them into textual embeddings. These are then combined with the visual embeddings and input into the LLM. We train the model by minimizing cross-entropy loss on the response token predictions. During inference, only the question is used as text input.

为了处理文本输入，我们将问题及其真实答案进行 Token 化，并将其转换为文本嵌入。然后，这些嵌入与视觉嵌入结合，并输入到大语言模型 (LLM) 中。我们通过最小化响应 Token 预测的交叉熵损失来训练模型。在推理过程中，仅使用问题作为文本输入。

Experiments

实验

Experimental Settings

实验设置

Our proposed framework is compatible with any vision encoder and LLM. In this study, we utilize pretrained SigLIPL/14 (Zhai et al. 2023) as the vision encoder and the powerful pretrained Qwen 0.5B, either version 1.5 or 2.0, (Bai et al. 2023; Yang et al. 2024) as the LLM, while initializing the parameters of the multi-modal connector randomly. To preserve pretrained LLM capabilities and enable efficient task-specific fine-tuning, we apply LoRA (Hu et al. 2022) with a rank of 64. During training, we freeze the vision encoder and LLM parameters, updating only the parameters of the multi-modal connector and LoRA adapters.

我们提出的框架兼容任何视觉编码器和大语言模型。在本研究中，我们使用预训练的 SigLIPL/14 (Zhai et al. 2023) 作为视觉编码器，并使用强大的预训练 Qwen 0.5B（版本 1.5 或 2.0）(Bai et al. 2023; Yang et al. 2024) 作为大语言模型，同时随机初始化多模态连接器的参数。为了保留预训练大语言模型的能力并实现高效的任务特定微调，我们应用了 LoRA (Hu et al. 2022)，其秩为 64。在训练过程中，我们冻结视觉编码器和大语言模型的参数，仅更新多模态连接器和 LoRA 适配器的参数。

For tasks requiring temporal information, the number of frames $N$ is 8; otherwise $N=1$ . Each frame is resized to $384\times384$ as the input to SigLIP-L/14, with the number of sampled visual embeddings $\boldsymbol{\mathrm{k}}$ set to 16 (i.e., $h=w=4,$ ).

对于需要时间信息的任务，帧数 $N$ 为 8；否则 $N=1$。每帧调整为 $384\times384$ 作为 SigLIP-L/14 的输入，采样视觉嵌入的数量 $\boldsymbol{\mathrm{k}}$ 设置为 16（即 $h=w=4$）。

We fine-tune our models on either TB-100K or TB-250K, and then report the accuracy on TB-Bench. We use AdamW (Loshchilov and Hutter 2017) with a learning rate of 2e-4 and batch size of 64 for 10 epochs, with learning rate adjusted via a cosine scheduler.

我们在 TB-100K 或 TB-250K 上微调我们的模型，然后在 TB-Bench 上报告准确率。我们使用 AdamW (Loshchilov and Hutter 2017) ，学习率为 2e-4，批量大小为 64，训练 10 个 epoch，并通过余弦调度器调整学习率。

Table 4: Results of compared methods on TB-Bench are reported in accuracy $(%)$ , where higher indicates better performance. Random guess† results are considered zero. ${}^{\star}\ensuremath{\mathbf{In}}$ -context learning for single-frame tasks uses three in-context examples, while multi-frame tasks use one. Hugging face and API names are used for easy reference.

表 4: TB-Bench 上对比方法的结果以准确率 $(%)$ 报告，数值越高表示性能越好。随机猜测†结果被视为零。${}^{\star}\ensuremath{\mathbf{In}}$ -context learning 在单帧任务中使用三个上下文示例，而在多帧任务中使用一个。Hugging face 和 API 名称用于方便参考。

模型	RD↑	SR↑	OR↑	EGO-LANE↑OBJ-LANE↑	OBJ-TURN↑EGO-TURN↑EGO-TRA↑	Avg.↑
Randomt	0.0	16.7	17.1	25.0	33.3	19.8
Zero-shot
LLaVA-1.5-7B	10.8	16.8	28.0	28.4	20.4	18.1
LLaVA-v1.6-Mistral-7B	4.0	25.6	30.8	20.4	26.0	19.6
LLaVA-NeXT-Video-7B	3.6	0.8	13.2	10.4	18.8	12.4
LLaVA-Interleave-Qwen-7B	5.6	24.8	10.8	31.6	19.2	17.4
Bunny-v1.1-4B	24.4	20.4	19.6	28.4	16.0	20.4
Bunny-v1.1-Llama-3-8B-V	7.6	16.4	30.0	26.8	18.4	17.8
InternVL2-8B	3.6	12.0	28.0	28.4	28.0	20.0
Mini-InternVL2-1B-DriveLM	0.0	31.2	20.0	28.4	24.8	24.2
DriveLM-mantis-8B	0.0	34.8	23.2	30.0	57.6	30.7
Gemini-1.5-flash	21.2	16.8	22.0	34.8	48.0	24.8
GPT-40-2024-08-06	8.4	32.0	40.8	54.4	39.6	34.4
In-context learning*
LLaVA-Interleave-Qwen-7B	14.0	3.6	10.4	24.8	29.6	19.3
GPT-40-2024-08-06	32.8	38.8	36.8	60.4	51.2	40.9
VLIT on TB-100k
Ours (SigLIP-L-Qwen1.5-0.5B)	76.4	74.4	86.8	94.0	68.8	77.5
Ours (SigLIP-L-Qwen2-0.5B)	80.4	74.8	88.8	93.6	65.2	77.5
VLIT on TB-250k
Ours (SigLIP-L-Qwen1.5-0.5B)	93.6	82.4	96.0	99.6	69.6	84.5
Ours (SigLIP-L-Qwen2-0.5B)	91.2	83.2	94.8	99.6	69.6	85.1

Zero-shot Evaluation for MLLMs

零样本评估多模态大语言模型 (MLLMs)

We report the zero-shot performance of various MLLMs on TB-Bench, including two popular proprietary models (GPT4o, Gemini 1.5), several SOTA open-source general models, including LLaVA (Liu et al. 2024b), Bunny (He et al. 2024), and InternVL (Chen et al. 2024), as well as open-source models with traffic domain adaptations trained on DriveLM (Sima et al. 2024), i.e., Mantis (Jiang et al. 2024) and MiniInternVL2 (Gao et al. 2024). For class output questions, we use a multi-choice template listing all possible class options, while for numerical output questions, we specify the format, i.e., “Answer in xx.x meters.” See the supp. material for more details on the models and the prompt design.

我们报告了各种多模态大语言模型 (MLLMs) 在 TB-Bench 上的零样本性能，包括两个流行的专有模型 (GPT4o, Gemini 1.5)，几个最先进的开源通用模型，包括 LLaVA (Liu et al. 2024b)、Bunny (He et al. 2024) 和 InternVL (Chen et al. 2024)，以及在 DriveLM (Sima et al. 2024) 上训练的具有交通领域适应的开源模型，即 Mantis (Jiang et al. 2024) 和 MiniInternVL2 (Gao et al. 2024)。对于类别输出问题，我们使用列出所有可能类别选项的多选模板，而对于数值输出问题，我们指定格式，即“以 xx.x 米为单位回答”。有关模型和提示设计的更多详细信息，请参阅补充材料。

Results on TB-Bench

TB-Bench 上的结果

Table 4 shows the results of different methods on TB-Bench tasks, categorized into four groups: zero-shot evaluation, incontext learning evaluation, VLIT on TB-100k, and VLIT on TB-250k.

表 4 展示了不同方法在 TB-Bench 任务上的结果，分为四组：零样本评估、上下文学习评估、TB-100k 上的 VLIT 以及 TB-250k 上的 VLIT。

In the zero-shot evaluation, although the proprietary models (GPT-4o and Gemini) outperform the open-source mod- els overall, none of them excels across all traffic behav- ior tasks. Many open-source models under perform random guessing, while traffic domain adaptation models show significantly better performance in certain areas but still lag behind the proprietary models. The proprietary models achieve an average accuracy of less than $35%$ .

在零样本评估中，尽管专有模型（GPT-4o 和 Gemini）总体上优于开源模型，但它们在所有交通行为任务中都没有表现出色。许多开源模型的表现甚至不如随机猜测，而交通领域适应模型在某些领域表现显著更好，但仍落后于专有模型。专有模型的平均准确率低于 $35%$。

In in-context learning, examples significantly improve performance in specific areas, i.e., numerical outputs.

在上下文学习中，示例显著提高了特定领域的性能，例如数值输出。

For baseline models fine-tuned on $\mathrm{TB}{-}100\mathrm{k}.$ , both with Qwen variants demonstrate strong performance across all tasks, with an average accuracy of $77.5%$ . Even the lowestperforming task exceeds $60%$ accuracy, showing a significant improvement of over almost $45%$ compared to GPT-4o and $57%$ over random chance. This underscores the effectiveness of VLIT when a high-quality dataset is available, enhancing traffic behavior understanding of MLLMs.

对于在 $\mathrm{TB}{-}100\mathrm{k}.$ 上微调的基线模型，Qwen变体在所有任务中均表现出色，平均准确率为 $77.5%$ 。即使是表现最差的任务，准确率也超过了 $60%$ ，相较于 GPT-4o 提升了近 $45%$ ，相较于随机概率提升了 $57%$ 。这凸显了在高质量数据集可用时，VLIT 的有效性，增强了大语言模型对交通行为的理解。

For baseline models fine-tuned on TB-250k, performance improves across all tasks, particularly those with increased data samples. Notably, accuracy in tasks like OBJ-LANE, OBJ-TURN, and EGO-TURN, with the same number of training samples to TB-100k, also benefits from additional samples in other tasks. This suggests that learning from tasks can be transfered to those with limited training data.

对于在 TB-250k 上微调的基线模型，所有任务的性能都有所提升，尤其是那些数据样本增加的任务。值得注意的是，OBJ-LANE、OBJ-TURN 和 EGO-TURN 等任务的准确性，尽管训练样本数量与 TB-100k 相同，但也受益于其他任务的额外样本。这表明从任务中学习可以转移到训练数据有限的任务上。

Abalation Study

消融研究

Table 5: Ablation results on (a) vision encoders, (b) number of visual embeddings per frame, and (c) number of frames.

表 5: 消融实验结果：(a) 视觉编码器，(b) 每帧的视觉嵌入数量，(c) 帧数。

(a) vision encoder (b) # tokens/frame (c) # frames

编码器	准确率	# tokens/帧	准确率	# 帧	准确率
CLIP-L/14	72.0	4	72.7	2	72.1
SigL-B/16	74.3	16	77.5	4	73.8
SigL-L/14	77.5	36	76.2	8	77.5

We conduct an ablation study to identify which factors enhance performance during fine-tuning, regarding visual inputs to the models. All experiments use the same settings unless noted. The results are summarized in Table 5.

我们进行了一项消融研究，以确定在微调过程中哪些因素能提升模型对视觉输入的性能。除非另有说明，所有实验均使用相同的设置。结果总结在表 5 中。

Table 6: Quantitative results of action tasks on BDD-X test dataset. We provide evaluation results on action description, action justification, and full-text generation (i.e., combining description and justification). ‘B4’ stands for BLEU4.

表 6: BDD-X 测试数据集上动作任务的定量结果。我们提供了动作描述、动作解释和全文生成（即结合描述和解释）的评估结果。‘B4’代表 BLEU4。

方法	描述			解释			全文
	CIDEr	B4	ROUGE	CIDEr	B4	ROUGE	CIDEr	B4	ROUGE
s.Ino (BDD-X)	118.6	20.0	53.8	61.3	6.9	26.1	54.2	12.0	38.4
sIno (BDD-X + TB-100k)	121.7	20.0	54.3	60.3	6.7	26.7	53.7	11.9	38.6

Table 7: Quantitative results of control signals prediction on BDD-X test dataset. RMSE denotes the root mean squared error, and $A_{\tau}$ measures the proportion of test samples with prediction errors less than $\tau$ .

表 7: BDD-X 测试数据集上控制信号预测的定量结果。RMSE 表示均方根误差，$A_{\tau}$ 表示预测误差小于 $\tau$ 的测试样本比例。

方法	速度 (m/s)					转向角度 (度)
	RMSE↓	A0.1 ↑	A0.5 ↑	A1.0 ↑	A5.0 ↑	RMSE↓	A0.1 ↑	A0.5 ↑	A1.0 ↑	A5.0 ↑
s.Ino (BDD-X)	1.40	26.1	55.7	75.6	98.6	11.2	44.2	62.2	71.8	89.2
ours (BDD-X + TB-100k)	1.38	26.3	57.6	76.1	98.8	11.3	44.5	63.7	73.0	89.3

Table 5a compares different pretrained vision encoders, including CLIP-L/14 (Radford et al. 2021) and SigLIP-B/16 (processing $224\times224$ frames). It is seen that the SigLIP encoders outperform the CLIP encoder, with SigLIP-L/14 achieving the highest accuracy.

表 5a 比较了不同的预训练视觉编码器，包括 CLIP-L/14 (Radford et al. 2021) 和 SigLIP-B/16 (处理 $224\times224$ 帧)。可以看出，SigLIP 编码器优于 CLIP 编码器，其中 SigLIP-L/14 达到了最高的准确率。

Table 5b presents the results of using varying numbers of sampled visual embeddings/tokens per frame, $p$ (where $h,=,w,=,\sqrt{p})$ . We observe that using 16 sampled visual tokens per frame is optimal, and increasing $p$ can degrade performance.

表 5b 展示了每帧使用不同数量的采样视觉嵌入/Token (visual embeddings/tokens) 的结果，其中 $p$ (其中 $h,=,w,=,\sqrt{p}$)。我们观察到，每帧使用 16 个采样的视觉 Token 是最优的，增加 $p$ 可能会降低性能。

Finally, we evaluate the impact of varying the number of sampled frames $N(=2,4,8)$ , on the tasks requiring temporal information. We consistently select the first and last frames, with the remaining $N-2$ frames sampled uniformly in between. As shown in Table 5c, increasing temporal information significantly boosts performance. For detailed task ac- curacy and other ablation results, see the supp. material.

最后，我们评估了在不同采样帧数 $N(=2,4,8)$ 下对需要时间信息的任务的影响。我们始终选择第一帧和最后一帧，并在中间均匀采样剩余的 $N-2$ 帧。如表 5c 所示，增加时间信息显著提升了性能。有关详细的任务准确性和其他消融实验结果，请参见补充材料。

Cross-Dataset Generalization

跨数据集泛化

We conduct additional experiments to demonstate the performance transfer from co-training with perception-stage tasks and planning tasks, to show its improvements on downstream tasks. Specifically, we co-train the TB-100k dataset with the BDD-X dataset and evaluate on the action and control prediction tasks (Kim et al. 2018).

我们进行了额外的实验，以展示从感知阶段任务和规划任务的联合训练中获得的性能迁移，并展示其对下游任务的改进。具体来说，我们将 TB-100k 数据集与 BDD-X 数据集联合训练，并在动作和控制预测任务上进行评估 (Kim et al. 2018)。

As the BDD-X dataset involves frame index referencing in both the question and answer text annotations, we employed the Mini-InternVL model (Gao et al. 2024) as the baseline, which formulates frame referencing in a similar manner.

由于BDD-X数据集在问题和答案文本注释中都涉及帧索引引用，我们采用了Mini-InternVL模型（Gao等人，2024）作为基线，该模型以类似的方式处理帧引用。

We follow a standard MLLM training regime: Stage 1 focuses on feature alignment, utilizing the pre-trained checkpoint of Mini-InternVL, while Stage 2 involves instruction tuning on the main datasets. In the standalone setting, the main dataset involves tuning with BDD-X for 20 epochs, while in the co-training setting, we tune the mixed BDD-X dataset for 20 epochs and TB-100k for 1 epoch. We apply LoRA (Hu et al. 2022) with a rank of 64. During training, we freeze the vision encoder and LLM parameters, updating only the parameters of the multi-modal connector and LoRA adapters. Overall, training is conducted with a learning rate of 2e-4 and a batch size of 96.

我们遵循标准的 MLLM 训练流程：第一阶段侧重于特征对齐，使用 Mini-InternVL 的预训练检查点，而第二阶段则涉及在主数据集上进行指令微调。在独立设置中，主数据集涉及使用 BDD-X 进行 20 个 epoch 的微调，而在联合训练设置中，我们对混合的 BDD-X 数据集进行 20 个 epoch 的微调，并对 TB-100k 进行 1 个 epoch 的微调。我们应用了 LoRA (Hu et al. 2022)，秩为 64。在训练过程中，我们冻结了视觉编码器和大语言模型的参数，仅更新多模态连接器和 LoRA 适配器的参数。总体而言，训练的学习率为 2e-4，批量大小为 96。

Tables 6 and 7 compare the transfer performance between standard training and additional co-training with TB-100k. Notably, beyond differences in task types within the VQA format, the two tables also differ in the type of outputs, i.e., free-form text and numerical outputs.

表 6 和表 7 比较了标准训练和额外使用 TB-100k 进行联合训练的迁移性能。值得注意的是，除了 VQA 格式中任务类型的差异外，这两个表在输出类型上也有所不同，即自由格式文本和数值输出。

Table 6 shows improved performance with the co-training setting in the description data split, which includes annotations about scene perception. However, there are marginal differences in the other splits, which are not directly related to perception tasks.

表 6 展示了在描述数据分割中，联合训练设置带来的性能提升，该分割包含场景感知的注释。然而，在其他分割中，差异较小，这些分割与感知任务没有直接关联。

Table 7 demonstrates consistent performance improvement with co-training across most metrics, except for the RMSE of the turning angle, which shows a slight decrease.

表 7 展示了在大多数指标上，通过协同训练 (co-training) 带来的性能提升是一致的，除了转向角的 RMSE 略有下降。

Conclusion

结论

We have introduced TB-Bench, a comprehensive benchmark that rigorously assesses MLLM performance across eight perception tasks, providing a much-needed standard for s patio temporal evaluation in AD. Alongside TB-Bench, we have developed the vision-language instruction tuning datasets, TB-100k and TB-250k, which significantly improve MLLM performance when used to fine-tune our baseline models, resulting in substantial gains over existing models. Additionally, our VLIT datasets offer benefits as valuable assets for mixed training datasets in other driving use cases. Our contributions not only represent incremental progress, but also lay a solid foundation for the further integration of MLLMs into the perception, prediction, and planning stages of AD. These resources are poised to accelerate advancements in the field, supporting the development of more capable and reliable autonomous systems. Please refer to the supplementary material for further discussion on broader impact, limitations, and future work.

我们推出了TB-Bench，这是一个全面的基准测试，严格评估MLLM在八种感知任务中的表现，为自动驾驶（AD）中的时空评估提供了急需的标准。与TB-Bench一起，我们还开发了视觉-语言指令调优数据集TB-100k和TB-250k，这些数据集在用于微调我们的基线模型时显著提升了MLLM的性能，相较于现有模型取得了显著的提升。此外，我们的VLIT数据集在其他驾驶用例的混合训练数据集中也提供了宝贵的资源。我们的贡献不仅代表了渐进式的进步，还为MLLM进一步融入AD的感知、预测和规划阶段奠定了坚实的基础。这些资源有望加速该领域的进步，支持开发更强大、更可靠的自动驾驶系统。有关更广泛影响、局限性和未来工作的进一步讨论，请参阅补充材料。

References

参考文献

TB-Bench 补充材料：用于理解行车记录仪图像/视频中时空交通行为的多模态 AI 训练与测试

This material includes the following sections:

本材料包含以下部分：

Discussions

讨论

Broader Impact

更广泛的影响

This study represents progress in enhancing the capabilities of Multi-Modal Large Language Models (MLLMs) by focusing on a limited set of AD perception tasks. Specifically, we introduce a new benchmark to evaluate MLLMs on understanding diverse traffic behaviors and provide highquality VLIT datasets that enhance MLLMs’ genera liza bility. We hope this will advance MLLMs’ applications in AD, contributing to the development of more robust autonomous driving systems.

本研究通过关注一组有限的自动驾驶感知任务，展示了在增强多模态大语言模型 (MLLMs) 能力方面的进展。具体而言，我们引入了一个新的基准来评估 MLLMs 在理解多样化交通行为方面的能力，并提供了高质量的 VLIT 数据集，以增强 MLLMs 的泛化能力。我们希望这将推动 MLLMs 在自动驾驶中的应用，为开发更强大的自动驾驶系统做出贡献。

Limitations

局限性

Firstly, our study utilizes the moderate large language models (Qwen 0.5B series) due to limited computational resources, which can be scaled up as needed.

首先，由于计算资源有限，我们的研究使用了中等规模的大语言模型（Qwen 0.5B 系列），这些模型可以根据需要进行扩展。

Secondly, we acknowledge the dataset imbalance arising from the natural occurrence of specific autonomous driving behaviors; please refer to Section Dataset Statistics for more details.

其次，我们承认由于特定自动驾驶行为的自然发生导致的数据集不平衡；更多详情请参阅数据集统计部分。

Lastly, the free-form text output templates in TB-100k and TB-250k are limited for certain tasks. However, we believe that the diversity of images is also important for the model to understand visual concepts. That being said, when combined with other (vision-)language instruction tuning datasets, our datasets still enhance the performance of MLLMs, enabling them to generalize better in traffic domains, particularly in understanding traffic behaviors.

最后，TB-100k 和 TB-250k 中的自由文本输出模板在某些任务中存在限制。然而，我们认为图像的多样性对于模型理解视觉概念同样重要。尽管如此，当与其他（视觉-）语言指令调优数据集结合时，我们的数据集仍然能够提升多模态大语言模型 (MLLM) 的性能，使其在交通领域，特别是在理解交通行为方面，具有更好的泛化能力。

Future Work

未来工作

Future research could expand this work by incorporating a wider range of perception tasks or by exploring subsequent stages, such as prediction and planning.

未来的研究可以通过纳入更广泛的感知任务或探索后续阶段（如预测和规划）来扩展这项工作。

Additionally, an important direction for future investigation is the optimal application of upstream perception tuning sets, including the TB-100k and TB-250k datasets, to relevant downstream traffic tasks. This approach may enhance model performance in real-world applications.

此外，未来的一个重要研究方向是将上游感知调优集（包括 TB-100k 和 TB-250k 数据集）优化应用于相关的下游交通任务。这种方法可能会提升模型在实际应用中的性能。

Furthermore, integrating real-time traffic data, such as video feeds and sensory inputs, could improve the MLLMs’ understanding of dynamic traffic situations. Finally, enhancing the explain ability of MLLMs in traffic behavior scenarios will help users understand the rationale behind model predictions.

此外，整合实时交通数据（如视频流和传感器输入）可以提高 MLLM 对动态交通情况的理解。最后，增强 MLLM 在交通行为场景中的可解释性将有助于用户理解模型预测背后的逻辑。

Access to the Benchmark and Datasets Availability

基准和数据集可用性访问

The Traffic Behavior Benchmark (TB-Bench) and the training datasets (TB-100k, TB-250k) will be publicly available at the following Github repository:

交通行为基准 (Traffic Behavior Benchmark, TB-Bench) 和训练数据集 (TB-100k, TB-250k) 将在以下 Github 仓库中公开提供：

https://github.com/TB-AD/TB-Bench-110k-250k

The source code for conducting and analyzing the experiments will also be publicly available in the repository upon publication, permitting free use for research purposes.

进行和分析实验的源代码也将在发布后公开于代码库中，允许免费用于研究目的。

Future Update

未来更新

We also plan to establish an evaluation server and leaderboard on Hugging Face in the future. Any updates will be communicated through the above Github repository to ensure users have access to the latest information.

我们还计划未来在 Hugging Face 上建立一个评估服务器和排行榜。任何更新都将通过上述 Github 仓库进行通知，以确保用户能够获取最新信息。

Benchmark and Datasets

基准测试与数据集

Task Definition

任务定义

Relative Distance (RD). The task is to predict the Euclidean distance in meters between two entities in an image; see Figure 11 for two examples.

相对距离 (Relative Distance, RD)。该任务旨在预测图像中两个实体之间的欧几里得距离（以米为单位）；图 11 展示了两个示例。

Spatial Reasoning (SR). The task is to predict the spatial position of one entity relative to another from the perspective of a reference entity; see Figure 12 for examples. Specifically, the relationship between two objects is defined by the angle $\theta$ , as follows:

空间推理 (Spatial Reasoning, SR)。该任务是从参考实体的角度预测一个实体相对于另一个实体的空间位置；示例见图 12。具体来说，两个对象之间的关系由角度 $\theta$ 定义，如下所示：

$$
{\mathrm{Relation}}={\left{\begin{array}{l l}{{\mathrm{front}}}&{{\mathrm{if~}}-30^{\circ}<\theta\leq30^{\circ},}\ {{\mathrm{front~~left}}}&{{\mathrm{if~~}}30^{\circ}<\theta\leq90^{\circ},}\ {{\mathrm{front~~right}}}&{{\mathrm{if~~}}-90^{\circ}<\theta\leq-30^{\circ},}\ {{\mathrm{back~~left}}}&{{\mathrm{if~~}}90^{\circ}<\theta\leq150^{\circ},}\ {{\mathrm{back~~right}}}&{{\mathrm{if~~}}-150^{\circ}<\theta\leq-90^{\circ},}\ {{\mathrm{back}}}&{{\mathrm{otherwise}}.}\end{array}\right.}
$$

This angular relationship is similar to that defined in (Qian et al. 2024).

这种角度关系与 (Qian et al. 2024) 中定义的关系类似。

Orientation Reasoning (OR). This task is to predict the facing relationship between two entities from the perspective of a reference entity, categorized as: ‘similar’, ‘opposite’, or ‘perpendicular’. Please refer to Figure 13 for examples. The relationship is defined based on the absolute difference in facing angles $|\theta|$ , as follows:

方向推理 (Orientation Reasoning, OR)。该任务是从参考实体的角度预测两个实体之间的朝向关系，分为：“相似”、“相反”或“垂直”。请参考图 13 中的示例。该关系基于朝向角度的绝对差值 $|\theta|$ 定义，如下所示：

$$
{\mathrm{Relation}}={\left{\begin{array}{l l}{{\mathrm{similar}}}&{{\mathrm{if~}}0^{\circ}\leq|\theta|\leq45^{\circ},}\ {{\mathrm{opposite}}}&{{\mathrm{if~}}135^{\circ}\leq|\theta|\leq180^{\circ},}\ {{\mathrm{perpendicular}}}&{{\mathrm{otherwise}}.}\end{array}\right.}
$$

$$
{\mathrm{Relation}}={\left{\begin{array}{l l}{{\mathrm{相似}}}&{{\mathrm{如果~}}0^{\circ}\leq|\theta|\leq45^{\circ},}\ {{\mathrm{相反}}}&{{\mathrm{如果~}}135^{\circ}\leq|\theta|\leq180^{\circ},}\ {{\mathrm{垂直}}}&{{\mathrm{其他情况}}.}\end{array}\right.}
$$

It is noted that this angle is measured from the facing direction of a reference entity to the position of the target entity in Euclidean space, irrespective of the target entity’s facing direction.

需要注意的是，这个角度是从参考实体的朝向方向到目标实体在欧几里得空间中的位置测量的，与目标实体的朝向方向无关。

Other Lane to Ego-Vehicle (EGO-LANE). This task is to predict the lane of a target vehicle relative to the egovehicle’s perspective; see Figure 14 for examples. The categories include: ‘front lane’, ‘front left lane’, ‘front right lane’, and ‘oncoming traffic lane’ (the lane on the opposite side of the road).

其他车道到自车车道 (EGO-LANE)。该任务是预测目标车辆相对于自车视角的车道；示例见图 14。类别包括：“前车道”、“前左车道”、“前右车道”和“对向车道”（道路的另一侧车道）。

It is noted that when the ego-vehicle is on a road with multiple lanes, the ‘front lane’ is further classified into three fine-grained categories: ‘front lane’, ‘front left lane’, and ‘front right lane’.

值得注意的是，当自车位于多车道道路上时，“前车道”进一步细分为三类：“前车道”、“左前车道”和“右前车道”。

Other Lane Changing (OBJ-LANE). This task is to predict whether the target vehicle is changing lanes, categorized as ‘left lane change’, ‘right lane change’, or ‘no change’; see Figure 15 for examples. Lane changes are evaluated based on the target vehicle’s viewpoint. For instance, if the target vehicle in the oncoming traffic lane executes a right lane change, the ego vehicle perceives it as moving to the left.

其他车道变换 (OBJ-LANE)。该任务是预测目标车辆是否在变换车道，分类为“左车道变换”、“右车道变换”或“无变换”；示例见图 15。车道变换是基于目标车辆的视角进行评估的。例如，如果对向车道中的目标车辆执行右车道变换，自车会将其视为向左移动。

Other Turning (OBJ-TURN). This task is to predict whether the target vehicle is making a turn, categorized as ‘turning left’, ‘turning right’, or ‘go straight’. The target vehicle is considered to be turning, if it changes direction by more than 25 degrees within

[论文翻译]TB-Bench：用于从行车记录仪图像/视频中理解时空交通行为的训练和测试多模态 AI

原文地址：https://arxiv.org/pdf/2501.05733

TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos