TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos
TB-Bench:用于从行车记录仪图像/视频中理解时空交通行为的训练和测试多模态 AI
Korawat Char oen pit aks 1,*, Van-Quang Nguyen2,*, Masanori Suganuma1, Kentaro Arai3, Seiji Totsuka3, Hiroshi Ino3, Takayuki Okatani1,2
Korawat Charoenpitaks 1,*, Van-Quang Nguyen 2,*, Masanori Suganuma 1, Kentaro Arai 3, Seiji Totsuka 3, Hiroshi Ino 3, Takayuki Okatani 1,2
1Tohoku University 2RIKEN Center for AIP 3DENSO CORPORATION korawat $@$ vision.is.tohoku.ac.jp, quang.nguyen.jz@riken.jp, suganuma $@$ vision.is.tohoku.ac.jp, kentaro.arai. $\mathrm{j}8\mathrm{n}\mathcal{@}_{.}$ jp.denso.com, seiji.totsuka.j7z@jp.denso.com, hiroshi.naganawa.j7v@jp.denso.com, okatani $@$ tohoku.ac.jp *Corresponding authors
1东北大学 2理化学研究所 AIP中心 3电装公司 korawat $@$ vision.is.tohoku.ac.jp, quang.nguyen.jz@riken.jp, suganuma $@$ vision.is.tohoku.ac.jp, kentaro.arai. $\mathrm{j}8\mathrm{n}\mathcal{@}_{.}$ jp.denso.com, seiji.totsuka.j7z@jp.denso.com, hiroshi.naganawa.j7v@jp.denso.com, okatani $@$ tohoku.ac.jp *通讯作者
Abstract
摘要
The application of Multi-modal Large Language Models (MLLMs) in Autonomous Driving (AD) faces significant challenges due to their limited training on traffic-specific data and the absence of dedicated benchmarks for spatiotemporal understanding. This study addresses these issues by proposing TB-Bench, a comprehensive benchmark designed to evaluate MLLMs on understanding traffic behaviors across eight perception tasks from ego-centric views. We also introduce vision-language instruction tuning datasets, TB-100k and TB-250k, along with simple yet effective baselines for the tasks. Through extensive experiments, we show that existing MLLMs under perform in these tasks, with even a powerful model like GPT-4o achieving less than $35%$ accuracy on average. In contrast, when fine-tuned with TB-100k or TB250k, our baseline models achieve average accuracy up to $85%$ , significantly enhancing performance on the tasks. Additionally, we demonstrate performance transfer by co-training TB-100k with another traffic dataset, leading to improved performance on the latter. Overall, this study represents a step forward by introducing a comprehensive benchmark, highquality datasets, and baselines, thus supporting the gradual integration of MLLMs into the perception, prediction, and planning stages of AD.
多模态大语言模型 (MLLMs) 在自动驾驶 (AD) 中的应用面临重大挑战,原因是其在交通特定数据上的训练有限,且缺乏专门用于时空理解的基准。本研究通过提出 TB-Bench 来解决这些问题,TB-Bench 是一个全面的基准,旨在评估 MLLMs 在以自我为中心的视角下理解交通行为的八种感知任务中的表现。我们还引入了视觉-语言指令调优数据集 TB-100k 和 TB-250k,并为这些任务提供了简单但有效的基线模型。通过大量实验,我们发现现有的 MLLMs 在这些任务中表现不佳,即使是像 GPT-4o 这样的强大模型,平均准确率也低于 $35%$。相比之下,当使用 TB-100k 或 TB-250k 进行微调时,我们的基线模型的平均准确率可达 $85%$,显著提升了任务表现。此外,我们通过将 TB-100k 与另一个交通数据集联合训练,展示了性能迁移的效果,从而提高了后者的性能。总体而言,本研究通过引入一个全面的基准、高质量的数据集和基线模型,推动了 MLLMs 逐步集成到自动驾驶的感知、预测和规划阶段。
Introduction
引言
The application of MLLMs to Autonomous Driving (AD) has gained increasing attention, particularly for predicting risks and planning actions based on images or videos from in-vehicle cameras. Notably, MLLMs have demonstrated their effectiveness in the international competitions like Autonomous Grand Challenge (Renz et al. 2024) and in specific tasks such as traffic sign detection (Zhang et al. 2024b). However, two major challenges remain.
多模态大语言模型 (MLLMs) 在自动驾驶 (Autonomous Driving, AD) 中的应用日益受到关注,尤其是在基于车载摄像头图像或视频预测风险和规划行动方面。值得注意的是,MLLMs 在国际竞赛如 Autonomous Grand Challenge (Renz et al. 2024) 和特定任务如交通标志检测 (Zhang et al. 2024b) 中展现了其有效性。然而,仍存在两大挑战。
First, current MLLMs, ranging from proprietary models like GPT-4o (Achiam et al. 2023) and Gemini (Team et al. 2023) to open-source models like LLaVA (Liu et al. 2024b), are not optimized for dashcam images or traffic scenes. These models are primarily trained on vast amounts of webbased text and image-text pairs, with minimal traffic-specific data, limiting their effectiveness in AD scenarios. To improve the general iz ability of MLLMs, incorporating highquality domain-specific datasets into the pre-training data is crucial, as shown in (Li et al. $2024\mathrm{a}$ ; Zhang et al. 2024a).
首先,当前的多模态大语言模型(MLLMs),从专有模型如 GPT-4o (Achiam et al. 2023) 和 Gemini (Team et al. 2023) 到开源模型如 LLaVA (Liu et al. 2024b),都没有针对行车记录仪图像或交通场景进行优化。这些模型主要是在大量的基于网络的文本和图像-文本对上进行训练的,其中交通相关的数据非常有限,这限制了它们在自动驾驶(AD)场景中的有效性。为了提高 MLLMs 的泛化能力,将高质量的领域特定数据集纳入预训练数据中至关重要,如 (Li et al. $2024\mathrm{a}$ ; Zhang et al. 2024a) 所示。

Figure 1: Examples of four tasks from TB-Bench; additional task examples are provided in the supplementary material.
图 1: TB-Bench 中的四个任务示例;补充材料中提供了更多任务示例。
Second, it lacks a dedicated benchmark for evaluating MLLMs’ abilities in s patio temporal understanding tasks, given their capabilities in vision-centric tasks are still developing. While these models are designed to handle diverse vision-language tasks, they struggle with complex visual understanding, such as spatial reasoning and object rela tion ships (Tian et al. 2024). Even in common domains unrelated to traffic scenes, there are insufficient benchmarks (e.g., Cambrian-1 (Tong et al. 2024), V-star benchmark $\mathrm{\DeltaWu}$ and Xie 2024)). Given that AD requires sophisticated geometric and s patio temporal understanding to capture the dynamic interactions between the vehicle and other entities, high-quality and dedicated benchmarks are much needed.
其次,鉴于多模态大语言模型 (MLLMs) 在以视觉为中心的任务中的能力仍在发展,缺乏专门用于评估其在时空理解任务中能力的基准。尽管这些模型旨在处理多样化的视觉-语言任务,但它们在复杂的视觉理解方面仍存在困难,例如空间推理和物体关系 (Tian et al. 2024)。即使在与交通场景无关的常见领域中,基准测试也不足 (例如 Cambrian-1 (Tong et al. 2024), V-star 基准 $\mathrm{\DeltaWu}$ 和 Xie 2024))。鉴于自动驾驶 (AD) 需要复杂的几何和时空理解来捕捉车辆与其他实体之间的动态交互,高质量且专门的基准测试是非常必要的。
While MLLMs are increasingly applied in AD, the aforementioned challenges remain insufficiently addressed. Recent research has primarily focused on using pretrained MLLMs derived from web data for specific AD tasks, without thoroughly investigating these challenges. Another issue is to determine which AD tasks across different stages and levels should be addressed by MLLMs. Within the “perception, prediction, and planning” framework, the question becomes: which stages should MLLMs handle?
尽管多模态大语言模型(MLLMs)在自动驾驶(AD)中的应用日益增多,但上述挑战仍未得到充分解决。最近的研究主要集中在使用从网络数据中预训练的MLLMs来完成特定的AD任务,而没有深入探讨这些挑战。另一个问题是确定MLLMs应该处理哪些跨不同阶段和层次的AD任务。在“感知、预测和规划”框架内,问题变成了:MLLMs应该处理哪些阶段?
Focusing solely on the perception stage and its associated tasks, it may not necessarily appear optimal to use MLLMs. Established technologies like LIDAR and CV methods such as object detection and visual odometry can accurately capture the vehicle’s position and s patio temporal relationships in Euclidean space.
仅关注感知阶段及其相关任务时,使用多模态大语言模型 (MLLMs) 可能并不一定是最优选择。像 LIDAR 和计算机视觉 (CV) 方法(如目标检测和视觉里程计)这样的成熟技术,能够准确捕捉车辆在欧几里得空间中的位置和时空关系。
However, when considering the use of MLLMs (or LLMs) in later stages, relying on such “Euclidean geometrically accurate” information in the earlier stage might not be optimal. Firstly, it is unclear how to input this information into LLMs. Moreover, achieving advanced understanding in later stages may require information representations specifically suited for MLLMs to perform higher-level tasks. This suggests the need for MLLM involvement from the perception stage.
然而,当考虑在后期阶段使用 MLLMs(或 LLMs)时,在早期阶段依赖这种“欧几里得几何精确”的信息可能并不是最优的。首先,尚不清楚如何将这些信息输入到 LLMs 中。此外,在后期阶段实现高级理解可能需要专门适合 MLLMs 的信息表示,以执行更高级的任务。这表明从感知阶段就需要 MLLM 的参与。
This study adopts this perspective, aiming to involve MLLMs in the perception stage by expressing spatiotemporal traffic scene tasks in natural language text. The underlying conjecture, as mentioned above, is that this approach could be important for prediction and planning in later stages by MLLMs.
本研究采用这一视角,旨在通过将时空交通场景任务以自然语言文本的形式表达,使多模态大语言模型 (MLLMs) 参与到感知阶段。正如前文所述,其基本假设是,这种方法可能对 MLLMs 在后续阶段的预测和规划具有重要意义。
To address the challenge of lacking dedicated benchmarks, we introduce TB-Bench, one of the first comprehensive benchmarks specifically designed to evaluate MLLM’s understanding of traffic behaviors. This benchmark assesses MLLM’s capabilities to perform perception tasks based on dashcam images or videos from the ego-centric views of vehicles, including determining the spatial position or orientation of other vehicles and interpreting the behaviors of both ego-vehicles and surrounding traffic. Compared to existing benchmarks, TB-Bench encompasses a wider array of eight distinct perception tasks,1 each corresponding to a typical driver maneuver. Figure 1 shows examples of several tasks. To ensure consistent evaluation across a diverse range of MLLMs, we employ a straightforward protocol. Specifically, we pair questions with images or video clips, requiring an MLLM to respond in plain text. Performance of the MLLM is then assessed by measuring response accuracy.
为了解决缺乏专用基准测试的挑战,我们引入了TB-Bench,这是首批专门设计用于评估多模态大语言模型(MLLM)对交通行为理解的综合基准测试之一。该基准测试评估MLLM基于车载摄像头图像或视频执行感知任务的能力,包括确定其他车辆的空间位置或方向,以及解释自车和周围交通的行为。与现有基准测试相比,TB-Bench涵盖了更广泛的八种不同的感知任务,每种任务对应典型的驾驶员操作。图1展示了几个任务的示例。为了确保在多种MLLM之间进行一致的评估,我们采用了一个简单的协议。具体来说,我们将问题与图像或视频片段配对,要求MLLM以纯文本形式回答。然后通过测量回答的准确性来评估MLLM的性能。
To address the challenge of insufficient training data for AD perception tasks, we introduce a high-quality dataset focused on traffic behavior understanding from ego-centric views. This dataset aligns with the task design of TB-Bench and is used for vision-language instruction tuning (VLIT) of MLLMs. We generate high-quality question-and-answer pairs using samples from established datasets such as KITTI, ONCE, and Argoverse 2. In total, we create TB-Bench comprising 2,000 manually constructed samples, along with two versions of training datasets: TB-250k containing 250,000 samples, and TB-100k (a more balanced version).
为了解决AD感知任务中训练数据不足的挑战,我们引入了一个专注于从自我中心视角理解交通行为的高质量数据集。该数据集与TB-Bench的任务设计保持一致,并用于MLLM的视觉-语言指令调优(VLIT)。我们利用KITTI、ONCE和Argoverse 2等现有数据集中的样本生成了高质量的问题-答案对。总共,我们创建了包含2,000个手动构建样本的TB-Bench,以及两个版本的训练数据集:包含250,000个样本的TB-250k和更平衡的TB-100k。
In addition to evaluating existing MLLMs, we introduce a generic framework that serves as a strong baseline for our tasks, consisting of three standard components: a pretrained vision encoder, a multi-modal connector, and a pretrained LLM. The vision encoder extracts visual representations from inputs with varying number of frames, while the connector projects these embeddings into the LLM’s embedding space, finally the LLM generates task-specific responses on our benchmark. This lightweight model is designed for efficient fine-tuning on our proposed dataset(s).
除了评估现有的多模态大语言模型 (MLLM),我们还引入了一个通用框架,作为我们任务的强基线。该框架由三个标准组件组成:预训练的视觉编码器、多模态连接器和预训练的大语言模型。视觉编码器从不同帧数的输入中提取视觉表示,连接器将这些嵌入投影到大语言模型的嵌入空间中,最后由大语言模型在我们的基准上生成任务特定的响应。这个轻量级模型设计用于在我们提出的数据集上进行高效微调。
Using TB-Bench to evaluate popular proprietary models (GPT-4o and Gemini) and various state-of-the-art open- source MLLMs (LLaVA, Bunny, and InternVL), we find that none of these models excels across all traffic behavior understanding tasks. On average, the open-source models under perform random guessing, while proprietary mod- els achieve only slightly better results, with average accu- racy below $35%$ . In contrast, when fine-tuned on TB-100k or TB-250k, our proposed baseline models demonstrate strong performance across all tasks, with average accuracy ranging from $77%$ to $85%$ . This highlights the effectiveness of our dataset in enhancing MLLM traffic behavior understanding.
使用 TB-Bench 评估流行的专有模型(GPT-4o 和 Gemini)和各种最先进的开源 MLLM(LLaVA、Bunny 和 InternVL),我们发现这些模型在所有交通行为理解任务中均未表现出色。平均而言,开源模型的表现低于随机猜测,而专有模型的表现仅略好,平均准确率低于 $35%$。相比之下,当在 TB-100k 或 TB-250k 上进行微调时,我们提出的基线模型在所有任务中表现出色,平均准确率在 $77%$ 到 $85%$ 之间。这凸显了我们数据集在增强 MLLM 交通行为理解方面的有效性。
Overall, our contributions are fourfold: 1) we introduce TB-Bench, a benchmark for assessing MLLMs on eight perception tasks of traffic behavior understanding; 2) we present the VLIT datasets (TB-100k and TB-250k) for the tasks, along with a generic baseline; 3) we conduct extensive experiments demonstrating the performance gap between existing MLLMs and the fine-tuned baselines; and 4) we show that our VLIT dataset, i.e., TB-100k, can be used as part of a co-training dataset to generalize to other driving benchmarks, such as BDD-X (Kim et al. 2018).
总体而言,我们的贡献有四点:1) 我们引入了 TB-Bench,这是一个用于评估多模态大语言模型 (MLLMs) 在交通行为理解的八项感知任务上的基准;2) 我们为这些任务提供了 VLIT 数据集 (TB-100k 和 TB-250k),并提出了一个通用基线;3) 我们进行了广泛的实验,展示了现有 MLLMs 与微调基线之间的性能差距;4) 我们展示了我们的 VLIT 数据集,即 TB-100k,可以作为联合训练数据集的一部分,泛化到其他驾驶基准,如 BDD-X (Kim et al. 2018)。
Related Work
相关工作
A summary of existing studies and benchmarks across various AD tasks is presented in Table 1.
表 1: 总结了现有研究和跨各种 AD (Anomaly Detection) 任务的基准。
Autonomous Driving Tasks
自动驾驶任务
The majority of evaluations in the AD field are focused on either end-to-end driving systems, open-loop planning, or standalone task schemes, such as single-round visual question answering (VQA) or captioning. Traditionally, the AD framework consists of perception, prediction, and planning tasks (Nie et al. 2023), although slight variations exist, i.e., predicting intention-level outputs instead of trajectories (Tian et al. 2024).
自动驾驶 (AD) 领域的大多数评估都集中在端到端驾驶系统、开环规划或独立任务方案上,例如单轮视觉问答 (VQA) 或图像描述。传统上,自动驾驶框架由感知、预测和规划任务组成 (Nie et al. 2023),尽管存在一些细微的差异,例如预测意图级别的输出而不是轨迹 (Tian et al. 2024)。
Generally, perception tasks in end-to-end driving systems are mainly auxiliary tasks, consisting of all available supervision signals provided based on the data source. For example, NuScene (Caesar et al. 2020) provides BEV information, segmentation labels, and more. Consequently, multitask learning is applied to these tasks, such as object detection, tracking, and segmentation.This approach is consistent across recent similar AD planning datasets, whether in open-loop or simulation scenarios. Occasionally, pretrained VL models are utilized to enhance these modules.
通常,端到端驾驶系统中的感知任务主要是辅助任务,由基于数据源提供的所有可用监督信号组成。例如,NuScene (Caesar et al. 2020) 提供了 BEV 信息、分割标签等。因此,多任务学习被应用于这些任务,如目标检测、跟踪和分割。这种方法在最近的类似 AD 规划数据集中是一致的,无论是在开环还是仿真场景中。偶尔会使用预训练的 VL 模型来增强这些模块。
Table 1: Summary of existing studies and benchmarks across AD tasks (brackets indicate tasks involving planning).
表 1: 现有研究和基准测试在自动驾驶任务中的总结(括号内表示涉及规划的任务)。
| 基准测试 | 视觉数据模态 | 感知(规划)任务 | 缩写 | 含义 |
|---|---|---|---|---|
| 自动驾驶中的独立任务 | OD | 2D和3D目标检测 | ||
| DRAMA (Malla et al. 2023) | 单图像 | PER, REA | OT | 2D和3D目标跟踪 |
| Rank2Tell (Sachdeva et al. 2024) | 单图像 | PER, REA, LANE, TLS | D | 深度估计 |
| BDD-X (Kim et al. 2018) | 多帧 | PER, (AC) | OBJ | 目标存在、类别等 |
| BDD-OIA (Xu et al. 2020) | 单图像 | PER | KNOW | 世界知识 |
| TrafficQA (Xu, Huang, and Liu 2021) | 多帧 | PER, PRED, REA | LOC | 位置或坐标 |
| LingoQA (Marcu et al. 2023) | 多帧 | LANE | 道路、车道、交叉口等 | |
| NuScenes-QA (Qian et al. 2024) | 多视角 | PER, PRED, REA | PER | 通用感知 |
| NuScenes-MQA (Inoue et al. 2024) | 多视角 | OBJ, SP OBJ, RD, OD | PRED PLAN | 通用预测 |
| MAPLM-QA (Cao et al. 2024) | 多视角,BEV图像 | LANE | REA | 通用规划 |
| DriveLM (Sima et al. 2024) | 单图像 | PER, PRED, (PLAN) | TLS | 通用推理 交通灯或标志 |
| 基准测试 | AC | 动作类别 | ||
| SpatialRGPT (Cheng et al. 2024) | 单图像 | RD, SR, OR | AR | 通用动作识别 |
| SEED (Li et al. 2023a) | 多图像,多帧 | PER, PRED, REA, AR | RD | 相对距离 |
| MVBench (Li et al. 2024b) | 多帧 | PER, PRED, REA, LOC, AR | SR | 空间推理 |
| MME (Fu et al. 2023) | 单图像 | PER, PRED | OR | 方向推理 |
| MMMU (Yue et al. 2024) | 多图像 | PER, REA, KNOW | EGO-LANE | 其他车道到自车 |
| ELM (Zhou et al. 2024) | 多帧 | PER, PRED, TLS, OD, OT, AR, (PLAN) | OBJ-LANE | 其他车道变换 |
| Cambrian-1 (Tong et al. 2024) | 单图像 | RD, SR, D | OBJ-TURN | 其他转向 |
| OpenEQA (Majumdar et al. 2024) | 多帧 | OBJ, SR, KNOW, LOC, REA | EGO-TURN | 自车转向 |
| TB-Bench (Ours) | 单图像,多帧 | RD, SR, OR, EGO-LANE, OBJ-LANE, OBJ-TURN, EGO-TURN, EGO-TRA | EGO-TRA | 自车穿越距离 |
Other popular traffic planning datasets are KITTI (Geiger et al. 2013), ONCE (Mao et al.), Waymo Open (Sun et al. 2020), and Argoverse2 (Wilson et al. 2021), which are inherently similar to NuScene in characteristics.
其他流行的交通规划数据集包括 KITTI (Geiger et al. 2013)、ONCE (Mao et al.)、Waymo Open (Sun et al. 2020) 和 Argoverse2 (Wilson et al. 2021),这些数据集在特性上与 NuScene 本质上相似。
Pretrained VL models are commonly known for their excellence in scene understanding, details, and visual cues. Still, it shows limitations in spatial grounding and reasoning (Tian et al. 2024). In detail, most standalone task schemes focus on perception tasks, which include general event VQA (Xu, Huang, and Liu 2021; Marcu et al. 2023), environment and weather conditions, traffic signals, and lane information (Wang et al. 2024; Cao et al. 2024). These tasks also encompass critical object detection (Malla et al. 2023; Sachdeva et al. 2024) or tracking in various forms, such as bounding box coordinates (Tian et al. 2024), region proposals (Deruyttere et al. 2019; Xu et al. 2020), 2D (Wu et al. 2023a), and 3D (Wu et al. 2023b) language-guided object tracking, as well as scene analysis that includes attributes or motion of objects like size, position, direction, distance, spatial position relationships (Qian et al. 2024), and orientation (Cheng et al. 2024). In particular, there is a comprehensive task for driving with language that integrates all aspects of perception, prediction, and planning in a VQA format (Sima et al. 2024). In the prediction tasks, all previous perception inputs are used to predict the object’s future trajectory, such as parking or moving, and interactions with the ego-vehicle. In the planning stage, it involves combining prior information to generate actions, decision descriptions (Xu et al. 2020), and trajectory waypoints (Tian et al. 2024; Sima et al. 2024).
预训练的视觉语言 (VL) 模型通常以其在场景理解、细节和视觉线索方面的卓越表现而闻名,但在空间定位和推理方面仍存在局限性 (Tian et al. 2024)。具体而言,大多数独立任务方案侧重于感知任务,包括通用事件视觉问答 (VQA) (Xu, Huang, and Liu 2021; Marcu et al. 2023)、环境和天气条件、交通信号和车道信息 (Wang et al. 2024; Cao et al. 2024)。这些任务还包括关键对象检测 (Malla et al. 2023; Sachdeva et al. 2024) 或各种形式的跟踪,例如边界框坐标 (Tian et al. 2024)、区域提议 (Deruyttere et al. 2019; Xu et al. 2020)、2D (Wu et al. 2023a) 和 3D (Wu et al. 2023b) 语言引导的对象跟踪,以及包括对象属性或运动(如大小、位置、方向、距离、空间位置关系 (Qian et al. 2024) 和方向 (Cheng et al. 2024))的场景分析。特别是,有一种综合的驾驶任务以 VQA 格式集成了感知、预测和规划的各个方面 (Sima et al. 2024)。在预测任务中,所有先前的感知输入都用于预测对象的未来轨迹,例如停车或移动,以及与自车的交互。在规划阶段,它涉及结合先验信息生成动作、决策描述 (Xu et al. 2020) 和轨迹路径点 (Tian et al. 2024; Sima et al. 2024)。
MLLMs and Benchmarks
多模态大语言模型 (MLLMs) 与基准测试
VL pre-training and foundation models started with learning from a broader source of supervision, specifically raw text at an internet scale (Radford et al. 2021), enabling zeroshot transfer of the model to downstream tasks. Notably, approaches attempting to connect VL pre-training to existing LLMs, referred to as MLLMs (Li et al. 2023b), enable capabilities similar to those of LLMs, such as image-to-text generation, improved via instruction tuning and in-context learning capabilities. Current frontier families of MLLMs, such as LLaVA (Liu et al. 2024b), VILA (Lin et al. 2024), and InternVL (Chen et al. 2024), utilize a similar architectural paradigm: vision encoder, multi-modal projector, and LLM connected in sequence. Despite some early work attempting resampler techniques like Q-Former (Dai et al. 2023), all state-of-the-art models use simpler linear layers with scaling to higher resolutions, focusing on higher quality VLIT instead. Another line of studies works on lightweight versions of MLLMs, optimizing for more informative, condensed training data and design choices (He et al. 2024; Shao et al. 2024). The latest MLLMs focus on simultaneously tackling multi-image, multi-frame (video), multi-view (3D), and multi-patch (single-image) scenarios, which show emergent capabilities and enhance overall performance (Li et al. 2024a). Nevertheless, it is a standard paradigm for MLLMs to evaluate on multiple general benchmarks, aiming to achieve overall performance.
视觉语言(VL)预训练和基础模型最初是从更广泛的监督源中学习,特别是互联网规模的原始文本(Radford 等人,2021),这使得模型能够零样本迁移到下游任务。值得注意的是,尝试将 VL 预训练与现有的大语言模型(LLM)连接的方法,称为多模态大语言模型(MLLM)(Li 等人,2023b),使得模型具备类似于 LLM 的能力,例如通过指令微调和上下文学习能力改进的图像到文本生成。当前前沿的 MLLM 家族,如 LLaVA(Liu 等人,2024b)、VILA(Lin 等人,2024)和 InternVL(Chen 等人,2024),采用了相似的架构范式:视觉编码器、多模态投影器和按顺序连接的大语言模型。尽管早期的一些工作尝试了像 Q-Former(Dai 等人,2023)这样的重采样技术,但所有最先进的模型都使用更简单的线性层,并通过扩展到更高分辨率来专注于更高质量的视觉语言推理任务(VLIT)。另一类研究致力于 MLLM 的轻量级版本,优化更具信息量的压缩训练数据和设计选择(He 等人,2024;Shao 等人,2024)。最新的 MLLM 专注于同时处理多图像、多帧(视频)、多视角(3D)和多块(单图像)场景,这些场景展示了涌现能力并提升了整体性能(Li 等人,2024a)。尽管如此,MLLM 的标准范式仍然是在多个通用基准上进行评估,旨在实现整体性能。
The existing benchmarks, which refer to MLLM benchmarks, aim to comprehensively evaluate various dimensions, but there is no standardized taxonomy for benchmark design. General benchmarks in the VL space started with simple perception-oriented tasks (Fu et al. 2023), followed by multi-frame benchmarks (Li et al. 2023a, 2024b) with action recognition and VL knowledge-based reasoning (Yue et al. 2024). Spatial or vision-centric benchmarks (Tong et al. 2024; Cheng et al. 2024) are becoming more relevant to address previously claimed weaknesses. Then, specialized benchmarks gained more attention, introducing tasks from different domains, such as robotics (Majumdar et al. 2024) and AD (Sima et al. 2024). In this case, there is still a lack of studies covering simple yet very important skills and behaviors in the AD context.
现有的基准测试(指 MLLM 基准测试)旨在全面评估各个维度,但在基准设计方面缺乏标准化的分类。视觉语言(VL)领域的通用基准测试始于简单的感知导向任务(Fu et al. 2023),随后是多帧基准测试(Li et al. 2023a, 2024b),涉及动作识别和基于视觉语言知识的推理(Yue et al. 2024)。空间或以视觉为中心的基准测试(Tong et al. 2024; Cheng et al. 2024)正变得越来越相关,以解决之前声称的弱点。随后,专业基准测试获得了更多关注,引入了来自不同领域的任务,例如机器人(Majumdar et al. 2024)和自动驾驶(AD)(Sima et al. 2024)。在这种情况下,仍然缺乏涵盖自动驾驶背景下简单但非常重要的技能和行为的研究。
Benchmark Design
基准设计
TB-Bench is created to fill the benchmark gap in evaluating MLLMs for AD, providing a specialized benchmark that rigorously tests their capability to understand complex traffic behaviors from an ego-centric perspective.
TB-Bench 的创建是为了填补评估用于自动驾驶 (AD) 的多模态大语言模型 (MLLM) 的基准测试空白,提供一个专门的基准测试,严格测试它们从自我中心视角理解复杂交通行为的能力。
Task Design
任务设计
We generate question-and-answer pairs in a VQA format, where the model takes an image or video paired with a question as input and produces a corresponding answer. Both the question and answer are expressed in a single sentence of free-form text.
我们以 VQA(视觉问答)格式生成问答对,模型将图像或视频与问题配对作为输入,并生成相应的答案。问题和答案都以自由形式的单句文本表示。
To achieve the above goal, we consider multiple types of Q&A pairs, each linked to a specific driver’s maneuver behavior. We refer to the Pre-crash Scenarios typology from the National Automotive Sampling System (NASS) variables (Najm et al. 2007), which are also utilized in the CARLA simulator (Do sov it ski y et al. 2017). This typology includes a total of 65 pre-crash scenarios, categorized into nine accident types2. Each scenario is described in the format of ‘an accident type: a detailed scenario.’ For example, the ‘lane change’ accident type includes scenarios like ‘one vehicle passing while another is turning.’ See the supp. material for the full list of scenarios.
为实现上述目标,我们考虑了多种类型的问答对,每种问答对都与特定的驾驶员操作行为相关联。我们参考了来自国家汽车采样系统 (NASS) 变量的预碰撞场景分类 (Najm et al. 2007),这些变量也在 CARLA 模拟器 (Dosovitskiy et al. 2017) 中使用。该分类共包含 65 种预碰撞场景,分为九种事故类型2。每个场景都以“事故类型:详细场景”的格式描述。例如,“变道”事故类型包括“一辆车超车时另一辆车正在转弯”等场景。完整场景列表请参见补充材料。
Focusing on typical maneuver behaviors derived from NASS scenarios, we have identified eight distinct Q&A types, referred to as ‘tasks,’ as shown in Table 2. Some tasks require numerical outputs (e.g., ‘distance in meters’), while others require discrete classes (e.g., ‘back,’ ‘back left,’ etc.). It is important to note that the models are expected to provide these outputs in their natural language responses. Fig. 1 presents examples for four of the eight tasks, each of which consists of input image(s) accompanied by a question and a ground-truth answer. The visual input is either a single image or multiple images (up to eight), depending on the task, as will be explained later.
我们聚焦于从NASS场景中提取的典型机动行为,识别出了八种不同的问答类型,称为“任务”,如表2所示。部分任务需要输出数值(例如“距离(米)”),而其他任务则需要输出离散类别(例如“后方”、“左后方”等)。需要注意的是,模型需要在其自然语言响应中提供这些输出。图1展示了八种任务中的四种示例,每个示例由输入图像、伴随的问题以及真实答案组成。根据任务的不同,视觉输入可以是单张图像或多张图像(最多八张),具体将在后文解释。
Referencing Entities
引用实体
Some tasks require the model to determine the spatial position or orientation of other vehicles, as shown in Fig. 1. When multiple vehicles are present in a scene, it is essential to distinguish between them in both the questions and answers. One approach is to describe the vehicle by its attributes, such as “black compact sedan,” but this can pose challenges in ensuring the model accurately identifies and differentiates similar objects using such descriptions. To avoid these complications and focus on evaluating the model’s spatial understanding, we label each target traffic entity as ‘Entity #n’ in the questions and answers, where $n$ corresponds to its index in the input image(s); see examples in the upper part of Fig. 1. To identify these entities, we draw colored three-dimensional bounding boxes (BBs) directly in the input image(s), using a consistent color for each entity index $n$ throughout the dataset. Specifically, we use
某些任务需要模型确定其他车辆的空间位置或方向,如图 1 所示。当场景中存在多辆车时,必须在问题和答案中区分它们。一种方法是通过属性描述车辆,例如“黑色紧凑型轿车”,但这在确保模型准确识别和区分类似对象时可能会带来挑战。为了避免这些复杂性并专注于评估模型的空间理解能力,我们在问题和答案中将每个目标交通实体标记为“实体 #n”,其中 $n$ 对应于其在输入图像中的索引;参见图 1 上部的示例。为了识别这些实体,我们直接在输入图像中绘制彩色的三维边界框 (BBs),在整个数据集中为每个实体索引 $n$ 使用一致的颜色。具体来说,我们使用
Table 2: Tasks and Concepts Addressed in Each. ‘Classes’ column indicates the types of outputs, i.e., the number of discrete classes or numerical outputs (indicated by $\mathcal{R}$ ); ‘Orientation Reasoning’ task contains both output types.
表 2: 每项任务和概念。‘Classes’列表示输出类型,即离散类别或数值输出的数量(用 $\mathcal{R}$ 表示);‘Orientation Reasoning’任务包含两种输出类型。
| 任务类型 | 抽象概念 | 类别 |
|---|---|---|
| 空间信息: | ||
| 相对距离 | 距离(米) | R |
| 空间推理 | 后、后左、后右、前、前左、前右 | 6 |
| 方向推理 | 相反、垂直、相似、角度 | 3/R |
| 物体行为: | ||
| 其他车道到自车 | 前车道、前左车道、前右车道、对向车道 | 4 |
| 其他车道车辆变道 | 左车道变道、不变道、右车道变道、直行、左转、右转 | 3 |
| 其他转向 | 3 | |
| 自车行为: 自车转向 | 直行、左转、右转 | 3 |
| 自车行驶距离 | 行驶距离(米) | R |
cyan and magenta BBs for ‘Entity #1’ and ‘Entity #2,’ respectively. Our dataset includes up to two entities per scene, i.e., $n,=,1$ or 2. An additional advantage of this method is that it requires minimal instruction tuning or even no extra learning for MLLMs to adapt. Furthermore, it is compatible with multi-view, multi-frame, and multi-scale modalities, as demonstrated in AnyRes (Liu et al. 2024a), UniRes (Zhang et al. 2024a), and Interleave (Li et al. 2024a).
分别为“实体#1”和“实体#2”使用青色和洋红色的边界框 (BBs)。我们的数据集中每个场景最多包含两个实体,即 $n,=,1$ 或 2。这种方法的一个额外优势是,它需要最少的指令调优,甚至不需要额外的学习来适应 MLLMs。此外,它与多视图、多帧和多尺度模态兼容,如 AnyRes (Liu et al. 2024a)、UniRes (Zhang et al. 2024a) 和 Interleave (Li et al. 2024a) 中所展示的。
Evaluation
评估
Our benchmark requires MLLMs to generate plain text outputs. Since the goal is to evaluate the s patio temporal understanding capabilities of MLLMs, the accuracy of their outputs should be assessed using methods tailored to this requirement.
我们的基准测试要求多模态大语言模型 (MLLMs) 生成纯文本输出。由于目标是评估 MLLMs 的时空理解能力,因此应使用针对此需求定制的方法来评估其输出的准确性。
The questions in the dataset are broadly classified into two categories based on the type of answers expected. One category includes questions about positional relationships or orientation, with typical answers like “positioned at the back right” or “a right-turn maneuver.” The other category involves questions requiring numerical answers, such as “is situated 15.53 meters away.”
数据集中的问题根据预期答案的类型大致分为两类。一类包括关于位置关系或方向的问题,典型答案如“位于右后方”或“右转操作”。另一类涉及需要数值答案的问题,例如“位于15.53米处”。
For the first category of Q&A, keywords are manually selected for each task or ground truth answer, and their presence in the output text is identified using rule-based methods (i.e., regular expressions). For the second category, the predicted value is compared to the correct answer, and if the difference falls within a specified range, the prediction is considered correct; otherwise, it is deemed incorrect. In the experiments, thresholds are set such that a difference within $25%$ of the correct value is considered acceptable for distance, and a difference within 15 degrees is acceptable for angle. Refer to the supp. material for more details.
对于第一类问答任务,每个任务或真实答案的关键词是手动选择的,并通过基于规则的方法(即正则表达式)来识别输出文本中是否存在这些关键词。对于第二类任务,预测值与正确答案进行比较,如果差异在指定范围内,则认为预测正确;否则,认为预测错误。在实验中,设定的阈值是:距离的差异在正确值的 $25%$ 以内被认为是可接受的,角度的差异在15度以内被认为是可接受的。更多细节请参考补充材料。
Generation of VQA Data Outline
VQA 数据大纲生成
To generate Q&A pairs for the eight tasks mentioned, we repurpose existing datasets, specifically KITTI (Geiger,
为了生成上述八项任务的问答对,我们重新利用了现有的数据集,特别是 KITTI (Geiger,

Figure 2: Overview of Data Generation Pipeline. Left: Sensory data is processed into higher-level attributes. Middle-Top: Spatial positioning and lane orientation relative to the ego-vehicle are determined. Middle-Bottom: Q&A samples are generated using rules and LLM augmentation. Right: Data is filtered and refined for the final dataset.
图 2: 数据生成流程概览。左:感知数据被处理为更高层次的属性。中上:确定相对于自车的空间定位和车道方向。中下:使用规则和大语言模型增强生成问答样本。右:数据经过过滤和精炼,形成最终数据集。
Lenz, and Urtasun 2012), ONCE (Mao et al.), and Argoverse2 (Wilson et al. 2021). These datasets are originally designed for studying object detection, localization, and tracking in three-dimensional space, providing detailed threedimensional geometry of traffic entities. KITTI and ONCE, in particular, offer object class information and 3D bounding boxes for each traffic entity, including their position, dimensions, and yaw angle. Argoverse2 further enriches this with lane information relative to the ego vehicle.
Lenz 和 Urtasun 2012)、ONCE (Mao 等人) 以及 Argoverse2 (Wilson 等人 2021)。这些数据集最初设计用于研究三维空间中的目标检测、定位和跟踪,提供了交通实体的详细三维几何信息。特别是 KITTI 和 ONCE,它们为每个交通实体提供了对象类别信息和 3D 边界框,包括其位置、尺寸和偏航角。Argoverse2 进一步丰富了这些信息,提供了相对于自车的车道信息。
To align with the task design mentioned (Table 2), the quantities provided by these datasets, mostly represented in the Euclidean space, are converted into abstract concepts, such as six discrete angles between two vehicles (e.g., front right, back left, etc.), lanes relative to the ego-car (i.e., front left lane, oncoming lane) and lane changing.
为了与任务设计(表 2)保持一致,这些数据集提供的量(大多以欧几里得空间表示)被转换为抽象概念,例如两辆车之间的六个离散角度(例如,右前、左后等)、相对于自车的车道(例如,左前车道、对向车道)以及变道。
For the first three tasks—‘Relative Distance,’ ‘Spatial Reasoning,’ and ‘Orientation Reasoning’—we generate Q&A pairs using samples from KITTI and ONCE, as these tasks do not require lane information from the ego vehicle or others. Since these tasks can be performed using a single image, we utilize a static dashcam image as the visual input. For the remaining tasks—‘Other Lane to Ego,’ ‘Other Lane Changing,’ ‘Other Turning,’ ‘Ego Turning,’ and ‘Ego Traverse Distance’—which require lane information and a multi-frame source, we generate Q&A pairs using Argoverse2. Given that these tasks involve temporal changes, we extract eight image frames from the ‘long scenario’ sequences in the dataset for each Q&A pair3, using these sequences as the visual input for models.
对于前三个任务——“相对距离”、“空间推理”和“方向推理”——我们使用 KITTI 和 ONCE 的样本来生成问答对,因为这些任务不需要自车或其他车辆的车道信息。由于这些任务可以通过单张图像完成,我们使用静态行车记录仪图像作为视觉输入。对于其余任务——“其他车道到自车”、“其他车道变换”、“其他转向”、“自车转向”和“自车穿越距离”——这些任务需要车道信息和多帧数据源,我们使用 Argoverse2 生成问答对。鉴于这些任务涉及时间变化,我们从数据集中的“长场景”序列中为每个问答对提取八帧图像,并将这些序列作为模型的视觉输入。
After generating the data automatically, we conduct a manual screening process. Based on the extent of screening, the data is organized into three distinct datasets. One dataset, comprising 2,000 samples, is designated for evaluation purposes, which we will refer to as ‘benchmark’ in this paper. These samples undergone thorough manual inspection, removing low-quality samples and ensuring an equal number (i.e., 250) of samples per task. The remaining two datasets are intended for model training: the first, TB-250k, contains 250,000 samples; the second, TB-100k, includes over 100,000 samples that have been filtered to balance the number of samples per task. Table 3 summarizes the overall statistics of these datasets.
在自动生成数据后,我们进行手动筛选过程。根据筛选的程度,数据被组织成三个不同的数据集。其中一个数据集包含2000个样本,用于评估目的,我们在本文中将其称为“基准”。这些样本经过彻底的手动检查,删除了低质量样本,并确保每个任务的样本数量相等(即250个)。其余两个数据集用于模型训练:第一个数据集TB-250k包含250,000个样本;第二个数据集TB-100k包含超过100,000个样本,这些样本经过过滤以平衡每个任务的样本数量。表3总结了这些数据集的总体统计信息。
Table 3: Statistics of TB-Bench, TB-100k, and TB-250K. Source datasets: K (KITTI), O (ONCE), Arv2 (Argoverse2).
表 3: TB-Bench、TB-100k 和 TB-250K 的统计数据。源数据集:K (KITTI)、O (ONCE)、Arv2 (Argoverse2)。
| 任务类型 | 来源/帧数 | TB-Bench | TB-250k | TB-100k |
|---|---|---|---|---|
| 空间信息: | ||||
| 相对距离 | [K, 0]/1 | 250 | 35k | 10k |
| 空间推理 | [K, 0]/1 | 250 | 70k | 30k |
| 方向推理 | [K, 0]/1 | 250 | 70k | 30k |
| 物体行为: | ||||
| 其他车道到本车 | [Arv2]/8 | 250 | 50k | 20k |
| 其他车道变道 | [Arv2]/8 | 250 | 1.5k | 1.5k |
| 其他转弯 | [Arv2]/8 | 250 | 1.5k | 1.5k |
| 本车行为: | ||||
| 本车转弯 | [Arv2]/8 | 250 | 1.5k | 1.5k |
| 本车行驶距离 | [Arv2]/8 | 250 | 25k | 15.5k |
| 总计 | 2000 | 254k | 110k |
Details of the Pipeline
管道的详细信息
The Q&A pairs are generated automatically, with manual inspection following the automated process. The only exception is the ‘Other Lane Changing’ task, where we manually generate Q&A pairs due to noisy lane information at intersections. Figure 2 illustrates the pipeline used for generating Q&A pairs from these datasets.
问答对是自动生成的,随后会进行人工检查。唯一的例外是“其他车道变换”任务,由于交叉口的车道信息噪声较大,我们手动生成了问答对。图 2 展示了从这些数据集中生成问答对的流程。
The process unfolds as follows: The input to the pipeline is a single sample from the datasets, which could be either a static image with a set of entity attributes from KITTI/ONCE or a list of sequences with similar data from Argoverse2. The pipeline begins by extracting key information from the input, as depicted in the left panel of Fig. 2. This is followed by a processing step shown in the middle-top panel of Fig. 2, where spatial positions and facing angles relative to an anchor object are calculated. Additionally, the ‘lane to ego’ task identifies on which side the entity is located relative to the ego vehicle. For turning behaviors, we record the accumulated turning angle of each object to determine its recent motion. For lane changes, a flag is recorded if there are changes in lane id compared to the previous step. Similarly, all sensor numerical ground truth data—such as position, dimension, and angle of all entities—are processed into attributed data, such as distance to ego and spatial position.
流程如下:管道的输入是数据集中的单个样本,可以是来自 KITTI/ONCE 的带有实体属性集的静态图像,也可以是来自 Argoverse2 的带有类似数据的序列列表。管道首先从输入中提取关键信息,如图 2 左侧面板所示。接着是图 2 中间顶部面板所示的处理步骤,计算相对于锚定对象的空间位置和朝向角度。此外,“车道到自我”任务识别实体相对于自我车辆所在的位置。对于转向行为,我们记录每个对象的累积转向角度以确定其最近的运动。对于车道变更,如果与上一步相比车道 ID 发生变化,则记录一个标志。同样,所有传感器的数值真实数据(如所有实体的位置、尺寸和角度)都被处理为属性数据,例如到自我的距离和空间位置。

Figure 3: The overall architecture of our baseline framework.
图 3: 我们基线框架的整体架构。
Finally, a rule-based process, depicted in the middlebottom panel of Fig. 2, is triggered to identify the corresponding task and generate Q&A pairs. Further details can be found in the supplementary material.
最后,触发一个基于规则的过程(如图 2 中下面板所示)来识别相应的任务并生成问答对。更多细节可以在补充材料中找到。
In the next phase, a rule-based system generates QA samples from processed data attributes. This depends on the type of task, i.e., tasks aside from lane change and turning behavior can be created based on any frame, without necessarily needing an event to trigger it. Thus, they naturally have more data samples generated. After this, the rule-based QA is generated with simple short answers, such as ‘oncoming lane’ or ‘turn left.’ Then, it is augmented to be a more complex sentence using text-only information with an LLM; we used Microsoft-Phi3-medium (Abdin et al. 2024).
在下一阶段,基于规则的系统从处理后的数据属性中生成问答样本。这取决于任务类型,即除了变道和转向行为之外的任务可以基于任何帧创建,而不一定需要事件触发。因此,它们自然生成了更多的数据样本。之后,基于规则的问答生成简单的短答案,例如“对向车道”或“左转”。然后,使用仅文本信息的大语言模型(LLM)将其增强为更复杂的句子;我们使用了 Microsoft-Phi3-medium (Abdin et al. 2024)。
Baseline Framework
基线框架
We present a generic framework that serves as a strong baseline for our tasks, comprising three standard components: a vision encoder, a multi-modal connector (a two-layer MLP), and an LLM. The vision encoder extracts visual representations from input frames, the multi-modal connector projects these representations into the LLM’s embedding space, and finally the LLM generates a response based on the given question and visual embeddings. Figure 3 illustrates the architecture of our framework.
我们提出了一个通用框架,作为我们任务的强基线,包含三个标准组件:视觉编码器、多模态连接器(一个两层的 MLP)和大语言模型。视觉编码器从输入帧中提取视觉表示,多模态连接器将这些表示投影到大语言模型的嵌入空间中,最后大语言模型根据给定的问题和视觉嵌入生成响应。图 3 展示了我们框架的架构。
We now explain how to adaptively extract visual representations from varying numbers of frames and input them into the LLM. Given $N$ frames of $H\times W$ having colorcoded bounding boxes, the vision encoder processes each frame individually to produce $N$ visual representations of size $[H/p\times W/p,C]$ , where $p$ is the patch size and $C$ is the embedding dimension of the encoder. These visual representations are then projected into the LLM’s embedding space of $D$ using the multi-modal connector, resulting in $N$ visual embeddings of size $[H/p\times W/p,D]$ .
我们现在解释如何从不同数量的帧中自适应地提取视觉表示并将其输入到大语言模型中。给定 $N$ 帧大小为 $H\times W$ 的带有颜色编码边界框的图像,视觉编码器分别处理每一帧,生成 $N$ 个大小为 $[H/p\times W/p,C]$ 的视觉表示,其中 $p$ 是 patch 大小,$C$ 是编码器的嵌入维度。然后,这些视觉表示通过多模态连接器投影到大语言模型的嵌入空间 $D$ 中,生成 $N$ 个大小为 $[H/p\times W/p,D]$ 的视觉嵌入。
Inputting all visual embeddings of $N$ frames into the LLM can be computationally expensive. To address this, we sample spatially a subset of these visual embeddings per frame. Specifically, we apply adaptive average pooling to reduce each frame’s embeddings, from $[H/p\times W/p,D]$ to $[k=h\times w,D]$ , where $k\ll H/p\times W/p$ . The value of $k$ is determined as a hyper parameter. The sampled embeddings from all $N$ frames are then reshaped and concatenated, preserving spatial and temporal order, which yields final visual embeddings of size $[N,\bar{\times},k,D]$ that are passed into the LLM along with the textual embeddings.
将所有 $N$ 帧的视觉嵌入输入到大语言模型中可能会带来高昂的计算成本。为了解决这个问题,我们对每帧的视觉嵌入进行空间采样。具体来说,我们应用自适应平均池化来减少每帧的嵌入,从 $[H/p\times W/p,D]$ 减少到 $[k=h\times w,D]$,其中 $k\ll H/p\times W/p$。$k$ 的值作为一个超参数来确定。然后,将所有 $N$ 帧的采样嵌入进行重塑和拼接,保留空间和时间顺序,最终生成大小为 $[N,\bar{\times},k,D]$ 的视觉嵌入,这些嵌入与文本嵌入一起输入到大语言模型中。
To process text input, we tokenize the question and its ground-truth response, converting them into textual embeddings. These are then combined with the visual embeddings and input into the LLM. We train the model by minimizing cross-entropy loss on the response token predictions. During inference, only the question is used as text input.
为了处理文本输入,我们将问题及其真实答案进行 Token 化,并将其转换为文本嵌入。然后,这些嵌入与视觉嵌入结合,并输入到大语言模型 (LLM) 中。我们通过最小化响应 Token 预测的交叉熵损失来训练模型。在推理过程中,仅使用问题作为文本输入。
Experiments
实验
Experimental Settings
实验设置
Our proposed framework is compatible with any vision encoder and LLM. In this study, we utilize pretrained SigLIPL/14 (Zhai et al. 2023) as the vision encoder and the powerful pretrained Qwen 0.5B, either version 1.5 or 2.0, (Bai et al. 2023; Yang et al. 2024) as the LLM, while initializing the parameters of the multi-modal connector randomly. To preserve pretrained LLM capabilities and enable efficient task-specific fine-tuning, we apply LoRA (Hu et al. 2022) with a rank of 64. During training, we freeze the vision encoder and LLM parameters, updating only the parameters of the multi-modal connector and LoRA adapters.
我们提出的框架兼容任何视觉编码器和大语言模型。在本研究中,我们使用预训练的 SigLIPL/14 (Zhai et al. 2023) 作为视觉编码器,并使用强大的预训练 Qwen 0.5B(版本 1.5 或 2.0)(Bai et al. 2023; Yang et al. 2024) 作为大语言模型,同时随机初始化多模态连接器的参数。为了保留预训练大语言模型的能力并实现高效的任务特定微调,我们应用了 LoRA (Hu et al. 2022),其秩为 64。在训练过程中,我们冻结视觉编码器和大语言模型的参数,仅更新多模态连接器和 LoRA 适配器的参数。
For tasks requiring temporal information, the number of frames $N$ is 8; otherwise $N=1$ . Each frame is resized to $384\times384$ as the input to SigLIP-L/14, with the number of sampled visual embeddings $\boldsymbol{\mathrm{k}}$ set to 16 (i.e., $h=w=4,$ ).
对于需要时间信息的任务,帧数 $N$ 为 8;否则 $N=1$。每帧调整为 $384\times384$ 作为 SigLIP-L/14 的输入,采样视觉嵌入的数量 $\boldsymbol{\mathrm{k}}$ 设置为 16(即 $h=w=4$)。
We fine-tune our models on either TB-100K or TB-250K, and then report the accuracy on TB-Bench. We use AdamW (Loshchilov and Hutter 2017) with a learning rate of 2e-4 and batch size of 64 for 10 epochs, with learning rate adjusted via a cosine scheduler.
我们在 TB-100K 或 TB-250K 上微调我们的模型,然后在 TB-Bench 上报告准确率。我们使用 AdamW (Loshchilov and Hutter 2017) ,学习率为 2e-4,批量大小为 64,训练 10 个 epoch,并通过余弦调度器调整学习率。
Table 4: Results of compared methods on TB-Bench are reported in accuracy $(%)$ , where higher indicates better performance. Random guess† results are considered zero. ${}^{\star}\ensuremath{\mathbf{In}}$ -context learning for single-frame tasks uses three in-context examples, while multi-frame tasks use one. Hugging face and API names are used for easy reference.
表 4: TB-Bench 上对比方法的结果以准确率 $(%)$ 报告,数值越高表示性能越好。随机猜测†结果被视为零。${}^{\star}\ensuremath{\mathbf{In}}$ -context learning 在单帧任务中使用三个上下文示例,而在多帧任务中使用一个。Hugging face 和 API 名称用于方便参考。
| 模型 | RD↑ | SR↑ | OR↑ | EGO-LANE↑OBJ-LANE↑ | OBJ-TURN↑EGO-TURN↑EGO-TRA↑ | Avg.↑ |
|---|---|---|---|---|---|---|
| Randomt | 0.0 | 16.7 | 17.1 | 25.0 | 33.3 | 19.8 |
| Zero-shot | ||||||
| LLaVA-1.5-7B | 10.8 | 16.8 | 28.0 | 28.4 | 20.4 | 18.1 |
| LLaVA-v1.6-Mistral-7B | 4.0 | 25.6 | 30.8 | 20.4 | 26.0 | 19.6 |
| LLaVA-NeXT-Video-7B | 3.6 | 0.8 | 13.2 | 10.4 | 18.8 | 12.4 |
| LLaVA-Interleave-Qwen-7B | 5.6 | 24.8 | 10.8 | 31.6 | 19.2 | 17.4 |
| Bunny-v1.1-4B | 24.4 | 20.4 | 19.6 | 28.4 | 16.0 | 20.4 |
| Bunny-v1.1-Llama-3-8B-V | 7.6 | 16.4 | 30.0 | 26.8 | 18.4 | 17.8 |
| InternVL2-8B | 3.6 | 12.0 | 28.0 | 28.4 | 28.0 | 20.0 |
| Mini-InternVL2-1B-DriveLM | 0.0 | 31.2 | 20.0 | 28.4 | 24.8 | 24.2 |
| DriveLM-mantis-8B | 0.0 | 34.8 | 23.2 | 30.0 | 57.6 | 30.7 |
| Gemini-1.5-flash | 21.2 | 16.8 | 22.0 | 34.8 | 48.0 | 24.8 |
| GPT-40-2024-08-06 | 8.4 | 32.0 | 40.8 | 54.4 | 39.6 | 34.4 |
| In-context learning* | ||||||
| LLaVA-Interleave-Qwen-7B | 14.0 | 3.6 | 10.4 | 24.8 | 29.6 | 19.3 |
| GPT-40-2024-08-06 | 32.8 | 38.8 | 36.8 | 60.4 | 51.2 | 40.9 |
| VLIT on TB-100k | ||||||
| Ours (SigLIP-L-Qwen1.5-0.5B) | 76.4 | 74.4 | 86.8 | 94.0 | 68.8 | 77.5 |
| Ours (SigLIP-L-Qwen2-0.5B) | 80.4 | 74.8 | 88.8 | 93.6 | 65.2 | 77.5 |
| VLIT on TB-250k | ||||||
| Ours (SigLIP-L-Qwen1.5-0.5B) | 93.6 | 82.4 | 96.0 | 99.6 | 69.6 | 84.5 |
| Ours (SigLIP-L-Qwen2-0.5B) | 91.2 | 83.2 | 94.8 | 99.6 | 69.6 | 85.1 |
Zero-shot Evaluation for MLLMs
零样本评估多模态大语言模型 (MLLMs)
We report the zero-shot performance of various MLLMs on TB-Bench, including two popular proprietary models (GPT4o, Gemini 1.5), several SOTA open-source general models, including LLaVA (Liu et al. 2024b), Bunny (He et al. 2024), and InternVL (Chen et al. 2024), as well as open-source models with traffic domain adaptations trained on DriveLM (Sima et al. 2024), i.e., Mantis (Jiang et al. 2024) and MiniInternVL2 (Gao et al. 2024). For class output questions, we use a multi-choice template listing all possible class options, while for numerical output questions, we specify the format, i.e., “Answer in xx.x meters.” See the supp. material for more details on the models and the prompt design.
我们报告了各种多模态大语言模型 (MLLMs) 在 TB-Bench 上的零样本性能,包括两个流行的专有模型 (GPT4o, Gemini 1.5),几个最先进的开源通用模型,包括 LLaVA (Liu et al. 2024b)、Bunny (He et al. 2024) 和 InternVL (Chen et al. 2024),以及在 DriveLM (Sima et al. 2024) 上训练的具有交通领域适应的开源模型,即 Mantis (Jiang et al. 2024) 和 MiniInternVL2 (Gao et al. 2024)。对于类别输出问题,我们使用列出所有可能类别选项的多选模板,而对于数值输出问题,我们指定格式,即“以 xx.x 米为单位回答”。有关模型和提示设计的更多详细信息,请参阅补充材料。
Results on TB-Bench
TB-Bench 上的结果
Table 4 shows the results of different methods on TB-Bench tasks, categorized into four groups: zero-shot evaluation, incontext learning evaluation, VLIT on TB-100k, and VLIT on TB-250k.
表 4 展示了不同方法在 TB-Bench 任务上的结果,分为四组:零样本评估、上下文学习评估、TB-100k 上的 VLIT 以及 TB-250k 上的 VLIT。
In the zero-shot evaluation, although the proprietary models (GPT-4o and Gemini) outperform the open-source mod- els overall, none of them excels across all traffic behav- ior tasks. Many open-source models under perform random guessing, while traffic domain adaptation models show significantly better performance in certain areas but still lag behind the proprietary models. The proprietary models achieve an average accuracy of less than $35%$ .
在零样本评估中,尽管专有模型(GPT-4o 和 Gemini)总体上优于开源模型,但它们在所有交通行为任务中都没有表现出色。许多开源模型的表现甚至不如随机猜测,而交通领域适应模型在某些领域表现显著更好,但仍落后于专有模型。专有模型的平均准确率低于 $35%$。
In in-context learning, examples significantly improve performance in specific areas, i.e., numerical outputs.
在上下文学习中,示例显著提高了特定领域的性能,例如数值输出。
For baseline models fine-tuned on $\mathrm{TB}{-}100\mathrm{k}.$ , both with Qwen variants demonstrate strong performance across all tasks, with an average accuracy of $77.5%$ . Even the lowestperforming task exceeds $60%$ accuracy, showing a significant improvement of over almost $45%$ compared to GPT-4o and $57%$ over random chance. This underscores the effectiveness of VLIT when a high-quality dataset is available, enhancing traffic behavior understanding of MLLMs.
对于在 $\mathrm{TB}{-}100\mathrm{k}.$ 上微调的基线模型,Qwen变体在所有任务中均表现出色,平均准确率为 $77.5%$ 。即使是表现最差的任务,准确率也超过了 $60%$ ,相较于 GPT-4o 提升了近 $45%$ ,相较于随机概率提升了 $57%$ 。这凸显了在高质量数据集可用时,VLIT 的有效性,增强了大语言模型对交通行为的理解。
For baseline models fine-tuned on TB-250k, performance improves across all tasks, particularly those with increased data samples. Notably, accuracy in tasks like OBJ-LANE, OBJ-TURN, and EGO-TURN, with the same number of training samples to TB-100k, also benefits from additional samples in other tasks. This suggests that learning from tasks can be transfered to those with limited training data.
对于在 TB-250k 上微调的基线模型,所有任务的性能都有所提升,尤其是那些数据样本增加的任务。值得注意的是,OBJ-LANE、OBJ-TURN 和 EGO-TURN 等任务的准确性,尽管训练样本数量与 TB-100k 相同,但也受益于其他任务的额外样本。这表明从任务中学习可以转移到训练数据有限的任务上。
Abalation Study
消融研究
Table 5: Ablation results on (a) vision encoders, (b) number of visual embeddings per frame, and (c) number of frames.
表 5: 消融实验结果:(a) 视觉编码器,(b) 每帧的视觉嵌入数量,(c) 帧数。
(a) vision encoder (b) # tokens/frame (c) # frames
| 编码器 | 准确率 | # tokens/帧 | 准确率 | # 帧 | 准确率 |
|---|---|---|---|---|---|
| CLIP-L/14 | 72.0 | 4 | 72.7 | 2 | 72.1 |
| SigL-B/16 | 74.3 | 16 | 77.5 | 4 | 73.8 |
| SigL-L/14 | 77.5 | 36 | 76.2 | 8 | 77.5 |
We conduct an ablation study to identify which factors enhance performance during fine-tuning, regarding visual inputs to the models. All experiments use the same settings unless noted. The results are summarized in Table 5.
我们进行了一项消融研究,以确定在微调过程中哪些因素能提升模型对视觉输入的性能。除非另有说明,所有实验均使用相同的设置。结果总结在表 5 中。
Table 6: Quantitative results of action tasks on BDD-X test dataset. We provide evaluation results on action description, action justification, and full-text generation (i.e., combining description and justification). ‘B4’ stands for BLEU4.
表 6: BDD-X 测试数据集上动作任务的定量结果。我们提供了动作描述、动作解释和全文生成(即结合描述和解释)的评估结果。‘B4’代表 BLEU4。
| 方法 | 描述 | 解释 | 全文 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| CIDEr | B4 | ROUGE | CIDEr | B4 | ROUGE | CIDEr | B4 | ROUGE | |
| s.Ino (BDD-X) | 118.6 | 20.0 | 53.8 | 61.3 | 6.9 | 26.1 | 54.2 | 12.0 | 38.4 |
| sIno (BDD-X + TB-100k) | 121.7 | 20.0 | 54.3 | 60.3 | 6.7 | 26.7 | 53.7 | 11.9 | 38.6 |
Table 7: Quantitative results of control signals prediction on BDD-X test dataset. RMSE denotes the root mean squared error, and $A_{\tau}$ measures the proportion of test samples with prediction errors less than $\tau$ .
表 7: BDD-X 测试数据集上控制信号预测的定量结果。RMSE 表示均方根误差,$A_{\tau}$ 表示预测误差小于 $\tau$ 的测试样本比例。
| 方法 | 速度 (m/s) | 转向角度 (度) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| RMSE↓ | A0.1 ↑ | A0.5 ↑ | A1.0 ↑ | A5.0 ↑ | RMSE↓ | A0.1 ↑ | A0.5 ↑ | A1.0 ↑ | A5.0 ↑ | |
| s.Ino (BDD-X) | 1.40 | 26.1 | 55.7 | 75.6 | 98.6 | 11.2 | 44.2 | 62.2 | 71.8 | 89.2 |
| ours (BDD-X + TB-100k) | 1.38 | 26.3 | 57.6 | 76.1 | 98.8 | 11.3 | 44.5 | 63.7 | 73.0 | 89.3 |
Table 5a compares different pretrained vision encoders, including CLIP-L/14 (Radford et al. 2021) and SigLIP-B/16 (processing $224\times224$ frames). It is seen that the SigLIP encoders outperform the CLIP encoder, with SigLIP-L/14 achieving the highest accuracy.
表 5a 比较了不同的预训练视觉编码器,包括 CLIP-L/14 (Radford et al. 2021) 和 SigLIP-B/16 (处理 $224\times224$ 帧)。可以看出,SigLIP 编码器优于 CLIP 编码器,其中 SigLIP-L/14 达到了最高的准确率。
Table 5b presents the results of using varying numbers of sampled visual embeddings/tokens per frame, $p$ (where $h,=,w,=,\sqrt{p})$ . We observe that using 16 sampled visual tokens per frame is optimal, and increasing $p$ can degrade performance.
表 5b 展示了每帧使用不同数量的采样视觉嵌入/Token (visual embeddings/tokens) 的结果,其中 $p$ (其中 $h,=,w,=,\sqrt{p}$)。我们观察到,每帧使用 16 个采样的视觉 Token 是最优的,增加 $p$ 可能会降低性能。
Finally, we evaluate the impact of varying the number of sampled frames $N(=2,4,8)$ , on the tasks requiring temporal information. We consistently select the first and last frames, with the remaining $N-2$ frames sampled uniformly in between. As shown in Table 5c, increasing temporal information significantly boosts performance. For detailed task ac- curacy and other ablation results, see the supp. material.
最后,我们评估了在不同采样帧数 $N(=2,4,8)$ 下对需要时间信息的任务的影响。我们始终选择第一帧和最后一帧,并在中间均匀采样剩余的 $N-2$ 帧。如表 5c 所示,增加时间信息显著提升了性能。有关详细的任务准确性和其他消融实验结果,请参见补充材料。
Cross-Dataset Generalization
跨数据集泛化
We conduct additional experiments to demonstate the performance transfer from co-training with perception-stage tasks and planning tasks, to show its improvements on downstream tasks. Specifically, we co-train the TB-100k dataset with the BDD-X dataset and evaluate on the action and control prediction tasks (Kim et al. 2018).
我们进行了额外的实验,以展示从感知阶段任务和规划任务的联合训练中获得的性能迁移,并展示其对下游任务的改进。具体来说,我们将 TB-100k 数据集与 BDD-X 数据集联合训练,并在动作和控制预测任务上进行评估 (Kim et al. 2018)。
As the BDD-X dataset involves frame index referencing in both the question and answer text annotations, we employed the Mini-InternVL model (Gao et al. 2024) as the baseline, which formulates frame referencing in a similar manner.
由于BDD-X数据集在问题和答案文本注释中都涉及帧索引引用,我们采用了Mini-InternVL模型(Gao等人,2024)作为基线,该模型以类似的方式处理帧引用。
We follow a standard MLLM training regime: Stage 1 focuses on feature alignment, utilizing the pre-trained checkpoint of Mini-InternVL, while Stage 2 involves instruction tuning on the main datasets. In the standalone setting, the main dataset involves tuning with BDD-X for 20 epochs, while in the co-training setting, we tune the mixed BDD-X dataset for 20 epochs and TB-100k for 1 epoch. We apply LoRA (Hu et al. 2022) with a rank of 64. During training, we freeze the vision encoder and LLM parameters, updating only the parameters of the multi-modal connector and LoRA adapters. Overall, training is conducted with a learning rate of 2e-4 and a batch size of 96.
我们遵循标准的 MLLM 训练流程:第一阶段侧重于特征对齐,使用 Mini-InternVL 的预训练检查点,而第二阶段则涉及在主数据集上进行指令微调。在独立设置中,主数据集涉及使用 BDD-X 进行 20 个 epoch 的微调,而在联合训练设置中,我们对混合的 BDD-X 数据集进行 20 个 epoch 的微调,并对 TB-100k 进行 1 个 epoch 的微调。我们应用了 LoRA (Hu et al. 2022),秩为 64。在训练过程中,我们冻结了视觉编码器和大语言模型的参数,仅更新多模态连接器和 LoRA 适配器的参数。总体而言,训练的学习率为 2e-4,批量大小为 96。
Tables 6 and 7 compare the transfer performance between standard training and additional co-training with TB-100k. Notably, beyond differences in task types within the VQA format, the two tables also differ in the type of outputs, i.e., free-form text and numerical outputs.
表 6 和表 7 比较了标准训练和额外使用 TB-100k 进行联合训练的迁移性能。值得注意的是,除了 VQA 格式中任务类型的差异外,这两个表在输出类型上也有所不同,即自由格式文本和数值输出。
Table 6 shows improved performance with the co-training setting in the description data split, which includes annotations about scene perception. However, there are marginal differences in the other splits, which are not directly related to perception tasks.
表 6 展示了在描述数据分割中,联合训练设置带来的性能提升,该分割包含场景感知的注释。然而,在其他分割中,差异较小,这些分割与感知任务没有直接关联。
Table 7 demonstrates consistent performance improvement with co-training across most metrics, except for the RMSE of the turning angle, which shows a slight decrease.
表 7 展示了在大多数指标上,通过协同训练 (co-training) 带来的性能提升是一致的,除了转向角的 RMSE 略有下降。
Conclusion
结论
We have introduced TB-Bench, a comprehensive benchmark that rigorously assesses MLLM performance across eight perception tasks, providing a much-needed standard for s patio temporal evaluation in AD. Alongside TB-Bench, we have developed the vision-language instruction tuning datasets, TB-100k and TB-250k, which significantly improve MLLM performance when used to fine-tune our baseline models, resulting in substantial gains over existing models. Additionally, our VLIT datasets offer benefits as valuable assets for mixed training datasets in other driving use cases. Our contributions not only represent incremental progress, but also lay a solid foundation for the further integration of MLLMs into the perception, prediction, and planning stages of AD. These resources are poised to accelerate advancements in the field, supporting the development of more capable and reliable autonomous systems. Please refer to the supplementary material for further discussion on broader impact, limitations, and future work.
我们推出了TB-Bench,这是一个全面的基准测试,严格评估MLLM在八种感知任务中的表现,为自动驾驶(AD)中的时空评估提供了急需的标准。与TB-Bench一起,我们还开发了视觉-语言指令调优数据集TB-100k和TB-250k,这些数据集在用于微调我们的基线模型时显著提升了MLLM的性能,相较于现有模型取得了显著的提升。此外,我们的VLIT数据集在其他驾驶用例的混合训练数据集中也提供了宝贵的资源。我们的贡献不仅代表了渐进式的进步,还为MLLM进一步融入AD的感知、预测和规划阶段奠定了坚实的基础。这些资源有望加速该领域的进步,支持开发更强大、更可靠的自动驾驶系统。有关更广泛影响、局限性和未来工作的进一步讨论,请参阅补充材料。
References
参考文献
Supplementary Material for TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos
TB-Bench 补充材料:用于理解行车记录仪图像/视频中时空交通行为的多模态 AI 训练与测试
This material includes the following sections:
本材料包含以下部分:
Discussions
讨论
Broader Impact
更广泛的影响
This study represents progress in enhancing the capabilities of Multi-Modal Large Language Models (MLLMs) by focusing on a limited set of AD perception tasks. Specifically, we introduce a new benchmark to evaluate MLLMs on understanding diverse traffic behaviors and provide highquality VLIT datasets that enhance MLLMs’ genera liza bility. We hope this will advance MLLMs’ applications in AD, contributing to the development of more robust autonomous driving systems.
本研究通过关注一组有限的自动驾驶感知任务,展示了在增强多模态大语言模型 (MLLMs) 能力方面的进展。具体而言,我们引入了一个新的基准来评估 MLLMs 在理解多样化交通行为方面的能力,并提供了高质量的 VLIT 数据集,以增强 MLLMs 的泛化能力。我们希望这将推动 MLLMs 在自动驾驶中的应用,为开发更强大的自动驾驶系统做出贡献。
Limitations
局限性
Firstly, our study utilizes the moderate large language models (Qwen 0.5B series) due to limited computational resources, which can be scaled up as needed.
首先,由于计算资源有限,我们的研究使用了中等规模的大语言模型(Qwen 0.5B 系列),这些模型可以根据需要进行扩展。
Secondly, we acknowledge the dataset imbalance arising from the natural occurrence of specific autonomous driving behaviors; please refer to Section Dataset Statistics for more details.
其次,我们承认由于特定自动驾驶行为的自然发生导致的数据集不平衡;更多详情请参阅数据集统计部分。
Lastly, the free-form text output templates in TB-100k and TB-250k are limited for certain tasks. However, we believe that the diversity of images is also important for the model to understand visual concepts. That being said, when combined with other (vision-)language instruction tuning datasets, our datasets still enhance the performance of MLLMs, enabling them to generalize better in traffic domains, particularly in understanding traffic behaviors.
最后,TB-100k 和 TB-250k 中的自由文本输出模板在某些任务中存在限制。然而,我们认为图像的多样性对于模型理解视觉概念同样重要。尽管如此,当与其他(视觉-)语言指令调优数据集结合时,我们的数据集仍然能够提升多模态大语言模型 (MLLM) 的性能,使其在交通领域,特别是在理解交通行为方面,具有更好的泛化能力。
Future Work
未来工作
Future research could expand this work by incorporating a wider range of perception tasks or by exploring subsequent stages, such as prediction and planning.
未来的研究可以通过纳入更广泛的感知任务或探索后续阶段(如预测和规划)来扩展这项工作。
Additionally, an important direction for future investigation is the optimal application of upstream perception tuning sets, including the TB-100k and TB-250k datasets, to relevant downstream traffic tasks. This approach may enhance model performance in real-world applications.
此外,未来的一个重要研究方向是将上游感知调优集(包括 TB-100k 和 TB-250k 数据集)优化应用于相关的下游交通任务。这种方法可能会提升模型在实际应用中的性能。
Furthermore, integrating real-time traffic data, such as video feeds and sensory inputs, could improve the MLLMs’ understanding of dynamic traffic situations. Finally, enhancing the explain ability of MLLMs in traffic behavior scenarios will help users understand the rationale behind model predictions.
此外,整合实时交通数据(如视频流和传感器输入)可以提高 MLLM 对动态交通情况的理解。最后,增强 MLLM 在交通行为场景中的可解释性将有助于用户理解模型预测背后的逻辑。
Access to the Benchmark and Datasets Availability
基准和数据集可用性访问
The Traffic Behavior Benchmark (TB-Bench) and the training datasets (TB-100k, TB-250k) will be publicly available at the following Github repository:
交通行为基准 (Traffic Behavior Benchmark, TB-Bench) 和训练数据集 (TB-100k, TB-250k) 将在以下 Github 仓库中公开提供:
https://github.com/TB-AD/TB-Bench-110k-250k
https://github.com/TB-AD/TB-Bench-110k-250k
The source code for conducting and analyzing the experiments will also be publicly available in the repository upon publication, permitting free use for research purposes.
进行和分析实验的源代码也将在发布后公开于代码库中,允许免费用于研究目的。
Future Update
未来更新
We also plan to establish an evaluation server and leaderboard on Hugging Face in the future. Any updates will be communicated through the above Github repository to ensure users have access to the latest information.
我们还计划未来在 Hugging Face 上建立一个评估服务器和排行榜。任何更新都将通过上述 Github 仓库进行通知,以确保用户能够获取最新信息。
Benchmark and Datasets
基准测试与数据集
Task Definition
任务定义
Relative Distance (RD). The task is to predict the Euclidean distance in meters between two entities in an image; see Figure 11 for two examples.
相对距离 (Relative Distance, RD)。该任务旨在预测图像中两个实体之间的欧几里得距离(以米为单位);图 11 展示了两个示例。
Spatial Reasoning (SR). The task is to predict the spatial position of one entity relative to another from the perspective of a reference entity; see Figure 12 for examples. Specifically, the relationship between two objects is defined by the angle $\theta$ , as follows:
空间推理 (Spatial Reasoning, SR)。该任务是从参考实体的角度预测一个实体相对于另一个实体的空间位置;示例见图 12。具体来说,两个对象之间的关系由角度 $\theta$ 定义,如下所示:
$$
{\mathrm{Relation}}={\left{\begin{array}{l l}{{\mathrm{front}}}&{{\mathrm{if~}}-30^{\circ}<\theta\leq30^{\circ},}\ {{\mathrm{frontleft}}}&{{\mathrm{if}}30^{\circ}<\theta\leq90^{\circ},}\ {{\mathrm{frontright}}}&{{\mathrm{if}}-90^{\circ}<\theta\leq-30^{\circ},}\ {{\mathrm{backleft}}}&{{\mathrm{if}}90^{\circ}<\theta\leq150^{\circ},}\ {{\mathrm{backright}}}&{{\mathrm{if}}-150^{\circ}<\theta\leq-90^{\circ},}\ {{\mathrm{back}}}&{{\mathrm{otherwise}}.}\end{array}\right.}
$$
$$
{\mathrm{Relation}}={\left{\begin{array}{l l}{{\mathrm{front}}}&{{\mathrm{if~}}-30^{\circ}<\theta\leq30^{\circ},}\ {{\mathrm{frontleft}}}&{{\mathrm{if}}30^{\circ}<\theta\leq90^{\circ},}\ {{\mathrm{frontright}}}&{{\mathrm{if}}-90^{\circ}<\theta\leq-30^{\circ},}\ {{\mathrm{backleft}}}&{{\mathrm{if}}90^{\circ}<\theta\leq150^{\circ},}\ {{\mathrm{backright}}}&{{\mathrm{if}}-150^{\circ}<\theta\leq-90^{\circ},}\ {{\mathrm{back}}}&{{\mathrm{otherwise}}.}\end{array}\right.}
$$
This angular relationship is similar to that defined in (Qian et al. 2024).
这种角度关系与 (Qian et al. 2024) 中定义的关系类似。
Orientation Reasoning (OR). This task is to predict the facing relationship between two entities from the perspective of a reference entity, categorized as: ‘similar’, ‘opposite’, or ‘perpendicular’. Please refer to Figure 13 for examples. The relationship is defined based on the absolute difference in facing angles $|\theta|$ , as follows:
方向推理 (Orientation Reasoning, OR)。该任务是从参考实体的角度预测两个实体之间的朝向关系,分为:“相似”、“相反”或“垂直”。请参考图 13 中的示例。该关系基于朝向角度的绝对差值 $|\theta|$ 定义,如下所示:
$$
{\mathrm{Relation}}={\left{\begin{array}{l l}{{\mathrm{similar}}}&{{\mathrm{if~}}0^{\circ}\leq|\theta|\leq45^{\circ},}\ {{\mathrm{opposite}}}&{{\mathrm{if~}}135^{\circ}\leq|\theta|\leq180^{\circ},}\ {{\mathrm{perpendicular}}}&{{\mathrm{otherwise}}.}\end{array}\right.}
$$
$$
{\mathrm{Relation}}={\left{\begin{array}{l l}{{\mathrm{相似}}}&{{\mathrm{如果~}}0^{\circ}\leq|\theta|\leq45^{\circ},}\ {{\mathrm{相反}}}&{{\mathrm{如果~}}135^{\circ}\leq|\theta|\leq180^{\circ},}\ {{\mathrm{垂直}}}&{{\mathrm{其他情况}}.}\end{array}\right.}
$$
It is noted that this angle is measured from the facing direction of a reference entity to the position of the target entity in Euclidean space, irrespective of the target entity’s facing direction.
需要注意的是,这个角度是从参考实体的朝向方向到目标实体在欧几里得空间中的位置测量的,与目标实体的朝向方向无关。
Other Lane to Ego-Vehicle (EGO-LANE). This task is to predict the lane of a target vehicle relative to the egovehicle’s perspective; see Figure 14 for examples. The categories include: ‘front lane’, ‘front left lane’, ‘front right lane’, and ‘oncoming traffic lane’ (the lane on the opposite side of the road).
其他车道到自车车道 (EGO-LANE)。该任务是预测目标车辆相对于自车视角的车道;示例见图 14。类别包括:“前车道”、“前左车道”、“前右车道”和“对向车道”(道路的另一侧车道)。
It is noted that when the ego-vehicle is on a road with multiple lanes, the ‘front lane’ is further classified into three fine-grained categories: ‘front lane’, ‘front left lane’, and ‘front right lane’.
值得注意的是,当自车位于多车道道路上时,“前车道”进一步细分为三类:“前车道”、“左前车道”和“右前车道”。
Other Lane Changing (OBJ-LANE). This task is to predict whether the target vehicle is changing lanes, categorized as ‘left lane change’, ‘right lane change’, or ‘no change’; see Figure 15 for examples. Lane changes are evaluated based on the target vehicle’s viewpoint. For instance, if the target vehicle in the oncoming traffic lane executes a right lane change, the ego vehicle perceives it as moving to the left.
其他车道变换 (OBJ-LANE)。该任务是预测目标车辆是否在变换车道,分类为“左车道变换”、“右车道变换”或“无变换”;示例见图 15。车道变换是基于目标车辆的视角进行评估的。例如,如果对向车道中的目标车辆执行右车道变换,自车会将其视为向左移动。
Other Turning (OBJ-TURN). This task is to predict whether the target vehicle is making a turn, categorized as ‘turning left’, ‘turning right’, or ‘go straight’. The target vehicle is considered to be turning, if it changes direction by more than 25 degrees within a period of 1.6 seconds. Please refer to Figure 16 for examples.
其他转向 (OBJ-TURN)。该任务是预测目标车辆是否在转弯,分类为“左转”、“右转”或“直行”。如果目标车辆在1.6秒内改变方向超过25度,则认为它在转弯。请参考图 16 中的示例。
Ego Turning (EGO-TURN). This task is to predict whether the ego-vehicle is making a turn, categorized as turning left, turning right, or going straight. The turning maneuver of the ego-vehicle is also defined by a change in direction of more than 25 degrees within a period of 1.6 seconds. Please refer to Figure 17 for examples.
自我转向 (EGO-TURN)。该任务是预测自车是否正在转向,分类为左转、右转或直行。自车的转向动作也定义为在1.6秒内方向变化超过25度。示例请参见图17。
Ego Traverse Distance (EGO-TRA). This task is to predict the traverse distance of the ego vehicle in meters over a period of 1.6 seconds. Please see Figure 18 for examples.
自车行驶距离 (EGO-TRA)。该任务是预测自车在1.6秒内的行驶距离(以米为单位)。示例请参见图 18。
Dataset Statistics
数据集统计
Table 1, 2, and 3 show the distribution of categories for the TB-Bench, TB-100k, and TB-250k datasets, respectively, detailing the count and percentage of samples for various task types.
表 1、表 2 和表 3 分别展示了 TB-Bench、TB-100k 和 TB-250k 数据集的类别分布,详细列出了各种任务类型的样本数量和百分比。
To create the TB-Bench, we manually screened the frames thoroughly to select samples with clearly visible target entities. Each task in TB-Bench has an equal count of 250 samples. We ensure that the distribution of categories in each task closely resembles that of the instruction tuning datasets.
为了创建 TB-Bench,我们手动筛选了帧,以选择目标实体清晰可见的样本。TB-Bench 中的每个任务都有 250 个样本。我们确保每个任务中的类别分布与指令微调数据集中的分布非常相似。
It is seen from Table 2, and 3 that TB-250k represents a normal scene occurrence distribution in real-world scenarios, while TB $\cdot100\mathbf{k}$ is a more label-balanced version.
从表 2 和表 3 可以看出,TB-250k 代表了现实世界场景中的正常场景发生分布,而 TB $\cdot100\mathbf{k}$ 是一个更标签平衡的版本。
Table 1: TB-Bench Statistics
表 1: TB-Bench 统计数据
| 任务类型 | 类别 | 数量 | 百分比 (%) |
|---|---|---|---|
| 相对距离 | 数值 | 250 | 12.5 |
| 空间推理 | 后后左 | 61 | 3.0 |
| 30 | 1.5 | ||
| 后右 | 9 | 0.4 | |
| 前前左 | 87 45 | 4.3 2.2 | |
| 方向 | 前右 | 18 | 0.9 |
| 数值相反 | 122 51 | 6.1 2.5 | |
| 推理 | 垂直 | 16 | 0.8 |
| 相似前车道 | 61 | 3.0 | |
| 其他车道到自车 | 前左车道 | 71 40 | 3.5 2.0 |
| 前右车道 | 31 | 1.6 | |
| 对向车道 | 108 | 5.4 | |
| 其他车道变换 | 左车道变换 | 62 | 3.1 |
| 无变换 | 142 | 7.1 | |
| 其他转向 | 右车道变换 | 46 | 2.3 |
| 直行左转 | 126 | 6.3 | |
| 右转 | 67 57 | 3.4 2.9 | |
| 直行 | 122 | 6.1 | |
| 自车转向 | 左转 | 38 | 1.9 |
| 右转 | 90 | 4.5 | |
| 自车行驶距离 | 数值 | 250 | 12.5 |
Data Generation Pipeline Details Information Extraction
数据生成管道详细信息提取
Figure 1 shows the extraction process. It begins with obtaining raw sensory data from input samples, which may include or ONCE, or sequential data from Argoverse2. This sensory data is processed to filter out insignificant scene information.
图 1 展示了提取过程。它从获取输入样本的原始感官数据开始,这些数据可能包括来自 ONCE 或 Argoverse2 的顺序数据。这些感官数据经过处理,以过滤掉不重要的场景信息。
Table 2: TB-100k Statistics
表 2: TB-100k 统计数据
| 任务类型 | 类别 | 数量 | 百分比 (%) |
|---|---|---|---|
| 相对距离 | 数值 | 10000 | 9.1 |
| 空间推理 | 后方 | 3580 | 3.3 |
| 空间推理 | 左后方 | 3183 | 2.9 |
| 空间推理 | 右后方 | 3115 | 2.8 |
| 空间推理 | 前方 | 7873 | 7.2 |
| 空间推理 | 左前方 | 7321 | 6.7 |
| 空间推理 | 右前方 | 4928 | 4.5 |
| 方向推理 | 数值 | 10000 | 9.1 |
| 方向推理 | 相反 | 10013 | 9.1 |
| 方向推理 | 垂直 | 2387 | 2.2 |
| 方向推理 | 相似 | 7600 | 6.9 |
| 其他车道到本车 | 前方车道 | 3889 | 3.5 |
| 其他车道到本车 | 左前方车道 | 3231 | 2.9 |
| 其他车道到本车 | 右前方车道 | 4182 | 3.8 |
| 其他车道到本车 | 对向车道 | 8698 | 7.9 |
| 其他车道变换 | 左变道 | 414 | 0.4 |
| 其他车道变换 | 不变道 | 807 | 0.7 |
| 其他车道变换 | 右变道 | 279 | 0.3 |
| 其他转向 | 直行 | 744 | 0.7 |
| 其他转向 | 左转 | 435 | 0.4 |
| 其他转向 | 右转 | 321 | 0.3 |
| 本车转向 | 直行 | 753 | 0.7 |
| 本车转向 | 左转 | 331 | 0.3 |
| 本车转向 | 右转 | 416 | 0.4 |
| 本车行驶距离 | 数值 | 15500 | 14.1 |
Table 3: TB-250k Statistics
表 3: TB-250k 统计数据
| 任务类型 | 类别 | 数量 | 百分比 (%) |
|---|---|---|---|
| 相对距离 | 数值 | 34721 | 13.7 |
| 空间推理 | 后方 | 17023 | 6.7 |
| 空间推理 | 左后方 | 6247 | 2.5 |
| 空间推理 | 右后方 | 3966 | 1.6 |
| 空间推理 | 前方 | 26917 | 10.6 |
| 空间推理 | 左前方 | 10793 | 4.3 |
| 空间推理 | 右前方 | 4804 | 1.9 |
| 方向推理 | 数值 | 34872 | 13.7 |
| 方向推理 | 相反 | 19242 | 7.6 |
| 方向推理 | 垂直 | 3355 | 1.3 |
| 方向推理 | 相似 | 12283 | 4.8 |
| 其他车道到自车 | 前方车道 | 14312 | 5.6 |
| 其他车道到自车 | 左前方车道 | 4454 | 1.8 |
| 其他车道到自车 | 右前方车道 | 6401 | 2.5 |
| 其他车道到自车 | 对向车道 | 24833 | 9.8 |
| 其他车道变换 | 左车道变换 | 414 | 0.2 |
| 其他车道变换 | 无变换 | 807 | 0.3 |
| 其他车道变换 | 右车道变换 | 279 | 0.1 |
| 其他转向 | 直行 | 744 | 0.3 |
| 其他转向 | 左转 | 435 | 0.2 |
| 其他转向 | 右转 | 321 | 0.1 |
| 自车转向 | 直行 | 753 | 0.3 |
| 自车转向 | 左转 | 331 | 0.1 |
| 自车转向 | 右转 | 416 | 0.2 |
| 自车行驶距离 | 数值 | 25000 | 9.9 |
For Argoverse2, lane geometry information is processed concurrently. Lane coordinates are used to create polygons with attributes, such as neighboring, successor, and predecessor lanes. This information helps determine lane direction and angle, which are then projected onto vehicle attributes to obtain the vehicle’s lane ID and relevant lane information. This data is subsequently passed to the next processing step to extract all scene attributes.
对于 Argoverse2,车道几何信息是并行处理的。车道坐标用于创建带有属性的多边形,例如相邻车道、后继车道和前驱车道。这些信息有助于确定车道方向和角度,然后将其投影到车辆属性上,以获取车辆的车道 ID 和相关车道信息。这些数据随后被传递到下一个处理步骤,以提取所有场景属性。
Rule-based Q&A Generation
基于规则的问答生成
The process begins with obtaining attribute data from either the nodes or edges of the relationship graph. This data is then processed through rule-based functions to extract behavioral or spatial information. Next, we generate behavioral attributes in a Q&A format using templates provided in Table 4.
该过程从关系图的节点或边获取属性数据开始。然后,这些数据通过基于规则的函数进行处理,以提取行为或空间信息。接下来,我们使用表 4 中提供的模板生成问答格式的行为属性。
Generation depends on the task type. Tasks 1-4 and task 8 (‘Relative Distance,’ ‘Spatial Reasoning,’ ‘Orientation Reasoning,’ ‘Other Lane to Ego,’ and ‘Ego Traverse Distance’) can be created in any frame, as their attributes are available in all frames.
生成取决于任务类型。任务1-4和任务8(“相对距离”、“空间推理”、“方向推理”、“其他车道到自车”和“自车穿越距离”)可以在任何帧中创建,因为它们的属性在所有帧中都可用。
In contrast, tasks 5-7 (‘Other Lane Changing,’ Other Turning,’ and Ego Turning’) require a triggering event, specifically a change in attributes. The following details explain how to trigger an event:
相比之下,任务5-7(“其他车道变换”、“其他转弯”和“自车转弯”)需要一个触发事件,特别是属性的变化。以下详细说明如何触发事件:
Event Triggering: Other Lane Changing
事件触发:其他车道变换
• Check if the current lane id is in the future right neighbor id. If yes, then assign: Right Lane Change. • Check if the current lane id is in the future left neighbor id. If yes, then assign: Left Lane Change. • If neither condition is met, assign: No Change. Note: future right neighbor id refers to the right neighbor id of the next time step; the same applies to the left side.
• 检查当前车道 ID 是否在未来的右侧邻居 ID 中。如果是,则分配:右车道变更。
• 检查当前车道 ID 是否在未来的左侧邻居 ID 中。如果是,则分配:左车道变更。
• 如果以上条件均未满足,则分配:无变更。
注:未来的右侧邻居 ID 指的是下一时间步的右侧邻居 ID;左侧同理。
Event Triggering: Other Turning
事件触发:其他转折
• Check if the accumulated object yaw angle is greater than 25 degrees in 1.6 seconds. If yes, then assign: Turn Left. • Check if the accumulated object yaw angle is less than -25 degrees in 1.6 seconds. If yes, then assign: Turn Right. • If neither condition is met, assign: Go straight.
• 检查在1.6秒内累积的物体偏航角是否大于25度。如果是,则分配:左转。
• 检查在1.6秒内累积的物体偏航角是否小于-25度。如果是,则分配:右转。
• 如果以上条件均未满足,则分配:直行。

Figure 1: Data Extraction Process.
图 1: 数据提取过程。
Table 4: Q&A Templates. The placeholder <entity ${}_{-\Omega}>$ refers to any entity, such as ‘Entity #1’, ‘Entity #2’, or ‘Ego-vehicle’, ensuring that no sentence contains duplicate entities. ‘Short Answer Template’ denotes a basic class of concise responses that can be expanded into more complex sentences.
表 4: 问答模板。占位符 <entity \${}_{-\Omega}>\$ 指代任何实体,例如“实体 #1”、“实体 #2”或“自我车辆”,确保句子中不包含重复的实体。“简短回答模板”表示一类简洁回答的基本形式,可以扩展为更复杂的句子。
| 任务类型 | 问题模板 | 简短回答模板 |
|---|---|---|
| 相对距离 | 你能测量 和 之间的直线距离(以米为单位)吗? | xx.xx 米 |
| 距离 有多远(以米为单位)? 和 相距多少米? | ||
| 空间推理 | 到 沿道路表面的距离是多少(以米为单位)? | 后方、左后方、右后方、前方、左前方、右前方 |
| 从 的角度来看, 和 在空间上是如何相关的? | ||
| 方向推理 | 相对于 的空间位置是什么? 与 的空间关系是什么? | 相反、垂直、相似、xx.xx 度 |
| 你如何描述 相对于 的方向,是相似、相反还是垂直? | ||
| 相对于 的方向是什么,是相似、相反还是垂直? 和 之间的角度是多少(以度为单位)? | ||
| 相对于 的朝向角度是多少(以度为单位)? 相对于 的方向是什么,是相似、相反还是垂直? | ||
| 其他车道与自车 | 和 之间的偏航角差异是多少(以度为单位)? | 前车道、左前车道、右前车道、对向车道 |
| 你如何描述 Entity#1 的车道位置?选项:前车道、左前车道、右前车道或对向车道。 | ||
| 其他车道变换 | 你如何描述涉及 Entity#1 的驾驶场景?请解释,重点关注车辆的车道变换操作。 | 左车道变换、无变换、右车道变换、直行、左转、右转 |
| 其他转向 | 你如何描述涉及 Entity#1 的驾驶场景?请解释,重点关注车辆的转向操作。 | |
| 自车转向 | 你如何描述涉及我们车辆的驾驶场景?请解释,重点关注我们车辆的转向操作。 | 直行、左转、右转 |
| 自车行驶距离 | 在当前场景中,我们的车辆行驶了多远,并执行了哪种转向操作? | xX.xx 米 |
Event Triggering: Ego Turning
事件触发:自我转向
• Check if the accumulated ego-vehicle yaw angle is greater than 25 degrees in 1.6 seconds. If yes, then assign: Turn Left. Check if the accumulated ego-vehicle yaw angle is less than -25 degrees in 1.6 seconds. If yes, then assign: Turn Right. • If neither condition is met, assign: Go straight.
• 检查自车在1.6秒内累积的偏航角是否大于25度。如果是,则分配:左转。检查自车在1.6秒内累积的偏航角是否小于-25度。如果是,则分配:右转。• 如果两个条件都不满足,则分配:直行。
Q&A Augmentation
问答增强
The augmentation process converts short question-answer (Q&A) pairs into natural language sentences. Each short QA pair was expanded into a full sentence using a predefined structure. We employ the Microsoft-Phi3-medium model to generate these sentences, using the following prompt:
增强过程将简短的问答对 (Q&A) 转换为自然语言句子。每个简短的问答对都使用预定义的结构扩展为完整的句子。我们使用 Microsoft-Phi3-medium 模型生成这些句子,使用以下提示:
Complete Prompt
完整提示
The parameters for ${{\tt q u e s t i o n}}$ , and ${\mathsf{a n s w e r}}$ are dynamically inserted for each instance. This approach ensures that the augmented data remains concise (up to 15 words) while incorporating the original short answer in a more elaborated context, maintaining the correctness and relevance of the response.
${{\tt q u e s t i o n}}$ 和 ${\mathsf{a n s w e r}}$ 的参数会为每个实例动态插入。这种方法确保了增强数据在保持简洁性(最多15个单词)的同时,将原始简短答案融入更详细的上下文中,保持了回答的正确性和相关性。
Pre-crash Scenarios
碰撞前场景
Figure 2 presents the full list of 65 pre-crash scenarios as described in Section Task Design, based on National Automotive Sampling System. Each scenario is categorized into a specific accident type, such as ‘Animal’, ‘Off-road’, etc.
图 2: 展示了基于国家汽车采样系统 (National Automotive Sampling System) 的 65 种预碰撞场景的完整列表,如任务设计部分所述。每个场景都被归类为特定的事故类型,例如“动物”、“越野”等。
Evaluation Details
评估细节
Evaluation Metrics
评估指标
As mentioned in the main paper, we employ the rule-based methods for evaluation. Figure 3 shows the keyword list and regular expression used in the evaluation pipeline.
正如主论文中所述,我们采用了基于规则的方法进行评估。图 3 显示了评估流程中使用的关键词列表和正则表达式。
Additional Details on Evaluated Models
评估模型的额外细节
In this study, we evaluate open-source state-of-the-art models and proprietary models on our TB-Bench in a zero-shot manner. We provide additional information for the evaluated models in Table 5.
在本研究中,我们以零样本的方式在我们的 TB-Bench 上评估了开源的最先进模型和专有模型。我们在表 5 中提供了评估模型的附加信息。
The first category consists of open-source models (LLaVA, Bunny, and InternVL), which are accessible via the Hugging Face API. These models are fully fine-tuned with
第一类包括开源模型(LLaVA、Bunny 和 InternVL),这些模型可以通过 Hugging Face API 访问。这些模型经过完全微调。
| No. | 场景定义 |
|---|---|
| 1 | 动物:其他 |
| 2 | 动物:车辆直行且动物在道路上 |
| 3 | 动物:车辆转弯且动物在道路上 |
| 4 | 越野:单车进行避让操作 |
| 5 | 越野:单车直行并偏离道路边缘 |
| 6 | 越野:单车直行并失控 |
| 7 | 越野:单车开始操作并偏离道路边缘 |
| 8 | 越野:单车开始操作并失控 |
| 9 | 越野:单车转弯并偏离道路边缘 |
| 10 | 越野:单车转弯并失控 |
| 11 | 越野:单车及其他失控情况 |
| 12 | 越野:单车因车辆故障 |
| 13 | 越野:单车及其他道路边缘偏离 |
| 14 | 越野:单车及其他/未知 |
| 15 | 越野:倒车 |
| 16 | 越野:无碰撞 |
| 17 | 骑自行车者:其他/未知 |
| 18 | 骑自行车者:车辆直行且与骑行者交叉 |
| 19 | 骑自行车者:车辆直行且与骑行者平行 |
| 20 | 骑自行车者:车辆在车道上启动且与骑行者交叉 |
| 21 | 骑自行车者:车辆左转且与骑行者交叉 |
| 22 | 骑自行车者:车辆左转且与骑行者平行 |
| 23 | 骑自行车者:车辆右转且与骑行者交叉 |
| 24 | 骑自行车者:车辆右转且与骑行者平行 |
| 25 | 行人:其他 |
| 26 | 行人:车辆倒车 |
| 27 | 行人:车辆直行且行人横穿道路 |
| 28 | 行人:车辆直行且行人突然进入道路 |
| 29 | 行人:车辆直行且行人在道路上玩耍/工作 |
| 30 | 行人:车辆直行且行人沿道路行走 |
| 31 | 行人:车辆左转且行人横穿道路 |
| 32 | 行人:车辆右转且行人横穿道路 |
| 33 | 倒车:在车道上 |
| 34 | 倒车:在交叉口 |
| 35 | 倒车:其他 |
| 36 | 变道:两车直行且一车侵入同一车道 |
| 37 | 变道:两车直行且一车侵入另一车道 |
| 38 | 变道:一车直行且另一车变道 |
| 39 | 变道:一车直行且另一车进入或离开停车位 |
| 40 | 变道:一车直行且另一车超车 |
| 41 | 变道:一车直行且另一车转弯 |
| 42 | 变道:两车其他组合 |
| 43 | 变道:一车超车且另一车转弯 |
| 44 | 对向行驶:失控 |
| 45 | 对向行驶:两车直行且一车侵入 |
| 46 | 对向行驶:两车直行且在同一车道 |
| 47 | 对向行驶:两车转弯且一车侵入 |
| 48 | 对向行驶:两车转弯且在同一车道 |
| 49 | 对向行驶:其他/未知 |
| 50 | 对向行驶:涉及一车超车 |
| 51 | 对向行驶:涉及车辆故障 |
| 52 | 追尾:后车变道 |
| 53 | 追尾:前车加速 |
| 54 | 追尾:前车变道 |
| 55 | 追尾:前车减速 |
| 56 | 追尾:前车以恒定较慢速度行驶 |
| 57 | 追尾:前车停止 |
| 58 | 追尾:其他/未知 |
| 59 | 交叉路径:左转横穿来自侧向方向的路径 (LTAP/LD) |
| 60 | 交叉路径:左转横穿来自对向方向的路径 (LTAP/OD) |
| 61 | 交叉路径:左转进入路径 (LTIP) |
| 62 | 交叉路径:其他/未知 |
| 63 | 交叉路径:右转横穿来自侧向方向的路径 (RTAP/LD) |
| 64 | 交叉路径:右转进入路径 (RTIP) |
| 65 | 交叉路径:直行交叉路径 (SCP) |
Figure 2: List of pre-crash scenarios based on National Automotive Sampling System (NASS) variables.
图 2: 基于国家汽车采样系统 (NASS) 变量的碰撞前场景列表。
specific settings for each version available in their Huggingface repositories.
各版本的具体设置可在其 Huggingface 仓库中找到。
The second category consists of proprietary models (GPT4o and Gemini), which require specific API calls and image formatting. It is noted that we evaluate the latest version of these models on our TB-Bench at the time of submission.
第二类包括专有模型(GPT4o 和 Gemini),这些模型需要特定的 API 调用和图像格式。需要注意的是,我们在提交时评估了这些模型在我们 TB-Bench 上的最新版本。
Helper Function:
辅助函数:
Relative Distance Evaluation:
相对距离评估:
This function assesses the accuracy of predicted distances by comparing them to the ground truth.
该函数通过将预测距离与真实值进行比较来评估其准确性。
Distance Extraction:
距离提取:
It extracts numerical distances from the predicted and ground truth texts using a helper function, returning 0 if extraction fails.
它使用辅助函数从预测文本和真实文本中提取数值距离,如果提取失败则返回 0。
Evaluation Logic:
评估逻辑:
The function checks if the predicted distance falls within $25%$ of the ground truth. If it does, it returns a score of 1 for a correct prediction; otherwise, it returns 0.
该函数检查预测的距离是否在真实值的 $25%$ 范围内。如果是,则返回 1 分表示预测正确;否则返回 0。
Spatial Reasoning Evaluation:
空间推理评估:
This function uses keyword lists for different spatial positions: front, front right, front left, back, back right, and back left. It checks the predicted text for these keywords.
该函数使用不同空间位置的关键词列表:前、前右、前左、后、后右和后左。它会检查预测文本中是否包含这些关键词。
Keyword Lists:
关键词列表:
front right keywords: ['front right', …] front left keywords: ['front left', …] front keywords: ['positioned directly ahead of our car', …] back keywords: ['positioned directly behind', …] back right keywords: ['back right', …] back left keywords: ['back left', …]
前右关键词: ['前右', …]
前左关键词: ['前左', …]
前关键词: ['正前方', …]
后关键词: ['正后方', …]
后右关键词: ['后右', …]
后左关键词: ['后左', …]
Checking Logic:
检查逻辑:
The function verifies if the predicted text contains keywords from exactly one category. If so, it returns a score of 1 if it matches the ground truth, otherwise 0. If no or multiple matches are found, it returns 0 to prevent ambiguity.
该函数验证预测文本是否仅包含一个类别的关键词。如果是,则当它与真实值匹配时返回分数1,否则返回0。如果未找到匹配项或找到多个匹配项,则返回0以防止歧义。
Other Lane Changing Evaluation:
其他车道变换评估:
This function uses keyword lists to identify lane change maneuvers: no change, left lane change, and right lane change. It checks the predicted text for these keywords.
该函数使用关键词列表来识别车道变更操作:不变、左车道变更和右车道变更。它会检查预测文本中是否包含这些关键词。
Keyword Lists:
关键词列表:
no change list: ['maintains its lane', …] left lane change list: ['change to the left lane', …] right lane change list: ['change to the right lane',
无变化列表:['保持车道', …]
左车道变更列表:['变道至左车道', …]
右车道变更列表:['变道至右车道', …]
Checking Logic:
检查逻辑:
The function verifies if the predicted text contains keywords from exactly one category. If so, it returns a score of 1 if it matches the ground truth, otherwise 0. If no or multiple matches are found, it returns 0 to prevent ambiguity.
该函数验证预测文本是否包含来自单一类别的关键词。如果是,则当它与真实情况匹配时返回分数1,否则返回0。如果未找到匹配项或找到多个匹配项,则返回0以防止歧义。
Other Turning Evaluation:
其他转向评估:
This function uses keyword lists to identify vehicle turning maneuvers: left turn, right turn, and go straight. It checks the predicted text for these keywords.
该函数使用关键词列表来识别车辆转向操作:左转、右转和直行。它会检查预测文本中是否包含这些关键词。
Keyword Lists:
关键词列表:
left turn list: ['turn left', …] right turn list: ['turn right', …] go straight list: ['go straight', …
左转列表: ['左转', …] 右转列表: ['右转', …] 直行列表: ['直行', …]
Checking Logic:
检查逻辑:
The function verifies if the predicted text contains keywords from exactly one category. If so, it returns a score of 1 if it matches the ground truth, otherwise 0. If no or multiple matches are found, it returns 0 to prevent ambiguity.
该函数验证预测文本是否包含恰好一个类别的关键词。如果是,则当它与真实情况匹配时返回1分,否则返回0分。如果未找到匹配项或找到多个匹配项,则返回0以防止歧义。
Orientation Reasoning Evaluation:
方向推理评估:
This function uses keyword lists to identify vehicle orientations: perpendicular, opposite, and similar to the ego-vehicle. It checks the predicted text for these keywords.
该函数使用关键词列表来识别车辆方向:垂直于、相反于和类似于自车。它会检查预测文本中是否包含这些关键词。
Keyword Lists:
关键词列表:
perpendicular list: ['perpendicular', …] opposite list: ['opposite', …] similar list: ['similar', …]
垂直列表:['垂直', …]
相反列表:['相反', …]
相似列表:['相似', …]
Checking Logic:
检查逻辑:
The function ensures the predicted text contains keywords from exactly one category to prevent ambiguity. If the ground truth is an angle, it calculates the angular difference between predicted and ground truth angles. If the difference is within 15 degrees, it returns a score of 1; otherwise, it returns 0. If no or multiple matches are found, it returns 0.
该函数确保预测文本包含且仅包含一个类别的关键词,以防止歧义。如果真实值是角度,则计算预测角度与真实角度之间的角度差。如果差值在15度以内,则返回1分;否则返回0分。如果未找到匹配项或找到多个匹配项,则返回0分。
Other Lane to Ego Evaluation:
其他车道到自车的评估:
This function uses keyword lists to identify lane positions: front lane, front left lane, front right lane, and oncoming traffic lane. It checks the predicted text for these keywords.
该函数使用关键词列表来识别车道位置:前方车道、左前方车道、右前方车道和对向车道。它会检查预测文本中是否包含这些关键词。
Keyword Lists:
关键词列表:
front lane list: ['front lane', …] front left lane list: ['front-left lane', …] front right lane list: ['front-right lane', …] oncoming traffic lane list: ['oncoming traffic lane',
前车道列表: ['前车道', …]
前左车道列表: ['前左车道', …]
前右车道列表: ['前右车道', …]
对向车道列表: ['对向车道', …]
Checking Logic:
检查逻辑:
The function verifies if the predicted text contains keywords from exactly one category. If so, it returns a score of 1 if it matches the ground truth, otherwise 0. If no or multiple matches are found, it returns 0 to prevent ambiguity.
该函数验证预测文本是否包含恰好一个类别的关键词。如果匹配真实值,则返回1分,否则返回0分。如果未找到匹配项或找到多个匹配项,则返回0分以防止歧义。
Ego Turning Evaluation:
Ego Turning Evaluation:
This function uses keyword lists to identify turning maneuvers: right turn, left turn, and go straight. It checks the predicted text for these keywords.
该函数使用关键词列表来识别转向操作:右转、左转和直行。它会检查预测文本中是否包含这些关键词。
Keyword Lists:
关键词列表:
right turn list: ['right turn', …] left turn list: ['left turn', …] go straight list: ['go straight',
右转列表: ['右转', …] 左转列表: ['左转', …] 直行列表: ['直行', …]
Checking Logic:
检查逻辑:
The function verifies if the predicted text contains keywords from exactly one category. If so, it returns a score of 1 if it matches the ground truth, otherwise 0. If no or multiple matches are found, it returns 0 to prevent ambiguity.
该函数验证预测文本是否包含来自单一类别的关键词。如果是,则当它与真实情况匹配时返回分数1,否则返回0。如果未找到匹配项或找到多个匹配项,则返回0以防止歧义。
Ego Traverse Distance Evaluation:
Ego Traverse Distance Evaluation:
It retrieves distances using a helper function, returning 0 if extraction fails.
它使用辅助函数检索距离,如果提取失败则返回 0。
Evaluation Logic:
评估逻辑:
The function checks if the predicted distance is within $25%$ of the ground truth. If the ground truth distance is less than 1.0 meter, it checks if the predicted distance is within the adjusted range. It returns a score of 1 for a correct prediction and 0 otherwise.
该函数检查预测距离是否在地面真实值的 $25%$ 范围内。如果地面真实距离小于 1.0 米,则检查预测距离是否在调整后的范围内。如果预测正确,则返回 1 分,否则返回 0 分。
Figure 3: Evaluation Metric Methodology for Each Task: The method uses rule-based and regular expressions techniques to assess accuracy.
图 3: 每个任务的评估指标方法:该方法使用基于规则和正则表达式技术来评估准确性。
Prompt for Zero-Shot Evaluation
零样本评估的提示
For zero-shot evaluation of existing models, we use an Option Template that presents multiple-choice options to define possible answer classes. This approach accommodates the
对于现有模型的零样本评估,我们使用一个选项模板来呈现多项选择选项,以定义可能的答案类别。这种方法适应了
varied terminology that pre-trained models may employ to describe situations.
预训练模型可能使用不同的术语来描述情况。
The details of the Option Template, which varies based on the task type, are as follows:
选项模板的详细信息根据任务类型的不同而有所不同,具体如下:
| 模型名称 | 完整仓库/API名称 | 视觉部分 | 语言部分 |
|---|---|---|---|
| 开源模型 | |||
| LLaVA-1.5-7B | llava-hf/llava-1.5-7b-hf | CLIP-L/14 | Vicuna-7b-v1.5 |
| LLaVA-v1.6-Mistral-7B | llava-hf/llava-v1.6-mistral-7b-hf | CLIP-L/14 | Mistral-7B-Instruct-vO.2 |
| LLaVA-NeXT-Video-7B | llava-hf/LLaVA-NeXT-Video-7B-hf | CLIP-L/14 | Vicuna-7B-v1.5 |
| LLaVA-Interleave-Qwen-7B | llava-hf/llava-interleave-qwen-7b-hf | SigLIP-L/14 | Qwen1.5-7B-Chat |
| Bunny-v1.1-4B | BAAI/Bunny-v1_1-4B | SigLIP-L/14 | Phi-3-mini-4k-instruct |
| Bunny-v1.1-Llama-3-8B-V | BAAI/Bunny-v1_1-Llama-3-8B-V | SigLIP-L/14 | Llama-3-8B-Instruct |
| InternVL2-8B | OpenGVLab/InternVL2-8B | InternViT-300M-448px | Qwen2-8B-Instruct |
| Mini-InternVL2-1B-DriveLM | OpenGVLab/Mini-InternVL2-1B-DA-DriveLM | InternViT-300M-448px | Qwen2-0.5B |
| DriveLM-mantis-8b | francepfl/DriveLM-mantis-8b-idefics2_8192 | SiGLIP | Mistral-7B-v0.1 |
| 专有模型 | |||
| Gemini-1.5-flash | Gemini-1.5-flash | 未知 | 未知 |
| GPT-40-2024-08-06 | GPT-40-2024-08-06 | 未知 | 未知 |
| 选项模板 |
|---|
| 距离相关任务:以 xx.x 米的格式回答。角度相关任务:以 xx.x 度的格式回答。有预定义答案选择的任务:检索答案选择。为每个选项分配一个字母(例如,A、B、C)。按以下方式呈现选项:选项:A. 选项1, B. 选项2, C. 选项3, |
Pre-trained models often use specific vocabularies based on their training data. For instance, a model might say ‘opposite side of the road’ instead of ‘oncoming traffic lane’ if it lacks specific instruction training. By offering explicit choices, the model can select the appropriate terminology despite variations.
预训练模型通常基于其训练数据使用特定的词汇表。例如,如果模型缺乏特定的指令训练,它可能会说“路的对面”而不是“对向车道”。通过提供明确的选择,模型可以在存在差异的情况下选择适当的术语。
For numerical answers, we specify the expected format within the prompt to ensure clarity and consistency, such as instructing the model to Answer in xx.x meters format.
对于数值答案,我们在提示中指定了预期的格式,以确保清晰和一致,例如指示模型以 xx.x 米的格式回答。
This structured approach allows the model to account for variations in wording and select the most appropriate option, demonstrating its understanding.
这种结构化的方法使模型能够考虑措辞的变化并选择最合适的选项,展示了其理解能力。
Experiments and Results
实验与结果
Implementation Details
实现细节
All models are finetuned on an Ubuntu 20.04 server equipped with four A6000 GPUs, each with 48GB of memory. The source code is built on the Transformers library (Wolf et al. 2019) and utilizes the PyTorch 2.4 framework (Paszke et al. 2019).
所有模型均在配备四块 A6000 GPU(每块显存为 48GB)的 Ubuntu 20.04 服务器上进行微调。源代码基于 Transformers 库 (Wolf et al. 2019) 构建,并使用了 PyTorch 2.4 框架 (Paszke et al. 2019)。
Additional information on hyper-parameter settings for finetuning our baseline models on TB-100k and TB-250k is presented in Table 6.
有关在 TB-100k 和 TB-250k 上微调基线模型的超参数设置的更多信息见表 6。
Table 6: Hyper-parameter settings for finetuning our models on TB-100k or TB-250k.
表 6: 在 TB-100k 或 TB-250k 上微调我们模型的超参数设置。
| 超参数 (Hyper-parameter) | 值 (Value) |
|---|---|
| Epochs | 10 |
| Warmup steps | 2,000 |
| Learning rate | 1e-5 |
| LoRA learning rate | 1e-4 |
| EffectiveBatchsize | 64 |
| AdamWβ | (0.9, 0.999) |
| Weight decay | 0.05 |
| Drop path | 0 |
| Attentiondropout | 0 |
| Torch data type | bf16 |
| Inferencetemperature | 0 |
Quantitative Analyses
定量分析
We provide quantitative analyses and the qualitative results of the model’s predictions on TB-Bench. The baseline model ((SigLIP-L/14 and Qwen1.5-0.5b) finetuned on TB-100k. For numerical output tasks, we visualize error distributions using box plots. On the other hand, we use confusion matrices for classification tasks.
我们提供了模型在TB-Bench上预测的定量分析和定性结果。基线模型(SigLIP-L/14和Qwen1.5-0.5b)在TB-100k上进行了微调。对于数值输出任务,我们使用箱线图来可视化误差分布。另一方面,对于分类任务,我们使用混淆矩阵。
Relative Distance and Ego Traverse Distance Tasks. Figure 4 shows the box plot for distance errors of our model predictions on the two tasks. For RD, distance errors are generally centered around zero, with a narrow interquartile range, indicating consistent performance, though a few outliers suggest over estimation. Predictions on EGO-TRA show a similar error distribution, with the median slightly above zero and more positive outliers, indicating a tendency to overestimate distance.
相对距离与自我遍历距离任务。图 4 展示了我们的模型在这两个任务上距离误差的箱线图。对于相对距离 (RD) 任务,距离误差通常集中在零附近,四分位距较窄,表明性能一致,尽管一些异常值表明存在高估。在自我遍历距离 (EGO-TRA) 任务上的预测显示出类似的误差分布,中位数略高于零,且更多的正异常值表明存在高估距离的趋势。
Orientation Reasoning Task. Figure 5 shows the box plot for angular errors of our model predictions on the Orientation Reasoning (OR) task. The median and interquartile range are close to zero, indicating precise and consistent predictions. Short whiskers further highlight this accuracy. Outliers are grouped near 0, 90, and 180 degrees, suggesting small angle mis estimations. Overall, the model demonstrates minimal errors in this task.
方向推理任务。图 5 展示了我们的模型在方向推理 (OR) 任务上的角度误差的箱线图。中位数和四分位距接近零,表明预测精确且一致。短须线进一步突出了这种准确性。异常值集中在 0、90 和 180 度附近,表明存在小的角度估计误差。总体而言,模型在此任务中表现出极小的误差。

Figure 4: Distance error on Relative Distance (RD) and Ego Traverse Distance (EGO-TRA) tasks.
图 4: 相对距离 (RD) 和自我穿越距离 (EGO-TRA) 任务上的距离误差。

Figure 5: Angular error on Orientation Reasoning (OR) task.
图 5: 方向推理 (OR) 任务中的角度误差。
Spatial Reasoning Task. Figure 6 shows the confusion matrix of our model predictions on the Spatial Reasoning (SR) task. The ‘front’ position is classified most accurately at $85.1%$ , while ‘back’ and ‘back left’ positions have lower accuracies of $63.3%$ and $66.7%$ . The matrix also shows moderate confusion between similar positions, such as back left’ being mis classified as front right’ $(23.33%)$ and ‘back’ as ‘front’ $(19.67%)$ .
空间推理任务。图 6 展示了我们的模型在空间推理 (SR) 任务上的预测混淆矩阵。其中,“前”位置的分类准确率最高,为 $85.1%$,而“后”和“后左”位置的准确率较低,分别为 $63.3%$ 和 $66.7%$。矩阵还显示,相似位置之间存在一定的混淆,例如“后左”被误分类为“前右” $(23.33%)$,以及“后”被误分类为“前” $(19.67%)$。
Other Lane to Ego-Vehicle Task. Figure 7 shows the confusion matrix of our model predictions on the Other Lane to Ego-Vehicle (EGO-LANE) task. Overall, the model shows high accuracy on most categories (over $96%$ ), except for the ‘front lane,’ which has an accuracy of only $81.7%$ . The primary mis classification pattern involves confusion between the ‘front lane’ and its adjacent lanes, with $9.9%$ of ‘front lane’ samples being mis classified as ‘front right lane.’
其他车道到自车任务。图 7 展示了我们的模型在“其他车道到自车”(EGO-LANE)任务上的预测混淆矩阵。总体而言,模型在大多数类别上表现出较高的准确率(超过 $96%$),除了“前车道”类别,其准确率仅为 $81.7%$。主要的误分类模式涉及“前车道”与其相邻车道之间的混淆,其中 $9.9%$ 的“前车道”样本被误分类为“前右车道”。

Figure 6: Confusion matrix on Spatial Reasoning (SR) task.
图 6: 空间推理 (SR) 任务的混淆矩阵。
Other Lane Changing Task. Figure 8 shows the confusion matrix on the Other Lane Changing (OBJ-LANE) task, where samples are categorized into ‘no change,’ ‘left lane change,’ and ‘right lane change.’ In this case, the model shows decent performance with an accuracy of around $78.87%$ in the ‘no change’ category. However, it struggles significantly with lane change predictions. For both ‘left lane change’ and ‘right lane change’ classifications, the most mis classified predictions are in the ‘no change’ category, with $32.3%$ and $30.4%$ mis classified, respectively. This indicates the model’s difficulty in distinguishing between lane changes and no change, underscoring the task’s challenges.
其他车道变换任务。图 8 显示了其他车道变换 (OBJ-LANE) 任务的混淆矩阵,其中样本被分类为“无变换”、“左车道变换”和“右车道变换”。在这种情况下,模型在“无变换”类别中表现良好,准确率约为 $78.87%$。然而,它在车道变换预测方面表现不佳。对于“左车道变换”和“右车道变换”分类,最常被错误分类的预测是“无变换”类别,分别有 $32.3%$ 和 $30.4%$ 的错误分类。这表明模型在区分车道变换和无变换方面存在困难,突显了该任务的挑战性。
Other Lane Changing Task. Figure 9 shows the confusion matrix on the Other Turning (OBJ-TURN) task, where samples are categorized as ‘left turn,’ ‘go straight,’ and ‘right turn.’ The model excels in identifying the go straight’ category, achieving an accuracy of $80.16%$ . However, it shows over $30%$ mis classification rates for both ‘left turn’ and ‘right turn.’ Notably, mis classifications of ‘left turn’ are nearly evenly divided between ‘right turn’ and go ‘straight,’ despite ‘right turn’ errors being more theoretically opposed. The model’s performance indicates that it struggles to accurately interpret turns from the perspective of other vehicles, influenced by road orientation and vehicle positioning.
其他车道变换任务。图 9 展示了其他转向 (OBJ-TURN) 任务的混淆矩阵,其中样本被分类为“左转”、“直行”和“右转”。模型在识别“直行”类别方面表现出色,准确率达到 $80.16%$。然而,对于“左转”和“右转”,模型的误分类率均超过 $30%$。值得注意的是,“左转”的误分类几乎均匀分布在“右转”和“直行”之间,尽管从理论上讲,“右转”错误更为对立。模型的表现表明,它在从其他车辆的角度准确解释转向时存在困难,这受到道路方向和车辆位置的影响。
Ego Turning Task. Figure 10 shows the confusion matrix on the task, where the actions are categorized as ‘left turn,’ ‘go straight,’ and ‘right turn.’ The model demonstrates strong performance in identifying turns, with high accuracy rates of $86.8%$ for ‘left turn’ and $86.67%$ for ‘right turn.’ Interestingly, the turning maneuvers have stronger performance than the ‘go straight’ action, with a notable $20.49%$ of ‘go straight’ samples being mis classified as ‘right turn.’
自我转向任务。图 10 展示了该任务的混淆矩阵,其中动作被分类为“左转”、“直行”和“右转”。模型在识别转向方面表现出色,“左转”的准确率为 $86.8%$,“右转”的准确率为 $86.67%$。有趣的是,转向动作的表现优于“直行”动作,有 $20.49%$ 的“直行”样本被错误分类为“右转”。
Qualitative Results
定性结果
For brevity, we present two samples per task, each with input frame(s), the task question, and the ground truth answer.
为简洁起见,我们为每个任务展示两个样本,每个样本包含输入帧、任务问题和真实答案。

Figure 7: Confusion Matrix on Other Lane to Ego-Vehicle (EGO-LANE).
图 7: 其他车道到自车 (EGO-LANE) 的混淆矩阵。
Each sample also includes predictions from our fine-tuned baseline model (SigLIP-L/14 and Qwen1.5-0.5b) and the best performing zero-shot model, GPT-4o (GPT-4o-2024- 08-06 version).
每个样本还包括我们微调的基线模型(SigLIP-L/14 和 Qwen1.5-0.5b)以及表现最佳的零样本模型 GPT-4o(GPT-4o-2024-08-06 版本)的预测结果。
Figures for each task are as follows:
各任务的图表如下:
• Figure 11: Relative Distance (RD) • Figure 12: Spatial Reasoning (SR) • Figure 13: Orientation Reasoning (OR) • Figure 14: Other Lane to Ego-Vehicle (EGO-LANE) • Figure 15: Other Lane Changing (OBJ-LANE) • Figure 16: Other Turning (OBJ-TURN) task • Figure 17: Ego Turning (EGO-TURN) • Figure 18: Ego Traverse Distance (EGO-TRA)
• 图 11: 相对距离 (RD)
• 图 12: 空间推理 (SR)
• 图 13: 方向推理 (OR)
• 图 14: 其他车道到自车 (EGO-LANE)
• 图 15: 其他车道变换 (OBJ-LANE)
• 图 16: 其他转向 (OBJ-TURN) 任务
• 图 17: 自车转向 (EGO-TURN)
• 图 18: 自车行驶距离 (EGO-TRA)
Ablation Study Details
消融研究细节
We provide detailed ablation results across eight tasks in Table 7.
我们在表 7 中提供了八个任务的详细消融实验结果。
Results indicate that stronger visual encoders significantly improve performance. For instance, comparing CLIP-L/14 to SigLIP-L/14 shows improvements of over $15.2%$ in Relative Distance (RD), $4.0%$ in Orientation Reasoning (OR), $5.6%$ in Other Turning (OBJ-TURN), and $10.4%$ in Ego Turning (EGO-TURN).
结果表明,更强的视觉编码器显著提升了性能。例如,将 CLIP-L/14 与 SigLIP-L/14 进行比较,相对距离 (RD) 提升了超过 $15.2%$,方向推理 (OR) 提升了 $4.0%$,其他转向 (OBJ-TURN) 提升了 $5.6%$,自我转向 (EGO-TURN) 提升了 $10.4%$。
The optimal number of visual tokens is 16. Increasing this to 36 tokens improves Ego Traverse Distance (EGO-TRA) by only $2.8%$ , while performance in other tasks declines compared to the 16-token variant.
视觉 Token 的最佳数量是 16。将其增加到 36 个 Token 仅将 Ego Traverse Distance (EGO-TRA) 提高了 $2.8%$,而与其他任务相比,16-Token 变体的性能有所下降。
Utilizing more sequential frames generally enhances performance, especially in the tasks requiring temporal information (tasks 3-8). Single-frame tasks like Spatial Reasoning also benefit from training on multi-frame tasks, showing notable improvements. For ego-focused tasks, using 8 frames instead of 2 results in significant gains of over $14%$ in EGO-TURN and $12.8%$ in EGO-TRA, indicating that the number of frames is more critical for ego-focused tasks than for object-focused ones.
利用更多的连续帧通常能提升性能,尤其是在需要时间信息的任务(任务 3-8)中。即使是单帧任务,如空间推理 (Spatial Reasoning),也能从多帧任务的训练中受益,显示出显著的改进。对于以自我为中心的任务,使用 8 帧而非 2 帧在 EGO-TURN 中带来了超过 $14%$ 的提升,在 EGO-TRA 中提升了 $12.8%$,这表明帧数对于以自我为中心的任务比以物体为中心的任务更为关键。

Figure 8: Confusion Matrix on Other Lane Changing (OBJLANE).
图 8: 其他车道变换 (OBJLANE) 的混淆矩阵。
Reproducibility Checklist
可复现性检查清单
We answer the questions outlined in the AAAI re prod uci bility checklist, available at https://aaai.org/aaai-conference/ reproducibility-checklist/, as follows:
我们回答了AAAI可复现性检查表中列出的问题,该检查表可在https://aaai.org/aaai-conference/reproducibility-checklist/获取,具体如下:
Table 7: Ablation results per task. All the models are finetuned on the TB-100k dataset, with their performance evaluated on TB-Bench and reported in accuracy (percentage).
表 7: 各任务的消融实验结果。所有模型均在 TB-100k 数据集上进行了微调,并在 TB-Bench 上评估了性能,结果以准确率(百分比)报告。
| 模型 | TrafficBehaviorBenchmark (TB-Bench) |
|---|---|
| RD↑ | |
| 视觉编码器 | |
| CLIP-L/14 | 61.2 |
| SigLIP-B/16 | 65.2 |
| SigLIP-L/14 | 76.4 |
| 每帧视觉 Token | |
| 4 | 68.8 |
| 16 | 76.4 |
| 36 | 75.5 |
| 帧数 | |
| 2 | 72.4 |
| 4 | 74.4 |
| 8 | 76.4 |

Figure 9: Confusion matrix on Other Turning (OBJ-TURN).
图 9: 其他转向 (OBJ-TURN) 的混淆矩阵。

Figure 10: Confusion matrix on Ego Turning (EGO-TURN)
图 10: Ego Turning (EGO-TURN) 的混淆矩阵

Figure 11: Examples and predictions from our baseline method and GPT-4o for the Relative Distance (RD) task.
图 11: 我们的基线方法和 GPT-4o 在相对距离 (RD) 任务中的示例和预测结果。
for running experiments (hardware and software), including GPU/CPU models; amount of memory; operating system; names and versions of relevant software libraries and frameworks. (yes/partial/no)
用于运行实验的硬件和软件,包括 GPU/CPU 型号;内存容量;操作系统;相关软件库和框架的名称及版本。(是/部分/否)
deep learning library. Advances in neural information processing systems, 32. 7
深度学习库。神经信息处理系统进展,32. 7
Qian, T.; Chen, J.; Zhuo, L.; Jiao, Y.; and Jiang, Y.-G. 2024. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 4542–4550. 2 Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. 2019. Hugging face’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771. 7
钱, T.; 陈, J.; 卓, L.; 焦, Y.; 和 蒋, Y.-G. 2024. Nuscenes-qa: 自动驾驶场景中的多模态视觉问答基准。在AAAI人工智能会议论文集, 第38卷, 4542–4550. 2
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; 等. 2019. Hugging Face的Transformers: 最先进的自然语言处理。arXiv预印本 arXiv:1910.03771. 7
References
参考文献
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. 2019. Pytorch: An imperative style, high-performance
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; 等. 2019. PyTorch: 一种命令式风格的高性能深度学习框架

Figure 12: Examples and predictions from our baseline method and GPT-4o for the Spatial Reasoning (SR) task.
图 12: 我们的基线方法和 GPT-4o 在空间推理 (SR) 任务中的示例和预测。

Figure 13: Examples and predictions from our baseline method and GPT-4o for the Orientation Reasoning (OR) task.
图 13: 我们的基线方法和 GPT-4o 在方向推理 (OR) 任务中的示例和预测。


Figure 14: Examples and predictions from our baseline method and GPT-4o for the Other Lane to Ego-Vehicle (EGO-LANE) task.
图 14: 我们的基线方法和 GPT-4o 在 Other Lane to Ego-Vehicle (EGO-LANE) 任务中的示例和预测。

Figure 15: Examples and predictions from our baseline method and GPT-4o for the Other Lane Changing (OBJ-LANE) task.
图 15: 我们的基线方法和 GPT-4o 在其他车道变换 (OBJ-LANE) 任务中的示例和预测。

Figure 16: Examples and predictions from our baseline method and GPT-4o for the Other Turning (OBJ-TURN) task.
图 16: 我们的基线方法和 GPT-4o 在 Other Turning (OBJ-TURN) 任务中的示例和预测。

Figure 17: Examples and predictions from our baseline method and GPT-4o for the Ego Turning (EGO-TURN) task.
图 17: 我们的基线方法和 GPT-4o 在 Ego Turning (EGO-TURN) 任务中的示例和预测结果。


