UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction

UniTraj: 可扩展车辆轨迹预测的统一框架

Abstract. Vehicle trajectory prediction has increasingly relied on datadriven solutions, but their ability to scale to different data domains and the impact of larger dataset sizes on their generalization remain underexplored. While these questions can be studied by employing multiple datasets, it is challenging due to several discrepancies, e.g., in data formats, map resolution, and semantic annotation types. To address these challenges, we introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria, presenting new opport unities for the vehicle trajectory prediction field. In particular, using UniTraj, we conduct extensive experiments and find that model performance significantly drops when transferred to other datasets. However, enlarging data size and diversity can substantially improve performance, leading to a new state-of-the-art result for the nuScenes dataset. We provide insights into dataset characteristics to explain these findings. The code can be found here: https://github.com/vita-epfl/UniTraj.

摘要：车辆轨迹预测日益依赖数据驱动的解决方案，但其跨数据领域的扩展能力以及更大规模数据集对泛化性能的影响仍缺乏深入研究。虽然这些问题可通过使用多个数据集来探究，但由于数据格式、地图分辨率和语义标注类型等差异，研究面临挑战。为此，我们提出UniTraj框架，该框架统一了多种数据集、模型和评估标准，为车辆轨迹预测领域带来新机遇。基于UniTraj的实验表明：模型迁移至其他数据集时性能显著下降，但扩大数据规模和多样性可大幅提升性能，并在nuScenes数据集上取得新的最优结果。我们通过分析数据集特征阐释了这些发现。代码见：https://github.com/vita-epfl/UniTraj。

Keywords: Vehicle trajectory prediction · Multi-dataset framework · Domain generalization

关键词：车辆轨迹预测 · 多数据集框架 · 领域泛化

1 Introduction

1 引言

Predicting the trajectories of surrounding vehicles is essential for ensuring the safety and collision avoidance of autonomous driving systems. With the advent of deep learning, researchers have turned to data-driven solutions to tackle this prediction task. However, while these models can achieve high accuracy, they are heavily reliant on the specific data domain used for training.

预测周围车辆的轨迹对于确保自动驾驶系统的安全性和防撞至关重要。随着深度学习的兴起，研究人员转向数据驱动的解决方案来解决这一预测任务。然而，尽管这些模型可以实现高精度，但它们严重依赖于训练所用的特定数据领域。

An autonomous driving system may encounter various situations such as diverse geographical locations. These various situations introduce data domain shifts, which can significantly impact the performance of the prediction models. Consequently, it is essential to study the performance of the models across diverse domains, such as datasets and cities. However, despite the importance of the question, the generalization of models to different domains has not been adequately studied yet. Therefore, our first Research Question (RQ1) is to investigate the performance drop of trajectory prediction models when transferred to new domains.

自动驾驶系统可能遇到多样化的地理位置等复杂场景。这些多样化场景会引发数据域偏移 (domain shift) ，显著影响预测模型的性能。因此，研究模型在不同域（如数据集和城市）中的表现至关重要。然而，尽管该问题意义重大，模型跨域泛化能力尚未得到充分研究。因此，我们的第一个研究问题 (RQ1) 是探究轨迹预测模型迁移到新域时的性能下降情况。

Fig. 1: UniTraj framework. The framework unifies various datasets, forming the largest vehicle trajectory prediction dataset available. It also includes multiple stateof-the-art prediction models and various evaluation strategies, making it suitable for trajectory prediction experimentation. The framework enables the study of diverse Research Questions, including (RQ1) the generalization of trajectory prediction models across different domains and (RQ2) the impact of data size on prediction performance.

图 1: UniTraj框架。该框架整合了多种数据集，构成了当前最大的车辆轨迹预测数据集。同时包含多个最先进的预测模型及多样化评估策略，适用于轨迹预测实验研究。该框架支持探索多种研究问题，包括 (RQ1) 轨迹预测模型在不同领域的泛化能力，以及 (RQ2) 数据规模对预测性能的影响。

A potential solution to improve the generalization ability of prediction models is to scale up the sizes of the datasets to cover a broader spectrum of driving scenarios. While there is a trend in extending datasets’ sizes [8,10,14,52], the impact of dataset size on the performance of trajectory prediction models remains largely unexplored. Our second research question (RQ2) is then to study the impact of increasing dataset sizes on the performance of the prediction models.

提升预测模型泛化能力的一个潜在解决方案是扩大数据集规模，以覆盖更广泛的驾驶场景。虽然当前存在扩展数据集规模的趋势[8,10,14,52]，但数据集大小对轨迹预测模型性能的影响在很大程度上仍未得到探索。因此，我们的第二个研究问题(RQ2)是研究增加数据集规模对预测模型性能的影响。

Exploring these two research questions involves leveraging multiple trajectory prediction datasets. Firstly, these datasets provide diverse domains, allowing for a thorough examination of model generalization across different domains (RQ1). Secondly, combining these datasets creates a much larger dataset, enabling an exploration of the asymptotic limits of data scaling (RQ2). However, significant challenges exist when attempting to leverage multiple datasets. (1) Each of these datasets has a unique data format, posing practical difficulties for researchers utilizing multiple datasets. (2) Each of the datasets undergoes collection and annotation through distinct strategies, with semi-automatic pre-annotations and manual curations [7,8,14]. This leads to multiple discrepancies such as variations in resolution, sampling rates, and types of semantic annotations. (3) Comparing model performance across datasets is not straightforward due to varying dataset settings (e.g., prediction horizons) and evaluation metrics (e.g., mAP metric is used in WOMD [14] and brier-FDE metric in Argoverse 2 [52]). In short, while each of the datasets contributes to the progress in the field, they have been developed independently, without considering harmonization with existing ones.

探索这两个研究问题需要利用多个轨迹预测数据集。首先，这些数据集提供了多样化的领域，能够全面检验模型在不同领域的泛化能力 (RQ1)。其次，合并这些数据集可以形成更大的数据集，从而探索数据扩展的渐近极限 (RQ2)。然而，在尝试利用多个数据集时存在重大挑战。(1) 每个数据集都有独特的数据格式，给研究人员使用多个数据集带来了实际困难。(2) 每个数据集通过不同的策略进行收集和标注，包括半自动预标注和人工校验 [7,8,14]，这导致了分辨率、采样率和语义标注类型等多方面的差异。(3) 由于数据集设置 (如预测时间范围) 和评估指标 (如 WOMD [14] 使用 mAP 指标，Argoverse 2 [52] 使用 brier-FDE 指标) 的不同，跨数据集比较模型性能并不简单。简而言之，虽然每个数据集都推动了该领域的进步，但它们都是独立开发的，没有考虑与现有数据集的协调性。

As a result, many trajectory prediction studies train and evaluate their models using a single dataset $[2,3,5,9,13,20,22,30,34,37,48].$

因此，许多轨迹预测研究使用单一数据集训练和评估模型 [2,3,5,9,13,20,22,30,34,37,48]。

To tackle these challenges, we introduce ‘UniTraj’, a comprehensive vehicle trajectory prediction framework. UniTraj seamlessly integrates and unifies multiple data sources (including nuScenes [7], Argoverse 2 [52] and Waymo Open Motion Dataset - WOMD [14]), models (including AutoBot [17], MTR [48], and Wayformer [35]), and evaluations. UniTraj not only serves as a solution to tackle our research questions but also provides a comprehensive and flexible platform for the community. First, it is designed for the effortless inclusion of new datasets by proposing a unified data structure compatible with various datasets. Second, Unitraj supports and simplifies the integration of new methods by providing numerous essential data processing and loss functions relevant to the trajectory prediction task. Lastly, UniTraj offers unified evaluation metrics, as well as diverse and insightful evaluation approaches, such as analyzing performance on the long-tail data instances as well as different clusters of data samples to allow a more in-depth understanding of model behavior. Figure 1 shows the overview of the framework.

为应对这些挑战，我们推出了"UniTraj"——一个综合的车辆轨迹预测框架。UniTraj无缝整合并统一了多数据源（包括nuScenes [7]、Argoverse 2 [52]和Waymo开放运动数据集-WOMD [14]）、多模型（包括AutoBot [17]、MTR [48]和Wayformer [35]）以及评估体系。该框架不仅为研究问题提供了解决方案，更为学界提供了全面灵活的平台。首先，通过提出兼容多种数据集的统一数据结构，实现新数据集的便捷接入；其次，通过提供轨迹预测任务所需的关键数据处理和损失函数，支持并简化新方法的集成；最后，提供统一的评估指标及多维深度评估方法（如分析长尾数据实例表现和数据样本聚类表现），以深入理解模型行为。图1展示了框架概览。

We conduct extensive experiments using the UniTraj framework to shed light on our two research questions. Our findings reveal a large performance drop when transitioning between data sources, alongside variations in the generalization abilities induced by different datasets (RQ1). We also show that scaling up dataset size and diversity can enhance model performance significantly without any architectural modifications, leading us to rank $1^{\mathrm{st}}$ in the nuScenes public leader board. This is accomplished by training models on all existing datasets in the framework. This unified dataset forms the largest public data one can use to train a vehicle trajectory prediction model, with more than 2M samples, 1337 hours of data, and 15 different cities. Finally, by providing an in-depth analysis of the datasets, we offer a more comprehensive understanding of their characteristics. Our analysis reveals that the datasets’ generalization capabilities are not only attributed to their size, but also their intrinsic diversity. We believe that the framework opens up new opportunities in the trajectory prediction field, and we will release the framework to foster further advancements. In summary, our contributions are as follows:

我们利用UniTraj框架进行了大量实验，以解答两个研究问题。研究发现，在数据源切换时会出现显著的性能下降，同时不同数据集引发的泛化能力也存在差异（RQ1）。实验还表明，在不改变模型架构的情况下，扩大数据集规模和多样性可显著提升模型性能，这使我们在nuScenes公开排行榜中取得了第一名的成绩。这一成果是通过在框架内所有现有数据集上训练模型实现的。该统一数据集构成了当前最大的公开车辆轨迹预测训练资源，包含超过200万样本、1337小时数据量，覆盖15个不同城市。最后，通过对数据集的深度分析，我们对其特性形成了更全面的认知。分析表明，数据集的泛化能力不仅取决于规模，更源于其内在多样性。我们相信该框架为轨迹预测领域开辟了新机遇，并将公开框架以促进后续研究。本文的主要贡献包括：

2 Previous work

2 相关工作

Trajectory prediction datasets. Many academic and industrial laboratories have paved the way for research development by open-sourcing real-world driving datasets [6–8, 10, 14, 19, 33, 38, 52, 57]. Notably, Argoverse [10] was among the pioneers in releasing the lane graph information, nuScenes [7] expanded the variety of scenes, Waymo [14] enriched their dataset with fine-grained information, and recently, Argoverse 2 [52] released the largest data in terms of unique roadways. While these datasets contribute to field developments, they have been developed in isolation without considering harmonization with former datasets. Thus, there exist multiple challenges in combining them due to various incompati bil i ties. This work addresses the challenges through a unified framework.

轨迹预测数据集。众多学术和工业实验室通过开源真实世界驾驶数据集[6–8, 10, 14, 19, 33, 38, 52, 57]为研究发展铺平了道路。其中，Argoverse [10]率先发布了车道图信息，nuScenes [7]扩展了场景多样性，Waymo [14]通过细粒度信息丰富了数据集，而近期Argoverse 2 [52]发布了在独特道路数量方面规模最大的数据。尽管这些数据集推动了领域发展，但它们各自独立开发，未考虑与先前数据集的协调性。因此，由于多种不兼容性，在整合这些数据集时存在诸多挑战。本研究通过统一框架应对这些挑战。

Trajectory prediction benchmarks. Multi-dataset benchmarks have already been explored in various domains such as object detection [59], semantic segmentation [24, 47], and pose prediction [41]. In the field of trajectory prediction, such benchmarks have primarily been developed for human trajectory prediction [1, 39, 42]. Notably, Trajnet++ [26] provides an interaction-centric benchmark by categorizing trajectories based on the presence of an interaction. trajdata [21] is a unified interface to multiple human trajectory datasets incorporating scene context into the inputs. A related work for the task of vehicle trajectory planning is Scenario Net [28], a simulator aggregating multiple realworld datasets into a unified format and providing a planning development and evaluation framework. To the best of our knowledge, we are the first to propose an open-source framework for vehicle trajectory prediction. Our framework is not limited to including multiple datasets; it also integrates a variety of trajectory prediction models and evaluation methodologies, thereby providing a comprehensive resource for advancing research and development in the vehicle trajectory prediction task.

轨迹预测基准。多数据集基准已在多个领域得到探索，如目标检测 [59]、语义分割 [24, 47] 和姿态预测 [41]。在轨迹预测领域，此类基准主要针对人类轨迹预测开发 [1, 39, 42]。值得注意的是，Trajnet++ [26] 通过基于交互存在性对轨迹进行分类，提供了一个以交互为中心的基准。trajdata [21] 是一个统一的多人类轨迹数据集接口，将场景上下文纳入输入。车辆轨迹规划任务的相关工作是 Scenario Net [28]，这是一个将多个真实世界数据集聚合为统一格式并提供规划开发与评估框架的模拟器。据我们所知，我们是首个提出车辆轨迹预测开源框架的团队。该框架不仅限于包含多数据集，还整合了多种轨迹预测模型与评估方法，从而为推进车辆轨迹预测任务的研究与开发提供了全面资源。

Generalization of trajectory prediction models. The discrepancies in data formats in vehicle trajectory prediction datasets hinder research on cross-dataset generalization, leading to limited studies in this area. In [55], one dataset is divided into different domains to explore model generalization. Authors in [46] propose an epistemic uncertainty estimation approach and perform cross-dataset evaluation. In [16], the authors studied cross-dataset generalization of models and showed a performance gap between datasets. However, they provide limited insights into the sources of the generalization gap. Moreover, their code is not publicly available. Previous works also investigated some generalization aspects of trajectory prediction models when they deal with new scenes and cities [11, 27, 31], new agent types [27], using perception outputs instead of curated annotations [51, 53, 58], and facing adversarial situations [3, 40, 43]. In this work, we conduct more extensive and in-depth cross-dataset, and cross-city analyses as well as multi-dataset training. Moreover, we provide insights into the dataset characteristics, explaining the findings. We also release an open-source framework to facilitate this line of research.

轨迹预测模型的泛化能力。车辆轨迹预测数据集中数据格式的差异阻碍了跨数据集泛化研究，导致该领域研究有限。在[55]中，作者将数据集划分为不同域来探索模型泛化性。[46]提出认知不确定性估计方法并进行跨数据集评估。[16]研究了模型的跨数据集泛化性能，揭示了数据集间的性能差距，但未深入分析泛化差距的根源，且未公开代码。先前研究还探讨了轨迹预测模型在新场景城市[11,27,31]、新智能体类型[27]、使用感知输出替代人工标注[51,53,58]及对抗场景[3,40,43]下的泛化表现。本研究进行了更广泛深入的跨数据集、跨城市分析及多数据集训练，从数据集特征角度解释研究发现，并开源研究框架以推动该领域发展。

Table 1: Summary of the discrepancies in data features. The table shows the features for each dataset and the unified version of the features. Most of the unified features are flexible and can be chosen by the user.

表 1: 数据特征差异汇总。该表展示了各数据集的特征及其统一版本。大部分统一特征具有灵活性，可由用户自行选择。

		Argoverse2	WOMD	nuScenes	UniTraj
坐标系		场景中心	场景中心	场景中心	智能体中心
时间长度	过去	5秒	1秒	2秒	[0-8]秒
	未来	6秒	8秒	6秒	[1-8]秒
智能体特征	标注	速度,航向	速度,航向	速度,航向	速度,航向
	其他信息	2D	3D包围盒尺寸	2D	加速度 2D
	坐标	2D	3D	2D	2D
地图特征	范围	~200米	~200米	~500米	[0-500]米
	分辨率	0.2米~2米	~0.5米	~1米	[0.2-2]米
	坐标	2D	3D	2D	2D

3 UniTraj framework

3 UniTraj框架

The UniTraj framework, illustrated in Figure 1, consists of three main components. The first component unifies the format and features of various datasets (see Section 3.1). The second component adapts trajectory prediction models to the unified data format, facilitating their training (see Section 3.2). The final component consists of a comprehensive and shared evaluation process for the models (see Section 3.3). The integration of the components allows for diverse experimentation, such as cross-dataset training, evaluation, and dataset analysis.

图 1 所示的 UniTraj 框架包含三个主要组件。第一个组件统一了各类数据集的格式和特征 (见第 3.1 节)。第二个组件使轨迹预测模型适配统一数据格式，便于模型训练 (见第 3.2 节)。最后一个组件包含一套全面共享的模型评估流程 (见第 3.3 节)。这些组件的集成支持多样化实验，例如跨数据集训练、评估和数据集分析。

3.1 Unified data

3.1 统一数据

Two types of discrepancies are found across trajectory forecasting datasets: data formats and data features. The former amounts to differences in the way data is structured and organized, while the latter stems from differences in the characteristics of the data itself, such as spatio-temporal resolution, range, and agent and map annotation taxonomy. In this section, we present solutions to tackle both types of discrepancies.

轨迹预测数据集中存在两类差异：数据格式和数据特征。前者指数据结构和组织方式的差异，后者源于数据本身特性的不同，例如时空分辨率、范围、智能体与地图标注分类体系等。本节将针对这两种差异提出解决方案。

Unified data format: To address the issue of different data formats used in trajectory prediction datasets, such as TFRecord in WOMD [14] and Apache Parquet in Argoverse 2 [52], we utilize Scenario Net [28]. Scenario Net was initially designed for traffic scenario simulation and modeling, but we repurposed it for the vehicle trajectory prediction task. It provides a unified scenario description format containing HD maps and detailed object annotations, which simplifies the process of decoding the dataset with different formats. Scenario Net currently supports converting WOMD, nuScenes, and nuPlan, and we extend its support to Argoverse 2 for our research. This reduces the need for multiple versions of preprocessing code to extract information from raw datasets and create batched data for the training of prediction models.

统一数据格式：为解决轨迹预测数据集中使用不同格式（如WOMD [14]中的TFRecord和Argoverse 2 [52]中的Apache Parquet）的问题，我们采用了Scenario Net [28]。该工具最初设计用于交通场景仿真与建模，现被我们改造应用于车辆轨迹预测任务。它提供包含高精地图和详细物体标注的统一场景描述格式，简化了不同格式数据集解码流程。Scenario Net当前支持转换WOMD、nuScenes和nuPlan数据集，我们为其扩展了对Argoverse 2的支持。这减少了从原始数据集提取信息并生成预测模型训练批量数据时所需的多版本预处理代码。

Unified data features: Despite the data being converted into a unified format, significant discrepancies persist across the datasets, affecting model performance. For example, the scenarios are 11 seconds long in Argoverse 2, while they are 9 seconds in WOMD; or the precision of map annotations are 1 meters in nuScenes while they are 0.5 meters in WOMD. Therefore, we aim to harmonize these discrepancies and minimize their impact on the model’s performance. Table 1 summarizes the discrepancies and the unified features. Our data processing approach involves specific harmonization s, including the following:

统一数据特征：尽管数据已转换为统一格式，但不同数据集间仍存在显著差异，影响模型性能。例如Argoverse 2的场景时长为11秒，而WOMD中为9秒；nuScenes的地图标注精度为1米，WOMD则为0.5米。为此我们致力于协调这些差异，最小化其对模型性能的影响。表1总结了差异项与统一特征。我们的数据处理方法包含以下特定协调措施：

Coordinate Frame. Recent trajectory prediction models predominantly utilize vectorized, agent-centric data as input [17, 35, 48, 49, 60]. Our data processing pipeline is designed to transform scene-level raw data into this format. It processes traffic scenarios, which consist of multiple trajectories, and selects trajectories designated as training samples within the datasets. The pipeline then converts the entire scenario into the coordinate frames of these selected agents, and normalizes the input accordingly.

坐标系。当前轨迹预测模型主要采用以智能体为中心的矢量化数据作为输入 [17, 35, 48, 49, 60]。我们的数据处理流程旨在将场景级原始数据转换为这种格式。该流程处理包含多条轨迹的交通场景，并选择数据集中指定为训练样本的轨迹，随后将整个场景转换到这些选定智能体的坐标系中，并对输入进行相应归一化处理。

– Time Length. The trajectories in different datasets are with the same frequency of $10\mathrm{Hz}$ but with a duration ranging from 8 to 20 seconds. To standardize this aspect, we truncate all trajectories to a uniform length of 8 seconds. Within this duration, UniTraj provides the option to flexibly determine a unified length of past and future trajectories for all datasets.

时长。不同数据集中的轨迹频率均为 $10\mathrm{Hz}$，但持续时间从8秒到20秒不等。为统一标准，我们将所有轨迹截断为统一的8秒时长。在此时间段内，UniTraj提供了灵活确定所有数据集过去和未来轨迹统一长度的选项。

– Agent Features. Among the datasets, WOMD provides the most detailed agent information, including 3D coordinates, velocity, heading, and bounding box size, whereas nuScenes lacks certain data, like bounding box size. We standardize inputs across datasets by using 2D coordinates, velocity, and heading. Our data processing also introduces supplementary features, such as one-hot encoding of agent type and time steps of trajectories, and acceleration. These elements are combined to create a rich, unified input for the model.

– Agent Features. 在数据集中，WOMD提供了最详细的智能体信息，包括3D坐标、速度、航向和边界框尺寸，而nuScenes缺少某些数据，如边界框尺寸。我们通过使用2D坐标、速度和航向来标准化各数据集输入。数据处理中还引入了补充特征，例如智能体类型的独热编码、轨迹时间步长以及加速度。这些元素组合起来为模型创建了丰富统一的输入。

– Map Features. Datasets differ in HD map resolution. We normalize this by using linear interpolation to standardize the distance between consecutive points to 0.5 meters, with an option for further down sampling to adjust map resolution. Additionally, we enrich the data with each lane point’s direction and one-hot encode lane types. Our experiments utilize semantic map classes such as center lanes, road lines, crosswalks, speed bumps, and stop signs.

地图特征。数据集在高精地图分辨率上存在差异。我们通过线性插值将连续点之间的距离标准化为0.5米来统一处理，并可选择进一步降采样以调整地图分辨率。此外，我们为每个车道点添加方向信息，并对车道类型进行独热编码。实验使用了中心车道、道路标线、人行横道、减速带和停车标志等语义地图类别。

Our framework allows for customization of specific features through predefined parameters for focused single-dataset research, while still providing a standardized data format across all datasets. The data processing module supports various parameters, such as the length of historical and future trajectories, number of points per lane, map resolution, types of surrounding agents and lines, and masked attributes. Thanks to its modular structure, our data processing pipeline enables easy integration of new processing methodologies and models. The framework is equipped with multi-processing and caching mechanisms for efficient processing. Our framework currently includes four large-scale, real-world datasets with over 1337 hours of driving data from 15 cities.

我们的框架允许通过预定义参数定制特定功能，以专注于单一数据集研究，同时仍提供跨所有数据集的标准化数据格式。数据处理模块支持多种参数，例如历史和未来轨迹长度、每条车道的点数、地图分辨率、周围智能体 (agent) 和车道线类型，以及掩码属性。得益于模块化结构，我们的数据处理流程可轻松集成新的处理方法和模型。该框架配备了多进程和缓存机制以实现高效处理。目前我们的框架包含四个大规模真实世界数据集，涵盖15个城市超过1337小时的驾驶数据。

3.2 Unified models

3.2 统一模型

Trajectory prediction models are often implemented in different pipelines, making direct comparisons challenging and fairness hard to ensure. We integrate three recent trajectory prediction models within the UniTraj framework. These models were chosen based on their state-of-the-art results on various benchmarks, indicating the research value of their designs. We include:

轨迹预测模型通常采用不同的实现流程，这使得直接比较具有挑战性且难以确保公平性。我们在UniTraj框架中整合了三种近期提出的轨迹预测模型。这些模型的筛选依据是它们在各类基准测试中的前沿性能表现，这体现了其设计理念的研究价值。具体包括:

The models’ capacities cover a large range (1.5M parameters for AutoBot, 60.1M for MTR, and 16.5M for Wayformer), enabling research on model size and scaling.

模型参数量覆盖范围广泛（AutoBot为150万参数，MTR为6010万参数，Wayformer为1650万参数），便于开展模型规模与扩展性研究。

Integrating new models: The flexibility of UniTraj’s data processing pipeline greatly simplifies the integration of new models. Furthermore, we provide a standardized output format, enabling seamless use of UniTraj’s evaluation and logging tools.

集成新模型：UniTraj数据处理流程的灵活性极大简化了新模型的集成。此外，我们提供了标准化输出格式，使得UniTraj的评估与日志工具能够无缝衔接使用。

3.3 Unified evaluation

3.3 统一评估

In trajectory prediction, various metrics have been proposed to evaluate the models, yet there is no consensus about them. As a result, each dataset provides a different set of evaluation metrics, making it challenging to compare performances across datasets. For example, WOMD employs mean average precision (mAP) metric [14] while Argoverse 2 uses brier minimum Final Displacement Error (brier-minFDE) [52]. Our framework provides a unified set of metrics to allow comprehensive and consistent evaluation across different datasets. To this end, we employ two sets of metrics: general and fine-grained evaluation metrics.

在轨迹预测领域，虽然已提出多种评估模型的指标，但尚未形成统一标准。这导致不同数据集采用各异的评估体系，难以跨数据集比较性能。例如，WOMD采用平均精度均值(mAP)指标[14]，而Argoverse 2使用布赖尔最小最终位移误差(brier-minFDE)[52]。我们的框架提供统一指标集，支持跨数据集的全面一致评估。为此，我们采用两套指标：通用评估指标和细粒度评估指标。

General evaluations: The most common metrics in the literature are the ones that provide an overall score based on accuracy measures by comparing the output with the ground truth in different aspects. We include the following three general metrics in the framework: 1) Minimum Average / Final Displacement Error (minADE/minFDE): It represents the minimum average/final displacement error between the predictions and the ground truth. The minimum is computed over the 6 modes of the output.

通用评估指标：文献中最常见的指标是基于准确度测量，通过将输出与不同方面的真实值进行比较来提供总体评分。我们在框架中包含以下三种通用指标：1) 最小平均/最终位移误差 (minADE/minFDE)：表示预测值与真实值之间的最小平均/最终位移误差。最小值是在输出的6种模态上计算的。

Miss Rate (MR): It is defined as the ratio of the samples with minFDE exceeding 2 meters, and is useful where up to 2 meters deviation is acceptable. 3) Brier Minimum Final Displacement Error (brier-minFDE): While the previous metrics focus on covering the ground truth, they do not account for the probability assigned to each predicted trajectory. The brier-minFDE metric addresses this by adding a penalty term, $(1-p)^{2}$ , to the minFDE where $p$ corresponds to the probability of the trajectory that best matches the ground truth.
漏检率 (MR): 定义为 minFDE 超过 2 米的样本比例，适用于可接受最多 2 米偏差的场景。
布里尔最小最终位移误差 (brier-minFDE): 虽然前述指标关注覆盖真实轨迹，但未考虑每条预测轨迹的分配概率。该指标通过为 minFDE 添加惩罚项 $(1-p)^{2}$ 来解决这一问题，其中 $p$ 表示最匹配真实轨迹的预测轨迹概率。

Fine-grained evaluations: We also provide two fine-grained evaluations.

细粒度评估：我们还提供了两项细粒度评估。

(1) Trajectory types. Datasets usually exhibit a significant prevalence of ‘straight’ trajectories, resulting in heavily imbalanced datasets. Besides, we argue that rare trajectory types can sometimes be the more safety-critical ones. Therefore, it is critical to specifically access prediction performances on rare situations and trajectory types. To address this, the UniTraj framework enables the stratification of evaluation metrics based on trajectory types. In practice, we adopt the trajectory taxonomy defined in the WOMD challenge [14], to categorize trajectories into the following groups: ‘stationary’, ‘straight’, ‘straight left’, ‘straight right’, ‘left-turn’, ‘right-turn’, ‘left u-turn’, ‘right u-turn’. While the use of this taxonomy provides valuable insights, its scope has limitations as it does not account for variations in motion dynamics. For instance, it does not differentiate between straight accelerating and decelerating trajectories, both of which are categorized as ‘straight’. Consequently, we additionally use the notion of ‘Kalman difficulty’ introduced below.

(1) 轨迹类型。数据集通常显著偏向"直线"轨迹，导致数据分布严重不平衡。此外，我们认为罕见的轨迹类型有时反而更具安全关键性。因此，特别评估模型在罕见场景和轨迹类型上的预测性能至关重要。为解决这一问题，UniTraj框架支持根据轨迹类型对评估指标进行分层计算。实践中，我们采用WOMD挑战赛[14]定义的轨迹分类法，将轨迹划分为以下类别："静止"、"直线"、"左偏直线"、"右偏直线"、"左转"、"右转"、"左掉头"、"右掉头"。虽然该分类法能提供有价值的分析视角，但其局限性在于未考虑运动动力学变化。例如，它无法区分直线加速与减速轨迹，二者均被归类为"直线"。为此，我们额外引入了下文所述的"卡尔曼难度"概念。

(2) Kalman difficulty. Some situations are more challenging to forecast than others, typically when the future is not a simple extrapolation of the past and when contextual factors play a significant role. The context encloses various elements such as map data, social interactions, or input signals coming from perception. Moreover, previous works [4, 32, 50] observe that these complex scenarios, while critical, are much less frequent than scenarios that are easier to forecast. To specifically evaluate the performance over critical cases, and reduce evaluation noise coming from the large number of simple scenarios, the authors in [32] propose to filter them as the ones with a high mismatch between their ground truth and predictions from a Kalman filter [23]. We follow this idea as it offers a simple method to evaluate how challenging a situation is. Accordingly, UniTraj stratifies evaluation metrics based on Kalman difficulty that we define as the FDE between the ground-truth trajectory and the prediction of a linear Kalman filter.

(2) Kalman难度。某些场景比其他情况更难预测，尤其是当未来无法简单通过过去数据外推，且环境因素起重要作用时。这类环境因素包括地图数据、社交互动或感知输入的信号等多元要素。此外，已有研究[4,32,50]指出，这些复杂场景虽然关键，但出现频率远低于易预测场景。为针对性评估关键场景下的性能，并减少大量简单场景带来的评估噪声，文献[32]作者提出通过筛选真实轨迹与卡尔曼滤波器[23]预测结果间高失配度的场景作为评估对象。我们沿用该方法，因其提供了一种简单评估场景挑战性的途径。据此，UniTraj基于Kalman难度（定义为真实轨迹与线性卡尔曼滤波器预测间的最终位移误差(FDE)）对评估指标进行分层。

4 Experiments

4 实验

The UniTraj framework opens up new opportunities for research and experimentation. This section presents experiments highlighting these opportunities, focusing on cross-domain (i.e., cross-dataset and cross-city) generalization (RQ1) in Section 4.1, and data scaling impact (RQ2) for trajectory prediction models in Section 4.2. We provide fine-grained dataset analyses and discussions in Section 4.3. Additional experiments in the appendix, such as continual learning and synthetic-to-real transfer, further demonstrate the framework’s research utility. Experimental settings: We replicate the model configurations and hyperparameters from their original implementations. Throughout the experiments, we have limited the training and validation samples to vehicle trajectories. The map range extends to a 100m radius with a spatial resolution of 0.5m. The temporal parameters are set to 2 seconds of historical trajectories and 6 second future trajectories. For our multi-dataset training experiments, we utilize WOMD [14], Argoverse 2 [52], and nuScenes [7] datasets. Since the nuPlan [8] dataset is oriented towards planning tasks and lacks an official training/validation set for prediction tasks, we exclusively use it for the cross-city generalization studies due to its large number of samples for different cities. We only report the results with the brier-minFDE metric and leave other metrics for the appendix.

UniTraj框架为研究和实验开辟了新机遇。本节重点展示以下实验发现：4.1节探讨跨领域（即跨数据集和跨城市）泛化能力(RQ1)，4.2节分析轨迹预测模型的数据规模影响(RQ2)。4.3节提供细粒度的数据集分析与讨论。附录中的持续学习和合成数据到真实场景迁移等补充实验，进一步验证了该框架的研究价值。

实验设置：我们复现了原始实现的模型配置与超参数。所有实验均限定使用车辆轨迹作为训练和验证样本。地图范围设置为半径100米，空间分辨率0.5米。时间参数采用2秒历史轨迹和6秒未来轨迹。多数据集训练实验使用WOMD[14]、Argoverse 2[52]和nuScenes[7]数据集。由于nuPlan[8]数据集面向规划任务且未提供预测任务的官方训练/验证集，我们仅利用其多城市样本进行跨城市泛化研究。实验结果统一采用brier-minFDE指标，其他指标见附录。

4.1 Generalization evaluation

4.1 泛化性评估

Generalization to new domains is a crucial challenge for data-driven models, necessitating diverse data for comprehensive evaluation. The UniTraj framework enables the exploration of model generalization across various datasets and cities.

泛化到新领域是数据驱动模型面临的关键挑战，需要多样化数据进行全面评估。UniTraj框架支持跨不同数据集和城市探索模型泛化能力。

Cross-dataset evaluation: To assess the generalization capabilities of models, we train models on each individual dataset and evaluate their performance on all other available datasets. The findings are presented in Table 2. Analyzing the data in different columns of the table, the first observation is that all models’ performances decline significantly when models are tested on other datasets. This is a consistent trend across all of the three model architectures, and all of the considered datasets. For instance, the second column under MTR reports the performance evaluated on the validation set of Argoverse 2. It indicates that MTR achieves its peak performance when it is trained on the training set of Argoverse 2 itself, while models trained on nuScenes and WOMD exhibit significantly lower performances.

跨数据集评估：为评估模型的泛化能力，我们在各数据集上分别训练模型，并在其他可用数据集上测试其性能。结果如表2所示。通过分析表中不同列的数据，首先可观察到所有模型在其他数据集上测试时性能均显著下降。这一趋势在三种模型架构和所有涉及的数据集中均保持一致。例如，MTR列下第二栏显示其在Argoverse 2验证集上的性能表现：当使用Argoverse 2自身训练集时达到峰值性能，而使用nuScenes和WOMD训练集训练的模型则表现明显较差。

With a more detailed investigation, we can also compare the generalization capabil

[论文翻译]UniTraj: 可扩展车辆轨迹预测的统一框架

原文地址：https://arxiv.org/pdf/2403.15098v3