[论文翻译]通过利用风险驾驶行为知识构建更安全的自主代理


原文地址:https://arxiv.org/pdf/2103.10245v2.pdf


.

Building Safer Autonomous Agents by Leveraging Risky Driving Behavior Knowledge

利用风险驾驶行为知识构建更安全的智能体(agent)

Abstract

Simulation environments are good for learning different driving tasks like lane changing, parking or handling intersections etc. in an abstract manner. However, these simulation environments often restrict themselves to operate under conservative interactions behavior amongst different vehicles. But, as we know that the real driving tasks often involves very high risk scenarios where other drivers often don't behave in the expected sense. There can be many reasons for this behavior like being tired or inexperienced. The simulation environments doesn't take this information into account while training the navigation agent. Therefore, in this study we especially focus on systematically creating these risk prone scenarios with heavy traffic and unexpected random behavior for creating better model-free learning agents. We generate multiple autonomous driving scenarios by creating new custom Markov Decision Process (MDP) environment iterations in highway-env simulation package. The behavior policy is learnt by agents trained with the help from deep reinforcement learning models. Our behavior policy is deliberated to handle collisions and risky randomized driver behavior. We train model free learning agents with supplement information of risk prone driving scenarios and compare their performance with baseline agents. Finally, we casually measure the impact of adding these perturbations in the training process to precisely account for the performance improvement attained from utilizing the learnings from these scenarios. IEEEkeywords: Autonomous Agents, Driving Simulations, Trajectory Prediction, Causality

摘要

仿真环境对于学习不同的驾驶任务,如车道更换,停车或处理交叉点等。然而,这些模拟环境通常限制在不同车辆之间的保守相互作用行为下运行。但是,正如我们所知,真正的驾驶任务往往涉及非常高的风险场景,其他司机通常不会在预期的意义上表现。这种行为可能有很多原因,如疲倦或缺乏经验。仿真环境不会在培训导航代理时考虑此信息。因此,在这项研究中,我们特别关注系统地创建这些风险易发的情景,其具有繁忙的交通和意外的随机行为,以创建更好的无模型学习代理。我们通过在Highway-Env仿真包中创建新的自定义马尔可夫决策过程(MDP)环境迭代来生成多个自主驱动方案。该行为政策是由深度加强学习模型的帮助训练的代理人学习。我们的行为政策是刻意处理碰撞和风险的随机驱动程序行为。我们培养模型自由学习代理商,风险易受驾驶场景的补充信息,并将其与基线代理商的表现进行比较。最后,我们随便衡量在培训过程中添加这些扰动的影响,精确地算用于利用这些方案的学习所获得的性能改进。

关键词 自主代理,驾驶模拟,轨迹预测,因果关系

Introduction 简介

The arrival of autonomous driving agents have had a great impact on automobile industry. And it will responsible for shaping the future of this industry as well. The current industry trend progression demonstrates that individual autonomous agents will be dominating before connected autonomous vehicles. But, that vision is still quite afar considering safety infrastructure, security and public policy reasons. Therefore, our immediate focus should be on making our driving agents safe and efficacious. Creating more safer agents have been explored well for past few years. As it is crucial to know the expected agent behavior in different environments especially safety critical ones. In natural driving environments risk prone scenarios doesn't happen frequently which makes the learning process harder from these scenarios. And it is also unethical to create these real risky driving scenarios for experimentation purposes. Therefore, generating and studying these scenarios systematically is a daunting task. Perception systems with adversarial approaches of noisy labelling does demonstrate promising path. But underlying fundamental problem remains focused around safe interactions with other vehicles which themselves are operating independently. Simulation environments with appropriate system dynamics design assumptions does overcomes this expressed safety issue. Along with that these environments allows us systematically study the risk prone scenarios with near realistic driving behaviors. For this study we have used highway-env simulation package which allows us to simulate different driving tasks. It also provides simple interfacing capabilities to modify these environments and quickly create baseline prototypes on it. We formulate the dynamics of the simulation system as Markov Decision Process (MDP) in our experimentation. We model our value function approximation agents with deep reinforcement learning for these defined MDP systems for our study.

自动驾驶代理的到来对汽车行业产生了很大影响。它将负责塑造这个行业的未来。目前的行业趋势进展表明,个人自治代理将在连接自动车辆之前占主导地位。但是,考虑到安全基础设施,安全和公共政策原因,这种愿景仍然非常远。因此,我们的直接关注应该是为了使我们的驾驶代理安全有效。在过去几年中探讨了更多更安全的代理商。因为了解不同环境中的预期代理行为至关重要,特别是安全临界。在自然驾驶环境中,风险易发的情景不会经常发生,这使得学习过程越来越难于这些方案。创建这些真正的风险驾驶场景,以进行实验目的,它也是不道德的。因此,系统地生成和研究这些方案是令人生畏的任务。具有逆向噪声标签的对抗方法的感知系统确实展示了有前途的道路。但基本问题仍然集中在与自己独立运作的其他车辆的安全互动。具有适当的系统动态设计假设的仿真环境确实克服了这一表达的安全问题。除此之外,这些环境允许我们系统地研究风险乐趣情景与近现实的驾驶行为。对于本研究,我们使用了高速公路env仿真包,使我们能够模拟不同的驾驶任务。它还提供简单的接口功能来修改这些环境并快速创建基线原型。我们在我们的实验中制定仿真系统的动态,如Markov决策过程(MDP)。我们为我们的研究提供了对这些定义的MDP系统的深度增强学习的价值函数逼近代理。

image

For our experiment design we create two distinct variants of each model architecture for above stated simulation dynamics. One of these model variant is trained with our increased dangerous driving behavior interventions for all the driving related tasks present in highway-env package. Second model variant is control variable in our experiment which is used to define the reward baseline for the trained agents on regular simulations. In our study we do a methodological doping of these environments with randomized dangerous driving scenarios to create more risk prone environments for driving. This is done specifically in two ways in our study, first by increasing the traffic at strategically important location challenging the agent to make dangerous overtakes. Second, we increase the randomization factor and the clogged lanes makes the environment more collision prone for our agent .

用于我们的实验设计,我们为上述模拟动态创建了每个模型架构的两个不同的变体。其中一个模型变体受过我们增加的危险驾驶行为干预训练,用于在高速公路env包中存在的所有驾驶相关任务。第二种模型变体在我们的实验中是控制变量,用于在常规模拟上定义训练有素的代理的奖励基准。在我们的研究中,我们通过随机危险的驾驶场景进行这些环境的方法兴奋,以创造更多的风险易受驾驶环境。在我们的研究中,这是专门完成的,首先通过增加战略重要地点的交通挑战代理人危险的超越。其次,我们增加随机化因素,堵塞的车道使环境对我们的代理商更容易发生。

This is done to create more robust agents in a given environment for that particular dynamic scenario. Which are essentially better at post impact trajectory predictions of other interacting vehicles. Figurefig1 explains our causal analysis experimentation setup in sequential form. We also attempt to understand our experimentation setup from a causal standpoint view. We hold complete control of the data generation process in our experimentation. Which allows us to have unique vantage point of conducting an experiment study which is equivalent to a randomized control trial (RCT). We train our agents with good enough assumption of absence in unobservable confounding variables. As we have strictly defined the state dynamics and vehicle behavior governing models. Our customization capabilities in highway-env package allows us to keep every condition same while training except our collision scenario perturbations. Meaning that treatment and control groups are same in all aspects except our treatment i.e. there is a comparability and co-variate balance in our experimental setup. With this special relation establishment we can imply that association found in our experiment setup is causation. As shown in Figurefig2 our treatment is subjecting the agent learning process to risk prone interacting vehicle dynamics in a given environment. After that our sample test experiment population involves evaluating the two model variants against regular and perturbed environments with excessive risk prone scenario doping. And finally by using expectation equations derived from above causal graph we estimate the causal effect our novel learning changes for enhanced safety.

image

Our contributions by the means of this paper include providing benchmarking environment simulations for collision robustness predictions. Along with that we provide experimentation methodology that creates more robust agents that provides better on-road safety. And we also causally measure the impact of our risky driving behavior doping interventions for different driving environments. For the remaining paper we first discuss related work corresponding to utilizing risk prone behavior for creating safer autonomous vehicle. Second, we formally define our problem and elaborate on causal aspects of it. Third, we explain the experiment setup for creating these robust agents and elaborate upon on our results. Finally, we conclude our autonomous driving agent study.

这样做是为了在给定的环境中创建更强大的特工,以用于该特定的动态方案。在其他交互车辆的后撞击轨迹预测中基本上更好。图51以顺序形式解释了我们的因果分析实验设置。我们还尝试从因果角度观看了解我们的实验设置。我们在我们的实验中完全控制数据生成过程。这使我们能够进行独特的有利程度,进行实验研究,这相当于随机控制试验(RCT)。我们培训我们的代理商,足够的假设在不可观察的混淆变量中的缺席。正如我们严格地定义了国家动态和车辆行为的管理模型。除了我们的碰撞情景扰动之外,我们在高速公路eng包中的定制功能使我们能够在训练中保持每个条件。意思是,除了我们的治疗之外,治疗和对照组在除了我们的治疗之外的所有方面也是如此。我们的实验设置中存在可比性和共变化。通过这种特殊关系,我们可以暗示在我们的实验设置中发现的关联是因果关系。如图案2所示,我们的治疗正在使代理学习过程在给定环境中冒着俯卧的互动车辆动态。之后,我们的样本测试实验群体涉及评估两种模型变体,以防止具有过度风险的风险场景掺杂的常规和扰动环境。最后通过使用源自因果图的期望方程,我们估计了我们的新颖学习变化以提高安全性的原因效果。

我们通过本文的手段的贡献包括为碰撞稳健性预测提供基准环境模拟。除此之外,我们提供了实验方法,可以创造更强大的代理,提供更好的道路安全性。我们还因导致衡量我们对不同驾驶环境的风险驾驶行为掺杂干预的影响。对于剩下的纸张,我们首先讨论对应于利用风险俯卧行为来创建更安全的自主车辆的相关工作。其次,我们正式定义了我们的问题,并详细说明了它的因果方面。第三,我们解释了创建这些强大代理的实验设置,并在我们的结果上详细说明。最后,我们得出自治驾驶代理学习。

Previous Work

Deep Reinforcement Learning have been used for extensively for traffic control tasks. The simulated environment provided by CARLA gave framework for systems that estimates several affordances from sensors in a simulated environment. Navigation tasks like merging traffic requiring good exploration capabilities have also shown promising results in simulators. ChauffeurNet elaborates on the idea of imitation learning for training robust autonomous agents that leverages worst case scenarios in the form of realistic perturbations. A clustering based collision case generation study systematically defines and generates the different type of collisions for effectively identifying valuable cases for agent training. In highway-env package specifically focuses on designing safe operational policies for large-scale non-linear stochastic autonomous driving systems. This environment has been extensively studied and used for modelling different variants of MDP, for example: finite MDP, constraint-MDP and budgeted-MDP (BMDP). BMDP is a variant of MDP which makes sure that risk notion implemented as cost signal stays below a certain adjustable threshold. The problem formalization for workings of vehicle kinematics, temporal abstraction, partial observability and reward hypothesis has been studied extensively as well. Robust optimization planning has been studied is past for finite MDP systems with uncertain parameters. And also has shown promising results under conservative driving behavior. For BMDP, efficacy and safety analysis has been extended into continuous kinematics states and unknown human behavior from the existing known dynamics and finite state space. Model free learning networks that approximates value function for these MDPs like Deep Q-Leaning (DQN) and Dueling Deep Q-Learning (DQN) networks have demonstrated promising results in continuous agent learning.

以前的工作

深增强学习已被广​​泛用于交通控制任务。 Carla提供的模拟环境为系统提供了框架,该系统估计来自模拟环境中的传感器的若干人。导航任务,如合并需要良好探索能力的流量也显示了模拟器的有希望的结果。 Chauffeurnet详细说明了训练强大自治代理的模仿学习的想法,以实现现实扰动的形式利用最坏情况。基于集群的碰撞案例生成研究系统地定义并生成不同类型的冲突,以便有效地识别代理培训的有价值的案例。在Highway-Env包中专注于为大型非线性随机自主驱动系统设计安全的操作策略。该环境已被广泛研究并用于建模MDP的不同变体,例如:有限MDP,约束 - MDP和预算-MDP(BMDP)。 BMDP是MDP的变体,其确保作为成本信号实现的风险概念停留在某个可调阈值以下。也在广泛研究了车辆运动学,时间抽象,部分可观察性和奖励假设的运作问题的问题形式化。已经研究了强大的优化规划是具有不确定参数的有限MDP系统的过去。并且还在保守驾驶行为下显示了有希望的结果。对于BMDP,效力和安全分析已扩展到连续的运动学状态和来自现有的已知动态和有限状态空间的未知人类行为。模型自由学习网络近似于这些MDP的值函数,如Deep Q倾斜(DQN)和Dueling Deep Q-Learning(DQN)网络已经表现出在连续代理学习中的有希望的结果。

Worst case scenario knowledge in traffic analysis has been leveraged for model based algorithms by building region of high confidence containing true dynamics with high probability. Tree based planning algorithms were used to achieve robust stabilisation and mini-max control with generic costs. These studies also leveraged non-asymptotic linear regression and interval prediction for safer trajectory predictions. Behavior guided action study which uses proximity graphs and safety trajectory computes for working with aggressive and conservative have shown promising results as well. This study used CMetric measure for generating varying level of aggressiveness in traffic. Whereas in our case we have used more randomization and traffic clogging at key areas for risk prone scenarios to measure more granular observation results. Causal modelling techniques have contributed a lot in terms of providing great interpretative explanations in many domains. Also, Fischer's randomized control trials (RCTs) has served as gold standard for causal discovery from observational data. Sewall's path diagrams were the first attempt of generating causal answers with mathematics. Now, causal diagrams and different adjustments on these diagrams does offer direct causal relation information about any experimental variables under study. In our experimental study we use these existing mathematical tools to draw direct causal conclusions of our learnings from environment interventions.

通过建立具有高概率的真正动态的高置信区,对基于模型的算法来利用基于模型的算法的最坏情况。基于树的规划算法用于实现具有通用成本的强大稳定和迷你控制。这些研究还利用了对更安全的轨迹预测的非渐近线性回归和间隔预测。行为指导动作研究使用邻近图和安全轨迹计算,用于与攻击性和保守派合作表明也有希望的结果。这项研究使用了对交通中产生不同水平的融合措施。而在我们的情况下,我们在风险易受风险场景的关键领域使用了更多随机化和流量,以测量更多的粒度观察结果。因果建模技术在提供许多域中提供了很大的解释性解释,因此贡献了很大贡献。此外,Fischer的随机控制试验(RCT)曾作为从观察数据的因果发现的金标准。 Sewall的路径图是第一次尝试与数学产生因果答案。现在,对这些图的因果图和不同的调整确实提供了关于研究中任何实验变量的直接因果关系信息。在我们的实验研究中,我们使用这些现有的数学工具从环境干预中汲取我们学习的直接因果关系。

Problem Formulation 解决方案

Our goal is to design and build agents on collision prone MDPs for navigation tasks across the different traffic scenarios. The MDP comprises of behavior policy $ \pi $ ( $ \mid $ ) that outputs action for given state . With this learnt policy our goal is to predict discrete safe and efficient action from finite action set for next time step for given driving scenarios. The simulation platform that we used is compatible with OpenAI gym package. The highway-env package provides the traffic flow which is governed by Intelligent Driver Model (IDM) for linear acceleration & MOBIL model for lane changing. MOBIL model primarily consists of safety criterion and incentive criterion. First safety criterion checks if after lane change vehicle is having enough acceleration space and second criterion determines the total advantage of lane change in terms of total acceleration gain. Given MDP is defined as a set of where action $ \in $ , state $ \in $ , reward function $ \in $ [0,1] and state transition probabilities T $ \in $ . With Deep-RL algorithms we search the the behavior policy $ \pi $ (s $ \mid $ a) that helps us in navigating across the traffic environments to gather maximum discounted reward . The state-action value function for given assists in estimating future rewards of given behavior policy $ \pi $ . Therefore, the optimal state-action value function provides maximum value estimates for all $ \in $ and is evaluated by solving Bellman Equation, stated below for reference. From this the optimal policy $ \pi $ is expressed as = arg max. We used DQN with duelling network architecture having to approximate the state-action value function which predicts best possible action as learned from the policy $ \pi $ .

我们的目标设计和构建易于碰撞的MDP,以便在不同的交通方案中导航任务。 MDP包括输出给定状态的动作的行为策略$\pi $($\mid $)。通过这种学习的政策,我们的目标是从用于给定驾驶场景的下一次步骤中,从有限时间进行预测,从有限时间步骤预测离散的安全和有效行动。我们使用的模拟平台与Openai健身包兼容。 Highway-Env包提供了由智能驱动程序模型(IDM)管理的交通流量,用于线路更换的线性加速\和MOBIL模型。 MOBIL模型主要由安全标准和激励标准组成。第一个安全标准检查后车道改变车辆是否具有足够的加速空间和第二标准确定总线加速度增益的车道变化的总优势。给定MDP被定义为一组Action $\in $,状态$\in $,奖励函数$\in $ [0,1]和状态转换概率T $\in $。使用Deep-RL算法,我们搜索行为策略$\pi $(s $\mid $a),帮助我们在流量环境中导航以收集最大的折扣奖励。给定辅助给定行为策略$\pi $的未来奖励的状态操作值函数。因此,最佳状态 - 动作值函数为所有$\in $提供最大值估计,并通过求解贝尔曼方程来评估下面的参考。从此,最佳策略$\pi $表示为= arg max。我们使用DQN具有决斗网络架构,该架构必须近似于从策略$\pi $中获知的最佳动作的状态动作值函数。

$$ Q^{*}(s, a) = \mathop{\mathbb{E}}[ R(s,a) + \gamma\sum\limits_{s'} P (s' | s , a) \max\limits_{a'} Q^{*}(s', a') ] $$

From our experimentation setup we intend to derive direct causal impact of our interventions in traffic environment scenarios. And as we can refer back from Figurefig2, our treatment (T) is subjecting the learning agent to more risky driving behavior. Our testing sample set involves random agent reward calculation against perturbed and control environments which makes our experiment equivalent to RCT. Meaning that there is no unobserved confounding present in our experimentation i.e.

来自我们的实验设置,我们打算导出我们在流量环境方案中的干预措施的直接因果关系。随着我们可以从图码2转述,我们的治疗(T)正在使学习代理更加危险的驾驶行为。我们的测试样本集涉及随机代理奖励计算对扰动和控制环境,这使我们的实验相当于RCT。这意味着我们的实验中没有不观察室的混淆,即我们的实验中。

backdoor criterion is satisfied. Also, in RCTs distribution of all co-variates are same except the treatment. Co-variate balance in observational data also implies that association is equal to causation while calculating the potential outcomes, refer equation stated below.

满足后门标准。而且,在RCT中,除​​了治疗外,所有共变体的分布也是如此。在观察数据中的共同变化也意味着在计算潜在结果的同时,关联等于因果关系,参考下面说明的等式。

$$ P(X \mid T=1) \stackrel{d}{=} P(X \mid T=0) $$
$$ P(X \mid T=1) \stackrel{d}{=} P(X), T \perp \perp X $$
$$ P(X \mid T=0) \stackrel{d}{=} P(X), T \perp \perp X$$

Essentially meaning that we can use the associative difference quantity to infer the effect of treatment on outcomes. Meaning that we can use Average Treatment Effect (ATE) approach for calculating the causal effect by simply subtracting the averaged out values treatment and control potential outcomes. In below stated equations Y(1) $ \triangleq $ Y& Y(0) $ \triangleq $ Y and these equations hold true in case of RCTs where causal difference can be calculated with associated difference.
基本上意味着我们可以使用关联差异数量来推断治疗结果的效果。意思是,我们可以使用平均处理效果(吃)方法来计算因果效应,简单地减去平均值处理和控制潜在结果。在下面说明的等式y(1)$\triangleq $y \ y y \ y(0)$\triangleq $y和这些等式在RCT的情况下保持真实,其中可以用相关差异计算因果差异。

$$ \begin{alignedat}{2} E[Y(1)-Y(0)] &= E[Y(1)]-E[Y(0)] E[Y(1)]-E[Y(0)] &= E[Y\mid T=1]-E[Y\mid T=0] \end{alignedat} $$

image

Model free learning approaches generally don't have explicit information about the dynamics of the systems. Therefore, during the training these agents will generalize their policies corresponding to the particular given scenarios only. We introduce risky randomized behavior in these environment vehicles with collision prone randomized behavior. Figurefig3 shows different probable real life collisions simulated in different highway-env package environments. It increases generalizations on less common but highly critical scenarios which can save user from hefty collisions. We critically analyze the performance our treated agents in comparison to control agents for important tasks like performing roundabouts, handling intersections, u-turn & two-way traffic overtakes and lane changing environments in our study.
无模型的学习方法通​​常不具有有关系统动态的明确信息。因此,在培训期间,这些代理商将概括其对应于特定特定方案的政策。我们介绍这些环境风险的车辆随机行为与碰撞容易发生随机行为。图53显示了在不同的高速公路env包环境中模拟的不同可能的实际碰撞。它会在不太常见但高度关键的方案上增加概括,这可以将用户从艰难的碰撞中保存。我们批判性地分析了我们治疗代理的性能,与我们在研究中执行环形交叉路口,处理交叉路口,U形转弯\和双向交通超值和车道更换环境的重要任务相比,我们的治疗代理商进行了比较。

![image](http://www.aiqianji.com/skill/static/upload/anonymous/tmp/2103.10245v2_dir/figures/intersect-th