[论文翻译]通过利用风险驾驶行为知识构建更安全的自主代理


下载PDF:https://arxiv.org/pdf/2103.10245v2.pdf


.

Building Safer Autonomous Agents by Leveraging Risky Driving Behavior Knowledge

利用风险驾驶行为知识构建更安全的智能体(agent)

Abstract

Simulation environments are good for learning different driving tasks like lane changing, parking or handling intersections etc. in an abstract manner. However, these simulation environments often restrict themselves to operate under conservative interactions behavior amongst different vehicles. But, as we know that the real driving tasks often involves very high risk scenarios where other drivers often don't behave in the expected sense. There can be many reasons for this behavior like being tired or inexperienced. The simulation environments doesn't take this information into account while training the navigation agent. Therefore, in this study we especially focus on systematically creating these risk prone scenarios with heavy traffic and unexpected random behavior for creating better model-free learning agents. We generate multiple autonomous driving scenarios by creating new custom Markov Decision Process (MDP) environment iterations in highway-env simulation package. The behavior policy is learnt by agents trained with the help from deep reinforcement learning models. Our behavior policy is deliberated to handle collisions and risky randomized driver behavior. We train model free learning agents with supplement information of risk prone driving scenarios and compare their performance with baseline agents. Finally, we casually measure the impact of adding these perturbations in the training process to precisely account for the performance improvement attained from utilizing the learnings from these scenarios. IEEEkeywords: Autonomous Agents, Driving Simulations, Trajectory Prediction, Causality

摘要

仿真环境对于学习不同的驾驶任务,如车道更换,停车或处理交叉点等。然而,这些模拟环境通常限制在不同车辆之间的保守相互作用行为下运行。但是,正如我们所知,真正的驾驶任务往往涉及非常高的风险场景,其他司机通常不会在预期的意义上表现。这种行为可能有很多原因,如疲倦或缺乏经验。仿真环境不会在培训导航代理时考虑此信息。因此,在这项研究中,我们特别关注系统地创建这些风险易发的情景,其具有繁忙的交通和意外的随机行为,以创建更好的无模型学习代理。我们通过在 Highway-Env 仿真包中创建新的自定义马尔可夫决策过程(MDP)环境迭代来生成多个自主驱动方案。该行为政策是由深度加强学习模型的帮助训练的代理人学习。我们的行为政策是刻意处理碰撞和风险的随机驱动程序行为。我们培养模型自由学习代理商,风险易受驾驶场景的补充信息,并将其与基线代理商的表现进行比较。最后,我们随便衡量在培训过程中添加这些扰动的影响,精确地算用于利用这些方案的学习所获得的性能改进。

关键词 自主代理,驾驶模拟,轨迹预测,因果关系

Introduction 简介

The arrival of autonomous driving agents have had a great impact on automobile industry. And it will responsible for shaping the future of this industry as well. The current industry trend progression demonstrates that individual autonomous agents will be dominating before connected autonomous vehicles. But, that vision is still quite afar considering safety infrastructure, security and public policy reasons. Therefore, our immediate focus should be on making our driving agents safe and efficacious. Creating more safer agents have been explored well for past few years. As it is crucial to know the expected agent behavior in different environments especially safety critical ones. In natural driving environments risk prone scenarios doesn't happen frequently which makes the learning process harder from these scenarios. And it is also unethical to create these real risky driving scenarios for experimentation purposes. Therefore, generating and studying these scenarios systematically is a daunting task. Perception systems with adversarial approaches of noisy labelling does demonstrate promising path. But underlying fundamental problem remains focused around safe interactions with other vehicles which themselves are operating independently. Simulation environments with appropriate system dynamics design assumptions does overcomes this expressed safety issue. Along with that these environments allows us systematically study the risk prone scenarios with near realistic driving behaviors. For this study we have used highway-env simulation package which allows us to simulate different driving tasks. It also provides simple interfacing capabilities to modify these environments and quickly create baseline prototypes on it. We formulate the dynamics of the simulation system as Markov Decision Process (MDP) in our experimentation. We model our value function approximation agents with deep reinforcement learning for these defined MDP systems for our study.

自动驾驶代理的到来对汽车行业产生了很大影响。它将负责塑造这个行业的未来。目前的行业趋势进展表明,个人自治代理将在连接自动车辆之前占主导地位。但是,考虑到安全基础设施,安全和公共政策原因,这种愿景仍然非常远。因此,我们的直接关注应该是为了使我们的驾驶代理安全有效。在过去几年中探讨了更多更安全的代理商。因为了解不同环境中的预期代理行为至关重要,特别是安全临界。在自然驾驶环境中,风险易发的情景不会经常发生,这使得学习过程越来越难于这些方案。创建这些真正的风险驾驶场景,以进行实验目的,它也是不道德的。因此,系统地生成和研究这些方案是令人生畏的任务。具有逆向噪声标签的对抗方法的感知系统确实展示了有前途的道路。但基本问题仍然集中在与自己独立运作的其他车辆的安全互动。具有适当的系统动态设计假设的仿真环境确实克服了这一表达的安全问题。除此之外,这些环境允许我们系统地研究风险乐趣情景与近现实的驾驶行为。对于本研究,我们使用了高速公路 env 仿真包,使我们能够模拟不同的驾驶任务。它还提供简单的接口功能来修改这些环境并快速创建基线原型。我们在我们的实验中制定仿真系统的动态,如 Markov 决策过程(MDP)。我们为我们的研究提供了对这些定义的 MDP 系统的深度增强学习的价值函数逼近代理。

image

For our experiment design we create two distinct variants of each model architecture for above stated simulation dynamics. One of these model variant is trained with our increased dangerous driving behavior interventions for all the driving related tasks present in highway-env package. Second model variant is control variable in our experiment which is used to define the reward baseline for the trained agents on regular simulations. In our study we do a methodological doping of these environments with randomized dangerous driving scenarios to create more risk prone environments for driving. This is done specifically in two ways in our study, first by increasing the traffic at strategically important location challenging the agent to make dangerous overtakes. Second, we increase the randomization factor and the clogged lanes makes the environment more collision prone for our agent .

用于我们的实验设计,我们为上述模拟动态创建了每个模型架构的两个不同的变体。其中一个模型变体受过我们增加的危险驾驶行为干预训练,用于在高速公路 env 包中存在的所有驾驶相关任务。第二种模型变体在我们的实验中是控制变量,用于在常规模拟上定义训练有素的代理的奖励基准。在我们的研究中,我们通过随机危险的驾驶场景进行这些环境的方法兴奋,以创造更多的风险易受驾驶环境。在我们的研究中,这是专门完成的,首先通过增加战略重要地点的交通挑战代理人危险的超越。其次,我们增加随机化因素,堵塞的车道使环境对我们的代理商更容易发生。

This is done to create more robust agents in a given environment for that particular dynamic scenario. Which are essentially better at post impact trajectory predictions of other interacting vehicles. Figurefig1 explains our causal analysis experimentation setup in sequential form. We also attempt to understand our experimentation setup from a causal standpoint view. We hold complete control of the data generation process in our experimentation. Which allows us to have unique vantage point of conducting an experiment study which is equivalent to a randomized control trial (RCT). We train our agents with good enough assumption of absence in unobservable confounding variables. As we have strictly defined the state dynamics and vehicle behavior governing models. Our customization capabilities in highway-env package allows us to keep every condition same while training except our collision scenario perturbations. Meaning that treatment and control groups are same in all aspects except our treatment i.e. there is a comparability and co-variate balance in our experimental setup. With this special relation establishment we can imply that association found in our experiment setup is causation. As shown in Figurefig2 our treatment is subjecting the agent learning process to risk prone interacting vehicle dynamics in a given environment. After that our sample test experiment population involves evaluating the two model variants against regular and perturbed environments with excessive risk prone scenario doping. And finally by using expectation equations derived from above causal graph we estimate the causal effect our novel learning changes for enhanced safety.

image

Our contributions by the means of this paper include providing benchmarking environment simulations for collision robustness predictions. Along with that we provide experimentation methodology that creates more robust agents that provides better on-road safety. And we also causally measure the impact of our risky driving behavior doping interventions for different driving environments. For the remaining paper we first discuss related work corresponding to utilizing risk prone behavior for creating safer autonomous vehicle. Second, we formally define our problem and elaborate on causal aspects of it. Third, we explain the experiment setup for creating these robust agents and elaborate upon on our results. Finally, we conclude our autonomous driving agent study.

这样做是为了在给定的环境中创建更强大的特工,以用于该特定的动态方案。在其他交互车辆的后撞击轨迹预测中基本上更好。图 51 以顺序形式解释了我们的因果分析实验设置。我们还尝试从因果角度观看了解我们的实验设置。我们在我们的实验中完全控制数据生成过程。这使我们能够进行独特的有利程度,进行实验研究,这相当于随机控制试验(RCT)。我们培训我们的代理商,足够的假设在不可观察的混淆变量中的缺席。正如我们严格地定义了国家动态和车辆行为的管理模型。除了我们的碰撞情景扰动之外,我们在高速公路 eng 包中的定制功能使我们能够在训练中保持每个条件。意思是,除了我们的治疗之外,治疗和对照组在除了我们的治疗之外的所有方面也是如此。我们的实验设置中存在可比性和共变化。通过这种特殊关系,我们可以暗示在我们的实验设置中发现的关联是因果关系。如图案 2 所示,我们的治疗正在使代理学习过程在给定环境中冒着俯卧的互动车辆动态。之后,我们的样本测试实验群体涉及评估两种模型变体,以防止具有过度风险的风险场景掺杂的常规和扰动环境。最后通过使用源自因果图的期望方程,我们估计了我们的新颖学习变化以提高安全性的原因效果。

我们通过本文的手段的贡献包括为碰撞稳健性预测提供基准环境模拟。除此之外,我们提供了实验方法,可以创造更强大的代理,提供更好的道路安全性。我们还因导致衡量我们对不同驾驶环境的风险驾驶行为掺杂干预的影响。对于剩下的纸张,我们首先讨论对应于利用风险俯卧行为来创建更安全的自主车辆的相关工作。其次,我们正式定义了我们的问题,并详细说明了它的因果方面。第三,我们解释了创建这些强大代理的实验设置,并在我们的结果上详细说明。最后,我们得出自治驾驶代理学习。

Previous Work

Deep Reinforcement Learning have been used for extensively for traffic control tasks. The simulated environment provided by CARLA gave framework for systems that estimates several affordances from sensors in a simulated environment. Navigation tasks like merging traffic requiring good exploration capabilities have also shown promising results in simulators. ChauffeurNet elaborates on the idea of imitation learning for training robust autonomous agents that leverages worst case scenarios in the form of realistic perturbations. A clustering based collision case generation study systematically defines and generates the different type of collisions for effectively identifying valuable cases for agent training. In highway-env package specifically focuses on designing safe operational policies for large-scale non-linear stochastic autonomous driving systems. This environment has been extensively studied and used for modelling different variants of MDP, for example: finite MDP, constraint-MDP and budgeted-MDP (BMDP). BMDP is a variant of MDP which makes sure that risk notion implemented as cost signal stays below a certain adjustable threshold. The problem formalization for workings of vehicle kinematics, temporal abstraction, partial observability and reward hypothesis has been studied extensively as well. Robust optimization planning has been studied is past for finite MDP systems with uncertain parameters. And also has shown promising results under conservative driving behavior. For BMDP, efficacy and safety analysis has been extended into continuous kinematics states and unknown human behavior from the existing known dynamics and finite state space. Model free learning networks that approximates value function for these MDPs like Deep Q-Leaning (DQN) and Dueling Deep Q-Learning (DQN) networks have demonstrated promising results in continuous agent learning.

以前的工作

深增强学习已被广​​泛用于交通控制任务。 Carla 提供的模拟环境为系统提供了框架,该系统估计来自模拟环境中的传感器的若干人。导航任务,如合并需要良好探索能力的流量也显示了模拟器的有希望的结果。 Chauffeurnet 详细说明了训练强大自治代理的模仿学习的想法,以实现现实扰动的形式利用最坏情况。基于集群的碰撞案例生成研究系统地定义并生成不同类型的冲突,以便有效地识别代理培训的有价值的案例。在 Highway-Env 包中专注于为大型非线性随机自主驱动系统设计安全的操作策略。该环境已被广泛研究并用于建模 MDP 的不同变体,例如:有限 MDP,约束 - MDP 和预算-MDP(BMDP)。 BMDP 是 MDP 的变体,其确保作为成本信号实现的风险概念停留在某个可调阈值以下。也在广泛研究了车辆运动学,时间抽象,部分可观察性和奖励假设的运作问题的问题形式化。已经研究了强大的优化规划是具有不确定参数的有限 MDP 系统的过去。并且还在保守驾驶行为下显示了有希望的结果。对于 BMDP,效力和安全分析已扩展到连续的运动学状态和来自现有的已知动态和有限状态空间的未知人类行为。模型自由学习网络近似于这些 MDP 的值函数,如 Deep Q 倾斜(DQN)和 Dueling Deep Q-Learning(DQN)网络已经表现出在连续代理学习中的有希望的结果。

Worst case scenario knowledge in traffic analysis has been leveraged for model based algorithms by building region of high confidence containing true dynamics with high probability. Tree based planning algorithms were used to achieve robust stabilisation and mini-max control with generic costs. These studies also leveraged non-asymptotic linear regression and interval prediction for safer trajectory predictions. Behavior guided action study which uses proximity graphs and safety trajectory computes for working with aggressive and conservative have shown promising results as well. This study used CMetric measure for generating varying level of aggressiveness in traffic. Whereas in our case we have used more randomization and traffic clogging at key areas for risk prone scenarios to measure more granular observation results. Causal modelling techniques have contributed a lot in terms of providing great interpretative explanations in many domains. Also, Fischer's randomized control trials (RCTs) has served as gold standard for causal discovery from observational data. Sewall's path diagrams were the first attempt of generating causal answers with mathematics. Now, causal diagrams and different adjustments on these diagrams does offer direct causal relation information about any experimental variables under study. In our experimental study we use these existing mathematical tools to draw direct causal conclusions of our learnings from environment interventions.

通过建立具有高概率的真正动态的高置信区,对基于模型的算法来利用基于模型的算法的最坏情况。基于树的规划算法用于实现具有通用成本的强大稳定和迷你控制。这些研究还利用了对更安全的轨迹预测的非渐近线性回归和间隔预测。行为指导动作研究使用邻近图和安全轨迹计算,用于与攻击性和保守派合作表明也有希望的结果。这项研究使用了对交通中产生不同水平的融合措施。而在我们的情况下,我们在风险易受风险场景的关键领域使用了更多随机化和流量,以测量更多的粒度观察结果。因果建模技术在提供许多域中提供了很大的解释性解释,因此贡献了很大贡献。此外,Fischer 的随机控制试验(RCT)曾作为从观察数据的因果发现的金标准。 Sewall 的路径图是第一次尝试与数学产生因果答案。现在,对这些图的因果图和不同的调整确实提供了关于研究中任何实验变量的直接因果关系信息。在我们的实验研究中,我们使用这些现有的数学工具从环境干预中汲取我们学习的直接因果关系。

Problem Formulation 解决方案

Our goal is to design and build agents on collision prone MDPs for navigation tasks across the different traffic scenarios. The MDP comprises of behavior policy $ \pi $ ( $ \mid $ ) that outputs action for given state . With this learnt policy our goal is to predict discrete safe and efficient action from finite action set for next time step for given driving scenarios. The simulation platform that we used is compatible with OpenAI gym package. The highway-env package provides the traffic flow which is governed by Intelligent Driver Model (IDM) for linear acceleration & MOBIL model for lane changing. MOBIL model primarily consists of safety criterion and incentive criterion. First safety criterion checks if after lane change vehicle is having enough acceleration space and second criterion determines the total advantage of lane change in terms of total acceleration gain. Given MDP is defined as a set of where action $ \in $ , state $ \in $ , reward function $ \in $ [0,1] and state transition probabilities T $ \in $ . With Deep-RL algorithms we search the the behavior policy $ \pi $ (s $ \mid $ a) that helps us in navigating across the traffic environments to gather maximum discounted reward . The state-action value function for given assists in estimating future rewards of given behavior policy $ \pi $ . Therefore, the optimal state-action value function provides maximum value estimates for all $ \in $ and is evaluated by solving Bellman Equation, stated below for reference. From this the optimal policy $ \pi $ is expressed as = arg max. We used DQN with duelling network architecture having to approximate the state-action value function which predicts best possible action as learned from the policy $ \pi $ .

我们的目标设计和构建易于碰撞的 MDP,以便在不同的交通方案中导航任务。 MDP 包括输出给定状态的动作的行为策略$\pi $($\mid $)。通过这种学习的政策,我们的目标是从用于给定驾驶场景的下一次步骤中,从有限时间进行预测,从有限时间步骤预测离散的安全和有效行动。我们使用的模拟平台与 Openai 健身包兼容。 Highway-Env 包提供了由智能驱动程序模型(IDM)管理的交通流量,用于线路更换的线性加速\和 MOBIL 模型。 MOBIL 模型主要由安全标准和激励标准组成。第一个安全标准检查后车道改变车辆是否具有足够的加速空间和第二标准确定总线加速度增益的车道变化的总优势。给定 MDP 被定义为一组 Action $\in $,状态$\in $,奖励函数$\in $ [0,1]和状态转换概率 T $\in $。使用 Deep-RL 算法,我们搜索行为策略$\pi $(s $\mid $a),帮助我们在流量环境中导航以收集最大的折扣奖励。给定辅助给定行为策略$\pi $的未来奖励的状态操作值函数。因此,最佳状态 - 动作值函数为所有$\in $提供最大值估计,并通过求解贝尔曼方程来评估下面的参考。从此,最佳策略$\pi $表示为= arg max。我们使用 DQN 具有决斗网络架构,该架构必须近似于从策略$\pi $中获知的最佳动作的状态动作值函数。

$$ Q^{*}(s, a) = \mathop{\mathbb{E}}[ R(s,a) + \gamma\sum\limits_{s'} P (s' | s , a) \max\limits_{a'} Q^{*}(s', a') ] $$

From our experimentation setup we intend to derive direct causal impact of our interventions in traffic environment scenarios. And as we can refer back from Figurefig2, our treatment (T) is subjecting the learning agent to more risky driving behavior. Our testing sample set involves random agent reward calculation against perturbed and control environments which makes our experiment equivalent to RCT. Meaning that there is no unobserved confounding present in our experimentation i.e.

来自我们的实验设置,我们打算导出我们在流量环境方案中的干预措施的直接因果关系。随着我们可以从图码 2 转述,我们的治疗(T)正在使学习代理更加危险的驾驶行为。我们的测试样本集涉及随机代理奖励计算对扰动和控制环境,这使我们的实验相当于 RCT。这意味着我们的实验中没有不观察室的混淆,即我们的实验中。

backdoor criterion is satisfied. Also, in RCTs distribution of all co-variates are same except the treatment. Co-variate balance in observational data also implies that association is equal to causation while calculating the potential outcomes, refer equation stated below.

满足后门标准。而且,在 RCT 中,除​​了治疗外,所有共变体的分布也是如此。在观察数据中的共同变化也意味着在计算潜在结果的同时,关联等于因果关系,参考下面说明的等式。

$$ P(X \mid T=1) \stackrel{d}{=} P(X \mid T=0) $$
$$ P(X \mid T=1) \stackrel{d}{=} P(X), T \perp \perp X $$
$$ P(X \mid T=0) \stackrel{d}{=} P(X), T \perp \perp X$$

Essentially meaning that we can use the associative difference quantity to infer the effect of treatment on outcomes. Meaning that we can use Average Treatment Effect (ATE) approach for calculating the causal effect by simply subtracting the averaged out values treatment and control potential outcomes. In below stated equations Y(1) $ \triangleq $ Y& Y(0) $ \triangleq $ Y and these equations hold true in case of RCTs where causal difference can be calculated with associated difference.
基本上意味着我们可以使用关联差异数量来推断治疗结果的效果。意思是,我们可以使用平均处理效果(吃)方法来计算因果效应,简单地减去平均值处理和控制潜在结果。在下面说明的等式 y(1)$\triangleq $y \ y y \ y(0)$\triangleq $y 和这些等式在 RCT 的情况下保持真实,其中可以用相关差异计算因果差异。

$$ \begin{alignedat}{2} E[Y(1)-Y(0)] &= E[Y(1)]-E[Y(0)] E[Y(1)]-E[Y(0)] &= E[Y\mid T=1]-E[Y\mid T=0] \end{alignedat} $$

image

Model free learning approaches generally don't have explicit information about the dynamics of the systems. Therefore, during the training these agents will generalize their policies corresponding to the particular given scenarios only. We introduce risky randomized behavior in these environment vehicles with collision prone randomized behavior. Figurefig3 shows different probable real life collisions simulated in different highway-env package environments. It increases generalizations on less common but highly critical scenarios which can save user from hefty collisions. We critically analyze the performance our treated agents in comparison to control agents for important tasks like performing roundabouts, handling intersections, u-turn & two-way traffic overtakes and lane changing environments in our study.
无模型的学习方法通​​常不具有有关系统动态的明确信息。因此,在培训期间,这些代理商将概括其对应于特定特定方案的政策。我们介绍这些环境风险的车辆随机行为与碰撞容易发生随机行为。图 53 显示了在不同的高速公路 env 包环境中模拟的不同可能的实际碰撞。它会在不太常见但高度关键的方案上增加概括,这可以将用户从艰难的碰撞中保存。我们批判性地分析了我们治疗代理的性能,与我们在研究中执行环形交叉路口,处理交叉路口,U 形转弯\和双向交通超值和车道更换环境的重要任务相比,我们的治疗代理商进行了比较。

image

The collision between two vehicles is equivalent to intersection of two polygons in the rendered environment output of highway-env. And we detect these collisions between rectangular polygons with the separating axis theorem for given two convex polygons. Essentially, the idea is to find a line that separates both polygons if that line exists than polygons are separated and collision hasn't happened yet. Algorithmically for each edge of our base rectangular polygon we find perpendicular axis to current edges under review. After that we project these edges onto that axis and in case these projections don't overlap it functionally means no collision as rectangular polygons are not intersecting, refer Figurefig4.

两个车辆之间的碰撞等同于高速公路 env 的渲染环境输出中的两个多边形的交点。并且我们检测到具有分离轴定理的矩形多边形之间的这些碰撞,用于给定两个凸多边形。基本上,这个想法是找到一个线,如果该系列存在于多边形,则分开多边形,并且尚未发生碰撞。算法用于我们基地矩形多边形的每个边缘,我们发现垂直轴到审查的当前边缘。之后,我们将这些边缘投影到该轴上,并且在这些投影不重叠的情况下,在功能上不重叠,这意味着没有碰撞,因为矩形多边形没有相交,请参阅图中图 4。

Experiment Setup 实验设置

In our experimentation setup we calculate the ATE metric for namely lane changing, two-way traffic, roundabout, intersection and u-turn tasks, refer Figurefig5. These five tasks are evaluated against increasing traffic from default vehicle count/density to a 200% increase. For each of these traffic scenarios we create our treatment environment with varying degree of acceleration & steering parameters which would not comply with MOBIL model criteria of safety and incentives. This randomization behavior is governed by equations stated below which lays the foundation of simulating collision prone behavior. We create collision prone behavior by significantly changing the equation max-min acceleration parameters in function defined in the kinematics rules of highway-env package. With our experimentation setup we quantify the causal model performance improvements from introduction of this risk factor knowledge in agent learning process. And compare it with the control baseline models across these five different navigation tasks against spectrum of increasing traffic density.

在我们的实验设置中,我们计算了即表示车道更改,双向流量,环形交通,交叉路口和 U-Work 任务的 ATE 度量,请参阅图 Fig5。根据默认车辆计数/密度的增加,评估这五项任务的增加,增加了 200%。对于这些交通方案中的每一个,我们创建了具有不同程度的加速度\和转向参数的治疗环境,这不会符合 Mobil 模型的安全和激励措施标准。该随机化行为由下面规定的等式控制,该方程奠定了模拟碰撞易于行为的基础。我们通过显着改变高速公路 env 包中的运动学规则中定义的功能中的等式 MAX-MIN 加速度参数来创建碰撞易于行为。通过我们的实验设置,我们可以通过在代理学习过程中引入这种风险因素知识来量化因果模型性能改进。并将其与控制基线模型进行比较这五个不同的导航任务,免受流量密度增加的频谱。

$$ acc_p = acc_{min} + \textit{rand[0,1]}* (acc_{max} - acc_{min}) $$
$$ str_p = str_{min} + \textit{rand[0,1]}* (str_{max} - str_{min}) $$

image

We rebuild the highway-env package with our custom changes by altering the environment configurations, randomizing behavior and adding new vehicles to strategically increase the traffic density and simulate risky driver behavior. In lane changing task for treatment & control model training we incrementally increase the vehicle count from 50 to 150 vehicles accompanied with equivalent intermittent increase of vehicle density by 100% and having increased randomized risky behavior on episode duration length of 20 seconds. We train unique agents for navigation on each different traffic count environments in our experimentation setup corresponding every vehicle count increase. Also, we plot a comparative analysis performance graph of control and treatment agent across these different environment iterations and calculate ATE of our perturbations. Similarly, for u-turn environment we uniformly increase our vehicle count from 3 to 12 with incremental increase of vehicle count of 3. Also, for two-way traffic we reduce original environment length to 2/3 of the original and incrementally increase the vehicle traffic count from base 5 in direction and 2 in opposite direction vehicles to 15 in direction and 6 in opposite direction vehicles. For collision prone treatment in intersection task driving task we rewire the randomization behavior to a more risky one with our acceleration and deceleration tuning. We also increase the vehicle count from 10 to 30 incrementally with interval gap of 5 vehicles and alongside we incrementally increase the spawning probability by 0.1 until it reaches its maximum value. Finally, for roundabout task we incrementally increase the traffic from 5 to 15 vehicles with risk prone randomization in our treatment environment for agent training and performance comparison with control baseline. Each of these configuration requires the environment to be rebuilt continuously.

我们重建高速公路-env 包通过改变环境配置,随机性行为和添加新车辆来战略性地提高流量密度并模拟风险驾驶员行为的自定义更改。在处理\和控制模型培训的车道更改任务中,我们逐步增加车辆数量从 50 到 150 辆车辆,伴随着载体密度的等效间歇性增加 100%,并在第 20 秒的活动时间长度增加随机风险行为。我们在我们的实验设置中培训独特的代理,以便在每个不同的流量计数环境中导航,相应的每辆车数增加。此外,我们在这些不同的环境迭代中绘制了对照和治疗剂的比较分析性能图,并计算了我们的扰动。同样,对于 U 形环境,我们统一地增加了 3 到 12 的车辆计数,增加了车辆数量的增加。此外,对于双向交通,我们将原始环境长度减少到原始的 2/3,逐步增加车辆在方向 5 和 2 的相反方向朝向 15 的方向和 6 的相反方向车辆中的交通计数。对于交叉点的碰撞俯卧处理,任务驾驶任务我们将随机化行为重新缠绕到更有风险的风险,我们的加速和减速调整。我们还将车辆计数从 10 到 30 逐渐增加,间隔间隙为 5 辆,我们逐步增加产卵概率 0.1,直到它达到其最大值。最后,对于迂回任务,我们逐步增加 5 到 15 辆车的流量,在我们的治疗环境中具有风险易于随机化,用于试剂培训和与控制基线的性能比较。这些配置中的每一个都需要连续重建环境。

And repeated model training against each new treatment & control environments. For evaluating our model's performance we have kept the traffic as constant in our test population set against the corresponding treatment and control environment on which agents were trained. And we have only changed the risky behavior in treatment and control environment sets to calculate the ATE for measuring causal performance improvement.

image

We use the DQN reinforcement learning modelling technique with ADAM optimizer for our experiment with learning rate of 5e-4. Our discount factor used is 0.99 and environment observation vector data is fed in a batch size of 100. Our agents are trained over 3072 episodes until they converged to average reward from a given driving environment. We use the dueling network design which utilizes advantage function which helps in estimating state-action value function for state-action pair more precisely. This done by splitting the network into two streams, value and advantage ones which shares some base hidden layers. The shared network consists of 3 fully connected layers of 256, 192 and 128 units respectively. Value and advantage stream consists of 2 layers of 128 unit each. The final output of these streams is also fully connected to the network. The value stream has 1 output of calculated value function for a given state. And advantage stream has outputs representing number of discrete possible actions for a given state. The output vectors from these two streams are combined to calculate the state-action value function estimate with the help from below stated equation, refer Figurefig6 for model architecture. Baseline DQN agent implementations referenced from: github.com/eleurent/rl-agents
$$ Q(s, a) = V (s) + A(s, a) $$

并反复模型训练针对每个新的治疗\控制环境。为了评估我们的模型的性能,我们将交通保持在我们的测试人口中的常量,以防止培训代理人的相应治疗和控制环境。我们只改变了治疗和控制环境中的危险行为,以计算用于测量因果性能改善的酸盐。

我们使用 DQN 强化学习建模技术与 ADAM Optimizer 进行 5E-4 学习率的实验。我们使用的折扣因子是 0.99,环境观察载体数据以批量送入 100.我们的代理人培训超过 3072 个集中,直到它们从给定的驾驶环境中融合到平均奖励。我们使用 Dueling 网络设计利用优势功能,这有助于更准确地估计状态动作对的状态动作值函数。这通过将网络分成两个流,值和优势,它们共享一些基本隐藏层。共享网络分别由 3 个完全连接的 356,192 和 128 个单元组成。价值和优势流由每个 128 个单元组成。这些流的最终输出也完全连接到网络。值流具有 1 个计算值函数的 1 个输出,用于给定状态。并且优势流具有表示给定状态的离散可能动作的数量的输出。组合来自这两个流的输出向量,以计算状态动作值函数估计在下面的说明书方程,请参阅图案架构的 uperfig6。基线 DQN 代理实现来自:github.com/eleurent/rl-agents
$$ q(s,a)= v(s)+ a(s,a)$$

Results

Another randomization factor in our experimentation setup involves initial random seed values. For adding random behavior to tasks like randomizing vehicle acceleration & deceleration, spawning vehicles, vehicle location etc. Therefore, we test our trained treatment and control models against the risk-prone and regular driving environments with different randomization seed values to average any anomalous results. Hence, we measure the ATE by evaluating our agents against several randomized seed values for risk prone and regular driving environments. This calculation is simply summarized by the equation below where summation of first two terms calculate the average reward calculated from treatment models in risk prone and regular environment. The last two terms calculate the same for control agent in both these environments and the test set sample count is expressed as = = 100. The associate difference of these quantities gives us the ATE for performance improvements in our robust agents trained on perturbed environments as explained in earlier section.

image

ATE results from Figurefig7 clearly demonstrates to us the advantage of teaching agents these risk prone critical scenarios. For better readability purpose we converted the calculated ATE values to their respective percentages in Figure\ref{fig7}. Across all tasks we observe that as traffic density increases like real-life scenarios in heavily populated cities the positive effect of knowing perturbed scenario is pronounced for our robust treatment agents. More importantly there is also a declining trend of average reward as the traffic increases for every driving task analyzed for highway-env.

结果

我们的实验设置中的另一个随机分子涉及初始随机种子值。为了将随机行为添加到随机性车辆加速\和减速,产卵车辆,车辆位置等。因此,我们通过不同随机化种子值对风险倾向的风险和常规驾驶环境进行培训的处理和控制模型,以平均任何异常结果。因此,我们通过评估我们的代理对抗几种随机种子值来衡量风险倾向于和普通驾驶环境的几个随机种子值来衡量 ATE。该计算简单地通过以下等式总结了前两个术语的总和计算风险易于和常规环境的处理模型计算的平均奖励。最后两个术语在这些环境中计算的控制代理和测试集样本计数表示为= = 100.这些数量的助理差异使我们可以在培训的扰动环境中培训的鲁棒特性的性能改进。在早期的部分。

从图 FIG7 的结果清楚地表明了我们教学代理的优势,这些风险易受关键情景。为了更好的可读性目的,我们将计算的 ATE 值转换为图\ REF {码}中的各自百分比。在所有任务中,我们观察到,随着普遍普遍的城市的现实生活场景的增加,我们稳健的治疗剂对知识扰动情景的积极效果显着。更重要的是,由于对高速公路 env 的每次驾驶任务的流量增加,平均奖励的趋势也有所下降。

This depreciation in agent performance both for control and treatment models can be attribute to constantly decreasing safety distance amongst all vehicles causing more than expected collisions and slow progression by our ego-vehicle across these environments due to heavy traffic. Even with decreasing average reward trend across our test set environment samples the performance of treatment models has always exceeded the control models. Also, the relative improvements in ATE values further increases as the traffic continues to increase demonstrating strong robustness of treatment agents. Currently our scope of work is limited to few but critically important driving scenarios. Also, we have used only homogeneous agents in our analysis and attempted to analyze critical knowledge leveraging component on single agent only i.e. our ego-vehicle. Plus our randomization mechanism though uniform doesn't necessarily follow human-like behavior while generating risk prone scenarios. But, our causal effect estimation approach that quantifies the information learnt from perturbed scenarios does demonstrates promising results. And holds vast scope of practical applications for creating more interpretable, metric-oriented & key-performance-indicator (KPI) driven autonomous agent systems.

对于控制和治疗模型的代理性能的这种折旧可以是由于由于交通拥挤而导致所有车辆之间的安全距离,导致我们的自我车辆的碰撞和慢速进展。即使在我们的测试集环境中降低了平均奖励趋势,环境样本也始终超过了控制模型的性能。而且,随着交通持续增加,ATE 值的相对改善进一步增加,以提高展示治疗剂的强稳健性。目前我们的工作范围仅限于少数但批评性的驾驶情景。此外,我们仅在我们的分析中使用了同类代理,并试图分析仅在单个代理上利用组件的关键知识即我们的自我车辆。加上我们的随机化机制虽然统一不一定遵循人类的行为,同时产生风险易发的情景。但是,我们的因果效应估算方法,这些方法量化了从扰动情景中学到的信息确实证明了有希望的结果。并拥有庞大的实际应用范围,以创建更具可解释的度量导向的\&key-performance-指示器(KPI)驱动的自治代理系统。

Conclusion

Our experiments from this paper provide insights into the importance of using deliberate interventions of collision prone behaviors while training agents for stochastic processes like autonomous driving. By using the MDP formulation of the discussed driving scenarios with collision simulation perturbations we were able to generate more robust agents. Our treatment model experimentation setup used episode data from traffic clogged lanes and risky randomized behavior during training which finally resulted in positive ATE result values. Which proved that agents trained with more wider range of collision prone scenarios performs better than the existing vanilla simulation agents. Also, we casually quantified the impact of our interventions for the discussed model free learning DQN technique which assisted us in accurately estimating the performance improvements. For every driving scenario environment our new agents produced better results and were proved to be better collision deterrents. Therefore, underscoring the importance of learning valuable lessons from risk prone scenario simulations for creating safe autonomous driving agents.

结论

我们本文的实验提供了对使用刻意易于碰撞行为的重要性的洞察,而自动驾驶等随机过程的培训代理。通过使用讨论的驾驶场景的 MDP 制定,具有碰撞仿真扰动,我们能够产生更强大的代理。我们的治疗模型实验设置使用来自流量堵塞的车道的剧集数据和训练期间的危险随机行为,最终导致正呈现出效果值。这证明了具有更广泛的碰撞型易受情景的代理商比现有的香草模拟代理更好。此外,我们随便量化了我们对讨论的模型自由学习 DQN 技术的影响,该技术为准确估算了性能改进。对于每个驾驶场景环境,我们的新代理商会产生更好的结果,被证明是更好的碰撞威慑。因此,强调了从风险易受风险方案模拟中学习有价值的教训的重要性,以创建安全自治驾驶代理。