Building Safer Autonomous Agents by Leveraging Risky Driving Behavior Knowledge



Simulation environments are good for learning different driving tasks like lane changing, parking or handling intersections etc. in an abstract manner. However, these simulation environments often restrict themselves to operate under conservative interactions behavior amongst different vehicles. But, as we know that the real driving tasks often involves very high risk scenarios where other drivers often don't behave in the expected sense. There can be many reasons for this behavior like being tired or inexperienced. The simulation environments doesn't take this information into account while training the navigation agent. Therefore, in this study we especially focus on systematically creating these risk prone scenarios with heavy traffic and unexpected random behavior for creating better model-free learning agents. We generate multiple autonomous driving scenarios by creating new custom Markov Decision Process (MDP) environment iterations in highway-env simulation package. The behavior policy is learnt by agents trained with the help from deep reinforcement learning models. Our behavior policy is deliberated to handle collisions and risky randomized driver behavior. We train model free learning agents with supplement information of risk prone driving scenarios and compare their performance with baseline agents. Finally, we casually measure the impact of adding these perturbations in the training process to precisely account for the performance improvement attained from utilizing the learnings from these scenarios. IEEEkeywords: Autonomous Agents, Driving Simulations, Trajectory Prediction, Causality



关键词 自主代理,驾驶模拟,轨迹预测,因果关系

Introduction 简介

The arrival of autonomous driving agents have had a great impact on automobile industry. And it will responsible for shaping the future of this industry as well. The current industry trend progression demonstrates that individual autonomous agents will be dominating before connected autonomous vehicles. But, that vision is still quite afar considering safety infrastructure, security and public policy reasons. Therefore, our immediate focus should be on making our driving agents safe and efficacious. Creating more safer agents have been explored well for past few years. As it is crucial to know the expected agent behavior in different environments especially safety critical ones. In natural driving environments risk prone scenarios doesn't happen frequently which makes the learning process harder from these scenarios. And it is also unethical to create these real risky driving scenarios for experimentation purposes. Therefore, generating and studying these scenarios systematically is a daunting task. Perception systems with adversarial approaches of noisy labelling does demonstrate promising path. But underlying fundamental problem remains focused around safe interactions with other vehicles which themselves are operating independently. Simulation environments with appropriate system dynamics design assumptions does overcomes this expressed safety issue. Along with that these environments allows us systematically study the risk prone scenarios with near realistic driving behaviors. For this study we have used highway-env simulation package which allows us to simulate different driving tasks. It also provides simple interfacing capabilities to modify these environments and quickly create baseline prototypes on it. We formulate the dynamics of the simulation system as Markov Decision Process (MDP) in our experimentation. We model our value function approximation agents with deep reinforcement learning for these defined MDP systems for our study.



For our experiment design we create two distinct variants of each model architecture for above stated simulation dynamics. One of these model variant is trained with our increased dangerous driving behavior interventions for all the driving related tasks present in highway-env package. Second model variant is control variable in our experiment which is used to define the reward baseline for the trained agents on regular simulations. In our study we do a methodological doping of these environments with randomized dangerous driving scenarios to create more risk prone environments for driving. This is done specifically in two ways in our study, first by increasing the traffic at strategically important location challenging the agent to make dangerous overtakes. Second, we increase the randomization factor and the clogged lanes makes the environment more collision prone for our agent .


This is done to create more robust agents in a given environment for that particular dynamic scenario. Which are essentially better at post impact trajectory predictions of other interacting vehicles. Figurefig1 explains our causal analysis experimentation setup in sequential form. We also attempt to understand our experimentation setup from a causal standpoint view. We hold complete control of the data generation process in our experimentation. Which allows us to have unique vantage point of conducting an experiment study which is equivalent to a randomized control trial (RCT). We train our agents with good enough assumption of absence in unobservable confounding variables. As we have strictly defined the state dynamics and vehicle behavior governing models. Our customization capabilities in highway-env package allows us to keep every condition same while training except our collision scenario perturbations. Meaning that treatment and control groups are same in all aspects except our treatment i.e. there is a comparability and co-variate balance in our experimental setup. With this special relation establishment we can imply that association found in our experiment setup is causation. As shown in Figurefig2 our treatment is subjecting the agent learning process to risk prone interacting vehicle dynamics in a given environment. After that our sample test experiment population involves evaluating the two model variants against regular and perturbed environments with excessive risk prone scenario doping. And finally by using expectation equations derived from above causal graph we estimate the causal effect our novel learning changes for enhanced safety.


Our contributions by the means of this paper include providing benchmarking environment simulations for collision robustness predictions. Along with that we provide experimentation methodology that creates more robust agents that provides better on-road safety. And we also causally measure the impact of our risky driving behavior doping interventions for different driving environments. For the remaining paper we first discuss related work corresponding to utilizing risk prone behavior for creating safer autonomous vehicle. Second, we formally define our problem and elaborate on causal aspects of it. Third, we explain the experiment setup for creating these robust agents and elaborate upon on our results. Finally, we conclude our autonomous driving agent study.



Previous Work

Deep Reinforcement Learning have been used for extensively for traffic control tasks. The simulated environment provided by CARLA gave framework for systems that estimates several affordances from sensors in a simulated environment. Navigation tasks like merging traffic requiring good exploration capabilities have also shown promising results in simulators. ChauffeurNet elaborates on the idea of imitation learning for training robust autonomous agents that leverages worst case scenarios in the form of realistic perturbations. A clustering based collision case generation study systematically defines and generates the different type of collisions for effectively identifying valuable cases for agent training. In highway-env package specifically focuses on designing safe operational policies for large-scale non-linear stochastic autonomous driving systems. This environment has been extensively studied and used for modelling different variants of MDP, for example: finite MDP, constraint-MDP and budgeted-MDP (BMDP). BMDP is a variant of MDP which makes sure that risk notion implemented as cost signal stays below a certain adjustable threshold. The problem formalization for workings of vehicle kinematics, temporal abstraction, partial observability and reward hypothesis has been studied extensively as well. Robust optimization planning has been studied is past for finite MDP systems with uncertain parameters. And also has shown promising results under conservative driving behavior. For BMDP, efficacy and safety analysis has been extended into continuous kinematics states and unknown human behavior from the existing known dynamics and finite state space. Model free learning networks that approximates value function for these MDPs like Deep Q-Leaning (DQN) and Dueling Deep Q-Learning (DQN) networks have demonstrated promising results in continuous agent learning.


深增强学习已被广​​泛用于交通控制任务。 Carla提供的模拟环境为系统提供了框架,该系统估计来自模拟环境中的传感器的若干人。导航任务,如合并需要良好探索能力的流量也显示了模拟器的有希望的结果。 Chauffeurnet详细说明了训练强大自治代理的模仿学习的想法,以实现现实扰动的形式利用最坏情况。基于集群的碰撞案例生成研究系统地定义并生成不同类型的冲突,以便有效地识别代理培训的有价值的案例。在Highway-Env包中专注于为大型非线性随机自主驱动系统设计安全的操作策略。该环境已被广泛研究并用于建模MDP的不同变体,例如:有限MDP,约束 - MDP和预算-MDP(BMDP)。 BMDP是MDP的变体,其确保作为成本信号实现的风险概念停留在某个可调阈值以下。也在广泛研究了车辆运动学,时间抽象,部分可观察性和奖励假设的运作问题的问题形式化。已经研究了强大的优化规划是具有不确定参数的有限MDP系统的过去。并且还在保守驾驶行为下显示了有希望的结果。对于BMDP,效力和安全分析已扩展到连续的运动学状态和来自现有的已知动态和有限状态空间的未知人类行为。模型自由学习网络近似于这些MDP的值函数,如Deep Q倾斜(DQN)和Dueling Deep Q-Learning(DQN)网络已经表现出在连续代理学习中的有希望的结果。

Worst case scenario knowledge in traffic analysis has been leveraged for model based algorithms by building region of high confidence containing true dynamics with high probability. Tree based planning algorithms were used to achieve robust stabilisation and mini-max control with generic costs. These studies also leveraged non-asymptotic linear regression and interval prediction for safer trajectory predictions. Behavior guided action study which uses proximity graphs and safety trajectory computes for working with aggressive and conservative have shown promising results as well. This study used CMetric measure for generating varying level of aggressiveness in traffic. Whereas in our case we have used more randomization and traffic clogging at key areas for risk prone scenarios to measure more granular observation results. Causal modelling techniques have contributed a lot in terms of providing great interpretative explanations in many domains. Also, Fischer's randomized control trials (RCTs) has served as gold standard for causal discovery from observational data. Sewall's path diagrams were the first attempt of generating causal answers with mathematics. Now, causal diagrams and different adjustments on these diagrams does offer direct causal relation information about any experimental variables under study. In our experimental study we use these existing mathematical tools to draw direct causal conclusions of our learnings from environment interventions.

通过建立具有高概率的真正动态的高置信区,对基于模型的算法来利用基于模型的算法的最坏情况。基于树的规划算法用于实现具有通用成本的强大稳定和迷你控制。这些研究还利用了对更安全的轨迹预测的非渐近线性回归和间隔预测。行为指导动作研究使用邻近图和安全轨迹计算,用于与攻击性和保守派合作表明也有希望的结果。这项研究使用了对交通中产生不同水平的融合措施。而在我们的情况下,我们在风险易受风险场景的关键领域使用了更多随机化和流量,以测量更多的粒度观察结果。因果建模技术在提供许多域中提供了很大的解释性解释,因此贡献了很大贡献。此外,Fischer的随机控制试验(RCT)曾作为从观察数据的因果发现的金标准。 Sewall的路径图是第一次尝试与数学产生因果答案。现在,对这些图的因果图和不同的调整确实提供了关于研究中任何实验变量的直接因果关系信息。在我们的实验研究中,我们使用这些现有的数学工具从环境干预中汲取我们学习的直接因果关系。

Problem Formulation 解决方案

Our goal is to design and build agents on collision prone MDPs for navigation tasks across the different traffic scenarios. The MDP comprises of behavior policy $ \pi $ ( $ \mid $ ) that outputs action for given state . With this learnt policy our goal is to predict discrete safe and efficient action from finite action set for next time step for given driving scenarios. The simulation platform that we used is compatible with OpenAI gym package. The highway-env package provides the traffic flow which is governed by Intelligent Driver Model (IDM) for linear acceleration & MOBIL model for lane changing. MOBIL model primarily consists of safety criterion and incentive criterion. First safety criterion checks if after lane change vehicle is having enough acceleration space and second criterion determines the total advantage of lane change in terms of total acceleration gain. Given MDP is defined as a set of where action $ \in $ , state $ \in $ , reward function $ \in $ [0,1] and state transition probabilities T $ \in $ . With Deep-RL algorithms we search the the behavior policy $ \pi $ (s $ \mid $ a) that helps us in navigating across the traffic environments to gather maximum discounted reward . The state-action value function for given assists in estimating future rewards of given behavior policy $ \pi $ . Therefore, the optimal state-action value function provides maximum value estimates for all $ \in $ and is evaluated by solving Bellman Equation, stated below for reference. From this the optimal policy $ \pi $ is expressed as = arg max. We used DQN with duelling network architecture having to approximate the state-action value function which predicts best possible action as learned from the policy $ \pi $ .

我们的目标设计和构建易于碰撞的MDP,以便在不同的交通方案中导航任务。 MDP包括输出给定状态的动作的行为策略$\pi $($\mid $)。通过这种学习的政策,我们的目标是从用于给定驾驶场景的下一次步骤中,从有限时间进行预测,从有限时间步骤预测离散的安全和有效行动。我们使用的模拟平台与Openai健身包兼容。 Highway-Env包提供了由智能驱动程序模型(IDM)管理的交通流量,用于线路更换的线性加速\和MOBIL模型。 MOBIL模型主要由安全标准和激励标准组成。第一个安全标准检查后车道改变车辆是否具有足够的加速空间和第二标准确定总线加速度增益的车道变化的总优势。给定MDP被定义为一组Action $\in $,状态$\in $,奖励函数$\in $ [0,1]和状态转换概率T $\in $。使用Deep-RL算法,我们搜索行为策略$\pi $(s $\mid $a),帮助我们在流量环境中导航以收集最大的折扣奖励。给定辅助给定行为策略$\pi $的未来奖励的状态操作值函数。因此,最佳状态 - 动作值函数为所有$\in $提供最大值估计,并通过求解贝尔曼方程来评估下面的参考。从此,最佳策略$\pi $表示为= arg max。我们使用DQN具有决斗网络架构,该架构必须近似于从策略$\pi $中获知的最佳动作的状态动作值函数。

$$ Q^{*}(s, a) = \mathop{\mathbb{E}}[ R(s,a) + \gamma\sum\limits_{s'} P (s' | s , a) \max\limits_{a'} Q^{*}(s', a') ] $$

From our experimentation setup we intend to derive direct causal impact of our interventions in traffic environment scenarios. And as we can refer back from Figurefig2, our treatment (T) is subjecting the learning agent to more risky driving behavior. Our testing sample set involves random agent reward calculation against perturbed and control environments which makes our experiment equivalent to RCT. Meaning that there is no unobserved confounding present in our experimentation i.e.


backdoor criterion is satisfied. Also, in RCTs distribution of all co-variates are same except the treatment. Co-variate balance in observational data also implies that association is equal to causation while calculating the potential outcomes, refer equation stated below.


$$ P(X \mid T=1) \stackrel{d}{=} P(X \mid T=0) $$
$$ P(X \mid T=1) \stackrel{d}{=} P(X), T \perp \perp X $$
$$ P(X \mid T=0) \stackrel{d}{=} P(X), T \perp \perp X$$

Essentially meaning that we can use the associative difference quantity to infer the effect of treatment on outcomes. Meaning that we can use Average Treatment Effect (ATE) approach for calculating the causal effect by simply subtracting the averaged out values treatment and control potential outcomes. In below stated equations Y(1) $ \triangleq $ Y& Y(0) $ \triangleq $ Y and these equations hold true in case of RCTs where causal difference can be calculated with associated difference.
基本上意味着我们可以使用关联差异数量来推断治疗结果的效果。意思是,我们可以使用平均处理效果(吃)方法来计算因果效应,简单地减去平均值处理和控制潜在结果。在下面说明的等式y(1)$\triangleq $y \ y y \ y(0)$\triangleq $y和这些等式在RCT的情况下保持真实,其中可以用相关差异计算因果差异。

$$ \begin{alignedat}{2} E[Y(1)-Y(0)] &= E[Y(1)]-E[Y(0)] E[Y(1)]-E[Y(0)] &= E[Y\mid T=1]-E[Y\mid T=0] \end{alignedat} $$


Model free learning approaches generally don't have explicit information about the dynamics of the systems. Therefore, during the training these agents will generalize their policies corresponding to the particular given scenarios only. We introduce risky randomized behavior in these environment vehicles with collision prone randomized behavior. Figurefig3 shows different probable real life collisions simulated in different highway-env package environments. It increases generalizations on less common but highly critical scenarios which can save user from hefty collisions. We critically analyze the performance our treated agents in comparison to control agents for important tasks like performing roundabouts, handling intersections, u-turn & two-way traffic overtakes and lane changing environments in our study.


The collision between two vehicles is equivalent to intersection of two polygons in the rendered environment output of highway-env. And we detect these collisions between rectangular polygons with the separating axis theorem for given two convex polygons. Essentially, the idea is to find a line that separates both polygons if that line exists than polygons are separated and collision hasn't happened yet. Algorithmically for each edge of our base rectangular polygon we find perpendicular axis to current edges under review. After that we project these edges onto that axis and in case these projections don't overlap it functionally means no collision as rectangular polygons are not intersecting, refer Figurefig4.


Experiment Setup 实验设置

In our experimentation setup we calculate the ATE metric for namely lane changing, two-way traffic, roundabout, intersection and u-turn tasks, refer Figurefig5. These five tasks are evaluated against increasing traffic from default vehicle count/density to a 200% increase. For each of these traffic scenarios we create our treatment environment with varying degree of acceleration & steering parameters which would not comply with MOBIL model criteria of safety and incentives. This randomization behavior is governed by equations stated below which lays the foundation of simulating collision prone behavior. We create collision prone behavior by significantly changing the equation max-min acceleration parameters in function defined in the kinematics rules of highway-env package. With our experimentation setup we quantify the causal model performance improvements from introduction of this risk factor knowledge in agent learning process. And compare it with the control baseline models across these five different navigation tasks against spectrum of increasing traffic density.


$$ acc_p = acc_{min} + \textit{rand[0,1]}* (acc_{max} - acc_{min}) $$
$$ str_p = str_{min} + \textit{rand[0,1]}* (str_{max} - str_{min}) $$


We rebuild the highway-env package with our custom changes by altering the environment configurations, randomizing behavior and adding new vehicles to strategically increase the traffic density and simulate risky driver behavior. In lane changing task for treatment & control model training we incrementally increase the vehicle count from 50 to 150 vehicles accompanied with equivalent intermittent increase of vehicle density by 100% and having increased randomized risky behavior on episode duration length of 20 seconds. We train unique agents for navigation on each different traffic count environments in our experimentation setup corresponding every vehicle count increase. Also, we plot a comparative analysis performance graph of control and treatment agent across these different environment iterations and calculate ATE of our perturbations. Similarly, for u-turn environment we uniformly increase our vehicle count from 3 to 12 with incremental increase of vehicle count of 3. Also, for two-way traffic we reduce original environment length to 2/3 of the original and incrementally increase the vehicle traffic count from base 5 in direction and 2 in opposite direction vehicles to 15 in direction and 6 in opposite direction vehicles. For collision prone treatment in intersection task driving task we rewire the randomization behavior to a more risky one with our acceleration and deceleration tuning. We also increase the vehicle count from 10 to 30 incrementally with interval gap of 5 vehicles and alongside we incrementally increase the spawning probability by 0.1 until it reaches its maximum value. Finally, for roundabout task we incrementally increase the traffic from 5 to 15 vehicles with risk prone randomization in our treatment environment for agent training and performance comparison with control baseline. Each of these configuration requires the environment to be rebuilt continuously.


And repeated model training against each new treatment & control environments. For evaluating our model's performance we have kept the traffic as constant in our test population set against the corresponding treatment and control environment on which agents were trained. And we have only changed the risky behavior in treatment and control environment sets to calculate the ATE for measuring causal performance improvement.


We use the DQN reinforcement learning modelling technique with ADAM optimizer for our experiment with learning rate of 5e-4. Our discount factor used is 0.99 and environment observation vector data is fed in a batch size of 100. Our agents are trained over 3072 episodes until they converged to average reward from a given driving environment. We use the dueling network design which utilizes advantage function which helps in estimating state-action value function for state-action pair more precisely. This done by splitting the network into two streams, value and advantage ones which shares some base hidden layers. The shared network consists of 3 fully connected layers of 256, 192 and 128 units respectively. Value and advantage stream consists of 2 layers of 128 unit each. The final output of these streams is also fully connected to the network. The value stream has 1 output of calculated value function for a given state. And advantage stream has outputs representing number of discrete possible actions for a given state. The output vectors from these two streams are combined to calculate the state-action value function estimate with the help from below stated equation, refer Figurefig6 for model architecture. Baseline DQN agent implementations referenced from: github.com/eleurent/rl-agents
$$ Q(s, a) = V (s) + A(s, a) $$


我们使用DQN强化学习建模技术与ADAM Optimizer进行5E-4学习率的实验。我们使用的折扣因子是0.99,环境观察载体数据以批量送入100.我们的代理人培训超过3072个集中,直到它们从给定的驾驶环境中融合到平均奖励。我们使用Dueling网络设计利用优势功能,这有助于更准确地估计状态动作对的状态动作值函数。这通过将网络分成两个流,值和优势,它们共享一些基本隐藏层。共享网络分别由3个完全连接的356,192和128个单元组成。价值和优势流由每个128个单元组成。这些流的最终输出也完全连接到网络。值流具有1个计算值函数的1个输出,用于给定状态。并且优势流具有表示给定状态的离散可能动作的数量的输出。组合来自这两个流的输出向量,以计算状态动作值函数估计在下面的说明书方程,请参阅图案架构的uperfig6。基线DQN代理实现来自:github.com/eleurent/rl-agents
$$ q(s,a)= v(s)+ a(s,a)$$


Another randomization factor in our experimentation setup involves initial random seed values. For adding random behavior to tasks like randomizing vehicle acceleration & deceleration, spawning vehicles, vehicle location etc. Therefore, we test our trained treatment and control models against the risk-prone and regular driving environments with different randomization seed values to average any anomalous results. Hence, we measure the ATE by evaluating our agents against several randomized seed values for risk prone and regular driving environments. This calculation is simply summarized by the equation below where summation of first two terms calculate the average reward calculated from treatment models in risk prone and regular environment. The last two terms calculate the same for control agent in both these environments and the test set sample count is expressed as = = 100. The associate difference of these quantities gives us the ATE for performance improvements in our robust agents trained on perturbed environments as explained in earlier section.


ATE results from Figurefig7 clearly demonstrates to us the advantage of teaching agents these risk prone critical scenarios. For better readability purpose we converted the calculated ATE values to their respective percentages in Figure\ref{fig7}. Across all tasks we observe that as traffic density increases like real-life scenarios in heavily populated cities the positive effect of knowing perturbed scenario is pronounced for our robust treatment agents. More importantly there is also a declining trend of average reward as the traffic increases for every driving task analyzed for highway-env.


我们的实验设置中的另一个随机分子涉及初始随机种子值。为了将随机行为添加到随机性车辆加速\和减速,产卵车辆,车辆位置等。因此,我们通过不同随机化种子值对风险倾向的风险和常规驾驶环境进行培训的处理和控制模型,以平均任何异常结果。因此,我们通过评估我们的代理对抗几种随机种子值来衡量风险倾向于和普通驾驶环境的几个随机种子值来衡量ATE。该计算简单地通过以下等式总结了前两个术语的总和计算风险易于和常规环境的处理模型计算的平均奖励。最后两个术语在这些环境中计算的控制代理和测试集样本计数表示为= = 100.这些数量的助理差异使我们可以在培训的扰动环境中培训的鲁棒特性的性能改进。在早期的部分。

从图FIG7的结果清楚地表明了我们教学代理的优势,这些风险易受关键情景。为了更好的可读性目的,我们将计算的ATE值转换为图\ REF {码}中的各自百分比。在所有任务中,我们观察到,随着普遍普遍的城市的现实生活场景的增加,我们稳健的治疗剂对知识扰动情景的积极效果显着。更重要的是,由于对高速公路env的每次驾驶任务的流量增加,平均奖励的趋势也有所下降。

This depreciation in agent performance both for control and treatment models can be attribute to constantly decreasing safety distance amongst all vehicles causing more than expected collisions and slow progression by our ego-vehicle across these environments due to heavy traffic. Even with decreasing average reward trend across our test set environment samples the performance of treatment models has always exceeded the control models. Also, the relative improvements in ATE values further increases as the traffic continues to increase demonstrating strong robustness of treatment agents. Currently our scope of work is limited to few but critically important driving scenarios. Also, we have used only homogeneous agents in our analysis and attempted to analyze critical knowledge leveraging component on single agent only i.e. our ego-vehicle. Plus our randomization mechanism though uniform doesn't necessarily follow human-like behavior while generating risk prone scenarios. But, our causal effect estimation approach that quantifies the information learnt from perturbed scenarios does demonstrates promising results. And holds vast scope of practical applications for creating more interpretable, metric-oriented & key-performance-indicator (KPI) driven autonomous agent systems.



Our experiments from this paper provide insights into the importance of using deliberate interventions of collision prone behaviors while training agents for stochastic processes like autonomous driving. By using the MDP formulation of the discussed driving scenarios with collision simulation perturbations we were able to generate more robust agents. Our treatment model experimentation setup used episode data from traffic clogged lanes and risky randomized behavior during training which finally resulted in positive ATE result values. Which proved that agents trained with more wider range of collision prone scenarios performs better than the existing vanilla simulation agents. Also, we casually quantified the impact of our interventions for the discussed model free learning DQN technique which assisted us in accurately estimating the performance improvements. For every driving scenario environment our new agents produced better results and were proved to be better collision deterrents. Therefore, underscoring the importance of learning valuable lessons from risk prone scenario simulations for creating safe autonomous driving agents.