.

# 利用风险驾驶行为知识构建更安全的智能体（agent）

## Abstract

Simulation environments are good for learning different driving tasks like lane changing, parking or handling intersections etc. in an abstract manner. However, these simulation environments often restrict themselves to operate under conservative interactions behavior amongst different vehicles. But, as we know that the real driving tasks often involves very high risk scenarios where other drivers often don't behave in the expected sense. There can be many reasons for this behavior like being tired or inexperienced. The simulation environments doesn't take this information into account while training the navigation agent. Therefore, in this study we especially focus on systematically creating these risk prone scenarios with heavy traffic and unexpected random behavior for creating better model-free learning agents. We generate multiple autonomous driving scenarios by creating new custom Markov Decision Process (MDP) environment iterations in highway-env simulation package. The behavior policy is learnt by agents trained with the help from deep reinforcement learning models. Our behavior policy is deliberated to handle collisions and risky randomized driver behavior. We train model free learning agents with supplement information of risk prone driving scenarios and compare their performance with baseline agents. Finally, we casually measure the impact of adding these perturbations in the training process to precisely account for the performance improvement attained from utilizing the learnings from these scenarios. IEEEkeywords: Autonomous Agents, Driving Simulations, Trajectory Prediction, Causality

## Introduction 简介

The arrival of autonomous driving agents have had a great impact on automobile industry. And it will responsible for shaping the future of this industry as well. The current industry trend progression demonstrates that individual autonomous agents will be dominating before connected autonomous vehicles. But, that vision is still quite afar considering safety infrastructure, security and public policy reasons. Therefore, our immediate focus should be on making our driving agents safe and efficacious. Creating more safer agents have been explored well for past few years. As it is crucial to know the expected agent behavior in different environments especially safety critical ones. In natural driving environments risk prone scenarios doesn't happen frequently which makes the learning process harder from these scenarios. And it is also unethical to create these real risky driving scenarios for experimentation purposes. Therefore, generating and studying these scenarios systematically is a daunting task. Perception systems with adversarial approaches of noisy labelling does demonstrate promising path. But underlying fundamental problem remains focused around safe interactions with other vehicles which themselves are operating independently. Simulation environments with appropriate system dynamics design assumptions does overcomes this expressed safety issue. Along with that these environments allows us systematically study the risk prone scenarios with near realistic driving behaviors. For this study we have used highway-env simulation package which allows us to simulate different driving tasks. It also provides simple interfacing capabilities to modify these environments and quickly create baseline prototypes on it. We formulate the dynamics of the simulation system as Markov Decision Process (MDP) in our experimentation. We model our value function approximation agents with deep reinforcement learning for these defined MDP systems for our study.

For our experiment design we create two distinct variants of each model architecture for above stated simulation dynamics. One of these model variant is trained with our increased dangerous driving behavior interventions for all the driving related tasks present in highway-env package. Second model variant is control variable in our experiment which is used to define the reward baseline for the trained agents on regular simulations. In our study we do a methodological doping of these environments with randomized dangerous driving scenarios to create more risk prone environments for driving. This is done specifically in two ways in our study, first by increasing the traffic at strategically important location challenging the agent to make dangerous overtakes. Second, we increase the randomization factor and the clogged lanes makes the environment more collision prone for our agent .

This is done to create more robust agents in a given environment for that particular dynamic scenario. Which are essentially better at post impact trajectory predictions of other interacting vehicles. Figurefig1 explains our causal analysis experimentation setup in sequential form. We also attempt to understand our experimentation setup from a causal standpoint view. We hold complete control of the data generation process in our experimentation. Which allows us to have unique vantage point of conducting an experiment study which is equivalent to a randomized control trial (RCT). We train our agents with good enough assumption of absence in unobservable confounding variables. As we have strictly defined the state dynamics and vehicle behavior governing models. Our customization capabilities in highway-env package allows us to keep every condition same while training except our collision scenario perturbations. Meaning that treatment and control groups are same in all aspects except our treatment i.e. there is a comparability and co-variate balance in our experimental setup. With this special relation establishment we can imply that association found in our experiment setup is causation. As shown in Figurefig2 our treatment is subjecting the agent learning process to risk prone interacting vehicle dynamics in a given environment. After that our sample test experiment population involves evaluating the two model variants against regular and perturbed environments with excessive risk prone scenario doping. And finally by using expectation equations derived from above causal graph we estimate the causal effect our novel learning changes for enhanced safety.

Our contributions by the means of this paper include providing benchmarking environment simulations for collision robustness predictions. Along with that we provide experimentation methodology that creates more robust agents that provides better on-road safety. And we also causally measure the impact of our risky driving behavior doping interventions for different driving environments. For the remaining paper we first discuss related work corresponding to utilizing risk prone behavior for creating safer autonomous vehicle. Second, we formally define our problem and elaborate on causal aspects of it. Third, we explain the experiment setup for creating these robust agents and elaborate upon on our results. Finally, we conclude our autonomous driving agent study.

## Previous Work

Deep Reinforcement Learning have been used for extensively for traffic control tasks. The simulated environment provided by CARLA gave framework for systems that estimates several affordances from sensors in a simulated environment. Navigation tasks like merging traffic requiring good exploration capabilities have also shown promising results in simulators. ChauffeurNet elaborates on the idea of imitation learning for training robust autonomous agents that leverages worst case scenarios in the form of realistic perturbations. A clustering based collision case generation study systematically defines and generates the different type of collisions for effectively identifying valuable cases for agent training. In highway-env package specifically focuses on designing safe operational policies for large-scale non-linear stochastic autonomous driving systems. This environment has been extensively studied and used for modelling different variants of MDP, for example: finite MDP, constraint-MDP and budgeted-MDP (BMDP). BMDP is a variant of MDP which makes sure that risk notion implemented as cost signal stays below a certain adjustable threshold. The problem formalization for workings of vehicle kinematics, temporal abstraction, partial observability and reward hypothesis has been studied extensively as well. Robust optimization planning has been studied is past for finite MDP systems with uncertain parameters. And also has shown promising results under conservative driving behavior. For BMDP, efficacy and safety analysis has been extended into continuous kinematics states and unknown human behavior from the existing known dynamics and finite state space. Model free learning networks that approximates value function for these MDPs like Deep Q-Leaning (DQN) and Dueling Deep Q-Learning (DQN) networks have demonstrated promising results in continuous agent learning.

## 以前的工作

Worst case scenario knowledge in traffic analysis has been leveraged for model based algorithms by building region of high confidence containing true dynamics with high probability. Tree based planning algorithms were used to achieve robust stabilisation and mini-max control with generic costs. These studies also leveraged non-asymptotic linear regression and interval prediction for safer trajectory predictions. Behavior guided action study which uses proximity graphs and safety trajectory computes for working with aggressive and conservative have shown promising results as well. This study used CMetric measure for generating varying level of aggressiveness in traffic. Whereas in our case we have used more randomization and traffic clogging at key areas for risk prone scenarios to measure more granular observation results. Causal modelling techniques have contributed a lot in terms of providing great interpretative explanations in many domains. Also, Fischer's randomized control trials (RCTs) has served as gold standard for causal discovery from observational data. Sewall's path diagrams were the first attempt of generating causal answers with mathematics. Now, causal diagrams and different adjustments on these diagrams does offer direct causal relation information about any experimental variables under study. In our experimental study we use these existing mathematical tools to draw direct causal conclusions of our learnings from environment interventions.

## Problem Formulation 解决方案

Our goal is to design and build agents on collision prone MDPs for navigation tasks across the different traffic scenarios. The MDP comprises of behavior policy $\pi$ ( $\mid$ ) that outputs action for given state . With this learnt policy our goal is to predict discrete safe and efficient action from finite action set for next time step for given driving scenarios. The simulation platform that we used is compatible with OpenAI gym package. The highway-env package provides the traffic flow which is governed by Intelligent Driver Model (IDM) for linear acceleration & MOBIL model for lane changing. MOBIL model primarily consists of safety criterion and incentive criterion. First safety criterion checks if after lane change vehicle is having enough acceleration space and second criterion determines the total advantage of lane change in terms of total acceleration gain. Given MDP is defined as a set of where action $\in$ , state $\in$ , reward function $\in$ [0,1] and state transition probabilities T $\in$ . With Deep-RL algorithms we search the the behavior policy $\pi$ (s $\mid$ a) that helps us in navigating across the traffic environments to gather maximum discounted reward . The state-action value function for given assists in estimating future rewards of given behavior policy $\pi$ . Therefore, the optimal state-action value function provides maximum value estimates for all $\in$ and is evaluated by solving Bellman Equation, stated below for reference. From this the optimal policy $\pi$ is expressed as = arg max. We used DQN with duelling network architecture having to approximate the state-action value function which predicts best possible action as learned from the policy $\pi$ .

$$Q^{*}(s, a) = \mathop{\mathbb{E}}[ R(s,a) + \gamma\sum\limits_{s'} P (s' | s , a) \max\limits_{a'} Q^{*}(s', a') ]$$

From our experimentation setup we intend to derive direct causal impact of our interventions in traffic environment scenarios. And as we can refer back from Figurefig2, our treatment (T) is subjecting the learning agent to more risky driving behavior. Our testing sample set involves random agent reward calculation against perturbed and control environments which makes our experiment equivalent to RCT. Meaning that there is no unobserved confounding present in our experimentation i.e.

backdoor criterion is satisfied. Also, in RCTs distribution of all co-variates are same except the treatment. Co-variate balance in observational data also implies that association is equal to causation while calculating the potential outcomes, refer equation stated below.

$$P(X \mid T=1) \stackrel{d}{=} P(X \mid T=0)$$
$$P(X \mid T=1) \stackrel{d}{=} P(X), T \perp \perp X$$
$$P(X \mid T=0) \stackrel{d}{=} P(X), T \perp \perp X$$

Essentially meaning that we can use the associative difference quantity to infer the effect of treatment on outcomes. Meaning that we can use Average Treatment Effect (ATE) approach for calculating the causal effect by simply subtracting the averaged out values treatment and control potential outcomes. In below stated equations Y(1) $\triangleq$ Y& Y(0) $\triangleq$ Y and these equations hold true in case of RCTs where causal difference can be calculated with associated difference.

\begin{alignedat}{2} E[Y(1)-Y(0)] &= E[Y(1)]-E[Y(0)] E[Y(1)]-E[Y(0)] &= E[Y\mid T=1]-E[Y\mid T=0] \end{alignedat}

Model free learning approaches generally don't have explicit information about the dynamics of the systems. Therefore, during the training these agents will generalize their policies corresponding to the particular given scenarios only. We introduce risky randomized behavior in these environment vehicles with collision prone randomized behavior. Figurefig3 shows different probable real life collisions simulated in different highway-env package environments. It increases generalizations on less common but highly critical scenarios which can save user from hefty collisions. We critically analyze the performance our treated agents in comparison to control agents for important tasks like performing roundabouts, handling intersections, u-turn & two-way traffic overtakes and lane changing environments in our study.

The collision between two vehicles is equivalent to intersection of two polygons in the rendered environment output of highway-env. And we detect these collisions between rectangular polygons with the separating axis theorem for given two convex polygons. Essentially, the idea is to find a line that separates both polygons if that line exists than polygons are separated and collision hasn't happened yet. Algorithmically for each edge of our base rectangular polygon we find perpendicular axis to current edges under review. After that we project these edges onto that axis and in case these projections don't overlap it functionally means no collision as rectangular polygons are not intersecting, refer Figurefig4.

## Experiment Setup 实验设置

In our experimentation setup we calculate the ATE metric for namely lane changing, two-way traffic, roundabout, intersection and u-turn tasks, refer Figurefig5. These five tasks are evaluated against increasing traffic from default vehicle count/density to a 200% increase. For each of these traffic scenarios we create our treatment environment with varying degree of acceleration & steering parameters which would not comply with MOBIL model criteria of safety and incentives. This randomization behavior is governed by equations stated below which lays the foundation of simulating collision prone behavior. We create collision prone behavior by significantly changing the equation max-min acceleration parameters in function defined in the kinematics rules of highway-env package. With our experimentation setup we quantify the causal model performance improvements from introduction of this risk factor knowledge in agent learning process. And compare it with the control baseline models across these five different navigation tasks against spectrum of increasing traffic density.

$$acc_p = acc_{min} + \textit{rand[0,1]}* (acc_{max} - acc_{min})$$
$$str_p = str_{min} + \textit{rand[0,1]}* (str_{max} - str_{min})$$

We rebuild the highway-env package with our custom changes by altering the environment configurations, randomizing behavior and adding new vehicles to strategically increase the traffic density and simulate risky driver behavior. In lane changing task for treatment & control model training we incrementally increase the vehicle count from 50 to 150 vehicles accompanied with equivalent intermittent increase of vehicle density by 100% and having increased randomized risky behavior on episode duration length of 20 seconds. We train unique agents for navigation on each different traffic count environments in our experimentation setup corresponding every vehicle count increase. Also, we plot a comparative analysis performance graph of control and treatment agent across these different environment iterations and calculate ATE of our perturbations. Similarly, for u-turn environment we uniformly increase our vehicle count from 3 to 12 with incremental increase of vehicle count of 3. Also, for two-way traffic we reduce original environment length to 2/3 of the original and incrementally increase the vehicle traffic count from base 5 in direction and 2 in opposite direction vehicles to 15 in direction and 6 in opposite direction vehicles. For collision prone treatment in intersection task driving task we rewire the randomization behavior to a more risky one with our acceleration and deceleration tuning. We also increase the vehicle count from 10 to 30 incrementally with interval gap of 5 vehicles and alongside we incrementally increase the spawning probability by 0.1 until it reaches its maximum value. Finally, for roundabout task we incrementally increase the traffic from 5 to 15 vehicles with risk prone randomization in our treatment environment for agent training and performance comparison with control baseline. Each of these configuration requires the environment to be rebuilt continuously.

And repeated model training against each new treatment & control environments. For evaluating our model's performance we have kept the traffic as constant in our test population set against the corresponding treatment and control environment on which agents were trained. And we have only changed the risky behavior in treatment and control environment sets to calculate the ATE for measuring causal performance improvement.

We use the DQN reinforcement learning modelling technique with ADAM optimizer for our experiment with learning rate of 5e-4. Our discount factor used is 0.99 and environment observation vector data is fed in a batch size of 100. Our agents are trained over 3072 episodes until they converged to average reward from a given driving environment. We use the dueling network design which utilizes advantage function which helps in estimating state-action value function for state-action pair more precisely. This done by splitting the network into two streams, value and advantage ones which shares some base hidden layers. The shared network consists of 3 fully connected layers of 256, 192 and 128 units respectively. Value and advantage stream consists of 2 layers of 128 unit each. The final output of these streams is also fully connected to the network. The value stream has 1 output of calculated value function for a given state. And advantage stream has outputs representing number of discrete possible actions for a given state. The output vectors from these two streams are combined to calculate the state-action value function estimate with the help from below stated equation, refer Figurefig6 for model architecture. Baseline DQN agent implementations referenced from: github.com/eleurent/rl-agents
$$Q(s, a) = V (s) + A(s, a)$$

$$q（s，a）= v（s）+ a（s，a）$$

## Results

Another randomization factor in our experimentation setup involves initial random seed values. For adding random behavior to tasks like randomizing vehicle acceleration & deceleration, spawning vehicles, vehicle location etc. Therefore, we test our trained treatment and control models against the risk-prone and regular driving environments with different randomization seed values to average any anomalous results. Hence, we measure the ATE by evaluating our agents against several randomized seed values for risk prone and regular driving environments. This calculation is simply summarized by the equation below where summation of first two terms calculate the average reward calculated from treatment models in risk prone and regular environment. The last two terms calculate the same for control agent in both these environments and the test set sample count is expressed as = = 100. The associate difference of these quantities gives us the ATE for performance improvements in our robust agents trained on perturbed environments as explained in earlier section.

ATE results from Figurefig7 clearly demonstrates to us the advantage of teaching agents these risk prone critical scenarios. For better readability purpose we converted the calculated ATE values to their respective percentages in Figure\ref{fig7}. Across all tasks we observe that as traffic density increases like real-life scenarios in heavily populated cities the positive effect of knowing perturbed scenario is pronounced for our robust treatment agents. More importantly there is also a declining trend of average reward as the traffic increases for every driving task analyzed for highway-env.

## 结果

This depreciation in agent performance both for control and treatment models can be attribute to constantly decreasing safety distance amongst all vehicles causing more than expected collisions and slow progression by our ego-vehicle across these environments due to heavy traffic. Even with decreasing average reward trend across our test set environment samples the performance of treatment models has always exceeded the control models. Also, the relative improvements in ATE values further increases as the traffic continues to increase demonstrating strong robustness of treatment agents. Currently our scope of work is limited to few but critically important driving scenarios. Also, we have used only homogeneous agents in our analysis and attempted to analyze critical knowledge leveraging component on single agent only i.e. our ego-vehicle. Plus our randomization mechanism though uniform doesn't necessarily follow human-like behavior while generating risk prone scenarios. But, our causal effect estimation approach that quantifies the information learnt from perturbed scenarios does demonstrates promising results. And holds vast scope of practical applications for creating more interpretable, metric-oriented & key-performance-indicator (KPI) driven autonomous agent systems.

## Conclusion

Our experiments from this paper provide insights into the importance of using deliberate interventions of collision prone behaviors while training agents for stochastic processes like autonomous driving. By using the MDP formulation of the discussed driving scenarios with collision simulation perturbations we were able to generate more robust agents. Our treatment model experimentation setup used episode data from traffic clogged lanes and risky randomized behavior during training which finally resulted in positive ATE result values. Which proved that agents trained with more wider range of collision prone scenarios performs better than the existing vanilla simulation agents. Also, we casually quantified the impact of our interventions for the discussed model free learning DQN technique which assisted us in accurately estimating the performance improvements. For every driving scenario environment our new agents produced better results and were proved to be better collision deterrents. Therefore, underscoring the importance of learning valuable lessons from risk prone scenario simulations for creating safe autonomous driving agents.