Safety of the Reinforcement Learning Based Functions, Challenges, Approaches

Size: px
Start display at page:

Download "Safety of the Reinforcement Learning Based Functions, Challenges, Approaches"

Transcription

1 Fakultät für Informatik Lehrstuhl für Robotik, Künstliche Intelligenz und Echtzeitsysteme und Robotik Safety of the Reinforcement Learning Based Functions, Challenges, Approaches Yingqiang Gao and Korbinian Ederer, Group D Masterseminar Reinforcement Learning in Autunomous Driving (SS18)

2 The Uber Accident *source: youtube.com 2

3 3

4 Where is Uber???? *data source: [1] 4

5 Uber Car Ignores the red light *source: [2] 5

6 Guidance Safety Definitions & Methodological Challenges Advanced RL for Policy Decision Advanced RL for Path Planning Ensuring safety for RL exploration Conclusion Why autonomous? Safety category Policy decoupling Policy Modeling Inverse RL Learn from expert Collision avoidance Path tracking Summary Future work 6

7 Safety Definition Do you believe that autonomous vehicles are safer? *source: [3] 7

8 Safety Definition Fact 1: Vehicles can perceive much more than human *source: [4] 8

9 Safety Definition Fact 2: Vehicles is always energetic doesn t need to take a rest *source: [5] 9

10 Safety Definition *source: [6] System Safety Behavior Safety Collision Safety Operation Safety Functional Safety Non-Collision Safety 10

11 Safety Definition An operating mode of a system or an arrangement of systems can be regarded as safe when there is no unreasonable risk ISO 26262, Part I, A vehicle, which is able to drive without human intervention (autonomous) by use of electronic equipment, shall not entail a hazard to human beings and/or property which is greater than the hazard represented by the conventional (human) driver Volkswagen,

12 Safety Definition An operating mode of a system or an arrangement of systems can be regarded as safe when there is no unreasonable risk ISO 26262, Part I, We need REASONABLE technologies for autonomous driving! Deterministic approaches? No way Repeatable approaches? Not likely Explainable approaches? Maybe! 12

13 Methodological Challenges Conventional Baseline Approach Perception-Decision-Control? Challenge Nr.1: Challenge Nr.2: Absolute safety efficient in time & comfortable ride OK then Reinforcement Learning Approaches with Neural Network? Bad Interpretability Neural Network is a famous Black-Box-Model, no one knows what it s doing but somehow it just works so well. Challenge Nr.3: Model Stability A great variance of model s output simply means low robustness 13

14 Guidance Safety Definitions & Methodological Challenges Advanced RL for Policy Decision Advanced RL for Path Planning Ensuring safety for RL exploration Conclusion Why autonomous? Safety category Policy decoupling Policy Modeling Inverse RL Learn from expert Collision avoidance Path tracking Summary Future Work 14

15 Advanced RL Approach for Policy Decision What researches are focus on: - Enhance stability by reducing model s variance without Markovian Assumption - Ensure driving comfort and safety by decomposing Policy Decision into highlevel and low-level parts - Increase model s interpretability by introducing hierarchical Neural Network Option Graph 15

16 Advanced RL Approach for Policy Decision Final goal of reinforcement learning: maximize accumulative rewards by finding a optimal sequence of state-action pair Adjust Neural Network Parameter state-action pair But how? Policy Gradient! Choose action according to the reward! *source: [7] 16

17 Advanced RL Approach for Policy Decision Policy Gradient: Unfortunately, the variance of estimated gradient is proportional to the sequence length T The need of variance reduction Baseline scalar *source: [8] 17

18 Advanced RL Approach for Policy Decision Policy Decomposition: high-level driving policy e.g. take off, stay in lane etc. low-level driving policy e.g. acceleration, steering etc. Not learned Learned from human driver s experience 18

19 Advanced RL Approach for Policy Decision Why Policy Decomposition ensures safety? - don t put your money on just one bet! - conventionally, wrong at final decision = fatal eror! - decomposition, wrong at high level decision dead for sure! Low-level decision could still refuse to execute the high-level decision ---- constraints 19

20 Advanced RL Approach for Policy Decision Why not learning low-level decision? - Low-level policy contains all fundamental safety issue (e.g. the minimal distance between vehicles, no staying out of lane, upper boundary for velocity etc.) - No need to modify absolute safety conditions Why even splitting the driving policy? - Deal with boundary cases with extremely small possibility - Handle rare cases by learning from human s experiences rather than just numerically adjusting their rewards 20

21 Advanced RL Approach for Policy Decision How to model policy? Map High-level policy to expected action space Desire [0, v max ] the velocity range L {g, t, o} n the lane that the vehicle wants to enter the action (g give away, t take off, o - offset) of other vehicles n is the number of neighbor vehicles Involved participation of other agents! Cartesian Product, i.e. the combination of velocity, lane and action 21

22 Advanced RL Approach for Policy Decision Use Option-Graph for decision flow - every internal node represents a policy - Every internal node is a neural network with 3 fully connected layer - The parameter θ i represents the policy of choosing a child node - π θ (D) is obtained by concatenating all θi *source: [8] 22

23 Advanced RL Approach for Policy Decision A complete decision is made by going though the Option Graph from root node to leaf node! The whole decision process is now explicitly visible! Interpretability is enhanced! That s quiet REASONABLE technology *source: [8] 23

24 Advanced RL Approach for Policy Decision Due to policy decomposition, the cost function is also split into 3 part Velocity cost: k (v x i, y i x i 1, y i 1 /τ) 2 i=2 x i, y i,, x k, y k denotes a driving trajectory Square difference of real and expected velocities 24

25 Advanced RL Approach for Policy Decision Lane cost: k dist(x i, y i, l) i=1 l denotes a desired lane Sum of distance to the desired lane from each point on the trajectory 25

26 Advanced RL Approach for Policy Decision interaction cost: τ j i τ i j (give away) (take away) (stay in lane) i, j indicates intersecting time of master and other vehicles give away: you go first I go after 0.5s take away: I go first you go after 0.5s offset: we keep in necessary distance 26

27 Advanced RL Approach for Policy Decision Combine these 3 parts in a weighted way and form the high-level policy π (T) Add term for smoothness of trajectory and for lane safety (i.e. the vehicle should always stay in lane etc.) 27

28 Advanced RL Approach for Policy Decision 28

29 Guidance Safety Definitions & Methodological Challenges Advanced RL for Policy Decision Advanced RL for Path Planning Ensuring safety for RL exploration Conclusion Why autonomous? Safety category Policy decoupling Policy Modeling Inverse RL Learn from expert Collision avoidance Path tracking Summary Future work 29

30 Policy learned from human s experiences Smooth lane and safe action BUT! How to consider passenger s preferences? Imagine two path A and B for going home - A takes 20 min, bad road conditions but short - B takes 30 min, beautiful country view + broad lane How do you choose? What do you do if you are already on A but want to switch to B? * source: [9] 30

31 If you are not on a run, why not enjoy your way home? Comfort and Familiarity of route are also important when the safety is ensured! A natural way is to let the vehicle learn from human! In the autonomous driving context: Provide the vehicle how we do this, let the vehicle imitates our behavior. Given an expert s policy, learn a reward function which best explain the policy. Inverse Reinforcement Learning 31

32 Advanced RL Approach for Path Planning What researches are focus on: - The ability of learning the user s path preference and adopting alternative route choices - Use inverse RL to learn a reward function associated with expert s demonstration - Solve the state space explosion issue by approximating the reward function use orthogonal polynomial basis 32

33 Advanced RL Approach for Path Planning For inverse RL, learn a reward function means specify the formulation and compute the responding coefficients How to formulate the to learned reward function? A common formulation: Problem: state space explosion when n tends to be big 33

34 Advanced RL Approach for Path Planning Is there a way to express the state space with fixed order? Yes! Use orthogonal polynomials as basis of state space Rewrite the reward formulation using Legendre polynomials: α i are those parameter that learned through inverse RL 34

35 Advanced RL Approach for Path Planning Path given by expert Learned reward function Corresponding optimal policy *Source: [10] 35

36 Advanced RL Approach for Path Planning Corner box Blue: Path given by expert Red: Path choosen by agent Learned reward function Corresponding optimal policy *Source: [10] 36

37 Advanced RL Approach for Path Planning Different start point Blue: Path given by expert Red: Path choosen by agent Learned reward function Corresponding optimal policy *Source: [10] 37

38 Advanced RL Approach for Path Planning 38

39 Advanced RL Approach for Path Planning What s good about this approach? - You ll have the feeling that the vehicle is your coworker rather than just assistant - Adoptability to route changes and small range obstacles What s still not good enough? - Pre-defined human s demonstratoin is deterministic and CAN T be wrong! - No consideration of other vehicles in the environment 39

40 Guidance Safety Definitions & Methodological Challenges Advanced RL for Policy Decision Advanced RL for Path Planning Ensuring safety for RL exploration Conclusion Why autonomous? Safety category Policy decoupling Policy Modeling Inverse RL Learn from expert Collision avoidance Path tracking Summary Future work 40

41 Combining RL and Safety Based Control DDPG 1) Algorithm can produce good results in continuous control field But: - Can t perform well without sufficient training! - Unpredictable results in unfamiliar scenarios - Negative rewards for Crash! 1) Deep Deterministic Policy Gradient algorithm 41

42 Combining RL and Safety Based Control The DDPG RL algorithm: Inputs: Vehicle speed, engine speed, track sensors, wheel speed, track position and vehicle angle *Source: [11] SS 2017 Chapter / Lecture Title 42

43 Combining RL and Safety Based Control Safety Based Control Artifficial Potential Field Path Tracking *Source: [11] SS 2017 Chapter / Lecture Title 43

44 Combining RL and Safety Based Control Artifficial Potential Field Avoiding collisions by artificial potential field method Potential field: Potential Forces: *Source: [11] SS 2017 Chapter / Lecture Title 44

45 Combining RL and Safety Based Control Resulting steering and acceleration commands from potential field: *Source: [11] SS 2017 Chapter / Lecture Title 45

46 Combining RL and Safety Based Control Path Tracking: τ p is derived in relation to steering command: eg. Decrese velocity when steering command is high *Source: [11] SS 2017 Chapter / Lecture Title 46

47 Combining RL and Safety Based Control Balancing of: Potential Force Path Tracking Reinforcement Learning δ = steering action τ = acceleration action α, β, γ = weight parameters SS 2017 Chapter / Lecture Title 47

48 Combining RL and Safety Based Control *Source: [11] SS 2017 Chapter / Lecture Title 48

49 Guidance Safety Definitions & Methodological Challenges Advanced RL for Policy Decision Advanced RL for Path Planning Ensuring safety for RL exploration Conclusion Why autonomous? Safety category Policy decoupling Policy Modeling Inverse RL Learn from expert Collision avoidance Path tracking Summary Future work 49

50 Summary 1. Functional safety can be ensured by enhancing interpretability 2. Policy decomposition helps us to understand the decision flow more explicitly 3. Inverse RL learns a reward function which best explains the expert s policy 4. Learn from expert s demonstration takes the user s preferences into consideration while doing the path planning. 50

51 Future 1. How to use gained experiences to deal extremely rare case of danger? Involve Cognition? 2. What about other categories of safety? How to improve them? 3. Ethical issue raised by safety solution? 51

52 References 1: thelastdriverlicenseholder.com 2: theverge.com 3: Virginia Tech 4: GM Safety report : Wikipedia 6: Waymo Safety Report : 8: Shai et al, Safe,Multi-Agent, Reinforcement Learning for Autonomous Driving, : 10: Srinivasan, Chakraborty, Path planning with user route preference A reward surface approximate approach usinf orthoganal Legendre polynomials, : Xiong et al, Combining Deep Reinforcement Learning and Safety Based Control for Autonomous Driving SS 2017 Chapter / Lecture Title 52

53 Thanks for your attention! We re now ready for questions 53