Modular ( agent-agnostic ) Human-in-the-loop RL. Owain Evans University of Oxford

Size: px
Start display at page:

Download "Modular ( agent-agnostic ) Human-in-the-loop RL. Owain Evans University of Oxford"

Transcription

1 Modulr ( gent-gnostic ) Humn-in-the-loop RL University of Oxford

2 Collbortors Dvid Abel (Brown) Andres Stuhlmueller (Stnford) John Slvtier (Oxford) 2

3 Overview 1. Autonomous vs. humn-controlled / interctive RL 2. Frmework for interctive RL 3. Applictions of frmework: rewrd shping nd simultions. 4. Cse study: prevent ctstrophes without side-effects. 3

4 Overview 1. Autonomous vs. humn-controlled / interctive RL 2. Frmework for interctive RL 3. Applictions of our frmework: rewrd shping nd simultions. 4. Cse study: prevent ctstrophes without side-effects. 4

5 Stndrd RL picture Environment M s, r Agent L 5

6 Two contrsting reserch progrms for RL: A. Autonomous RL B. Interctive RL 6

7 Picture A: Autonomous RL (Deepmind et l.) 7

8 Picture A: Autonomous RL (Deepmind et l.) 1. ML resercher designs generic RL gent 2. Rel-world environment nd sprse rewrds 3. Autonomous lerning (no humn intervention) 4. Motivtion: prgmtic (hnd-engineering doesn t scle), biologicl (nimls cn lern utonomously). 8

9 A: Autonomous RL (Deepmind et l.) Humn H Humn designs Agent Clss nd chooses Env. Agent Clss Environment M s, r Agent L 9

10 A: Autonomous RL (Deepmind et l.) Humn H Agent lerns utonomously Agent Clss Environment M s, r Agent L 10

11 Picture B: humn-controlled, humnshped RL 11

12 Picture B: humn-controlled, humnshped RL Motivtion for Interctive RL: useful RL systems should be sfe, vlue-ligned, interpretble 1. Fine-grined rewrds: rewrd function or by demonstrtion (IRL or Apprenticeship). 2. Humn designs curriculum: simultions, prctice environment, sequence of rel-world environments. 3. Humn cn intervene during lerning (humn-in-loop) 12

13 Picture B: humn-controlled, humnshped RL 13

14 Picture B: humn-controlled, humnshped RL 14

15 Picture B: humn-controlled, humnshped RL 15

16 Overview 1. Autonomous vs. humn-controlled / interctive RL 2. Frmework for interctive RL 3. Applictions of our frmework: rewrd shping nd simultions. 4. Cse study: prevent ctstrophes without side-effects. 16

17 Frmework for interctive RL Lots of techniques for integrting humn into RL system rewrd design/shping s in TAMER, Active Rewrd Lerning (Knox nd Stone 2008, Dniel et l. 2014) void ctstrophes by bising trining distribution (Frnk et l. 2008, Pul et l. 2016) provide online dvice bout Q-vlues, policy (Thomz et l 2016, Torrey et l 2013, Loftin et l 2014) 17

18 Frmework for interctive RL Current work: Lots of techniques for integrting humn into RL system GOAL: specify existing techniques in common/ unified frmework Benefit: Esier to nlyse, to generlize nd to compose techniques. (AI Sfety: interested in bstrct properties of techniques.) 18

19 Frmework for interctive RL Current work: Lots of techniques for integrting humn into RL system GOAL: specify existing techniques in common/ unified frmework SUB-GOAL: frmework should bstrct wy detils of gent s lgorithm (modulr or gent-gnostic ) 19

20 Overview 1. Autonomous vs. humn-controlled / interctive RL 2. Frmework for interctive RL 3. Applictions of our frmework: rewrd shping nd simultions. 4. Cse study: prevent ctstrophes without side-effects. 20

21 Stndrd RL Environment M s, r Agent L 21

22 Interctive Frmework Environment M s, r Protocol Progrm P Humn H s, r Agent L 22

23 Exmple 1: Stndrd RL Environment M = s, r Protocol Progrm P Humn H s = s, r = r Agent L 23

24 Exmple 2: Humn control Environment M s, r Protocol Progrm P s, r Humn H s Agent L 24

25 Overview of protocols Environment M s, r Protocol Progrm P Humn H s = s, r = r + bonus(s,) Agent L rewrd shping 25

26 Overview of protocols Environment M s, r Protocol Progrm P Humn H s = fetures(s), r = r Agent L feture design, pre-processing 26

27 Overview of protocols Environment M s, r Protocol Progrm P Simulted Env M* Humn H s = T*(s, ), r = R*(s, ), Agent L gent in simultion 27

28 Overview of protocols ction pruning Environment M Humn H INPUT: <- L if is_sfe(s,): = output -> M else: output s, r_bd -> L s, r Agent L s, r 28

29 Overview of protocols Environment M s, r Protocol Progrm P Humn H ction pruning gent in simultion s, r rewrd shping Agent L feture design, preprocessing 29

30 Overview 1. Autonomous vs. humn-controlled/interctive RL 2. Frmework for interctive RL 3. Applictions of our frmework: rewrd shping nd simultions. 4. Cse study: prevent ctstrophes without sideeffects. 30

31 31

32 32

33 33

34 Prevent Ctstrophes with Interctive RL (Informlly) A stte-ction (s,) is ctstrophic (w.r.t. ε > 0) if: Humn requires tht: P( gent does (s,) ) < ε 34

35 Prevent Ctstrophes with Interctive RL (Informlly) A stte-ction (s,) is ctstrophic (w.r.t. ε > 0) if: Humn requires tht: P( gent does (s,) ) < ε Exmples: irreversible dmge to property (robot destroys itself) breking lws / morl rules physiclly hrm humns mnipulte or psychologiclly hrm humns 35

36 Prevent Ctstrophes with Interctive RL Relted work: Sfe RL nd voiding SREs (Moldovn nd Abeel, Frnk et l., Pul et. l, Lipton et l.) Chllenge: Simultion often indequte (esp. for extreme events) RL gents lern by tril nd error (don t know R nd T in dvnce) Solution: humn blocks ctstrophes before they hppen 36

37 Prevent Ctstrophes with Interctive RL Assumptions: gent intercts with rel-world environment gent might try ctstrophic ctions (e.g. previous trining ws insufficient) humn recognizes ctstrophic ctions before outcome 37

38 Prevent Ctstrophes with Interctive RL Gol: Find protocol progrm with following properties 1. For given ε, P( ctstrophe ) < ε, nd complexity(ε) scles well. 2. Minimize negtive side-effects on gent s lerning: e.g. gent still converges to optiml policy (t sme rte). 3. Requirements on humn (time nd knowledge) re fesible. 38

39 Pruning protocol progrm for discrete MDPs 1. Agent tries (s,) = (33,RIGHT) 2. H judges (s,) to be unsfe. 3. H blocks (s,) nd ppends to memory (34, DOWN, 0, 33) (33, RIGHT, -1000, 33) 4. Agent receives (33, RIGHT, -1000, 33). 1 G

40 Action Pruning 1: humn evlutes Environment M s, r INPUT: from Agent L Humn H if H( is_sfe(s,) ): = output to Env M else: ppend (s,) to Memory output s, r_bd to Agent L s, r Agent L 40

41 Action Pruning 2: imittor evlutes Environment M s, r INPUT: from Agent L Humn H if (s,) in Memory: = output to Env M else: output s, r_bd from Agent L s, r Agent L 41

42 Prevent Ctstrophes with Humn Overseer Why do we need n imittor to block the gent? 1. Some RL lgorithms would keep trying ctstrophic ction. For exmple: epsilon-greedy, RMAX. 2. Note: no gurntee tht gent lerns to void ctstrophes. 42

43 Gol: Find protocol progrm with following properties 1. Given ε, P( ctstrophe ) < ε, nd complexity(ε) scles well. 2. Minimize negtive side-effects on gent s lerning: gent still converges to optiml policy (t sme rte). - Trnsition function is modified but still lernble. 3. Requirements on humn (time nd knowledge) re fesible. - Humn cn retire when ech (s,) tried once.? - Humn cn t mke mistkes.? 43

44 Result from pper Ide: Suppose humn identifies the (voidble) ctstrophic ctions but otherwise cn t tell which ctions re better. Then ll ctstrophic ctions re blocked nd gent still lerns optiml policy. Result: Suppose tht humn H hs only ß-optiml Q-function Q_h. There exists protocol progrm s.t. (1) gent never tkes n ction more thn 4ß from optiml ction, (2) no optiml ctions re pruned. 44

45 Prevent Ctstrophes in Continuous MDPs 45

46 Prevent Ctstrophes in Continuous MDPs With lrge/infinite stte spce, humn cnnot lbel ll ctstrophic ctions. Insted humns trins clssifier on dt lbeled either ctstrophic or not. If clssifier succeeds on held-out test set, humn retires. 46

47 Continuous ction pruning: overview Environment M INPUT: from Agent L s, r Humn H if H( test clssifier?, Dtset): test_clssifier(dtset) else: if H( is_sfe(s,) ): output (=) to Env M else: output s, r_bd to Agent L ppend ( s,, H(is_sfe(s,) ) to Dtset s, r Agent L 47

48 Continuous ction pruning: overview Environment M INPUT: from Agent L s, r Humn H if H( test clssifier?, Dtset): test_clssifier(dtset) else: if H( is_sfe(s,) ): output (=) to Env M else: output s, r_bd to Agent L ppend ( s,, H(is_sfe(s,) ) to Dtset Humn in control nd lbels dt s, r Agent L 48

49 Continuous ction pruning: overview Environment M INPUT: from Agent L s, r Humn H if H( test clssifier?, Dtset): test_clssifier(dtset) else: if H( is_sfe(s,) ): output (=) to Env M else: output s, r_bd to Agent L ppend ( s,, H(is_sfe(s,) ) to Dtset Trin/test clssifier s, r Agent L 49

50 Generlizing bout ctstrophes Need clssifier tht mislbels ctstrophe with prob < ε. Dtset must contin ll clsses of ctstrophe. Stndrd problem with RL: smples (s,, H(is_sfe(s,)) re not iid. 50

51 Generlizing bout ctstrophes Need clssifier tht mislbels ctstrophe with prob < ε (s in PAC or SLT gurntees). Stndrd problem with RL: smples (s,, H(is_sfe(s,)) re not iid. Robust generliztion: void overconfident errors out of smple. 51

52 Generlizing bout ctstrophes Robust generliztion: void overconfident errors out of smple. Specil cse: no flse negtives but flse positives cceptble, cn sk humn when uncertin (synchronously). Humn cn block broder, esier-to-lern clss of ctions. 52

53 Generlizing bout ctstrophes Problem 2: Requires gent to visit ech kind of ctstrophic stte (SLOW). After ctstrophic (s,) hs been observed. Humn dds noise to it (without chnging lbel). A genertive model could lso help produce dt for the clssifier. 53

54 Thnks! 54

55 Prevent Ctstrophes in Continuous MDPs Problem 3: An dvnced gent would relize the humn is involved Current RL gents will not model the humn s role. Advnced gents might know when the humn retires. However: for ny gent, imittor blocks ctstrophic ctions. (No reson to behve differently once humn retires.) Adversril ttcks: Agent genertes dtset for clssifier. Response: () if too slow to lern, humn will shut down, (b) humn cn run dumber gents to explore spce nd compre distribution. [NB: protocol does involve lots of interction with gent] 55

56 Reltion to Christino s Bootstrpping: Online rewrds (TAMER, ARL) Environment M s Humn H: r ~ U(s,) s, r Agent L 56

57 Christino Bootstrpping vs. Ctstrophe Prevention Environment Environment M INPUT: <- L s, Humn if H( is_sfe(s,) ): = output -> M else: ppend (s,) to Memory output s, r_bd -> L Humn H: r ~ U(s,) Agent L s, Agent L 57

58 Reltion to Christino s Bootstrpping Christino: Humn gives rewrd for every (tiny) ction. Infesible. One pproch: gent uses semi-supervised or ctive lerning Relted pproch: use simpler, more trusted lerner to provide rewrd signls (cll humn when uncertin). Our protocol: Every ction must be checked for ctstrophes. Infesible. Choose lerner for tsk of clssifying ctstrophic ctions (should be trusted, robustly generlize, trnsprent, etc.) 58

59 Trining in Simultion Environment M s, r Protocol Progrm P Simulted Env M* Humn H s = T*(s, ), r = R*(s, ), Agent L 59

60 Trining in Simultion Alterntive pproch to trining ginst ctstrophes is using simultion (providing lots of dngerous sttes mybe generted by other gents). Not esy to work (without deluding gent): - Simultion not good enough (esp. on weird events) - Ctstrophes too rre for gent to lern them well from nturl distribution. - One option: brek blck box (Osborne nd Whiteson, Precup) - Another option: bite bullet nd trin gent on skewed distribution (yields timid prnoic) - Trin clssifier on dtset skewed to ctstrophes (generted by gent tht seeks out ctstrophes in the simultion). Use this clssifier to block ctstrophes in the rel world (using our other protocol). 60