Reinforcement Learning

Size: px
Start display at page:

Download "Reinforcement Learning"

Transcription

1 Reinforcement Learning 5. Off-policy RL and replay buffer Olivier Sigaud Sorbonne Université 1 / 10

2 To understand the distinction, one must consider three objects: The behavior policy β(s) used to generate samples. The critic, which is generally V (s) or Q(s, a) The target policy π(s) used to control the system in exploitation mode. 2 / 10

3 Why prefering off-policy? Historically, the distinction between on-policy and off-policy discriminates between sarsa and Q-learning [Sutton & Barto, 1998]. Off-policy learning refers to learning about one way of behaving, called the target policy, from data generated by another way of selecting actions, called the behavior policy. The target policy should be an approximation to the optimal policy Ex: deterministic target policy, stochastic behavior policy Advantages: More freedom for exploration Learning from human data (imitation) Reusing old data (sample efficiency) Transfer between policies in a multitask context Maei, H. R., Szepesvári, C., Bhatnagar, S., & Sutton, R. S. (2010) Toward off-policy learning control with function approximation. ICML, pages / 10

4 Introducing a replay buffer Helps clarify the on-policy versus off-policy distinction Introduces sample efficiency discussion Samples can be fed randomly to the critic 4 / 10

5 Filling the replay buffer General format of samples S: (s t, a t, r t, s t+1, a ) Makes it possible to apply a general update rule: Q(s t, a t) Q(s t, a t) + α[r t + γq(s t+1, a ) Q(s t, a t)] There are three different update rules In sarsa, a = β(s t+1 ) In Q-learning, a = argmax aq(s t+1, a) (a does not need to be stored) In actor-critic, a = π(s t+1 ) (a does not need to be stored) 5 / 10

6 The Q-learning versus sarsa case We consider β(s) as a random, uniform sampling policy Sort of worst case behavior policy We add a negative reward for hitting walls Q-learning still learns an optimal critic sarsa fails 6 / 10

7 Closing the loop Quite obviously, Q-learning still works sarsa works too: the behaviour policy is Greedy in the limit of infinite exploration (GLIE) Singh, S. P., Jaakkola, T., Littman, M. L., & Szepesvári, C. (2000) Convergence results for single-step on-policy reinforcement-learning algorithms. Machine learning, 38(3): / 10

8 The actor-critic case The taken action a is not random, it improves over time, so actor-critic (unreliably) converges to an optimal π(s) Thus actor-critic can be said off-policy 8 / 10

9 Replay buffer and sample efficiency Important intuition: in the discrete deterministic case, one sample from each (state, action) pair in the buffer is enough for Q-learning to converge Thus using a replay buffer can be very sample efficient In the stochastic case, samples in the replay buffer should reflect the distribution over next state This may require a large replay buffer (over 1e 6 samples) In the continuous case, the state (and action) spaces cannot be covered But off-policy deep RL algorithm using a replay buffer still benefit from the initial intuition 9 / 10

10 Any question? Send mail to: 10 / 10

11 References Maei, H. R., Szepesvári, C., Bhatnagar, S., & Sutton, R. S. (2010). Toward off-policy learning control with function approximation. Edité dans ICML, pages Singh, S. P., Jaakkola, T., Littman, M. L., & Szepesvári, C. (2000). Convergence results for single-step on-policy reinforcement-learning algorithms. Machine learning, 38(3): Sutton, R. S. & Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press. 10 / 10