Monte-Carlo Tree Search

Size: px
Start display at page:

Download "Monte-Carlo Tree Search"

Transcription

1 Introduction Selection and expansion References An introduction Jérémie DECOCK May 2012

2 Introduction Selection and expansion References Introduction 2

3 Introduction Selection and expansion References Introduction (MCTS) MCTS is a recent algorithm for sequential decision making It applies to Markov Decision Processes (MDP) discrete-time t with finite horizon T state s t S action at A transition function st = P(s t, a t ) cost function rt = R P (s t ) T reward R = t=0 r t policy function a t = π P (s t ) we look for the policy π that maximizes expected R 3

4 Introduction Selection and expansion References Introduction MCTS strength Mcts is a versatile algorithm (it does not require knowledge about the problem) especially, does not require any knowledge about the Bellman value function stable on high dimensional problems it outperforms all other algorithms on some problems (difficult games like Go, general game playing,... ) 4

5 Introduction Selection and expansion References Introduction MCTS Problems are represented as a tree structure: blue circles = states plain edges + red squares = decisions dashed edges = stochastic transitions between two states Current state (root) t 1 Explored decisions Stochastic transitions Explored states t

6 Introduction Selection and expansion References 6

7 Introduction Selection and expansion References Main steps of MCTS selection expansion simulation propagation t n t n horizon 7

8 Introduction Selection and expansion References Main steps of MCTS 1. selection t n t n Starting from an initial state: 1. select the state we want to expand from 8

9 Introduction Selection and expansion References Main steps of MCTS 1. selection 2. expansion t n t n Starting from an initial state: 1. select the state we want to expand from 2. add the generated state in memory 8

10 Introduction Selection and expansion References Main steps of MCTS 1. selection 2. expansion 3. simulation t n t n horizon Starting from an initial state: 1. select the state we want to expand from 2. add the generated state in memory 3. evaluate the new state with a default policy until horizon is reached 8

11 Introduction Selection and expansion References Main steps of MCTS 1. selection 2. expansion 3. simulation 4. propagation t n t n horizon Starting from an initial state: 1. select the state we want to expand from 2. add the generated state in memory 3. evaluate the new state with a default policy until horizon is reached 4. back-propagation of some information: n(s, a) : number of times decision a has been simulated in s n(s) : number of time s has been visited in simulations ˆQ(s, a) : mean reward of simulations where a was whosen in s 8

12 Introduction Selection and expansion References Main steps of MCTS 1. selection 2. expansion 3. simulation 4. propagation t n t n horizon Starting from an initial state: 1. select the state we want to expand from 2. add the generated state in memory 3. evaluate the new state with a default policy until horizon is reached 4. back-propagation of some information: n(s, a) : number of times decision a has been simulated in s n(s) : number of time s has been visited in simulations ˆQ(s, a) : mean reward of simulations where a was whosen in s 8

13 Introduction Selection and expansion References Main steps of MCTS 1. selection 2. expansion 3. simulation 4. propagation t n t n a t n horizon The selected decision a tn = the most visited decision form the current state (root node) 9

14 Introduction Selection and expansion References Selection and expansion 10

15 Introduction Selection and expansion References Selection and expansion Selection step How to select the state to expand??? 11

16 Introduction Selection and expansion References Selection and expansion How to select the state to expand? argmax scoreucb(s,a) a??? The selection phase is driven by Upper Confidence Bound log(2 + n(s)) score ucb (s, a) = ˆQ(s, a) + }{{} 2 + n(s, a) 1 }{{} 2 1. mean reward of simulations including action a in state s 2. the uncertainty on this estimation of the action s value 12

17 Introduction Selection and expansion References Selection and expansion How to select the state to expand? argmax scoreucb(s,a) a??? The selection phase is driven by Upper Confidence Bound score ucb (s, a) = ˆQ(s, log(2 + n(s)) a) + }{{} 2 + n(s, a) 1 }{{} 2 The selected action: a = arg max a score ucb (s, a) 12

18 Introduction Selection and expansion References Selection and expansion How to select the state to expand? 13

19 Introduction Selection and expansion References Selection and expansion When should we expand? PW? One standard way of tackling the exploration/exploitation dilemma is Progressive Widening. A new parameter α [0; 1] is introduced, to choose between exploration (add a decision to the tree) and exploitation (go to an existing node) 14

20 Introduction Selection and expansion References Selection and expansion How to select the state to expand? n(s) α α n(s) if( A s < n(s) α ) then we explore a new decision else we simulate a known decision With A s the number of legal actions in state s 15

21 Introduction Selection and expansion References Selection and expansion When should we expand? α = 0.2 t=10 t=50 t=500 α = 0.8 t=10 t=50 t=500 16

22 Introduction Selection and expansion References References I 17