572 BEHAVIOR BASE COTROL A FUZZY Q-LEARIG FOR AUTOOMOUS MOBILE ROBOT AVIGATIO Kharul Anam 1,3, Son Kuswad 2,Rusdhanto Effend 3 1 epartment of Electrcal Engneerng, Faculty of Engneerng, Unversty of Jember, Jl. Slamet Ryad 62 Jember emal : kh_anam_sp@elect-eng.ts.ac.d 2 EEPIS-ITS, Kampus Keputh ITS SUkollo emal : sonk@eeps-ts.edu 3 epartment of Electrcal Engneerng, Faculty of Industral Engneerng, ITS ABSTRACT Ths paper presents collaboraton of behavor based control and fuzzy Q-learnng for moble robot navgaton systems. There are many fuzzy Q- learnng algorthms that have been proposed to yeld ndvdual behavor lke obstacle avodance, fnd target and so on. However, for complcated tasks, t s needed to combne all behavors n one control schema usng behavor based control. Based ths fact, ths paper proposes a control schema that ncorporate fuzzy q-learnng n behavor based schema to overcome complcated tasks n navgaton systems of autonomous moble robot. In the proposed schema, there are two behavors whch s learned by fuzzy q-learnng. Other behavors s constructed n desgn step. All behavors are coordnated by herarchcal hybrd coordnaton node. Smulaton results demonstrate that the robot wth proposed schema s able to learn the rght polcy, to avod obstacle and to fnd the target. However, Fuzzy q-learnng faled to gve rght polcy for the robot to avod collson n the corner locaton. Keywords : behavor based control, fuzzy q- learnng 1 ITROUCTIO Moble robot autonomous navgaton system s a one of actve area of robot research. To mplement such a robot system, t s mportant for the system to properly react n an unknown envronment by learnng ts actons through experence. For ths purpose, renforcement learnng methods have been recevng ncreased attenton for use n autonomous robot systems. One method that has been wdely used s Q- learnng. However, snce Q-learnng deals wth dscrete actons and states, an enormous amount of states may be necessary for an autonomous robot to learn an approprate acton n a contnuous envronment. Therefore, Q-learnng can not be drectly used to such a case due to the problems of the curse of dmensonalty. To overcome ths problem, varatons of the Q- learnng algorthm have been developed. fferent authors have proposed to use the generalzaton of statstcal method (hammng dstance,statstcal clusterng)[1], of generalzaton ablty of feedforward eural etworks to store the Q-values[1-3]. Another approach consst n extendng Learnng nto fuzzy envronments [4,5] and was called by fuzzy q-learnng. In ths approach, pror knowledge can be embedded nto the fuzzy rules whch can reduce tranng sgnfcantly. Therefore, ths approach s used n ths paper. Fuzzy Q-learnng (FQL) has been used n varous feld of research, such as robot navgaton[2,3], control system[6], robot soccer[7], game[8], and so on[9]. In moble robot navgaton, FQL has been used to generate tasks for navgaton purposes lke obstacle avodance[10], wall followng[11]. However, most of them was mplemented n sngle task and smple problem. For more complcated problems, t s necessary to desgn a schema control that nvolves more than one FQL to conduct the complcated tasks smultaneously. Ths paper s focused on collaboraton between FQLs and behavor-based control n autonomous moble robot navgaton. The rest of the paper s organzed as follows. Secton 2 descrbes theory and desgn of control schema. Smulaton result s descrbed n secton 3 and concluson s descrbed n secton 4. 2 THEORY A ESIG 2.1 Fuzzy Q-learnng Fuzzy Q-learnng methods may be consdered as an extenson of ts orgnal verson of Q-learnng. Q-learnng [12] s a renforcement learnng method where the learner bulds ncrementally a Q-value functon whch attempts to estmate the dscounted
097 Behavor Based Control And Fuzzy Q-Learnng For Autonomous Moble Robot avgaton - Kharul Anam 573 future rewards for takng acton from gven states. Q-value functon descrbed by followng equaton : Qs ˆ ( t, at) = Qs ( t, at) + α r t+ 1 + γ. V( s t+ 1 ) Qs ( t, at) (1) where r s the scalar renforcement sgnal, α s the learnng rate, γ s a dscount factor. In order to deal wth large contnuous state, generalzaton must be ncorporated n the state representaton. Generalzaton ablty of fuzzy nference system (FIS) can be used to facltate generalzaton n the state space and to generate contnuous acton [10]. Each fuzzy rule R~ s a local representaton over a regon defned n the nput space and t memorzes the parameter vector q assocated wth each of these possble dscrete actons. These Q-values are then used to select actons so as to maxmze the dscounted sum of reward obtaned whle achevng the task. The rules have the form [4]: If x s S then acton = a[,1] wth q[,1] or a[,2] wth q[,2] or a[,3] wth q[,3]... or a[,j] wth q[,j] where the state S are fuzzy labels and x s nput vektor (x 1,., x n ), a[,j] s possble acton and q[,j] s q-values that s correspondng to acton a[,j], and J s number of possble acton. The learnng robot has to fnd the best concluton for each rule.e. the acton wth the best value. In order to explore the set of possble actons and acqure experence through renforcement sgnals, the local acton are selected usng usng an exploraton-explotaton strategy based on the stateacton qualty,.e., q values. Here, the smple ε- greedy method s used for acton selecton: a greedy acton s chosen wth probablty 1-ε, and a random acton s used wth probablty ε. The exploraton 2 probablty s set by ε = where T s the 10 + T number of tral. The exploraton probablty s ntended to control the necessary trade-off between exploraton and control, whch s gradually elmnated after each tral.[10] Let be selected acton n rule usng acton selecton mechanms that was mentoned before and * such as q [,*] = max j J q [, j]. The nfered acton a s : = 1 α ( x) x a (, ) ax ( ) = = 1 α ( x) (2) The actual Q-value of the nfered acton, a, s : 1 ( ) x (, ) (, ) α x q Qxa = = = 1 α ( x) (3) and the value of the states x : 1 ( ) x (, *) (, ) α x q V xa = = = 1 α ( x) (4) If x s a state, a s the acton appled to the system, y the new state and r s the renforcement sgnal, then Q(x,a) can be updated usng equtons (1) and (3). The dfference between the old and the new Q(x,a) can be thought of as an error sgnal, Q= r+ γv( y) Qxa (, ), than can be used to update the acton q-values. By ordnary gradent descent, we obtan : o α ( x) q [, ] = ε x Q = 1 α ( x) (5) Where ε s a learnng rate. To speed up learnng, t s needed to combne Q- learnng and Temporal fference (T(λ)) method[4] and s yelded the elgblty e[,j] of an acton y : α ( x) λγ e [, j] + f j= e [, j] = α ( x) = 1 λγ e [, j] elsewhere (6) Therefore, the updatng equaton (5) become : q [, ] = ε x Qx e [, j]. (7) The algorthm of fuzzy q-learnng as has been expalned before s descrbed below. 1. Observe the state x. 2. for each rule, choose the actual consequence usng e-greedy secelton 3. compute global consequence a(x) and ts correspondng Q-value Q(x,a) 4. Apply the acton a(x). Let y be the new state 5. receve the renforcement r Update q-values.
574 4 th Internatonal Conference Informaton & Communcaton Technology and System 2.2 Behavor Based Control Ths paper consders herarchcal control structure (fg. 1) that showng two layers : hgh level controller and low level controller. HIGH LEVEL COTROLLER Stmulus Behavor 1 Behavor 2 Behavor 3 FQL-behavor 1 FQL-behavor 2 Percepton n n Moble Robot HYBRI COORIATOR n 1 n 2 n n 3 Low level Control Fgure 1. Behavor based Control Schema Hgh level controller s behavor-based layer that conssts of a set of behavors and a coordnator. Ths paper uses hybrd coordnator that was proposed by carreras[13]. The hybrd coordnator takes advantage of compettve and cooperatve approaches. The hybrd coordnator allows the coordnaton of a large number of behavors wthout the need of a complex desgnng phase or tunng phase. The addton of a new behavor only mples the assgnment of ts prorty wth reference to other behavors. The hybrd coordnator uses the prorty and behavor actvaton level to calculate the output of the layer, whch s the desred control acton nput to the low-level control system Therefore, the response of each behavor s composed of the actvaton level and the control acton, as llustrated n Fg. 2[13]. S b Fgure 2. Behavor ormalzaton [5] Before enterng the coordnator, each behavor s normalzed as descrbed n fgure 7. In fgure 7, S s th behavor and r s th result of behavor normalzaton that consst of expected control acton v and actvaton level (degree of behavor), a 0 1. Behavor coordnator uses r behavor responses to compose control acton of entre system. Ths process s executed each samplng tme of hgh level controller. The coordnaton system s composed of set of n nodes. Each node has two nputs and one output. The nputs are domnant nput and non-domnant nput. The response that s connected to domnant r n n 2 nput has hgher prorty than the response that s connected to non-domnant nput. The node output conssts of expected control acton v and actvaton level a. When domnant behavor s fully actvated,.e. a d =1, node output s same as domnant behavor. In ths case, the node behaves lke compettve coordnaton. However, when domnant behavor s partly actvated,.e. 0 < a d < 1, the node output s combnaton of two behavors, domnant behavor and non-domnant behavor. When a d =0, the node output wll behave lke non-domnant behavor. Set of nodes construct a herarchy called Herarchcal Hybrd Coordnaton odes (HHC). S d S nd b d b nd r d r nd n domnan non-domnan r a a d + a nd.(1 - a d ) k k = 1,2,3,.. f (a >1) a = 1 V V d. a d /a + V nd.a nd.(1 - a d ) k / a f ( V >1) V = V / V Fgure 3. Mathematc formulaton of node output [13] The low-level controller s constructed from conventonal control.e. PI controller. The nput s derved from output of hgh-level controller, that s velocty settng that must be accomplshed by motor. Ths controller has responsblty to control speed motor so that the actual speed motor s same or almost same as the velocty settng from hgh-level controller. 2.3 Robot esgn and Envronment model To test our proposed schema, cluttered envronment s created as descrbed n fgure 5. The fgure 5 s. consderd as cluttered envronment because some reasons. The frst, there are many objects wth varous shape and poston. Second, the poston of the target s hded. Ths condton gve a dffculty to robot to fnd the target drectly. The large area of the envronment s 1.6 m x 1.6 m. Fgure 4 descrbe the robot that was used n the testng of proposed schema. The robot has three range fnder sensors, two lght sources and two touch sensors (bumpers). Fgure 4. Robot desgn
097 Behavor Based Control And Fuzzy Q-Learnng For Autonomous Moble Robot avgaton - Kharul Anam 575 Envronment model whch s used n ths paper s showed by fgure 5. Fgure 5. Envronment model for smulaton purpose 2.4 FQL and BBC for robot control Ths paper presents collaboraton between Fuzzy Q-Learnng and Behavor based control. Most of authors have developed fuzzy q-learnng to generate a behavor that s constructed by learnng contnuously to maxmze dscounted future reward. However, most of them only focus on generatng a behavor for smple envronment as showed by eng[10], Mr Jo [11]. For complex envronment, t s necessary to ncorporate FQL n behavor-based schema. Therefore, ths paper proposes behavor based schema that uses hybrd coordnaton node [13] to coordnate some behavors ether from FQl generaton or from behavor that s desgned n desgn step. Proposed schema s adapted from [13] and descrbed n fgure 6. HIGH LEVEL COTROLLER Stmulus Stop Obstacle Avodance -FQL Searchng Target - FQL Wanderng Percepton HYBRI COORIATOR n 1 n n 2 n n 3 n Low level Controlller Fgure 6. Fuzzy Q-learnng n Behavor based Control In fgure, Hgh-level controller conssts of four behavors and one HHC. The four behavors are stop, obstacle avodance-fql, searchng target- FQL, and wandreng. Stop behavor has hghest prorty and wanderng behavor has lowest prorty. Each behavor s developed separately and there s no relaton between behavors. The output of hghlevel controller s speed settng to low level controller and robot headng. The wanderng behavor has task to explore the robot envronment to detect the exstence of target. Actvaton parameter, a tm, s 1 over tme. The output s speed settng that s vary every few seconds. The obstacle avodance-fql behavor s one of behavor that s generated by Fuzzy Q-learnng. Ths behavor has task to avod every object whch s encountered and detected by the rangng fndng sensors. The nput s dstance data between robot and the object from three IR range fnder sensors. Output of the range fnder sensors s nteger value from 0 to 1024. The zero value means that the object s far from the robot. On the contrary, the 1024 value means that the robot has collded the object. The acton set conssts of fve actons: {turnrght, lttle turn-rght, move-forward, lttle turn-left, turn-left}. The renforcement functon s drectly derved from the task defnton, whch s to have a wde clearance to the obstacles. Renforcement sgnal r penalzes the robot whenever t colldes wth or approaches an obstacle. If the robot colldes or the bumper s actve or the dstance more than 1000, t s penalzed by a fxed value,.e. -1. f the dstance between the robot and obstacles s more than a certan threshold, d k = 300, the penalty value s 0. Otherwse, the robot s rewarded by 1. The component of the renforcement that teaches the robot keep away from obstacles s: 1 f collson, d > 1000 s r = 0 f ds > dk 1 otherwse (8) where d s s the shortest dstance provded by any of IR sensor whle performng the acton. The value of actvaton parameter, s proportonal to the dstance between the sensors and the obstacle.. The searchng target behavor has task to fnd and go to target. The goal s to follow a movng lght source, whch s dsplaced manually. The two lght sensors are used to measure the ambent lght on dfferent sdes of the robot. The sensors value s from 0 to 1024.. The acton set conssts of fve actons: {turn-rght, lttle turn-rght, move-forward, lttle turn-left, turn-left}. The robot s rewarded
576 4 th Internatonal Conference Informaton & Communcaton Technology and System when t s faced toward the lght source, and receves punshment n the other cases. 1 f d < 300 s r = 0 f ds < 800 (9) 1 otherwse where d s s the largest value provded by any of lght sensor whle performng the acton. The stop Behavor wll be fully actve when the any of lght sensor value more than 1000. The goal s to stop the robot when t reaches the lght source n certan dstance. 3 SIMULATIO RESULT To test performance of the proposed structure control, eght experments has been conducted. The man goal s the robot has to fnd and get the target wthout any collson wth the object that was encountered and to reach the target n as quck as possble n cluttered envronment fgure 5. From the task defnton, there are three performance ndcators. The Frst s robot ablty to get the target. The second s robot ablty to avod collson wth the obstacle and the last s the tme that was needed by the robot to reach the target. The parameters values that are used n ths paper are : α = 0.0001 ; λ = 0.3 ; γ = 0.9 Fgure 8. Local reward of FQL-obstacle avodance The local reward fgure 8 gves more nformaton about the performance of FQL-obstacle avodance. Robot got many rewards and few penaltes. Fgure 9. Reward accumulaton of FQL-target searchng Fgure 7. Reward accumulaton of FQL-obstacle avodance The performance of FQL-target searchng can be analyzed from fgure 9 and 10. The reward accumulaton tends to go -1. In ths condton, robot was tryng to fnd target and the target was stll outsde scope of the robots. Therefore, n ths step, robot was penalzed by -1. After explorng the envronment, the robot succeed to detect the exstence of the target. Fgure 7 shows the smulaton result for eght trals for reward accumulaton of FQL-obstacle avodance. For all of trals, robot has succeeded to reach the target. But the tme that was spent to reach the target s dfferent. There are one tral that spent more tme than the others. In the tral, the robot have collded more obstacles than the others.
097 Behavor Based Control And Fuzzy Q-Learnng For Autonomous Moble Robot avgaton - Kharul Anam 577 Fgure 9. Local reward of FQL-target searchng Another test that was accomplshed to measure the performance of the FQL s to test the learnng ablty of the robot to get the target from dfferent startng pont. There are three dfferent startng ponts. The result of smulaton s showed by fgure 10. Fgure 11. Robot trajectory from dfferent target poston testng In the frst effort, the robot must get the 1 st target poston. After gettng the target, the target was moved to second poston. Then the target was moved to thrd poston after t got the second poston. Fnally, t got the last poston. The trajectory gves nformaton that the robot was able to track the target poston wherever target s. However, the robot was not able to avod collson wth some walls or obstacle (red crcles). From the fgure 11, the corner postons are the most dffcult poston for the robot to avod t wthout collson. They cause the robot get confuson to decde what acton should be chosen from local dscrete acton that was defned n fuzzy q-learnng. If the robot chooses turn left acton, t wll collde the wall n the left sde. Otherwse, f t choose turn rght acton, t wll collde the wall n the rght sde. Therefore the robot perforce collde the wall. Fgure 10. Robot trajectory from dfferent startng pont testng The trajectory result of fgure 10 gves nformaton that robot was able to reach and get the target although t started from dfferent pont and t was able to avod almost all of obstacles that was encountered. It also gves some ponts that the robot have collded the wall or obstacles (red crcles). Fgure 11 s test of FQL-target searchng. There s only one target but the poston of the target was moved to another place after the robot got the target. Three dfferent target poston s tested and fgure 11 shows the smulaton result. 4 COCLUSIO Ths paper proposes control schema for navgaton system of autonomous moble robot n complcated envronment by ncorporatng the fuzz q-learnng to behavor based control. Two behavors were generated by fuzzy q-learnng by learnng the envronment contnuously. Smulaton results demonstrate that the robot wth proposed schema s able to learn the rght polcy, to avod obstacle and to fnd the target. However, Fuzzy q-learnng faled to gve rght polcy for the robot to avod collson n the corner locaton.
578 4 th Internatonal Conference Informaton & Communcaton Technology and System REFERECE [1]. C. Touzet,"eural Renforcement Learnng for Behavour Synthess", Robotcs and Autonomous Systems, Specal ssue on Learnng Robot: the ew Wave,. Sharkey Guest Edtor, 1997 [2]. Yang, GS, Chen, ER, Wan, C.(2004), Moble Robot avgaton Usng eural Q Learnng, Proceedng of the Thrd Internatonal Conference on Machne learnng and Cybernatcs, Shangha, Cna, Vol. 1,p. 48 52 [3]. Huang, BQ, Cao, GY, Guo, M.(2005),"Renforcement Learnng eural etwork to The Problem Of Autonomous Moble Robot Obstacle Avodance "IEEE Proceedngs of the Fourth Internatonal Conference on Machne Learnng and Cybernetcs, Guangzhou, Vol. 1, p. 85-89 [4]. Jouffe,L,"Fuzzy Inference System Learnng By Renforcement Methods", IEEE Transactons On Systems, Man, And Cybernetcs Part C: Applcatons And Revews, Vol. 28, o. 3, August 1998 [5]. Glorennec, P.Y., Jouffe,L, Fuzzy Q- learnng, Proceedng of the sxth IEEE Internasonal Conference on Fuzzy Sstem, Vol. 2, o. 1, 1997,hal. 659 662 [6]. CharlesW. Anderson1, ouglas C. Httle2, Alon. Katz2, and R. Matt Kretchmar, "Synthess of Renforcement Learnng, eural etworks, and PI Control Appled to a Smulated Heatng Col", Elsever : Artfcal Intellgence n Engneerng, Volume 11, umber 4, October 1997, pp. 421-429(9) [7]. Tomoharu akashma, Masayo Udo, and Hsao Ishbuch, "Implementaton of Fuzzy Q-Learnng for a Soccer Agent", The IEEE Internatonal Conference on Fuzzy Systems, 2003 [8]. Ishbuch, H, akashma, T., Myamoto, H., Ch-Hyon Oh,"Fuzzy QLearnng for a Mult- Player on-cooperatve Repeated Game", Proceedngs of the Sxth IEEE Internatonal Conference on Fuzzy Systems,Volume 3, Issue, 1997 Page:1573-1579 vol.3 [9]. Ho-Sub Seo, So-Joeng Youn, Kyung-Whan Oh, "A Fuzzy Renforcement Functon for the Intellgent Agent to process Vague Goals", 19th Internatonal Conference of the orth Amercan Fuzzy Informaton Processng Socety-AFIPS, 2000, Page(s):29-33 [10]. C. eng, M. J. Er and J. Xu, "ynamc Fuzzy Q-Learnng and Control of Moble Robots", 8th Internatonal Conference on Control, Automaton, Robotcs and Vson, Kunmng, Chna, 6-9th ecember 2004 [11]. Meng Joo Er, Member, IEEE, and Chang eng, "Onlne Tunng of Fuzzy Inference Systems Usng ynamc Fuzzy Q- Learnng", IEEE Transactons On Systems, Man, And Cybernetcs, Vol. 34, o. 3, June 2004 [12]. Watkns C., ayan P.(1992),"Qlearnng,Thechncal ote", Machne Learnng, Vol 8, hal.279-292 [13]. Carreras, M, Yuh, J, Batlle, J, Rdao, P A Behavor-Based Scheme Usng Renforcement Learnng for Autonomous Underwater Vehcles, IEEE Journal Of Oceanc Engneerng, Vol. 30, o. 2, Aprl 2005.