A Hybrid Neuro-Fuzzy Approach to Intelligent Behavior Acquisition in Complex Multi-Agent Systems

Size: px
Start display at page:

Download "A Hybrid Neuro-Fuzzy Approach to Intelligent Behavior Acquisition in Complex Multi-Agent Systems"

Transcription

1 A Hybrd Neuro-Fuzzy Approach to Intellgent Behavor Acquston n Complex Mult-Agent Systems ALI AKHAVAN BITAGHSIR *, FATTANEH TAGHIYAREH, AMIR- HOSSEIN SIMJOUR *, AMIN KHAJENEJAD * Electrcal and Computer Engneerng Dep. faculty of Engneerng, Unversty of Tehran IRAN Abstract: In recent years, multagent systems have emerged as an actve subfeld of Artfcal Intellgence (AI). Because of the nherent complexty of MAS, there s much nterest n usng Machne Learnng (ML) technques to help buld multagent systems. Besdes, n these complex systems for whch acqurng a mappng from the system's nputs to the approprate outputs s not smple, the need for a good paradgm for convergng the system's functonalty to the approprate goal s apparent. A layered paradgm whch s nspred from ncremental learnng model s proposed. Our approach to usng ML and fuzzy logc as tools for developng ntellgent frefghter robots nvolves layerng ncreasngly complex learned behavors. In ths artcle, we descrbe multple levels of learned behavors, rangng from low level envronmental behavors to more hgh level and complex behavors. We also verfy emprcally that the learned behavors perform well n dsaster stuatons. Fndngs suggest that usng a hybrd soluton comprsed from fuzzy logc and artfcal neural networks provdes us wth both robustness and advanced learnng ablty. Keywords: Artfcal neural networks, fuzzy logc, layered learnng, RoboCup Rescue Smulaton System (RCRSS), mult-agent systems. Introducton In recent years, multagent systems (MAS) have emerged as an actve subfeld of Artfcal Intellgence (AI). Because of the nherent complexty of MAS, there s much nterest n usng Machne Learnng (ML) technques to help deal wth ths complexty [, ]. RoboCup Rescue s a partcularly good doman for studyng MAS []. The test-bed has enough complexty to be realstc; also good multagent ML opportuntes have brought ths doman nto a challengng area for MAS researchers. Our approach to acquston of ntellgent behavors for fre extngushment by a team of fre-fghters, s to break down the complexty of decson makng step by step, and solvng smpler tasks frst; gong for acqurng hgher level team behavors (strateges), after learnng the low level behavors. Ths dea s manly nspred from Incremental Evoluton (dscussed n [4]) whch s a method for avodng lmtaton of drect evoluton n dffcult problems where the percentage of the search space that consttutes a soluton s very small, and the ftness landscape very rugged. In ths case the probablty of producng frutful ndvduals n the ntal random populaton wll be low, and evoluton wll not make progress; thus the populaton gets trapped n suboptmal regons of the ftness landscape durng the early stages of evoluton. One way to scale ML algorthms to tasks that are too dffcult to evolve drectly, s to begn by vewng the task we want to solve, as a member of a famly of

2 tasks, rangng from smple tasks to complex tasks. As tasks get more dffcult, the soluton set becomes smaller, but because successve tasks are somehow related (dependng on the chosen abstracton level for tasks' decomposton), each task postons the populaton n a good regon of the space to solve the next task. Eventually, f the tasks are generated properly, the goal task can be acheved. A layered paradgm s nspred from Incremental learnng model dscussed above. Our research focuses on acqurng behavors for tasks n whch a drect mappng from nputs to outputs s ntractable. Prevously, herarchcal renforcement learnng has been studed and motvated by the well-known "curse of dmensonalty" n renforcement learnng (RL). As surveyed n [5], most herarchcal RL approaches use gated behavors; meanng that there are a collecton of behavors mappng the envronment states nto low level actons and a gatng functon decdes upon whch behavor must be executed [6, 7]. Also MAXQ algorthm [8] and feudal Q-learnng [9] learn at all levels of the herarchy, smultaneously. A constant among these approaches s that the behavors and the gatng functon are all control tasks wth smlar nputs and actons, however n ths research the nput representaton of dfferent layers may be learned prevously n lower levels. Moreover none of the above methods has been mplemented n a large scale, complex doman. More nlne wth ths type of learnng s the work presented by Stone []. In ther approach a layered model has been tested on RoboCup Soccer server whch s a complex doman; three abstracton levels were observed for learnng a soccer player robot's behavor. However, n ths paper, ntroducng the abstracton level for learnng robots' behavor s done n a dfferent manner. Namely, n our doman of dscourse the layers may not necessarly represent robot behavors. Instead we may learn the envronment's behavor (model) n a layer n order to provde more robust decson makng n hgher levels. Ths paper contrbutes the concrete representaton of layered learnng n a complex multagent doman, namely RoboCup Rescue Smulaton System. In secton the formalsm of our approach s gven, dscussng about the layered paradgm, formally. A bref specfcaton of smulated RoboCup rescue robots s gven n secton. Our observaton of dfferent layers n addton to the mplementaton phase s demonstrated n secton 4. In secton 5, the result of our proposed method s dscussed; and fnally n secton 6, we arrve at concluson and dscuss drectons for future work. The Layered Learnng Paradgm The layered learnng paradgm s desgned for domans n whch a drect mappng from nput representaton to output representaton s not tractably acqured. Our research nvolves layerng ncreasngly complex behavors of both the rescue robot controller and the envronment tself. In ths secton a formalsm much lke the one addressed n [], but wth necessary modfcatons due to the complex dependency of the robot controller to the envronment's behavor s presented. The major characterstc of the paradgm s that the output of each layer can have drect effect on at least one of the subsequent layers by: (I) supplyng the features used for learnng; and (II) formng the tranng example set. Besdes, the output of each layer can gve us more knowledge about the contrbuton of prevous layers n the fnal goal; for example we may realze that prevous layers are polluted by nosy, rrelevant,..., features whch lead us to revse the feature selecton process and repeat the layered approach agan.. Formalsm Consder the learnng task of acqurng a functon f from among a class of functons F whch map a set of nput features (world state sensory nformaton) I to a set of outputso, such that based on a set of tranng examples, f s most lkely (of the functons n F ) to represent unseen examples and provde the approprate output. In order to accomplsh the task, several layers

3 { L, L,..., L n } are ntroduced based on the prevous knowledge of the desgner n the doman of context, complyng wth the followng form: L I, O, f,( M, T )) () ( M In whch: I : s the set of nputs (features) selected by the desgner; I for ndcates the j th feature j acqured drectly from the envronment or prevous layers outputs. Each member of I j ( I ) s a member of I ; O : s the set of outputs whch may ndcate the envronment s behavor (model) or the robots approprate acton for the correspondng subtask of ths layer (f any). O n O ; f : s the approxmated functon whch maps I ntoo ; M : f s acqured whether by means of a machne learnng algorthm or any manual method whch may explot the nherent knowledge of the doman. The method used n ths layer s called M ; T M : In case a machne learnng algorthm s used as M, a set of tranng examples T M s fed nto M. It s noteworthy that f provdes one or more nputs for some j of the subsequent layers: I k where k n and j s an arbtrary value. In the followng sectons, each layer s descrbed n detal. Robocup Rescue Smulaton Envronment RoboCup Rescue Smulaton System (RCRSS) s desgned to smulate the rescue msson problem n real world [0]. In ths smulaton system a communcaton center and a number of smulators are exstent to smulate the traffc after earthquake, fre accdents as a result of gas leakage, road blockages, etc [0]. RCRSS envronment s a heterogeneous mult-agent system n whch the agents correspond to the agents nvolved n a real rescue msson. Types of agents n ths doman are as follows: - Fre brgade: Ths agent s responsble for extngushng burnng buldngs. - Fre staton: It organzes the functon of fre brgades. - Rescue, Polce and ther Center Agents: These agents are other platoon agents each responsble for ther specfc task n rescue smulaton envronment. Our research goal s to arrve at effectve fre extngushment behavor for fre brgade agents. The system smulates Kobe cty for 00 cycles (each cycle correspondng to one mnute n real world) after the earthquake [0]. In each cycle, the fre smulator smulates fre propagaton n the cty by means of pre-computed statstcal nformaton gathered from the real Kobe earthquake n 995. The fnal performance of the agents' work s assgned n proporton to the unburned buldngs at the end of smulaton. At each cycle each fre brgade agent can send one of the followng actons to the system's kernel : (I) Extngush (B): for whch the smulator extngushes (decreases the burn of) buldng B n proporton to the maxmum amount of water a fre brgade can supply ; (II) Move (R): n whch R s a route plan; the traffc smulator moves the fre brgade through R n the next cycle. By regulaton, an unburned buldng can be gnted by one of ts neghborng burned buldngs. Let's call a group of neghborng burned buldngs a fre ste. In order to stop the spread of exstent fre to other unburned buldngs (and thus achevng a hgher fnal performance), the fre fghters must try to extngush the boundary buldngs of each fre ste at frst (see fgure ). 4 Implementaton and Experment In ths secton, we llustrate our layered approach va a full-fledged mplementaton n RCRSS []. Here, the hgh-level goal s for a team of fre brgade agents to acheve complex collaboratve behavor. 4. The Fre Spread Speed

4 TP PT Inputs Frst, the agents learn a basc envronmental behavor: the fre spread speed. As mentoned before, the potental buldngs for burnng are the buldngs whch are neghbor to at least one of the border buldngs of a fre ste. We chose to have our agents learn ths behavor by means of a ML algorthm, because the fre smulator behavor n the opportunty to seralze the traned neural network weghts and bases nto a fle for real tme usage durng the smulaton. At the frst cycles of the smulaton, each agent loads ths fle nto ts memory. Thus, our learnng method s off-lne n ths case, provdng the same knowledge to all the agents n ther msson doman. The RMSE (error) of the NN at the end of tranng was approxmately 0.04TP PT. Also, for unseen data, the traned network performs well by an average error of 8 cycles (for EF( B )), where the range of EF( B) s 50 cycles and the average s taken over 500 patterns. 4. The Effect of Collaboraton Fgure. Border buldngs are marked wth * symbol. Four fre brgade agents are extngushng a buldng n border. n ths case s so complex and thus, fne-tunng an approxmatve functon by hand s dffcult. We provded our agents wth a large number of tranng examples and used a supervsed learnng technque: neural networks ( M ). A fully connected neural network ( f ) wth nputs and 6 hdden sgmod unts and a learnng rate of 0.7 was traned. I conssts of the followng parameters gathered from the envronment at regular tme ntervals: the potental buldng B 's Total Area, { : Feryness of B, Dstance Between B and B, Burnng Tme of B } where B s for are the three nearest buldngs to B, respectvely; and Feryness s the state that specfes how much the buldng s burnng []. Also O = { EF(B) } where s EF(B) the expected tme for buldng B to be gnted. TM was constructed by samplng the envronment's parameters n regular tme ntervals for 000 tmes. The neural network was traned for 5000 epochs. The network was traned by Joone [] (a java package for tranng and usng neural networks), gvng us Collaboraton s a key dea for successful teamwork n RCRSS, as well as other Multagent systems. Although the effect of extngushng s not defned explctly, t may not be dffcult for even a few fre brgades to extngush an early fre. On the contrary, t s dffcult for even many to extngush a late and bg fre. Consequently, the agents should be aware of the effect of collaboraton n fre extngushment for further decson makng. In ths layer ( L ), we am to understand the envronment's feedback to the jont effort of k fre brgade agents for extngushng a specfc buldng B. Smlar to the prevous layer, complexty leads us to usng a ML algorthm, whch s agan a Neural Network ( M ). I s comprsed both from the target buldng's characterstcs and the number of collaboratng agents : I = {B 's Total Area, B 's Feryness, B 's Burnng Tme, NP, MnDstance, NCol} where NP s the number of B 's neghborng buldngs whch are potental for gntng B, MnDstance s the mnmum dstance of such buldngs to B and NCol} s the number of collaboratng agents. Also O = { EX(B) }, where EX(B) s equal to the expected tme for the collaboratng agents to extngush buldng B. A fully connected neural and outputs are normalzed to [0,].

5 TP PT In network ( f ) wth 6 nput and 5 hdden sgmod unts was traned. The learnng rate was equal to 0.7. The neural network was provded wth 500 nput patterns ( T M ) gathered from several smulaton runs. In each run, the agents try to extngush a specfc buldng as the target, takng log data at regular tme ntervals from envronment for constructngt M. The NN was traned for 5000 epochs. After nearly 4500 epochs the network's RMSE wll not be decreased. So we fnshed learnng the NN at ths epoch. The NN may learn the nput pattern noses n case of further learnng. We've provded the NN wth tranng examples n whch the NCol parameter ranges from to 6. Our NN has the strength of generalzablty (when NCol 6), However, f the maxmum number N of collaboratng agents n a smulaton s known n advance, one may tran N dfferent neural networks separately (provdng the NN only tranng examples wth th NCOL ) n order to have more specalzed NNs. Now that the agents are provded wth useful prmary knowledge, they must be able to plan an ntellgent strategy for extngushng a whole fre ste. 4. Extngushng a Fre Ste In order to extngush a whole fre ste, the agents use ther learned functons (EF, EX) to decde upon whch buldng s more urgent (pror) to extngush n each stuaton. In ths layer ( L ) each buldng B wll be assgned a prorty value for extngushment P( B) ( O ), based on ts nfluence on the ungnted buldngs. Regardng ths prorty, all the agents rush to the buldng wth the maxmum prorty value by sendng the Move commands to the system's kernel, sequentally. After all agents were located n a certan dstance to the target buldng, they collaboratvely extngush ths buldng by sendng Extngush com- mandstp PT. After extngushng ths buldng, the agents wll evaluate other buldngs' prorty value agan, choosng ther next target buldng for extngushment. The agents wll repeat ths process untl no burnng buldng remans n the fre ste. As the parameters used for decson makng n ths layer can be nosy and naccurate, a fuzzy rule-base system was used to evaluate the prorty of a buldng. The developed fuzzy system uses sngleton fuzzfer, the Larsen nference engne and the functon left maxmum defuzzfer []. At frst, the fuzzy system assgns to each unburned buldng B, a danger value ( D( B )), whch ndcates the potental damage that B can mpose to the system's performance when gnted. Due to our observatons, the much area a buldng has, the later the fre brgades can extngush t (causng lower performance). Consequently, D( B) s n drect proporton to B 's total area. On the other hand, dangerousness (potental mposng damage) of an unburned buldng depends on ts expected tme for gnton; namely the sooner a buldng s gnted, the more damage t wll mpose to the system's performance, n long run. The system uses these two lngustc varables ( I ) for determnng D( B ) as a crsp value between 0 and 00. The membershp functon of EF(B), B 's area and D( B ) n Hgh, Average and Low sets are depcted n fgure ; also the correspondng values for the labels are gven n fgure. The knowledge base of the fuzzy system conssts of 9 fuzzy rules. Regardng fgure, for each of the 9 membershp status of the lngustc varables n the sets, a rule s generated. For example, the entry n the frst row and thrd column of dangerous table n fgure corresponds to the followng rule: f EF(B) s LOW and B 's Total Area s HIGH then D(B) s HIGH our smulaton, the agents work together at all the tmes.

6 TP PT Entres Now that D( B ) s evaluated for each unburned buldng, the agents should determne the most urgent (burnng) buldng for extngushment. Fgure. The general Member functon dagrams for all lngustc varables fuzzfers. For ths purpose, another fuzzy system s used for evaluatng the prorty of each burnng buldng B ( P( B )) for extngushment. P( B ) s evaluated based on three parameters: () EX(B): The more tme extngushng buldng B takes, the less pror s B for extngushment, due to the agents' tme loss for extngushng ths buldng. () The maxmum value of D( B ) among B 's neghbors ( max( D( B )) ) () The dstance between B and ts most dangerous neghbor buldng ( Ds ( B )).The second fuzzy system's confguraton s much lke the frst one. The membershp functons of max( D( B )) and Ds ( B ) are the same as prevous functons (Fgure ). The correspondng values for labels are gven n fgure. The only dfference s EX ( B ), for whch only two labels (HIGH, LOW) are used. The knowledge base of the fuzzy system s constructed lke the prevous system regardng fgure (based Fgure. The labels' correspondng value used n fuzzy nference (left), correspondng table for evaluatng dangerousness (mddle), correspondng table for evaluatng prorty (rght). on max( D( B )), Ds ( B )), except for the dashed entres. For these entres the followng rules were used: () f Ds(B) s LOW and D(B) s LOW and EX(B) s LOW then P(B) s AVERAGE () f Ds(B) s LOW and D(B) s LOW and EX(B) s HIGH then P(B) s LOW These rules dstngush the problem features at entry (, ) for the EX(B) varable. The same pattern s used for the other two dashed entrestp PT. Fnally a crsp value for P( B ) s (, ) and (, ). drven from the fuzzy system. Now the fre brgade agents choose the buldng wth maxmum prorty as ther next target. 5 Results In order to evaluate the proposed method, the developed team of fre brgade agents was tested several tmes wth dfferent confguratons of the cty. In the prevous work [4], another "state of the art" approach was presented, n whch the fre brgade agents extngush the fre ste by dvdng t nto several sectors (assumng the fre ste as a crcle) and

7 select the sectors to extngush, based on a cost functon mplemented n "Eternty" rescue smulaton team. The agents wll extngush all of the buldngs n one sector before gong to the next sector; also, the most pror buldngs for extngushment are found wth "state of the art" algorthms not ncludng machne learned components. The team has won the 4th place n the 00 Internatonal RoboCup rescue smulaton league and won the champonshp later n CIS00 compettons. In fgure 4, our proposed method's performance s compared wth Eternty's performance based on the ntal sze of the fre ste. The comparson was done for two confguratons. In the frst confguraton the cty does not have any road blockage; but n the second, the cty s smulated wth road blockages generated by RCRSS GIS []. The results show the average performance of these two methods on the specfed ntal cty map. The performance s evaluated regardng 0.8 Performance Eternty 0. Proposed Method Intal fre ste sze Fgure 4. Performance comparson between the proposed method and Eternty's approach. In the frst confguraton the cty s smulated wth no road blockage (left), however, the second confguraton consders road blockages (rght). RoboCup 00 regulatons: B 0 Performance = B B where s the total area of buldngs and B s the area of the unburned buldngs at the end of the experment. The results show that n the frst confguraton, our method works better than Eternty's approach for nearly all of the tested values as the ntal fre ste sze. As the ntal sze of the fre ste ncreases, Eternty's performance gets closer to our method's performance. Ths s due to the fact that for larger fre stes, the agents movement cost between the ste's buldngs ncreases, whch causes lower performance n the proposed method. However, n Eternty's approach the agents may not move to the next sector of 0 the fre ste, unless they extngush the current sector's buldngs; thus theagents wll not pay much for movement costs. In the second confguraton, however there exst road blockages n the cty whch wll cause severe movement cost for agents. As t s clear n the dagram, our proposed method s performance s better for fre stes wth ntal sze less than 7, however, as the fre ste ntal sze ncreases, Eternty s performance becomes better. Ths s because the road blockages n the second confguraton cause more movement cost n comparson to the frst one; thus Eternty has the opportunty to acheve better performance n large fre stes. As the fre ste ntal sze decreases, our proposed method performs better due to less movement cost. 6 Concluson and Future Work

8 Ths paper has presented a layered paradgm for acqurng a functon whch tres to maxmze the fre brgade agents' performance, and llustrated t wth a fully-mplemented applcaton n RoboCup Rescue Smulaton System (RCRSS) as a challengng and complex multagent control problem. Moreover, the results showed that usng the layered paradgm for solvng such complex task lead us to sgnfcant mprovement n the smulated robots performance. It s noteworthy that usng ths paradgm gves us the opportunty of drectly combnng dfferent ML algorthms wthn a herarchcally decomposed task representaton; thus one may explot both the robustness of Fuzzy logc algorthms besde the strong learnng capablty of neural networks. In order to fnd out whether the proposed model s sutable for other domans, applyng t to other control problems s necessary. Based on the acqured results, there s a strong possblty that applyng the proposed method to other domans may brng about good performance mprovements. Important drectons for future work are to: (I) Desgn, revse and develop further layers n ths doman; and (II) applyng ths paradgm to other complex domans. For nstance, t would be benefcal to ntroduce another layer above the prevous learned layers, whch s responsble for dvdng the fre brgade agents among dfferent stes (when several stes exst). Ths wll ntroduce hgher level of cooperaton for the agents. The fndngs lend support to the concluson that layered learnng paradgm's power s derved from the concept of smultaneously usng dfferent ML algorthms n a herarchcally task representaton. References: [] Peter Stone. Layered Learnng n Mult- Agent Systems. PhD thess, Computer Scence Department, Carnege Mellon Unversty, Pttsburgh, PA, December 998. Avalable as techncal report CMU-CS [] AAAI. Adaptaton, Coevoluton and Learnng n Multagent Systems: Papers from the 996 AAAI Sprng Symposum, Menlo Park,CA, March 996. AAAI Press. AAAI Techncal Report SS960. Sandp Sen Char. [] Gerhard We and Sandp Sen, edtors. Adaptaton and Learnng n Multagent Systems. Sprnger Verlag, Berln, 996. [4] Faustno John Gomez : Robust Non-lnear Control through Neuroevoluton. Report AI- TR-0-0, August 00. [5] Lesle Pack Kaelblng, Mchael L. Lttman, and Andrew W. Moore. Renforcement learnng: A survey. Journal of Artfcal Intellgence Research, 4:7{85, May 996. [6] Long-J Ln. Renforcement Learnng for Robots Usng Neural Networks. PhD thess, School of Computer Scence, Carnege Mellon Unversty, Pttsburgh, PA, 99. [7] Patte Maes and Rodney A. Brooks. Learnng to coordnate behavors. In Proceed- ngs of the Eghth Natonal Conference on Artfcal Intellgence, pages 796{80. Morgan Kaufmann, 990. [8] Thomas G. Detterch. The MAXQ method for herarchcal renforcement learnng. In Proceedngs of the Ffteenth Internatonal Conference on Machne Learnng. Morgan Kaufmann, 998. [9] Peter Dayan and Geokrey E. Hnton. Feudal renforcement learnng. In S. J. Hanson, J. D. Cowan, and C. L. Gles, edtors, Advances n Neural Informaton Processng Systems 5. Morgan Kaufmann, San Mateo, CA, 99. [0] Hroak Ktano, Satosh Tadokoro, Itsuk Noda, Htosh Matsubara, Tomoch Takahash, Atsush Shnjou, and Susumu Shmada. Robocup rescue: Search and rescue n largescale dsasters as a doman for autonomous agents research. In Proc. Of IEEE Conf. on System, Man and Cybernetcs, Dec [] The RoboCup Rescue Techncal Commttee: RoboCup Rescue Smulaton Manual, verson 0, July 000. [] Joone's Techncal Reference, see http: // for more nformaton. [] L-Xn Wang : A Course n Fuzzy Systems and Control. Prentce-Hall Internatonal Inc., 997. [4]E.Mahmoud,S. Mrarab,N.Hakmpour, A.Akhavan Btaghsr : Eternty's Team De scrpton, Sprnger Verlag, Robocup 00 Issue, 004 (To appear).