A Hybrid Intelligent Learning Algorithm in MAS

Size: px
Start display at page:

Download "A Hybrid Intelligent Learning Algorithm in MAS"

Transcription

1 A Hybrd Intellgent Learnng Algorthm n MAS SHUJUN ZHANG, QINGCHUN MENG, WEN ZHANG, CHANGHONG SONG Computer Scence Department Ocean Unversty of Chna Computer Scence Department of Ocean Unversty of Chna, CHINA Abstract: Machne learnng s a major branch of AI and a man research drecton of mult-agent systems (MAS). Wth the emergence of varous new complex systems, agent s ndvdual ablty and the system s ntellgence level are urgently to be mproved. Agent learnng methods and agent archtecture are studed n ths paper and a new hybrd ntellgent learnng algorthm based on mult-agent s proposed. The applcaton and results n the soccer robot smulaton system show the feasblty and effcency of ths algorthm. Key-words: mult-agent systems (MAS) hybrd ntellgent learnng robot soccer 1. Introducton Machne learnng has been a man research drecton n mult-agent systems (MAS). For a long tme, some research work has been carred out on a sngle agent learnng technque and now wth the development of such dscplnes as Artfcal Intellgence and Intellgent Control, many mature agent learnng technques have been proposed, such as Artfcal Neural Network (ANN), Genetc Algorthms (GAs), Renforcement Learnng (RL) [1 2] and so on. But these technques are manly used for ndvdual agent learnng each agent s learnng s separate even n mult-agent envronment. In order to solve more complex problems, collaboratve learnng n MAS[3], as a new focus, has appeared. A system requres all the agents to cooperate to complete some complex tasks that can t be fnshed by a sngle agent. Therefore the followng two ponts become the major research content to MAS:how to take effectve actons through learnng when mult agents work togetherhow ths cooperatve behavor evolves over tme to adapt the dynamcally changng envronment[4]. At present, the robot soccer system s the most challengng subject n AI feld, especally the smulator, a good research platform to MAS and machne learnng[5]. Such collaboratve, adversaral, real-tme and complex systems requre every agent to show effectve performance not only autonomously but also as a member of a team[6]. Therefore, each of them should have both ndvdual sklls and collaboratve learnng ablty. Ths paper studes ndvdual learnng and group learnng, proposes a hybrd ntellgent learnng algorthm based on mult-agent, the correspondng agent archtecture and apples t to the robot soccer system. 2Indvdual learnng and group learnng Each agent n MAS has ts cognton and behavor. In general, t s at two correlatve layers at the same tme: ndvdual layer and group layer. Every agent s a balanced combnaton of both layer roles. Generally, agents learnng methods fall nto two categores, Indvdual learnng: ndependent learnng for a sngle agent. The agent learns the ablty of nteractng wth the envronment, such as basc actons, commands and communcaton.

2 Group learnng:the agent learns group behavors,.e. the nteractve and cooperatve ablty wth other agents n the same system. The tranng procedure s carred out n a complex envronment made up of multple agents. Reasonng based on a sngle agent has many shortcomngs because ndependent solvng ablty of any sngle agent s lmted and there re too many factors to be consdered smultaneously. In general, ndvdual learnng only handles local objectves and local nformaton relatng to the agent tself to perform autonomously, so there re no connectons between the agents. If the system only focuses on the mprovement of a sngle am (agent s ndvdual ablty) whle gnorng nteracton of agents, collaboraton can t be acheved. However, nteracton of agents s the essental terms of learnng n MAS. On the other hand, smplex group learnng sometmes can t take agent s partcular needs nto account. For example, when mprovement of ndvdual ablty and ntegral performance conflct, group learnng may result n sacrfce of agent s ndvdual ablty, because t stresses the effcency of the whole system. In order to conquer ther respectve shortcomngs, a hybrd ntellgent learnng algorthm based on mult-agent s proposed, allowng for nteracton between agents, wthout the loss of the personalty and feature of each agent. Ths s a mult-agent nteractve learnng method and enables every agent to brng ts ntellgence and autonomous behavor nto full play so as to cooperate better wth other agent n the system[7]. 3 Hybrd ntellgent learnng algorthm (HILA) based on mult-agent 3.1 Prelmnares There re two knds of behavors n MAS: ndvdual behavors and group behavors. Only when an agent s belef and objectve relate to other agents and ts behavor s exactly based on such belef and objectve, should the behavor be called a group behavor. That s, whether a behavor of an agent belongs to a group behavor doesn t depend on external descrpton of the behavor tself but on whether t nvolves and affects other agents[8]. Otherwse, the behavors that only nvolve but don t take any effect on other agents belong to ndvdual behavors. New algorthms should dstngush ndvdual behavors and group behavors n order to carry out learnng at dfferent levels and complextes. Accordng to ther dfferent objects, group behavors can also dvde nto cooperatve behavors and adversaral behavors. In addton, there re more problems to be consdered: (1) Dfference and nteracton between agents Agents n dstrbuted open envronment are usually heterogeneous, dynamc and unpredctable. These features render ther collaboraton much dffcult and complex. As the whole performance and effcency of the system have close relatonshp wth agents dvson of labor and coordnaton, learnng process must allow for the dfference between agents and dspersvty of concerted control. Many ndependent agents gatherng together don t necessarly form a system because a system must be an organsm n whch the way of agents nteracton s defned by the envronment and agents behavor modes. One of agents cooperatve methods n MAS s dstrbuted control based on local knowledge. It enables each agent to have certan autonomy. In comparson wth centralzed control, ths method ncreases flexblty and allevates the bottleneck of control. But f every agent s performance s constraned by local and ncomplete nformaton (such as local objectve and local plan), global concert s hard to be acheved. Therefore, t s very mportant to embed collaboratve knowledge n each agent. Ths knowledge ncludes abltes, objectves, plans, nterests, behavors of other agents and mutual dependent nformaton. Through nformaton

3 exchange, nteractve agents can change the envronment where they exst and nfluence other agents learnng. In partcular, when many agents as a team try to perform a learnng task whch a sngle agent can t, that s, to carry out group learnng, nteracton becomes the key problem[9].an effectve learnng algorthm must consder the dfference and nteracton between agents to assgn learnng tasks better and adapt to the dynamcally changng complex envronment. (2) Mult-agent negotaton and automatc negotaton Negotaton s a concrete form of agent nteracton. Snce every agent n the MAS s autonomous, there may be conflcts f each of them executes systematc tasks only accordng to ts own objectve, knowledge and ablty. Therefore, learnng algorthms must take account of negotaton and automatc negotaton between agents n order to enable the agent to reason based on other agents beleves, further more, to learn from ther behavors and mprove ther mutual understandng and cooperaton. Embodment n the algorthm s that the effect on the envronment and feedback the agent gets from the envronment after t has executed an acton at a certan tme step n a learnng process are related to not only the agent s own behavor but also other agents behavors and states. If dfferent agents have dfferent vewponts about the present optmal strategc acton of the system based on everyone s perceptve nformaton and reasonng, negotaton s urged[3].the process of agents negotaton and state transformaton s as follows. At frst, the agent s n the state of watng for operaton. When the operatng condtons are satsfed or t receves other agents requests, t wll negotate wth them accordng to certan rules and go to negotaton state. There re two knds of results of negotaton. One, terms or rules can t be met wth so the agent wll refuse to take the acton negotated and go to refusal state. Then after a certan tme t ll return to the orgnal state. The other, wth all the terms and rules satsfed the agent wll take accordng acton and go to operaton state. Durng the operatng process, f the agent comes across an excepton t wll restore to operaton. If falng to restore, t ll go back to the orgnal state otherwse, t ll contnue to operate untl the task s fnshed. Fg.1 Prncple of Renforcement Learnng 3.2 Implementaton of HILA The hybrd ntellgent learnng algorthm proposed n ths paper s based on renforcement learnng (RL). Fg.1 llustrates the basc prncple of RL. The agent perceves the state nformaton from the envronment, chooses proper actons accordng to ts own strategy, thereby changes the envronment and gets reward or stmulus as a contrbuton functon of the new state relatve to the agent[9]. The agent s target s to search out a strategy for acton choce that can get t a reward as large as possble on the bass of explorng the state space of the envronment. Our learnng system s layered and the agent uses RL drectly to learn ndvdual behavors, so the followng manly dscusses group-behavor learnng. The algorthm extends the dea of RL, consderng and makng use of mult-agent nteracton and ts changes whle nhertng the same method of adjustng the decson-makng probablty wth feedback. So t s a dynamc collaboratve model n whch ndvdual optmzaton loses ts sense because every agent s feedback not only depends on tself but also on other agents decsons. The learnng target of the system s to look for an approprate strategy, maxmzng the overall reward the system can get n the future. Now we gve mathematc descrpton of the algorthm. Suppose MAS s made up of n agents, expressed: MAS = {Agent 1 Agent 2 Agent n }. Let E denote the external envronment state set of the

4 system. An agent s denoted n the form of < S, A, f >,where S s the agent s state set A s the acton set f s ts plannng functon. Let P( s, a, s') denote the state transformaton probablty,.e. the probablty of movng to state s' S after the agent executes acton a A n the state s S. Agents collaboratve knowledge,.e. correlaton of the nteractve behavors between the agents and relatonshp between ndvdual targets and ntegral targets, are denoted by fuzzy matrxes H (, H '( respectvely. Because the two are both matrx functons of tme t, they can reflect the dynamc changes of mult-agent nteracton and changes of the relatonshp between every agent and contnuously changng targets.the hybrd ntellgent learnng algorthm durng a gven cycle t s as follows: (1) Set the external envronment state E randomly (2) Agent ( = 1,2, L, n) observes E and accordngly updates ts own state S based on H and H '( (3) All the agents learn concurrently: Agent uses state transformaton probablty P S (, a, s' ) and ts own decson-makng ( functon f to choose an acton a (4) The envronment produces an external renforcement sgnal R actons after all the agents (5) Agent fgures out W accordng to H and H '(,where W denotes the weght assocated wth Agent relatve to the whole system at tme t and t satsfes: n = 1 W ( = 1 (6) Agent calculates ts own effectve renforcement sgnal r ( = W ( R(, where r represents the reward Agent gets at tme t when t sets off from state S and executes the acton the decson-makng functon selected (7) Update H, H '( (8) Search out a systemc strategy wth whch every agent s sub-strategy can enable the system to get the largest dscount reward sum (total return), that s, to maxmze W n ( γ r (,where = 1 j= 0 γ s dscount factor, reflectng the degree of the tme s nfluence on the repayment, and usually t s assgned a value slghtly smaller than 1.0. Ths algorthm adopts layered dea and separates ndvdual-behavor learnng from group-behavor learnng. Through the collaboratve knowledge among agents n the same system, usng the weghted method to adjust agents nteracton and mutual effects wth the envronment, the agent can learn wth dfferent emphass, ther strong ponts developed and also they re traned as a whole. Consequently, agents ndvdual skll and the system s performance are both mproved. 4 Applcaton n robot soccer system Robot soccer has become a standard research platform of MAS[7]. Agent learnng n MAS can also be appled to ths feld. In the followng part, we ntroduce HILA based on mult-agent to the robot soccer smulaton system and present the j

5 expermental results. Frst, based on layered dea, agent s tasks are decomposed nto dfferent layers (Fg.3) to use HILA. From the lower to the top: (1) s ndvdual learnng, the agent learns ndvdual sklls from the world model (2)(3) are group learnng: the agent learns group behavors based on the world model and sklls t has already acqured. In detals, ndvdual behavors manly nclude postonng, drbblng, kckng etc. Collaboratve behavors (group behavors n the same team) nclude passng and ntercepton. Adversaral behavors nclude shootng, goaltendng, defendng and so on. Fg.2 Task decomposton n layered learnng [6] 4.1 Applcaton samples Passng: to all the players Factors such as the passng object, trajectory and choce of the passng opportunty are to be consdered. In a soccer game passng s the major cooperaton for the players. Both teams want to keep control of the ball n the front feld to help them goal. Passng nvolves the whole stuaton of the feld and there s hardly a crteron to judge whether a pass succeeds or not. In our algorthm, let fuzzy matrx H denote the probablty wth whch the player who s handlng the ball may pass t to others fuzzy matrx H denote the correlaton between the agent and the global target. Envronmental state vector s make up of the postons and veloctes of 22 players and the ball all the basc commands consttute acton set, to perform mult-agent learnng of team behavor. If the pass succeeds, external renforcement sgnal R s assgned 1.0 the weghts connected to the controllng-ball player and the recevng teammate Agent are relatvely larger so are the specfc gravtes of ther vald renforcement sgnals wth respect to the overall reward of the system. Therefore, they two are the major learners on ths occason. Goaltendng: The goale should observe and make statstcs of the shootng habt of the opponent forwards, predct ther shootng tendency and adjust hs own ntercepton pont to the ball. There re four knds of geometrc defendng strateges. We use HILA to perform learnng and tran the goale n varous envronments. In the begnnng, only one forward s used. Tranng s carred out for many tmes wth the feedback gven by the envronment each tme: successful payoff s gven when the goale catches the ball, falure payoff s gven when the opponent goals. Then add the number of the opponent forwards and mprove the dffculty of defendng. Learnng steps of the goaltendng strategy are as follows: (1). The forward s placed randomly n the feld and the goale s set at hs home-poston (2). The forward drbbles and wats for the opportunty to shoot (on the bass of havng acqured the shootng skll) (3). The goale chooses a strategy to defend accordng to present decson-makng probablty (4). If the ball s caught, the goale succeeds and postve reward s gven f the opponents goal, negatve penalty sgnal s gven otherwse (for example, the ball doesn t enter the goal-area) the sgnal s zero (5). Adjust the decson-makng probablty accordng to the feedback. Repeat the learnng process untl the optmal strategy wth the largest state evaluaton functon s found out. 4.2 Expermental results

6 In order to nspect the optmzaton of the control strategy, we make statstcs of the tmes the goale succeeds n defendng every 100 learnng cycles and get the expermental data as shown n Fg.4 (the abscssa denotes the tme order number n learnng). We can see that the goale s success tmes are ncreasng wth the learnng tme. (The curve s oscllaton s due to the uncertanty of the envronment and the randomzaton of the algorthm.) In order to nspect the mprovement of the whole system s performance we use smulator team A wthout the new algorthm to play aganst smulator team B adopted HILA. The scores of the 10 games randomly selected are shown n Table 1. Tmes of success n goaltendng learnng tme Fg.3 Relatonshp between the tmes of success n goaltendng and the learnng tme Table 1 Results of 10 smulated games between A and B score tmes A B References 1. Zhuang Xao-dong, Meng Qng-chun, We Tan-bn, Robot Path Plannng n Dynamc Envronment Based on Renforcement Learnng Journal of Harbn Insttute of Technology, , Vol.8, No.3 2. Wang Zh-je, Fang Jan-an, Shao Sh-huang, A Self-learnng Fuzzy Logc Controller Usng Renforcement, Control and Decson, 1997, Vol.12, No.2 3. Wang L-chun, L Hong-bng, Chen Sh-fu, Mult-agent Negotaton and Learnng n AODE, Pattern Recognton and Artfcal Intellgence, 2001, Vol.14, No.3 4. Wu Jan-bng, Yang Je, Wu Yue-hua, An Evolutonary Mult-agent Robotc System, Journal of Chna Unversty of Scence and Technology, 1998, Vol.28, No.5 5. L Sh, Xu Xu-mng, Ye Zhen, Internatonal Robot Soccer Tournament and Correlatve Technque, Robot,2000, Vol.22, No.5 6. Peter Stone, Layered Learnng n Mult-Agent Systems, doctor thess Carnege Mellon Unversty Pttsburgh Meng We, Hong Bng-rong, Han Xue-dong,, Applcaton of Renforcement Learnng to Robot Soccer, Applcaton Research of Computers, Shua Dan-xun, Gu Jng, A New Algebrac Modelng for Dstrbuted Problem-Solvng of Mult-agent Systems (Part 1): Socal Behavor, Socal Stuaton and Socal Dynamcs No.2 9. Ca Qng-sheng, Zhang Bo, An Agent Team Based on Renforcement Learnng Model and Its Applcaton, Journal of Computer & Development, , Vol.37, No.9