Modeling of Suppliers Learning Behaviors in an Electricity Market Environment

Size: px
Start display at page:

Download "Modeling of Suppliers Learning Behaviors in an Electricity Market Environment"

Transcription

1 Modelng of Supplers Learnng ehavors n an Electrcty Market Envronment Nanpeng Yu, Student Member, IEEE Chen-Chng Lu, Fellow, IEEE Legh Tesfatson, Member, IEEE Abstract--The ay-ahead electrcty market s modeled as a mult-agent system wth nteractng agents ncludng suppler agents, Load Servng Enttes, and a Market Operator. Smulaton of the market clearng results under the scenaro n whch agents have learnng capabltes s compared wth the scenaro where agents report true margnal costs. It s shown that, wth Q-Learnng, electrcty supplers are makng more profts compared to the scenaro wthout learnng due to strategc gamng. As a result, the LMP at each bus s substantally hgher. Index Terms--Electrcty Market, Suppler Modelng, Compettve Markov ecson Process, Q-Learnng. I. INTROUCTION trategc bddng s an mportant ssue n the wholesale S electrcty market. Electrcty prces change as a result of transmsson network congeston, whch may be caused by strategc bddng or heavy load. For PJM, the total congeston costs were $75 mllon n 24 and $2.9 bllon n 25. Learnng may also allow larger electrcty supplers to use ther market power and bd strategcally. In Calforna [], electrcty expendture n the wholesale market ncreased from $2.4 bllon n the summer of 999 to $8.98 bllon n the summer 2. It s estmated that 59% of ths ncrease was due to ncreased market power. Learnng to bd n the wholesale market s also crucal for smaller electrcty supplers who have a desre to recover the cost of ther nvestment n generaton by avodng over or under-bddng. Research on the learnng behavor of electrcty supplers wll provde nsghts nto gamng on the market and the power grd. Ths may allow market desgners to develop approprate market rules to dscourage strategc bddng and enhance market effcency. Researchers have used varous learnng methods to model electrcty supplers behavor. The learnng confguraton for supplers n [2] s a verson of a stochastc reactve Ths research s sponsored by the Power System Engneerng Research Center (PSERC) through a collaboratve project nvolvng Iowa State Unversty, Washngton State Unversty and Smth College/Cornell Unversty. The authors would lke to thank Professors Gerald Sheblé, Anjan ose and Judth Cardell and Mr. Jm Prce, Calforna ISO, for ther contrbutons. Nanpeng Yu s wth the epartment of Electrcal and Computer Engneerng, Iowa State Unversty, Ames, IA 5 USA Chen-Chng Lu s wth the epartment of Electrcal and Computer Engneerng, Iowa State Unversty, Ames, IA 5 USA Legh Tesfatson s wth the epartment of Economcs, Iowa State Unversty, Ames, IA. 5 USA. renforcement learnng developed by Alvn Roth and Ido Erev. In ths confguraton, agents have fnte fxed acton domans, are backward lookng, and rely entrely on response learnng. Average reward γ-greedy renforcement learnng was used n [3] to model the learnng and bddng processes of supplers. Wth ths scheme, each suppler uses greedy selecton as ts acton choce rule wth probablty (- γ), and random acton selecton wth probablty γ. Thus, γ determnes the trade-off between explotaton of avalable nformaton and exploraton of untested actons. The tradng agents modeled n [4] use GP-Automata to compute ther bddng strateges for the current market condtons. Fnally, n the area of mult-agent renforcement learnng, Nash Q- Learnng [5] was desgned specfcally as a potental technque to represent agents learnng behavor n a multagent context. Ths paper s focused on how to model electrcty supplers learnng behavor by Q-Learnng. In addton, load servng enttes that have demand-sde response are consdered n ths mult-agent electrcty market envronment. II. AY-AHEA MARKET MOEL The ay-ahead electrcty market s modeled as a multagent system wth three types of agents nteractng wth one another. These agents are suppler agents, Load Servng Enttes (LSEs), and a market operator (MO). On the mornng of day suppler agents submt supply offers and LSEs submt demand bds for the ay-ahead Market to the MO. urng the afternoon, the MO runs a market-clearng algorthm (smlar to an optmal power flow) to match supply to demand and determne dspatch schedules and LMPs. At the end of the process, the MO sends the dspatch schedules and LMPs to the suppler agents and LSEs for day +. The nteracton among the MO, LSEs and suppler agents s shown n Fg.. A. Load Servng Entty Model LSEs purchase bulk power from the ay-ahead market to serve load. Wthout loss of generalty, t s assumed that LSEs do not have generaton unts and one LSE only serves load at one locaton n the power system. Suppose that the number of LSEs n the ay-ahead market s J. On day, LSE j submts a load profle for day +. Ths load profle specfes 24 hours of MW power demand PLj ( H ), H=, 23.

2 Fg. : Mult-Agent ay-ahead Market Envronment It s assumed that demand-sde response s avalable to LSEs. The demand-sde response works as follows. If the day peak hour LMP, LMPLj ( H peak ), at the bus where LSE j s servng load, s hgher than a crtcal value, then LSE j reduces ts peak hour demand for day + by 2%. If ths LMP does not exceed the crtcal value, LSE j wll not curtal ts peak hour demand. Therefore, each LSE has two states. If the LMP at ts node s below the crtcal value, t s n state,.e., S Lj =, and t wll submt a normal load profle for day +. If the LMP at ts node s above the crtcal value, t s n state,.e., S Lj =, and t wll submt a curtaled load profle for day +.. Suppler Agent Model Suppler agents sell bulk power to the ay-ahead market. For smplcty, t s assumed that each suppler agent has only one generaton unt. However, ths model can be extended to permt supplers wth multple generaton unts. Suppose the number of suppler agents n the ay-ahead market s I, and the MW power output of generator n some hour H s p G.Generator has lower and upper lmts denoted by p mn and p max for ts hourly MW power output. For generator, the hourly total producton cost C( pg) for producton level pg s represented by a quadratc form: 2 C( pg) = a pg + b pg + F () where a, b and F (pro-rated fxed cost) are gven constants. y takng dervatves on both sdes of (), the margnal cost functon for Generator s obtaned,.e., MC( pg) = a + 2 b pg (2) On each day, the suppler agents submt to the ay- Ahead market a supply offer for day + that ncludes two components. The frst component s ts reported margnal cost functon gven by: MC ( p ) = a + 2 b p (3) G G The second component s ts hourly MW power output upper lmt, denoted by p max. Suppose, on day, suppler agents submt ther supply offers for day + to the MO, and the market clearng program calculates LMPs and dspatch schedules. Let LMPG ( H ) denote the LMP for hour H at the bus where suppler s generaton unt s located, and let p * ( H ) denote the MW power output for G hour H n the dspatch schedule posted by the MO. Suppler agent s proft on day s obtaned by summng 24 hours of profts on that day: 23 * * π = [ p ( H ) LMP ( H ) C ( p ( H ))] (4) H = G G The accumulated proft of generator on day s gven by: AP( ) = AP( ) + π (5) C. Market Operator Model The MO for ths ay-ahead market s responsble for clearng the market based on the nformaton submtted by LSEs and suppler agents. The MO uses a market clearng algorthm to determne the LMP at each bus and MW power output for each generaton unt at each hour. Snce only MW power s consdered n ths model, a COPF problem can be formulated as follows: I 2 mn ( a p ) G + b p G (6) = subject to Pk Pgk + Pdk =, k =,... Nb (7) Hδ F max (8) pmn p pmax, =,... I (9) G where Nb denotes the total number of buses n the system, Pk represents the net power njecton at bus k, Pgk denotes the total MW power generaton at bus k, Pdk s the total MW demand at bus k, H denotes the lne flow matrx,δ denotes G

3 the vector of voltage angle dfferences, and F vector of maxmum lne flows. max s the The objectve of the COPF s to mnmze the total varable generaton cost based on suppler offers and LSE bds. The constrants are MW power balance constrants for each bus k =,... Nb, MW thermal lmt constrants for each lne, and MW producton lmts for each generator =,... I. The COPF program of MATPOWER [6] applcable to large-scale power systems s used n ths research. The smulaton platform s programmed n MATLA. III. MOEL FOR SUPPLIERS LEARNING EHAVIOR Q-Learnng, developed by Watkns [7], s a form of antcpatory renforcement learnng that allows agents to learn how to act n a controlled Markovan doman. A controlled Markovan doman mples that the envronment s Markovan n the sense that the state transton probablty from any state x to another state y only depends on x, y and the acton a taken by the agent, and not on other hstorcal nformaton. It works by successvely updatng estmates for the Q-values of state-acton pars. The Q-value Qxa (, ) s the expected dscounted reward for takng acton a at state x and followng an optmal decson rule thereafter. If these estmates converge to the correct Q- values, the optmal acton to take n any state s the one wth the hghest Q-value. y the procedure of Q-Learnng, n the n th step the agent observes the current system state x n, selects an acton a n, receves an mmedate payoff r n, and observes the next system state y n. The agent then updates ts Q-value estmates usng a learnng parameterα n and a dscount factor γ [7] as follows: If x = xn and a= an, Q (, x a) = ( α ) Q (,) x a + α [ r + γv ( y )]() n n n n n n n Otherwse, Qn(,) x a = Qn (,) x a () where Vn ( y) max{ Qn ( y, b)} (2) b It s proven by Watkns n [8] that f () the state and acton-values are dscrete, (2) all actons are sampled repeatedly n all states, (3) the reward s bounded, (4) the envronment s Markovan and (5) the learnng rate decays approprately, then the Q-value estmates converge to the correct Q-values wth probablty. In a mult-agent context such as the ay-ahead market model presented n ths paper, the system mght not be Markovan because state transton probabltes mght depend on actons taken by other agents. Therefore, there s no guarantee that Q-Learnng wll converge to the correct Q-values. A Generaton Company (GENCO) usually has several generaton plants located at dfferent buses of the system. For smplcty, Q-Learnng s used to model electrcty supplers that are assumed to have only one generaton unt. Nevertheless, by a smlar approach, Q-Learnng could be mplemented for suppler agents wth multple generaton unts at dfferent locatons. A novel approach to the mplementaton of Q-Learnng for a suppler agent s presented here. The suppler agent vews the ay-ahead market as a complex system wth dfferent states. The system state on day, X, s defned as a vector for the states of all LSEs. Hence the state vector on day can be expressed as X = { SL, SL2,..., SLJ}, where J s the number of LSEs. The cardnalty of the state space s 2 J snce each LSE has two states,.e., reduced peak load or not based on demand-sde response. Electrcty supplers mght have market power. Thus, t s assumed that suppler agents are capable of forecastng the LSEs states. In other words, the state vector s predctable by the suppler agents. A, s defned as a The acton doman of suppler agent, vector of bddng nformaton. Ths vector conssts of the margnal cost functon parameters a and 2 b, and the hourly MW output upper lmt p max. The cardnalty of M M M the acton doman, product of the number of possble p max values. a b max, s gven by the a, 2 b and Consder the begnnng of each day. A suppler agent frst makes a predcton of the system state, whch s represented by x. It next chooses an acton accordng to a Gbbs/oltzmann probablty dstrbuton,.e., Q( x, a)/ T e p ( x, a) = (3) Q( x, b) / T e b A where T, whch depends on, s a temperature parameter that models a decay over tme. Havng chosen an acton a, the suppler agent wll submt ts supply offer to the MO. Once the market s cleared, the suppler agent wll receve ts reward, whch s the proft for day +. Then the agent wll use ths reward to update ts Q-value estmates accordng to equatons () to (2). The Q-value estmates of an agent are sad to have converged f under all states x the agent chooses some acton wth probablty.99 or hgher. If the Q-value estmates of all the agents have converged, the smulaton termnates.

4 The parameters that are used to mplement the Q-Learnng algorthm are set n the followng way: scount factor γ =. 7 Learnng parameter α for a state-acton par ( x, a ) s set to be T s the number of tmes that α =, where ( xa, ) ω T( xa, ) acton a has been taken n state x. ω =.77 The temperature parameter T s gven by: 9 6 =.7 ( ), where s the number of days T that have currently been smulated. The cardnalty of the acton doman a b max s M M M = 4 4 4, n whch a and b range from to 3 tmes ther true values, and p max ranges from 97% to % of the true upper lmt. IV. NUMERICAL STUY A. Test Case The 5-bus transmsson grd used here for smulaton s taken from ISO-NE/PJM tranng manuals, where t s used to llustrate the determnaton of ay-ahead market LMP solutons. A one-lne dagram of the grd s shown n Fg. 2. aly LSE load profles are adopted from the dynamc 5- bus example n [2]. Lne capactes, reactance levels, and generator cost data are also adopted from [2]. functon and ts true generaton upper lmt. The MW producton level of each generator and the LMP at each bus that are cleared by the MO based on true cost data from generators are depcted n Fg. 3(a) and Fg. 4(a). MW Producton (MWs) MW Producton: No Learnng PG PG2 PG3 PG4 PG5 Fg 3.a: 5-us Transmsson Grd Smulaton Results for 24- MW Producton (No-Learnng Scenaro) MW Producton (MWs) MW Producton n Learnng Scenaro PG PG2 PG3 PG4 PG5 Fg. 3.b: 5-us Transmsson Grd Smulaton Results for 24- MW Producton (Learnng Scenaro ) 24- MW Producton: Learnng Scenaro 2 6 Fg. 2: 5-us Transmsson Grd etaled soluton values for the scenaro n whch supplers submt ther true producton data to the MO ( the nolearnng scenaro ) are gven n [2]. Ths study smulates two Q-Learnng scenaros for ths 5- bus test case. In the frst scenaro the LSEs have relatvely low crtcal values for curtalng demand, whereas n the second scenaro they have relatvely hgh crtcal values. Smulaton results for these learnng scenaros are compared wth the no-learnng scenaro.. Revew of Results from the No-Learnng Scenaro In the no-learnng scenaro analyzed n [2], each generator submts a supply offer that ncludes ts true margnal cost MW Producton (MWs) PG PG2 PG3 PG4 PG5 Fg 3.c: 5-us Transmsson Grd Smulaton Results for 24- MW Producton (Learnng Scenaro 2) Generators 3 and 5 are the two largest unts n the system wth a combned capacty 2MWs. The combned capacty of the three other small unts s 4MW. The large unts together wth the hgh peak hour demand (53.59MW) gves generators 3 and 5 potental market power. Note that the congeston between bus and bus 2 exsts for all 24 hours. Ths causes LMP separaton between bus and bus 2. urng hour 7, the power flow on the

5 lne between buses and 2 hts ts upper thermal lmt, and Generator 3 s dspatched at ts upper producton lmt. Therefore, generator 4 that has the hghest varable generaton cost has to be dspatched to meet the demand. Ths results n a huge prce spke at buses 2 and 3 at hour 7 that s about double of ther LMP values at hour 6. LMP($/MWh) LMPs: No Learnng LMP LMP2 LMP3 LMP4 LMP5 Fg. 4.a: 5-us Transmsson Grd Smulaton Results for 24- LMPs (No-Learnng Scenaro) LM Ps: Learnng Scenaro LMP LMP2 LMP3 LMP4 LMP5 Fg. 4.b: 5-us Transmsson Grd Smulaton Results for 24- LMPs (Learnng Scenaro ) LMP ($/MWh) LM P: Learnng Scenaro LMP LMP2 LMP3 LMP4 LMP5 Fg. 4.c: 5-us Transmsson Grd Smulaton Results for 24- LMPs (Learnng Scenaro 2) C. Results from the Two Learnng Scenaros Assume that the generators do not have to report ther true margnal costs to the MO. Instead, the proft-seekng generators use Q-Learnng to learn how to bd strategcally to make more profts. Snce the system can be n several states, t does not have to stay n one sngle state n the long term. Rather, t may vst some states perodcally or t may not even converge to a perodc pattern. Therefore, one has to defne convergence n a dfferent way. The ay-ahead market s sad to be convergent f, at any state, each generator chooses one acton n that state wth probablty.99 or hgher. ue to the probablstc nature of the learnng algorthm, the smulaton does not converge to the same values for each run. In order to average out the random effects across dfferent runs, smulaton runs are performed for each scenaro and the mean values from the runs are reported. In scenaro one, LSEs have lttle tolerance for hgh LMPs. Ther crtcal values for curtalng demand are only slghtly hgher than the LMPs that they wll pay n the no-learnng scenaro. The crtcal values for LSEs are 5.5($/MWh), 98.($/MWh), and 47.5($/MWh). Smulaton results show that most of the tme the system stays n state 8, n whch every LSE s curtalng demand every day. Ths mples that generators are usng very aggressve bddng strateges, and makng full use of ther market power. In ths case, generators actually are makng more profts by movng the system to state 8 because, even n the stuaton of less demand n peak hour, the generators are stll able to rase prces hgher than the crtcal values of the LSEs. In all smulaton runs, all fve generators converge by day 23. The average number of days before convergence s 7.. In some cases the system moves back and forth between two states n a cyclcal pattern of convergence. In scenaro two, the LSEs have hgh tolerance for hgh electrcty prces. Ther crtcal values for curtalng demand are hgher than the crtcal values n scenaro one. The crtcal values for LSEs n ths case are 35.5($/MWh), 5.5($/MWh), and 55.5($/MWh). In all runs, all fve generators converge by day 325. The average number of days before convergence s Smulaton results show that most of the tme, the system ends up vstng state and state 8 n turn. The day of convergence comes later f the system keeps vstng more than one state. It can be shown from the smulaton results that, n fact, Q-Learnng allows the generators to take advantage of the LSEs, whose demand-sde response only has one-day memory. Frst, by submttng low supply offers, the generators make sure that the LSEs do not curtal ther demand tomorrow. Afterward they submt hgh supply offers and proft sgnfcantly from the LSEs that decrease ther peak hour demand tomorrow. Then the generators submt a low supply offer agan and so on. The smulaton results show that Q-Learnng helps generators make more profts by sacrfcng today s beneft for more profts n the future. Ths scenaro s a good llustraton of antcpatory renforcement learnng. fferences between the learnng scenaros and the no learnng scenaro are dscussed below. Furthermore, t s desrable to know to what extent Q-Learnng s capable of

6 helpng generators exercse market power. Fg. 3(b) and (c) depct the mean values of MW producton n learnng scenaros and 2, along wth the correspondng smulaton results obtaned n the no-learnng scenaro. In the nolearnng scenaro, generator 4 s only dspatched at the peak hour. In both learnng scenaros, n some smulaton runs generator 4 s not dspatched. Ths s true when each generator s submttng an aggressve supply offer so that generator 4 s stll the most expensve. However, n some smulaton runs generator 4 chooses to submt less aggressve supply offers so that t becomes a relatvely cheaper unt. The 24-hour mean LMP values for the learnng scenaros and 2 are shown n Fg. 4(b) and (c) along wth the 24-hour LMP values for the no-learnng scenaro. In the no-learnng scenaro, the prce spke at hour 7 s obvous. Although the LMPs n the learnng scenaros or 2 are substantally hgher than for no-learnng, the prce fluctuaton around the peak hour s much less. Ths fndng s smlar to the fndng of Sun and Tesfatson [2], who used reactve renforcement learnng to model the learnng process of generators. However, snce the sets of actons are dfferent, one cannot draw a defntve concluson about the learnng technques used n the two studes. Fgure 5 shows that the mean of the total proft ganed by the generators n each learnng scenaro s much hgher than what they made n the no-learnng scenaro. In fact, n the no-learnng scenaro the generators are not able to recover ther fxed cost because they only covered ther varable costs n ther supply offers. Ths fact demonstrates that Q- Learnng helps the generators to learn to exercse ther potental market power to maxmze ther profts. It can be observed n Fg 5 that, durng peak hour 7, the generators are makng more profts n learnng scenaro 2 than they are n learnng scenaro. The hgh level of tolerance for prce spkes of the LSEs n learnng scenaro 2 gves the generators more opportuntes to manpulate the market. Total Proft ($/h) Total Proft Total Proft: Learnng Scenaro Total Proft: Learnng Scenaro 2 Total Proft: No-Learnng Scenaro Fg. 5: 5-us Transmsson Grd Smulaton Results for 24- Total Profts (No-Learnng Compared wth Learnng Scenaros and 2) V. CONCLUSION Ths paper presents a novel applcaton of Q-Learnng to model electrcty supplers learnng behavor n a multagent electrcty market envronment. Smulaton results show that Q-Learnng helps electrcty supplers learn how to bd strategcally under the condton of a smple demandsde response model. Wth Q-Learnng capabltes, electrcty supplers fnd a way to make more profts n the long term by sacrfcng ther mmedate profts. Q-learnng has some lmtatons. It assumes a fnte doman of actons. Also, the Q-learnng model developed n ths research assumes that electrcty supplers do not explctly take nto account the presence of other electrcty supplers n ther choce envronments. These lmtatons wll be relaxed n future extensons of ths research by adoptng more advanced learnng algorthms that enable agents to learn about other agents strateges. If the bddng data of electrcty supplers are publcly released by the MO, ths should help each electrcty suppler to form conjectures regardng other electrcty supplers bddng behavors. VI. REFERENCE [] S. orensten, J. ushnell, and F. A. Wolak, Measurng Market Ineffcences n Calforna s Restructured Wholesale Electrcty Market, Center for the Study of Energy Markets. Paper CSEMWP- 2, June 22. [2] J. Sun and L. Tesfatson, ynamc Testng of Wholesale Power Market esgns: An Open-Source Agent-ased Framework, to apperar n Computatonal Economcs, [3] V. Nandur and T.K. as, A Renforcement Learnng Model to Assess Market Power under Aucton-ased Energy Prcng, IEEE Transactons on Power Systems, vol. 22(), pp , Feb. 27. [4] C. W. Rchter, G.. Sheblé, and. Ashlock, Comprehensve ddng Strateges wth Genetc Programmng/Fnte State Automata, IEEE Transactons on Power Systems, vol. 4, pp , Nov [5] J. Hu and M. P. Wellman, Nash Q-learnng for General-Sum Stochastc Games, J. Mach. Learn. Res., vol. 4, pp , 23. [6] R.. Zmmerman and. Gan, MATPOWER: A MATLA Power System Smulaton Package (Verson 2.), Cornell Unversty, New York [7] C.J.C.H. Watkns, Learnng from elayed Rewards, Ph.. Thess, Unversty of Cambrdge, England, 989. [8] C.J.C.H. Watkns and P. ayan, Q-Learnng, Machne Learnng, vol. 3, pp , 992. VII. IOGRAPHIES Nanpeng Yu receved hs.eng. from Tsnghua Unversty, ejng, Chna, n 26. He s currently pursung the Ph.. degree at Iowa State Unversty. Chen-Chng Lu s currently Palmer Char Professor of Electrcal and Computer Engneerng at Iowa State Unversty. r. Lu serves as Presdent of the Councl on Intellgent System Applcatons to Power Systems (ISAP). He s a Fellow of the IEEE. Legh Tesfatson s Professor of Economcs and Mathematcs at Iowa State Unversty. She serves as Assocate Edtor for several economcs and mathematcs journals and s a Member of the IEEE.