Cross Channel Optimized Marketing by Reinforcement Learning

Size: px
Start display at page:

Download "Cross Channel Optimized Marketing by Reinforcement Learning"

Transcription

1 Cross Channel Optmzed Marketng by Renforcement Learnng Naok Abe, Naval Verma and Chd Apte Mathematcal Scences Dept. IBM T. J. Watson Res. Ctr. Yorktown Heghts, NY nabe, nverma, Robert Schroko Database Marketng Saks Ffth Avenue 12 E. 49th Street, New York, NY Robert ABSTRACT The ssues of cross channel ntegraton and customer lfe tme value modelng are two of the most mportant topcs surroundng customer relatonshp management (CRM today. In the present paper, we descrbe and evaluate a novel soluton that treats these two mportant ssues n a unfed framework of Markov Decson Processes (MDP. In partcular, we report on the results of a jont project between IBM Research and Saks Ffth Avenue to nvestgate the applcablty of ths technology to real world problems. The busness problem we use as a testbed for our evaluaton s that of optmzng drect mal campagn malngs for maxmzaton of profts n the store channel. We dentfy a problem common to cross-channel CRM, whch we call the Cross-Channel Challenge, due to the lack of explct lnkng between the marketng actons taken n one channel and the customer responses obtaned n another. We provde a soluton for ths problem based on old and new technques n renforcement learnng. Our n-laboratory expermental evaluaton usng actual customer nteracton data show that as much as 7 to 8 per cent ncrease n the store profts can be expected, by employng a malng polcy automatcally generated by our methodology. These results confrm that our approach s vald n dealng wth the cross channel CRM scenaros n the real world. Categores and Subject Descrptors I.2.6 [Artfcal Intellgence]: Learnng General Terms Expermentaton Keywords customer lfe tme value, CRM, cost senstve learnng, renforcement learnng, targeted marketng Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. KDD 04, August 22 25, 2004, Seattle, Washngton, USA. Copyrght 2004 ACM /04/ $ INTRODUCTION The ssues of cross channel ntegraton and customer lfe tme value modelng are undoubtedly two of the most mportant topcs surroundng customer relatonshp management (CRM today. Despte the wde-spread acknowledgement of ther mportance, there has not been a satsfactory soluton to these ssues n the market. In many cases, vendors provde nfrastructure that enables cross channel ntegraton of customer data, but provde no specal-purpose analytcs. Applyng exstng data analytcs tools wll not fully leverage the ntegrated data. Ths s because exstng tools do not help optmze decson makng n the respectve channels for maxmzaton of future profts across dfferent channels. In the present paper, we propose a novel soluton that treats these two mportant ssues n a unfed framework. The proposed soluton s based on our earler work on the applcaton of renforcement learnng to sequental costsenstve decson makng [8]. In that paper, t was demonstrated that by combnng renforcement learnng and scalable data mnng technologes, decson rules that are optmzed wth respect to long term benefts can be automatcally generated solely from data analytcs. We use varants of ths basc technology to perform customer lfe tme value modelng n a cross-channel settng, and optmze marketng actons wth respect to long term, mult-channel profts. We have conducted a jont study between IBM Research and Saks Ffth Avenue to nvestgate the applcablty of ths technology on a practcal problem, whch we wll report on n the current paper. The busness problem we elected as a testbed for our nvestgaton s that of optmzng nteractons between the drect mal channel and the store channel. Of the varous channels of customer nteractons that Saks Ffth Avenue owns and operates, such as the web, telemarketng, drect malng and the store, we chose to focus on the nteractons between the latter two channels, based on the relatve ease of evaluaton and readness of the relevant data. The largest obstacle that we faced n our effort s one that s characterstc of a cross-channel scenaro, and s of general nterest to most applcatons nvolvng cross-channel nteractons. The problem, whch we call the Cross Channel Challenge, s the lack of explct lnkng between the marketng actons taken n one channel and the responses (profts or rewards obtaned n another. Ths translates, n practce, to very low correlatons observed between marketng actons and ther effects across channels. Therefore, applyng off-the-shelf regresson methods to model the rewards as 767

2 a functon of varous varables, s lkely to produce a model that s ndependent of the marketng actons, thus leadng to useless marketng rules. We resolve ths challenge by nvokng the renforcement learnng technology, and devsng a number of modfcatons to t. One of these modfcatons, based on an exstng method of renforcement learnng called advantage updatng [3], s partcularly notable. It manages to solve the cross channel challenge by focusng on learnng the dfference n the effects on the rewards of competng actons, thereby bypassng the accurate estmaton of the nosy reward functon. We conducted expermental evaluaton of the proposed methods usng actual customer nteracton data from Saks Ffth Avenue. The results of our n-laboratory evaluaton experments suggest that we can expect as much as 7 to 8 per cent ncrease n the store profts by employng targetng rules automatcally produced by our methodology, as compared to the current malng polcy used at Saks. These results seem to confrm that our approach s vald n dealng wth the cross channel CRM scenaros n the real world. The rest of the paper s organzed as follows. We begn by descrbng the busness problem that we address, n Secton 2. Secton 3 presents the methodology ncludng detaled descrptons of newly devsed methods. Secton 4 wll descrbe the experments we conducted and the results we obtaned. Secton 5 concludes wth a summary. 2. PROBLEM DESCRIPTION 2.1 The busness problem The busness problem we address s that of optmzng drect mal catalogue malngs over multple campagns, to maxmze the effect on the profts/revenue obtaned n the store channel. At Saks Ffth Avenue over 60 major drect mal campagns are conducted each year. These campagns vary from malngs of general store catalogues to those specfc to partcular product groups, such as women s apparel and cosmetcs. Some are seasonal campagns, such as Chrstmas season campagns. Some may nvolve store coupons, whle others may provde nformaton on upcomng sales and ther contents and duratons. Many of these campagn features are avalable for use n data analyss. Currently, the generaton of malng lst for each of these campagns s based on a number of crtera and constrants, and s not fully automated. There are ntrcate ssues surroundng the process of generatng these malng lsts. Our goal s not necessarly a full and mmedate automaton of ths process, but rather n demonstratng the potental use of sophstcated data analytcs n assstng and mprovng t. As a step n ths endeavor, our techncal task s to analyze the past data and automatcally generate targetng rules that can be used to construct malng lsts for future campagns. 2.2 The cross channel challenge The problem just descrbed s a challengng one, and turns out not to be solvable by straghtforward applcatons of exstng modelng technques. For example, the smplest soluton would be to model the short term profts (or revenues n the store generated by a partcular customer, say n a wndow of one month, as a functon of varous features of that customer, ncludng the control varable of whether a gven catalogue s maled to that customer. As we elaborate n the secton on experments, ths wll result n a model that s ndependent of the control varable, thus gvng no nterestng nformaton on the effect of the malng acton. At the heart of the problem s the credt assgnment problem. That s, there s no explct nformaton n the data, lnkng the actons taken n the drect mal channel to the responses observed n the store channel. Ths would be possble, f a good part of the transactons were assocated wth coupons ssued n the outbound channel of nterest, and ths fact were recorded. Ths s rarely the case n practce, and n partcular s not the stuaton we face n our current problem. To place further burden on the modelng task, our problem settng nvolves events wth varable length tme ntervals,.e. the ntervals between the decson ponts (campagn malngs are varable n length. Ths adds consderable amount of nose n the data, and makes the task of modelng the responses of marketng actons even more dffcult. 3. METHODOLOGY 3.1 Cross channel lfe tme value maxmzaton and MDP Common practce n database marketng and CRM today s to organze customer data nto a table consstng of felds representng varous attrbutes of customers and response felds, model a response feld of nterest as a functon of those attrbutes, and then optmze marketng actons aganst the obtaned model. Here we go a step further: We use tme stamped sequences of such data to represent tme-varyng sequences of customer s attrbutes (state and marketng actons. We then model the process of customer nteractons as a dynamc process, and optmze marketng actons wth respect to such a model. The techncal framework we employ s the so-called Markov Decson Process (MDP model, popular n dynamc programmng and renforcement learnng. Here we refer the reader to the lterature for detaled descrptons of theory and methods n MDP, e.g. [5, 9]. In the current applcaton, we can use the attrbute vector correspondng to a customer at a gven pont n tme to represent the state for that customer at that pont n tme. Cross-channel ntegraton of data allows us to represent the entre hstory of nteractons between a gven customer and the enterprse, across all channels, thereby provdng a unfed vew of the customer. The maxmzaton of lfe tme value of a gven customer, across all channels, can then be naturally formulated as the maxmzaton of the dscounted cumulatve rewards n the standard MDP termnology. It follows that lfe tme value maxmzaton n the cross-channel settng s reduced to solvng for the optmum polcy of the MDP wth the cross-channel state representaton. Snce the obtaned polcy s a mappng from customer attrbute vectors to actons, t can be readly translated to generc f-then style rules for use n any rules engne. 3.2 Q-learnng wth varable tme ntervals As we mentoned n Introducton, for our problem t s necessary to extend the MDP framework to a formulaton nvolvng varable tme ntervals. The varable tme nterval MDP we consder here s dentcal to the standard dscrete tme MDP, except that every event s tmed. Here we assume that the tme at the ntal state s 0, and then all subsequent events wll have postve tme assocated wth 768

3 them. The process starts n some ntal state s 1 at t 1 =0, and then the learner repeatedly takes actons, resultng n a sequence of acton, state, reward and tme quadruples, {(s,a,r,t } =1. The goal of the learner s then defned as the maxmzaton of the total dscounted rewards, wth the dscounted factors determned as a functon of the tme duratons. That s, t s to maxmze the cumulatve reward R, R = X =1 γ t r (1 We note that the model we ntroduce here s dfferent from and smpler than the extenson of MDP to the contnuous tme settng e.g. SMDP proposed by [4], n that we stll assume dscrete tme steps, though havng varable nterval length. The challenge n devsng a learnng method n the varable tme nterval MDP s n determnng how the rewards n varous tme ntervals should be normalzed, n order to lessen the effect of nose ntroduced by the varyng nterval length. For estmatng the mmedate rewards, t s clear that the reward receved n a tme nterval can be normalzed by dvdng t by the tme nterval. For estmatng the Q-value functon (the expected cumulatve rewards, whch s the objectve functon n an MDP, the stuaton s sgnfcantly more complcated. In partcular, we must account for the fact that, at any learnng teraton, the nterval over whch the dscounted rewards s summed (for approxmatng the value functon s ncremented. Furthermore, the effectve nterval s affected by the learnng rate, snce a larger learnng rate corresponds to assgnng greater weghts to the rewards receved far nto the future. Here we propose a normalzaton scheme n whch both of these factors are taken nto account n determnng the effectve tme nterval to be used for normalzaton. More specfcally, we use an analogous update rule for the normalzaton factor, as that for the Q-value estmate. The resultng varant of the batch Q- learnng method for the varable tme nterval set-up, Var- RL(Q, s presented n Fgure 1. (The update rules for both Q-values, v (k, and normalzaton factors, Z(k,arenblock 4.2 of the pseudo-code. 3.3 Batch Advantage Updatng A number of past papers have addressed the problem of extendng Q-learnng and other related learnng methods to varable tme ntervals and more generally to the contnuous tme settng, e.g. [3, 4]. Of these, the work by Bard [3] s of partcular nterest to us, snce hs soluton appears to be addressng closely related problems to the one we currently face: functon approxmaton n Q-learnng nvolvng varable tme ntervals s often dffcult due to the nose ntroduced by the varyng ntervals. Bard proposes a novel method, called advantage updatng, whch tres to learn the relatve advantage of competng actons n any gven state. Ths procedure avods havng to explctly estmate the Q- value functon, thereby bypassng the nosy estmaton problem. We brefly descrbe ths method and some modfcatons we made to make t work for our problem. Advantage updatng s based on the noton of advantage of an acton a relatve to the optmal acton at a gven state s, wrtten A(s, a. The followng s one of alternatve def- Procedure Var-RL(Q Premse: A base learnng module, Base, for regresson s gven. Input data: D = {e =1,..., N} where e = { s,a,r,t j =1,..., l } (e s the -th epsode, and l s the length of e. 1. For all e D 1.1 For j =1tol, t = t +1 t 2. For all e D 2.1 For j =1tol 1 Z (0 = t v (0 = r D (0 = { (s,a, v(0 Z (0 3. e Q (0 =Base( S =1,...,N D(0 4. For k =1toK 4.1 Set α k,e.g. α k = 1 k 4.2 For all e D For j =1tol 1 Z (k v (k D (k =(1 α kz (k 1 j =1,..., l } + α k ( t + γ t Z (k 1 =(1 α k e Q (k 1 (s,a Z (k α k (r + γ t max a e Q (k 1 (s +1,a Z (k 1 +1 = { (s,a, v(k Z (k j =1,..., l 1} 4.3 e Q (k =Base( S =1,...,N D(k 5. Output the fnal unnormalzed model,.e. e Q (K Z (K. Fgure 1: Varable tme nterval batch Q-learnng. ntons of ths quantty. A (s, a = 1 (r + γ ts E s [V (s ] V (s (2 t s In the above, note that V s defned as V (s = max a Q (s, a, where Q (s, a s the Q-value of an optmal polcy. We used t s to denote the tme nterval mmedately followng state s, ands to denote the state reached after s. Alternatvely, the advantage can be wrtten n terms of the Q-value by: A (s, a = 1 (Q (s, a max Q (s, a (3 t s a Ths quantty s extremely nterestng for us for two reasons: It factors out the dependence of the value functon on the tme nterval, and on the state. Gven ths noton of advantage, advantage updatng s an on-lne learnng method that learns ths functon teratvely, by a coupled set of update rules for the estmates for A and V, and a normalzaton step for A (s, a whch drves max a A (s, a towardszero. We exhbt a batch verson of ths method n Fgure 2. A couple of other modfcatons were necessary, before the two methods just presented, Var-RL(Q and batch-au, could be made to work satsfactorly. One modfcaton has to do wth the ntalzaton of the quanttes beng estmated n the two methods, the Q-value and A-value, usng the emprcal cumulatve rewards observed n the data, rather than the mmedate rewards as n the orgnal on-lne methods. The other modfcaton s allowng optonal applcatons of functon approxmaton, e.g. applyng functon approxmaton n every n-th teraton. Due to space lmtatons, we wll omt the detals of these modfcatons, and refer the reader to our techncal report [1]. 769

4 Procedure Batch-AU Premse: A base learnng module, Base, for regresson s gven. Input data: D = {e =1,..., N} where e = { s,a,r j =1,..., l } (e s the -th epsode, and l s the length of e. 1. For all e D 1.1 For j =1tol, t = t +1 t 2. For all e D D (0 = { (s,a, r t j =1,..., l } S 3. A (0 =Base( =1,...,N D(0 4. For all e D and for j =1tol 1, ntalze 4.1 A (0 = A(0 (s,a 4.2 Amax (0 =max a A(0 (s,a 4.3 V (0 = Amax (0 5. For k =1toK 5.1 Set α k, β k and ω k,e.g.α k = β k = ω k = 1 k 5.2 For all e D For j =1tol 1 A (k D (k =(1 α ka (k 1 +α k (Amax (k 1 = { (s,a,a (k + r +γ t V (k 1 (k 1 +1 V t j =1,..., l 1} S 5.3 A (k =Base( =1,...,N D(k 5.4 For all e D and for j =1tol 1, update A (k = A(k (s,a Amax (k =max a A(k (s,a V (k =(1 β k V (k 1 +β k ( Amax(k Amax(k 1 α k + V (k For all e D and for j =1tol 1, normalze A (k =(1 ω ka (k + ω k(a (k Amax(k 6. Output the fnal advantage model, A (K. Fgure 2: Batch renforcement learnng based on advantage updatng. 4. EXPERIMENTS We evaluated the proposed methods usng actual customer nteracton data from Saks Ffth Avenue from the past years. Below we wll descrbe the data we used for ths expermentaton n some detal. Then we wll dscuss the challenge we face n evaluatng the methods usng past data. We then descrbe the expermental results. 4.1 Data As brefly explaned n Introducton, the data we used for our data analyss can be categorzed nto the followng four types. 1. Customer data for 1.6 mllon customers, whose recent annual spendngs exceeded a certan threshold. These contan demographc and other types of nformaton on the customers. We note that prvacy senstve nformaton was strpped off before they were used for data analytcs. 2. Transacton data for the sad 1.6 mllon customers for the past three years. The transacton data are tme stamped and contan the entre pont of sales data, ncludng the categores of purchased tems and sales prce, among other thngs. 3. Campagn data for the major campagns for the year There were 69 campagns. These data contan tmng of malng, duraton of sales, types of catalogues sent, and product groups (dvsons beng targeted by the campagns. 4. Product data for the purchased tems n the transactons data. These contan taxonomy nformaton on them, rangng n granularty from product groups down to the SKU level. These data were then used to generate tme stamped sequences of feature vectors contanng summarzed nformaton on the hstory of cross-channel nteractons. The features we generated and elected to use, representng the state of a customer at any gven pont n tme, are summarzed n Table 1. (Note that the features also fall nto four types accordng to the data sources. Note that as the thrd column n Table 1 the correlaton coeffcents between each of the features and the response varable (reward are lsted. Here, the response varable was calculated smply by summng the observed profts n the data, over a fxed perod of tme snce the tme of the malng n queston, and t does not exactly correspond to the cumulatve dscounted profts that we wsh to maxmze. It s worth notng, nonetheless, that a very low correlaton s observed between the control varable (malng acton and the response varable, as compared to some of the other features. The control varable has the thrd lowest correlaton coeffcent (at 0.008, and s magntudes lower than those of typcal (transacton and campagn features. Ths explans thenatureofthe cross-channel challenge, and n partcular why we tend to get a model that s ndependent of the control varable, when we run a standard regresson engne to model the response varable as a functon of these features. 4.2 Evaluaton A common problem n performance evaluaton of renforcement learnng methods s that often t s dffcult or mpossble to conduct real lfe experment n whch the learnng methods have access to on-lne nteractons wth the MDP. Our current applcaton doman of CRM and database marketng s no excepton. To conduct such performance evaluaton relably usng only statc hstorcal data s tself a bg challenge. The problem s that we need to evaluate the polcy generated by the learnng procedure usng only past data, whch presumably were collected usng some polcy that s dfferent from t. Here we propose a soluton to ths problem, and use t to conduct performance evaluaton for our methodology. Our soluton s based on a noton recently proposed by Kakade and Langford [6] called polcy advantage, and a bas correcton technque based on mportance samplng, c.f. [10]. We elaborate on ths below. Frst, we defne a dscrete tme verson of the noton of advantage ntroduced earler (n Eq. 3, wth respect to any polcy π. A π(s, a =Q π(s, a max a Q π(s, a (4 Then the polcy advantage of a new polcy π wth respect to an old (or samplng polcy π and ntal state dstrbuton 770

5 Features Descrptons Cor. full lne store of resdence f a full-lne store exsts n the area off ffth store of resdence f an off-ffth store exsts n the area loyalty program level loyalty program level fvrt store channel favorte store channel (web/store purchase amt 1m amount of purchase n last month purchase amt 2 3m amount of purchase n last 2-3 month perod purchase amt 6m amount of purchase n last 4-6 month perod purchase amt 1y amount of purchase n last year purchase amt tot total amount of purchase (n 3 years promo purchase rato rato of purchases n promoton perods cur dv purchase amt 1m purchase amount last month n current dvson cur dv purchase amt 2 3m purchase amount last 2-3 month n current dvson cur dv purchase amt 6m purchase amount last 4-6 month n current dvson cur dv purchase amt 1y purchase amount last 1 year n current dvson cur dv purchase amt tot total purchase amount n current dvson dv purchase amt tot j total purchase amount n dvson j n cat 1m number of catalogues sent n last month n cat 2 3m number of catalogues sent n last 2-3 month n cat 4 6m number of catalogues sent n last 4-6 month n cat tot total number of catalogues sent cur dv n cat 1m number of catalogues sent n last month targetng current dvson cur dv n cat 2 3m number of catalogues sent n last 2-3 month targetng current dvson cur dv n cat 4 6m number of catalogues sent n last 4-6 month targetng current dvson cur dv n cat tot total number of catalogues sent targetng current dvson acton to mal or not to mal Table 1: Features used n our experments: Features, ther descrptons, and ther correlatons wth the reward feld. The features are lsted n 4 groups: (1 customer features; (2 transacton features; (3 campagn features; and (4 product group specfc campagn features. (* An example correlaton value s exhbted for several features of ths type. µ, wrtten A π,µ(π, s defned as follows. A π,µ(π =E s π,µ[e a π (a s[a π(s, a]] (5 Intutvely, the polcy advantage measures how much advantage can result by replacng the acton of the old polcy by that of the new polcy at a random state selected by the samplng polcy, whle all other actons reman unchanged (specfed by the samplng polcy. In some sense, ths measure quantfes how much local mprovement s attaned by changng the old polcy by the new polcy, assumng that the overall state dstrbuton s not sgnfcantly affected by that change. 1 The polcy advantage can be estmated usng only data collected by the samplng polcy π, usng a bas correcton technque based on mportance samplng as follows. A π,µ(π =E s,a π,µ[ π (a s [Aπ(s, a]] (6 π(a s Note that π (a s s known snce t s the (possbly stochastc polcy generated by renforcement learnng, but π(a s need be estmated from the data, snce we do not know the samplng polcy explctly. It should be ponted out that ths quantty becomes mpossble to estmate for a determnstc samplng polcy, snce the bas correcton factor π (a s dverges for actons a never chosen by the samplng polcy π(a s π. In real world settngs, however, the state nformaton s often not suffcent to determne the chosen acton determnstcally, as s the case n our settng. 1 Kakade and Langford [6] have establshed a theoretcal result whch mples that a new polcy wth a postve polcy advantage can be used to defne a new polcy, whch provably has better performance than the old polcy. 4.3 Expermental Results We used the evaluaton method just ntroduced, namely that of bas corrected estmaton of polcy advantage n the data, to valdate the performance of the proposed methods. We used both of the proposed methods, Var-RL(Q and Batch-AU. In both cases, we used IBM s scalable regresson engne, ProbE TM [7, 2], as the basc regresson module. For both methods, we used the feature to ntalze the Q- value and A-value estmates usng the emprcal lfe tme values. Also, for both methods, we used the opton of applyng functon approxmaton at every fourth learnng teraton. In our evaluaton, we randomly sampled approxmately 1.0 percent of the ndvdual customers from the entre data (approxmately 16 thousand customers from 1.6 mllon. The epsodc data were then generated, for use n tranng, by randomly selectng a sub-epsode of length 10 (consstng of 10 events correspondng to each of the sampled ndvduals. A separate test data set consstng of 5,000 randomly selected ndvduals was also sampled for calculatng the polcy advantage. The polcy advantage was calculated usng these ndvdual data over the entre 68 campagns, and for each learnng teraton, the results were averaged over 10 random runs. The results of ths evaluaton are exhbted n Fgures 3 and 4. Fgure 3 plots how the polcy advantage of the polcy output by the Var-RL(Q method changes as the learnng teraton progresses n a typcal run. Fgure 4 shows the analogous graph for the batch-au method. In each of the graphs, the y-axs s the polcy advantage shown as the percentage over the value of the old polcy. Strctly speakng, what s shown on the x-axs s not the number of teratons, but rather s the number of tmes functon approxmaton s performed. In both cases, we chose to run functon approxmaton at every fourth learnng teraton. 771

6 5. CONCLUSIONS We valdated our renforcement learnng based approach to lfe tme value modelng and cross-channel optmzed marketng on a real world problem. In the course of our nvestgaton, we dentfed a general problem common to modelng cross-channel nteractons, and proposed a soluton based on old and new technques of renforcement learnng. We also provded a soluton to the samplng bas problem n evaluaton of learned polces, and used t to evaluate the proposed approach. Some ssues for future nvestgaton nclude the followng: (1 Easng deployment by reducng the need to customze; (2 Handlng varous channel constrants ncludng budget constrants; Fgure 3: The polcy advantage for the varable tme nterval Q-learnng method (n a typcal run. Fgure 4: The polcy advantage for the Batch Advantage Updatng method (n a typcal run. A defnte trend can be read off from these graphs. For both of the methods, a typcal run starts wth a polcy that s relatvely unnformatve whch does not show any advantage over the samplng polcy. It s worth notng that, snce both methods were ntalzed wth the emprcal lfe tme value observed n the data, ths shows that drect modelng wth emprcal LTV does not lead to any advantage over the exstng malng polcy. At the thrd functon approxmaton, or after 9 learnng teratons, the polcy advantage peaks and then t starts declnng agan. Ths behavor s thought to be attrbutable, n part, to the nature and lmtaton of the evaluaton method. Polcy advantage measures the advantage of a new polcy wth respect to an old polcy, usng the old polcy as the samplng polcy. It s therefore more effectve when the two polces are relatvely smlar. As the learnng progresses and the two polces start dvergng, the measure becomes less and less relable. Even wth the lmtaton mentoned above, the obtaned results are qute encouragng. Wth the assumpton that the new polcy does not sgnfcantly change the state dstrbuton, the results mply that as much as 7 to 8 percent ncrease n the store profts can be expected, by employng the polcy output by our methodology, over the current malng polcy used at Saks. 6. ACKNOWLEDGMENTS We wsh to thank Bll Franks and Sher Wlson-Gray of Saks Ffth Avenue for ther executve leadershp n makng the jont project possble. We also wsh to thank Edwn Pednault and Banca Zadrozny of IBM Research and John Langford of TTI at Chcago for helpful dscussons and assstance. 7. REFERENCES [1] N.Abe,N.Verma,C.Apte,andR.Schroko.Cross channel optmzed marketng by renforcement learnng. Techncal Report RC23132(W , IBM Research, March [2] C. Apte, E. Bbelneks, R. Natarajan, E. Pednault, F. Tpu, D. Campbell, and B. Nelson. Segmentaton-based modelng for advanced targeted marketng. In Proceedngs of the Seventh ACM SIGKDD Internatonal Conference on Knowledge Dscovery and Data Mnng (SIGKDD, pages ACM, [3] L. C. Bard. Renforcement learnng n contnuous tme: Advantage updatng. In Proceedngs of the Internatonal Conference on Neural Networks, June [4] S. Bradtke and M. Duff. Renforcement learng methods for contnuous-tme Markov decson problems. In Advances n Neural Informaton Processng Systems, volume 7, pages The MIT Press, nov [5] L. P. Kaelblng, M. L. Lttman, and A. W. Moore. Renforcement learnng: A survey. Journal of Artfcal Intellgence Research, 4, [6] S. Kakade and J. Langford. Approxmately optmal approxmate renforcement learnng. In Proceedngs of the 19th Internatonal Conference on Machne Learnng, July [7] R. Natarajan and E. Pednault. Segmented regresson estmators for massve data sets. In Second SIAM Internatonal Conference on Data Mnng, Arlngton, Vrgna, to appear. [8] E. Pednault, N. Abe, B. Zadrozny, H. Wang, W. Fan, and C. Apte. Sequental cost-senstve decson makng wth renforcement learnng. In Proceedngs of the Eghth ACM SIGKDD Internatonal Conference on Knowledge Dscovery and Data Mnng. ACM Press, To appear. [9] R. S. Sutton and A. G. Barto. Renforcement Learnng: An Introducton. MIT Press, [10] B. Zadrozny. Polcy mnng: Learnng decson polces from fxed sets of data. PhDthess,Unverstyof Calforna, San Dego,