Active Learning for Decision-Making

Size: px

Start display at page:

Download "Active Learning for Decision-Making"

Sara Miller
5 years ago
Views:

1 Actve Learnng for Decson-Makng Maytal Saar-Tsechansky Foster Provost Department of Management Scence and Informaton Systems, Red McCombs School of Busness, e Unversty of Texas at Austn, Austn, Texas 78712, USA Department of Informaton, Operatons, and Management Scences, Leonard N. Stern School of Busness, New York Unversty, 44 West Fourth Street, New York, NY 10012, USA maytal.saar-tsechansky@mccombs.utexas.edu fprovost@stern.nyu.edu Workng Paper CeDER-04-06, Stern School of Busness, New York Unversty, NY, NY November, 2004 Abstract s paper addresses focused nformaton acquston for predctve data mnng. As busnesses strve to cater to the preferences of ndvdual consumers, they often employ predctve models to customze marketng efforts. Buldng accurate models requres nformaton about consumer preferences that often s costly to acqure. Pror research has ntroduced many actve learnng polces for dentfyng nformaton that s partcularly useful for model nducton, the goal beng to reduce the acquston cost necessary to nduce a model wth a gven accuracy. However, predctve models often are used as part of a decson-makng process, and costly mprovements n model accuracy do not always result n better decsons. s paper develops a new approach for actve nformaton acquston that targets decson-makng specfcally. e method we ntroduce departs from the tradtonal error-reducng paradgm and places emphass on acqustons that are more lkely to affect decson-makng. Emprcal evaluatons wth drect marketng data demonstrate that for a fxed nformaton acquston cost the method sgnfcantly mproves the targetng decsons. e method s desgned to be generc not based on a sngle model or nducton algorthm and we show that t can be appled effectvely to varous predctve modelng technques. Key words: actve learnng, nformaton acquston, decson-makng, class probablty estmaton, cost-senstve learnng.

2 Actve -Learnng for Decson-Makng 1. Introducton Because of advances n computng power, network reach, avalablty of data, and the maturty of nducton algorthms, busnesses ncreasngly take advantage of automated predctve modelng, or predctve data mnng, to nfluence repettve decsons. Consder an example: Telecommuncatons companes face severe customer retenton problems, as customers swtch back and forth between carrers (the problem of "churn"). For each customer, at each pont n tme, the company faces a decson between dong nothng and ntervenng n an attempt to retan the customer. Increasngly, decson-makng s based on predctve models bult from data on the effectveness of offers and of nacton. For ths example, deally the predctve model would estmate the probablty of mmnent loss of the customer; the probablty estmate would then be combned wth utlty nformaton to maxmze expected proft. 1 Acqurng addtonal customer feedback can mprove modelng, but ths nformaton comes at a cost. Frms collect nformaton about customers drectly va solctatons, e.g., surveys of the customers themselves, or drect acquston from a thrd party. For example, Acxom provdes detaled consumer demographc and lfestyle data to a varety of frms, ncludng Fortune 500 frms, n support of ther marketng efforts; other drect marketng frms such as Abacus Drect mantan and sell specalzed consumer purchase nformaton (New York Tmes, 1999). Frms collect nformaton ndrectly, va nteractons ntated by the frm for the purpose of collectng relevant data, and also va normal busness nteractons (e.g., Amazon s acquston of customer preferences va purchases and product ratngs). All these acqustons nvolve costs to the frm. For ths paper, we consder the acquston of a partcular knd of nformaton. Followng the termnology used by Haste, et al., (2001) we refer to the data used to nduce models as tranng data. Importantly for ths paper, n the usual ( supervsed learnng ) scenaro tranng data must be labeled, meanng that the value of the target varable s known (e.g., whether or not a partcular customer would respond postvely to the current offer). However, acqurng labels may be costly. For example, obtanng preference 1 For ths paper we gnore ssues pertnent to ths example lke calculatons of lfetme value, but see (Rosset et al. 2003) for a treatment from the data-mnng perspectve. 2 2

3 Actve -Learnng for Decson-Makng nformaton for ndvdual customers nvolves solctaton costs, ncentves requred for revealng ther preferences, negatve reactons to solctatons, etc. Frms also ncur opportunty costs when nformaton s acqured over tme through normal busness nteractons. For example, makng a partcular offer to a random sample of web ste vstors, for the purpose of acqurng tranng labels, may preclude makng another offer already known to be proftable. Because of the neffcences mposed by these label-acquston costs, researchers have studed the selectve acquston of labels for nductve model buldng (e.g., optmal expermental desgn (Kefer, 1959; Fedorov, 1972) and actve learnng (Cohn et al., 1994)). e motvaton s that focused selecton of cases for label acquston should result n better models for a gven acquston budget, as compared to the standard approach where labels are acqured for cases sampled unformly at random, and therefore should reduce the cost of nducng a model wth a partcular level of performance. Research to date offers varous labelacquston strateges for nducng statstcally accurate predctve models (e.g., Fedorov, 1972; Cohn et al., 1994; Lews and Gale, 1994; Roy and McCallum, 2001). However, busness applcatons employ predctve models to help make partcular busness decsons. Of course, a more accurate model may lead to better decsons, but concentratng on the decsons themselves has the potental to produce a more economcal allocaton of the nformaton acquston budget. Pror work has not addressed how labels should be acqured to facltate decson-makng drectly. We consder the decson of whether or not to ntate a busness acton. A characterstc of such decsons s that they requre an estmaton of the expected utlty for each acton, hence (n the presence of uncertanty) an estmaton of the probabltes of dfferent outcomes. Consder for example a model for predctng whether a moble servce customer wll termnate her contract, where the model supports the decson of whether or not to offer the customer ncentves to renew her contract. Pror work provdes label acquston strateges to mprove (for a gven budget) the estmaton of the probablty of renewal. However, such a strategy may not be best for mprovng nterventon decsons. As we show later, the ablty to dentfy potentally wasteful nvestments can result n consderable economes. 3

4 Actve -Learnng for Decson-Makng e contrbuton of ths paper s the development and demonstraton of a new method for selectng cases for label acquston that (1) targets decson-makng drectly (t s decson-centrc), and (2) can be appled to varous predctve modelng technques (t s generc). e goal s to allow the nducton of better models for decson-makng, gven a fxed label-acquston budget. We demonstrate the method usng data from a drect-marketng campagn, wth the objectve of acqurng customer feedback that wll ncrease profts from future customer solctatons. e decson-centrc approach results n sgnfcantly hgher decson-makng accuracy and proft (for a gven number of acqustons) compared to the usual strategy: samplng cases for label acquston unformly at random. Moreover, the decson-makng accuracy and proft obtaned wth the new method are sgnfcantly hgher than those obtaned by acqurng labels to reduce model error. Notably, even though the decson-centrc method does result n superor decson-makng, the average statstcal accuracy of the model nduced s lower than that obtan wth the error-reducng method. Each method s better at the task for whch t was desgned. To demonstrate the generc nature of the method, we apply the method to three dfferent model nducton algorthms and show that the decson-centrc method consstently results n superor performance compared to the error-reducton method. e rest of the paper s organzed as follows. Secton 2 dscusses the current paradgm for selectve label acquston for nducton (actve learnng). In Secton 3 we analyze the mpact of tradtonal actve learnng on decson-makng effcacy and lay the theoretcal foundaton for the new decson-centrc approach. e new method s presented n Secton 4. en, n Secton 5 we demonstrate the proposed approach. We estmate the costs and benefts of drect malng decsons and analyze the performance of the proposed and exstng label acquston methods. We present some lmtatons to the work n Secton 6, and we dscuss manageral mplcatons and conclude n Secton Actve Learnng: Termnology, Framework and Pror Work We frst ntroduce the notaton and the termnology we employ. A frm wants to nduce a probablstc classfcaton model to estmate the probablty of alternatve outcomes. A categorcal classfcaton model s a mappng of an nput vector x X to a label y Y from a set of dscrete labels or 4

5 Actve -Learnng for Decson-Makng classes Y. e model s constructed through nducton where a tranng set of labeled examples ( x, y) pars are generalzed nto a concse model M : X Y. Dfferently from a categorcal classfcaton model, a probablstc classfcaton model also assgns a probablty dstrbuton over Y for a gven nput vector x. A maxmum a posteror decson rule would then map x to the label y wth the hghest estmated probablty. Examples of probablstc classfers are the Naïve Bayes classfer (Mtchell, 1997), sutably desgned treebased models (Breman et al., 1984; Provost & Domngos, 2003) and logstc regresson. For brevty, we refer to the estmaton of the probablty that an nput vector x belongs to a class y, p ˆ( y x), as class probablty estmaton (CPE). 2.1 Actve Learnng A sgnature of a modelng technque s predctve performance for a partcular doman of applcaton s captured by ts learnng curve, depctng the model s predctve 2 accuracy 2 as a functon of the number of tranng data used for ts nducton. A prototypcal learnng curve s shown n Fgure 1 where the model mproves wth the number of tranng examples avalable for nducton, steeply frst, but wth decreasng margnal mprovement (cf., Cortes, et al., 1994). For ths paper we assume that the cost of acqurng data s unform, so the learnng curve also shows the cost of learnng a model wth any gven accuracy. Consder for example a company modelng consumer preferences to predct the probablty of response to varous offers. e customer preference model can be mproved as more feedback on varous offers s acqured, resultng n more effectve product recommendatons and a potental ncrease n proft. e cost of acqurng customer feedback corresponds to the graph s x-axs, and hence the learnng curve characterzes the model s performance as a functon of the nformaton acquston cost. Consder a typcal settng where there are many potental tranng examples for whch labels can be acqured at a cost; for example, customers to whom we can send an offer to determne whether they wll 2 Model accuracy here refers to a model s predctve performance on out-of-sample data. s measure s sometmes referred to as generalzaton performance. my footnote 5

6 Actve -Learnng for Decson-Makng respond. Let us refer to examples whose labels are not (yet) acqured as unlabeled examples, and to examples whose labels already have been acqured as labeled examples. e goal of actve learnng s acqure the labels of unlabeled examples n order to produce a better model. Specfcally, for a gven number of acqustons, we would lke the model s generalzaton performance to be better than f we had used the alternatve strategy of acqurng labels for a representatve sample of examples (va unform random samplng). a. A learnng curve descrbes a model performance as a functon of the number of tranng examples or nformaton-acquston cost b. Actve learnng economzes on nformaton-acquston cost for a partcular model accuracy Fgure 1: e learnng curve and the effect of actve learnng Let us examne the learnng curve that results from tradtonal actve learnng. e thn learnng curve n Fgure 1b corresponds to acqurng the labels of examples that were sampled randomly and then usng these labeled examples for model nducton. e thck-lned curve n Fgure 1b s an dealzed learnng curve resultng from actve learnng, where fewer labeled tranng examples are needed to nduce a model wth any gven accuracy. Actve learnng attempts to label examples that are partcularly nformatve for reducng the model error, so deally t results n a steeper learnng curve. Smlarly, for a gven acquston budget (pont on the x-axs), the acqustons drected by actve learnng produce a model wth lower predcton error. 6

7 Actve -Learnng for Decson-Makng Actve learnng methods operate teratvely. At each phase they () estmate the expected contrbuton of potental unlabeled examples f ther labels were to be 3 acqured, 3 () label some examples, and () add these examples to the tranng set. Fgure 2 descrbes an algorthm framework outlnng the prevalng actve learnng paradgm. Specfcally, an nducton method frst s appled to an ntal set L of labeled examples (usually selected at random or provded by an expert). Subsequently, sets of M 1 examples are selected by the actve learnng method n phases from the set of unlabeled examples UL, untl a predefned condton s met (e.g., the labelng budget s exhausted). To select the best examples for labelng n the next phase, each canddate unlabeled example x UL s assgned an effectveness score ES based on an objectve functon, reflectng ts estmated contrbuton to subsequent nducton. Examples then are selected based on ther effectveness scores and ther labels are acqured before beng added to the tranng set L. (And the process terates.) Input: an ntal labeled set L, an unlabeled set UL, a model nducton algorthm I, a stoppng crteron, and an nteger M specfyng the number of actvely selected examples n each phase. 1 Whle stoppng crteron not met /* perform next phase: */ 2 Apply nducer I to L to nduce model E 3 For each example { x x UL} compute ES, the effectveness score 4 Select a subset S of sze M from UL based on ES 5 Remove S from UL, label examples n S, and add S to L Output: Model E nduced wth I from the fnal labeled set L Fgure 2: A Generc Actve Learnng Algorthm 3 An often-tact assumpton of actve learnng methods s that acqurng labels for certan tranng examples wll affect smlar examples when the model s used. We wll revst ths below. 5 dummy footnote 7

8 Actve -Learnng for Decson-Makng e framework n Fgure 2 hghlghts the challenge of an actve learnng method: to estmate the relatve contrbuton of possble tranng examples (the effectveness score) pror to acqurng ther labels. Most exstng methods compute effectveness scores based on some noton of the uncertanty of the currently held model. For example, uncertanty samplng [Lews and Gale, 1994] s a generc actve learnng method desgned for nducng bnary classfers. Uncertanty samplng defnes the most nformatve examples (whose labels should be acqured) as those examples for whch the current classfcaton model assgns a CPE that s closest to 0.5. e ratonale s that the classfcaton model s most uncertan regardng the class membershp of these examples, and so the estmaton of the classfcaton boundary can be mproved most by acqurng ther labels for tranng. 2.2 Pror Work e role of nformaton acquston n decson-support has been studed by many n the management lterature. For example, Allen and Gale (1999) examne the role of ncreasng nformaton costs on the emergence of fnancal ntermedary nsttutons, demonstratng that the rsng cost of nformaton necessary to successfully partcpate n sophstcated fnancal markets was the key factor n the formaton of ntermedares. Makadok and Barney (2001) examne the creaton of nformatonal advantages by frms through the acquston of nformaton about ther competton. ey focus on the acquston of nformaton for supportng strategy-formulaton decsons. s paper concentrates on nformaton acquston to mprove operatonal decsons that must be made repeatedly, so even modest margnal mprovements can have a large cumulatve effect on proft. Many organzatons employ predctve models effectvely, often as key tools for extractng customer, compettor and market ntellgence (Wall Street Journal, 1997; Resnk and Varan, 1997; New York Tmes 2003a; New York Tmes, 2003b). Research on predctve models for busness ntellgence has focused prmarly on modelng technques (e.g., West et al., 1997; Moe and Fader, 2004). However, such ntellgence reles on nformaton that requres sgnfcant tme and/or money to obtan. erefore, t s mportant to understand what are the fundamental propertes of nformaton that wll be partcularly effectve for nducng 8

9 Actve -Learnng for Decson-Makng accurate predctve models, so as to drect the acquston of such nformaton. Mookerjee and Mannno (1997) consder the cost of retrevng examples for nductve learnng. ey argue for the mportant role of nformaton cost for learnng and am to reduce the cost of attrbute specfcaton requred to retreve cases relevant for classfcaton. ey demonstrate that the ncorporaton of such consderatons can reduce nformaton acquston costs. Another related stream of research bulds on the (classc) mult-armed bandt problem orgnally proposed by Robbns (1952). Gven k slot machnes wth dfferent rates of return, a gambler has to decde whch to play n a sequence of trals. ere are varous formulatons of the goal, but generally the gambler wants to maxmze the overall reward. An mportant dfference between the mult-armed bandt settngs and that of actve learnng s that for the gambler t s suffcent to estmate the success probablty of each machne, whereas an actve learner must nduce a predctve model over the dependent-varable doman space. e challenge of data acquston specfcally for modelng has been studed extensvely n the statstcal communty. In partcular, the problem of optmal expermental desgn (Kefer, 1959; Fedorov, 1972) or OED examnes the choce of observatons for nducng parametrc statstcal models when observatons are costly to acqure. e objectve s to devse a dstrbuton over the ndependent varables reflectng the contrbuton of label acquston for these examples. Although there are substantal smlartes between work callng tself actve learnng and work on optmum expermental desgn (and not many cross-references), there s an mportant dfference between the two. OED studes parametrc statstcal modelng, whereas actve learnng s concerned prmarly wth non-parametrc machne-learnng modelng or wth generc methods that apply (n prncple) to a varety of modelng methods. s s an mportant dstncton because methods for OED depend upon closed-form formulatons of the objectve functon that cannot be derved for non-parametrc models. e fundamental noton of actve learnng has a consderable hstory n the lterature. Smon and Lea (1974) descrbe conceptually how nducton nvolves smultaneous search of two spaces: the hypothess space and the example space. e results of searchng the hypothess space can affect how the example space wll be sampled. In the context of acqurng examples for classfcaton problems, Wnston (1975) suggests that 9

10 Actve -Learnng for Decson-Makng the best examples to select for learnng are "near msses," nstances that mss beng class members for only a few reasons. s noton underles most actve learnng methods, whch address classfcaton models (e.g., Seung et al., 1992; Lews and Gale, 1994; Roy and McCallum, 2001) and are desgned to mprove classfcaton accuracy (rather than the accuracy of the probablty estmatons, to whch we wll return presently). As mentoned prevously, most exstng actve-learnng methods address categorcal classfcaton problems and compute effectveness scores based on some noton of the uncertanty of the currently held model. To our knowledge, ths dea was ntroduced n the actve-learnng lterature by the Query By Commttee (QBC) algorthm (Seung et al., 1992). In the QBC algorthm each potental example s sampled at random, generatng a stream of tranng examples, and an example s consdered nformatve and ts label s acqured f classfcaton models sampled from the current verson 5 space 4 (Mtchell, 1997) dsagree regardng ts class membershp. e QBC algorthm employs dsagreement among dfferent classfcaton models as a bnary effectveness score capturng uncertanty n current predctons of each unlabeled example s class membershp. Subsequently, authors proposed a varety of alternatve effectveness scores for ths uncertanty (e.g., Lews and Gale, 1994; Roy and McCallum, 2001). A dfferent approach to actve learnng for categorcal classfcaton problems attempts to estmate drectly the expected mprovement n accuracy f an example s label were acqured. Roy and McCallum (2001) present an actve-learnng approach for acqurng labeled documents and subsequently usng them for nducng a Naïve Bayes document classfer. er method estmates the expected mprovement n class entropy obtaned from acqurng the label of each potental learnng example; t acqures the example that brngs about the greatest estmated expected reducton n entropy. Decson-makng stuatons often requre more than categorcal classfcaton. In partcular, for evaluatng dfferent courses of acton under uncertanty t s necessary to estmate the probablty dstrbuton over possble outcomes, whch enables the decson-makng procedure to ncorporate the costs and benefts 4 e verson space refers to the set of all hypotheses or models that predct the correct class of all the examples n the tranng set. 10

11 Actve -Learnng for Decson-Makng assocated wth dfferent actons. In targeted marketng, for example, the estmated probablty that a customer wll respond to an offer s combned wth the correspondng costs and revenues to estmate the expected profts from alternatve offers. More generally, accurate estmatons of response probabltes enable a decson maker to rank alternatves correctly, to dentfy the actons wth the hghest expected benefts, and to maxmze utlty over multple courses of acton. To our knowledge there s only one study of generc actve learnng methods for nducng accurate class probablty estmaton (CPE) models (Saar- Tsechansky and Provost 2004), n whch the effectveness score s based on uncertanty n the CPEs rather than n the classfcatons (we return to ths below). However, as we dscuss n more detal next, mprovng the CPEs generally may not be as effectve as focusng on the partcular decson-makng task. 3. Actve Learnng for Decson-Makng e objectve of all pror actve learnng methods has been to lower the cost of learnng accurate models, be they accurate models for categorcal classfcaton or accurate models for class probablty estmaton. erefore, these methods employ strateges that dentfy and acqure labels for tranng examples that are estmated to produce the largest reductons n the model s predcton error. From a management perspectve, t s mportant to ask whether these strateges are best when the learned models wll be used n a partcular decson-makng context. In partcular, an error-reducng strategy may waste resources on acqustons that mprove model accuracy, but produce lttle or no mprovement n decson-makng. More accurate CPEs do not necessarly mply better decson-makng. How should actve learnng strateges be desgned to avod such wasteful nvestments? We next analyze the relatonshp between costly label acqustons and decson-makng effcacy, dervng the fundamentals for new actve learnng approaches desgned specfcally for decson support. 3.1 e Impact of Label Acquston on Decson Makng Qualty As descrbed above, we consder the decson of whether or not to ntate a busness acton, such as malng a drect marketng solctaton, or offerng a costly ncentve for contract renewal. We would lke to 11

12 Actve -Learnng for Decson-Makng estmate whether the expected utlty from acton would exceed that of nacton. Let x be an example (e.g., the descrpton of a customer) and let f denote the (unknown) probablty that the acton wth respect to x wll be successful (e.g., customer x wll respond to the marketng campagn, or wll renew her contract). Gven that acton s taken, let the utlty of success and the utlty of falure wth respect to nstance x be F U and U respectvely. Let the correspondng utlty of nacton be Ψ. Fnally let C denote the cost of S acton. To maxmze utlty, acton should be ntated f probablty of a successful outcome exceeds the threshold f U + ( 1 f ) U C Ψ, or equvalently, f the S f gven by F f Ψ + C U = (1) U S U For a decson maker to act optmally t s necessary to estmate the probablty of success. Because tranng nformaton s costly, we would lke to reduce the cost of nducng an estmaton model that wll render decsons of a gven qualty. One approach to reducng the cost of learnng accurate CPEs s va tradtonal actve learnng methods, whch are desgned to mprove the model s average performance over the nstance space. However, mprovement of class probablty estmatons may not always be justfed. Consder the F F case n whch the actual probablty of success exceeds the threshold f (suggestng acton s better than nacton). For the nduced model to allow a decson maker to act optmally t s suffcent that the estmated probablty of success fˆ exceed the threshold as well, even f t s hghly naccurate. Improvement of the probablty estmaton when the current estmaton already specfes the correct decson would not affect decson-makng, and therefore the cost of the mprovement would be wasted. In fact, as we wll llustrate, f the true probablty s just above the threshold and the estmate has a non-neglgble varance, mprovng the probablty may adversely affect decson-makng (cf., Fredman 1997). Snce a model s nduced from a sample, the model s probablty estmaton fˆ can be treated as a random varable. Let Γ be the best acton and let Γˆ be the estmated best acton derved usng the model s 12

13 Actve -Learnng for Decson-Makng probablty estmaton. Smlarly to Fredman s analyss of ncorrect classfcaton decsons (Fredman, 1997), the probablty of makng a wrong decson -.e., a decson that s nconsstent wth the decson derved usng the true probablty of success -s gven by: ( f < f ) p( fˆ) dfˆ + I( f f ) f f P( Γˆ Γ) = I p( fˆ) dfˆ (2) where the ndcator functon I (L) s 1 f L s true and 0 otherwse. For example, f the actual probablty were smaller than the threshold f, the expected utlty of acton would not exceed that of nacton; a sub-optmal decson would result f the estmated probablty were larger than the threshold. In order to reduce the cost of nducng a probablty estmaton model that wll allow for satsfactory decson-makng, t s mportant to understand the crcumstances under whch costly mprovements n CPE accuracy should be avoded. If we approxmate p (fˆ ) wth a normal dstrbuton, the probablty of makng an nconsstent decson s gven by P( Γˆ Γ) = Φ sgn ( f f ) Efˆ f var fˆ (3) where Φ denotes the rght-hand-sde tal of the standard normal dstrbuton, E denotes an expectaton and var denotes the varance of a random varable. Assume for llustraton that a learner s used to nduce a model from a tranng sample for estmatng the probablty that customers would respond to a certan offer. For a gven customer x the model produces (on average over dfferent samples) a CPE such that the expected proft from an offer solctaton to x s hgher than the expected proft of not makng the offer,.e., f ˆ >. Also assume that the true probablty of response suggests the same ( f f > f ). So we expect the model to lead to the correct decson: make the offer. Under such crcumstances t may not be cost-effectve to acqure addtonal labels (customer feedback) to mprove the estmaton; mprovng the estmaton may ncrease the chance of decson-makng error! From (3) we see that ndeed the larger the average CPE produced by the learner and hence the more based the estmates are, the more lkely t s that the model would produce the correct decson. s s because the larger bas reduces the chance that due to 13

14 Actve -Learnng for Decson-Makng estmaton varance the estmated expected proft from acton would (mstakenly) fal to exceed that of nacton. ere s an ncentve, however, to remove such CPE bas when the estmated probablty and the true probablty of response lead to dfferent decsons. 5 For example, assume that the true expected benefts from nacton exceed those of acton, but that on average the learner nduces a model that suggests otherwse. In ths scenaro mprovng the CPEs reduces the lkelhood of decson-makng error. Exstng actve learnng methods employ a greedy strategy, acqurng examples for model mprovement whenever such mprovement s deemed possble. e above analyss suggests that for cost-effectve acquston of examples to support decson-makng, an actve learner s well advsed to take a decson-centrc strategy, acqurng labels for tranng examples when an mprovement n the estmaton s lkely to lead to better decsons, and avodng acqustons otherwse even f they mght produce a more-accurate model. Unfortunately, the true probablty and thus the rght decson are unknown, and therefore t s mpossble to determne whether or not an mprovement s called for. In summary, actve label acquston targetng mproved accuracy generally may not be best for costeffectve decson-makng. In fact, somewhat counter-ntutvely, n certan cases mprovng CPEs can be detrmental to decson-makng. Ideally, we would lke to mprove CPEs only when the decson s wrong; however because f s unknown we cannot determne whether or not the model s predcton s correct. In the followng secton we develop an approach for cost-effectve acquston of examples that offers an alternatve. 4. Goal-Orented Actve Learnng Instead of estmatng drectly whch decsons are erroneous, we propose an alternatve method based on a related property that avods the need to know f. We propose acqurng labels for examples where a 5 ere s a ncentve whenever the estmated probablty fˆ and the true probablty threshold f. 6 dummy footnote 14 f are on dfferent sdes of the

15 Actve -Learnng for Decson-Makng relatvely small change n probablty estmaton can affect decson-makng, and avodng acqurng labels otherwse. Specfcally, we wll prefer acqustons when fˆ s closer to f. For example, consder two scenaros concernng a gven decson. In scenaro A the estmated class probablty s consderably hgher than the threshold probablty. In scenaro B the estmated probablty of response s only margnally greater than the threshold probablty. In scenaro A the evdence n the tranng data s strongly n favor of acton. As a result a more substantal change n the estmated probabltes s necessary to affect the decson n Scenaro A as compared to Scenaro B, requrng more tranng examples to sway the estmaton n favor of nacton (all else beng equal). e approach we propose here acqures labeled examples pertanng to decsons that are lkely to be less costly to affect;.e., decsons for whch a relatvely small change n the estmaton can change the preference order of choce. Of course, although the desgn s suggested by the theoretcal development above, ths s a heurstc method. e new method we propose operates wthn the actve learnng framework presented n Fgure 2. At each phase, M 1 examples are selected from the set of unlabeled examples UL ; ther labels are acqured and the examples are added to the set of labeled tranng examples L. e effectveness score s calculated as follows. Each example x UL s assgned a score that reflects the relatve effect the example s expected to have on decson-makng f ts label were acqured and the example added to the tranng set. In partcular, the score s nversely proportonal to the mnmum absolute change n the probablty estmaton that would result n a decson dfferent from the decson mpled by the current estmaton,.e., the score of example x s nversely proportonal to ˆ. f f For selecton, rather than selectng the examples wth the hghest scores ( drect selecton, as s common n actve learnng), a samplng dstrbuton s created. Specfcally, the effectveness scores are consdered to be weghts on the examples, and examples are drawn from a dstrbuton where the probablty of an example to be selected for labelng s proportonal to ts weght. Earler work (Iyengar et al., 2000; Saar- Tsechansky and Provost 2004) has demonstrated that samplng from a dstrbuton of effectveness scores s 15

16 Actve -Learnng for Decson-Makng preferable to drect selecton. It reduces the chance of acqurng labels of outlers or other atypcal examples (Saar-Tsechansky and Provost 2004). Fgure 3: e Goal Orented Actve Learnng (GOAL) Algorthm Input: Set of unlabeled examples UL, ntal set of labeled examples L, Inducer (learner) I, stoppng crtera. Whle (stoppng crteron) 1 Apply nducer I to L, resultng n estmator E 2 Apply estmator E to UL 3 For all examples { x x UL = 1 sze( UL) where = 1 ( + ˆ λ β f f ) } compute D ( x ) = 1 ( β + fˆ f ) λ,, such that D s a dstrbuton 3 Sample from the probablty dstrbuton D, a subset S of M examples from UL wthout replacement 5 Remove S from UL, label examples n S, and add them to L 6 Select the top M nstances from UL, remove them from UL, label them and add them to L End Whle Output: Estmator E nduced from L Formally, the samplng-dstrbuton weght D x ) assgned to example x UL s gven by D( x ) = λ 1 ˆ ( β + f f ) evaluaton that follows β = ), = 1 sze( UL) by = 1 ( + ˆ λ β f f ) (, where β s some small real number to avod dvson by zero (n the emprcal f Ψ + C U = as above, and λ s a normalzng factor gven U S U F F whch we call Goal-Orented Actve Learnng (GOAL)., to make D a dstrbuton. Fgure 3 presents pseudocode of the method, 16

17 Actve -Learnng for Decson-Makng 5. A Drect Marketng Campagn Case Study We evaluate GOAL usng data from a drect-marketng campagn. We use these data for evaluaton because they comprse real consumer nteractons along wth nformaton-acquston costs. e data pertan to a charty s perodc solctatons to potental donors and are publcly avalable through the Unversty of Calforna at Irvne repostory [Blake et al., 1998]. e challenge of drect marketng stems from the nonneglgble cost of solctaton; hence the organzaton seeks to maxmze campagns profts by better targetng potental donors. For these data, each solctaton costs 68 cents (prntng, malng costs, etc.) and response amounts range from $1 to $200, wth an average of $15. e average response rate s approxmately 5%. Because of the low response rate and the cost of solctaton, nformed decsons that mnmze wasteful solctatons are crtcal to the success of the campagns. Importantly for ths paper, the estmated probabltes come from an nduced predctve model that n turn requres costly acqustons of customer responses. For a cost-effectve utlzaton of the charty s donor base, t s mportant to reduce the number of solctatons necessary to allow for effectve targetng, or alternatvely, to ncrease the effectveness of targetng for a gven solctaton budget. 5.1 Acquston Strateges for the Drect Marketng Problem Let us frst descrbe the context n whch actve acquston of consumer responses takes place. Gven an acquston budget, an acquston strategy solcts potental donors and acqures ther responses (.e., whether or not a gven consumer responded to the solctaton, and f so, n what amount). ese become the labeled tranng data. Once the labels are acqured, a targetng model s nduced from the tranng data and s subsequently employed to target potental donors for a new campagn. In the new campagn, a successful solctaton s one that results n a contrbuton that exceeds the solctaton cost. So, the objectve s to reduce the acquston cost necessary to acheve a partcular level of proft, or alternatvely to ncrease the proft for a partcular acquston nvestment. We wll compare three label-acquston strateges: 17

18 Actve -Learnng for Decson-Makng (1) Acquston of responses from a representatve set of donors, usng random samplng from a unform dstrbuton. Unform random samplng s the most wdely appled practce for the acquston of labels based on a set of unlabeled tranng examples. In spte of ts smplcty of mplementaton, random samplng s remarkably effectve because t attempts (mplctly) to obtan a representatve sample of the example space. We wll refer to ths label-acquston strategy as RANDOM. (2) An actve-learnng method that focuses on error reducton: BOOTSTRAP-LV. Because probablty estmates are used to evaluate the expected proftablty of alternatve solctatons, an acquston strategy that mproves these estmatons s also lkely to mprove targetng decsons. To our knowledge BOOTSTRAP-LV s the only generc method desgned specfcally to reduce class probablty estmaton error. BOOTSTRAP-LV follows the tradtonal paradgm of usng uncertanty n estmatons to calculate effectveness scores. Specfcally, BOOTSTRAP-LV estmates the varance n learned models response probabltes, for each potental example, and assgns a hgher score to the acquston of responses from examples wth hgher varance. BOOTSTRAP-LV was shown to result n lower probablty estmaton error for a gven acquston cost compared both to random acquston of responses, and to actve learnng desgned for mprovng categorcal classfcaton accuracy (Saar-Tsechansky and Provost 2004). (3) GOAL. Let the estmated probablty that a potental donor x would respond to a malng be fˆ, the estmated contrbuton amount (descrbed below) be S Uˆ and the malng cost be C. e proft from nacton s zero; hence a solctaton s ntated f ˆ S f U C 0 and the threshold probablty s f C =. Uˆ S erefore, the weght assgned to acqurng donor s response n GOAL s gven by ( β + fˆ f ) C 1 λ = 1 λ β + fˆ S. Uˆ 18

19 Actve -Learnng for Decson-Makng 5.2 Expermental Settng In order to evaluate the three acquston strateges, we compare the decson-makng effcacy and profts generated from solctaton decsons derved from the models nduced wth each. We now descrbe the nducton methods examned, the data parttonng, and the method for calculatng generated profts. For estmatng the probablty of response, we use three nducton 6 methods. 6 Our frst experments focus on Probablty Estmaton Trees (PETs) unpruned C4.5 classfcaton trees (Qunlan, 1993) for whch the Laplace correcton [Cestnk 1990] s appled at the leaves. Not prunng and usng the Laplace correcton has been shown to mprove the CPEs (Provost and Domngos 2003; Perlch et al. 2003). Subsequently, n order to demonstrate the generc nature of the methods, we also compare the three acquston strateges usng logstc regresson and Naïve Bayes (Mtchell 1997). For ths applcaton, revenues from successful solctatons are not known n advance and therefore also must be estmated from the data. We use a lnear regresson model based on a set of predctors that was dentfed n earler studes. 7 On a separate (holdout) set of potental donors, we compare the profts generated by each method for an ncreasng number of acqured, labeled tranng examples. More specfcally, at each phase the responses of M addtonal donors are acqured by each method and added to ts respectve set of tranng examples. Each pont on each curve shown hereafter s an average over 10 ndependent experments. For each experment, 6 e predctors are: household ncome range, date of frst donaton, date of most recent donaton, number of donatons per solctaton, number of donatons gven n the last 18 months, amount of last donaton, and whether the donor responded to three consecutve promotons. 7 e predctors for the regresson model are: the amount of the most recent gft, the number of donatons per solctaton, average donaton amount n response to the last 22 promotons, and the estmated probablty of donaton as estmated by the CPE model. Followng Zadrozny and Elkan (2001) the CPE estmaton s ncorporated as a predctor n the lnear regresson model to remove a sample selecton bas. Because large gfts are rare there exsts a selecton bas towards one group of frequent donors who donate small amounts resultng n the regresson model underestmatng gfts by donors who contrbute large amounts nfrequently. To allevate such a bas, Heckman (1979) recommends ncorporatng the probablty of belongng to ether group (.e., the probablty of makng a donaton) as a predctor n the regresson model. 8 dummy note 19

20 Actve -Learnng for Decson-Makng the data are randomly parttoned nto: an ntal set of labeled tranng examples selected at random (used to buld the frst model) L ; an unlabeled pool of donors UL from whch the three strateges acqure addtonal responses, whch then are added to L (cumulatvely as the curves progress); and an ndependent out-ofsample test set T of potental donors whose responses and donatons are known, for evaluatng the three methods. To reduce varance, the same data parttons are used by all methods. e proft for each method s calculated va the followng smulated process (recall, responses are known to the expermenters for the entre test set). For each potental donor n the test set, ether a solctaton s maled or no acton s taken. e solctaton s maled f the expected revenue exceeds the solctaton cost. e cost of malng s subtracted from the total proft whenever a solctaton s made; f a donor responds to a solctaton the actual donated amount s added to the overall proft. s proft calculaton s depcted n Fgure 4. fˆ f ˆ S U ˆ > C Fgure 4: Decson-makng proftablty calculaton from charty solctatons 5.3 Results In order to evaluate the effectveness of the GOAL acquston polcy we frst measure the accuracy of malng decsons enabled by each acquston strategy. Specfcally, we measure decson-makng effcacy as the proporton of targetng decsons made correctly by each model. Ultmately, the capablty to avod nonproftable malngs as well as to dentfy proftable ones s crtcal for a campagn s success. Fgure 5 shows the proporton of malng decsons made correctly when GOAL, BOOTSTRAP-LV and RANDOM are used to 20

21 Actve -Learnng for Decson-Makng acqure donor responses for model nducton. Malng decson accuracy s shown for an ncreasng cost of label (response) acqustons. Each method was gven (the same) 2000 ntal tranng examples. At each phase 10 responses were acqured by each method and n total 5000 responses were acqured actvely by each. Note that ntally all methods have access to the same, small set of responses. erefore, the same probablty estmaton model s nduced by all methods resultng n the same performance. As addtonal donors responses are acqured by each method, the sets of responses avalable for tranng begn to dffer n composton, resultng n dfferent learned models. As Fgure 5 shows, as more responses are acqured and the composton of the tranng sets dverges, the relatve advantage of GOAL becomes more apparent. GOAL mproves (on average) more decsons per acquston than ether of the other methods. For a gven cost, a model traned wth donor responses acqured by GOAL obtans a hgher proporton of correct targetng decsons when compared wth BOOTSTRAP-LV s CPE-error-reducton polcy or wth the acquston of responses unformly at random. Smlarly, RANDOM acqustons are clearly nferor to those obtaned wth BOOTSTRAP-LV. GOAL s superorty wth respect to BOOTSTRAP-LV s statstcally sgnfcant (p=0.05) once 200 responses are acqured by each method. Proporton of accurate malng decsons GOAL 15 Bootstrap-LV Random Response acquston cost Number of response acqustons Percentage Improvement Fgure 5: Malng accuracy rate as a functon of response acquston cost Table 1: Percent mprovement n malng effectveness obtaned usng GOAL wth respect to BOOTSTRAP-LV (p=0.05). e dffculty of mprovng model accuracy suffcently to mprove targetng decsons s well demonstrated by the number of acqustons requred n order to obtan a gven mprovement n malng decson 21

22 Actve -Learnng for Decson-Makng performance. For example, BOOTSTRAP-LV must acqure more than 2000 responses n order to ncrease the malng-decson accuracy from 15.1% to 16.4%; GOAL must acqure only about 300 responses to exhbt the same mprovement n performance. On average over all acquston phases, GOAL ncreases the malngdecson accuracy rate by 3.66%. e largest mprovements are exhbted n the early acqustons phases, where GOAL results n more than 9% mprovement compared to BOOTSTRAP-LV. Table 1 shows the mprovements n the proporton of correct malng decsons obtaned wth GOAL, over BOOTSTRAP-LV, for an ncreasng number of response acqustons. e reported mprovements are sgnfcant accordng to a pared t-test (p=0.05). Proporton of proftable customers targeted GOAL Bootstrap-LV Response acquston cost Proporton of non-proftable customers targeted Response acquston cost Bootstrap-LV GOAL a. Proporton of proftable donors targeted b. Proporton of non-proftable donors targeted Fgure 6: Proporton of proftable donors targeted wth GOAL and BOOTSTRAP-LV. One element of campagn proftablty s the model s effectveness at targetng proftable donors, and smlarly avodng targetng non-proftable ones. Fgure 6a reports, for an ncreasng number of response acqustons, the proporton of the set of proftable donors targeted wth GOAL and BOOTSTRAP-LV. Fgure 6b reports the proporton of non-proftable donors targeted by each. Clearly, the tranng responses acqured wth GOAL produce a model that dentfes more of the proftable donors and that avods targetng more of the non-proftable donors than do the tranng responses acqured wth the error-reducng approach. GOAL s performance s already statstcally sgnfcantly superor (p=0.05) before 200 responses are acqured. 22

23 Actve -Learnng for Decson-Makng Taken together, these results strongly support our contenton that for ths problem, GOAL s decsoncentrc response acqustons are more nformatve and effectve (on average) compared to acqurng tranng responses to mprove CPE accuracy generally (usng BOOTSTRAP-LV). Of course t s possble that the mproved decson accuracy afforded by GOAL smply s a result of mproved class probablty estmaton. GOAL s desgned to mprove decsons drectly, whle BOOTSTRAP- LV s acquston strategy s desgned to mprove the model s CPEs. However, perhaps BOOTSTRAP-LV s not effectve at ts ntended purpose. Fgure 7a compares the error of the probablty estmates produced by GOAL wth those generated by BOOTSTRAP-LV (as always, on out-of-sample test sets). Probablty estmaton accuracy s measured wth BMAE (Best-estmate Mean Absolute Error), computed as BMAE N = = 1 p Best ( x ) p( x ), where p ( x ) s the probablty estmated by the model under evaluaton (and that N was nduced from the selected subset of the avalable examples); N s the number of test examples for whch the models are evaluated; p Best s a surrogate to the best estmated probablty and s estmated by a best model nduced usng the entre set of avalable examples L UL (and usng a more complcated modelng approach, a Bagged-PET whch generally produces superor CPEs as compared to a PET (Provost and Domngos, 2003; Perlch et al. 2003)). CPE Error Response Acquston Cost Goal Bootstrap-LV GOAL Bootstrap-LV a. Error rates of estmated probabltes of b. Profts from drect malng. response. Fgure 7: Comparson of malng proftablty and CPE accuracy usng a PET model. Proft from Solctatons Response Acquston Cost 23

24 Actve -Learnng for Decson-Makng In contrast to the pattern shown n Fgure 5, on average the class probablty estmatons obtaned wth GOAL for a gven acquston cost are consderably worse than those obtaned wth BOOTSTRAP-LV. BOOTSTRAP-LV s mproved average error s statstcally sgnfcant (p=0.05) after both strateges have acqured 600 examples. Bootstrap-LV acqures responses that mprove the accuracy of response probablty estmaton, regardless of the subsequent mpact on decson-makng. As the dscusson n Secton 3 suggests, some mprovements n CPE accuracy may not mpact decson-makng. Because these mprovements come at a cost, they result n wasteful solctatons that are not rewarded wth mproved malng decsons. GOAL s desgned to avod such acqustons, and t s able to exhbt mproved decson makng for a gven cost; however the average probablty estmatons t produces are nferor. Fgure 7b explores whether the mproved decson accuracy also results n superor proftablty. e graph shows the profts generated from drect marketng malngs for ncreasng cost of response acqustons. From 200 response acqustons onward and untl 2500 responses are acqured, GOAL results n statstcally sgnfcantly hgher profts than BOOTSTRAP-LV, accordng to a pared t-test (p=0.05) agan, GOAL produces models yeldng better targetng decsons. In summary, Fgures 6 and 7a show that GOAL and BOOTSTRAP-LV each excels at the task for whch t was desgned: BOOTSTRAP-LV to mprove the average CPEs and GOAL to mprove decson-makng. In partcular, whle BOOTSTRAP-LV obtans consderably better average probablty estmatons for a gven cost, these mprovements often do not result n more accurate targetng. GOAL, on the other hand, avods many costly CPE mprovements that are not lkely to alter decsons, and thereby reduces the cost of obtanng a gven level of decson-makng effcacy. We desgned GOAL to be generc: t does not depend on the form of the model or on the nducton algorthm. Hence, t can be appled wth any model for estmatng class probabltes. Whether t wll be effectve for varous models must be demonstrated emprcally. Fgure 8a compares the accuracy of targetng decsons usng logstc regresson models nduced wth GOAL and wth BOOTSTRAP-LV. Agan, GOAL s 24

25 Actve -Learnng for Decson-Makng able to acqure more-nformatve responses for nducng donor response models. GOAL s superor performance s statstcally sgnfcant (p=0.05) once 200 donor responses are acqured. Fgure 8b shows the drect-malng decson accuracy for GOAL and BOOTSTRAP-LV when the base model s a Naïve Bayes classfer. For ths model as well, GOAL results n better decson-makng for a gven nvestment n response acquston. For the Naïve Bayes model, GOAL s superorty s statstcally sgnfcant (p=0.05) once more than 750 donor responses have been acqured by each polcy. Proporton of accurate malng decsons GOAL Bootstrap-LV Response acquston cost a. Malng accuracy rate wth logstc regresson Proporton of accurate malng decsons GOAL Bootstrap-LV Response acquston cost b. Malng accuracy rate wth a Naïve Bayes model Fgure 8: Malng decson accuracy rate usng GOAL and BOOTSTRAP-LV Fgure 9 compares the proftablty resultng from GOAL s acqustons to the proftablty obtaned wth BOOTSTRAP-LV. For the logstc regresson model, GOAL s mprovement s sgnfcant (p=0.05) once 200 acqustons are made by each method. Once 4000 responses are acqured the two acquston polces exhbt comparable performance agan. GOAL results n statstcally superor proftablty wth the Naïve Bayes model once more than 2600 responses have been acqured. 25