Modeling Consumer Footprints on Search Engines: An Interplay with Social Media 1

Size: px
Start display at page:

Download "Modeling Consumer Footprints on Search Engines: An Interplay with Social Media 1"

Transcription

1 Modelng Consumer Footprnts on Search Engnes: An Interplay wth Socal Meda 1 Anndya Ghose Stern School of Busness, New York Unversty, aghose@stern.nyu.edu Panagots G. Iperots Stern School of Busness, New York Unversty, panos@stern.nyu.edu Bebe L 2 Henz College, Carnege Mellon Unversty, bebel@andrew.cmu.edu (Forthcomng at Management Scence) Abstract It s now well understood that socal meda plays an ncreasngly mportant role n consumers decson makng. However, an overload of socal meda content n product search engnes can hnder consumers from effcently seekng nformaton. We propose a structural econometrc model to understand consumers preferences and costs on search engnes to mprove user experence under unstructured socal meda. Our model combnes an optmal stoppng framework wth an ndvdual-level random utlty choce model and analyzes clck behavor n conjuncton wth purchase choces. Our model takes nto accounts three major constrants n a consumer s decson makng process: (1) nterdependency n decson makng for dfferent alternatves; (2) sequental arrval of nformaton revealed by clck-throughs; (3) non-neglgble search cost. Our approach allows us to jontly estmate consumers heterogeneous preferences and search costs under the nterplay of socal meda and search engnes, and predct search and purchase behavor for each consumer. We valdate the model usng an ndvdual sesson-level dataset of approxmately 7 mllon observatons resultng n room bookngs n 2,117 U.S. hotels. Interestngly, our analyss allows us to quantfy the trade-off between consumers benefts and cogntve costs from usng large-scale unstructured socal meda nformaton durng decson makng. Our polcy experments show that provdng a carefully curated dgest of socal meda content durng the earler stages of consumer search (.e., on the search results summary page) can lead to a 12.01% ncrease n the overall search engne revenue. 1 The authors thank Travelocty for provdng data and support to ths research. The authors thank the department edtor, the anonymous assocate edtor, three anonymous revewers for ther constructve comments. For valuable feedback and suggestons, the authors also thank Rav Bapna, Pradeep Chntagunta, Dara Dzyabura, Alok Gutpa, Elsabeth Honka, Sam Hu, Serge Koulayev, Ramayya Krshnan, Mchael D Smth, Rahul Telang, Russell S. Wner, and audence at the 2012 Internatonal Conference on Informaton Systems, 2012 Workshop for Informaton Systems and Economcs, 2012 INFORMS Annual Conference, 2014 Marketng Scence Conference, Carnege Mellon Unversty, Stern School of Busness at New York Unversty, and Carlson School of Management at Unversty of Mnnesota. An earler conference verson of ths paper has won the Best Theme Paper Award at the Internatonal Conference on Informaton Systems n Author names are n alphabetcal order. 0

2 1. Introducton Wth the growng pervasveness of socal meda, the volume and complexty of nformaton product search engnes need to access from ther own platforms has been ncreasng rapdly. For example, webstes such as Amazon.com, TrpAdvsor.com, or Yelp.com can attract hundreds or even thousands of revew postngs that compete for users attenton. The onslaught of the explodng socal meda content can lead to a sgnfcant nformaton overload for consumers durng product search. Such excess content can hnder consumers from effcently seekng nformaton and makng decsons (e.g., Iyengar and Lepper 2000). What s worse, t may dscourage consumers from searchng and cause unexpected termnaton of search (e.g., sesson drop-out). Durng the past decade, product search engnes have been tryng to combne advanced technques from nformaton retreval (e.g., Google Product Search) and recommender systems (e.g., Amazon.com) nto ther rankng desgn to mprove the user search experence. Recent studes show that product search engnes can mprove the rankng desgn and the user search experence based on the predcton of consumer preferences (e.g., Ghose et al. 2012, Do los Santos and Koulayev 2014). Because consumers want the most desrable results early on, search engnes can reorder the results by the predcted probabltes of consumer preferences (e.g., clcks or purchases). Prevous studes have examned how to estmate customer preferences based on onlne purchase nformaton only (e.g., Ghose et al. 2012). However, consumer footprnts on search engnes provde us wth a tremendous amount of nformaton that reveals ther preferences, even n the absence of purchases (e.g., Koulayev 2014, Km et al. 2010, Do los Santos and Koulayev 2014). When ths search behavor s combned wth purchases, the sgnals become even more comprehensve and useful. However, although many studes have worked on usng ether hstorcal clck-throughs or conversons separately to estmate consumer preferences, there s lttle work n jontly analyzng the search and purchase behavor to nfer ndvdual consumer preferences and dentfy the products that satsfy most the user needs. Wth the deluge of structured and unstructured socal meda content, consumers' cogntve costs n searchng and evaluatng product nformaton become non-neglgble. As a result, search costs also play an mportant role n affectng consumers choces n product search engnes. Therefore, a major goal of our study s to better understand consumers onlne footprnts by takng nto account consumers heterogeneous preferences and search costs, usng both clck and purchase nformaton. However, ths task can be challengng, because the cause of an observed search behavor by a consumer s hard to dentfy e.g., The fact that a consumer prefers to clck product A over product B may be because of a hgher valuaton for A, or because the consumer has ncurred a lower search cost n searchng for A than for B. More generally, the challenge n predctng consumer choce wth search cost s to smultaneously dentfy consumers heterogeneous preferences and search costs (Hortacsu and Syverson 2004). A consumer may stop searchng ether because of a hgh valuaton for the products already found or because of a hgh search cost. Ether the preferences for product characterstcs or the moments of the search cost dstrbuton can 1

3 explan the same observed search outcome (Koulayev 2014). Keepng the above n mnd, another major goal of our study s to dentfy heterogeneous search costs under the socal meda context, examne ther effect on consumer search behavor, and provde nsghts to product search engnes on better desgn and management of socal meda content to mprove user experence. The key dentfcaton strategy for consumer search cost n our study reles on the excluson restrcton that consumer preferences enter the decson-makng processes of both search and purchase, whereas consumer search cost enters only the search decson-makng process. Once the consderaton set s generated after search, the condtonal purchase decson should depend only on the consumer preferences. Our unque dataset contanng both consumer search data and purchase data allows us to dentfy these effects. In addton, we model search cost as a functon of an exclusve set of varables. From an emprcal dentfcaton perspectve, we can smply vew the search cost varables as addtonal product characterstcs. In summary, we propose a structural econometrc model to understand consumers preferences and search costs on product search engnes to mprove user experence under large-scale, unstructured socal meda. It combnes an optmal stoppng framework wth an ndvdual-level random utlty choce model. It allows us to jontly estmate consumers heterogeneous preferences and search costs. Based on the results, we are able to predct the probablty that a consumer clcks or purchases a certan product and provde a better understandng of what drves consumer engagement. Our analyss also allows us to quantfy the trade-off between consumers benefts and cogntve costs from usng large-scale unstructured socal meda nformaton durng decson makng. Our polcy experments offer nsghts to search engnes on what product nformaton they should dsplay durng dfferent stages of consumer search (.e., on the search result summary page vs. product landng page), to mprove user experence, clck/purchase probabltes as well as search engne revenues. Our model s valdated by a unque dataset from the onlne hotel search ndustry. We have detaled ndvdual consumer sesson-level search and transacton data from November 2008 through January 2009, contanng approxmately seven mllon observatons resultng n room bookngs n 2,117 hotels n the Unted States on Travelocty.com. Our model provdes more precse measures of consumer prce elastcty and heterogeneous preferences than does a statc Mxed Logt model that does not account for consumer search cost or the sequence of the pror clcks. Our model also provdes better predctve performance than does a clck model that purely reles on the clck nformaton. More specfcally, our model demonstrates the best performance n predctng the consumer clck and purchase probabltes compared to other benchmark models. We see a % and an 18.77% mprovement n the out-of-sample predcton usng our model compared to the next best performng model, wth respect to clck-through and converson probabltes, respectvely. Our polcy experments show that provdng addtonal product nformaton, especally the locatonrelated nformaton, on the travel search engne summary page wll lead to a 22.16% ncrease n the overall search engne revenue. By contrast, although hdng all hotels prce nformaton from the search summary page may lead to hgher user engagement (when engagement s measured by number of clcks), t can hurt the travel search engne eventually by leadng to a 7.08% drop n the overall search engne revenue. On the contrary, 2

4 provdng a carefully curated dgest of socal meda textual content on search results summary page can lead to a 12.01% ncrease n the overall search engne revenue. Ths fndng suggests that t s mportant for product search engnes to leverage the economc value of large-scale unstructured socal meda nformaton, whle n the meantme reducng the cogntve burden of consumers by automatng the extracton of such nformaton and presentng t to the consumers durng the earler stages of the decson makng process. 2. Pror Lterature Our paper draws from multple streams of work. We summarze them n ths secton. 2.1 Search Cost and Consumer Informaton Search Frst, our work bulds on the lterature on search cost and consumer nformaton search. Recent studes have found that consumers have cogntve lmtatons, and search costs exst durng the nformaton search processes. Dsregardng consumers cogntve lmtatons and the lmted nature of choce sets can lead to based estmates of demand (e.g., Mehta et al. 2003, Km et al. 2010, Brynjolfsson et al. 2010). The exstng lterature n ths feld holds two dfferent vews of the nature of consumer search: nonsequental and sequental search. The former strand of research follows Stgler s (1961) orgnal model, assumng consumers frst sample a fxed number of alternatves and then choose the best from among them (e.g., Mehta et al. 2003, Moraga-Gonzalez et al. 2012, Honka 2014). By contrast, the other vew, arsng from the job-search lterature (e.g., Mortensen 1970), argues the actual consumer search should follow a sequental model n whch consumers keep searchng untl the margnal cost of an extra search exceeds the expected margnal beneft. Wetzman (1979), n sngle-agent scenaros, and Renganum (1982), n mult-agent scenaros, have lad theoretcal foundatons for sequental search models. In our paper, we assume consumers search sequentally on product search engnes. Ths assumpton s consstent wth the manstream research by the web search communty (e.g., Chapelle and Zhang 2009). In addton, many recent studes n economcs and marketng have also adopted the sequental search strategy for examnng consumer search n an onlne envronment (e.g., Km et al. 2010, Koulayev 2014, Chen and Yao 2016). Wth the growng nterests and the recent development of nformaton technologes that have made many ntensve computaton tasks more tractable today, emprcal work to date has ncreased. Hong and Shum (2006) were the frst to develop a structural methodology to recover search cost from prce data only. Moraga- Gonzalez and Wldenbeest (2008) extend the approach of Hong and Shum to the olgopoly case and provde a maxmum lkelhood estmate of the search cost dstrbuton. Both papers focus on markets for homogeneous goods, usng both sequental and non-sequental search models. Hortacsu and Syverson (2004) examne markets wth dfferentated goods and develop a sequental search model to recover search cost from the utlty dstrbuton. More recent emprcal studes on non-sequental search tend to focus on the offlne market wth search frctons to study prce dsperson (e.g., Wldenbeest 2011), endogenous choce sets and demand (e.g., Moraga-Gonzalez et al. 2012), or the dentfcaton of search cost from swtchng cost (Honka 2014). Recent 3

5 emprcal work on sequental search examnes consumers lmted search and the assocated demand, wth a focus on the onlne search market (Koulayev 2014, Km et al. 2010). Meanwhle, De los Santos et al. (2012) use web browsng and purchasng behavor based on book-prce dstrbuton across 14 onlne bookstores to compare the extent to whch consumers are searchng under non-sequental and sequental search models. One common practce n the exstng emprcal studes on both types of search models s that they typcally model search cost as an nherent attrbute of the consumer. Two exceptons are Km et al. (2010), who model search cost as a functon of the product s appearance frequency on Amazon.com, and Moraga-Gonzalez et al. (2012), who consder explanatory varables such as geographc dstance from a consumer s home to dfferent car dealershps. In our model, search cost s not only an nherent attrbute of a consumer, but also a consequence of the socal meda context n whch consumers of today are embedded. Note that consstent wth pror lterature, the search cost n our study s modeled as exogenous to the consumer s search. By modelng consumer search cost as a random-coeffcent functon of the textual varables that are related to the unstructured socal meda content, we am to examne the nature of search cost gven the ncreasng nterplay between product search engnes and socal meda. Fnally, another related stream of consumer search lterature has analyzed optmal search behavor when consumers are uncertan about the dstrbuton of the product prce or utlty (e.g., Rothschld 1974, Rosenfeld and Shapro 1981, Bkhchandan and Sharma 1996, Koulayev 2013, De los Santos et al. 2013). For example, the recent work by De los Santos et al. (2013) has relaxed the assumpton that consumers know the dstrbuton of offerngs whle decdng on ther search strategy, and allows for learnng of the utlty dstrbuton. More specfcally, consumers learn the utlty dstrbuton by Bayesan updatng ther Drchlet process prors whle samplng nformaton about products and retalers. Our study s related to ths stream of work n that we also consder the sequental arrval of nformaton durng dfferent search stages, whch allows for consumer update of the ntal belef towards product utlty va nformaton search. 2.2 Search Engne Rankng and Unser-Generated Content (UGC) Our work s also related to the lterature on search engne rankng. Examnng the rank-poston effect on the clck-through rate (CTR) and converson rate (CR) on search engnes has attracted a lot of attenton. A number of recent studes focus on the context of search-engne-based keyword advertsng and fnd sgnfcant emprcal evdence on the rank-order effect (e.g., Ghose and Yang 2009, Goldfarb and Tucker 2011, Agarwal et al 2011, Yao and Mela 2011). Other studes focus on search engne rankng for commercal products. For example, Baye et al. (2009) use a unque dataset on clcks from one of Yahoo's prce comparson stes to estmate the search engne rankng effect on clcks receved by onlne retalers. Ellson and Ellson (2009) focus on the competton of retalers ranked on prce search engnes and fnd the easy prce search makes demand hghly prce senstve for some products. Ghose et al. (2012) propose a utlty-gan-based rankng (usng data from past purchases, only, and not browsng behavor) that recommends products wth the hghest expected utlty. The 4

6 lab experments ndcate a strong preference for utlty-based rankng compared to exstng state-of-the-art alternatves. Ghose et al. (2014) combned a Herarchcal Bayesan model and randomzed user experments to examne the search engne rankng and personalzaton effects from a causal perspectve. Fnally, our work also relates to the stream of research on socal meda and User-Generated Content (UGC) (e.g., Godes and Mayzln 2004, Chevaler and Mayzln 2006, Dellarocas et al. 2007, Duan et al. 2008, Forman et al. 2008). Especally, t bulds on the recent research from a multdmensonal vew of the customer revews (e.g., Archak et al. 2011, Ghose et al 2012, Netzer et al. 2012, Chen et al. 2017). In ths paper, we am to examne the role of socal meda from multple dmensons n affectng not only the product utlty evaluaton but also the search cost of consumers. 2.3 Comparsons wth Recent Lterature Our model bulds on Wetzman s (1979) optmal sequental search framework. To the best of our knowledge, fve exstng studes use smlar methodologes to ours: Km et al. (2010), Koulayev (2014), De los Santos and Koulayev (2014), Km et al. (2014), and Chen and Yao (2016). However, our research dffers from these studes n the followng ways: () Our model ncorporates not only consumers search behavor, but also ther purchases. Km et al. (2010), De los Santos and Koulayev (2014), and Koulayev (2014) consder only consumers search nformaton as an approxmaton of ther actual purchase decsons. () Our observatons nclude detaled clck-throughs from each rankng poston on a page, whch allows us to precsely model the ndvdual clck probablty for each product, rather than for a page wth a bundle of products (.e., a page of 15 hotels as n Koulayev 2014). More broadly speakng, Koulayev (2014) and our paper are complmentary: Koulayev models the costly process of dscoverng new hotels by flppng pages, but stops short of modelng what happens between clck and bookng. Our paper focuses on the second stage, startng from the costly clck to the fnal bookng. () We conduct our analyss at the ndvdual-consumer level as opposed to at the aggregate market level (Km et al. 2010, 2014). Such ndvdual-level data allow us to leverage the detaled nformaton of the sequence of clcks per sesson, rather than only the ndependent clck-throughs. (v) Chen and Yao (2016), De los Santos and Koulayev (2014), and Koulayev (2014) focus on constructng models that examne the jont use of search refnement tools (e.g., sortng) durng consumer search. However, search refnement s not our focus n ths paper. (v) Km et al. (2010, 2014) and Chen and Yao (2016) assume a smpler nformaton structure where consumers do not update ther nformaton set durng search. Whereas, our paper allows for a more realstc nformaton structure by allowng consumers to update ther nformaton set before and after clckthrough. (v) Most mportantly, our paper ntates a specal focus on the nterplay between consumer search and socal meda. Our goal s to use the structural econometrc approach as a tool for analytcs by product search engnes to mprove the user experence, especally under an overload of the unstructured socal meda content. We model the trade-off between the value and the cogntve cost assocated wth the large-scale unstructured socal meda nformaton. We am to examne how search engne polces regardng socal meda 5

7 Data content, such as what nformaton to show on the search summary page versus product landng page, may affect consumer search/purchase behavors and search engne revenues. In addton, two recent papers, Ghose et al. (2012) and (2014), have also ntalzed ther focus on the nterplay of search engne and socal meda. Ths current paper dstngushes from these two studes n the followng: () Ghose et al. (2012) studed only the consumer purchase decsons, not search/clck decsons, whereas ths paper jontly studes the clck and purchase decsons. () Both Ghose et al. (2012) and (2014) focused on only the beneft of socal meda on consumer evaluaton of product qualty for the purchase decson, but dd not consder the cogntve cost assocated wth processng socal meda nformaton durng consumer search. Ths s one major unque advantage of ths paper. None of the prevous work has studed the cost of socal meda content n affectng consumer search and purchase decsons on product search engnes. () Both Ghose et al. (2012) and (2014) used aggregated data on clck/purchase share at product level, whle ths paper models consumer decson at ndvdual level. (v) From a methodology perspectve, dfferent from Ghose et al. (2012) and (2014), ths paper takes nto accounts three unque constrants n the model: (1) nterdependency n clcks/purchases among dfferent products; (2) sequental arrval of nformaton revealed by clck-throughs; (3) non-neglgble search costs. A summary of the dfferences between ths paper and the exstng studes s n Table 1. Table 1. Comparson wth Recent Lterature Km et al. (2010) Amazon, Vew-Rank, 18 months Ghose et al. (2012) Hotels, Purchase, ~8k observatons, 3 months Ghose et al. (2014) Hotels, Clck, Purchase, ~30k observatons 3 months Km et al. (2014) Amazon, Vew-Rank, Sale-Rank, 18 months Level of Analyss Market Market Market Market Real Transactons Koulayev (2014) Hotels, Clck, Search Refnement, 1 month, (Chcago) Indvdual Clcks at Page Level De los Santos & Koulayev (2014) Hotels, Clck, Search Refnement, 1 month, (Chcago) Indvdual Clcks Chen & Yao (2016) Hotels, Clck, Search Refnement, Purchase, 215 sessons, 15 days Indvdual Clcks and Purchases Ths Paper Hotels, Clck, Purchase, ~7M observatons ~1M sessons, 2117 hotels, 3 months, Indvdual Clcks and Purchases No Yes Yes No No No Yes Yes Clck Sequence No No No No Yes Yes Yes Yes Search Refnements No No Yes No Yes Yes Yes No Interplay wth Unstructured No Yes No No No No No Yes Socal Meda Update of Consumer Informaton Set No No No No No No No Yes Major Objectves Desgn Search Interplay between Desgn a Novel Consumer Causal Effect of Engne Rankng Search Search & Socal Consumer Market Welfare, Search Engne Prce by Maxmzng Refnement Meda, Surplus-based Structure, Market Rankng and Senstvty Aggregate CTR Welfare Cogntve Cost of Rankng for Innovaton Structure Personalzaton Decrease Unstructured Search Engne Socal Meda 6

8 3. Data Clckstream and Transacton Data: Our dataset comes from Travelocty.com, a leadng onlne travel search agency. The dataset contans detaled nformaton on sesson-level consumer search, clck, and purchase events from November 2008 through January 2009, wth a total of 7,059,122 observatons from 969,033 ndvdual sessons resultng n room bookngs n 2,117 hotels n the Unted States. 3 A typcal onlne sesson observed n our dataset nvolves the followng events: the ntalzaton of the sesson, the search query, the hotel lstngs returned from that search query n a partcular rank order, whether the consumer has used any specal sortng crtera to rerank the hotels, clcks on any hotel lstng, the logn and actual transactons n a gven hotel, and the termnaton of the sesson. We observe the hotel lstngs dsplayed to the consumer durng the search sesson (regardless of whether any clck occurs). If a clck occurs, we observe hotel lstngs the consumer observed pror to that clck. Moreover, we also observe the sequence of the clcks. Notce we also have detaled nformaton assocated wth each event for every correspondng hotel, such as nghtly room prces and the hotel s poston n the set of lstngs returned by the search engne (.e., Page and Rank ). We have the detaled transacton-level nformaton from Travelocty.com that s lnked to the entre sesson-level consumer search data, ncludng the fnal transacton prce and the number of room unts and nghts purchased n each transacton. Ths nformaton allows us to model consumer preferences for both the search and the purchase processes. Hotel General Informaton: We collected hotel-related nformaton from Travelocty.com, such as hotel class, hotel brand, number of amentes, number of rooms, revewer ratng, number of revews, and the textual content of all the revews up to January 31, 2009 (the last date of transactons n our database). Hotel Locaton Informaton: In addton, we have ndependently collected supplemental data on hotel locaton-related characterstcs usng automatc socal geo-mappng technques together wth mage data mnng. We use geo-mappng search tools (n partcular the Bng Maps API) and socal geo-tags (from geonames.org) to dentfy the number of external amentes (e.g., shops, bars) n the area around the hotel. We use mage classfcaton methods together wth human annotatons (from Amazon Mechancal Turk, AMT) to extract whether a beach, lake, or downtown area s nearby, and whether the hotel s close to a hghway or publc transportaton. We extract these characterstcs from dfferent zoom levels of the satellte mages of a hotel locaton wthn a 0.25-, 0.5-, 1-, and 2-mle radus. We also collect local crme rates from FBI statstcs. Hotel Servce Qualty Informaton Extracted from Socal Meda: To fully explot the nformaton about hotel servce qualty, we combne text mnng and sentment analyss to examne the natural-language text of the customer revews. For example, the helpfulness of the hotel staff s a servce feature one can assess by readng the consumer opnons. Toward extractng such nformaton, we buld on the prevous work of Archak 3 In our dataset, 2,117 hotels had at least one bookng durng the data collecton perod. A total of 13,546 hotels had at least one dsplay n consumer search sessons. 7

9 et al. (2011) and Ghose et al. (2012). Frst, we extract the mportant hotel features. Followng the automated approach ntroduced prevously (Archak et al. 2011, Ghose et al. 2012), we use a part-of-speech tagger to dentfy the frequently mentoned nouns and noun phrases, whch we consder canddate hotel features. We then use WordNet (Fellbaum 1998) and a context-senstve herarchcal agglomeratve clusterng algorthm (Mannng and Schutze 1999) to further cluster the dentfed nouns and noun phrases nto clusters of smlar nouns and noun phrases. The resultng set of clusters corresponds to the set of dentfed product features mentoned n the revews. For our analyss, we kept the top sx most frequently mentoned features, whch were hotel staff, food qualty, bathroom, parkng facltes, bedroom qualty and check-n/out front desk effcency. For sentment analyss, we extracted all the evaluaton phrases (adjectves and adverbs) that were used to evaluate the ndvdual servce features (for example, for the feature hotel staff, we extracted phrases such as helpful, smlng, rude, responsve ). The process of extractng user evaluaton phrases can be automated. To measure the meanng of these evaluaton phrases, we used AMT to exogenously assgn explct polarty semantcs to each word. To compute the scores, we used AMT to create our ontology, wth the scores for each evaluaton phrase. Our process for creatng these external scores was done usng the methodology of Archak et al. (2011). Fnally, to handle the negaton (e.g., I ddn t thnk the staff was helpful ), we bult a dctonary database to store all the negaton words (e.g., not, hardly ) usng an approach smlar to NegEx ( accessed Sept 10, 2015). Consumer Cogntve Cost Indcators Extracted from Socal Meda: Although the textual content of customer revews can reveal mportant nformaton about hotel qualty, there s a non-neglgble cogntve cost assocated wth processng such nformaton. To capture consumers cogntve costs n readng the usergenerated revews, we analyzed two sets of revew text features that are lkely to affect consumers ntellectual efforts n nternalzng revew content: readablty (.e., textual complexty, syllables, and spellng errors) and subjectvty (.e., mean and standard devaton). Research has shown both of them have had sgnfcant mpact on consumer onlne shoppng behavor n the past (e.g., Ghose and Iperots 2011). To derve the probablty of subjectvty n the revew s textual content, we apply text-mnng technques. In partcular, we tran a classfer usng the hotel descrptons of each of the hotels n our dataset as objectve documents. We randomly retreved 1,000 revews to construct the subjectve examples n the tranng set. We conduct the tranng process by usng a 4-gram Dynamc Language Model classfer provded by the LngPpe toolkt ( accessed Sept 10, 2015). Thus we are able to acqure a subjectvty confdence score for each sentence n a revew, and then derve the mean and varance of ths score, whch represent the probablty of the revew beng subjectve. In addton to revew textual readablty and subjectvty, we also extracted an addtonal cogntve cost ndcator based on the topc complexty of the customer revews. In partcular, bult on pror lterature (Gong et al. 2016) we analyzed the entropy value for the dstrbuton of topcs extracted from all customer revews for each hotel ( Topc Entropy ). Ths entropy value measures the dversty of topcs covered by the customer 8

10 revews for each hotel. Pror lterature suggests the dversty n search results affects consumer search behavor (e.g., Wetzman 1979, Dellaert and Haubl 2012). In addton, consumer psychology theores suggest that as the nformaton become noser, users are more lkely to abandon ther search (e.g., Jacoby et al. 1974; Dhar and Smonson 2003), because users tend to get overwhelmed and dscouraged by the complexty of nformaton, and therefore lose ther nterest or trust n the search results. Therefore, we derved a Topc Entropy score usng probablstc topc models from machne learnng and natural language processng to capture the nosness of nformaton provded by the customer revews. Topc models are unsupervsed algorthms that am to extract hdden topcs from unstructured text data. In partcular, we measure the topc complexty of revews for each product by estmatng a topc model usng Latent Drchlet Allocaton model (LDA; Ble et al. 2003), and subsequently computng the entropy (.e. dversty) of the topc dstrbuton of revews for that product. We provde more techncal detals on the topc modelng n Onlne Appendx E. For a better understandng of the varables, we present the defntons and summary statstcs of all varables n Table 2. Note the dataset n ths paper use not only the transacton data (.e., purchases), but the complete sesson-level data (.e., both clcks and purchases). The resultng dataset contans approxmately seven mllon observatons from one mllon ndvdual user sessons. Fgure 1a. Dstrbuton of Fgure 1b. Dstrbuton of # Pages Browsed (Sesson Level) #Clck-thoughs Per Page (Sesson Level) 3.1 Model-free Evdence of Lmted Search by Consumers Before we descrbe our model, we seek from the data suggestve evdence that could motvate our assumpton of consumers lmted search. Frst, we plot the dstrbuton of the total number of pages a consumer browses n a search sesson. Fgure 1a llustrates ths dstrbuton n detal, wth the x-axs representng the page counts and the y-axs representng the densty. We notce that over 25% of consumers browse only one page; over 50% of consumers browse less than three pages; and less than 10% of consumers browse more than 15 pages durng ther search for hotels. Ths fndng s consstent wth pror ndustry evdence that consumers seldom search more than three pages (e.g., Iprospect 2008). Second, we further look nto the dstrbuton of the average number of clck-throughs made per page durng each search sesson. Fgure 1b llustrates ths dstrbuton, wth the x-axs representng the clck-throughs per page and the y-axs representng the densty. We fnd that on average, consumers clck less than one hotel (out of a total of 25 hotels) per page durng ther search. In fact, a large majorty of consumers clck less than 0.5 hotels per page, on average. Besdes, over 97% 9

11 clcks occurred on the frst page. These two fgures provde us prelmnary evdence that consumers ncur nontrval search costs and that consumer search s lmted A Structural Model of Consumer Sequental Search Our dataset contans the complete nformaton on the browsng sesson (e.g., lst of hotels dsplayed, sequence of clcks) and the purchasng decsons that consumers made. Consumers have three optons for a hotel durng a search sesson: (A) Do not clck on the hotel at all; (B) Clck on the hotel but do not purchase t; and (C) Clck on the hotel and also purchase t. To dentfy opton A from optons B and C, we need to model consumers clck decson makng. To dentfy opton B from opton C, we need to model consumers purchase decson makng. As a key contrbuton of ths analytcal study, we buld a holstc model of user behavor that models both the clckng and purchasng behavor. Our model, n summary, works as follows: Before Clck: 1. A consumer sesson starts wth consumer browsng hotels on the search results summary page. A consumer can obtan any hotel nformaton provded on the search results summary page (wth no clcks needed) at zero cost. 2. Before clckng on a hotel, the consumer does not observe the exact nformaton shown on the detals landng page for that hotel. Instead, she forms a belef about what nformaton would appear on the landng page, condtonal on the nformaton observed n the search results summary page. Because no clck s needed to form the belef, we assume the consumer ncurs zero cost at ths step. 3. Gven the observed nformaton on the search results summary page and the condtonal belef of the unobserved nformaton on the landng page, the consumer s able to nfer the expected utlty of each hotel before the clck-through at zero cost. 4. Meanwhle, before clckng on a hotel, the consumer also forms a belef about what the expected search cost would be f she were to clck on the hotel (e.g., due to the addtonal cogntve efforts needed for processng the unstructured nformaton on the landng page), condtonal on the nformaton observed from the search results summary page. Agan, no clck s requred to form the belef of search cost, and we assume the consumer ncurs zero cost at ths step. After Clck: 5. The consumer sesson contnues wth a seres of clcks, where the consumer decdes to clck on the landng pages of some hotels and to fnd out the exact utltes from these hotels. The goal of search (.e., va clckthrough) s to reveal any uncertanty n the utlty (.e., uncertanty n the landng-page characterstcs as well as the unobserved error). The set of clcked hotels and the order of the clcks reveal nformaton about the preferences and search costs of the user. 4 For some ctes, the number of hotels mght be small and therefore no addtonal page s avalable for searchng. We fnd that 56 out of 4,845 ctes (approxmately 1.15%) n our data have less than 25 hotels (whch means only one page s avalable for searchng). After excludng these small ctes, the model-free evdence shows a smlar trend that consumer search s hghly lmted. 10

12 6. The consderaton set s beng generated durng the search process. It contans all the hotels the consumer has clcked. After the costly clck-through, the consumer knows the actual utltes (rather than the expected utlty) of the clcked hotels, whch form the consderaton set. 7. The consumer stops searchng new hotels (and hence stops clckng) when the expected margnal beneft of dong so s less than the expected search cost. We adopt the concept of reservaton utlty from Wetzman (1979) to defne when the consumer stops searchng. The decson of whether to contnue searchng or to stop reles on the actual utltes of the hotels n the consderaton set at that moment 5 and her expected utltes and expected search costs of the upcomng hotels. 8. Once the consumer stops searchng, the consderaton set s fxed. Based on the fnal consderaton set, the consumer makes a purchase decson (or skps purchasng anythng at all). 4.1 Model Settng (1) Product Utlty. Assume the utlty of hotel j for consumer to be a random-coeffcent model as follows: S L u V V e, (1) S L where V V V represents the hotel utlty from the hotel characterstcs dsplayed on the webste. It conssts S of two conceptual components: () a determnstc component: V, the exact utlty from summary-page hotel characterstcs consumers can drectly observe on the search summary page, and () a stochastc component: L the addtonal utlty, V, from landng-page hotel characterstcs consumers cannot drectly observe before the clck-through but can observe after the clck-through. To evaluate the overall expected utlty before the clckthrough, a consumer forms a belef of the dstrbuton of the unobserved landng-page utlty f ( V ) based L on S V. Ths belef comes from the consumer s knowledge about the utlty dstrbuton for hotel j condtonal on the observed summary-page characterstcs for ths hotel. The consumer makes the clck decson based on the exact value of the summary-page utlty S L V and the expected value of the landng-page utlty EV ( ). Once the consumer decdes to clck on the hotel, the clck-through wll reveal the actual value of the landng-page L L characterstcs, and the consumer updates the expected value EV ( ) wth the actual value V. Moreover, we let e represent the unobserved uncertanty n the consumer s evaluaton. The consumer does not know the realzaton of e unless she clcks on hotel j and vsts ts landng page. In partcular, we assume e to be..d. across consumers and hotels, and to follow a Type I Extreme Value dstrbuton e ~ Type I EV(0,1) 6. 5 In partcular, the decson of whether to contnue searchng or to stop depends on the actual utlty of the hotel wth the maxmum utlty n the consderaton set. 6 Note that dfferent from Km et al. (2010), who assume standard normal dstrbuton of the error, we allow for logt dstrbuton of the error term n our model, as we assume that the consumers may optmze ther utlty over unobserved (to the econometrcan) varables. In our estmaton, we transform the logt error nto standard normal dsturbances usng an nverse standard normal CDF 11

13 In summary, our utlty settng assumes the consumer does not know the full realzaton of the utlty of hotel j before the clck-through. However, the consumer knows the dstrbuton of the utlty. Ths assumpton s crtcal and s consstent wth Wetzman (1979) and many recent studes that have examned consumers sequental search behavor n the onlne search contexts (e.g., Km et al. 2010, Chen and Yao 2016, Koulayev 2014). Hence the goal of search (.e., clck) s to solve the uncertanty n the consumer s evaluaton toward both the landng-page characterstcs and the unobserved error to reveal the true utlty of a hotel. More specfcally, let X be a vector of summary-page characterstcs for hotel j. Let j P represent the j Prce for hotel j that s also drectly avalable to consumers on the search results summary page. Thus, we can model the summary-page utlty as V S X P, where and j j are consumer-specfc parameters capturng the heterogeneous preferences of consumers. We assume ~ N (, ) where s a vector, contanng the means of the random effects and s a dagonal matrx contanng the varances of the random effects. Smlarly, we assume 2 ~ N (, ). L Meanwhle, we model the expected value of the pre-clck stochastc part of the utlty as E( V ) L, j where L j represents the consumer expectaton toward the landng-page characterstcs for hotel j condtonal on the observed summary-page characterstcs ( X j, P j). Note that L j may not equal the actual values of the landng-page characterstcs. We use the tlde sgn to dstngush L j from the realzaton of ts determnstc value, L j. Usng a smlar approach proposed by Koulayev (2014), we approxmate L j by takng the mean of the bootstrap samples from the actual nformaton of the landng pages of the hotels that present the same summary-page characterstcs. Ths approach allows consumers to nfer knowledge about the utlty dstrbuton of a hotel based on the average knowledge from the populaton wth smlar experence (.e., who are also exposed to ( X j, P j) ). The consumer estmates the expected utlty of the landng page based on L j. She updates L j wth the determnstc value L only after she chooses to clck on hotel j and reveals the actual determnstc j values of the landng-page characterstcs. Let represent consumer-specfc parameter to capture the heterogenety. Consstent wth prevous assumptons, we assume t follows a normal dstrbuton ~ N(, ). Therefore, we have the overall utlty functon as follows. Before the clck-through, the expected utlty from hotel j for consumer s u X P L e. (2a) j j j After the clck-through, the realzaton of the actual utlty becomes u X P L e. (2b) j j j functon. Ths transformaton approach was proposed and wdely used by prevous studes to compute the nverse Mll s rato for logt dstrbuton (e.g., Lee (1983), Greene (2002)). We provde more detals n Onlne Appendx C. In addton, we have also tred the normal dstrbuton assumpton for the error term. We fnd our fnal results stay very consstent. 12

14 (2) Search Cost. We model a consumer s search cost as a result of the landng-page-evaluaton behavor assocated wth a clck (.e., cogntve cost of processng addtonal unstructured nformaton). More specfcally, let Q denote j the set of actual cogntve cost varables for evaluatng the landng-page unstructured nformaton of hotel j. We model the actual search cost of consumer after clckng on hotel j to follow a lognormal dstrbuton: 7 c exp( Q ), j where ~ N (, ), s a vector contanng the means of the random effects and s a dagonal matrx contanng the varances of the random effects. To model the consumer s cogntve cost of evaluatng the unstructured nformaton on the landng page, we consder dfferent dmensons n the cogntve-cost varables Q, ncludng both the readablty and the subjectvty of the textual content of onlne revews. j However, because the landng-page nformaton s not drectly observable to the consumer before clck, to decde whether to clck on a hotel, the consumer needs to form a belef of her expected search cost condtonal on the observed summary-page characterstcs of that hotel. Ths means that n our model, Q s j not drectly observable to the consumer before the clck-through. Smlarly, the consumer forms an expectaton based on the observed summary-page characterstcs. Let Q j capture the consumer s expectaton toward the unobserved cogntve-cost varables of hotel j. We approxmate ths expectaton value by takng the mean of the bootstrap samples from the actual nformaton of the hotels wth the same summary-page characterstcs. Based on the dscusson above, we can wrte the (expected) search cost of consumer for hotel j before the clck-through as the followng: c exp( Qj ). (3b) Thus, before the clck-through of hotel j, a consumer makes the clck decson based on the expected search cost for j. Note that a consumer s search cost s a sunk cost. It enters only the consumer s clck decson process but not the purchase decson process. Once the consumer forms an evaluaton about the expected search cost, she wll make a clck decson based on ths evaluaton one tme, and wll not need t agan n the future. Therefore, the realzed actual search cost after clck-through n Equaton (3a) does not enter ether the clck model or the purchase model n realty. Only the expected search cost before clck-through n Equaton (3b) wll enter the model estmaton process (.e., clck model). Hence, we can treat the consumer s expected search cost as a determnstc value n modelng her search (clck) decson, whch s consstent wth Wetzman (1971). For smplcty of notaton, we therefore keep the same notaton c to denote the expected search cost, although the expected search cost n Equaton (3b) represents an expectaton value (based on Q j, not Q ). j (3a) 7 The log-normal assumpton of search cost s consstent wth the pror lterature (e.g., Km et al. 2010, Wldenbeest 2011). 13

15 4.2 Problem Descrpton and the Optmal Search Framework In general, our consumer search problem can be descrbed as follows. Assume a consumer searches sequentally (.e., examnes alternatves one by one) to fnd a hotel. At each stage of the search, the consumer has two optons: contnue to search for the next alternatve, or stop and purchase the current best alternatve (ncludng purchasng nothng,.e., an outsde good). Consder that the consumer s forward lookng. Ths stuaton mples that at any stage durng her search, she always tres to choose an acton that maxmzes her expected utlty from the current stage gong forward meanng she tres to maxmze the margnal benefts from both the current stage and all potental future stages. Therefore, the key problem here s to determne the optmal pont for the consumer to choose the stop opton. More formally, let consumer has clcked). Let S be the current search-generated consderaton set (.e., ncludng all hotels denote the current hghest value obtaned by consumer so far. We defne u max js { u,0}. (4) Note we defne u as the hghest value u consumer obtans from the hotels n her consderaton set. Gven the current best value u u, the expected margnal benefts for consumer from searchng j are u u B ( u ) ( u u ) f ( u ) du, where f () s the probablty densty functon of hotel utlty (5) and s ndvdual specfc. The expected margnal benefts B ( u ) represent the expectaton of the utlty for hotel j, gven that t s hgher than, multpled by the probablty that u exceeds u. As we notce, the benefts of search depend only on the dstrbuton of utlty above u. Thus, for any hotel j, the reservaton utlty z meets the followng boundary condton, where the expected search cost equals the expected margnal benefts from searchng the hotel: c B ( z ) ( u z ) f ( u ) du. (6) z Note that n Equatons (4)-(6), because the actual search cost and actual utlty for an upcomng unsearched hotel j are not observable to consumer before the clck-through, her decson of whether to clck on hotel j s based on her expected search cost and expected utlty. By contrast, s derved based on the actual hotel utltes because after the clck-through the consumer can observe the exact nformaton about each hotel n her consderaton set. Thus, when consumer s current best value s equal to the reservaton utlty of hotel j, u z, she s ndfferent between searchng for j or stoppng (and acceptng ). Consumer wll contnue to search for hotel j f her current best value s lower than the reservaton utlty of hotel j, u z, and she wll stop otherwse. More detals on the dervaton of the optmal search strategy and the reservaton utlty are provded n Appendces B and C. u u u 14

16 4.3 Clck Probablty We defne the clck probablty n a fashon smlar to (Km et al. 2010). Let denote the hotel wth the th hghest-ranked reservaton utlty. Let be the probablty that consumer wll clck hotel j z r, ( j) r, ( j ) r( j). Ths probablty equals the probablty that the current hghest value u wthn the consumer s current consderaton set s lower than the reservaton utlty of hotel r( j). Let S be the current consderaton set r, ( j) generated pror to hotel r( j ). It ncludes all hotels the consumer has clcked before hotel r( j). For a consumer to clck hotel r( j ), z r has to exceed the maxmum value from the clcked sets of hotels. Thus we model the, ( j ) r( j) clck probablty of hotel r( j) for consumer as ( ) r, ( j) Pr r j s clcked by consumer Pr u z, r( j) S L Pr max ( V V e, ( ), ( ) r, ( m) ) z ms r m r m r, ( j) r, ( j ) F ( z V V ), j 1, ms r, ( j ) S L e, r( j), r( m), r( m) where Fe () s the CDF of e, whch n our case e ~ TypeI EV (0,1). 8 (7) 4.4 Condtonal Purchase Probablty Condtonal on the sequence of clcks consumer has made n the search sesson, we can derve the condtonal probablty that she purchases hotel r(j) n her consderaton set as the followng: r, ( j) Pr r( j) s booked by consumer all clcks by consumer Pr u u all clcks by consumer, r( j) r( j '), r( j), r( j ') S r, ( j) r, ( j') S L S L, ( ), ( ) r, ( j) e r j r j r, ( j') r, ( j' ) r, ( j') all clcks by consumer V V e V V, Pr, r( j) r( j'), r( j), r( j') S (8) where S s the clck-generated consderaton set for consumer. Note that because the consderaton set S s selected by consumer based on her search decsons, e does not follow a full Type I EV dstrbuton. Instead, t follows a truncated Type I EV dstrbuton based on the optmalty condtons used by the consumer. Unfortunately, under such crcumstances the condtonal choce probablty does not have a close-form expresson (e.g., Logt form). To address ths ssue, we appled a smulaton approach. Smlar methods have been adopted by the prevous studes (Chen and Yao 2016, Honka 2014, McFadden 1989). McFadden (1989) proposed a kernel-smoothed frequency smulator to sample the random draws from a truncated Type I EV dstrbuton by smoothng the probabltes usng a multvarate scaled logstc CDF (Gumbel 1961). Honka (2014) appled McFadden s approach to sample the error term from a truncated Type I EV dstrbuton by 8 Note that u s the maxmum utlty value from the current consderaton set products are ncluded n the current stage of the consderaton set. S. Hence, the value of r, ( j) u depends on what 15

17 takng nto account the composton of the clck-generated consderaton set and the utlty optmalty of the fnal choce to model consumer smultaneous search. Chen and Yao (2016) appled a smlar smulaton approach to sample the error term from a truncated normal dstrbuton by further accountng for not only the choce set composton and the utlty optmalty of the fnal choce, but also the sequence of the clck-generated consderaton set to model consumer sequental search. Our smulaton approach bulds on the methods from Chen and Yao (2016) and Honka (2014). It allows us to smulate the error term from a truncated Type I EV dstrbuton by satsfyng the follow three optmalty condtons: 1) Sequence of the clck-generated consderaton set; 2) Composton of the clck-generated consderaton set; 3) Utlty optmalty of the fnal choce. We provde the full detals on how we use the smulated method to construct the condtonal purchase probablty n Onlne Appendx D. 4.5 Lkelhood Functon We model the overall lkelhood as the product of the probabltes of all the observed consumer clcks and purchases. where Lkelhood Pr( CLICK, PURCHASE ) Pr( CLICK ) Pr( PURCHASE CLICK ), PURCHASE represents the observed purchase event by consumer, and observed sequence of all clck events by consumer. (9) CLICK represents the We can then model Pr( CLICK ) and Pr( PURCHASE CLICK ) as follows. Frst, let N be the total number of hotels that consumer has clcked (.e., sze of the consderaton set) and J be the total number of hotels avalable n the market. We can model the jont probablty of the sequence of clck events for consumer as the followng: Pr( CLICK ) Pr clck, then clck,..., then clck, then all _ unclcks where, r(1), r(2), r( N ) N J N clcked _ before _ r ( j) Pr( u n, zr, ( j) n S ) un, zr, ( m) r( j) S clcked r( m) S unclcked N J N Pr( u z ) 1 Pr( u, n1, r( n), N r( n) S clcked r( m) S unclcked z ), r( m) N J N 1, r, ( n) r, ( m) r( n) S clcked r( m) S unclcked, Pr( ) (10) clcked S represents the set of all hotels that have been clcked by consumer, of all hotels that have not been clcked by consumer, and unclcked S represents the set clcked _ before _ r( j) S represents the set of hotels that have been clcked by consumer before r(j). Second, condtonal on the sequence of clck events, we can derve the condtonal probablty of the purchase event from Equaton (8). Agan, S s the clck-generated consderaton set for consumer. 16

18 Pr( PURCHASE CLICK ) Pr u u CLICK, r( j) r( j'), r( j), r( j ') S r, ( j) r, ( j') r, ( j) Fnally, based on Equatons (10) and (11), we can rewrte the lkelhood functon as follows: (11) N J N Lkelhood 1. r, ( j) r, ( n) r, ( m) (12) clcked unclcked r( n) S r( m) S Wth ths model settng, we are able to account for the fact that the decson-makng processes for the hotels n the same sesson are not completely ndependent from each other. Instead, the clck and purchase decsons for a hotel depend not only on ts own utlty, but also on the pror sequence of clcks assocated wth the consderaton set Estmaton To model the utlty of a hotel, we consder X to contan all hotel characterstcs that are drectly avalable on the search summary page, ncludng Hotel Class, Hotel Brand, Customer Ratng, Total Revew Count, Page, and Rank. We consder L to contan all addtonal characterstcs that can only be revealed from the hotellandng page, ncludng Amenty Count, Number of Rooms, Number of External Amentes, the top-6 servce characterstcs extracted from the socal meda textual content ncludng hotel staff, food qualty, bathroom, parkng facltes, bedroom qualty and check-n/out front desk effcency, as well as locatonal factors such as Beach, Lake, Downtown, Hghway, Publc Transportaton, and Crme Rate. To analyze consumers search costs, we consder Q to contan dfferent factors that capture the cogntve cost of the unstructured hotel nformaton on the landng page. In partcular, we consder both the Readablty (.e., complexty, syllables, and spellng errors) and Subjectvty (.e., mean and standard devaton of the lngustc subjectvty) of the textual content of onlne revews. 10 Note the webste also provdes a sortng mechansm for consumers to refne ther search by sortng the results under crtera other than the default sortng algorthm. Techncally, f a consumer chooses to customze the sortng algorthm, her search cost for each hotel n the rankng lst may also change, hence becomng endogenous to her own search behavor. However, n realty, we fnd that wth approxmately one mllon onlne search sessons n our dataset, more than 90% of these sessons do not nvolve any customzed sortng behavor at all. Ths fndng s consstent wth prevous randomzed expermental results (Ghose et al. 2014) that the majorty of users tend to stck wth the default sortng method durng onlne product search. 9 Note that based on the model framework, we do not explctly model the selecton rule for the search order, but take t as pre-calculated (.e., based on the Wetzman optmal search model, ths search order s pre-calculated based on the descendng order of the reservaton utlty of each product). 10 Note that search cost on product search engne mght be partly assocated wth the search engne desgn. To better account for ths factor, n the man analyss we have controlled for the onlne postons of a hotel (.e., Page and Rank on the search engne webste), by whch we am to control for the search engne desgn effcency to a large extent. Under such crcumstance, our model estmated search costs ndcate that condtonal on the same onlne poston, what the cogntve cost of searchng a certan product s. In addton, we have conducted addtonal robustness tests by controllng for the sortng methods n a consumer s search sesson. Ths s to control for addtonal factors ntroduced by search engne desgn. We fnd n all these cases our model estmated results reman hghly consstent. 17

19 Therefore, for smplcty n ths study, we focus on only those search sessons conducted under the default sortng algorthm. 11 To estmate our model, we derve the overall log-lkelhood functon as the followng: LL( ) ln ln ln 1, r( n) S clcked r( m) S unclcked where represents the set of parameters of the random coeffcents we am to estmate: 1.. N N1.. J r, ( j) r, ( n) r, ( m) (13) { } (, ), (, ), (, ), (, ). We teratvely estmate the model usng a Maxmum Smulated Lkelhood (MSL) method. In partcular, we apply the Monte Carlo method for numercal smulaton, where for each ndvdual observaton, we smulate 250 random draws from the jont dstrbuton of the ndvdual heterogeneous parameters {} and compute the correspondng ndvdual-level clck probablty and condtonal purchase probablty, j. To maxmze, j the log-lkelhood functon LL( ), we use a non-dervatve-based optmzaton algorthm (.e., the Nelder-Mead smplex method) for heurstc search. 12 Ths procedure teratvely searches for the optmal set of parameters {} untl the log-lkelhood functon s maxmzed: { } { } arg mn LL( ). (14) The man computatonal complexty of the estmaton comes from the calculaton of the reservaton values. Durng each teraton of the optmzaton algorthm, for each observaton and each value of the search cost, we need to solve numercally. To mprove the estmaton effcency, we apply an nterpolaton-based method to compute the reservaton values (Km et al. 2010, Koulayev 2014). We provde more detals of ths computaton procedure n Onlne Appendx C. 4.7 Identfcaton z B 1 ( c ) One of the major challenges s to smultaneously dentfy consumers heterogeneous preferences and search costs. A person may stop searchng ether because she has a hgh valuaton for the products already found or because she has a hgh search cost. Therefore, ether the preferences for product characterstcs or the moments of the search cost dstrbuton can explan an observed search outcome. In our study, we need to dentfy four major effects: Consumer Preferences (Mean and Heterogenety) and Consumer Search Cost (Mean and Heterogenety). The key dentfcaton strategy of our estmaton reles on the excluson restrcton that consumer preferences enter the decson-makng processes of both search and purchase, whereas consumer search cost enters only the search decson-makng process. Once the consderaton set s generated after search, 11 In prncple, consumers can choose from varous search strateges to customze ther search results. To study ths drecton, one needs to separately look nto the data under each dfferent strategy. In ths paper, our man focus s not on the search refnement strateges. We refer the readers to Chen and Yao (2016) and Koulayev (2014) for a more n depth analyss on that front. 12 As a robustness check, we also tred the dervatve-based optmzaton algorthms (e.g., the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorthm and the Nested Fxed Pont algorthm (NFXP)). We found that dfferent optmzaton algorthms are able to recover consstent structural parameters. 18

20 the condtonal purchase decson should depend only on the consumer preferences. Our unque dataset contanng both consumer search data and purchase data allows us to dentfy these effects. Moreover, the dentfcaton also reles on search-cost shfters. Recall that we model search cost as a functon of a completely dfferent set of varables compared to the consumer-preferences varables. When we choose dfferent sets of covarates for search cost and consumer preferences, the covarates enter search cost functon but not utlty functon serve as the excluson restrctons for dentfcaton. Condtoned on the excluson restrctons, the utlty and the search cost can be separated. From an emprcal dentfcaton perspectve, we can smply vew the search-cost varables as addtonal product characterstcs (.e., smlar to Km et al. (2010)). Thus we can dentfy the search-cost and consumer-preferences varables smultaneously. 13 We provde more detaled dscussons below. (1) Mean Effects. We dentfy the mean effects of consumer preferences varables based on the correlaton between the observed clck/purchase frequences and the frequences of underlyng preferences varables. In other words, we measure the mean effect of a consumer preference varable by how often the same (or smlar) varable appears n the hotels consumers clck or purchase. For example, f on average people tend to clck (or purchase) low prce hotels, we may conclude that people have a hgh prce senstvty. Ths dentfcaton s smlar to the one n most tradtonal choce models, except that t takes nto consderaton not only the observed purchases, but also the clcks, to nfer the mean effect of consumer preferences. We dentfy the mean effect of search cost partally based on the observed average sze of the consumer's search-generated consderaton set. Importantly, note that we model the search cost as a functon of completely dfferent varables compared to the consumer-preferences varables, whch can be vewed smply as addtonal hotel characterstcs. Thus, smlar to the dentfcaton of consumer mean preferences, we can dentfy the mean search cost coeffcents based on the correlaton between the observed clck frequences and the frequences of underlyng search cost characterstcs. (2) Heterogeneous Effects. Note that across both purchase data and search data, we have multple observatons per consumer. For a gven consumer and her search cost, we observe the devaton of observed purchase and searches from those predcted decsons based on the mean preferences and search cost parameters. The dstrbuton of these devatons across ndvdual consumers allows us to dentfy the heterogenety dstrbuton parameters. More specfcally, we dentfy consumer heterogeneous preferences from two perspectves. Frst, we partally dentfy them from the search data based on the dstrbuton of the devatons across ndvdual 13 One mportant fact to note s that we also observe rch varaton n the characterstcs of hotels that enter the consumer s consderaton set. In partcular, we fnd that among all the sessons n whch consumers ncur clck-throughs, 8,731 sessons are assocated wth a sze of (clck-generated) consderaton set that s larger than 5, and 3,506 sessons are assocated wth a sze that s larger than 10. These observatons are crtcal for our model dentfcaton. 19

21 consumers between our model's predcted clck probabltes (based solely on the mean effects) and ndvdual consumers observed clck probabltes. Second, snce we also observe ndvdual consumers' fnal purchases, the purchase data allow us to dentfy the heterogeneous preferences based on the dstrbuton of the devatons across ndvdual consumers between the model's predcted purchase probabltes (based solely on the mean effects) and ndvdual consumers observed purchase probabltes. We dentfy the heterogeneous search cost through two sources. Frst, our dentfcaton reles on the excluson restrcton that search cost varables do not enter purchase decson processes. After dentfyng the consumer heterogeneous preferences through the condtonal purchase probabltes, we can then dentfy the heterogeneous search cost by the jont varaton of the consderaton set sze and the clck probabltes. In partcular, at each pont durng a consumer s search, based on mean parameters, her reservaton utlty, and the products already searched n the consderaton set, we can predct the mean probablty of her stoppng the search. The devaton of her search actvtes from the predcted values gve us the nformaton of one s heterogenety n search cost. The dstrbuton of these devatons across ndvdual consumers dentfes the search cost heterogenety dstrbuton parameters. Second, the nonlnear functonal form n the reservaton utlty (.e., Equaton (6)) can also help dentfy consumer preference and search cost parameters (Km et al. 2010). Snce the consumer preferences enter the equaton n a nonlnear manner (.e., need to ntegrate over the utlty), whereas the search cost enters the equaton n a lnear manner, ths mathematcal nonlnearty also helps us separately dentfy consumer heterogeneous preferences and search cost. 5. Emprcal Results 5.1 Man Results Our man results are shown n Table 3. Frst, we fnd the majorty of the coeffcents are statstcally sgnfcant at the p 5% level, ncludng both the mean effects (,,, ) and the heterogenety parameters (,,, ), (for prce, summary hotel characterstcs, landng page hotel characterstcs, and cost of absorbng socal meda content, respectvely). Consstent wth theory, PRICE has a negatve effect on hotel demand. CLASS, AMENITYCNT, ROOMS, RATING, and REVIEWCNT each have a postve effect on hotel demand. For hotel-locaton characterstcs, we fnd that BEACH, TRANS, HIGHWAY, and DOWNTOWN each has a postve effect on hotel demand, whereas LAKE and CRIME each shows a negatve effect. Consstent wth pror lterature, onlne poston has a sgnfcant effect on consumer clck and demand (e.g., Yao and Mela 2011, Ghose and Yang 2009). In partcular, PAGE and RANK each leads to a decrease n the hotel demand. Moreover, we fnd that three servce varables that are extracted from socal meda textual content demonstrate sgnfcant effect on hotel demand. In partcular, food qualty presents the hghest postve mpact, followed by hotel staff and parkng. On the other hand, we fnd the addtonal unstructured nformaton from the landng page ndeed leads to an ncrease n consumer search cost. In partcular, the readablty-related revew features such as 20

22 COMPLEXITY, SYLLABLES, and SPELLERR each have a postve sgn, suggestng that long and complex sentences, words wth many syllables, or spellng errors n user revews dscourage consumers from contnung to search on product search engnes. Moreover, SUB has a postve sgn, mplyng that hghly subjectve and opnonated content that lacks objectve nformaton creates a cogntve burden for consumers durng hotel search and may lead to early termnaton of ther search. Fnally, SUBDEV also has a postve sgn, whch suggests that a mxture of both objectve and subjectve messages s lkely to lead to hgher cogntve costs. In other words, SUBDEV represents the standard devaton of the subjectvty value and t captures the level of heterogenety n the type of nformaton provded n the revews. The hgher the heterogenety, the hgher the cogntve cost assocated wth processng such nformaton (.e., when a revew s a mx of both subjectve and objectve messages, t adds to the cogntve costs because readers mght have to ncur addtonal effort when swtchng between dfferent types of nformaton). To get a handle on the actual magntude of the search cost, we quanttatvely derve the dollar value of dfferent search cost varables. Ths value represents how much a certan varable effect can be translated nto prce. We fnd that on average, the effort of contnung to search an addtonal hotel costs $6.18. The search costs dffer across hotels from $3.43 to $7.75. Our fndngs are consstent wth prevous fndngs suggestng a non-trval search cost n onlne markets. For example, Koulayev (2014) found the page-level medan search costs rse from $4 per frst search to $16 per ffth on a travel search engne. Brynjolfsson et al. (2010) found the benefts from searchng lower screens equal $6.55 for the medan consumer. Hann and Terwesch (2003) quantfed rebddng costs to be $4.00-$7.50 n a reverse-aucton channel. Hong and Shum (2006) found consumers medan search costs to be $1.31-$2.90 for a sample of text books. In addton, de los Santos (2008) found search costs rangng from $0.90 to $1.80 per search n the onlne book ndustry. Meanwhle, a one-word ncrease n the average sentence length ncreases consumer search cost by $0.44. One more syllable or one more spellng error per revew can cost consumers $0.56 or $0.28, respectvely, durng the product search. Importantly, our emprcal analyss allows us to quantfy the trade-off between consumers benefts and costs toward leveragng socal meda nformaton for decson makng. Our results ndcate that more socal meda nformaton (especally textual content) may not always mprove consumer decson makng. Certan servce and qualty related nformaton extracted from revew textual content can ndeed facltate consumer decson makng and mpact product demand. However, due to the sze and the unstructured nature of such nformaton, t also brngs n non-neglgble cogntve costs to the consumers. Our study ams to explore a more effectve and scalable way of managng socal meda nformaton, whch can help search engnes extract and provde useful nformaton to consumers wthout ntroducng hgh cogntve costs. Moreover, our model and polcy experments (n Secton 6) allow us to evaluate the assocate economc outcome on consumers as well as on product search engne revenues. To further analyze the robustness of our model performance, and how socal meda and consumer heterogenety (e.g., travel purposes) may affect the search cost and decsons of a consumer, we conduct three 21

23 robustness tests by (1) excludng the socal meda varables from the man model, (2) ncludng addtonal Topc Entropy varable nto the man model, and (3) addng nteracton effects between consumer travel purposes and summary-page varables. We fnd the estmated coeffcents are qualtatvely consstent wth the man results. Interestngly, we notce the model that does not account for socal meda textual varables presents sgnfcantly hgher prce elastcty. Ths result ndcates that the unstructured socal meda nformaton plays an mportant role n consumer decson makng, and that consumers cogntve costs to dgest such nformaton are nonneglgble. Wthout accountng for such unstructured nformaton durng consumer search can lead to an overestmaton of prce elastcty. We provde more detals on the robustness tests n Onlne Appendx F. 5.2 Model Comparsons Furthermore, to understand how the type and scale of data or modelng mechansms may affect the performance of our analyss, we conducted model comparson analyses wth a set of alternatve benchmark models usng dfferent data sets or modelng mechansms Alternatve Models In partcular, we consdered four alternatve benchmark models: (1) Alternatve Model I: Use the purchase data only (Mxed Logt Model), (2) Alternatve Model II: Use the purchase data only (Mxed Logt Model + Addtonal Search Cost Varables), (3) Alternatve Model III: Use the clck data only (Clck Model) 14, and (4) Alternatve Model IV: Use both the clck and the purchase data (Jont Probablstc Model of Clck and Purchase + Addtonal Search Cost Varables, But No Clck Sequence Informaton). 15 Due to space lmtaton, we provde the detals on the alternatve model mechansms n Onlne Appendx G. Overall, we fnd the estmaton results are qualtatvely consstent wth our man fndngs. Interestngly, we fnd that usng a statc model wthout accountng for consumers search behavor can lead to an overestmaton of the prce elastcty. The nterpretaton of ths fndng can be attrbuted to the nature of the hotel search market. A model that captures consumers actual search behavors fnds lower prce elastcty, mplyng consumers n the hotel search market tend to hghly evaluate the qualty of hotels and put weght on non-prce factors durng search (e.g., class, amentes, or revews). Our fndng on prce elastcty s consstent wth pror fndngs by Koulayev (2014) and Brynjolfsson et al. (2010). Both studes show that when consumers face a hghly dfferentated market (e.g., product dfferentaton or retaler dfferentaton), they are more lkely to focus on non-prce factors durng search. Hence the estmated prce elastcty s lower when ncorporatng consumers search behavors nto the model. On the contrary, when a market s less dfferentated, consumers become more prce-senstve and focus more on prce search. Thus a search model that ncorporates consumers search behavors may fnd a hgher prce elastcty of demand than a statc model (e.g., de los Santos et al. 2012). 14 In partcular, we nclude only the clck-sequence-related nformaton n the lkelhood functon usng the clck data only. We estmate ths clck model usng a smlar smulated maxmum lkelhood approach based on only the clck probablty. 15 The major dfference between ths jon probablstc model and our man search model s that nstead of capturng the sequence of clcks and allowng clcks to be nterdependent, the jon model assumes each clck decson to be ndependent. Correspondngly, t models the clck decsons ndependently as followng a dscrete choce process (e.g., Logt model). 22

24 Furthermore, from Alternatve Models (I) (III), we fnd that usng only the clck data or only the purchase data are lkely to overestmate the prce elastcty, and therefore t s mportant to consder both clck and purchase decsons when modelng consumer preferences. However, nterestngly, from Alternatve Model (IV), we fnd although ncorporatng both clck and purchase decsons nformaton can mprove the model estmaton, the jont probablstc model wthout consderng the clck sequence nformaton can stll lead to an overestmaton of prce elastcty. Ths result ndcates that not only the fnal clck or purchase decsons matter, but also the sequental clck path s crtcal n revealng consumer preferences. Falng to capture consumers search paths can lead to an overestmaton of prce elastcty n the onlne search market. For more detals on the alternatve model results, we llustrate them n Tables G1 and G2 n Onlne Appendx G Model Predcton Experments Based on the model-estmated coeffcents, our fnal goal s to predct the clck and purchase probabltes for a hotel by an ndvdual consumer. The predcton of the two ndvdual probabltes can be acheved by substtutng the model-estmated coeffcents nto the Equatons (7) and (8). To obtan ndvduallevel consumer heterogenety, we apply the Monte Carlo smulaton method. In partcular, we use the same random draws we smulated prevously (.e., n Sec. 4.6) from the jont dstrbuton of the ndvdual heterogeneous parameters. Based on the steps above, we are able to compute the correspondng ndvdual clck and purchase probabltes for each hotel for an ndvdual consumer. To examne the predctve performance of our model, we conduct a set of model-predcton experments. We frst compute the predcted ndvdual clck and purchase probabltes for each hotel as descrbed above. Then we compare the predcted ndvdual clck and purchase probabltes wth the observed clck and purchase probabltes (.e., observed search and choce shares for the hotels). We calculate the predcton error for each hotel at ndvdual-sesson level for both clck and purchase probabltes. Then we compute the root mean square error (RMSE) and mean absolute devaton (MAD). We consder all the four alternatve models dscussed above as our baselne models. Furthermore, we are nterested n examnng how the use of unstructured data (socal meda textual varables) may affect the model s predctve power. Therefore, we consder a ffth baselne model: man model wthout the socal meda textual varables (Robustness Test 1) for both clck- and purchase-probablty predctons. We randomly partton our dataset nto two subsets: one wth 70% of the total observatons as the estmaton sample and the other wth 30% of the total observatons as the holdout sample. To mnmze any potental bas from the partton process, we perform a 10-fold cross valdaton. We conduct both n-sample and out-of-sample estmaton usng our model and the two baselne models. We then compare the predctve performance of both the clck and the purchase probabltes of a hotel. The predcton results are llustrated n Tables 5a and 5b (clck probablty) and Tables 6a and 6b (purchase probablty) n Appendx A. Our modelpredcton results demonstrate our model has the overall hghest predctve power. Our model outperforms the 23

25 baselne models n both n- and out-of-sample predctve power for both clck and purchase predctons. Smlar trends n mprovement n the predctve power occur wth respect to RMSE and MAD. For example, wth regard to the clck-probablty predcton, the out-of-sample results n Table 5b show that wth respect to the RMSE, our proposed model can mprove the predcton performance by 14.92% compared to the search model wthout the socal meda textual varables. It can mprove the predcton performance by 33.20% compared to the clck model wth only clck data. We fnd a smlar trend wth regard to the purchase-probablty predcton. For example, the out-of-sample results n Table 6b demonstrate that wth respect to the RMSE, our man model can mprove the predcton performance by 18.77% compared to the next best model search model wthout the socal meda textual varables. It can mprove the predcton performance by 37.36% compared to the Mxed Logt model wth a lmted consderaton set, and by 32.81% compared to the Mxed Logt model wth a lmted consderaton set plus the addtonal search cost varables. Notce that the model-predcton experments ndcate our model s better able to predct the ndvdual clck and purchase probabltes for each hotel than the clck model and the statc Mxed Logt model. Even after consderng varous extensons of the Mxed Logt models accountng for the lmted consderaton set and the addtonal search cost varables, or consderng other alternatve behavoral models, our search model stll provdes the best predctve performance. The potental reasons are the followng. Our proposed search model s a holstc model that captures both the clck and the purchase decson makng processes for a consumer. Therefore, our model s able to account for the followng three unque features of consumer search: (1) Interdependency n decson makng. The search model predcts that a clck decson depends on the ordered lst of prevously clcked products, and consequently, a purchase decson depends on the prevous clck-generated consderaton set; however, statc models assume ndependent decson makng durng consumer evaluaton. (2) Informaton arrves sequentally. Our model assumes that detaled product landng-page attrbutes can only become avalable to a consumer after she clcks on the product; however, statc models tend to gnore the fact that nformaton arrves sequentally and assume both landng-page and summary-page attrbutes are avalable to the consumer at the begnnng. (3) Non-neglgble search cost. The search model predcts that a clck decson depends on the (expected) search cost assocated wth ths clck, and the formaton of the fnal consderaton set depends on the search cost towards each product; however, the statc model gnores such opportunty cost. In summary, three major ndcatons from our model comparson experments are: () Both the clck and the purchase data reveal sgnfcant nformaton about consumer preferences and search cost, and both are crtcal to mprove the model predctve power. () The sequence of the clcks reveals sgnfcant nformaton about consumer preferences and search cost. Our man search model ncorporates not only the clck decsons but also the sequental order of these clcks. However, the statc Mxed Logt models gnore the sequence of the clcks and smply take the fnal consderaton set as exogenously gven. () Unstructured socal meda data play an mportant role n consumer decson makng. Incorporatng such nformaton nto the model can lead to a sgnfcant mprovement n the model s predctve power. 24

26 6. Polcy Experment Based on our model estmaton results, we conduct counterfactual analyses under varous polcy experments to explore the what-f type of questons. More specfcally, consderng the amount and type of dfferent nformaton, we are nterested n what nformaton product search engnes should present durng dfferent stages of consumer search (.e., on the search results summary page vs. product landng page). 6.1 Informaton Shown on the Search Summary Page vs. Landng Page As we notce n our dataset, most onlne products contan a large number of characterstcs. However, due to the lmtaton n screen space, search engnes are unable to show all product nformaton on the search summary page. Instead, search engnes choose to hghlght a snapshot of some product nformaton on the summary page, whle leavng the majorty of nformaton to the landng page. The nformaton selected for the search summary page for a product becomes crtcal because t can nfluence both consumers perceptons of the utlty of the product and ther expectatons regardng the search costs assocated wth further evaluaton of the product (.e., va clck-through). 16 To explore what nformaton should be shown on the search summary page, we conduct a polcy experment usng our model. In partcular, we assume search engnes show dfferent sets of hotel characterstcs on the summary page meanng these chosen characterstcs are drectly observable to consumers before the clck-through. We re-estmate consumers condtonal belef regardng the unobserved characterstcs usng bootstrap samples, and then compute the ndvdual sesson-hotel-level predcted clck and purchase probabltes based on the parameter estmates from the orgnal model estmaton. We compute the overall clck and purchase probabltes for a hotel by takng an average of the clck and purchase probabltes across all sessons for that hotel. Fnally, we sum over all hotels based on the prces and the predcted purchase probabltes to compute the predcted search engne revenue. By dong so, we am to examne the followng queston: Holdng consumers preferences for product characterstcs and search cost varables consstent, how would consumers clck and purchase behavor change f the search engne webstes were to provde dfferent sets of nformaton on the search results summary page? Moreover, we are nterested n explorng a better strategy for search engnes to desgn the search results summary page such that t mproves the overall clck/purchase probabltes and the search engne revenue. More specfcally, we focus on sx alternatve sets of product nformaton that may be potentally useful to show on the search results summary page: (1) Exstng summary-page characterstcs (.e., prce, hotel class, hotel brand, customer ratng, revew count, page, rank); (2) Exstng summary-page characterstcs plus addtonal locaton-related characterstcs (.e., # of external amentes, beach, lake, downtown, hghway, publc 16 Note that we do not focus on the supply sde model n the paper, and we make the mplct assumpton that ther nformaton provdng practces are exogenous and not necessarly optmal. We beleve that ths s a reasonable assumpton for our model, but potentally more research n that drecton could shed more lght n the nformaton provson decsons of the hotels and examne whether there s any strategc ratonale for ther actons n that front. 25

27 transportaton, crme rate); (3) Exstng summary-page characterstcs plus addtonal servce-related nformaton (.e., amenty count); (4) Exstng summary-page characterstcs plus addtonal revew-text-related nformaton (.e., textual revew features); (5) Exstng summary-page characterstcs plus addtonal revewtopc-related nformaton (.e., Topc Entropy sore derved from the entropy measurement); (6) Exstng summary-page characterstcs mnus the product prce nformaton. Fgure 2a. Predcted Clck Probabltes wth Dfferent Informaton Provded on Search Summary Page 6% Average Clck Probablty Per Hotel Per Sesson (Confdence Interval n Parenthess) 5% 4% 3% 2% 1% 0% 0.06% 0.05% 0.04% 0.03% 0.02% 0.01% Exstng (4.15%- 4.69%) Exstng+Locaton (2.12%-2.46%) Exstng+Servce (3.49%-3.88%) Exstng-Prce (5.96%-6.28%) Exstng+Revew (Text Feature) (2.83%-3.09%) Fgure 2b. Predcted Purchase Probabltes wth Dfferent Informaton provded on Search Summary Page Average Purchase Probablty Per Hotel Per Sesson (Confdence Interval n Parenthess) Exstng+Revew (Topc Entropy) (3.01%-3.24%) 0.00% Exstng (0.009% %) Exstng+Locaton (0.042%-0.048%) Exstng+Servce (0.021%-0.028%) Exstng-Prce (0.007%-0.012) Exstng+Revew (Text Feature) (0.037%-0.040%) Exstng-Revew (Topc Entropy) (0.030%-0.033%) We compute the average predcted clck and purchase probabltes per hotel per sesson under each of the above sx assumptons. We provde our results n Fgures 2a and 2b. Our fndngs demonstrate that the type 26

28 of nformaton search engnes choose to show on the summary page has a statstcally sgnfcant effect on consumers clck and purchase probabltes. In partcular, we fnd provdng addtonal locaton-, servce-, or revew-related nformaton for products on the search results summary page wll lead to a sgnfcant decrease n clck probablty. Ths fndng s ntutve. Because provdng more nformaton on the summary page wll reduce the varance of the product utlty (.e., reducng uncertanty n consumer expectaton) before clck, t lowers the reservaton utlty of the product (hence makng the product less attractve for a consumer to clck). However, nterestngly, we fnd that provdng addtonal product nformaton, especally the locatonrelated nformaton, on the travel search engne summary page wll lead to a sgnfcant ncrease n the purchase probablty. A potental reason for ths fndng s that provdng addtonal product nformaton on the search summary page can reduce the potental error n consumers expectaton towards product utlty and search costs before clck. As a consequence, consumers are more lkely to clck on the best set of products that wll provde them the hghest utlty. Hence, the maxmum utlty dscovered from ths clck-generated consderaton set s more lkely to exceed the utlty of the outsde good. As a result, consumers are less lkely to mss a good-value deal (.e., leave wthout purchase). Meanwhle, we fnd that although excludng prce nformaton from the search summary page can lead to a sgnfcantly hgher clck probablty, t does not seem to ncrease the purchase probablty at the end. Ths fndng ndcates that strategcally hdng prce nformaton (.e., prce obfuscaton) from the search summary page can make further searchng (.e., clckng) for products on a search engne more attractve. However, ths strategy may not ncrease the overall purchase probablty. Fnally, we compute the overall search engne revenue based on the hotel prces and the predcted purchase probabltes. Our results show that the locaton-related nformaton s the most nfluental, compared to the servce- and revew-related nformaton, when the travel search engne presents ths nformaton on the search summary page. It can lead to a 22.16% ncrease n the overall search engne revenue. Provdng servcerelated nformaton, such as the total number of hotel amentes, on the search summary page can lead to a 3.22% ncrease n the overall search engne revenue. By contrast, strategcally hdng prce nformaton from the search summary page can hurt the search engne revenue, leadng to a 7.08% drop n the overall revenue. We provde more detals on the correspondng results n Table Interestngly, provdng a carefully curated dgest of socal meda textual content on product summary page (e.g., top-6 most frequently mentoned product features extracted from the customer revews, customers atttudes towards these popular features, readablty of the revew s textual content) can lead to a 12.01% ncrease n the overall search engne revenue. Meanwhle, provdng an overall Topc Entropy score of the 17 We conducted addtonal analyss to examne the statstcal sgnfcance n the dfference across the smulated revenues n the polcy experments. In partcular, gven each dfferent set of nformaton on the search summary page, we replcated our smulaton experments for 200 tmes (.e., va bootstrappng) to acqure the confdence nterval of the correspondng smulated platform revenue. We found the predcted revenues under dfferent scenaros are statstcally dfferent from the exstng case (.e., confdence ntervals do not overlap). 27

29 revew content (.e., derved from topc models to measure the complexty of the revew topc content) can lead to an 8.23% ncrease n the overall search engne revenue. These fndngs suggest that t s mportant for product search engnes to leverage the economc value of large-scale unstructured socal meda nformaton, whle at the same tme reducng the cogntve burden of consumers by automatng the extracton of such nformaton and provdng t to consumers durng the earler stages of decson makng. Table 4. Predcted Overall Search Engne Revenue wth Dfferent Informaton on Search Summary Page Overall Search Engne Revenue 95% Confdence Interval Exstng $452,781 $445,263 $458,260 Exstng + Locaton Informaton $553,136 $538,026 $561,989 Exstng + Servce Informaton $467,369 $460,031 $474,112 Exstng Prce Informaton $420,132 $411,585 $429,203 Exstng + Revew Informaton (Text Features) $507,160 $500,327 $514,278 Exstng + Revew Informaton (Topc Entropy) $490,063 $481,314 $498,157 Confdence Interval s calculated based on bootstrappng the polcy smulaton experments for 200 tmes. Furthermore, to examne where the revenue ncrease came from, we conducted an addtonal analyss on the breakdown of the revenue n the smulaton. Interestngly, we found that the revenue ncrease came from both exstng consumers and expanson of market coverage. In addton, we also found that the revenue ncrease occured for both exstng hotels and new hotels. Ths fndng provdes further supports that wth carefully desgned nformaton on search summary page, search engne can mprove the market coverage of consumers as well as the dversty of products consumed, whch can lead to a potental ncrease n consumer surplus. For more detals, we provde the complete revenue breakdown analyss n Onlne Appendx H. In sum, our polcy experment offers crtcal nsghts on the potental of analyzng large hstorcal user behavoral data for search engnes to mprove the landng-page desgn strategy for better user experence and hgher overall busness revenues. 7. Manageral Implcatons and Concluson In ths paper, we propose a structural econometrc model for product search engnes to understand consumers search and purchase behavor as well as to quantfy the search costs ncurred by consumers. Our model combnes an optmal stoppng framework wth an ndvdual-level random utlty choce model. It allows us to jontly estmate consumers heterogeneous preferences and search costs n a product search engne context where unstructured socal meda nformaton s qute pervasve, and to dentfy the key drver of a consumer s decson at each stage of the search and purchase process. Our fnal results suggest that both the hstorcal clckng decsons and the purchase decsons reveal sgnfcant nformaton of consumer preferences and search costs. Moreover, the paths of searches (.e., sequence of clcks) also reveal sgnfcant nformaton of consumer 28

30 preferences and search costs. Our analyses can help search engnes predct consumer onlne footprnts and desgn the search result summary page to mprove user experence and search engne revenues. On a broader note, our research makes two key contrbutons. Frst, we show the advantage of ncorporatng multple and large-scale data sources to analyze how humans search, evaluate nformaton, and make decsons under cogntve constrants n response to the emergng nterplay between socal meda and search engnes. Moreover, we are able to quantfy the effects of unstructured socal meda content on user search cost. Our emprcal analyss ams to provde an approach on whch future studes can buld, wth the goal of explorng the potental of Bg Data and sophstcated customer analytcs tools for manageral decsonmakng. Second, we demonstrate the value of usng dgtal analytcs by search engnes based on structural econometrc methods n fndng solutons for mportant busness problems. Our structural model of consumer search combnes the optmal stoppng framework wth an ndvdual-level random utlty choce model. It allows us to harness the advantage of multstage consumer behavoral data on search engnes to dentfy the drvers of consumer decsons n electronc markets. It enables the predcton of consumers future search behavor on search engnes. Moreover, t offers nsghts to search engnes on the desgn of the search results summary page (.e., what nformaton to show on the summary page vs. the landng page) to mprove the user experence and the search engne revenues. Importantly, ths approach can be generalzed to any electronc market wth an nhouse search engne (e.g., Amazon.com), especally n a moble search envronment (e.g., Apple s Tunes or App store), gven the commonalty n the goal of mprovng user experence. Our work has several lmtatons, some of whch can serve as frutful areas for future research. Frst, our model assumes the consumer knows the general dstrbuton of utltes of alternatves, and each alternatve follows the same dstrbuton. However, when the alternatves are sorted on search engnes under certan crtera ncludng the default method, they are presented n order of ther predcted attractveness to a consumer. Such recommendatons can alter the dstrbuton of the expected utltes of alternatves and may nduce a shft n consumers decson makng (Dellaert and Häubl 2012). Examnng ths fact from an emprcal perspectve would be nterestng. Second, testng other alternatve consumer behavoral models would be nterestng. For example, nstead of searchng sequentally, consumers may search n a non-sequental fashon by frst choosng a fxed sze of a consderaton set (e.g., Honka 2014). Comparng the dfferences n the correspondng model predcton of consumer search strategy would be nterestng. Thrd, n ths study, we assume each onlne consumer sesson to be an ndependent search process. Due to the data lmtaton, we cannot dentfy the possblty that a consumer may leave a sesson wthout bookng but come back at a later tme to resume the search. In ths case, we treat these searches as two separate results n our estmaton. Dstngushng such repeated searchers and more precsely estmatng the search costs would be an nterestng avenue for future research. Meanwhle, due to the data lmtaton, we do not have the consumer-level demographc nformaton. Because the search cost s lkely to relate to the opportunty cost of tme, ncludng such nformaton (e.g., age, ncome) n future would be useful. Fnally, t would be very nterestng for future research to consder the supply 29

31 sde (e.g., how the hotels/advertsers may respond to the search engne s polcy change) n addton to the demand sde to examne the effects of polcy change on search engnes. References Agarwal, A., K. Hosanagar, M. Smth Locaton, Locaton, Locaton: An Analyss of Proftablty of Poston n Onlne Advertsng Markets. Journal of Marketng Research. 48(6). Archak, N., A. Ghose, P. G. Iperots Dervng the prcng power of product features by mnng consumer revews. Management Sc. 57(8) Baye, M.R., Gatt, J.R.J., Kattuman, P. and Morgan, J Clcks, Dscontnutes, and Frm Demand Onlne. Journal of Economcs & Management Strategy. 18(4), Bkhchandan, S. and S. Sharma Optmal Search wth Learnng, Journal of Economc Dynamcs and Control, 20. Ble, D. M., Ng, A. Y., and Jordan, M. I Latent Drchlet Allocaton, Journal of Machne Learnng Research (3). Brynjolfsson, E., A. Dck and M. Smth A nearly perfect market? Dfferentaton vs. prce n consumer choce. Quanttatve Marketng and Economcs, vol.8, no.1. Chapelle, O., Zhang, Y A Dynamc Bayesan Network Clck Model for Web Search Rankng. Proceedngs of WWW Chen, P., Y. Hong and Y. Lu The Value of Mult-dmensonal Ratng Systems: Evdence from a Natural Experment and Randomzed Experments. Management Scence, forthcomng. Chen, Y. and S. Yao Sequental Search wth Refnement: Model and Applcaton wth Clck-stream Data. Forthcomng n Management Scence. Chevaler, J. A., D. Mayzln The effect of word of mouth on sales: Onlne book revews. J. Marketng Res. 43(3). De los Santos, B. 2008, Consumer search on the nternet, PhD dssertaton, Chcago Unversty. De los Santos, B., A. Hortacsu, and M. Wldenbeest Testng models of consumer search usng data on web browsng and purchasng behavor. Amercan Economc Revew, 102(6), De los Santos, B., A. Hortacsu, and M. Wldenbeest Search wth Learnng. Workng Paper. De los Santos, B and Koulayev, S Optmzang Clck-Through n Onlne Rankngs for Partally Anonymous Consumers. Workng Paper. Dellaert, B. G.C., G. Häubl Searchng n Choce Mode: Consumer Decson Processes n Product Search wth Recommendatons. Journal of Marketng Research. Vol. 49, No. 2, pp Dellarocas, C., N. Awad, M. Zhang Explorng the value of onlne product revews n forecastng sales: The case of moton pctures. J. Interactve Marketng 21(4) Duan, W., B. Gu, A. B. Whnston Do onlne revews matter? An emprcal nvestgaton of panel data. Decson Support Systems, 45(4) Ellson, G. and Ellson, S.F Search, Obfuscaton, and Prce Elastctes on the Internet. Econometrca. Erdem, T. and Keane, M.P Decson-Makng Under Uncertanty: Capturng Dynamc Brand Choce Processes n Turbulent Consumer Goods Markets. Marketng Scence. vol. 15 no Fellbaum, C Wordnet: An Electronc Lexcal Database. MIT Press, Cambrdge, MA. Forman, C., A. Ghose, B. Wesenfeld Examnng the relatonshp between revews and sales: The role of revewer dentty dsclosure n electronc markets. Inform. Systems Res. 19(3) Ghose, A. and Iperots, P. G Estmatng the helpfulness and economc mpact of product revews: Mnng text and revewer characterstcs. IEEE Transactons on Knowledge and Data Engneerng, 23 (10), Ghose, A., Iperots, P. and L, B Desgnng Rankng Systems for Hotels on Travel Search Engnes by Mnng User- Generated and Crowdsourced Content. Marketng Scence. 31(3), Ghose, A., Iperots, P. and L, B Examnng the Impact of Rankng on Consumer Behavor and Search Engne Revenue. Management Scence. 60(7). Ghose A, and Yang, S An Emprcal Analyss of Search Engne Advertsng: Sponsored Search n Electronc Markets. Management Scence. 55(10), pp Godes, D., D. Mayzln Usng onlne conversatons to study word-of-mouth communcaton. Mkt. Sc. 23(4). Goldfarb, A. and Tucker, C Search Engne Advertsng: Channel Substtuton When Prcng Ads to Context. Management Scence, 57:

32 Gong, J., V. Abhshek, and B. L Examnng the Impact of Contextual Ambguty on Search Advertsng Keyword Performance: A Topc Model Approach. Workng Paper Greene, W. H Lmdep Manual. Verson 8.0. Econometrc Software, Inc. Hann, I., & Terwesch, C Measurng the frctonal cost of onlne transactons: The case of a name-your-own-prce channel. Management Scence, 49, Hong, H. and M. Shum Can search cost ratonalze equlbrum prce dsperson n onlne markets? Rand Journal of Economcs, 37 (2): Honka, E Quantfyng search and swtchng costs n the U.S. auto nsurance ndustry. Rand Journal of Economcs. Hortacsu, A. and C. Syverson Product Dfferentaton, Search Costs, and Competton n the Mutual Fund Industry: A Case Study of S&P 500 Index Funds, Quarterly Journal of Economcs, 119: Iprospect Prospect Blended Search Results Study. JupterResearch Retal Web Ste Performance. Km, J., P. Albuquerque, B. Bronnenberg Onlne Demand under Lmted Consumer Search, Marketng Scence, 29(6). Km, J., P. Albuquerque, B. Bronnenberg The Effects of Product Innovaton on Onlne Search and Choce. Workng Paper. Koulayev, S Search wth Drchlet Prors: Estmaton and Implcatons for Consumer Demand, Journal of Busness & Economc Statstcs 31, pp Koulayev, S Estmatng Demand n Onlne Search Markets, wth Applcaton to Hotel Bookngs. RAND J. of Economcs. Lee, Lung-Fe Generalzed Econometrc Models wth Selectvty. Econometrca 51(2): Mannng, C., H. Schutze Foundatons of Statstcal Natural Language Processng. MIT Press, Cambrdge. MarketngLand Top Retal Webstes Not Gettng Faster: Average Web Page Load Tme Is 7.25 Seconds. McFadden, D A Method of Smulated Moment for Estmaton of Dscrete Response Models wthout Numercal Integraton. Econometrca, Vol. 57, pp McFadden, D. and K. Tran Mxed MNL Models of Dscrete Response. Journal of Appled Econometrcs. Mehta, N., S. Rajv, and K. Srnvasan Prce uncertanty and consumer search: a structural model of consderaton set formaton, Marketng Scence, 22(1). Moraga-Gonzalez, J. L., Wldenbeest, M. R Maxmum lkelhood estmaton of search costs. European Economc Revew, 52. Moraga-Gonzalez, J.L., Sandor, Z. and Wldenbeest, M.R Consumer Search and Prces n the Automoble Market. Workng Paper. Mortensen, D.T Job search, the duraton of unemployment and the Phllps curve, Amercan Economc Revew. Netzer, O., R. Feldman, J. Goldenberg, M. Fresko Mne your own busness: Market-structure survellance through text mnng. Marketng Sc. 31(3) Renganum, J. F Strategc search theory. Internatonal Economc Revew. 23(1) Rosenfeld, D. B. and R. D. Shapro Optmal Adaptve Prce Search, Journal of Economc Theory. Rothschld, M Searchng for the Lowest Prce when the Dstrbuton of Prces s Unknown, J. of Poltcal Economy 82. Stgler, G.J The Economcs of Informaton. The Journal of Poltcal Economy, 69(3), Wetzman, M. L Optmal search for the best alternatve. Econometrca 47(3) Wldenbeest, M.R An Emprcal Model of Search wth Vertcally Dfferentated Products. RAND J. of Economcs, 42(4). Yao, S., C. F. Mela A Dynamc Model of Sponsored Search Advertsng. Marketng Scence, 30(3). 31

33 Table 2. Defntons and Summary Statstcs of Varables Varable Defnton Mean Std. Mn Max PRICE_DISP Dsplayed prce per room per nght PRICE_TRANS Transacton prce per room per nght CLASS Hotel class AMENITYCNT Total # hotel amentes ROOMS Total number of hotel rooms BRAND Dummes for 9 hotel brands: Accor, Best western, Cendant, Choce, Hlton, Hyatt, Intercontnental, Marrott, and Starwood PAGE Page number of the hotel RANK Screen poston of the hotel SPECIALSORT Dummy for a specal sortng method BEACH Beachfront wthn 0.6 mles LAKE Lake or rver wthn 0.6 mles TRANS Publc transportaton wthn 0.6 mles HIGHWAY Hghway exts wthn 0.6 mles DOWNTOWN Downtown area wthn 0.6 mles EXTAMENITY Number of external amentes wthn 1 mle,.e., restaurants, shoppng malls, or bars CRIME Cty annual crme rate Socal Meda Varables (Cogntve Cost) COMPLEXITY Average sentence length per revew SYLLABLES Average # syllables per revew SPELLERR Average # spellng errors per revew SUB Revew subjectvty - mean SUBDEV Revew subjectvty - standard devaton TOPICENTROPY Entropy score to measure topc complexty Socal Meda Varables (Hotel Qualty) REVIEWCNT Total # revews RATING Overall revewer ratng STAFF Sentment score for helpfulness of staff FOOD Sentment score for food qualty BATHROOM Sentment score for bathroom qualty PARKING Sentment score for parkng facltes BEDROOM Sentment score for bedroom qualty FRONTDESK Sentment score for check-n/out front desk effcency Model Computed Search Cost (n US Dollar $) c j Search Cost for a hotel j derved from the model estmaton Total # Sessons: 969,033 Total # Hotels: 2117 Total # Observatons: 7,059,122 Tme Perod: 11/1/2008-1/31/

34 Varable Mean Effect (Std. Err) M (Preferences),,, Table 3. Estmaton Results - Man Model Heterogenety Varable Mean Effect (Std. Err) M (Std. Err) M Heterogenety (Std. Err) M, (Preferences),,,, PRICE (L) (.022).417 (.074) DOWNTOWN (.061).471 (.093) PAGE (.003).080 (.133) CRIME (.043).015 (.034) RANK (.008).132 (.067) RATING (.015) (.091) CLASS (.023).935 (.181) REVIEWCNT (L) (.107).369 (.069) AMENITYCNT (L).146 (.034).066 (.070) STAFF.139 (.027).034 (.088) ROOMS (L).394 (.024).195 (.287) FOOD.225 (.038).136 (.002) EXTAMENITY L).165 (.036).041 (.046) BATHROOM.290 (.271).060 (.103) BEACH (.028).561 (.099) PARKING.097 (.008).075 (.011) LAKE (.116) (.389) BEDROOM (.232).253 (.269) TRANS (.140).192 (.064) FRONTDESK.065 (.103).021 (.076) HIGHWAY.447 (.093).068 (.061) BRAND Yes (Search Cost) γ (Search Cost) γ Search Base Cost (Constant) (.089).971 (.176) SPELLERR (L).329 (.082).033 (.101) COMPLEXITY.541 (.094).398 (.115) SUB.196 (.045).057 (.229) SYLLABLES (L).678 (.115).721 (.106) SUBDEV.342 (.056).119 (.273) Maxmum LL -405,418 Prce Elastcty (L) Logarthm of the varable. Statstcally sgnfcant at 5% M: Man Model. 33

35 Man Model Appendx A. Model Comparsons Table 5a: In-sample Model Predcton Results (Clck Probablty) Man Model Clck Model w/o Socal Meda (wth Only Clck Textual Varables Data) Jont Model of Clck and Purchase (No Clck Sequence) RMSE MAD Table 5b: Out-of-sample Model Predcton Results (Clck Probablty) Man Model Man Model w/o Socal Meda Textual Varables Clck Model (wth Only Clck Data) Jont Model of Clck and Purchase (No Clck Sequence) RMSE MAD Table 6a: In-sample Model Predcton Results (Purchase Probablty) Mxed Logt Model Jont Model of Man Model (Lmted Consderaton Clck and Purchase Man w/o Socal (Lmted Set (No Clck Sequence) Model Meda Textual Consderaton +Addtonal Search Varables Set) Cost Varables) RMSE MAD Table 6b: Out-of-sample Model Predcton Results (Purchase Probablty) Man Model w/o Socal Meda Textual Varables Mxed Logt Model Jont Model of Clck and Purchase (No Clck Sequence) Man Model (Lmted Consderaton (Lmted Consderaton Set +Addtonal Search Set) Cost Varables) RMSE MAD

36 Onlne Appendx B. Optmal Search Framework Our model bulds on the optmal sequental search framework. Consder that consumers are forwardlookng and tryng to maxmze the expected present value of utlty over a plannng horzon (e.g., Erdem and Keane 1996). The expected present value n our settng can be computed as follows. Frst, we partton the set of avalable alternatves nto S, wth S contanng all the ones that have been searched and S contanng S all the non-searched ones. Let u be the hghest net value searched so far, thus, we have u max { u,0}. (B1) js Note that Equaton (B1) s the same as Equaton (4) n the paper. The state of the system at any tme durng the search s gven by ( u, S ). Defne ( u, S) as the expected present dscounted value of followng an optmal search polcy, from the current state ( u, S) gong forward. Therefore, for each u and equaton: S, the state valuaton functon ( u, S) must satsfy the Bellman u ( u, S) max u,max c d ( u, S { j}) f( u ) du ( u, S { j}) f( u ) du, js u u u u u where F() s the CDF of u and f () s the probablty densty functon of u. Therefore, at current state (B2) u S (, ), the consumer can ether termnate search and collect reward u, or search any j S to maxmze u S (, ). Gven the short tme span n onlne search, we set the dscount rate d to 1. Equaton (B2) s the prncple of optmalty for dynamc programmng. As ponted out by Wetzman (1979) and Lppman & McCall (1976) n the classcal economc lterature of search, the optmal soluton to ths dynamc programmng has a myopc soluton: Namely, the consumer needs only compare her return from stoppng and acceptng reward u wth the expected return from exactly one more search. More formally, let the expected margnal utlty for consumer from the search of product j be u B ( u ) ( u u ) f( u ) du. (B3) Thus, consumer wll contnue to search f there exsts at least one j such that the expected margnal beneft from searchng product j exceeds ts correspondng search cost c B ( u ). Therefore, the optmal search strategy for a consumer s to contnue searchng untl a value u s found that volates Equaton (B4). (B4) 35

37 Onlne Appendx C. Computaton of Reservaton Utlty Our model bulds on the optmal sequental search framework by Wetzman (1979). Defne the reservaton utlty z as the utlty value that satsfes the followng boundary condton, where the search cost equates the expected margnal utlty from searchng product j (same as Equaton (6)). c B ( z ) ( u z ) f( u ) du. z The optmal search strategy for a consumer s to contnue searchng untl she fnds a value u larger than the boundary soluton z. The reservaton utlty z can be solved from Equaton (C1) gven the search cost, z B 1 ( c ). Wetzman (1979) has proved the functon B ( z ) s contnuous and monotonc. Therefore, there exsts a unque soluton z to the equaton c B ( z ). Let u be the mean and 2 (C1) be the varance of the utlty dstrbuton f ( u ). Based on our model settng, before the clck-through we can wrte down the mean and the varance of the expected utlty as the followng: u E[ u ] E( X P L e ) j j j (C2) X P L E( e ), j j j and 2 VAR[ u ] VAR( X P L e ) j j j VAR( e ). L j s the expected value of the unobserved landng-page characterstcs before clck. It can be estmated based on the mean of the bootstrappng samples drawn from the emprcal dstrbuton of landngpage characterstcs for hotel j condtonal on the observed summary-page characterstcs ( X, P ). Meanwhle, based on the assumpton e ~ Type I EV (0,1), we can derve Ee ( ) ( Euler-Mascheron 2 constant ) and VAR( e ) /6. Therefore, u and We rewrte Equaton (C1) as follows: 2 can be derved accordngly. j j (C3) c ( u z ) f( u ) du z f( u ) (1 F( z )) ( ), u z du z 1 F( z ) (C4) 36

38 where F( z ) s the CDF of u evaluated at z. We compute the reservaton utlty usng a smlar approach as Km et al. (2010). 18 Let z u and c, we can rewrte Equaton (C4) as follows: c f ( ) (1 F( )) u z 1 F( ) f ( ) g( ) (1 F( )), 1 F( ) (C5) If we can solve 1 g ( ) from the above Equaton (C5), then we can solve z u. Note that Equaton (B5) does not nvolve any model parameters. Therefore, n practce we only need to solve t once and use the results n the model estmaton (Km et al. 2010, Koulayev 2014). For computatonal tractablty, we apply an nterpolaton approach to solve Equaton (C5). More specfcally, we compute the reservaton utlty z n the followng four steps: 1) Pre-construct a lookup table for each par (, ) based on Equaton (C5). 2) At any stage n the estmaton, use the current values of c and c to compute. 3) Based on the value of, look up the correspondng value n the pre-constructed table. 4) Based on the current values of, and u, solve the reservaton utlty z u. 18 Note that dfferent from Km et al. (2010), who assume standard normal dstrbuton of the error, we allow for logt dstrbuton of the error term n our model. To calculate the functon wth regard to the nverse Mll s rato ( f ( u )/ 1 F( z ) ) from Equaton (C4) to Equaton (C5), we frst need to transform the logt error nto standard normal dsturbances usng an nverse standard normal CDF functon. Ths transformaton approach was proposed and wdely used by prevous studes to compute the nverse Mll s rato for logt dstrbuton (e.g., Lee (1983), Greene (2002)). 37

39 Onlne Appendx D. More Detals on Usng the Smulated Approach to Construct the Condtonal Purchase Probablty As we dscussed n secton 4.4, condtonal on the sequence of clcks consumer has made n the search sesson, we can derve the condtonal probablty that she purchases hotel r(j) n her consderaton set as the followng: r, ( j) P( r( j) s booked by consumer ) Pr u u, r( j) r( j'), r( j), r( j') S r, ( j) r, ( j') V V e V V e Pr r( j) r( j'), r( j), r( j') S S L S L, r( j), r( j) r, ( j), r( j'), r( j') r, ( j'),, (D1) where S s the clck-generated choce set for consumer. Equaton (D1) s dentcal to Equaton (8) n the paper. Note that because the consderaton set S s selected by consumer based on her search decsons, e does not follow a full Type I EV dstrbuton. Instead, t follows a truncated Type I EV dstrbuton based on the optmalty condtons used by the consumer. Unfortunately, under such crcumstance the condtonal choce probablty does not have a close-form expresson (e.g., Logt form). To address ths selecton ssue, we appled a smulaton approach. Smlar methods have been adopted by the prevous studes (Chen and Yao 2016, Honka 2014, McFadden 1989). Our smulaton approach bulds on the methods from Chen and Yao (2016) and Honka (2014). It allows us to smulate the error term from a truncated Type I EV dstrbuton by satsfyng the follow three optmalty condtons: 1) Sequence of the clck-generated choce set; 2) Composton of the clck-generated choce set; 3) Utlty optmalty of the fnal choce. More specfcally, the smulated purchase probablty has to satsfy the followng condtons: 1) At any moment durng the consumer search process, the utlty of the currently beng clcked product j, u r, ( j), s smaller than the reservaton utltes of those products clcked after j. Ths s because the consumer contnues to search afterwards. Here, clcked by consumer. clcked r,() j r,(') j clcked S denotes the set of all products that have been u mn( z, r( j') S and r( j') r( j)) (D2) 38

40 2) The utlty of the fnal purchased product, u r, ( j), s greater than the reservaton utltes of all the remanng unsearched products, Here, z r, ( j'). Ths s because the consumer stops searchng afterwards. unclcked S represents the set of all products that have not been clcked by consumer., ( ') unclcked ur,() j zr,(') j r j S (D3) 3) The utlty of the fnal purchased product, u r, ( j), s greater than the utlty of any other product n the clck-generated choce set, u r, ( j'). Ths s the fnal choce utlty optmalty condton. u u, r( j') S and r( j) r( j') (D4) clcked r,() j r,(') j Hence, when smulate the error term n the utlty functon to construct the condtonal purchase probablty, we need to draw from the truncated Type I EV dstrbuton by takng nto consderaton all the three optmalty condtons (D2)-(D4) above. As dscussed n Chen and Yao (2016), for dfferent products n the clck-generated choce set, the error terms are truncated dfferently. For clcked products that are not purchased, the error term s rght truncated. For the purchased product, the error term s left truncated f t s the fnal clck durng the search process; f t s not the fnal search, then t s truncated on both sdes. To construct the condtonal purchase probablty, an ntutve approach s to draw the error term from the Type I EV dstrbuton based on the three truncaton optmalty condtons n (D2)-(D4), by countng the frequency that the three optmalty condtons are satsfed. More specfcally, we adopted a smlar approach as used by Chen and Yao (2016). The step-by-step mplementaton of our smulaton method can be summarzed as follows: ) Condtonal on a gven set of other parameters n our model, for each consumer and each product j n the consderaton set, draw 200 e, j from Type I EV dstrbuton, dependng on the three truncaton optmalty condtons; ) Count the frequency of the three optmalty condtons (D2)-(D4) beng satsfed across the 200 random draws of e, j s; ) Iterate 100 tmes the above two steps, repeatedly makng new draws of other parameters durng each round of teraton; v) Average the smulated frequences from step ) across the 100 teratons to calculate the fnal smulated purchase probablty of consumer. 39

41 However, ths approach can be computatonally expensve. As a robustness check, we also tred an alternatve method wth a kernel-smoothed frequency smulator whch was proposed by McFadden (1989) and was suggested by Honka (2014). In ths approach, we smoothed the probabltes usng a multvarate scaled logstc CDF (Gumbel 1961) wth all the scalng factors equal to Notce that due to data lmtaton, Honka (2014) consders only the composton of the clck-generated choce set but not the sequence of clcks as the optmalty condton, whereas Chen and Yao (2016) observe the sequence of search process whch allows them to consder both the composton and the sequence of the clck-generated choce set. In our study, smlarly as Chen and Yao (2016) we observe both types of nformaton. Therefore, we are able to account for the addtonal optmalty condton regardng the sequence of consumer clcks compared to Honka (2014). 19 For more detals on the kernel-smoothed frequency smulator, we refer nterested readers to Onlne Appendx B n Honka (2014). The man dea of ths smulator s to calculate the smoothed condtonal purchase probablty usng a multvarate scaled logstc CDF (Gumbel 1961). In our estmaton, we allow all the scalng factors to be equal to 15 as suggested by Honka (2014). However, we have also tred other values for the scalng factors rangng from We found our estmaton results stay qualtatvely consstent. 40

42 Onlne Appendx E. Usng Topc Modelng to Generate the Topc Entropy Score for Each Hotel In addton to revew textual readablty and subjectvty, we also extracted an addtonal cogntve cost ndcator based on the topc complexty of the customer revews. In partcular, bult on pror lterature (Gong et al. 2016) we analyzed the entropy value for the dstrbuton of topcs extracted from all customer revews for each hotel. Ths topc entropy measures the dversty of topcs covered by the customer revews for each hotel. Pror lterature suggests the dversty n search results affects consumer search behavor (e.g., Wetzman 1979, Dellaert and Haubl 2012). In addton, consumer psychology theores suggest that as the nformaton become noser, users are more lkely to abandon ther search (e.g., Jacoby et al. 1974; Dhar and Smonson 2003), because users tend to get overwhelmed and dscouraged by the complexty of nformaton, and therefore lose ther nterest or trust n the search results. Therefore, we derved a Topc Entropy score usng probablstc topc models from machne learnng and natural language processng to capture the nosness of nformaton provded by the customer revews. Topc models are unsupervsed algorthms that am to extract hdden topcs from unstructured text data. The ntuton behnd topc models s that a topc s a cluster of words that frequently occur together, and that documents, consstng of words, may belong to multple topcs wth dfferent probabltes. A probablstc topc model tres to dscover the underlyng topc structure n a statstcal framework. In partcular, we measure the topc complexty of revews for each product by estmatng a topc model usng Latent Drchlet Allocaton model (LDA; Ble et al. 2003), and subsequently computng the entropy (.e. dversty) of the topc dstrbuton of revews for that product. We dscuss the detals how we use topc modelng to generate the Topc Entropy score for each hotel below. 1. Corpus Constructon and Document Pre-processng. We frst construct a corpus of documents that descrbe the nformaton content conveyed by the hotel revews. In partcular, we collect all the customer revews for each hotel. Hence, each hotel s assocated wth a revew document, and all documents together construct the overall corpus. After constructng the corpus of revew documents, we pre-process the documents followng a standard procedure (e.g., Aral et al. 2011, Gong et al. 2016). We frst remove annotatons and tokenze the sentence nto dstnct terms. Then we remove stop words usng a standard dctonary. 2. Latent Drchlet Allocaton (LDA). We use topc models to automatcally nfer semantc nterpretatons of keyword meanngs. The most wdely used topc model s the Latent Drchlet Allocaton model (LDA; Ble et al. 2003), whch s a herarchcal Bayesan model that descrbes a generatve process of document creaton. Prevous research shows that humans tend to agree wth the coherence of the topcs generated by LDA, whch provdes strong support for the use of topc models for nformaton retreval applcatons. 41

43 The goal of LDA s to nfer topcs as latent varables from the observed dstrbuton of words n each document. In partcular, a topc s defned as a multnomal dstrbuton over a vocabulary of words, a document s a collecton of words drawn from one or more topcs, and a corpus s the set of all documents. Based on the dscusson above, we construct a document for each hotel that best reflects the nformaton of the hotel revew. We now dscuss how we use LDA to nfer the topcs from the corpus of documents. Formally, let T be the number of topcs related to the corpus, let D be the number of documents n the corpus, and let W be the total number of words n the corpus. We assume that each document n the corpus s generated accordng to the followng process: Step 1. For each topc t, choose,, ~, where descrbes the word dstrbuton of topc t over the vocabulary of words. Step 2. For each document d, choose,, ~, where s the probablty of topc t to whch document d belongs. Step 3. For each word n n document d, (1) choose a topc ~, and (2) choose a word ~. and are hyper-parameters for the two pror dstrbutons - as the pror dstrbuton of (word dstrbuton n a topc) and as the pror dstrbuton of (topc dstrbuton n a document). We use the values suggested by Steyvers and Grffths (2007) ( = 0.01 and = 50/T ). Based on the generatve process descrbed above, we use a Markov chan Monte Carlo (MCMC) algorthm to estmate and. Specfcally, we use a collapsed Gbbs sampler to sequentally sample the topc of each word token n the corpus condtonal on the current topc assgnments of all other word tokens. We run a collapsed Gbbs sampler usng MALLET (McCallum 2002) wth 2,000 teratons. For each hotel, we obtan the posteror topc probabltes nferred from ts correspondng document of customer revews. In our study, we estmate the LDA model wth a dfferent number of topcs, T= 20, 50, and Topc Entropy as a Measure for Keyword Ambguty. We propose usng Topc Entropy to measure the complexty of hotel revews. It captures the uncertanty of a document's topc dstrbuton. In nformaton theory, entropy measures the unpredctablty of a random varable. In our context, each hotel s assocated wth ts own revew topc dstrbuton nferred from the hotel-specfc document. Therefore, we treat the topc assgnment as a multnomal random varable, and use topc entropy to quantfy how nosy the customer revews for a hotel are n terms of underlyng topcs. The hgher the entropy s, the more complex or noser the revews for that hotel. In other words, hotels wth hgher Topc Entropy tend to relate to a broader range of topcs (more complex), whereas hotel wth lower Topc Entropy tend to relate to fewer domnant topcs (less complex). More formally, let denote the posteror probablty that hotel k belongs to topc t. We therefore defne the topc entropy of hotel k as follows: 42

44 log, (E1) where T s the total number of topcs. We present the summary statstcs for the estmated Topc Entropy n Table E1. As we can see, the maxmum entropy value depends on the number of topcs chosen. Smple calculaton also shows that wth T topcs, entropy ranges from 0 to ln(t). 20 The hgh correlatons among entropy values derved based on a dfferent number of topcs also suggest entropy seems to be farly robust to the number of topcs specfed n the LDA model. Table E1. Summary Statstcs of Topc Entropy Mean Std. Dev. Mn. Max. Correlaton 20 Topcs 50 Topcs 20 Topcs Topcs Topcs In our model estmaton (.e., Robustness Test 2) and polcy experment, we used the Topc Entropy values derved based on 20 topcs. We llustrate the dstrbuton of the Topc Entropy n Fgure E1 based on 20 topcs (T=20). We also tred usng 50 topcs and 100 topcs and the results are qualtatvely consstent. Fgure E1. Dstrbuton of Topc Entropy (T=20) References: Steyvers, M., and Gr_ths, T Probablstc Topc Models. Handbook of Latent Semantc Analyss (427:7), pp McCallum, A. K MALLET: A Machne Learnng for Language Toolkt. 20 The number of topcs T s pre-specfed before estmatng the LDA model. Entropy for hotel k s the smallest when there exsts 1,, such that 1; Entropy s the largest when for all 1,,, 1/. 43