Airline Itinerary Choice Modeling Using Machine Learning

Size: px
Start display at page:

Download "Airline Itinerary Choice Modeling Using Machine Learning"

Transcription

1 Airline Itinerary Choice Modeling Using Machine Learning A. Lhéritier a, M. Bocamazo a, T. Delahaye a, R. Acuna-Agost a a Innovation and Research Division, Amadeus S.A.S., 485 Route du Pin Montard, Sophia Antipolis Cedex, France Abstract Understanding how customers choose between different itineraries when searching for flights is very important for the travel industry. This knowledge can help travel providers, either airlines or travel agents, to better adapt their offer to market conditions and customer needs. This has a particular importance for pricing and ranking suggestions to travelers when searching for flights. This problem has been historically handled using Multinomial Logit (MNL) models. While MNL offers the dual advantage of simplicity and readability, they lack flexibility to deal with correlations between alternatives or non-linearity in the effect of alternatives attributes. In this work, we present an alternative modeling approach based on machine learning techniques that are able to deal with the main weakness of MNL models for this particular application. These methods are also better adapted for applications in the travel industry thanks to robust data-parallelized implementations working on Big Data platforms. We test these models on a dataset consisting of flight searches and bookings on European markets. The results show machine learning models consistently outperforming MNL on prediction accuracy, log likelihood, and market share estimation. Keywords: multinomial logit model, machine learning, decision tree, random forest, gradient boosting machines addresses: alix.lheritier@amadeus.com (A. Lhéritier), mrbocamazo@gmail.com (M. Bocamazo), delahaye.thierry@gmail.com (T. Delahaye), rodrigo.acunaagost@amadeus.com (R. Acuna-Agost) Preprint submitted to Elsevier March 22, 2017

2 1. Introduction This paper deals with the airline itinerary choice problem. Consider for example that a customer is searching for flights from London to New York, traveling next week on Tuesday and coming back on Saturday. This search request is then processed by a travel provider (e.g., an online travel agent) that proposes between 1 and 50 different alternatives (itineraries) to the customer. The itineraries have different attributes, among others: number of stops, total trip duration, and price. The question is: which one is (probably) going to be selected by the customer? There is a growing interest within the travel industry in better understanding how customers choose between different itineraries when searching for flights. Such an understanding can help travel providers, either airlines or travel agents, to better adapt their offer to market conditions and customer needs, thus increasing their revenue. This can be used for filtering alternatives, sorting them or even for changing some attributes in real-time (e.g., changing the price). The field of customer choice modeling is dominated by traditional statistical approaches, such as the Multinomial Logit (MNL) model, that are linear with respect to features and are tightly bound to their assumptions about the distribution of error. While these models offer the dual advantages of simplicity and readability, they lack flexibility to handle correlations between alternatives or non-linearity in the effect of alternative attributes. Another strong limitation is their inability to model different behavior according to individual specific variables. A large part of the existing modeling work focuses in adapting these modeled distributions, so that they can match observed behavior Nested (NL) and Cross-Nested (CNL) Logit models are good examples of this: they add terms for highly specific feature interactions, so that substitution patterns between sub-groups of alternatives can be captured. In this work, we present an alternative modeling approach based on machine learning techniques. The selected machine learning methods can model non-linear relationships between feature values and the target class, allow collinear features, and have more modeling flexibility to automatically learn implicit customer segments. In particular, we have chosen to work with machine learning methods based on ensembles of decision trees, namely Random Forests (RF) [1] and tree based Gradient Boosting Machines (GBM) [2]. In fact, decision trees 2

3 are well adapted to our problem as model bifurcations (branches) are welldisposed to automatically partition the customers into segments and, at the same time, capture nonlinear relationships within attributes of alternatives and characteristics of the decision maker, if this has a positive impact on the prediction accuracy. Indeed, there are two main segments to be taken into account for our particular problem. Business and leisure air passengers behave very different when it comes to booking flights. The business passenger tends to favor alternatives with convenient schedules like shorter connection times and time preferences, while leisure passengers are very price sensitive, which means that they can accept a longer connection time if this is reflected in a lower price of the tickets. The problem is that the segment is not explicitly known when the customer is searching, however it could be derived by combining different factors. For example, industry experts know that business passengers have a tendency to book with less anticipation and are not predisposed to stay on Saturdays nights. However, these are not black or white rules, which reinforces the need for a model able to detect these rules depending on the data and actual customer behavior Another observed advantage is that RF and GBM are fairly quick to train and very quick to predict, which enables fast iteration. The rest of this paper is organized as follows: Section 2 describes the discrete choice modeling background. Section 3 introduces the proposed modeling approach based on machine learning. In Section 4, we present numerical experiments and their results. Finally, Section 5 draws the main conclusions and present some perspectives of this work. 3

4 2. Discrete choice modeling In this section we introduce the basic setting of discrete choice modeling and the classical MNL approach considered for our experiments as the baseline methodology Setting Our choice situations are structured in air shopping sessions. For each session i, we know the basic search request information provided by the passenger (or decision maker / individual). This information is: origin and destination cities and the arrival/departure dates of the travel. The individual considers a choice set of n i flight alternatives from which she must choose exactly one. It should be noted that only sessions finished with a booking (sale) were considered in the experiments as it was not possible to collect data on sessions with non-purchase choices. We denote C i the random variable representing the index of the chosen flight itinerary alternative. Therefore, the choice set verifies the three basic conditions for any discrete choice modeling problem: a) mutually exclusive, b) exhaustive, and c) be composed of a finite number of alternatives. In each session, two kinds of feature vectors are considered: 1 x i0 R d 0, characterizing the decision maker (also called characteristics), and x ij R d 1, characterizing the alternatives (also called attributes). We denote the vector of all the features that can be considered for the alternative j of the session i as a ij x i0 x ij Multinomial Logit Model The Multinomial Logit Model (MNL) is derived under the assumption that a decision maker chooses the alternative that maximizes his utility U. In general it is not possible to know the decision maker s utility nor its form. However, one can determine some attributes of the alternatives as faced by 1 In practice, there can be features that characterize both the decision maker and the alternatives. For example, the time before departure characterizes, at a day/week scale, what the user looks for but, at an hour scale, it can differentiate the alternatives. See Section 4.5 for more details. 4

5 the passenger, in addition to some characteristics of the decision maker. In general we consider: alternative specific features x ij with generic coefficients β individual specific features x i0 with alternative specific coefficients γ j alternative specific features x ij with alternative specific coefficient δ j. alternative constants α j We can now specify a model or function that relates these observed factors to the unknown individual s utility. This function is often called representative utility, and defined as: V ij = α j + βx ij + γ j x i 0 + δ j x ij (1) At this point, it should ne noted that there are aspects of utility that cannot be observed nor derived from the available data, therefore V ij U ij. The utility could be thus decomposed as: U ij = V ij + ɛ ij (2) Where ɛ ij encapsulates all the factors that impact utility but are not considered in V ij. It should be noted that we do not know ɛ ij and this is the reason why these terms are treated in the literature as random. In particular we assume ɛ ij are i.i.d. with an extreme value distribution. Under these assumptions, the resulting probability of choosing the alternative a ij is given by [3]: P (C i = j) = ev ij j ev ij. (3) For our particular application, alternatives vary drastically from one session to another, for this reason alternative specific coefficients and constants were ruled out (i.e., α j = γ j = δ j = 0 j). 5

6 3. Machine Learning Approach In this section, we describe how we apply supervised learning to the choice modeling problem and we give a brief description of the considered methods Supervised Learning In supervised learning, each sample point consists in a set of features and a target denoted x i, y i, respectively. The goal is, after seeing a set of sample points, to predict new unseen targets given their associated features. In the soft classification problem, the targets are discrete (and usually called labels) and the goal is to estimate their conditional probability distribution P (Y X) while trying to minimize some measure of the expected error when predicting a new unseen label. In the regression problem, the ouputs are continuous and the goal is to estimate their conditional expected value E(Y X). More precisely, we assume that the sample points are realizations of a pair of random variables (X, Y ), X Ω, Y A where Ω is some feature space (e.g., Ω R d ) and A is some alphabet (e.g., A {0, 1} for binary classification and A R for a regression problem). A central problem in the estimation of these quantities is the Bias-Variance trade-off (see [4] for a thorough presentation). The bias is the error generated by erroneous assumptions in the model. High bias can cause underfitting, i.e., missing relevant relations between features X and targets Y. The variance is the error generated by the sensitivity of the model to small fluctuations in the training data. High variance can cause overfitting, i.e., capturing noise or relations that do not generalize. Given the true model and infinite data to estimate it, it is possible reduce both the bias and variance terms to 0. However, in real life, we deal with imperfect models and finite data, and there is a trade-off between minimizing the bias and minimizing the variance Choice modeling as a soft classification supervised learning problem Our approach to choice modeling is to consider each alternative independently and treat it as a soft classification problem, i.e., given its set of feature values, predict the probability of being chosen. That is, X will be instantiated with vectors a ij and Y with 1 {j=ci }. Therefore, we will train the classification models as if the pairs (X, Y ) were independent and identically distributed which is a strong hypothesis that does not hold in our case and in many practical applications. In particular, in every session there is one and only one alternative chosen, and thus, the probabilities within a 6

7 session should add to one. In order to fix this, we normalize the probabilities yielded by the classification model to make them add to one. More precisely, given a soft classifier ˆP ML (Y X) we define a model that assigns the following probability of being chosen to the alternative j of session i: ˆP (C i = j) ˆP ML (Y = 1 X = a ij ) j ˆP ML (Y = 1 X = a ij ). (4) In contrast to MNL where the main interest is providing understandable insights about the factors driving the choices, our focus is on different performance metrics measured in a hold-out dataset. This is the case in many successful machine learning applications where theory is not completely understood: their good empirical performance fully justify them Decision trees Let us first review our basic building block. A decision tree is a tree where each node represents a local estimate of the quantity of interest (e.g., the conditional distribution for soft classification) based on a subset of the sample points. The subsets are recursively split by applying a linear threshold to a single variable chosen to optimize some impurity measure change in the resultant splits of data. This impurity is meant to measure the homogeneity of the target variable within the subsets. Different quantities can be used as impurity. In the case of soft classification, we use the entropy of the conditional distribution. Different criteria can be used to stop splitting nodes: build full trees, set a maximum tree depth or set a minimum number of points in a leaf. In our case, we set a maximum tree depth, which also controls the computational complexity. In order to predict the target of a new sample point, its features are used to go down the tree according to the linear thresholds and when a leaf is reached, the local estimate is used to predict the target. If the leaves contain too few training points, the model will suffer from statistical scarcity and will lead to high variance. If the tree is not deep enough, the model will suffer from high bias. Next we will present some strategies to find a good balance between bias and variance when using decision trees. 7

8 3.4. Random forests Random forests [1] (RF) follow the idea of bagging (for bootstrap aggregating ), which consists in averaging the predictions of many noisy but approximately unbiased models built on over a collection of bootstrap samples to reduce the variance. In particular, RF build an ensemble of decision trees by using two elements of randomness. The first one consists in using a bootstrap sample for each tree by sampling with replacement from the training data. The idea is to build multiple weak trees (each one capturing only a part of the information) that vote together to give the resulting prediction. In our case, voting is done by averaging their predictions. The second element of randomness is introduced when splitting each node: only a random subset of the features (usually, of size #features) is considered to optimize the splitting of the node. If trees are grown sufficiently deep, they will have relatively low bias, and, since they are identically distributed, the ensemble too. Thus, the error is reduced by reducing the variance as trees are aggregated. See, e.g., [4, Chap. 15] for more details. An interesting outcome of RF building is feature importance. As a matter of fact, some features are more informative than others, or have splits in the ensemble that reduce the impurity more. These reductions in impurity, weighted by the number of samples passing through each node, are summed over the forest and divided by the total reduction in impurity. This finds the share of each feature in informativeness, which is one way of understanding feature importance. As a final remark, due to the way predictions are computed, it is possible that RF assign a null probability to some label value which is bad for some metrics like log likelihood if this happens in the test set. This is not the case for the following method we consider Gradient boosting machines An alternative to bagging is boosting. At first sight, is similar to bagging in the sense that it aggregates many weak classifiers but it is fundamentally different. The idea of boosting is to sequentially train boosting functions that correspond to weak classifiers. Let us denote the t-th boosting function as f t (x). These functions are additively combined, i.e.: ŷ t (x) = ŷ t 1 (x) + νf t (x) (5) 8

9 where 0 < ν 1 is the learning rate, i.e. a shrinkage parameter that controls overfitting. Gradient Boosting Machines 2 (GBM) perform a gradient-descent based optimization by building these boosting functions from regression trees. The t-th regression tree learns the gradient of a loss function L evaluated on ŷ t 1 (x) applied to the training data, in order to add a step in the direction of the negative gradient with a magnitude that is optimized. If we consider the deviance loss 2 log ˆP (Y = y i X = x i ), (6) i then the t-th regression tree for the symbol k is trained on the following gradients g ikt 1 {yi =k} ˆP t GBM(Y = k X = x i ). (7) In fact, the algorithm works for discrete alphabets of any cardinality and, for this, one boosting function f kt (x) is built per alphabet symbol k A at round t. Then, after t rounds, the probability assignment is obtained using the softmax transformation as in MNL: ˆP t GBM(Y = k X = x) k A eŷkt(x) eŷkt(x). (8) A stochastic GBM [5] is a way to improve generalization by sampling columns (per split) and rows (per tree) during the model building process. For more details, see [4, Chap. 10, Algorithms 10.3 & 10.4]. 2 in our case, the machines are trees 9

10 4. Experimental results 4.1. Data We trained and tested our models on a dataset consisting of flight searches and bookings on a set of European origin/destination markets and airlines, extracted from GDS (global distribution system) logs. The choice set consists of the results of a flight search request. Each search session includes between 1 and 50 different itineraries, one of which has been booked by the customer. In total, there are choice situations (i.e., sessions) in the dataset, which are divided into training and test sets containing and 6791 sessions respectively. The features available are presented in Table 1: there are numerical and categorical ones, such as price of ticket, number of connecting flights and origin/destination. For MNL, departure and arrival times were decomposed into a sum of sine/cosine functions (see [6, Chap. 7]). For ML, they were simply represented as the numbers of seconds from midnight. For ML, we made some of the features relative to the minimum in their session, since absolute values make generalization more difficult for partition based methods. For MNL, categorical variables were dummy encoded and numerical features were standardized 3 to allow assessing their relative importance by looking at their coefficients Metrics We compare MNL vs Machine Learning methods using the following metrics on the test set consisting of m sessions, the i-th session containing n i alternatives: TOP N accuracy: we rank the alternatives decreasingly by assigned probability (ties are broken randomly), and we consider a prediction as TOP N accurate if the ranking of the chosen alternative is less or equal than N. 3 by subtracting the mean and dividing by two times the standard deviation of the original feature, as suggested in [7], to make them comparable to binary features coefficients. 10

11 Table 1: Features: marked with, alternative specific features (used in experiment 1, 2 and 3) and, marked with, business/leisure predictors (used in experiment 2 and, for ML, in 3 also). The rest of the features are used in experiment 3 only. For more details, see Sections 4.4, 4.5 and 4.6. Feature Type Range/Card. Used for Log(Price) Num [4.35,9.73] MNL Price/Cheapest Price Num [1,45.64] ML Trip duration (minutes) Num [105, 4314] MNL Trip duration / shortest trip duration Num [1, 24.65] ML Stay duration (minutes) Num [120,434000] MNL Stay duration / shortest stay duration Num [1,1819] ML Number of legs (nflights) Num [2,6] MNL,ML Number of airlines Num [1,4] MNL,ML Contains Low-Cost Carrier (LCC) Bin {0,1} MNL,ML {cos,sin}{2π,4π,6π} Outbound departure time Num [-1, 1] MNL {cos,sin}{2π,4π,6π} Outbound arrival time Num [-1, 1] MNL Outbound departure time Num [0, 86400] ML Outbound arrival time Num [0, 86400] ML Stay duration (minutes) (median) Num [120,434000] ML,B/L classif. Stay Saturday (median) Bin {0,1} ML,B/L classif. Days to departure (DTD) (median) Num [0, 343] ML,B/L classif. Origin/Destination (OD) Cat 97 ML Continental Trip Bin {0,1} ML Domestic Trip Bin {0,1} ML Departure weekday Cat 7 MNL,ML 11

12 Normalized log likelihood: Let c i be the index of the chosen alternative of i-th session. The normalized log likelihood is 1 m m log ˆP (C i = c i ). (9) i=1 Fractional market shares for a given feature f with discrete values V f : the true market share is an empirical distribution s over V f such that for a given value v V f : i s(v) 1 {f(a ici )=v} (10) #sessions where f(a) denotes the value of the feature f for the alternative a and 1 denotes the indicator function. The estimated market share given by a predictor ˆP is defined as: ŝ(v) ij 1 {f(a ij )=v} ˆP (C i = j) #sessions Finally, we compare them by computing the sum of absolute errors: (11) SAE f ( ˆP ) v V f ŝ(v) s(v) (12) In particular, we are interested in airline market share. Notice that, we do not use the airline as training feature since we want to test the ability of the models to generalize, i.e. we want these models to be applicable to new airlines and predict their market share Methods We consider the following methods and corresponding implementations: MNL: we used the Larch open toolbox 4. This implementation has been proven to be faster than traditional commercial and academic MNL software [8]. 4 Available at 12

13 RF: we used Distributed Random Forests from the H2O library 5. An interesting feature of this library is its ability to run on a simple personal computer as well as on popular Big Data platforms like Hadoop or Spark clusters, which makes it suitable for industrial big data applications without additional effort. GBM: we used the implementation of H2O as well [9]. Uniform: as a reference, we consider a uniform probability assignment, i.e. ˆPunif (C i = j) = 1 /n i. In order to optimize the parameters of the ML methods, we use 10-fold cross-validation. This means that for each combination of parameter values 10 models are trained on each fold and assessed on the corresponding validation set. The best combination of parameter values, in terms of average log likelihood, is used to train the final model on the full training set. RF are meant to be used with deep (overfitted) decision trees. Nevertheless, in practice, better results can be obtained by tuning the maximum depth of the trees. We tuned the maximum tree depth by trying the values {3, 5, 7, 9, 11, 13, 15, 17, 19, 21}. The number of trees can go as high as desired but for computational reasons it is important to find a not too large but good enough number. Scoring is performed every 10 trees on each validation set in terms of log likelihood and a moving average of window size 2 is computed. A maximum number of trees of is specified, but our training procedure stops if the relative improvement of the moving average is less than for two scoring rounds. For the other parameters, we used their default values. GBM are based on weak (underfitted) models, i.e. shallow trees. A too large number of trees can make the model overfit. This parameter was tuned in the same way as for RF. Using a cartesian grid search we tuned the following parameters: learning rate: {0.01, 0.1} maximum tree depth: {3, 5, 7, 9, 11}, The other parameters being used with their default values except for the sample rate, where we used Friedman s advice (stochastic gradient boosting [5]) by setting it to Available at 13

14 4.4. Experiment 1: Using only alternative specific features In this experiment, we use only alternative specific features (see Table 1) to show how the flexibility of tree models allow to capture non-linear dependencies. Notice that we do not include absolute stay duration since it helps identifying the kind of traveler (see Experiment 2). Table 2 shows the best parameters obtained from the grid search. Table 2: Best parameters resulting from grid search with alternative specific features only. max depth # trees learn rate RF GBM Table 3 summarizes the results. We observe that RF outperforms MNL in all the considered metrics while GBM outperforms it everywhere except on SAE. In terms of TOP N (N {1..30}) accuracy, we observe absolute improvements up to 4.7%. Table 3: Summary of results using alternative specific features only. method NormLogLikelihood SAE TOP 1 TOP 5 TOP 15 uniform MNL RF GBM In order to get more insight from these models, we analyze feature importance given by RF (Figure 1) and the magnitude of significant (p-value < 5%) MNL coefficients (Figure 2). For MNL, the top 3 features correspond to trip duration, price and number of legs, which is expected. RF gives higher importance to stay duration and to departure and arrival times, which suggests that the model is capturing convenience preferences Experiment 2: Adding business/leisure segmentation As suggested in [10], a shorter time between the booking date and the flight departure (Days to Departure, DTD) yields a higher probability of the 14

15 ratioprice outarrtime ratiostayduration Feature outdeptime ratiotripduration nflights nairlines containslcc scaled_importance Figure 1: RF feature importance using alternative specific features only. 15

16 Feature totaltripdurationminutes logprice nflights outarrtime_sin2p nairlines staydurationminutes containslcc1 outdeptime_cos2p outdeptime_sin4p outarrtime_sin4p outdeptime_cos6p outdeptime_cos4p outdeptime_sin6p outarrtime_cos4p outarrtime_cos6p abs_coefficient sign negative positive Figure 2: MNL significant (p-value < 5%) coefficient magnitude using alternative specific features only. customer being business and therefore less sensitivity to price. Since DTD is almost constant within session, this cannot be straightforwardly exploited by MNL. In this experiment, we exploit attributes that help in characterizing the type of user, more precisely the purpose of the trip, i.e., business or leisure. We follow the method proposed in [10], to allow the MNL to take advantage of this, by splitting the alternative specific features into two new ones. More precisely, a feature f is split into f B and f L : { f if trip classified as Business f B (13) 0 otherwise { f L f if trip classified as Leisure. 0 otherwise (14) All the features from Experiment 1 were split. For this purpose, a classifier (GBM) was trained on a separate dataset of trips that have been unequivocally identified as business or leisure. This classifier was trained using 16

17 only three business/leisure predictors: Stay Saturday, Stay Duration, and Days to Departure. When tested on the survey data used in [10], it yielded an accuracy of 79.4%. Since, within a session, there can be some variations in these features between alternatives, we consider, in fact, their median value to apply the classification model in order to get a consistent result. On the other side, ML algorithms were trained using the features of Experiment 1 and the three segment predictor features but without using the extra dataset or the obtained business/leisure label. 6 Table 4 shows the best parameters obtained from the grid search. Table 4: Best parameters resulting from grid search with alternative specific features and segment predictors. max depth # trees learn rate RF GBM Table 5 summarizes the results. We observe that this splitting helps MNL improving its performance in terms of TOP N accuracy and likelihood, as expected. In terms of log likelihood, GBM outperforms MNL, with a relative improvement of 5.2%. In terms of TOP N (N {1..30}) accuracy, we observe absolute improvements up to 6.2%. Table 5: Summary of results using alternative specific features and segment predictors. method NormLogLikelihood SAE TOP 1 TOP 5 TOP 15 uniform MNL RF GBM Finally, we also compare RF feature importance (Figure 3) with MNL coefficient magnitude (Figure 4). As expected, we see that the price for a leisure traveler has the highest importance followed by trip duration for business travelers. Interestingly, for RF, DTD is now among the most important 6 For ML, we use the original stayduration and the median value of DTD and SaturdayStay. 17

18 features, which is consistent with the reported importance of this feature to predict the kind of trip (see [10]). We observe again the high importance of departure and arrival times. Feature outarrtime ratioprice outdeptime dtd ratiotripduration staydurationminutes ratiostayduration nflights nairlines containslcc staysaturday scaled_importance Figure 3: RF feature importance using alternative specific features and segment predictors Experiment 3: Using all the features In this experiment, we compare MNL and ML methods using all the available features but without the business/leisure label, to compare them in a scenario without business knowledge about possible segments. Table 6 shows the best parameters obtained from the grid search. Table 6: Best parameters resulting from grid search with all available features. max depth # trees learn rate RF GBM

19 Feature L_logPrice B_totalTripDurationMinutes L_totalTripDurationMinutes B_nFlights B_logPrice B_stayDurationMinutes L_outArrTime_sin2p B_outArrTime_sin2p L_nFlights B_nAirlines L_nAirlines B_outDepTime_cos2p L_containsLCC1 B_containsLCC1 B_outDepTime_sin4p B_outArrTime_sin4p B_outDepTime_cos4p B_outDepTime_cos6p L_outDepTime_cos4p L_outArrTime_cos4p B_outDepTime_sin6p L_outDepTime_sin6p L_outDepTime_cos6p B_outArrTime_cos4p L_outDepTime_sin2p B_outArrTime_sin6p B_outArrTime_cos6p abs_coefficient sign negative positive Figure 4: MNL significant (p-value < 5%) coefficient magnitude using alternative specific features and segment predictors. The B/L prefix indicates the business/leisure split. 19

20 Table 7 summarizes the results. We observe that ML outperforms MNL in all the considered metrics. In terms of TOP N (N {1..30}) accuracy, we observe absolute improvements up to 8.7%. In terms of log likelihood, GBM outperforms MNL by 7.3%. With respect to airline market share prediction, GBM yields an SAE 38% smaller than MNL. Table 7: Summary of results using all available features. method loglikelihood SAE TOP 1 TOP 5 TOP 15 uniform MNL RF GBM Finally, we also compare RF feature importance (Figure 5) with MNL coefficient magnitude (Figure 6). Again we see the high importance of trip duration, price and number of legs for MNL. Interestingly, for RF, the most important feature is now OD (origin-destination pair), suggesting a geographic market segmentation. 20

21 Feature OD outarrtime ratioprice outdeptime dtd ratiotripduration depweekday staydurationminutes ratiostayduration nflights nairlines containslcc staysaturday isdomestic iscontinental scaled_importance Figure 5: RF feature importance using all available features. 21

22 Feature totaltripdurationminutes logprice nflights outarrtime_sin2p depweekday5 nairlines depweekday6 depweekday4 staydurationminutes containslcc1 outdeptime_cos2p depweekday3 outdeptime_sin4p outarrtime_sin4p outdeptime_cos6p outdeptime_cos4p outdeptime_sin6p outarrtime_cos4p outarrtime_cos6p abs_coefficient sign negative positive Figure 6: MNL significant (p-value < 5%) coefficient magnitude using all available features. FirstAirline dummy coefficients have been aggregated and the average value is shown (the estimated standard deviation is 2.33). The B/L prefix indicates the business/leisure split. 22

23 5. Conclusions In this paper we dealt with the air itinerary choice modeling problem, traditionally dominated with MNL-based models in the literature. We proposed an alternative approach based on machine learning (ML) techniques in order to tackle some identified weakness of MNL for this particular application. In contrast to previous applications of MNL to air itinerary choice (e.g., [8]), our dataset is much more complex. We consider round-trip alternatives instead of just one-way, multiple markets (O&Ds), different travelers profiles, and different point-of-sales; all together in a single data set of sessions. Thus, alternatives differ between sessions and they cannot easily be labeled in few discrete categories in contrast to classical MNL examples such as {car, bus, air}. Moreover, the number of alternatives is not constant and it varies up to 50 in addition to the fact that some of them could be highly correlated (e.g., same outbound but different inbound flights). This increases the complexity and makes the weaknesses of MNL more evident. Based on the experimentation results, our main takeaways are: Machine Learning (ML) yields a better performance than MNL on all the selected metrics. MNL achieves very good results in simple settings but requires feature engineering to allow non-linearities and customer segmentation in more complex ones. For example, it requires another model, trained on another dataset, to identify business and leisure passengers [10]. ML allows automatic segmentation and non-linear modeling, thus requiring less research effort than MNL for comparable performances. ML scales better thanks to the existence of many libraries for big data platforms The selection of metrics were driven by the industrial uses of this model. It should be noted that for applications such as dynamic pricing flight tickets, a small difference in Top-1 and Top-5 prediction accuracy can lead to a significant increase in profit [10]. Also Top-15 has a particular importance for ranking the results of flight searches since most websites show approximately 15 results per page, and users usually look at the first page in more detail. We have also introduced a metric to measure the ability to predict airline market shares (SAE). 23

24 In all these metrics the two ML models, Random Forest and Gradient Boosting Machines, outperform MNL significantly. For example in terms of Top-1, ML gives a prediction accuracy of 27.8% compared to 22.1% for MNL. In terms of predicting airline market shares, ML yields 38% smaller average errors compared to MNL. We think this is possible because machine learning methods provide clear benefits to model non-linearities of features and to automatically segment customers. Another clear practical advantages of ML compared to MNL is the possibility of fast and scalable learning computation of big data platforms. When it comes to future works, there are still several unexplored research directions: Application of ML methods to larger datasets using big data technologies: The availability of big data platforms opens the door to explore even huge data sets available in the air travel industry collected by GDS for example. Also this is an opportunity to use other recent ML methods that require huge volume of data to be effective as Deep Learning. Understanding better the theoretical aspects, in particular how the non-iidness affects ML methods: The way we consider a choice situation as a supervised soft classification problem it may not be the most correct approach as clearly the different alternatives are not completely independent. As a consequence the outcome probabilities need to be normalized to one artificially. We think there is room to analyze further the theoretical consequences of this choice, something that could lead to possible improvements. Improvements of the models. We plan to modify GBM s loss function to take into account sessions as a whole instead of looking at single alternatives. 24

25 References [1] L. Breiman, Random forests, Machine learning 45 (1) (2001) [2] J. H. Friedman, Greedy function approximation: a gradient boosting machine, Annals of statistics (2001) [3] D. McFadden, et al., Conditional logit analysis of qualitative choice behavior. [4] T. Hastie, R. Tibshirani, J. Friedman, The elements of statistical learning: data mining, inference and prediction, 2nd Edition, Springer, URL [5] J. H. Friedman, Stochastic gradient boosting, Computational Statistics & Data Analysis 38 (4) (2002) [6] L. A. Garrow, Discrete choice modelling and air travel demand: theory and applications, Routledge, [7] A. Gelman, Scaling regression inputs by dividing by two standard deviations, Statistics in medicine 27 (15) (2008) [8] J. P. Newman, V. Lurkin, L. A. Garrow, Computational methods for estimating multinomial, nested, and cross-nested logit models that account for semi-aggregate data, in 96th Annual Meeting of the Transportation Research Board, Washington, DC. (2016). [9] C. Click, M. Malohlava, A. Candel, H. Roark, V. Parmar, Gradient boosted models with h2o. [10] T. Delahaye, R. Acuna-Agost, N. Bondoux, A. Nguyen, M. Boudia, Data-driven models for itinerary preferences of air travelers and application for dynamic pricing optimization., Under review at Journal of Revenue and Pricing Management. 25

Copyright 2013, SAS Institute Inc. All rights reserved.

Copyright 2013, SAS Institute Inc. All rights reserved. IMPROVING PREDICTION OF CYBER ATTACKS USING ENSEMBLE MODELING June 17, 2014 82 nd MORSS Alexandria, VA Tom Donnelly, PhD Systems Engineer & Co-insurrectionist JMP Federal Government Team ABSTRACT Improving

More information

Lecture 6: Decision Tree, Random Forest, and Boosting

Lecture 6: Decision Tree, Random Forest, and Boosting Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia Tech CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Tree? Tuo Zhao Lecture 6: Decision

More information

Using Decision Tree to predict repeat customers

Using Decision Tree to predict repeat customers Using Decision Tree to predict repeat customers Jia En Nicholette Li Jing Rong Lim Abstract We focus on using feature engineering and decision trees to perform classification and feature selection on the

More information

Salford Predictive Modeler. Powerful machine learning software for developing predictive, descriptive, and analytical models.

Salford Predictive Modeler. Powerful machine learning software for developing predictive, descriptive, and analytical models. Powerful machine learning software for developing predictive, descriptive, and analytical models. The Company Minitab helps companies and institutions to spot trends, solve problems and discover valuable

More information

CS6716 Pattern Recognition

CS6716 Pattern Recognition CS6716 Pattern Recognition Aaron Bobick School of Interactive Computing Administrivia Shray says the problem set is close to done Today chapter 15 of the Hastie book. Very few slides brought to you by

More information

Ensemble Modeling. Toronto Data Mining Forum November 2017 Helen Ngo

Ensemble Modeling. Toronto Data Mining Forum November 2017 Helen Ngo Ensemble Modeling Toronto Data Mining Forum November 2017 Helen Ngo Agenda Introductions Why Ensemble Models? Simple & Complex ensembles Thoughts: Post-real-life Experimentation Downsides of Ensembles

More information

3 Ways to Improve Your Targeted Marketing with Analytics

3 Ways to Improve Your Targeted Marketing with Analytics 3 Ways to Improve Your Targeted Marketing with Analytics Introduction Targeted marketing is a simple concept, but a key element in a marketing strategy. The goal is to identify the potential customers

More information

Linear model to forecast sales from past data of Rossmann drug Store

Linear model to forecast sales from past data of Rossmann drug Store Abstract Linear model to forecast sales from past data of Rossmann drug Store Group id: G3 Recent years, the explosive growth in data results in the need to develop new tools to process data into knowledge

More information

Machine Learning Models for Sales Time Series Forecasting

Machine Learning Models for Sales Time Series Forecasting Article Machine Learning Models for Sales Time Series Forecasting Bohdan M. Pavlyshenko SoftServe, Inc., Ivan Franko National University of Lviv * Correspondence: bpavl@softserveinc.com, b.pavlyshenko@gmail.com

More information

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Lecture - 02 Data Mining Process Welcome to the lecture 2 of

More information

Today. Last time. Lecture 5: Discrimination (cont) Jane Fridlyand. Oct 13, 2005

Today. Last time. Lecture 5: Discrimination (cont) Jane Fridlyand. Oct 13, 2005 Biological question Experimental design Microarray experiment Failed Lecture : Discrimination (cont) Quality Measurement Image analysis Preprocessing Jane Fridlyand Pass Normalization Sample/Condition

More information

Predictive Modeling Using SAS Visual Statistics: Beyond the Prediction

Predictive Modeling Using SAS Visual Statistics: Beyond the Prediction Paper SAS1774-2015 Predictive Modeling Using SAS Visual Statistics: Beyond the Prediction ABSTRACT Xiangxiang Meng, Wayne Thompson, and Jennifer Ames, SAS Institute Inc. Predictions, including regressions

More information

Improving Urban Mobility Through Urban Analytics Using Electronic Smart Card Data

Improving Urban Mobility Through Urban Analytics Using Electronic Smart Card Data Improving Urban Mobility Through Urban Analytics Using Electronic Smart Card Data Mayuree Binjolkar, Daniel Dylewsky, Andrew Ju, Wenonah Zhang, Mark Hallenbeck Data Science for Social Good-2017,University

More information

PREDICTION OF CONCRETE MIX COMPRESSIVE STRENGTH USING STATISTICAL LEARNING MODELS

PREDICTION OF CONCRETE MIX COMPRESSIVE STRENGTH USING STATISTICAL LEARNING MODELS Journal of Engineering Science and Technology Vol. 13, No. 7 (2018) 1916-1925 School of Engineering, Taylor s University PREDICTION OF CONCRETE MIX COMPRESSIVE STRENGTH USING STATISTICAL LEARNING MODELS

More information

Data Mining. Chapter 7: Score Functions for Data Mining Algorithms. Fall Ming Li

Data Mining. Chapter 7: Score Functions for Data Mining Algorithms. Fall Ming Li Data Mining Chapter 7: Score Functions for Data Mining Algorithms Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University The merit of score function Score function indicates

More information

Predicting Corporate Influence Cascades In Health Care Communities

Predicting Corporate Influence Cascades In Health Care Communities Predicting Corporate Influence Cascades In Health Care Communities Shouzhong Shi, Chaudary Zeeshan Arif, Sarah Tran December 11, 2015 Part A Introduction The standard model of drug prescription choice

More information

When to Book: Predicting Flight Pricing

When to Book: Predicting Flight Pricing When to Book: Predicting Flight Pricing Qiqi Ren Stanford University qiqiren@stanford.edu Abstract When is the best time to purchase a flight? Flight prices fluctuate constantly, so purchasing at different

More information

Dynamic Pricing, Advance Sales, and Aggregate Demand Learning

Dynamic Pricing, Advance Sales, and Aggregate Demand Learning Dynamic Pricing, Advance Sales, and Aggregate Demand Learning in Airlines Department of Economics & Finance The University of Texas-Pan American Duke University November, 2011 Outline 1 Introduction Motivation

More information

Unravelling Airbnb Predicting Price for New Listing

Unravelling Airbnb Predicting Price for New Listing Unravelling Airbnb Predicting Price for New Listing Paridhi Choudhary H John Heinz III College Carnegie Mellon University Pittsburgh, PA 15213 paridhic@andrew.cmu.edu Aniket Jain H John Heinz III College

More information

SPM 8.2. Salford Predictive Modeler

SPM 8.2. Salford Predictive Modeler SPM 8.2 Salford Predictive Modeler SPM 8.2 The SPM Salford Predictive Modeler software suite is a highly accurate and ultra-fast platform for developing predictive, descriptive, and analytical models from

More information

Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest

Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest 1. Introduction Reddit is a social media website where users submit content to a public forum, and other

More information

POLICY FORECASTS USING MIXED RP/SP MODELS: SOME NEW EVIDENCE

POLICY FORECASTS USING MIXED RP/SP MODELS: SOME NEW EVIDENCE Advanced OR and AI Methods in Transportation POLICY FORECASTS USING MIXED / MODELS: SOME NEW EVIDENCE Elisabetta CHERCHI 1, Italo MELONI 1, Juan de Dios ORTÚZAR Abstract. The application of discrete choice

More information

Big Data. Methodological issues in using Big Data for Official Statistics

Big Data. Methodological issues in using Big Data for Official Statistics Giulio Barcaroli Istat (barcarol@istat.it) Big Data Effective Processing and Analysis of Very Large and Unstructured data for Official Statistics. Methodological issues in using Big Data for Official Statistics

More information

ADVANCED DATA ANALYTICS

ADVANCED DATA ANALYTICS ADVANCED DATA ANALYTICS MBB essay by Marcel Suszka 17 AUGUSTUS 2018 PROJECTSONE De Corridor 12L 3621 ZB Breukelen MBB Essay Advanced Data Analytics Outline This essay is about a statistical research for

More information

Application of Decision Trees in Mining High-Value Credit Card Customers

Application of Decision Trees in Mining High-Value Credit Card Customers Application of Decision Trees in Mining High-Value Credit Card Customers Jian Wang Bo Yuan Wenhuang Liu Graduate School at Shenzhen, Tsinghua University, Shenzhen 8, P.R. China E-mail: gregret24@gmail.com,

More information

Near-Balanced Incomplete Block Designs with An Application to Poster Competitions

Near-Balanced Incomplete Block Designs with An Application to Poster Competitions Near-Balanced Incomplete Block Designs with An Application to Poster Competitions arxiv:1806.00034v1 [stat.ap] 31 May 2018 Xiaoyue Niu and James L. Rosenberger Department of Statistics, The Pennsylvania

More information

Modeling of competition in revenue management Petr Fiala 1

Modeling of competition in revenue management Petr Fiala 1 Modeling of competition in revenue management Petr Fiala 1 Abstract. Revenue management (RM) is the art and science of predicting consumer behavior and optimizing price and product availability to maximize

More information

Rank hotels on Expedia.com to maximize purchases

Rank hotels on Expedia.com to maximize purchases Rank hotels on Expedia.com to maximize purchases Nishith Khantal, Valentina Kroshilina, Deepak Maini December 14, 2013 1 Introduction For an online travel agency (OTA), matching users to hotel inventory

More information

Churn Prediction for Game Industry Based on Cohort Classification Ensemble

Churn Prediction for Game Industry Based on Cohort Classification Ensemble Churn Prediction for Game Industry Based on Cohort Classification Ensemble Evgenii Tsymbalov 1,2 1 National Research University Higher School of Economics, Moscow, Russia 2 Webgames, Moscow, Russia etsymbalov@gmail.com

More information

Understanding Customer Choices to Improve Recommendations in the Air Travel Industry

Understanding Customer Choices to Improve Recommendations in the Air Travel Industry Understanding Customer Choices to Improve Recommendations in the Air Travel Industry ABSTRACT Alejandro Mottini Rodrigo Acuna-Agost Recommender systems aim at suggesting relevant items to users to support

More information

Prediction and Interpretation for Machine Learning Regression Methods

Prediction and Interpretation for Machine Learning Regression Methods Paper 1967-2018 Prediction and Interpretation for Machine Learning Regression Methods D. Richard Cutler, Utah State University ABSTRACT The last 30 years has seen extraordinary development of new tools

More information

USE CASES OF MACHINE LEARNING IN DEXFREIGHT S DECENTRALIZED LOGISTICS PLATFORM

USE CASES OF MACHINE LEARNING IN DEXFREIGHT S DECENTRALIZED LOGISTICS PLATFORM USE CASES OF MACHINE LEARNING IN DEXFREIGHT S DECENTRALIZED LOGISTICS PLATFORM Abstract In this document, we describe in high-level use cases and strategies to implement machine learning algorithms to

More information

SOCIAL MEDIA MINING. Behavior Analytics

SOCIAL MEDIA MINING. Behavior Analytics SOCIAL MEDIA MINING Behavior Analytics Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

Introduction to Random Forests for Gene Expression Data. Utah State University Spring 2014 STAT 5570: Statistical Bioinformatics Notes 3.

Introduction to Random Forests for Gene Expression Data. Utah State University Spring 2014 STAT 5570: Statistical Bioinformatics Notes 3. Introduction to Random Forests for Gene Expression Data Utah State University Spring 2014 STAT 5570: Statistical Bioinformatics Notes 3.5 1 References Breiman, Machine Learning (2001) 45(1): 5-32. Diaz-Uriarte

More information

References. Introduction to Random Forests for Gene Expression Data. Machine Learning. Gene Profiling / Selection

References. Introduction to Random Forests for Gene Expression Data. Machine Learning. Gene Profiling / Selection References Introduction to Random Forests for Gene Expression Data Utah State University Spring 2014 STAT 5570: Statistical Bioinformatics Notes 3.5 Breiman, Machine Learning (2001) 45(1): 5-32. Diaz-Uriarte

More information

Model Selection, Evaluation, Diagnosis

Model Selection, Evaluation, Diagnosis Model Selection, Evaluation, Diagnosis INFO-4604, Applied Machine Learning University of Colorado Boulder October 31 November 2, 2017 Prof. Michael Paul Today How do you estimate how well your classifier

More information

Supervised Learning Using Artificial Prediction Markets

Supervised Learning Using Artificial Prediction Markets Supervised Learning Using Artificial Prediction Markets Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, FSU Dept. of Scientific Computing 1 Main Contributions

More information

Forecasting Intermittent Demand Patterns with Time Series and Machine Learning Methodologies

Forecasting Intermittent Demand Patterns with Time Series and Machine Learning Methodologies Forecasting Intermittent Demand Patterns with Time Series and Machine Learning Methodologies Yuwen Hong, Jingda Zhou, Matthew A. Lanham Purdue University, Department of Management, 403 W. State Street,

More information

Analytical Capability Security Compute Ease Data Scale Price Users Traditional Statistics vs. Machine Learning In-Memory vs. Shared Infrastructure CRAN vs. Parallelization Desktop vs. Remote Explicit vs.

More information

National Occupational Standard

National Occupational Standard National Occupational Standard Overview This unit is about performing research and designing a variety of algorithmic models for internal and external clients 19 National Occupational Standard Unit Code

More information

JEFFREY NEWMAN, LAURIE GARROW COMPUTATIONAL APPROACHES FOR EFFICIENT ESTIMATION OF DISCRETE CHOICE MODELS

JEFFREY NEWMAN, LAURIE GARROW COMPUTATIONAL APPROACHES FOR EFFICIENT ESTIMATION OF DISCRETE CHOICE MODELS JEFFREY NEWMAN, LAURIE GARROW COMPUTATIONAL APPROACHES FOR EFFICIENT ESTIMATION OF DISCRETE CHOICE MODELS COMPUTATIONAL APPROACHES FOR EFFICIENT ESTIMATION 10000110010100111 00010101100101001 BIG DATA

More information

Developing new models for European mobility

Developing new models for European mobility Developing new models for European mobility Samuel Cristobal, Innaxis R&F sc@innaxis.org 1st DATASET2050 workshop: July 12 th, 2016 hosted by the University of Westminster, London Contents: i. Introduction

More information

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong Machine learning models can be used to predict which recommended content users will click on a given website.

More information

IBM SPSS Decision Trees

IBM SPSS Decision Trees IBM SPSS Decision Trees 20 IBM SPSS Decision Trees Easily identify groups and predict outcomes Highlights With SPSS Decision Trees you can: Identify groups, segments, and patterns in a highly visual manner

More information

Intro Logistic Regression Gradient Descent + SGD

Intro Logistic Regression Gradient Descent + SGD Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade March 29, 2016 1 Ad Placement

More information

Random Forests. Parametrization and Dynamic Induction

Random Forests. Parametrization and Dynamic Induction Random Forests Parametrization and Dynamic Induction Simon Bernard Document and Learning research team LITIS laboratory University of Rouen, France décembre 2014 Random Forest Classifiers Random Forests

More information

E-Commerce Sales Prediction Using Listing Keywords

E-Commerce Sales Prediction Using Listing Keywords E-Commerce Sales Prediction Using Listing Keywords Stephanie Chen (asksteph@stanford.edu) 1 Introduction Small online retailers usually set themselves apart from brick and mortar stores, traditional brand

More information

KnowledgeSTUDIO. Advanced Modeling for Better Decisions. Data Preparation, Data Profiling and Exploration

KnowledgeSTUDIO. Advanced Modeling for Better Decisions. Data Preparation, Data Profiling and Exploration KnowledgeSTUDIO Advanced Modeling for Better Decisions Companies that compete with analytics are looking for advanced analytical technologies that accelerate decision making and identify opportunities

More information

Munenori SHIBATA Transport Planning and Marketing Laboratory, Signalling and Transport Information Technology Division

Munenori SHIBATA Transport Planning and Marketing Laboratory, Signalling and Transport Information Technology Division PAPER A Study on Passengers Train Choice Model in Urban Railways Noriko FUKASAWA Munenori SHIBATA Transport Planning and Marketing Laboratory, Signalling and Transport Information Technology Division This

More information

Genomic Selection with Linear Models and Rank Aggregation

Genomic Selection with Linear Models and Rank Aggregation Genomic Selection with Linear Models and Rank Aggregation m.scutari@ucl.ac.uk Genetics Institute March 5th, 2012 Genomic Selection Genomic Selection Genomic Selection: an Overview Genomic selection (GS)

More information

Data Mining Applications with R

Data Mining Applications with R Data Mining Applications with R Yanchang Zhao Senior Data Miner, RDataMining.com, Australia Associate Professor, Yonghua Cen Nanjing University of Science and Technology, China AMSTERDAM BOSTON HEIDELBERG

More information

Machine Learning Techniques For Particle Identification

Machine Learning Techniques For Particle Identification Machine Learning Techniques For Particle Identification 06.March.2018 I Waleed Esmail, Tobias Stockmanns, Michael Kunkel, James Ritman Institut für Kernphysik (IKP), Forschungszentrum Jülich Outlines:

More information

State-of-the-Art Diamond Price Predictions using Neural Networks

State-of-the-Art Diamond Price Predictions using Neural Networks State-of-the-Art Diamond Price Predictions using Neural Networks Charley Yejia Zhang, Sean Oh, Jason Park Abstract In this paper, we discuss and evaluate models to predict the prices of diamonds given

More information

DATA ANALYTICS WITH R, EXCEL & TABLEAU

DATA ANALYTICS WITH R, EXCEL & TABLEAU Learn. Do. Earn. DATA ANALYTICS WITH R, EXCEL & TABLEAU COURSE DETAILS centers@acadgild.com www.acadgild.com 90360 10796 Brief About this Course Data is the foundation for technology-driven digital age.

More information

Decision Tree Learning. Richard McAllister. Outline. Overview. Tree Construction. Case Study: Determinants of House Price. February 4, / 31

Decision Tree Learning. Richard McAllister. Outline. Overview. Tree Construction. Case Study: Determinants of House Price. February 4, / 31 1 / 31 Decision Decision February 4, 2008 2 / 31 Decision 1 2 3 3 / 31 Decision Decision Widely Used Used for approximating discrete-valued functions Robust to noisy data Capable of learning disjunctive

More information

Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification

Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification Final Project Report Alexander Herrmann Advised by Dr. Andrew Gentles December

More information

Data Science in a pricing process

Data Science in a pricing process Data Science in a pricing process Michaël Casalinuovo Consultant, ADDACTIS Software michael.casalinuovo@addactis.com Contents Nowadays, we live in a continuously changing market environment, Pricing has

More information

HUMAN RESOURCE PLANNING AND ENGAGEMENT DECISION SUPPORT THROUGH ANALYTICS

HUMAN RESOURCE PLANNING AND ENGAGEMENT DECISION SUPPORT THROUGH ANALYTICS HUMAN RESOURCE PLANNING AND ENGAGEMENT DECISION SUPPORT THROUGH ANALYTICS Janaki Sivasankaran 1, B Thilaka 2 1,2 Department of Applied Mathematics, Sri Venkateswara College of Engineering, (India) ABSTRACT

More information

MODELING EFFECTS OF TRAVEL-TIME RELIABILITY ON MODE CHOICE USING PROSPECT THEORY

MODELING EFFECTS OF TRAVEL-TIME RELIABILITY ON MODE CHOICE USING PROSPECT THEORY Ghader et al. 1 MODELING EFFECTS OF TRAVEL-TIME RELIABILITY ON MODE CHOICE USING PROSPECT THEORY Sepehr Ghader, Graduate Research Assistant (Corresponding Author) Department of Civil and Environmental

More information

Exploiting full potential of predictive analytics on small data to drive business outcomes

Exploiting full potential of predictive analytics on small data to drive business outcomes Exploiting full potential of predictive analytics on small data to drive business outcomes Adrian Foltyn, External Data Science Expert TM Forum, Nice 15 May 2018 HelloFresh breaks the dinner routine by

More information

Examination of Cross Validation techniques and the biases they reduce.

Examination of Cross Validation techniques and the biases they reduce. Examination of Cross Validation techniques and the biases they reduce. Dr. Jon Starkweather, Research and Statistical Support consultant. The current article continues from last month s brief examples

More information

Introduction to Data,Mining

Introduction to Data,Mining Introduction to Data,Mining Equivalent,Buzz,Words! Data Mining! Machine Learning! Analytics! Big Data (not in actually, but in 90% of uses)! Data Science! Business Intelligence (often means reporting)!

More information

Active Learning for Conjoint Analysis

Active Learning for Conjoint Analysis Peter I. Frazier Shane G. Henderson snp32@cornell.edu pf98@cornell.edu sgh9@cornell.edu School of Operations Research and Information Engineering Cornell University November 1, 2015 Learning User s Preferences

More information

Predict Commercial Promoted Contents Will Be Clicked By User

Predict Commercial Promoted Contents Will Be Clicked By User Predict Commercial Promoted Contents Will Be Clicked By User Gary(Xinran) Guo garyguo@stanford.edu SUNetID: garyguo Stanford University 1. Introduction As e-commerce, social media grows rapidly, advertisements

More information

Tree Depth in a Forest

Tree Depth in a Forest Tree Depth in a Forest Mark Segal Center for Bioinformatics & Molecular Biostatistics Division of Bioinformatics Department of Epidemiology and Biostatistics UCSF NUS / IMS Workshop on Classification and

More information

Analytics for Banks. September 19, 2017

Analytics for Banks. September 19, 2017 Analytics for Banks September 19, 2017 Outline About AlgoAnalytics Problems we can solve for banks Our experience Technology Page 2 About AlgoAnalytics Analytics Consultancy Work at the intersection of

More information

Predicting prokaryotic incubation times from genomic features Maeva Fincker - Final report

Predicting prokaryotic incubation times from genomic features Maeva Fincker - Final report Predicting prokaryotic incubation times from genomic features Maeva Fincker - mfincker@stanford.edu Final report Introduction We have barely scratched the surface when it comes to microbial diversity.

More information

Metaheuristics for scheduling production in large-scale open-pit mines accounting for metal uncertainty - Tabu search as an example.

Metaheuristics for scheduling production in large-scale open-pit mines accounting for metal uncertainty - Tabu search as an example. Metaheuristics for scheduling production in large-scale open-pit mines accounting for metal uncertainty - Tabu search as an example Amina Lamghari COSMO Stochastic Mine Planning Laboratory! Department

More information

Predicting Purchase Behavior of E-commerce Customer, One-stage or Two-stage?

Predicting Purchase Behavior of E-commerce Customer, One-stage or Two-stage? 2016 International Conference on Artificial Intelligence and Computer Science (AICS 2016) ISBN: 978-1-60595-411-0 Predicting Purchase Behavior of E-commerce Customer, One-stage or Two-stage? Chen CHEN

More information

Cold-start Solution to Location-based Entity Shop. Recommender Systems Using Online Sales Records

Cold-start Solution to Location-based Entity Shop. Recommender Systems Using Online Sales Records Cold-start Solution to Location-based Entity Shop Recommender Systems Using Online Sales Records Yichen Yao 1, Zhongjie Li 2 1 Department of Engineering Mechanics, Tsinghua University, Beijing, China yaoyichen@aliyun.com

More information

Gene Expression Data Analysis

Gene Expression Data Analysis Gene Expression Data Analysis Bing Zhang Department of Biomedical Informatics Vanderbilt University bing.zhang@vanderbilt.edu BMIF 310, Fall 2009 Gene expression technologies (summary) Hybridization-based

More information

Startup Machine Learning: Bootstrapping a fraud detection system. Michael Manapat

Startup Machine Learning: Bootstrapping a fraud detection system. Michael Manapat Startup Machine Learning: Bootstrapping a fraud detection system Michael Manapat Stripe @mlmanapat About me: Engineering Manager of the Machine Learning Products Team at Stripe About Stripe: Payments infrastructure

More information

TDWI strives to provide course books that are contentrich and that serve as useful reference documents after a class has ended.

TDWI strives to provide course books that are contentrich and that serve as useful reference documents after a class has ended. Previews of TDWI course books offer an opportunity to see the quality of our material and help you to select the courses that best fit your needs. The previews cannot be printed. TDWI strives to provide

More information

Evaluating Workflow Trust using Hidden Markov Modeling and Provenance Data

Evaluating Workflow Trust using Hidden Markov Modeling and Provenance Data Evaluating Workflow Trust using Hidden Markov Modeling and Provenance Data Mahsa Naseri and Simone A. Ludwig Abstract In service-oriented environments, services with different functionalities are combined

More information

Random Forests. Parametrization, Tree Selection and Dynamic Induction

Random Forests. Parametrization, Tree Selection and Dynamic Induction Random Forests Parametrization, Tree Selection and Dynamic Induction Simon Bernard Document and Learning research team LITIS lab. University of Rouen, France décembre 2014 Random Forest Classifiers Random

More information

2 Maria Carolina Monard and Gustavo E. A. P. A. Batista

2 Maria Carolina Monard and Gustavo E. A. P. A. Batista Graphical Methods for Classifier Performance Evaluation Maria Carolina Monard and Gustavo E. A. P. A. Batista University of São Paulo USP Institute of Mathematics and Computer Science ICMC Department of

More information

Predicting user rating on Amazon Video Game Dataset

Predicting user rating on Amazon Video Game Dataset Predicting user rating on Amazon Video Game Dataset CSE190A Assignment2 Hongyu Li UC San Diego A900960 holi@ucsd.edu Wei He UC San Diego A12095047 whe@ucsd.edu ABSTRACT Nowadays, accurate recommendation

More information

A logistic regression model for Semantic Web service matchmaking

A logistic regression model for Semantic Web service matchmaking . BRIEF REPORT. SCIENCE CHINA Information Sciences July 2012 Vol. 55 No. 7: 1715 1720 doi: 10.1007/s11432-012-4591-x A logistic regression model for Semantic Web service matchmaking WEI DengPing 1*, WANG

More information

Decision Trees And Random Forests A Visual Introduction For Beginners

Decision Trees And Random Forests A Visual Introduction For Beginners Decision Trees And Random Forests A Visual Introduction For Beginners We have made it easy for you to find a PDF Ebooks without any digging. And by having access to our ebooks online or by storing it on

More information

Segmenting Customer Bases in Personalization Applications Using Direct Grouping and Micro-Targeting Approaches

Segmenting Customer Bases in Personalization Applications Using Direct Grouping and Micro-Targeting Approaches Segmenting Customer Bases in Personalization Applications Using Direct Grouping and Micro-Targeting Approaches Alexander Tuzhilin Stern School of Business New York University (joint work with Tianyi Jiang)

More information

Test lasts for 120 minutes. You must stay for the entire 120 minute period.

Test lasts for 120 minutes. You must stay for the entire 120 minute period. ECO220 Mid-Term Test (June 29, 2005) Page 1 of 15 Last Name: First Name: Student ID #: INSTRUCTIONS: DO NOT OPEN THIS EAM UNTIL INSTRUCTED TO. Test lasts for 120 minutes. You must stay for the entire 120

More information

Empirics of Airline Pricing

Empirics of Airline Pricing Empirics of Airline Pricing [Think about a interesting title that will motivate people to read your paper] [you can use this file as a template for your paper. The letters in green are comments and the

More information

Online appendix for THE RESPONSE OF CONSUMER SPENDING TO CHANGES IN GASOLINE PRICES *

Online appendix for THE RESPONSE OF CONSUMER SPENDING TO CHANGES IN GASOLINE PRICES * Online appendix for THE RESPONSE OF CONSUMER SPENDING TO CHANGES IN GASOLINE PRICES * Michael Gelman a, Yuriy Gorodnichenko b,c, Shachar Kariv b, Dmitri Koustas b, Matthew D. Shapiro c,d, Dan Silverman

More information

Customer Relationship Management in marketing programs: A machine learning approach for decision. Fernanda Alcantara

Customer Relationship Management in marketing programs: A machine learning approach for decision. Fernanda Alcantara Customer Relationship Management in marketing programs: A machine learning approach for decision Fernanda Alcantara F.Alcantara@cs.ucl.ac.uk CRM Goal Support the decision taking Personalize the best individual

More information

Evaluating predictive models for solar energy growth in the US states and identifying the key drivers

Evaluating predictive models for solar energy growth in the US states and identifying the key drivers IOP Conference Series: Earth and Environmental Science PAPER OPEN ACCESS Evaluating predictive models for solar energy growth in the US states and identifying the key drivers To cite this article: Joheen

More information

Predicting the profitability level of companies regarding the five comparability factors

Predicting the profitability level of companies regarding the five comparability factors VU University Amsterdam MSc. Business Analytics Research Paper Predicting the profitability level of companies regarding the five comparability factors March 31, 2017 Manon Wintgens (2558262) Manon Wintgens

More information

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy AGENDA 1. Introduction 2. Use Cases 3. Popular Algorithms 4. Typical Approach 5. Case Study 2016 SAPIENT GLOBAL MARKETS

More information

Appendix A Mixed-Effects Models 1. LONGITUDINAL HIERARCHICAL LINEAR MODELS

Appendix A Mixed-Effects Models 1. LONGITUDINAL HIERARCHICAL LINEAR MODELS Appendix A Mixed-Effects Models 1. LONGITUDINAL HIERARCHICAL LINEAR MODELS Hierarchical Linear Models (HLM) provide a flexible and powerful approach when studying response effects that vary by groups.

More information

Environmental correlates of nearshore habitat distribution by the critically endangered Māui dolphin

Environmental correlates of nearshore habitat distribution by the critically endangered Māui dolphin The following supplements accompany the article Environmental correlates of nearshore habitat distribution by the critically endangered Māui dolphin Solène Derville*, Rochelle Constantine, C. Scott Baker,

More information

Our MCMC algorithm is based on approach adopted by Rutz and Trusov (2011) and Rutz et al. (2012).

Our MCMC algorithm is based on approach adopted by Rutz and Trusov (2011) and Rutz et al. (2012). 1 ONLINE APPENDIX A MCMC Algorithm Our MCMC algorithm is based on approach adopted by Rutz and Trusov (2011) and Rutz et al. (2012). The model can be written in the hierarchical form: β X,ω,Δ,V,ε ε β,x,ω

More information

Survey of Behavioral Segmentation Methods

Survey of Behavioral Segmentation Methods Survey of Behavioral Segmentation Methods Written by Rhonda Carraway Petty Marketing Insights Data Scientist rpetty@mathnetix.com There is no magic rule or formula for deciding how to classify customers

More information

UPDATE OF THE NEAC MODAL-SPLIT MODEL Leest, E.E.G.A. van der Duijnisveld, M.A.G. Hilferink, P.B.D. NEA Transport research and training

UPDATE OF THE NEAC MODAL-SPLIT MODEL Leest, E.E.G.A. van der Duijnisveld, M.A.G. Hilferink, P.B.D. NEA Transport research and training UPDATE OF THE NEAC MODAL-SPLIT MODEL Leest, E.E.G.A. van der Duijnisveld, M.A.G. Hilferink, P.B.D. NEA Transport research and training 1 INTRODUCTION The NEAC model and information system consists of models

More information

Choice Based Revenue Management for Parallel Flights

Choice Based Revenue Management for Parallel Flights Choice Based Revenue Management for Parallel Flights Jim Dai Cornell University, jd694@cornell.edu, Weijun Ding Georgia Institute of Technology, wding34@gatech.edu, Anton J. Kleywegt Georgia Institute

More information

Preface to the third edition Preface to the first edition Acknowledgments

Preface to the third edition Preface to the first edition Acknowledgments Contents Foreword Preface to the third edition Preface to the first edition Acknowledgments Part I PRELIMINARIES XXI XXIII XXVII XXIX CHAPTER 1 Introduction 3 1.1 What Is Business Analytics?................

More information

CORPORATE FINANCIAL DISTRESS PREDICTION OF SLOVAK COMPANIES: Z-SCORE MODELS VS. ALTERNATIVES

CORPORATE FINANCIAL DISTRESS PREDICTION OF SLOVAK COMPANIES: Z-SCORE MODELS VS. ALTERNATIVES CORPORATE FINANCIAL DISTRESS PREDICTION OF SLOVAK COMPANIES: Z-SCORE MODELS VS. ALTERNATIVES PAVOL KRÁL, MILOŠ FLEISCHER, MÁRIA STACHOVÁ, GABRIELA NEDELOVÁ Matej Bel Univeristy in Banská Bystrica, Faculty

More information

sed/star metrics record linkage

sed/star metrics record linkage sed/star metrics record linkage Joshua Tokle, Christina Jones, and Michelle Yin 17 July 2015 American Institutes for Research Outline 1. Introduction to the problem 2. The data 3. Methodology of record

More information

Deep Dive into High Performance Machine Learning Procedures. Tuba Islam, Analytics CoE, SAS UK

Deep Dive into High Performance Machine Learning Procedures. Tuba Islam, Analytics CoE, SAS UK Deep Dive into High Performance Machine Learning Procedures Tuba Islam, Analytics CoE, SAS UK WHAT IS MACHINE LEARNING? Wikipedia: Machine learning, a branch of artificial intelligence, concerns the construction

More information

Methodological challenges of Big Data for official statistics

Methodological challenges of Big Data for official statistics Methodological challenges of Big Data for official statistics Piet Daas Statistics Netherlands THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Content Big Data: properties

More information

Adaptive Time Series Forecasting of Energy Consumption using Optimized Cluster Analysis

Adaptive Time Series Forecasting of Energy Consumption using Optimized Cluster Analysis Adaptive Time Series Forecasting of Energy Consumption using Optimized Cluster Analysis Peter Laurinec, Marek Lóderer, Petra Vrablecová, Mária Lucká, Viera Rozinajová, Anna Bou Ezzeddine 12.12.2016 Slovak

More information

Bid rent model for simultaneous determination of location and rent in land use microsimulations. Ricardo Hurtubia Michel Bierlaire

Bid rent model for simultaneous determination of location and rent in land use microsimulations. Ricardo Hurtubia Michel Bierlaire Bid rent model for simultaneous determination of location and rent in land use microsimulations Ricardo Hurtubia Michel Bierlaire STRC 2011 May 2011 STRC 2011 Bid rent model for simultaneous determination

More information