Airline Itinerary Choice Modeling Using Machine Learning

Size: px

Start display at page:

Download "Airline Itinerary Choice Modeling Using Machine Learning"

Claud O’Neal’
5 years ago
Views:

1 Airline Itinerary Choice Modeling Using Machine Learning A. Lhéritier a, M. Bocamazo a, T. Delahaye a, R. Acuna-Agost a a Innovation and Research Division, Amadeus S.A.S., 485 Route du Pin Montard, Sophia Antipolis Cedex, France Abstract Understanding how customers choose between different itineraries when searching for flights is very important for the travel industry. This knowledge can help travel providers, either airlines or travel agents, to better adapt their offer to market conditions and customer needs. This has a particular importance for pricing and ranking suggestions to travelers when searching for flights. This problem has been historically handled using Multinomial Logit (MNL) models. While MNL offers the dual advantage of simplicity and readability, they lack flexibility to deal with correlations between alternatives or non-linearity in the effect of alternatives attributes. In this work, we present an alternative modeling approach based on machine learning techniques that are able to deal with the main weakness of MNL models for this particular application. These methods are also better adapted for applications in the travel industry thanks to robust data-parallelized implementations working on Big Data platforms. We test these models on a dataset consisting of flight searches and bookings on European markets. The results show machine learning models consistently outperforming MNL on prediction accuracy, log likelihood, and market share estimation. Keywords: multinomial logit model, machine learning, decision tree, random forest, gradient boosting machines addresses: alix.lheritier@amadeus.com (A. Lhéritier), mrbocamazo@gmail.com (M. Bocamazo), delahaye.thierry@gmail.com (T. Delahaye), rodrigo.acunaagost@amadeus.com (R. Acuna-Agost) Preprint submitted to Elsevier March 22, 2017

2 1. Introduction This paper deals with the airline itinerary choice problem. Consider for example that a customer is searching for flights from London to New York, traveling next week on Tuesday and coming back on Saturday. This search request is then processed by a travel provider (e.g., an online travel agent) that proposes between 1 and 50 different alternatives (itineraries) to the customer. The itineraries have different attributes, among others: number of stops, total trip duration, and price. The question is: which one is (probably) going to be selected by the customer? There is a growing interest within the travel industry in better understanding how customers choose between different itineraries when searching for flights. Such an understanding can help travel providers, either airlines or travel agents, to better adapt their offer to market conditions and customer needs, thus increasing their revenue. This can be used for filtering alternatives, sorting them or even for changing some attributes in real-time (e.g., changing the price). The field of customer choice modeling is dominated by traditional statistical approaches, such as the Multinomial Logit (MNL) model, that are linear with respect to features and are tightly bound to their assumptions about the distribution of error. While these models offer the dual advantages of simplicity and readability, they lack flexibility to handle correlations between alternatives or non-linearity in the effect of alternative attributes. Another strong limitation is their inability to model different behavior according to individual specific variables. A large part of the existing modeling work focuses in adapting these modeled distributions, so that they can match observed behavior Nested (NL) and Cross-Nested (CNL) Logit models are good examples of this: they add terms for highly specific feature interactions, so that substitution patterns between sub-groups of alternatives can be captured. In this work, we present an alternative modeling approach based on machine learning techniques. The selected machine learning methods can model non-linear relationships between feature values and the target class, allow collinear features, and have more modeling flexibility to automatically learn implicit customer segments. In particular, we have chosen to work with machine learning methods based on ensembles of decision trees, namely Random Forests (RF) [1] and tree based Gradient Boosting Machines (GBM) [2]. In fact, decision trees 2

3 are well adapted to our problem as model bifurcations (branches) are welldisposed to automatically partition the customers into segments and, at the same time, capture nonlinear relationships within attributes of alternatives and characteristics of the decision maker, if this has a positive impact on the prediction accuracy. Indeed, there are two main segments to be taken into account for our particular problem. Business and leisure air passengers behave very different when it comes to booking flights. The business passenger tends to favor alternatives with convenient schedules like shorter connection times and time preferences, while leisure passengers are very price sensitive, which means that they can accept a longer connection time if this is reflected in a lower price of the tickets. The problem is that the segment is not explicitly known when the customer is searching, however it could be derived by combining different factors. For example, industry experts know that business passengers have a tendency to book with less anticipation and are not predisposed to stay on Saturdays nights. However, these are not black or white rules, which reinforces the need for a model able to detect these rules depending on the data and actual customer behavior Another observed advantage is that RF and GBM are fairly quick to train and very quick to predict, which enables fast iteration. The rest of this paper is organized as follows: Section 2 describes the discrete choice modeling background. Section 3 introduces the proposed modeling approach based on machine learning. In Section 4, we present numerical experiments and their results. Finally, Section 5 draws the main conclusions and present some perspectives of this work. 3

4 2. Discrete choice modeling In this section we introduce the basic setting of discrete choice modeling and the classical MNL approach considered for our experiments as the baseline methodology Setting Our choice situations are structured in air shopping sessions. For each session i, we know the basic search request information provided by the passenger (or decision maker / individual). This information is: origin and destination cities and the arrival/departure dates of the travel. The individual considers a choice set of n i flight alternatives from which she must choose exactly one. It should be noted that only sessions finished with a booking (sale) were considered in the experiments as it was not possible to collect data on sessions with non-purchase choices. We denote C i the random variable representing the index of the chosen flight itinerary alternative. Therefore, the choice set verifies the three basic conditions for any discrete choice modeling problem: a) mutually exclusive, b) exhaustive, and c) be composed of a finite number of alternatives. In each session, two kinds of feature vectors are considered: 1 x i0 R d 0, characterizing the decision maker (also called characteristics), and x ij R d 1, characterizing the alternatives (also called attributes). We denote the vector of all the features that can be considered for the alternative j of the session i as a ij x i0 x ij Multinomial Logit Model The Multinomial Logit Model (MNL) is derived under the assumption that a decision maker chooses the alternative that maximizes his utility U. In general it is not possible to know the decision maker s utility nor its form. However, one can determine some attributes of the alternatives as faced by 1 In practice, there can be features that characterize both the decision maker and the alternatives. For example, the time before departure characterizes, at a day/week scale, what the user looks for but, at an hour scale, it can differentiate the alternatives. See Section 4.5 for more details. 4

5 the passenger, in addition to some characteristics of the decision maker. In general we consider: alternative specific features x ij with generic coefficients β individual specific features x i0 with alternative specific coefficients γ j alternative specific features x ij with alternative specific coefficient δ j. alternative constants α j We can now specify a model or function that relates these observed factors to the unknown individual s utility. This function is often called representative utility, and defined as: V ij = α j + βx ij + γ j x i 0 + δ j x ij (1) At this point, it should ne noted that there are aspects of utility that cannot be observed nor derived from the available data, therefore V ij U ij. The utility could be thus decomposed as: U ij = V ij + ɛ ij (2) Where ɛ ij encapsulates all the factors that impact utility but are not considered in V ij. It should be noted that we do not know ɛ ij and this is the reason why these terms are treated in the literature as random. In particular we assume ɛ ij are i.i.d. with an extreme value distribution. Under these assumptions, the resulting probability of choosing the alternative a ij is given by [3]: P (C i = j) = ev ij j ev ij. (3) For our particular application, alternatives vary drastically from one session to another, for this reason alternative specific coefficients and constants were ruled out (i.e., α j = γ j = δ j = 0 j). 5

6 3. Machine Learning Approach In this section, we describe how we apply supervised learning to the choice modeling problem and we give a brief description of the considered methods Supervised Learning In supervised learning, each sample point consists in a set of features and a target denoted x i, y i, respectively. The goal is, after seeing a set of sample points, to predict new unseen targets given their associated features. In the soft classification problem, the targets are discrete (and usually called labels) and the goal is to estimate their conditional probability distribution P (Y X) while trying to minimize some measure of the expected error when predicting a new unseen label. In the regression problem, the ouputs are continuous and the goal is to estimate their conditional expected value E(Y X). More precisely, we assume that the sample points are realizations of a pair of random variables (X, Y ), X Ω, Y A where Ω is some feature space (e.g., Ω R d ) and A is some alphabet (e.g., A {0, 1} for binary classification and A R for a regression problem). A central problem in the estimation of these quantities is the Bias-Variance trade-off (see [4] for a thorough presentation). The bias is the error generated by erroneous assumptions in the model. High bias can cause underfitting, i.e., missing relevant relations between features X and targets Y. The variance is the error generated by the sensitivity of the model to small fluctuations in the training data. High variance can cause overfitting, i.e., capturing noise or relations that do not generalize. Given the true model and infinite data to estimate it, it is possible reduce both the bias and variance terms to 0. However, in real life, we deal with imperfect models and finite data, and there is a trade-off between minimizing the bias and minimizing the variance Choice modeling as a soft classification supervised learning problem Our approach to choice modeling is to consider each alternative independently and treat it as a soft classification problem, i.e., given its set of feature values, predict the probability of being chosen. That is, X will be instantiated with vectors a ij and Y with 1 {j=ci }. Therefore, we will train the classification models as if the pairs (X, Y ) were independent and identically distributed which is a strong hypothesis that does not hold in our case and in many practical applications. In particular, in every session there is one and only one alternative chosen, and thus, the probabilities within a 6

7 session should add to one. In order to fix this, we normalize the probabilities yielded by the classification model to make them add to one. More precisely, given a soft classifier ˆP ML (Y X) we define a model that assigns the following probability of being chosen to the alternative j of session i: ˆP (C i = j) ˆP ML (Y = 1 X = a ij ) j ˆP ML (Y = 1 X = a ij ). (4) In contrast to MNL where the main interest is providing understandable insights about the factors driving the choices, our focus is on different performance metrics measured in a hold-out dataset. This is the case in many successful machine learning applications where theory is not completely understood: their good empirical performance fully justify them Decision trees Let us first review our basic building block. A decision tree is a tree where each node represents a local estimate of the quantity of interest (e.g., the conditional distribution for soft classification) based on a subset of the sample points. The subsets are recursively split by applying a linear threshold to a single variable chosen to optimize some impurity measure change in the resultant splits of data. This impurity is meant to measure the homogeneity of the target variable within the subsets. Different quantities can be used as impurity. In the case of soft classification, we use the entropy of the conditional distribution. Different criteria can be used to stop splitting nodes: build full trees, set a maximum tree depth or set a minimum number of points in a leaf. In our case, we set a maximum tree depth, which also controls the computational complexity. In order to predict the target of a new sample point, its features are used to go down the tree according to the linear thresholds and when a leaf is reached, the local estimate is used to predict the target. If the leaves contain too few training points, the model will suffer from statistical scarcity and will lead to high variance. If the tree is not deep enough, the model will suffer from high bias. Next we will present some strategies to find a good balance between bias and variance when using decision trees. 7

8 3.4. Random forests Random forests [1] (RF) follow the idea of bagging (for bootstrap aggregating ), which consists in averaging the predictions of many noisy but approximately unbiased models built on over a collection of bootstrap samples to reduce the variance. In particular, RF build an ensemble of decision trees by using two elements of randomness. The first one consists in using a bootstrap sample for each tree by sampling with replacement from the training data. The idea is to build multiple weak trees (each one capturing only a part of the information) that vote together to give the resulting prediction. In our case, voting is done by averaging their predictions. The second element of randomness is introduced when splitting each node: only a random subset of the features (usually, of size #features) is considered to optimize the splitting of the node. If trees are grown sufficiently deep, they will have relatively low bias, and, since they are identically distributed, the ensemble too. Thus, the error is reduced by reducing the variance as trees are aggregated. See, e.g., [4, Chap. 15] for more details. An interesting outcome of RF building is feature importance. As a matter of fact, some features are more informative than others, or have splits in the ensemble that reduce the impurity more. These reductions in impurity, weighted by the number of samples passing through each node, are summed over the forest and divided by the total reduction in impurity. This finds the share of each feature in informativeness, which is one way of understanding feature importance. As a final remark, due to the way predictions are computed, it is possible that RF assign a null probability to some label value which is bad for some metrics like log likelihood if this happens in the test set. This is not the case for the following method we consider Gradient boosting machines An alternative to bagging is boosting. At first sight, is similar to bagging in the sense that it aggregates many weak classifiers but it is fundamentally different. The idea of boosting is to sequentially train boosting functions that correspond to weak classifiers. Let us denote the t-th boosting function as f t (x). These functions are additively combined, i.e.: ŷ t (x) = ŷ t 1 (x) + νf t (x) (5) 8

9 where 0 < ν 1 is the learning rate, i.e. a shrinkage parameter that controls overfitting. Gradient Boosting Machines 2 (GBM) perform a gradient-descent based optimization by building these boosting functions from regression trees. The t-th regression tree learns the gradient of a loss function L evaluated on ŷ t 1 (x) applied to the training data, in order to add a step in the direction of the negative gradient with a magnitude that is optimized. If we consider the deviance loss 2 log ˆP (Y = y i X = x i ), (6) i then the t-th regression tree for the symbol k is trained on the following gradients g ikt 1 {yi =k} ˆP t GBM(Y = k X = x i ). (7) In fact, the algorithm works for discrete alphabets of any cardinality and, for this, one boosting function f kt (x) is built per alphabet symbol k A at round t. Then, after t rounds, the probability assignment is obtained using the softmax transformation as in MNL: ˆP t GBM(Y = k X = x) k A eŷkt(x) eŷkt(x). (8) A stochastic GBM [5] is a way to improve generalization by sampling columns (per split) and rows (per tree) during the model building process. For more details, see [4, Chap. 10, Algorithms 10.3 & 10.4]. 2 in our case, the machines are trees 9

10 4. Experimental results 4.1. Data We trained and tested our models on a dataset consisting of flight searches and bookings on a set of European origin/destination markets and airlines, extracted from GDS (global distribution system) logs. The choice set consists of the results of a flight search request. Each search session includes between 1 and 50 different itineraries, one of which has been booked by the customer. In total, there are choice situations (i.e., sessions) in the dataset, which are divided into training and test sets containing and 6791 sessions respectively. The features available are presented in Table 1: there are numerical and categorical ones, such as price of ticket, number of connecting flights and origin/destination. For MNL, departure and arrival times were decomposed into a sum of sine/cosine functions (see [6, Chap. 7]). For ML, they were simply represented as the numbers of seconds from midnight. For ML, we made some of the features relative to the minimum in their session, since absolute values make generalization more difficult for partition based methods. For MNL, categorical variables were dummy encoded and numerical features were standardized 3 to allow assessing their relative importance by looking at their coefficients Metrics We compare MNL vs Machine Learning methods using the following metrics on the test set consisting of m sessions, the i-th session containing n i alternatives: TOP N accuracy: we rank the alternatives decreasingly by assigned probability (ties are broken randomly), and we consider a prediction as TOP N accurate if the ranking of the chosen alternative is less or equal than N. 3 by subtracting the mean and dividing by two times the standard deviation of the original feature, as suggested in [7], to make them comparable to binary features coefficients. 10

11 Table 1: Features: marked with, alternative specific features (used in experiment 1, 2 and 3) and, marked with, business/leisure predictors (used in experiment 2 and, for ML, in 3 also). The rest of the features are used in experiment 3 only. For more details, see Sections 4.4, 4.5 and 4.6. Feature Type Range/Card. Used for Log(Price) Num [4.35,9.73] MNL Price/Cheapest Price Num [1,45.64] ML Trip duration (minutes) Num [105, 4314] MNL Trip duration / shortest trip duration Num [1, 24.65] ML Stay duration (minutes) Num [120,434000] MNL Stay duration / shortest stay duration Num [1,1819] ML Number of legs (nflights) Num [2,6] MNL,ML Number of airlines Num [1,4] MNL,ML Contains Low-Cost Carrier (LCC) Bin {0,1} MNL,ML {cos,sin}{2π,4π,6π} Outbound departure time Num [-1, 1] MNL {cos,sin}{2π,4π,6π} Outbound arrival time Num [-1, 1] MNL Outbound departure time Num [0, 86400] ML Outbound arrival time Num [0, 86400] ML Stay duration (minutes) (median) Num [120,434000] ML,B/L classif. Stay Saturday (median) Bin {0,1} ML,B/L classif. Days to departure (DTD) (median) Num [0, 343] ML,B/L classif. Origin/Destination (OD) Cat 97 ML Continental Trip Bin {0,1} ML Domestic Trip Bin {0,1} ML Departure weekday Cat 7 MNL,ML 11

12 Normalized log likelihood: Let c i be the index of the chosen alternative of i-th session. The normalized log likelihood is 1 m m log ˆP (C i = c i ). (9) i=1 Fractional market shares for a given feature f with discrete values V f : the true market share is an empirical distribution s over V f such that for a given value v V f : i s(v) 1 {f(a ici )=v} (10) #sessions where f(a) denotes the value of the feature f for the alternative a and 1 denotes the indicator function. The estimated market share given by a predictor ˆP is defined as: ŝ(v) ij 1 {f(a ij )=v} ˆP (C i = j) #sessions Finally, we compare them by computing the sum of absolute errors: (11) SAE f ( ˆP ) v V f ŝ(v) s(v) (12) In particular, we are interested in airline market share. Notice that, we do not use the airline as training feature since we want to test the ability of the models to generalize, i.e. we want these models to be applicable to new airlines and predict their market share Methods We consider the following methods and corresponding implementations: MNL: we used the Larch open toolbox 4. This implementation has been proven to be faster than traditional commercial and academic MNL software [8]. 4 Available at 12

13 RF: we used Distributed Random Forests from the H2O library 5. An interesting feature of this library is its ability to run on a simple personal computer as well as on popular Big Data platforms like Hadoop or Spark clusters, which makes it suitable for industrial big data applications without additional effort. GBM: we used the implementation of H2O as well [9]. Uniform: as a reference, we consider a uniform probability assignment, i.e. ˆPunif (C i = j) = 1 /n i. In order to optimize the parameters of the ML methods, we use 10-fold cross-validation. This means that for each combination of parameter values 10 models are trained on each fold and assessed on the corresponding validation set. The best combination of parameter values, in terms of average log likelihood, is used to train the final model on the full training set. RF are meant to be used with deep (overfitted) decision trees. Nevertheless, in practice, better results can be obtained by tuning the maximum depth of the trees. We tuned the maximum tree depth by trying the values {3, 5, 7, 9, 11, 13, 15, 17, 19, 21}. The number of trees can go as high as desired but for computational reasons it is important to find a not too large but good enough number. Scoring is performed every 10 trees on each validation set in terms of log likelihood and a moving average of window size 2 is computed. A maximum number of trees of is specified, but our training procedure stops if the relative improvement of the moving average is less than for two scoring rounds. For the other parameters, we used their default values. GBM are based on weak (underfitted) models, i.e. shallow trees. A too large number of trees can make the model overfit. This parameter was tuned in the same way as for RF. Using a cartesian grid search we tuned the following parameters: learning rate: {0.01, 0.1} maximum tree depth: {3, 5, 7, 9, 11}, The other parameters being used with their default values except for the sample rate, where we used Friedman s advice (stochastic gradient boosting [5]) by setting it to Available at 13

14 4.4. Experiment 1: Using only alternative specific features In this experiment, we use only alternative specific features (see Table 1) to show how the flexibility of tree models allow to capture non-linear dependencies. Notice that we do not include absolute stay duration since it helps identifying the kind of traveler (see Experiment 2). Table 2 shows the best parameters obtained from the grid search. Table 2: Best parameters resulting from grid search with alternative specific features only. max depth # trees learn rate RF GBM Table 3 summarizes the results. We observe that RF outperforms MNL in all the considered metrics while GBM outperforms it everywhere except on SAE. In terms of TOP N (N {1..30}) accuracy, we observe absolute improvements up to 4.7%. Table 3: Summary of results using alternative specific features only. method NormLogLikelihood SAE TOP 1 TOP 5 TOP 15 uniform MNL RF GBM In order to get more insight from these models, we analyze feature importance given by RF (Figure 1) and the magnitude of significant (p-value < 5%) MNL coefficients (Figure 2). For MNL, the top 3 features correspond to trip duration, price and number of legs, which is expected. RF gives higher importance to stay duration and to departure and arrival times, which suggests that the model is capturing convenience preferences Experiment 2: Adding business/leisure segmentation As suggested in [10], a shorter time between the booking date and the flight departure (Days to Departure, DTD) yields a higher probability of the 14

15 ratioprice outarrtime ratiostayduration Feature outdeptime ratiotripduration nflights nairlines containslcc scaled_importance Figure 1: RF feature importance using alternative specific features only. 15

16 Feature totaltripdurationminutes logprice nflights outarrtime_sin2p nairlines staydurationminutes containslcc1 outdeptime_cos2p outdeptime_sin4p outarrtime_sin4p outdeptime_cos6p outdeptime_cos4p outdeptime_sin6p outarrtime_cos4p outarrtime_cos6p abs_coefficient sign negative positive Figure 2: MNL significant (p-value < 5%) coefficient magnitude using alternative specific features only. customer being business and therefore less sensitivity to price. Since DTD is almost constant within session, this cannot be straightforwardly exploited by MNL. In this experiment, we exploit attributes that help in characterizing the type of user, more precisely the purpose of the trip, i.e., business or leisure. We follow the method proposed in [10], to allow the MNL to take advantage of this, by splitting the alternative specific features into two new ones. More precisely, a feature f is split into f B and f L : { f if trip classified as Business f B (13) 0 otherwise { f L f if trip classified as Leisure. 0 otherwise (14) All the features from Experiment 1 were split. For this purpose, a classifier (GBM) was trained on a separate dataset of trips that have been unequivocally identified as business or leisure. This classifier was trained using 16

17 only three business/leisure predictors: Stay Saturday, Stay Duration, and Days to Departure. When tested on the survey data used in [10], it yielded an accuracy of 79.4%. Since, within a session, there can be some variations in these features between alternatives, we consider, in fact, their median value to apply the classification model in order to get a consistent result. On the other side, ML algorithms were trained using the features of Experiment 1 and the three segment predictor features but without using the extra dataset or the obtained business/leisure label. 6 Table 4 shows the best parameters obtained from the grid search. Table 4: Best parameters resulting from grid search with alternative specific features and segment predictors. max depth # trees learn rate RF GBM Table 5 summarizes the results. We observe that this splitting helps MNL improving its performance in terms of TOP N accuracy and likelihood, as expected. In terms of log likelihood, GBM outperforms MNL, with a relative improvement of 5.2%. In terms of TOP N (N {1..30}) accuracy, we observe absolute improvements up to 6.2%. Table 5: Summary of results using alternative specific features and segment predictors. method NormLogLikelihood SAE TOP 1 TOP 5 TOP 15 uniform MNL RF GBM Finally, we also compare RF feature importance (Figure 3) with MNL coefficient magnitude (Figure 4). As expected, we see that the price for a leisure traveler has the highest importance followed by trip duration for business travelers. Interestingly, for RF, DTD is now among the most important 6 For ML, we use the original stayduration and the median value of DTD and SaturdayStay. 17

18 features, which is consistent with the reported importance of this feature to predict the kind of trip (see [10]). We observe again the high importance of departure and arrival times. Feature outarrtime ratioprice outdeptime dtd ratiotripduration staydurationminutes ratiostayduration nflights nairlines containslcc staysaturday scaled_importance Figure 3: RF feature importance using alternative specific features and segment predictors Experiment 3: Using all the features In this experiment, we compare MNL and ML methods using all the available features but without the business/leisure label, to compare them in a scenario without business knowledge about possible segments. Table 6 shows the best parameters obtained from the grid search. Table 6: Best parameters resulting from grid search with all available features. max depth # trees learn rate RF GBM

19 Feature L_logPrice B_totalTripDurationMinutes L_totalTripDurationMinutes B_nFlights B_logPrice B_stayDurationMinutes L_outArrTime_sin2p B_outArrTime_sin2p L_nFlights B_nAirlines L_nAirlines B_outDepTime_cos2p L_containsLCC1 B_containsLCC1 B_outDepTime_sin4p B_outArrTime_sin4p B_outDepTime_cos4p B_outDepTime_cos6p L_outDepTime_cos4p L_outArrTime_cos4p B_outDepTime_sin6p L_outDepTime_sin6p L_outDepTime_cos6p B_outArrTime_cos4p L_outDepTime_sin2p B_outArrTime_sin6p B_outArrTime_cos6p abs_coefficient sign negative positive Figure 4: MNL significant (p-value < 5%) coefficient magnitude using alternative specific features and segment predictors. The B/L prefix indicates the business/leisure split. 19

20 Table 7 summarizes the results. We observe that ML outperforms MNL in all the considered metrics. In terms of TOP N (N {1..30}) accuracy, we observe absolute improvements up to 8.7%. In terms of log likelihood, GBM outperforms MNL by 7.3%. With respect to airline market share prediction, GBM yields an SAE 38% smaller than MNL. Table 7: Summary of results using all available features. method loglikelihood SAE TOP 1 TOP 5 TOP 15 uniform MNL RF GBM Finally, we also compare RF feature importance (Figure 5) with MNL coefficient magnitude (Figure 6). Again we see the high importance of trip duration, price and number of legs for MNL. Interestingly, for RF, the most important feature is now OD (origin-destination pair), suggesting a geographic market segmentation. 20

21 Feature OD outarrtime ratioprice outdeptime dtd ratiotripduration depweekday staydurationminutes ratiostayduration nflights nairlines containslcc staysaturday isdomestic iscontinental scaled_importance Figure 5: RF feature importance using all available features. 21

22 Feature totaltripdurationminutes logprice nflights outarrtime_sin2p depweekday5 nairlines depweekday6 depweekday4 staydurationminutes containslcc1 outdeptime_cos2p depweekday3 outdeptime_sin4p outarrtime_sin4p outdeptime_cos6p outdeptime_cos4p outdeptime_sin6p outarrtime_cos4p outarrtime_cos6p abs_coefficient sign negative positive Figure 6: MNL significant (p-value < 5%) coefficient magnitude using all available features. FirstAirline dummy coefficients have been aggregated and the average value is shown (the estimated standard deviation is 2.33). The B/L prefix indicates the business/leisure split. 22

23 5. Conclusions In this paper we dealt with the air itinerary choice modeling problem, traditionally dominated with MNL-based models in the literature. We proposed an alternative approach based on machine learning (ML) techniques in order to tackle some identified weakness of MNL for this particular application. In contrast to previous applications of MNL to air itinerary choice (e.g., [8]), our dataset is much more complex. We consider round-trip alternatives instead of just one-way, multiple markets (O&Ds), different travelers profiles, and different point-of-sales; all together in a single data set of sessions. Thus, alternatives differ between sessions and they cannot easily be labeled in few discrete categories in contrast to classical MNL examples such as {car, bus, air}. Moreover, the number of alternatives is not constant and it varies up to 50 in addition to the fact that some of them could be highly correlated (e.g., same outbound but different inbound flights). This increases the complexity and makes the weaknesses of MNL more evident. Based on the experimentation results, our main takeaways are: Machine Learning (ML) yields a better performance than MNL on all the selected metrics. MNL achieves very good results in simple settings but requires feature engineering to allow non-linearities and customer segmentation in more complex ones. For example, it requires another model, trained on another dataset, to identify business and leisure passengers [10]. ML allows automatic segmentation and non-linear modeling, thus requiring less research effort than MNL for comparable performances. ML scales better thanks to the existence of many libraries for big data platforms The selection of metrics were driven by the industrial uses of this model. It should be noted that for applications such as dynamic pricing flight tickets, a small difference in Top-1 and Top-5 prediction accuracy can lead to a significant increase in profit [10]. Also Top-15 has a particular importance for ranking the results of flight searches since most websites show approximately 15 results per page, and users usually look at the first page in more detail. We have also introduced a metric to measure the ability to predict airline market shares (SAE). 23

24 In all these metrics the two ML models, Random Forest and Gradient Boosting Machines, outperform MNL significantly. For example in terms of Top-1, ML gives a prediction accuracy of 27.8% compared to 22.1% for MNL. In terms of predicting airline market shares, ML yields 38% smaller average errors compared to MNL. We think this is possible because machine learning methods provide clear benefits to model non-linearities of features and to automatically segment customers. Another clear practical advantages of ML compared to MNL is the possibility of fast and scalable learning computation of big data platforms. When it comes to future works, there are still several unexplored research directions: Application of ML methods to larger datasets using big data technologies: The availability of big data platforms opens the door to explore even huge data sets available in the air travel industry collected by GDS for example. Also this is an opportunity to use other recent ML methods that require huge volume of data to be effective as Deep Learning. Understanding better the theoretical aspects, in particular how the non-iidness affects ML methods: The way we consider a choice situation as a supervised soft classification problem it may not be the most correct approach as clearly the different alternatives are not completely independent. As a consequence the outcome probabilities need to be normalized to one artificially. We think there is room to analyze further the theoretical consequences of this choice, something that could lead to possible improvements. Improvements of the models. We plan to modify GBM s loss function to take into account sessions as a whole instead of looking at single alternatives. 24

25 References [1] L. Breiman, Random forests, Machine learning 45 (1) (2001) [2] J. H. Friedman, Greedy function approximation: a gradient boosting machine, Annals of statistics (2001) [3] D. McFadden, et al., Conditional logit analysis of qualitative choice behavior. [4] T. Hastie, R. Tibshirani, J. Friedman, The elements of statistical learning: data mining, inference and prediction, 2nd Edition, Springer, URL [5] J. H. Friedman, Stochastic gradient boosting, Computational Statistics & Data Analysis 38 (4) (2002) [6] L. A. Garrow, Discrete choice modelling and air travel demand: theory and applications, Routledge, [7] A. Gelman, Scaling regression inputs by dividing by two standard deviations, Statistics in medicine 27 (15) (2008) [8] J. P. Newman, V. Lurkin, L. A. Garrow, Computational methods for estimating multinomial, nested, and cross-nested logit models that account for semi-aggregate data, in 96th Annual Meeting of the Transportation Research Board, Washington, DC. (2016). [9] C. Click, M. Malohlava, A. Candel, H. Roark, V. Parmar, Gradient boosted models with h2o. [10] T. Delahaye, R. Acuna-Agost, N. Bondoux, A. Nguyen, M. Boudia, Data-driven models for itinerary preferences of air travelers and application for dynamic pricing optimization., Under review at Journal of Revenue and Pricing Management. 25

Copyright 2013, SAS Institute Inc. All rights reserved.

Copyright 2013, SAS Institute Inc. All rights reserved. IMPROVING PREDICTION OF CYBER ATTACKS USING ENSEMBLE MODELING June 17, 2014 82 nd MORSS Alexandria, VA Tom Donnelly, PhD Systems Engineer & Co-insurrectionist JMP Federal Government Team ABSTRACT Improving