A random forests approach to prioritize Highway Safety Manual (HSM) variables for data collection

Size: px
Start display at page:

Download "A random forests approach to prioritize Highway Safety Manual (HSM) variables for data collection"

Transcription

1 JOURNAL OF ADVANCED TRANSPORTATION J. Adv. Transp. (2016); 50: Published online 29 December 2015 in Wiley Online Library (wileyonlinelibrary.com) A random forests approach to prioritize Highway Safety Manual (HSM) variables for data collection Dibakar Saha*, Priyanka Alluri and Albert Gan Department of Civil and Environmental Engineering, Florida International University, West Flagler Street, EC 3680, Miami, FL 33174, U.S.A. SUMMARY The Highway Safety Manual (HSM) recommends using the empirical Bayes method with locally derived calibration factors to predict an agency s safety performance. The data needs for deriving these local calibration factors are significant, requiring very detailed roadway characteristics information. Many of these data variables are currently unavailable in most of the agencies databases. Furthermore, it is not economically feasible to collect and maintain all the HSM data variables. This study aims to prioritize the HSM calibration variables based on their impact on crash predictions. Prioritization would help to identify influential variables for which data could be collected and maintained for continued updates, and thereby reduce intensive data collection efforts. Data were first collected for all the HSM variables from over 2400 miles of urban and suburban arterial road networks in Florida. Using 5 years ( ) of crash data, a random forests data mining approach was then applied to measure the importance of each variable in crash frequency predictions for five different urban and suburban arterial facilities including two-lane undivided, three-lane with a two-way left-turn lane, four-lane undivided, four-lane divided, and five-lane with a twoway left-turn lane. Two heuristic approaches were adopted to prioritize the variables: (i) simple ranking based on individual relative influence of variables; and (ii) clustering based on relative influence of variables within a specific range. Traffic volume was found as the most influential variable. Roadside object density, minor commercial, and minor residential variables were the other variables with significant influence on crash predictions. Copyright 2015 John Wiley & Sons, Ltd. KEY WORDS: Highway Safety Manual; random forests; variable importance; variable prioritization; data mining 1. INTRODUCTION The Highway Safety Manual (HSM), published in 2010 by the American Association of State Highway and Transportation Officials, recommends agencies to employ statistically advanced quantitative analyses to improve highway safety. Almost 5 years after its release, state and local agencies are still struggling with its implementation. Meeting the data requirements is the most challenging task in the initial stages of the HSM implementation. The HSM provides analytical tools for quantifying the safety effects of potential changes at individual sites on rural two-lane roads, rural multilane highways, and urban and suburban. The manual recommends using the empirical Bayes (EB) method to predict an agency s safety performance. The EB method accounts for the effect of the traditional regression-to-the-mean (RTM) bias [1]. The RTM bias occurs when sites for safety improvements are selected based on short-term observed crash frequency, resulting in a biased estimate of effectiveness of safety programs. The EB method requires the use of predictive models that estimate predicted average crash frequency of a site, facility, or network. *Correspondence to: Dibakar Saha, Department of Civil and Environmental Engineering, Florida International University, West Flagler Street, EC 3720, Miami, FL 33174, U.S.A. dsaha003@fiu.edu Copyright 2015 John Wiley & Sons, Ltd.

2 RANDOM FORESTS TO PRIORITIZE HSM VARIABLES 523 Part C of the HSM presents predictive models to estimate predicted average crash frequency at individual sites on different roadway facilities including rural two-lane two-way roads, rural multilane highways, and urban and suburban. The general form of the predictive models in the HSM can be expressed as follows [1]: N predicted;i ¼ N spf;i CMF 1;i CMF 2;i CMF n;i Ci (1) where N predicted,i is the predicted average crash frequency for a specific year for site type i, N spf,i is the predicted average crash frequency determined for base conditions with the safety performance function (SPF) for a specific year for site type i, CMF 1,i,, CMF n,i are the crash modification factors (CMFs) for n geometric conditions or traffic control features for site type i, and C i is the calibration factor to adjust SPF for local conditions for site type i. The predictive models discussed in the HSM include three components: base SPFs, CMFs, and calibration factors. Base SPFs are the statistical models that are used to estimate the average crash frequency for a site type with specified base conditions. CMFs are used to account for the effects of non-base conditions on predicted crashes. Calibration factors are required to account for differences between the jurisdiction and time period for which the predictive models were developed and the jurisdiction and time period to which they are applied by HSM users [1]. Calibration factor is the ratio of the total number of observed crashes to the total number of predicted crashes calculated using the SPFs and CMFs provided in the HSM. Predicted crash frequency is, therefore, a factor of roadway geometry, environment, and traffic characteristics. The predictive models are most effective when calibrated to local conditions [2, 3]. Very detailed roadway characteristics information is required to derive local calibration factors for each facility type. Several of the variables are often unavailable in most of the agencies databases. During the process of deriving calibration factors for different jurisdictions, several studies identified data collection efforts as one of the most challenging tasks [2, 4 6]. Collecting and maintaining all the data variables on the entire road network for the purpose of the HSM implementation are not considered cost feasible. Furthermore, some variables are likely to have more impact on crash predictions than the others. Therefore, Florida Department of Transportation (FDOT) desires a process to streamline the data requirements that minimizes the potential impacts to the quality of analysis. This initiative works on the assumption that not all of the HSM variables are likely to have high impacts on safety predictions. It thus becomes beneficial to assess and rank the impact of each variable on safety predictions. The ranking could then be used to identify additional variables for which data are to be collected and maintained for continued updates. This paper is based on an FDOT study to identify and rank data variables discussed in the HSM for five different urban and suburban arterial facilities in Florida [7]. The facilities include two-lane undivided, three-lane with a two-way left-turn lane (TWLTL), four-lane undivided, four-lane divided, and five-lane with a TWLTL. The study applies random forests algorithm to prioritize variables in order to optimize data collection efforts. 2. LITERATURE REVIEW Very few studies have focused on the influence of the HSM data variables on safety predictions [8 10]. The studies employed the typical sensitivity analysis approach to assess the influence of the HSM-recommended variables on crash frequency predictions. The sensitivity of a variable was measured by comparing the outputs obtained from using the variable s value at maximum, minimum, and average with the output generated from using the actual values of the variable. The variables that showed significant variation in outputs were identified as influential variables. The main limitation of this approach is that only a single variable is evaluated at one time and the possible association between the variables while measuring the effect of a single variable on crash predictions is ignored [11]. Data mining procedures assist in learning and extracting useful information from data [12]. Different learning algorithms are increasingly being applied in safety studies to capture complex and more accurate relations between data variables and crash characteristics [13 18]. For example, decision tree,

3 524 D. SAHA ET AL. also known as classification and regression tree (CART) [19], is a popular data mining approach that is being applied in traffic safety studies [20, 21]. Unlike statistical regression, the CART procedure does not require any prespecified functional form, probability distribution, variable transformation, and error terms to fit models [22]. Furthermore, the major advantage of the CART method is that it provides interpretable results, as contrary to the so-called black box or magic box phenomenon typically attributed to data mining techniques. Several transportation safety studies explored the benefits of the CART procedure to quantify the influence of contributing factors on crash occurrence [23, 24], injury severity [25, 26], and crash patterns [27, 28]. CART models, however, can sometimes be unstable, particularly when the models are used to predict new data. The ensemble of trees approach is a step forward to provide more stability and improve prediction accuracy over the single decision tree or CART model [29, 30]. In tree-based ensemble, a large number of decision trees are fitted, and their predictions are combined, yielding a more robust prediction estimates. There are two ensemble approaches based on decision trees: random forests and boosted regression trees. Both the ensemble methods fit a sufficiently large number of trees but with a different learning procedure. Random forests yield predictions with low variance [30]. While random forests model involves tuning of few parameters and often the default parameterization produces the best performance, boosted regression trees require increased number of trials with tuning of several parameters for an optimal solution with no suggestions on default values for parameters [30 32]. In addition, the boosting procedure is more susceptible to overfitting [33]. Random forests technique appears to be less complicated, less sensitive to outliers and parameters, and faster compared with boosted regression trees. In transportation, Harb et al. [34] used random forests technique to rank the variables of angle, headon, and rear-end crashes associated with drivers crash avoidance maneuvers. Siddiqui et al. [35] adopted random forests approach to identify and rank macro-level crash and planning variables so as to incorporate proactive safety measures in transportation planning. Hossain and Muromachi [36] developed random forests of logit models to identify the influential factors associated with traffic crashes on basic freeway segments and ramp areas. Hasan and Abdel-Aty [37] adopted random forests method to predict crashes associated with reduced visibility conditions using real-time traffic flow data on freeways. Xu et al. [38] used random forests technique to analyze the propensity of crashes on freeways with traffic flow variables for each level of service ( A through F ). Some studies applied random forests technique primarily for prescreening of most important variables from a large data set, which were to be used as inputs in developing specific models [39 41]. Haleem and Gan [39] selected nine variables from a group of total 17 continuous and categorical variables related to roadway geometry, traffic, environment, vehicle, and driver. Shi and Abdel-Aty [40] selected 20 variables from a total of 37 variables collected from traffic detectors installed on 75 miles of an expressway network. Yu and Abdel-Aty [41] selected four most critical variables from the variable importance ranking by the random forest model and used those variables as inputs to perform the severity analysis using three other modeling techniques. The review of the aforementioned studies revealed that random forests is a promising data mining approach because of its ability to consider many variables, inherently model their relationships, and determine variable importance with no prior specification of potential model forms. The application of the random forests data mining procedure in this study aims to identify and prioritize calibration variables by measuring their influence in crash-frequency predictions. 3. DATA SET This section discusses the data needs, data collection, and data preparation efforts undertaken in this study. This study collected the HSM-specific data variables for urban and suburban in Florida. The process adopted to collect necessary data associated with roadway characteristics and crash attributes is first discussed. The preparation of the final data set is then summarized Roadway characteristics data Table I lists the data variables required for classifying urban and suburban in Florida and the variables identified in the HSM for estimating crash predictions for these facilities [1]. The primary

4 RANDOM FORESTS TO PRIORITIZE HSM VARIABLES 525 Table I. Variables identified in the HSM for urban and suburban. Variable Description For segment classification Area type Categorical (rural and urban) Functional class Categorical (rural principal arterial interstate, rural principal arterial other, rural minor arterial, rural major collector, rural minor collector, rural local, urban principal arterial interstate, urban principal arterial other freeways and expressways, urban principal arterial other, urban minor arterial, urban collector, and urban local) Number of through traffic lanes Roadway type Categorical (divided, not divided, and one-way) Median type Categorical (TWLTL*, concrete, turf, gravel, paved, and others) For calibration of predictive models Annual average daily traffic (AADT) Segment length Median width Categorical (10 levels: 10, 20, 30, 40, 50, 60, 70, 80, 90, and 100 ft) Number of major commercial driveways Number of major residential driveways Number of major industrial driveways Number of minor commercial driveways Number of minor residential driveways Number of minor industrial driveways Number of other driveways Number of roadside objects Speed limit Categorical (Two levels: 30 and > 30 mph) Presence of on-street parking Categorical (Two levels: absent and present) Presence of lighting Categorical (Two levels: absent and present) Presence of automated speed enforcement Categorical (Two levels: absent and present) Type of on-street parking Categorical (Two levels: angle parking and parallel parking) Parking by land use type Categorical (Two levels: residential/other and commercial or industrial/institutional) Curb length with on-street parking Offset to roadside objects Categorical (Seven levels: 2, 5, 10, 15, 20, 25, and 30 ft) HSM, Highway Safety Manual. *TWLTL indicates two-way left-turn lane. Median width is applicable only for four-lane divided. The variable pertains to the presence of on-street parking. The variable pertains to the number of roadside objects. source of information for these data variables in Florida is the roadway characteristics inventory (RCI) database maintained by FDOT. First, all the segments that are part of Florida s State Highway System (SHS) were extracted from the RCI database. The variables including area type, functional classification, roadway type, number of lanes, and median type were then used to categorize segments on the SHS network into the following facilities: urban and suburban two-lane undivided ; urban and suburban three-lane with a TWLTL; urban and suburban four-lane undivided ; urban and suburban four-lane divided ; and urban and suburban five-lane with a TWLTL. Figure 1 shows examples of the five types of urban and suburban arterial segments. Of the calibration variables listed in Table I, annual average daily traffic (AADT), median width, and speed limit were extracted from the RCI. These variables were used to divide the roadway sections

5 526 D. SAHA ET AL. Figure 1. Urban and suburban arterial facility types. into homogeneous segments, and the new segment length was calculated. Each homogeneous segment is unique in terms of its feature value, that is, AADT, median width, and speed limit do not vary throughout the extent of a homogeneous segment. Based on the minimum segment length used for calibration of predictive models for urban and suburban in the study by Srinivasan et al. [42], segments shorter than 0.04 miles were excluded from further analysis. The generated homogeneous segments of length equal to or greater than 0.04 miles were used as the reference points for collecting data on the remaining variables. Note that RCI provides information on type of parking and number of luminaires along a segment. However, these variables are not to the required detail and hence were not extracted from the RCI. An in-house web-based data collection application was developed and used to facilitate the data collection process. The application works as follows. It first reads a linear-referenced roadway segment, converts its coordinates to the Google Maps projection on the fly, and then displays the segment on the Google Maps, in either the street view or the aerial view, depending on the user selection. Similar to a typical video log system, the application allows the user to scan from the beginning to the end milepost of a segment for smooth observation of the roadway features. Figure 2 shows the screen capture of this application. As can be seen from the figure, the application has two panels. The left panel displays the segments along with begin and end mileposts on Google Maps. The right panel includes an interactive section to record the observed data pertaining to each segment.

6 RANDOM FORESTS TO PRIORITIZE HSM VARIABLES 527 Figure 2. Web-based application to collect data Crash data Five years of crash data (2008 through 2012) were extracted from FDOT s Unified Basemap Repository system. The system archives crash records separately for on-system and off-system roads. The onsystem crash repository provided information of crashes occurred only on state roads. Segment-related crashes were then obtained by excluding crashes that occurred at intersections, within intersection influence areas, and at entrance or exit ramps from total crashes Final data set Once the data collection process was completed, the next step was to prepare a final data set by merging both the roadway characteristics data and the crash data. The crash data files have crash location information, including the route number and the milepost at which the crash occurred. This information was used to assign crashes to segments. In other words, crashes were assigned to segments by matching their location information. Crashes that occurred on the point of two roadway segments were consistently assigned to the beginning segment. The number and types of crashes on each segment were counted using a structured query language query. The merged data set was thoroughly checked for possible outliers and inconsistencies. For example, sites with extremely high or extremely low AADT values were not considered in the analysis. Individual sites with a huge difference in AADT between consecutive years were also discarded. Furthermore, the sites identified with any type of construction work were excluded. The final data set contained a total of miles of roadways with an observed frequency of crashes in 5 years. Table II gives the descriptive statistics of the final data set for each facility. 4. METHODOLOGY This section discusses the methodology of the random forests data mining approach. It first includes an overview of single decision tree followed by ensemble method, random forests procedure, optimization technique of random forests models, and how variable importance is measured using random forests algorithm.

7 528 D. SAHA ET AL. Table II. Descriptive statistics. Attribute Measure Two-lane undivided Three-lane with a TWLTL Four-lane undivided Four-lane divided Five-lane with a TWLTL Sample size Number of segments Total Roadway length Total Number of crashes Total Number of crashes per mile per year Mean Input variables for random forests models AADT (veh/day) Min Max Mean SD Segment length (miles) Min Max Mean SD Major commercial Min Max Mean SD Major residential Min Max Mean SD Major industrial Min Max Mean SD Minor commercial Min Max Mean SD Minor residential Min Max Mean SD Minor industrial Min Max Mean SD Other Min Max Mean SD Roadside fixed object density Min Max Mean SD Speed limit* 30 mph >30 mph Presence of on-street parking* Yes No Presence of lighting* Yes No Presence of automated speed enforcement* Yes No (Continues)

8 RANDOM FORESTS TO PRIORITIZE HSM VARIABLES 529 Table. II. (Continued) Attribute Measure Two-lane undivided Three-lane with a TWLTL Four-lane undivided Four-lane divided Five-lane with a TWLTL Median width* 10 ft ft ft ft ft ft ft ft ft ft 48 AADT, annual average daily traffic; SD, standard deviation; TWLTL, two-way left-turn lane; mph, miles per hour. *The value represents the number of segments attributed to each category of these categorical variables. The symbol ( ) indicates median width is not applicable to two-lane undivided, three-lane with a TWLTL, four-lane undivided, and five-lane with a TWLTL Single decision tree A single decision tree model partitions the predictor space into a number of mutually exclusive regions. The partition occurs recursively; the space is first divided into two regions, and each individual region is further divided into two more regions, and so on. Each sub-divided region thus contains smaller and smaller number of observations. The process continues until a prespecified criterion (e.g., presence of a minimum number of observations in a node) is reached [29, 30, 43]. Figure 3 demonstrates the concept of a decision tree model with a simple data set consisting of two predictor variables X 1 and X 2. Figure 3(a) shows the recursive partition of a two-dimensional space formed by X 1 and X 2 into five regions R 1, R 2, R 3, R 4, and R 5 at different values of X 1 (s 1, s 4 ) and X 2 (s 2, s 3 ). Each region yields a constant term by averaging the output of the observations belonging to it (i.e., the average of y m, where y m R m, m =1, 2,, 5). The mean predictions from each region is then combined to give overall model predictions. Figure 3(b) represents the partitioning of the predictor space in an inverted tree-based structure. The node at the top, also called root, corresponds to the entire data set, and the terminal nodes at the bottom, also called leaves, correspond to the regions (e.g., R 1 through R 5 ) Ensemble method Single decision tree model, however, is very unstable; a slight change in the training data can lead to a very different decision tree [44]. Ensemble methods provide an efficient way to improve prediction accuracy over single decision tree. The underlying principal is to grow a good number of single decision trees and combine their results by either voting or averaging, depending on the type of response variable. The ensemble technique is, therefore, more stable and reduces the variance [45, 46]. Note that ensemble methods work successfully when individual trees in the ensemble are different from or not correlated to each other [30, 44, 46] Random forests Breiman [47] introduced random forests ensemble technique based on two powerful machine-learning algorithms: bagging by Breiman [48] and random features subspace by Ho [49]. To understand how random forests approach works, these two concepts are described as follows. In bagging or bootstrap aggregation, each tree in the ensemble is grown with a different bootstrap training sample that randomly draws data from the original data set with replacement. For each tree, the observations that are not in the bootstrap sample constitute test data set, also known as out-of-bag

9 530 D. SAHA ET AL. Figure 3. Schematic representation of a single decision tree. (OOB) data set. For K number of trees, bagging constructs each tree with bootstrap data, predicts the output using the OOB data, and combines results from K trees through averaging. However, it might happen that the same regression tree is grown over the ensemble, providing no improvement in prediction accuracy over the single decision tree. For example, when a particular variable is stronger than the other variables at a node for splitting, it is likely for the variable to be selected again and again in the subsequent trees. This situation prevails even when the importance of a particular variable is slightly higher than some other variables and leads to inadequate learning from the trees [44]. To avoid the possible correlation between individual trees, the idea of random features subspace is incorporated to the bagging principle [29, 30, 44]. Instead of selecting all the variables, the random features subspace allows to select a subset of features (i.e., a fraction of the predictor variables) at each node. The best split point is then determined by implementing the splitting algorithm on the subset of the selected variables. For each node, the splitting algorithm searches over the selected variables within their range of all possible values and picks the variable for splitting that best produces maximum homogeneity to the successive nodes at a particular value [30, 50]. Random forests algorithm thus employs two levels of randomness to ensure diversity of individual trees: (i) a different training data set with the same sample size (i.e., bootstrapped data); and (ii) a different set of inputs to split each node. The randomization ensures improved prediction performance by decorrelation of trees [45] Parameter optimization To optimize the performance of random forests model, the following parameters require fine-tuning [30, 43, 50, 51]: number of trees in the forest (n tree ), and number of features selected at each node for splitting (m try ).

10 RANDOM FORESTS TO PRIORITIZE HSM VARIABLES 531 Liaw and Wiener [44] suggested growing forests with a good number of trees (n tree ) until increasing tree volumes do not quite improve prediction performance. Breiman [45] suggested the following three trials with m try for random forests regression: (1) m try equal to one-third of the number of predictor variables (i.e., m try = p 3 ), (2) m try equal to half of one-third of the number of predictor variables (i.e., m try = 1 2 p 3 ), or (3) m try equal to two times one-third of the number of predictor variables (i.e., m try =2 p 3 ). Two performance measures help determine the optimal pair of n tree and m try values for random forests models. One is mean squared error (MSE), and the other is percent variance explained (R 2 ) [50, 52]. The estimates of the MSE and R 2, shown, respectively, in Equations (2) and (3), are based on the OOB data set [50, 53]: MSE MSE OOB ¼ 1 n i OOB ðy i y î Þ 2 (2) R 2 ¼ 1 MSE OOB Varðy i Þ (3) where MSE OOB is the mean squared error using the OOB sample, y i is the observed value of the ith observation in the OOB sample, y î is the predicted value of the ith observation in the OOB sample, n is the number of observations in the OOB sample, and Var(y i ) is the variance of the response variable Y computed as 1 n i OOBð y i yþ 2,wherey is the mean of y i sintheoobsample Variable importance Variable importance is a measure to rank variables in the predictor set based on their importance in producing accurate predictions [52]. The importance of a variable indicates the contribution of a variable to the output prediction when all other variables are present in the model. The variable importance in a random forests regression model is measured by decrease in node impurity values, which is otherwise called increase in node purity values. The node purity of a variable corresponds to the summation of improvements due to splits by the variable in a tree. The cumulative improvements over all the trees are then averaged to give an overall increase in node purity or decrease in node impurity. The following paragraph describes how variable importance is measured from the node impurity values. In the case of a regression tree, the impurity of a node t, i(t), is measured by sample variance [46, 54]: it ðþ ¼ 1 ð yt ðþþ 2 (4) N t y i y i t where y i is the observed value of the response variable Y, yt ðþis the empirical mean of y i s at node t, and N t is the number of data points at node t. The decrease in node impurity is measured by comparing the measure of impurity of the parent node before splitting with the impurity measure of the child nodes after splitting [55]. Let a variable X m splits the node t into two child nodes t L and t R at split point s such that t L contains all the data points X m s and t R contains all the remaining data points. The decrease in impurity of the node t due to the split s by the variable X m, Δiðs; tþ X m, is given by [46, 54, 55]: Δiðs; tþ X m ¼ it ðþ N t L it ð L Þ N t R it ð R Þ (5) N t N t where i(.) is the impurity measure of a given node, N t is the number of data points at the parent node t, and N tl and N tr are the number of data points associated with the child nodes, t L and t R, respectively. The importance of the variable X m, Imp(X m ), is then estimated by adding up the weighted decreases in node impurities, Δiðs; tþ X m, for all nodes t where X m is used for splitting, averaged over all the trees in the forest [46]:

11 532 D. SAHA ET AL. ImpðX m Þ ¼ 1 N T T N t t T N Δis; ð tþ X m (6) where N T is the number of trees in the forest, N t is the number of data points at node t, and N is the total sample size. 5. ANALYSIS The randomforest package in the statistical software R was used to develop random forests regression models [50, 56]. The analysis for this study was carried out based on total number of crashes in 5 years for the following five urban and suburban facilities: two-lane undivided, three-lane with a TWLTL, four-lane undivided, four-lane divided, and five-lane with a TWLTL. The variables listed in Table I were used as input variables with necessary adjustments. For example, type of on-street parking, parking by land use type, and curb length with on-street parking are only applicable for locations with on-street parking facility. These features on a segment are likely to have some values when the segments have on-street parking facility. It is obvious that these variables are highly correlated to the presence of on-street parking variable. Similarly, offset to roadside objects is only applicable for locations with roadside objects. To avoid the known effects of correlation, these variables were not included in the analysis. Additionally, driveways were included in the model as, that is, driveways per mile, instead of number of driveways. A reasonable assumption that roadway geometry characteristics did not change during the analysis period was made. First, random forests regression models were developed with different sets of n tree and m try values. Forests were grown with 500, 1000, 5000, and trees based on three m try values, as suggested by Breiman [47]. A total of 12 models were therefore built by all the pairs of n tree and m try values for each urban and suburban arterial facility type. Note that m try is equal to 4 when one-third of the predictor variables forms the subset and equal to 2 when half of one-third of the predictor variables forms the subset. When the subset is formed by two times one-third of the predictor variables, m try is equal to 9 for urban and suburban four-lane divided arterial facilities and equal to 8 for other facilities. The models were evaluated by MSE and R 2 measures as shown in Equations (2) and (3). The optimized random forests model for each facility was chosen by the criteria that the model yielded lowest value of MSE and correspondingly highest value of R 2 among all the models. The variables importance was then estimated using the optimized random forests model. Note that although prediction models are usually validated through cross validation or using a validation sample different from the training sample, it is not required for random forests model to run a cross-validation procedure or to have an exclusive test set for measuring its performance [51]. Each individual tree in random forests is fitted with separate bootstrapped sample for training and OOB sample (i.e., consisting of observations not included in the bootstrapped sample) for estimating prediction performance and variable importance scores. An OOB sample is composed of approximately one-third of the total number of observations. It is highly likely that in the random forests model formed by a conglomeration of huge number of trees, an observation is sampled in the training set in some trees and in the OOB set elsewhere. The cross validation or simple validation of a test data set is, therefore, an intrinsic feature of the random forests algorithm, where no separate sample is required to assess prediction performance. 6. RESULTS This section presents the results of the analysis performed in this study. First, the random forests models were evaluated to optimize parameter values. Based on the optimal parameter values, the importance of variables was then determined.

12 RANDOM FORESTS TO PRIORITIZE HSM VARIABLES Performance evaluation Table III shows the performance measures of random forests model for the given pair of n tree and m try values. A lower value of MSE and, at the same time, a higher value of R 2 is attributed to a better performance of the random forests models. It is observed that for a given m try, the MSE and R 2 RF values did not change significantly when tree volumes were increased from 500 to The change is less pronounced in n tree as compared to that by m try. Mostly, the models with m try = 1 2 p 3 showed worse performance than the models with two other m try values (i.e., m try = p 3 and m try =2 p 3 ). An ensemble of trees is considered suitable for stable estimates of variable importance. Correspondingly, the models with n tree = and m try = p 3 were found to have the lowest MSE and the highest R 2 values for two-lane undivided, four-lane undivided, four-lane divided, and five-lane with a TWLTL. For three-lane with a TWLTL, the model with n tree = and m try =2 p 3 was observed to perform better than the models with other pairs of parameters Variable importance The variable importance in the random forests models was measured by increase in node purity values. Table IV shows increase in node purity values of the variables for urban and suburban two-lane undivided, three-lane with a TWLTL, four-lane undivided, four-lane divided, and five-lane with a TWLTL. Because the increase in node purity values greatly vary by sample size, it is difficult to set any specific cut-off point or threshold value to determine variable s significance level. The relative contribution of a variable was therefore computed by dividing the respective node purity value by the sum of node purity values for all the variables. Average annual daily traffic is observed as the most influential variable for crash predictions in all the five facilities. The roadside object density variable is ranked as the second most influential variable for two-lane undivided, four-lane undivided, and four-lane divided ; while it is ranked third for both three-lane and five-lane with a TWLTL. For two-lane undivided, minor commercial and minor residential driveway density are the influential variables next to AADT and roadside object density. These four most Table III. Performance of random forests models. MSE R 2 Facility n tree m try = p 3 m try = 1 2 p 3 m try =2 p 3 m try = p 3 m try = 1 2 p 3 m try =2 p 3 Two-lane undivided Three-lane with a TWLTL Four-lane undivided Four-lane divided Five-lane with a TWLTL TWLTL, two-way left-turn lane.

13 534 D. SAHA ET AL. Table IV. Variable importance. Two-lane undivided Three-lane with a TWLTL Four-lane undivided Four-lane divided Five-lane with a TWLTL Variable Inc. in node purity Relative contr. (%) Rank Inc. in node purity Relative contr. (%) Rank Inc. in node purity Relative contr. (%) Rank Inc. in node purity Relative contr. (%) Rank Inc. in node purity Relative contr. (%) Rank AADT Roadside fixed object density Major commercial Major residential Major industrial Minor commercial Minor residential Minor industrial Other Median width Speed limit Presence of n/a n/a n/a on-street parking Presence of lighting Use of automated n/a n/a n/a n/a n/a n/a speed enforcement TWLTL, two-way left-turn lane; AADT, annual average daily traffic. N/a indicates that a particular variable has no different values between yes and no for all the segments of a particular facility. The symbol ( ) indicates median width is not applicable to two-lane undivided, three-lane with a TWLTL, four-lane undivided, and five-lane with a TWLTL.

14 RANDOM FORESTS TO PRIORITIZE HSM VARIABLES 535 significant variables altogether had approximately 80% contributions in modeling crash frequency predictions. The variables including presence of on-street parking, presence of lighting, major commercial, and minor industrial are the next influential variables with almost equal contributions (in the order of 3%) in estimating crash predictions. Among the other variables, speed limit, other, and major industrial contributed slightly more than 1%, while major residential and automated speed enforcement had trifling contributions (i.e. less than 1%). For three-lane with a TWLTL, AADT is followed by minor commercial, roadside object density, and major commercial in the rank of influential variables for predicting crashes. Among the other variables, minor residential and minor industrial are ranked higher than their major counterparts. Presence of lighting and speed limit are the other two variables with more than 1% contributions. For four-lane undivided, AADT, roadside object density, major residential, and minor commercial are the four most influential variables; each contributing more than 10% in modeling crash frequency predictions. The variables minor residential driveway density and minor industrial each had good contributions (i.e., between 5% and 10%), while each of the remaining variables except other had approximately 2% to 3% contributions. For four-lane divided, in addition to AADT and roadside object density, both the major and minor commercial driveway densities played a significant role in crash frequency predictions. The contributions by major and minor commercial driveway densities were almost equal and altogether approximately 25% of the total contribution. Median width is another significant variable, with a good contribution (i.e., greater than 5%). Among the remaining nine variables, only the variables including minor and major residential driveway densities and presence of lighting each had more than 1% contributions. For five-lane with a TWLTL, it is evident that AADT, major commercial, roadside object density, and minor commercial showed substantial contributions compared to other variables. Cumulatively, these four variables had approximately 93% contributions in crash predictions. Minor residential had a fair contribution (i.e., between 2% and 3%) to the model fit, while the contributions by major residential, speed limit, and presence of lighting were similar (approximately 1%). In summary, AADT, number of major and minor driveways, and roadside fixed object density were more important; while speed limit, presence of on-street parking, and presence of automated speed enforcement were relatively less important. 7. VARIABLE PRIORITIZATION Based on the variable importance results of random forests models, two heuristic approaches can be used for prioritizing variables to optimize data collection efforts. The first approach is to follow the direct ranking of the variables, as shown in Table IV. Based on the resources available, a local agency (e.g., FDOT) might intend to collect data for the first x number of variables. The second approach is to cluster the variables by their relative contribution to crash predictions. The clustering approach suggests grouping the variables as follows: Cluster 1 variables individually contributing at least 25%; Cluster 2 variables individually contributing at least 5% but less than 25%; Cluster 3 variables individually contributing at least 1% but less than 5%; and Cluster 4 variables individually contributing less than 1%. While Saha et al. [22] prioritized the HSM calibration variables by clusters of high priority variables and low priority variables, the second approach in our study adds two more layers of clusters to provide more flexibility and decision criteria for data collection. Table V summarizes the results based on clustering of the variables. Table V shows that AADT is the only variable that belongs to Cluster 1. Note that AADT data are readily available in the FDOT s RCI database and also in other agencies databases. Cluster 2

15 536 D. SAHA ET AL. Table V. Prioritization of variables by clustering. Two-lane undivided Three-lane with a TWLTL Four-lane undivided Four-lane divided Five-lane with a TWLTL Cluster 1 AADT AADT AADT AADT AADT Cluster 2 Roadside object density Minor commercial Roadside object density Roadside object density Major commercial Minor commercial Roadside object density Minor residential Major commercial Minor residential Cluster 3 Presence of on-street parking Minor industrial Presence of lighting Major commercial Major residential Major residential Minor commercial Minor residential Minor industrial Major commercial Major commercial Roadside object density Minor commercial Median width Minor residential Minor commercial Minor residential Presence of on-street parking Presence of lighting Major residential Major industrial Major residential Speed limit Minor industrial Presence of lighting Speed limit Speed limit Presence of lighting Speed limit Major industrial Presence of lighting Other Major industrial Cluster 4 Major residential Use of automated speed enforcement Presence of on-street parking Other Use of automated Other speed enforcement Major industrial Minor industrial Major industrial Speed limit Use of automated Minor industrial speed enforcement Other Presence of on-street parking Other TWLTL, two-way left-turn lane; AADT, annual average daily traffic.

16 RANDOM FORESTS TO PRIORITIZE HSM VARIABLES 537 consists of three variables each for two-lane undivided, four variables each for three-lane with a TWLTL and four-lane divided, and five variables for four-lane undivided. Of the variables in Cluster 2, roadside object density and minor commercial driveway density are common to all the facilities. In Cluster 3, two-lane undivided arterial facility has seven variables, three-lane arterial facility with a TWLTL and four-lane undivided arterial facility each has five variables, five-lane arterial with a TWLTL facility has four variables, and four-lane divided arterial facility has only three variables. An agency can, therefore, easily decide whether to collect data for the variables belonging up to Cluster 2 or Cluster 3, if not to collect all the variables. 8. CONCLUSIONS Calibration factors are required to adjust crash frequencies predicted using the HSM-default SPFs to local site conditions. The HSM requires very detailed roadway geometry, traffic, and crash characteristics data to derive local calibration factors. Unfortunately, many of the data variables needed to derive the local calibration factors are uncommon and are currently unavailable in states roadway inventory databases. Because some variables are likely to have more impact on safety predictions than the others, ranking data variables will help prioritize the additional data to be collected and maintained for continued updates. As such, this study aims to prioritize the HSM variables for calibration purposes by determining their influence on crash predictions. This study applied random forests algorithm to determine the impact of the HSM-recommended variables on crash predictions for urban and suburban in Florida. The analysis was based on the data collected from over 2400 miles of segments on the state road network in Florida. The models identified AADT as the most influential variable. In addition to AADT, data variables including roadside object density and minor commercial were invariably found within top four in the list of influential variables for all the models. Two heuristics approaches were used to prioritize the variables based on their relative importance; one by simple ranking of variables and the other by clustering of variables. The ranking can help collect data by selecting the first few variables (e.g., first 5, first 7), while the clustering can assist in collecting data for a group of variables that have similar contributions to crash predictions. Furthermore, the variable importance score reduces the likelihood of erroneously selecting sensitive variables with default assumptions. For example, the HSM considers roadside object density variable as less sensitive and recommends making suitable assumptions when data are not available. However, this study ranks roadside object density as one of the major influential variables associated with crash predictions for urban and suburban in Florida. One of the limitations of this study is that the study results apply only to urban and suburban arterial facilities in Florida. Because roadway geometry, environment, traffic, and crash characteristics vary considerably in different jurisdictions, the study results, in general, are not applicable to other jurisdictions without validating the models using local jurisdiction data. Another limitation is associated with the random forests algorithm itself. Although the random forests machine learning technique aggregates results (e.g., variable importance) from thousands of individual model outputs to provide stable and accurate predictions, it fails to compute confidence interval of importance scores. Despite the limitations, the random forests approach is a useful technique for data prioritization to minimize the overall challenging and resource-intensive data collection task. Transportation agencies can adopt the robust strategy presented in this study for efficient data collection. The study procedure can also be extended to prioritize data variables for other facility types (e.g., intersections, freeways, ramps, interchanges, and rural highways). The study recommends investigating the actual effect of roadside fixed object density for other jurisdictions, which is characterized as less sensitive to crash frequencies in the HSM while it is ranked as one of the influential variables in this study. Future research can apply other modeling techniques to identify influential variables and evaluate the sensitivity of variable importance using different modeling techniques.