The Pennsylvania State University. The Graduate School. College of Engineering. A Thesis in. Civil Engineering. Seunghwan Shin Seunghwan Shin

Size: px
Start display at page:

Download "The Pennsylvania State University. The Graduate School. College of Engineering. A Thesis in. Civil Engineering. Seunghwan Shin Seunghwan Shin"

Transcription

1 i The Pennsylvania State University The Graduate School College of Engineering SELECTION BIAS AND HETEROGENEITY IN SEVERITY MODELS - SOME INSIGHTS FROM AN INTERSTATE ANALYSIS A Thesis in Civil Engineering by Seunghwan Shin 2012 Seunghwan Shin Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science December 2012

2 ii The thesis of Seunghwan Shin was reviewed and approved* by the following: Venky N. Shankar Associate Professor of Civil and Environmental Engineering Thesis Adviser Swagata Banerjee Assistant Professor of Civil and Environmental Engineering Vikash Gayah Assistant Professor of Civil and Environmental Engineering Peggy A. Johnson Professor of Civil and Environmental Engineering Head of the Department of Civil and Environmental Engineering *Signatures are on file in the Graduate School

3 iii ABSTRACT This paper addresses the potential effects of selection bias in the estimation of severity distributions in accident severity modeling. In particular, we address this issue in the context of frequency by severity models. Prior literature on frequency by severity models has focused on the use of discrete outcome models and count models as the baseline frameworks (Lord and Mannering 2010; Anastasopoulos and Mannering 2011; Milton, Shankar and Mannering 2008; Park and Lord 2007; Kweon and Kockelman 2003; Ye, Pendyala, Shankar and Konduri 2008). In discrete outcome models, the severity distribution is modeled as a proportions variable as function of geometric, traffic volume and potentially environmental factors. In count models, the outcome variables are either univariate or multivariate counts of severity, and modeled as functions of geometric, traffic volume and potential environmental effects (see for example Venkataraman et al. 2011). While the aforementioned modeling efforts provide significant insight into the unconditional probability of severity occurrence in terms of segment level measurements of geometry and volume, they ignore the impact of selection bias. The models are constructed on the observed histories of segments, which means that segments with no crash histories are omitted from the estimation procedure. Due to this omission, the severity distributions that are estimated may be biased. We provide a method to account for this selection bias via an examination of interstate crash histories in Washington State.

4 iv In conventional severity analysis, as summarized via the above literature references, there are two main approaches conditional analysis of severity where crash specific factors related to collision types, vehicle types, and occupant information are mainly used to estimate severity models. Roadway geometry is used in the form of dummy variables to describe the presence or absence of curvature for example, in the neighborhood of a crash site. The second aspect of conventional severity analysis relates to the unconditional analysis at the segmental level where frequencies of severities are estimated as a function of roadway geometry and traffic volume. In this case, collision type is ignored in conventional analysis because such information is not available in aggregate form. As a result of the above approaches, gaps exist in terms of hybrid data including both geometrics and collision type. This restricts the formulation of comprehensive models of severity. This thesis addresses this gap by using hybrid data including segmental geometry and collision information at the segmental aggregation level. Severity distributions are estimated as a function of interstate geometry and traffic volume factors, and observed collision type proportions. A comparison of parameters with and without the selection bias effect is provided. We also explore the effect of selection bias in terms of heterogeneity in the mean of parameters, where parameters are estimated to be random. We note in concluding that the extant literature on selection bias in the conditional context of severity modeling is scant (Tarko et al. 2010). It is a fruitful area of future research, which can lead to opportunities for integrating insights from conditional and unconditional severity analysis. Some evidence

5 v of this prospect can be seen in recently published papers (see for example, Anastasopoulos and Mannering 2011; Shankar et al. 2006). The original contribution of this thesis is two-fold: a) it addresses a gap in the published literature on the accommodation of potential information from segments that are observed to have not crashes, which as a result can affect the estimated severity distributions; and b) by accounting for such selection effects, the thesis also makes original contributions in the area of the nature of the impact of selection effects on parameters associated with severity models. In particular, the severity models are formulated as two-stage models where information on both geometrics and collision type is incorporated to provide for comprehensive analysis of segment level severity distributions.

6 vi TABLE OF CONTENTS LIST OF TABLES... vii LIST OF FIGURES... viii ACKNOWLEDGEMENT... ix Chapter 1 INTRODUCTION...1 Chapter 2 RELATED WORKS AND RESEARCH QUESTIONS Conventional Severity Analysis Statistical Modeling of Severity Panel Data and Random Parameters Framework...5 Chapter 3 ANALYSIS PROCESS AND EMPIRICAL SETTINGS Analysis Process Study Area Crash Cluster Construction Data Collection Descriptive Statistics Chapter 4 STATISTICAL MODELING OF SELECTIVITY EFFECTS IN SEVERITY ANALYSIS Random Parameters Model Specification Random Parameters Approach for Modeling Estimation Results for Random Parameters Unconditional Severity Model Comparison of Parameters with and without the Selection Bias Effect Chapter 5 CONCLUSION AND DIRECTIONS FOR FUTURE RESEARCH Discussion on Selection Effect Conclusions and Recommendations References APPENDIX... 49

7 vii LIST OF TABLES Table 1: Multinomial logit estimates of crash versus no-crash probabilities Table 2 : Multinomial logit estimates of crash versus no-crash probabilities Table 3: Descriptive statistics of key variables Table 4: Route specific selectivity induced heterogeneity in means models of rural Interstate crash severity in Washington State Table 5: Route specific selectivity induced heterogeneity in means models of urban Interstate crash severity in Washington State Table 6: Comparative results between with selection and without selection in rural segments Table 7 : Comparative results between with selection and without selection in rural segments... 37

8 viii LIST OF FIGURES Figure 1 : Sample Combined Severity Distribution with Crash and Non-Crash Segment 2 Figure 2 : Characteristics in conditional and unconditional severity analysis.. 10 Figure 3 : Modeling structure for severity analysis...12 Figure 4 : Washington state highway map including interstates...17 Figure 5 : Parameter distribution of I-5 sideswipe collision(%) and multi-vehicle collision(%) in rural segments.. 38 Figure 6 : Parameter distribution of I-5 sideswipe collision(%) and multi-vehicle collision(%) in urban segments. 38

9 ix ACKNOWLEDGEMENT Most of all, I would like to appreciate Professor Venky Shankar s encouragement as well as his theoretical and technical guidance. Beyond my academic advisor, he becomes a life-time mentor like my parents. Also I would like to express my gratitude to my company (Korea Expressway Corporation) and my family members. With the company s support for my master course, I can have a really helpful academic life in Pennsylvania State University. Without my family members for their endless love and unlimited support and encouragement, I would not have been able to accomplish this degree. Lastly, thanks God. I want to mention a biblical phrase in Ephesians, the New Testament. Finally, be strong in the Lord and in his might power. Put on the full armor of God. With countless prayer for God of me, my family and friends in State College Korean Church, I can complete research work and this thesis.

10 1 Chapter 1 INTRODUCTION This thesis explores the impact of selection bias in severity analysis for traffic safety evaluation concerning crash data. Crash data are reported by state patrol or police personnel, and then assembled in raw form by state departments of transportation for statistical analysis and monitoring. Often times, the state traffic safety commission is involved in this effort. Dedicated funding allows for the continual monitoring of crash severities so that, key routes on the state transportation system do not trend toward high injury crash occurrences, thereby inflicting heavy social costs. The average cost of a traffic fatality has risen to over 4 million dollars per capita, and the total cost of traffic crashes has exceeded 300 billion annually in the United States. While fatalities and higher end severities such as disabling crashes inflict a majority of the burden from a social cost standpoint, the sheer number of low severity crashes (in excess of 75%) contributes as well to the social cost burden. Therefore, it is imperative that an analysis of severity distributions include the expected social cost burden due to locations where crashes have not occurred. A major reason for this expectation is that crash occurrence distributions are probabilistic (see for example, Shankar et al. 1996), and locations where crashes were not reported may have crashes in the future. In particular, if these locations are ignored in the analysis of severity, then, estimates of severity model parameters could be biased.

11 2 Figure 1 : Sample Combined Severity Distribution with Crash and Non-Crash Segment For example, Figure 1 shows how non-crash segment effects to severity distribution of whole segments to investigate. The rest of this thesis is focused on the following discussions with a trend toward a detailed assessment of the severity model in the later chapters, followed by conclusions and recommendations. In chapter 2, I discuss the background of severity studies, where I lay out the gaps in the state of knowledge, and the research questions that still remain in the area of accurate estimation of network wide severity distributions. In Chapter 3, I discuss materials and methods used to address the research questions, thereby providing a data-model foundation for laying out the original contribution of my thesis. In chapter 4, I discuss the results of the statistical models of injury severity, which helps illustrate the effectiveness of my method for using hybrid geometric and collision data for estimating severity distributions. In chapter 5, I discuss conclusions and recommendations, including strengths and limitations of my thesis, and also address scope for further work that can build on insights from my thesis.

12 3 Chapter 2 RELATED WORKS AND RESEARCH QUESTIONS 2.1 Conventional Severity Analysis Conventional severity analysis, by modern literature standards either examines the severity of crashes conditioned upon the crash having occurred, or the severity of crashes by frequency distributions. These are common templates which can be used to evaluate contemporary severity analyses as well, although the methods used tend to vary. In the conditional models, the severity of a crash is analyzed in terms of the most severe outcome of the crash, or the driver severity, or the occupant severities of all vehicles involved. In the case of pedestrian or bicycle severity analysis, similar extensions to participants versus the main rider can be applied. When the most severe outcome of a crash is being analyzed, it results in one record per crash. The severity scale is usually defined as: a) property damage, b) possible injury, c) evident injury, d) disabling injury, or e) fatality. In terms of the distribution of these severities, the property damage composition of severity distributions ranges around 65%, whereas, possible injury is around 15%, and evident injury is around 15%, while disabling injury is around 4% and fatalities are around 1% (WSDOT, 2010). This is to say, that given a crash has occurred, the probability that the most severe outcome of the crash would be a fatality is 1%. Evidently, this excludes the consideration of probability of the crash occurring in the first place. The probability of a crash occurring has not been considered in conditional severity analysis. To offset this limitation, conventional severity analysis has in the past looked at frequency by severity analysis where the count of fatal crashes, property

13 4 damages, or possible injury crashes, or evident injury crashes is modeled at a segmental level for example. At the unconditional level however, other constraints remain. We do not have information on the proportion of drunk drivers in the vehicle fleet in each segment of the roadway network, nor do we have information about environmental conditions at a continuous level along the roadway. These factors have been found to be statistically significant in conditional severity analysis (see for example, Shankar et al. 1996a; 1996b). If the severity analysis of crashes is to be considered at a multiple observation level per crash, then, one can analyze the severities of all occupants involved for example; or for example, the severity of drivers involved in the collision. In either case, the statistical aspects get a bit more complex, since now we require information on the occupants in all vehicles, information on vehicle types and models, factors that have been found to be significant in most severe outcome models of severity (see for example Shankar, Mannering and Barfield, 1996; Ulfarsson and Mannering 2004). In summary, one can conclude based on the extant literature to date that conventional severity analysis looks at two vastly different vectors of input variables for modeling the severity outcomes of a crash (Savolainen, Mannering and Quddus, 2012; Yamamoto, Shankar, 2004). 2.2 Statistical Modeling of Severity Mutivariate severity analysis that exploited the use of abundant crash specific data was attempted in the mid 1990 s (Shankar, Mannering and Barfielf, 1996a; Shankar and Mannering 1996b), followed by some work on ordered models by Kockelman et al. (2002) and Abdel-Aty et al. (2003). The former used unordered logit models and

14 5 evaluated crash data on rural interstate freeways to examine the correlation between vehicles speeds, vehicle types, driver age, seat belt use, environmental conditions at the time of the crash, as well as specific roadway geometry at the scene of the crash. Kockelman et al. (2002) document the first attempted approach to severity modeling using ordered models. They argue that ordered models take advantage of the increasing nature of the severity scale, but counter arguments have been proposed where the increasing nature of severity assumption may be in question. For example, the unordered approach (see for example, Ulfarsson and Mannering 2004) does a very good job showing that injuries in the middle, especially evident injury can behave quite differently from the ordered severity assumption in terms of significant variable effects. In other instances, Ulfarsson and Mannering also show that variables can have opposite effects at either end of the spectrum. A significant body of work has started to emerge since the work of Shankar, Mannering and Barfield in the area of severity analysis. Savolainen et al. (2012) provide a lucid description of methods and models to date in the severity area. 2.3 Panel Data and Random Parameters Framework In this study, data is gathered at different locations (groups) on the Washington State interstate network. Crash clusters are identified to define the groups, and therefore, the cluster size varies in terms of segment lengths. Then, geometric and crash data are assembled for the clusters and collated into complete records. Crash collision type counts and crash collision severities are included in these records. The dataset as a result, begins to take on the shape of a panel dataset, where, multiple years of data can be appended to each segment if data becomes available. In recent years, panel data analysis

15 6 has been widely favored and used in various areas of transportation including safety (Venkataraman et al. 2011), planning (Su 2010), operations (Karato et al. 2009) etc. The primary benefit of panel data in safety is that having multiple observations at the same location allows the modeler to capture the heterogeneity across individuals (or groups) in a time dynamic fashion. Additionally, panel analyses also allow accounting for latent dynamic effects across cross-sections (Wooldridge 2003, Greene 2004). The concept of heterogeneity across individuals (groups) was introduced into the safety literature by an influential paper written by Milton, Shankar and Mannering in The authors in that paper looked at frequency by severity of crashes on divided highways in Washington State, and developed a mixed logit model to capture the impacts of roadway geometry on severity distributions. They found that significant parameter heterogeneity existed in various geometric effects. Their argument is that heterogeneity in parameters can be the result of driver level interaction with geometry, and unobserved effects affecting the severity outcomes. The upshot of this argument is that statistical parameters associated with geometric effects cannot be assumed to be constant across all roadway segments. Rather, they would follow a distribution with an estimated mean and standard deviation as applicable. The random parameters model is a highly versatile hierarchical model that allows for not only the variation of parameters across individuals, but also the mean value of the parameter distribution to be individual specific. In this research, NLOGIT 4.0 software has been used for statistical modeling. To exploit the usefulness of the random parameters model, I develop heterogeneity in means random parameters model to examine the hierarchical effect of selection bias on severity outcomes. The argument for this is that since it has been established that heterogeneity

16 in parameters exists, the rationale for some known geometric or route specific effects 7 governing this heterogeneity cannot be ignored. For example, an interstate in a predominantly urban area can have vastly different effects in terms of the distribution of the parameter around its mean, compared to an interstate with a rural region. In addition, the distribution of severities due to the urban and rural nature of the regions also affects the distribution of the parameters. For this reason, I extend the random parameter framework to include heterogeneity due to route and direction specific effects, items that are discussed in some detail in the methodology and modeling sections of this thesis to follow. It is reasonable to expect directional effects to influence severity parameter distributions because increasing direction and decreasing direction of travel on interstates are divided. The alignment is not the same for one; second, the directional distribution of flow is not the same; third, unobserved effects due to upstream and downstream portions of the segment are different. So for these aforementioned reasons, the models I estimate include directional dummies as significant variables. Another point of note that is useful for consideration of the material in the following chapters involves the concept of probability of crash occurrence. Abdel-Aty et al. (2003) looked at statistical models of crash occurrence in Florida. Importantly, this approach which I will call the stage 1 model in my thesis has crucial implications for stage 2 which actually models the severities. In stage 1, I model the geometric effects in terms of their impact on the probability of crash. In stage 2, I assume this estimated probability as an additional variable to be input into the severity models. In doing so, this approach includes geometry in a predicted fashion. To-date, to my knowledge, this

17 8 approach in terms of two-stage analysis of severity has not been conducted as evident in the extant literature. The literature as summarized illustrates the gaps in the state of knowledge in terms of methods for accurately estimating severity distributions. In particular, the gaps can be framed in terms of important research questions as follows: a) What is the scope for selection bias if zero crash segments are accounted for? In particular, are there parametric ways to capture this in a statistical model? b) How does the scope of this bias vary across scale? In particular, is selection bias contribution significant at all scales, including micro-scales, and not just conventional scales of safety analysis, which involves homogeneous segments, or project level segmentations such as interchange versus noninterchange levels? c) Given the absence of methodological treatments for selection in a multinomial-multinomial framework, what methods are plausible to explore the scope of selection bias for analyzing safety related severity? In the context of selection bias, much of the motivation for a methodological approach comes from the seminal work of Heckman (1979), and the subsequent of Dubin and McFadden (1984). Heckman showed that there are methods to develop consistent estimates in second stage regressions using a binary selection scheme in the first stage for observations that do not have outcome data. Dubin and McFadden extended this framework to multinomial selection schemes. However, these applications use a leastsquares second-stage regression to examine the impact of selection bias on the outcome variable. To the author s knowledge, no published work exists to date that details the

18 9 properties of second stage parameters in terms of selection bias when a multinomial selection scheme is used in concert with a multinomial outcome regression in the second stage. However, this said, there is some evidence that the Dubin-McFadden shows considerable robustness for second stage transfer of the selection probability even if the first stage is not truly multinomial (for example, shared unobservables.) In fact, Dubin and McFadden report that the first stage multinomial selection scheme can be used to develop instrumented probabilities to apply to the second stage for consistent estimation of second stage parameters. They report their own parametric adjustment which is a nonlinear form of the first stage probabilities, but this is a finding that is specific to the second stage regression being a linear model. What I therefore attempt to do is to exploit the finding that instrumented first stage probabilities can provide consistent estimates in the second stage without any adjustment. It is likely however that the second stage estimates may be inefficient. To limit the extent of inefficiency, I use stringent t- statistics to select second stage variables, in order to compensate for the potential of inflated standard errors.

19 10 Chapter 3 ANALYSIS PROCESS AND EMPIRICAL SETTINGS This chapter covers the data collection and the analytical process descriptions that were carried out in order to generate the statistical models in the chapters following this. 3.1 Analysis Process As I discussed briefly in the introduction section, the comparative issues related to conditional and unconditional severity analysis need to be weighed carefully in the evaluation of selection bias impacts on severity modeling. Figure 2 below shows the characteristics in terms of the detail regarding predictor variables, scale of analysis, and modeling limitations. Figure 2 : Characteristics in conditional and unconditional severity analysis

20 11 As figure 2 shows, there is a major issue that shows up in unconditional analysis which is of relevance to the conditional analysis shown on the left hand side. And this issue relates to the identification of factors that cause crash occurrence in the first place. In unconditional analysis, this identification is automatic, for it by default requires the use of predictor variables such as exposure, geometry and interactions. In conditional severity analysis, the detail in terms of the types of crash related variables, especially in terms of collision types lend a great amount of richness to the types of statistical models we can develop. In particular, the impact of collision type when compared to geometry is not trivial. When one considers the impact of collision type at the conditional level, this makes the analysis predicated on a statistical approach with the goal of using collision types as major influencers of severity, from a road safety planning standpoint. Collision types are correctible events, and therefore, lend engineering insight into the types of roadway improvements that can be tried. Furthermore, due to their heterogeneous effect on severity, one can expect the impact of the corrective measures to vary from segment to segment. It is this motivation that leads to the development of the following analytical process in this thesis. This motivation has not been explored in the current literature (Savolainen et al. 2012). With this Figure 2 below lays out the analytical process I employed in the set up of the modeling components of this thesis. As Figure 2 points out, the two-level approach captures the propensity of crash occurrence as an explicit measure at the first stage but does so at a regional level, by distinguishing crash occurrence probabilities between urban and rural sections. The idea behind factoring in this probability is that we don t know what the severities would have been in an urban or rural location if a crash had been observed. One could also expect

21 12 that as histories of observation increase, that the selection probabilities would converge to a stable level and the observed severity distributions would suffer from minimal heterogeneity and selection bias. Figure 3 : Modeling structure for severity analysis As figure 3 shows, the main aspect of stage 1 is to capture the impact of road geometry on crash occurrence in a manner that is consistent with the rural and urban aspects of roadway effects. Stage 1 is indicated by the four boxes (at the top of the figure) titled rural no crash observed, rural observer crash, urban no crash observed, and urban crash observed. These boxes represent empirical categories defined on the basis of crash and non-crash clusters based on 9 years of continuous observation of interstate segments. If a crash is reported at a certain milepost, then it becomes a starting cluster, and the cluster grows with reporting of crashes every 0.01 miles. In any instance, there is no report of crashes, the cluster size terminates, and the next cluster begins as a

22 non-crash cluster. The reporting of crashes is evaluated for the next 0.01 miles, and if no crashes are reported, then, the non crash cluster grows in size. In essence, cluster size is 13 defined on a continuous basis in the direction of travel at 0.01 increments. The interstates in my Washington State dataset span the entire network that physically exists, and therefore, have rural and urban effects that require a crash occurrence branching structure that reflects this regional division. When the probability of a crash occurring is predicted, it is predicted on the entire set of the rural and urban roadways. Table 1 shows the results of stage 1 as the crash probability model. Table 1: Multinomial logit estimates of crash versus no-crash probabilities Variable Coefficient Std. Err. T-Statistic Rural No-Crash Propensity Equation Constant Interstate dummy (1 if segment is on interstate 5; 0 otherwise) Interstate dummy 2 (1 if seg. is on interstate 90; 0 otherwise) Interstate dummy 3 (1 if seg. is on interstate 82; 0 otherwise) Interstate dummy 4 (1 if seg. is on interstate 205; 0 otherwise) Direction of travel dummy (1 if direction of travel is along decreasing milepost; 0 otherwise) Number of vertical curves in segment Number of horizontal curves in segment Number of lanes Logarithm of left shoulder width measured in feet Logarithm of right shoulder width measured in feet Logarithm of directional ADT Rural Crash Propensity Equation Constant Interstate dummy 1 (1 if segment is on interstate 5; 0 otherwise) Interstate dummy 2 (1 if seg. is on interstate 90; 0 otherwise) Interstate dummy 3 (1 if seg. is on interstate 82; 0 otherwise) Direction of travel dummy (1 if direction of travel is along decreasing milepost; 0 otherwise) Number of vertical curves in segment Number of horizontal curves in segment Number of lanes Logarithm of left shoulder width measured in feet Logarithm of right shoulder width measured in feet

23 14 Table 2 (Continued). Multinomial logit estimates of crash versus no-crash probabilities Variable Coefficient Std. Err. T-Statistic Logarithm of directional ADT Urban No-Crash Propensity Equation Constant Interstate dummy 1 (1 if segment is on interstate 5; 0 otherwise) Interstate dummy 2 (1 if seg. is on interstate 90; 0 otherwise) Interstate dummy 3 (1 if seg. is on interstate 82; 0 otherwise) Interstate dummy 4 (1 if seg. is on interstate 205; 0 otherwise) Direction of travel dummy (1 if direction of travel is along decreasing milepost; 0 otherwise) Number of vertical curves in segment Number of horizontal curves in segment Number of lanes Logarithm of left shoulder width measured in feet Logarithm of right shoulder width measured in feet Logarithm of directional ADT Urban Crash Propensity Equation Set as Baseline Log-likelihood with constants only -45, Log-likelihood at convergence -24, Number of observations 38,273 This probability is then used in stage 2 at the appropriate route level, so that the route specific models include heterogeneity due to expected effects from stage 1 based on the entire network. In stage 2, the severity is modeled as a function of collision variables with the goal to identifying the effects of vehicle involvement and type of collision. By doing so, we would be able to identify the relative magnitudes of collision types. The collision variables are measures of counts of collision types, with seven categories of collision types and five categories of vehicle involvement. The seven categories of collision types are rearend, sideswipe, same direction, headon, overturn, fixed object, and other, while the five vehicle involvement categories are: single vehicle, two-vehicle, three vehicle, four vehicle, and five or higher number of vehicles. The

24 15 heterogeneity in collision type effects on severity results from various factors, including but not limited to: angle of impact, vehicle roof integrity, vehicle trip hazard resulting in severe roadside collisions, vehicle defects and interacting with all of these factors, interior structural integrity of the vehicle. Vehicle involvement on the other hand, captures factors relating to exposure of occupants, vehicle mass differences and vehicle at fault effects on severity. The extant literature does not model crash severities at the unconditional level in this manner rather it tends to focus on geometry as a direct effect. I argue that geometry should be an interacting effect, one that enters severity as an influencing variable at the crash occurrence probability level. In addition, I argue that given specific crash information, the statistical association of collision types and vehicle involvement will lead to a more focused model for severity mitigation. In modeling stage 2, I also introduce a hierarchical effect where the collision type parameters are treated as random functions of selection probabilities interacting with route specific and direction specific effects. This means that I estimate route specific models of severity for urban and rural segments. In so doing, what I am assuming is that the selection probabilities are functions of geometry that interact with the collision types in stage 2. At the core, this thesis in essence models the interactive effects on severity of crash occurrence selection with collision type. A significant point to note in the development of my methodology relates to the assembly of hybrid data and the level of segmentation. To begin with, I use a microscale approach to segmentation so to explore the impact of low probability crash scenarios in very small segments that are observed to have zero crashes. This impact is potentially significant, because the instrumented probabilities are likely to occupy the

25 16 entire 0-1 spectrum, whereas at higher scales, the instrumented probabilities are more likely to be closer to 1 than 0. Second, the segmentation I use can be viewed as method to address spatial selection, which is more significant than temporal selection. Spatial selection occurs when the adjacent zero crash segments potentially exert geometric influences on neighboring crash sites due to spatial correlation. Therefore, it makes the assumption that zero crash segments are ignorable a strong and unreasonably restrictive assumption. To this end, a micro-scale segmentation procedure similar to the one I adopt in my thesis helps address the scope of spatial selection on severity distribution estimation. In temporal selection problems by contrast, selection mainly occurs due to zero crash years for a particular segment. This is more likely to be the case when larger scales are used, for example, interchange and non-interchange segment levels. In such contexts, no segment will be reported to have zero crashes cumulatively in a 9-year period; however, some segments may have zero crash years. Therefore, the question of selection becomes: what if crashes were observed in those years, and how would they skew the analysis of severity? The methods I use here can be used to address the temporal selection as well with one refinement potentially. The choice of year needs to be added to the spatial segmental choice level in the first stage, and then instrumented probabilities can be used from stage 1 for use in the stage 2 model. Hybrid data I use in my thesis include aggregated collision type proportions, vehicle involvement proportions, as well as aggregated interchange level geometric proportions. I then assign the neighborhood interchange level geometry to the micro segment, because even though one can argue that micro scale crash propensity should be predicted as a function of micro scale geometry, the influence of downstream and

26 17 upstream geometry cannot be ignored. In order to avoid spatial correlation related bias due to this, I assign the entire interchange level geometry surrounding the micro segment. In this manner, I can ensure the segment is adequately represented all related neighboring geometry. 3.2 Study Area The study area covers the entire interstate network in Washington State. This coverage results in nearly 1,530 miles of interstate being analyzed in terms of numerous crash types and the five associated collision severities. The study area is shown in Figure 3. Figure 4 : Washington state highway map including interstates The interstates studied in this thesis are I-5, I-82, I-182, I-90, I-205, I-405 and I Interstates 5, 82 and 90 are the roadways that occupy a majority of the 1,528

27 18 directional miles of travel. I-405 is entirely urban in nature, while I-205 is entirely rural in nature. I-5 traverses the west side of Washington State in the north south direction, and therefore is influenced primarily by rain as an environmental effect. Comparatively I-90 is a major route that traverses the state east-west and is influenced by adverse environmental conditions due to altitude effects near Snoqualmie Pass to the east, rain to the west end, and dry climate to the east end. Environmental heterogeneity is greatest for I-90, and I-82, while I-405 is relatively uniform due to it focused urban length between I- 5 on the south end and I-5 on the north end. Topographic and gradient effects are observed on all interstates. Speed limits on rural portions of the interstate network are different from urban segments (70 versus 65 mph), and this might be reflected in collision type heterogeneity effects on severity. For example, lower speed limits may influence the degree to which cross-median collisions occur, as well as rollover/overturn and fixed object collision types. 3.3 Crash Cluster Construction In this section, the various steps involved in constructing crash clusters are discussed. The steps involved are crash data collection, construction of micro clusters and collation of geometric data and traffic volume data to reduce the dataset to an observation-consistent sample. In other words, all records in the sample are consistent at the observation level in terms of the number of variables, with all variables having full column rank. Therefore, the matrix used for the dataset does not have missing data.

28 Data Collection The following sources of data were utilized in this study: Crash data from the Washington State Department of Transportation (WSDOT) for the period was assembled for the seven interstates. Crash data contained raw information on the most severe outcome, driver specific injury, occupant specific injury, although not complete, crash specific information relating to contributing factors. Information about the number of lanes, lane and roadway width, shoulder widths, alignment data and traffic volumes were obtained from WSDOT state highway logs. Collision types in terms of whether the collisions reported were rear-end, sideswipes, same direction, headon, overturn, fixed objects, and other type were categorized for each crash. Using the crash cluster beginning and ending adjusted route mileposts, I built crash counts for each cluster by collision type. The crash clusters are built based on the 0.01 milepost separations in location of crashes. As one progresses along increasing or decreasing milepost, if a cluster of crashes is reported at milepost , and no crashes are reported at and , but at , then, I have three crash clusters, with the first one at with length of 0 miles (as in a point crash cluster), the second at with a length of 0.01 miles, and the third at with a length of 0 miles. I then collated geometric data for the crash cluster by using composite, weighted geometry for the interchange segment containing the crash cluster. The

29 20 argument of this that the interchange level geometry affects crash occurrence in the neighborhood of the crash cluster and therefore captures aggregate effects that are not so micro-scale that I might introduce aggregate heterogeneity. The above procedure resulted in a total of 29,657 observations for crash cluster level analysis. It should be noted that these observations contain cumulative counts of crashes over the nine year period. In this manner, I capture severity distributions over a longer period of observation than is typically done in severity analysis Descriptive Statistics Table 3 shows the descriptive statistics for the assembled dataset for the unconditional random parameter heterogeneity in means severity model. Table 3: Descriptive statistics of key variables Variable Mean Min Max Rural crash containing segment length Rural crash-free segment length Urban crash containing segment length Urban crash-free segment length Property damage frequency Possible injury frequency Injury frequency Average daily traffic 87, , ,769 Number of horizontal curves Number of vertical curves Number of lanes Right shoulder width Left shoulder width

30 21 As can be seen from Table 3, the mean crash containing segment length is longer than the crash free cluster length for both rural and urban segments. The maximum cluster lengths are likewise smaller for crash free segments as opposed to crash containing segments. It must be noted here that the crash free and crash containing segment lengths are based on 9-year histories of crashes, and therefore, could be viewed as lengths that are fairly robust to the length of crash histories. About percent of the crash counts are property damage only, while percent are possibly injury crashes, and percent consisting of evident injury, disabling and fatal injury crashes. The injury component is a single composite component consisting of evident, disabling and fatal injuries as most severe outcomes. The main reason for using a composite injury component is that finer resolutions did not yield tractable random parameter models of severity, a fact confirmed by prior research (Milton, Shankar and Mannering, 2008). Average daily traffic varied from a minimum of 4, to a maximum of 241,769. ADT is weighted ADT by segment length over the interchange length definition. As a result, ADT is shown in fractions. Number of vertical curves varies from zero to a maximum of 6 whereas the number of horizontal curves varies from zero to a maximum of 4. Number of lanes reaches a maximum of 5, while left shoulder width maximum reaches 24 feet and right shoulder widths a maximum of 26 feet. These maxima for shoulder widths are not common place, but show that variation in geometry does occur to a point where a geometry based model of the probability of crash occurrence is likely to include multiple shoulder width, lane cross section and curvature values. A total of 29,657 observations were assembled to cover the rural and urban portions of the interstate network.

31 22 Chapter 4 STATISTICAL MODELING OF SELECTIVITY EFFECTS IN SEVERITY ANALYSIS The statistical model in stage 2 is constructed as a model with stochastic parameters, while the stage 1 model is built with a single level multinomial logit model. The stage 1 model is discussed as part of the more general random parameters specification discussed below. 4.1 Random Parameters Model Specification The application of the mixed logit model (also called the random parameters logit model) is undertaken by considering injury-severity proportions for individual roadway segments. Severity is defined as the resulting injury level of the most severely injured person in the observed accident. To develop the modeling approach, a severity function determining the proportion of injury severities (of all reported accidents per year) on a roadway segment is defined as, S = β X ε (1) in i in in where S in is a severity function determining the injury-severity category i proportion (property damage only, possible injury, evident injury, disabling injury and fatality) on roadway segment n; X in is a vector of explanatory variables (weather, geometric, pavement, roadside and traffic variables); β i is a vector of estimable parameters; and ε in is error term. If ε in s are assumed to be generalized extreme value distributed, McFadden (1981) has shown that the multinomial logit model results such that,

32 23 n ( ) P i I [ βi in ] [ β X ] EXP X = EXP i In (2) where P n (i) is the proportion of injury-severity category i (from the set of all injuryseverity categories I) on roadway segment n. To generalize this to allow for parameter variations across roadway segments (variations in β), a mixing distribution is introduced giving injury-severity proportions (see Train 2003), in EXP[ βi X in ] = EXP[ β X ] I i In ( β ϕ ) P f dβ (3) where f (β φ) is the density function of β with φ referring to a vector of parameters of the density function (mean and variance), and all other terms are as previously defined. Equation 3 is the formulation for the mixed logit model. For model estimation, β can now account for segment-specific variations of the effect of X on injury-severity proportions, with the density function f (β φ) used to determine β. Mixed logit proportions are then a weighted average for different values of β across roadway segments where some elements of the vector β may be fixed and some may be randomly distributed. If the parameters are random, the mixed logit weights are determined by the density function f(β φ). Most studies have used a continuous form of this density function in model estimation (such as a normal distribution). Maximum likelihood estimation of mixed logit models is computationally cumbersome because of the required numerical integration of the logit formula over the distribution of the random, unobserved parameters. As a result, simulation-based maximum likelihood methods are typically employed using Halton draws, which have

33 been shown to provide a more efficient distribution of draws for numerical integration than purely random draws (see Bhat, 2003 and Train, 1999). Details of the evolution of simulation-based maximum likelihood methods for estimating mixed logit models are provided in numerous references including McFadden and Ruud (1994), Geweke, Keene and Runkle (1994), Boersch-Supan and Hajivassiliou (1993), Stern (1997) and Brownstone and Train (1999). As a final point, note that in traditional multinomial logit models the error terms ε in (unobserved effects) are assumed to be extreme-value independent and identically distributed. However, in functions that determine the injury proportions on individual roadway segments, it is important to accommodate the possibility of shared unobservables among injury outcomes. Traditional multinomial logit models assume that the alternate severity outcomes are independent and, if they are not, a model specification error will result. Some past research has shown that lower severity accidents, such as property damage and possible injury, may share unobserved effects (resulting in error term correlation). Previously this problem has been resolved in the accident-severity literature by using nested logit formulations (Shankar et al. 1996, Lee and Mannering, 2002, Savolainen and Mannering, 2006). The mixed logit averts this error term problem by allowing for a more general error-correlation structure, while obviating the need for making apriori assumptions about the structure of shared observables (i.e., nest structures). Uncorrelated errors in the mixed logit model would render the model a multinomial logit. 24

34 25. where the data consist of observations on, where,,.,, is the response variable, contains the main explanatory variables in the rows of the x matrix, is a x 1 vector of time invariant, individual (group) specific variables that influence the means of the random parameters, is a x 1 vector of random latent individual effects, with the following assumptions: N 0,1, N,,,, and,. The structural parameters of the model are:, a x 1 vector of constant terms in the means of the random parameters,, a x matrix of unknown parameters that multiply the covariates in the distribution of random parameters, and, a x lower triangular matrix of unknown variance parameters. It follows that is normally distributed with the moments,, and,.

35 26 In the two-level model specification above, the first level estimates the response variable as a function of the main covariates. In the second level, the segment specific coefficient ( ) that may be drawn from a distribution is assumed to be normally distributed. The means of these normal distributions are modeled as a function of group specific effects. This method provides a means to capture and analyze the observed heterogeneous effects in the means of the random parameters. In the correlated version of the model, the random parameters are allowed to be correlated, in which case is a lower triangular matrix with non-zero offdiagonal elements and the full covariance matrix of the random coefficients is given by. In the random parameter model framework, non-random parameters can be accommodated by forcing the corresponding elements in and to contain zeros. Given the above specifications, the parameter estimates for the random parameters model are estimated by the method of simulated maximum likelihood (Train 2003) using Halton draws. 4.2 Random Parameters Approach for Modeling The dependent variable is the ratio of observed frequency of a particular severity to the total of all severity type frequencies....., where 1,, segments, and r 1,2, for interstate routes, and g represents urban versus rural. So, in general, the severity proportion variable,

36 ...., denotes a member of the set containing counts of PDOs, PINJs and INJs. The value of is bounded by 0 and 1 and can be coded as a proportions response variable. The main explanatory variables in the first level of the random parameters model are expressed in matrix notation as: 27 1 ; it is to be noted here that the vector of explanatory variables is.. equation specific, and given that we have three equations, we have up to 2 constants. In the first level of the model, the random parameters associated with sideswipe proportions and multi-vehicle proportions appear to be statistically plausible as stochastic effects. Overturn proportions in a segment on the other hand appear to be fixed effects. It should also be noted that the baseline effect involving a stochastic constant was found to be statistically plausible. All constants in the rural and urban severity equations were found to be statistically significant at the 90 or 95% confidence level. At the second level, the heterogeneity in the means of the random parameters associated with sideswipe and multi-vehicle proportions is captured by route specific selection probabilities and directional selection probabilities. Clearly, the statistically plausibility of directional selection raises further questions about the importance of design consistency for divided highways. This is not a result that was fully expected.