ASSESSING PROBABILITY DISTRIBUTIONS FROM DATA

Size: px

Start display at page:

Download "ASSESSING PROBABILITY DISTRIBUTIONS FROM DATA"

Oliver Rice
5 years ago
Views:

1 ASSESSING PROBABILITY DISTRIBUTIONS FROM DATA INTRODUCTION VICTOR RICHMOND R. JOSE McDonough School of Business, Georgetown University, Washington, D.C. The task of assessing probabilities for uncertain events can be cognitively challenging even for the most skilled forecasters. This is most evident when one has limited prior experience with such events. The presence of data, however, can ease the burden on the decision maker by providing a good starting point for understanding the nature of the uncertainty, and such data hopefully would lead to accurate and reliable assessments. In many cases, it is almost impossible to find information directly related to the uncertainty of concern. Some related data, however, are typically available that could be used to improve our understanding of the uncertainty. For example, consider an engineer who is studying the failure time of a certain component. She is mainly concerned about whether this component will last until some ordered spares arrive, in order to prevent an disruption in the company s operations. She may study data on the same component or a closely related product for which information is available. This information may be used to generate a probability assessment for this component since it seems to be a reasonable assumption that the life of this component would be similar to that of other identical or similarly designed items. On the other hand, consider an insurance company which routinely assesses the likelihood that a customer would be involved in a motor vehicle accident. It would be a realistic assumption to say that based on certain demographics such as age, occupation, and driving experience, some individuals are more likely than others to be involved in an accident. Here, the related information can be used to help predict whether or not the customer would be involved in an accident say, in the next year. In these examples, data become useful in the probability assessment process. In the engineering example, assessments can be made by using a historical approach, that is, probabilities are assigned to events based on the assumption that past information is representative of current and future behavior. It works on the premise that the earlier events are drawn from a common distribution or process and that the event being assessed will follow as well. In the insurance example, however, it may no longer be plausible to assume that the chance of a particular individual getting into a motor vehicle accident is a simple random draw from the accident history of everyone in the population. Here, additional explanatory variables are used to make inferences about the event probability. ASSESSMENTS USING HISTORICAL DATA In the historical data approach, a probability assessment is made by generating a distribution based on past information and using this for analysis related to the uncertainty of interest. Framed as such, a large literature in statistics is available and has been devoted to address this problem. Parametric Approach Traditionally, constructing probability distributions from data often starts by considering a subset of probability measures that are indexed by some parameter θ R n.by considering functions characterized by this parameter θ, the assessment task is somewhat easier since the focus is then simply on estimating appropriate values for θ. Choosing a Model. Often, problems can be approached in many different ways and Wiley Encyclopedia of Operations Research and Management Science, edited by James J. Cochran Copyright 2010 John Wiley & Sons, Inc. 1

2 2 ASSESSING PROBABILITY DISTRIBUTIONS FROM DATA choosing an appropriate model can be somewhat confusing. Though there is no acid test to clearly determine which distribution or parametric family is appropriate in every occasion, research [1] suggests that considering simple questionnaires about the uncertainty (such as the one listed below) can be helpful in limiting the set of distributions that we need to consider. 1. Nature of the Sample Space. Is the random variable discrete, continuous or mixed (i.e., can it take only discrete values, is it continuous over some interval(s) or both)? Are we dealing with a univariate or multivariate distribution? 2. Bounds. Is the range of values bounded above? Or below? If so, what are reasonable upper and lower bounds? 3. Shape. Is the distribution symmetric? If it is skewed, in which direction is it skewed? 4. Concentration. What range of values is of primary interest to us? Is the modeling of the tails critical? Is the distribution unimodal? Or multimodal? 5. Underlying Process and Dependencies. Is there an underlying process? Do we expect the variable to be dependent on other variables? Are the variables correlated? Are other forms of dependencies known? Despite the fact that questionnaires like this one can be used to limit the parametric families that we should consider, it often does not single out a particular parametric family. Often, we still have to choose among several distributions. Working with a Model. For each of the models which we might consider, the next task for the decision maker is to generate the probability assessments that he or she needs. Consider our engineering example: suppose the records show that the last 15 units of the same component used in the plant lasted 12, 14, 15, 6, 9, 10, 11, 12, 14, 13, 13, 10, 14, 9, and 18 days. If the spares are scheduled to arrive in 10 days, then we can use this to estimate the probability that there will be a disruption. If we assume that the data we have observed comes from independent draws from a common distribution with the same probability of lasting less than 10 days, then we can say that a binomial distribution (i.e., f (x) = θ x (1 θ) n x ) might be appropriate for this. Having decided on a parametric model would then limit the problem to estimating an appropriate value for θ. Suppose that over a period of time or a series of experiments, you were able to collect a random sample X 1,..., X n coming from a distribution function F(x; θ), where θ is an unknown parameter belonging to some set R m (m 1). There are many ways to estimate θ. Perhaps the most commonly used approach used today is the maximum likelihood (ML) approach. Under this method, the data observed is assumed to be generated by some joint distribution L (characterized by the parameter θ) called a likelihood function L(θ) = f (x 1,..., x n θ). (1) The estimation is then made by finding the ˆθ, which maximizes the value of Equation (1), that is, ˆθ = arg max L(θ). (2) θ The estimate in Equation (2) can be interpreted as the parameter that gives the highest probability (or the one that makes it most likely ) that the data comes from the said process. A large literature has shown many interesting and useful statistical properties for the ML estimator, making it desirable to many researchers and practitioners. However, one challenge still encountered in this approach is that solving Equation (2) for certain functions can be computationally difficult. Going back to our example, it can be verified that the ML approach would yield an estimate for θ of 12/15 = 80%. Intuitively, this makes sense since 3 instances of the 15 were less than the delivery time of 10 days. Alternative estimation techniques which may yield significantly different results, are also available. Lehmann and Casella [2] provide a detailed discussion on different statistical estimation techniques.

3 ASSESSING PROBABILITY DISTRIBUTIONS FROM DATA 3 Naturally, an alternative model could also make the estimates significantly different. For example, if we assume that the lifetimes of the component follow say, a normal distribution, then the estimate for the mean and standard deviation using a momentmatching approach would be 12 and 2.95 respectively. This would mean that the probability that the component survives before the spare arrives would be P(Component Life 10) = P(Z 0.678) = 75.1%. An alternative approach to solving for point estimates is computing for the posterior distribution of the parameter θ and using this information to generate an estimate for the parameter. Given prior information on θ encoded in the distribution f (θ), Bayes rule allows us to compute the posterior distribution of θ as follows: f (θ x) = f (x 1,..., x n θ) f (θ) f (x1,..., x n θ) f (θ)dθ. (3) Typically, in many instances where no informationonθ is available, the statistical literature often relies on the use of noninformative (or diffuse) priors. Once a posterior distribution is attained, we can then use this for analysis. Retrospectively, we can also use this to make probabilistic statements about the uncertainty. The posterior mean of θ could be used as an estimate for θ, similar to what the ML approach tries to achieve. Other statistics (such as the median or other specific quantiles) could be used as well, depending on the problem. Compared to the earlier approach, this Bayesian treatment is a more careful analysis about the event uncertainty since the inherent uncertainties in θ is incorporated in the analysis through the prior distribution f (θ). Selecting among Models. If only one model is considered appropriate, then there is no ambiguity in which probability estimate to use. However, when multiple models are considered, we often have to choose which one should be used and reported. The typical approach in practice is to compare models based on their fit with the empirical data. Models are ranked using goodness-of-fit statistics. Some commonly used statistics include Pearson s chisquared (χ 2 ), Kolmogorov Smirnov (KS), and Anderson Darling (AD). D Agostino and Stephens [3] provide a more detailed treatise on this subject. Current statistical softwares have the ability to fit data into certain parametric families. In addition, users have the option of ranking the distributions based on fit using some common goodness-of-fit statistics. For example, Crystal Ball, a commonly used Monte Carlo simulation package, estimates parameters for distributions using an ML approach and users have the option to rank distributions based on either the χ 2, KS, or AD statistic. Figure 1 shows a sample output generated by Crystal Ball for a small financial data set. For these data-fitting techniques, one important issue that often happens in practice is dealing with small data sets. Research [4] has shown that for several well-known distributions, small differences in the estimates have a large effect on decision and risk analysis studies, especially when the uncertainty has a high relative standard deviation. Extra caution is needed when relying on assessments and analysis that are generated from such small data sets. Nonparametric Approach Once in a while, an individual may not have sufficient information to select a class of parametric families to use. In these cases, making certain assumptions on the functional form of the distribution might lead to large errors instead of improvements in the model. In a risk analysis study [5], it was shown that incorrect substitutions in functional forms of probability distributions generated errors that can grow by a factor on the order of n for a data set of n measurements. An alternative to forcing a choice for a parametric model is to allow the empirical data to determine the distribution on its own. In this nonparametric approach, the empirical distribution is used as the probability distribution in the analysis. This practice is commonly done in Monte Carlo simulation for determining some distribution when none of the standard distributions used seem to

4 ASSESSING PROBABILITY DISTRIBUTIONS FROM DATA Figure 1. Data-fitting option of Crystal Ball. Figure 2. Using empirical distributions in a Monte Carlo setting. be appropriate.

Figure 2 shows the tool in Crystal Ball that could be used to create a custom distribution.

4 4 ASSESSING PROBABILITY DISTRIBUTIONS FROM DATA Figure 1. Data-fitting option of Crystal Ball. Figure 2. Using empirical distributions in a Monte Carlo setting. be appropriate. Many Monte Carlo and statistical software packages now allow users to use empirical distributions in their analysis. Figure 2 shows the tool in Crystal Ball that could be used to create a custom distribution. Though the freedom from choosing a parametric family seems to be a better alternative, the nonparametric approach has its own setback. The main disadvantage of this approach is that the quality and reliability of the analysis exponentially decreases as the number of data points drop. To a lesser extent, tractability may also provide some problems but with the presence of modern computing this is not a big issue unless the dimension or size of the problem is large. ASSESSMENTS USING EXPLANATORY VARIABLES A different and equally valid approach in using data to assess probability distributions is to use historical and current information to predict certain events. For example, consider a firm that has decided to provide a new type of small short-term loans to its customers.

5 ASSESSING PROBABILITY DISTRIBUTIONS FROM DATA 5 In this case, one useful assessment question that might be asked by this firm is how likely is it that an individual will default a loan if one gets approved. In this case, historical information on default rates may not be as useful since customers with different backgrounds and profiles may not be likely to default their loans at the same rate. If there is historical data on loans of a similar type, then some analyses could be done to estimate the probability that an individual would default. In particular, information such as age, household income, level of education, amount of credit card debt, and current credit rating score might be useful in predicting the default rates for each individual applicant. Tools from Categorical Data Analysis One class of models that could be used to analyze explanatory variables is called discrete choice models, which are models that predict what category an uncertain event would materialize as a function of any number of variables that are believed to have an influence on the event. Binary Models. Consider a Bernoulli random variable Y i, which depends on a set of explanatory variables (X i1,..., X in ). An initial model that could be considered is a linear probability model, which tries to linearly model the probability through the explanatory variables, that is, p i E(Y i ) = β 0 + β 1 X i1 + +β n X in. (4) Two problems are typical in an approach such as this. First, the estimate that you might get may fall outside the interval [0,1]. The other issue is that the event probability may not be linearly related to the explanatory variable. Typically, you may expect that as you go to the extremities, changes in the probability would start to become smaller. An alternative approach which addresses these issues and has also been popular in the literature is logistic regression. Logistic regressions are models that fit the log-odds ratio (or logit ) of a binomial random variable as a linear function of the predictor variables {X ij } n j=1,thatis, ( ) pi logit(p i ) = ln 1 p i = β 0 + β 1 X i1 + +β n X in. (5) Equation (5) is equivalent to p i = P(Y i = 1) = exp (β 0 +β 1 X i1 + +βnx in ). (6) The term logistic comes from the fact that Equation (6) is similar in form to a logistic curve/function which has the form f (x) = (1 + exp( x)) 1. Figure 3 shows an example comparing a linear probability model with a logistic regression. To solve for the coefficients in a logistic model, ML estimation is performed. In certain instances the ML estimation procedure can be computationally tedious, especially for Probability Linear model Logit model Dependent variable Figure 3. A comparison of the linear probability and logit models.

6 6 ASSESSING PROBABILITY DISTRIBUTIONS FROM DATA large problems. However, standard statistical software such as Stata, SAS, and SPSS now have a logistic regression tool built into their standard packages; that does these computations quickly and efficiently. Multinomial Models. In cases where there are more than two possible states or categories, the binary model can be extended to the multinomial case. If Y i can take on the values {1, 2,..., m}, then we can write the logit equations as logit(p ij ) = β 0j + β 1j X i1 + +β nj X in, (7) for j i. Since we also know that j p ij = 1, we only need m 1 equations for Equation (7). Hence, one category (WLOG say state 1) is typically chosen to be the reference state; that is, the log-odds ratio is always compared to the first state making things easier to compare or mathematically, [ P(Yi = j) ] logit(p ij ) = ln. (8) P(Y i = 1) Then, Equation (7) can be rewritten as p i1 = P(Y i = 1) 1 =, m 1 + exp(β 0k + β 1k X i1 + +β nk X in ) k=2 (9) p ij = P(Y i = j) exp(β 0j + β 1j X i1 + +β nj X in ) = m 1 + exp(β 0k + β 1k X i1 + +β nk X in ) k=2 for j 1. (10) Again, many software packages available today incorporate a multinomial logistic regression tool. In addition, other multinomial logistic models (ordered state space models, conditional logistic models) are available in these packages. An excellent discussion of these more sophisticated models is provided in Refs 6 and 7. Tools from Time Series Models and Forecasting Alternatively, we can also use some inferential techniques to generate probability assessments over continuous intervals. Consider a fashion retailer who is trying to provide probabilities for sales for the next few quarters. Given the presence of past data, projections for future revenue streams can be generated say, by using some time series forecasting technique that could be used to provide a predictive distribution for future sales. Figure 4 provides an example of a data series where seasonal patterns are quite evident; draws from historical values might be misleading because of the growth trend that has been experienced by this firm. Moreover, one may be interested in using other predictors outside the historical data for making predictions. Generalized linear modelsmightbeusedtopredictsay,thesales, using other variables such as the advertising Figure 4. A sample time series forecast output from a statistical package. 0 Q3/96 Q3/99 Q3/02 Q3/05 Q3/08 Q3/11

7 ASSESSING PROBABILITY DISTRIBUTIONS FROM DATA 7 budget, lagged sales, market share, economic indicators, and the like. The predictive distribution of the forecast can then be used as the probability distribution for the unknown quantity. Here, the emphasis is not on the point forecast generated by the model but rather the probabilistic information that is contained in the predictive distribution. SOME CONSIDERATIONS The use of data can be very useful in determining an appropriate probability distribution for an uncertain event. Historical data about the uncertainty can provide a glimpse of the past behavior of the uncertainties being studied, while related information may provide a good basis for making inferences about these uncertainties. Despite the many advantages of using data, one must still be cautious of falling into the following traps that are common in this process: 1. Framing Traps. Frame blindness refers to the tendency of approaching a problem in an inappropriate manner because one has already accepted a specific framework with little thought leading to the discounting of better alternatives [8]. This narrowness in perspective often blinds decision makers. A common tendency is to lock in on a data set and discard any other theory or model that may reject this data set before doing a reasonable exploration of options. This can be due to a number of reasons such as having an initial cost associated with the data set, the data set being accepted as a common standard in the past, or perhaps its being attributable to a specific individual/source/group. 2. Illusion of Objectivity. One particular phenomenon with the use of data is the illusion of objectivity [9]. This refers to the distortion in the stakeholder s understanding of the situation by making him or her believe that the assessments made using the data are purely objective and free from any input from individuals. In many cases, the data used in the generation of probability assessments are devoid of human inputs but the choice of which data to use and how to model the data (including, for example, distributional assumptions or choice of a model in a regression setting) often involves a decision maker who ascertains why the use of such data over others is appropriate. 3. Purity of Data. Related to the notion of objectivity is the purity of data phenomenon, where decision makers often eliminate all sources of information that are attributable to individuals. In many instances however, information from individuals can be very useful and insightful. Research has shown that combining forecasts (see Bayesian Aggregation of Experts Forecasts; Combining Forecasts.) can lead to improvements in forecasting accuracy and this is true regardless of whether the forecasts come from individuals, groups, models or data. Data can be a powerful tool in providing us more information about uncertainties that we have limited information on. They are useful as long as we keep in mind that data can never be used without a sprinkling of caution and a good dose of critical thought. REFERENCES 1. Lipton J, Shaw WD, Holmes J, et al. Short communication: Selecting input distributions for use in Monte Carlo simulations. Regul Toxicol Pharm 1995;21: Lehmann EL, Casella G. Theory of point estimation, Springer texts in statistics. 2nd ed. New York: Springer; D Agostino RB, Stephens M. Statistics: a series of textbooks and monographs. Volume 68, Goodness-of-fit techniques: New York: Marcel Dekker; Haas C. Importance of distributional form in characterizing inputs in Monte Carlo risk assessments. Risk Anal 1997;17(1): Seiler FA, Alvarez JL. On the selection of distributions for stochastic variables. Risk Anal 1996;16(1):5 18.

8 8 ASSESSING PROBABILITY DISTRIBUTIONS FROM DATA 6. Agresti A. Categorical data analysis. 2nd ed. Wiley series in probability and statistics. Hoboken (NJ): Wiley-Interscience; Hosmer DW, Lemeshow S. Applied logistic regression. Wiley series in probability and statistics. Hoboken (NJ): Wiley-Interscience; Russo JE, Schoemaker PJH. Decision traps: ten barriers to brilliant decision making and how to overcome them. New York: Doubleday; Berger JO, Berry DA. Statistical analysis and the illusion of objectivity. Am Sci 1988; 76(2): FURTHER READING Frey HC, Burmaster DE. Methods for characterizing variability and uncertainty: comparison of bootstrap simulation and likelihood-based approaches. Risk Anal 1999;19(1): Greene WH. Econometric analysis. 6th ed. New Jersey: Prentice Hall; Hamed M, Bedient P. On the effect of probability distributions of input variables in public health risk assessment. Risk Anal 1997;17(1): Hamilton JD. Time series analysis. New Jersey: Princeton University Press; Hattis D, Burmaster DE. Assessment of variability and uncertainty distributions for practical risk analysis. Risk Anal 1994;14(5): Kleinbaum DG, Klein M. Logistic regression. 2nd ed. New York: Springer; Lloyd CJ. Statistical analysis of categorical data. New York: Wiley Interscience; Thompson KM. Software review of distribution fitting programs: Crystal Ball and BestFit Add- In Hum Ecol Risk Assess 1999; 5(3):

Distinguish between different types of numerical data and different data collection processes.

Level: Diploma in Business Learning Outcomes 1.1 1.3 Distinguish between different types of numerical data and different data collection processes. Introduce the course by defining statistics and explaining