CS626 Data Analysis and Simulation Instructor: Peter Kemper R 14A, phone 221-3462, email:kemper@cs.wm.edu Office hours: Monday, Wednesday 2-4 pm Today: Stochastic Input Modeling based on WSC 21 Tutorial by Biller and Gunes, CMU, slides used with permission Reference: Law/Kelton, Simulation Modeling and Analysis, Ch 6. 1
Big Picture: Model-based Analysis of Systems real world portion/facet perception description real world problem solution to real world problem transfer decision formal model transformation presentation probability model, stochastic process formal / computer aided analysis solution, rewards, qualitative and quantitative properties 2
What is input modeling? Input modeling Deriving a representation of the uncertainty or randomness in a stochastic simulation. Common representations Measurement data Distributions derived from measurement data <-- focus of Input modeling usually requires that samples are i.i.d and corresponding random variables in the simulation model are i.i.d i.i.d. = independent and identically distributed theoretical distributions empirical distribution Time-dependent stochastic process Other stochastic processes Examples include time to failure for a machining process; demand per unit time for inventory of a product; number of defective items in a shipment of goods; times between arrivals of calls to a call center. 3
Why are input models stochastic? We just cannot assume randomness away. Example (Nelson and Biller 23): Suppose you are a supplier of a component that you know has a mean time to failure of 2 years. A client is willing to pay $1 for your component, but wants you to pay a penalty of $5 if failure occurs in less than one year. Should you take this contract? No uncertainty: You will pocket $1 for each component you sell. Uncertainty: If you know that the distribution of time to failure is well modeled as being exponentially distributed (an input model) with mean 2 years, then F(1)=.39 and you can expect to lose $95 on each component you sell. If you know that the distribution of time to failure is well modeled as being uniformly distributed (an input model) between and 4 years (so that mean lifetime is 2 years), then F(1)=.25 the expected loss on each component is $25. from WSC 21 Tutorial by Biller and Gunes, CMU, slides used with permission 4
Learning objectives Concept of input modeling and its fit in simulation model development. Input modeling with data: Physical basis for distributions. Fitting and checking. Input modeling without data: Sources of information. Incorporating expert opinion. from WSC 21 Tutorial by Biller and Gunes, CMU, slides used with permission 5
What is input modeling? Input modeling: Deriving a representation of the uncertainty or randomness in a stochastic simulation. Randomness? A way to describe the behavior of a subsystem that - (lack of knowledge): we can not describe as a deterministic system - (lack of interest, abstraction from details): we do not want to describe as a deterministic system 6
What is input modeling? Example model: G/G/n/m FCFS queue Customers (Tasks) arrive according to some general distribution G Customers are served for a time according to some distribution G n servers are available to serve customers in parallel Customers are scheduled following first-come-first-serve (FCFS) m is the capacity of the queue, (customers hitting a full system are turned away) Design question: What values of n and m are necessary to limit the waiting time for 9% of all customers to 1 min and to limit the fraction of customers that get turned away to 5% in the long run What pieces of information does the input modeling contribute to this simulation study? Photo: Stuart Richards (Left-hand), Flickr, Creative Commons 7
Cookbook recipe for conducting a simulation study Statement of the decision problem and objectives System Analysis Data Collection Verification Output Analysis Validation Input Modeling Development Removal of initialcondition bias Experimental Design Model Building Design and coding of the simulation program Determination of the replication number for error control Simulation runs Rough-cut Model Development Static (Spreadsheet) Simulation Dynamic System Simulation Comparison via Simulation Statistical analysis of results and system design comparison Static Models Dynamic Models Simulation Optimization Recommendation for decisions and implementation of the model Final documentation from WSC 21 Tutorial by Biller and Gunes, CMU, slides used with permission 8
Simulation model development Real-World Simulation Modeling Simulation Programming Process or Phenomenon Simulation Model Random Input Model Simulation Program Random Variate Generator Simulation Input Modeling Random Variate Programming from WSC 21 Tutorial by Biller and Gunes, CMU, slides used with permission 9
G/G/n/m FCFS queueing model revisited Conceptual model Customers (Tasks) arrive according to some general distribution G Customers are served for a time according to distribution G n servers are available to serve customers in parallel Customers are scheduled following first-come-first-serve (FCFS) m is the capacity of the queue, (customers hitting a full system are turned away) Design question: What values of n and m are necessary to limit the waiting time for 9% of all customers to 1 min and to limit the fraction of customers that get turned away to 5% in the long run Input model Measurement data for task arrivals and service times for a certain time Option 1: Trace-driven simulation use measurement data to feed a simulation run Option 2: Simulation draws from a probability distribution needs selection/configuration of a distribution (distribution fitting) alternative: empirical distribution Option 3: Simulation executes stochastic process (later) 1
Input model development There is no true model for any stochastic input. The best that we can hope is to obtain an approximation that yields useful results. A key distinction in input modeling problems is the presence or absence of data: When we have data, then we fit a model to the data. Software support: Essentially, all models are wrong, but some are useful. Box, George E. P.; Norman R. Draper. Empirical Model-Building and Response Surfaces. Wiley 1987. Special purpose software, e.g., ExpertFit by A. Law Simulation environments include this, e.g., Arena by Rockwell Automation Statistics packages provide key functionality, e.g., R (www.r-project.org) When no data are available, then we have to creatively use what we can get to construct an input model. 11
Collecting data Generally hard, expensive, frustrating, boring: System might not exist. Data available on the wrong things might have to change model according to what is available. Incomplete, dirty data. Too much data (!) Sensitivity of outputs to uncertainty in inputs. Match model detail to quality of data. Cost should be budgeted in project. Capture variability in data model validity. from WSC 21 Tutorial by Biller and Gunes, CMU, slides used with permission 12
Example: Traffic measured at a node in a network Plot shows sequence of time stamps for a series of requests (arrival stream). Observations: concatenation of several measurements with a restart close to. or unreasonable wide gaps to higher values of time stamps Need thresholds to automate subsequence detection x = 2s for drop of time, y = 1s for increase Note: Check consistency ahead of any numerical analysis! 13
Example: Traffic measured at a node in a network Plot shows sequence of time differences for first 2k of events. Observations: closer look reveals that subsequence are not necessarily accurately ordered Options? 1.remove out-of-order entries 2.consider ordered subsequences 3.sort subsequence Note: Check consistency ahead of any numerical analysis! 14
Input model development Approaches Real-World Process Collecting Data Validation Fitting Probability Distributions Using Data Itself Expert Opinion Goodness of the Fit Input Model (Fit) from WSC 21 Tutorial by Biller and Gunes, CMU, slides used with permission 15
Notes on using data itself: Trace-driven simulation Example: Simulator needs arrival of i-th customer: pick i-th arrival from data Limitations and Challenges Can never go outside your observed data. No tail and nothing in the gaps. Difficult to reflect dependencies in the inputs. Need to change the data when the input process changes. May not have enough data for long or many runs. Difficult to configure, e.g., customers arrive twice as fast... Huge amount of data requires huge amount of space On the positive side measurement data can naturally incorporate all kinds of qualitative and quantitative constraints and necessary details for a realistic run allows for a direct comparison of real system with simulated system and validation 16
Fitting Probability Distributions Precondition: I.I.D assumption for sample data used in fitting I.I.D assumption for RVs in real system must be validated Corresponding graphical techniques/statistical tests... later! Focus: univariate distributions (i.e. just one RV) Most probability distributions were invented to represent a particular physical situation. If we know the physical basis for a distribution, then we can match it to the situation we have to model. Examples: Binomial Poisson and Exponential Normal and Lognormal Beta, Pert, and Triangular Uniform (See Law, Chapter 6 (27) for a detailed list) 17
G/G/m/n FCFS Example refined (from Law, Example 6.1) Does the selection of the distribution really matter? Arrivals: exponential, rate λ = 1, m=1, n= Service times: given 2 samples, distribution unknown Exercise different distributions with parameter being fitted to match data Make 1 independent simulation runs using each of the 5 distributions; continue each of the 5 runs to collect 1 delays; observe impact of selected distribution: Distribution Delay in queue Number in queue Prop. delays 2 Exponential 6.71 6.78.64 Gamma 4.54 4.6.19 Weibull (best) 4.36 4.41.13 Lognormal 7.19 7.3.78 Normal 6.4 6.13.45 18
Some Distributions Exponential Gamma Weibull Lognormal Normal 19
Parameterization of distributions Parameters of 3 basic types Location specifies an x-axis location point of a distribution s range of values usually the midpoint (e.g. mean for normal distribution) or lower end point for the distribution s range sometimes called shift parameter since changing its value shifts the distribution to the left or right, e.g., for Y = X + γ Scale determines the scale (unit) of measurement of the values in the range of the distribution (e.g. std deviation σ for normal distribution) changing its value compresses/expands distribution but does not alter its basic form, e.g., for Y = β X Shape determines basic form/shape of a distribution changing its values alters a distribution s properties, e.g. skewness more fundamentally than a change in location or scale 2
Physical basis for binomial distribution Binomial Models the number of successes in n independent Bernoulli trials, with probability p of success in each trial Example: The number of defective components found in a lot of n components with probability p of picking a defective component..45 Binomial(5,.2).45 Binomial(5,.8) -1. 1. 2. 3. 4. 5. 6. -1. 1. 2. 3. 4. 5. 6. E[X]=np Var=np(1-p) from WSC 21 Tutorial by Biller and Gunes, CMU, slides used with permission 21
Physical basis for Poisson distribution Poisson: Models the number of independent events that occur in a fixed amount of time. Example: Number of customers arriving at a store during 1 hr..4 Poisson(1).2 Poisson(5) -.5.125.75 1.375 2. 2.625 3.25 3.875 4.5-2. 2. 4. 6. 8. 1. 12. E[X]=λ Var=λ from WSC 21 Tutorial by Biller and Gunes, CMU, slides used with permission 22
Physical basis for exponential distribution Exponential Models the time between independent events, or a process time which is memoryless. Example: The time to failure for a system that has constant failure rate over time. Note: If the time between events is exponential, then the number of events is Poisson. 1.2 Expon(1) Shift=-2.5.35 Expon(3) Shift=-2.5.96.72.48.24-3. -2.45-1.9-1.35 -.8 -.25.3.85 1.4 1.95 2.5-4. -2. 2. 4. 6. 8. 1. 12. E[X]=λ Var=λ 2 from WSC 21 Tutorial by Biller and Gunes, CMU, slides used with permission 23
Physical basis for normal distribution Normal distribution Models quantities that are the sum of a large number of other quantities. Example: Time to assemble a product. Student t distribution Very similar to normal, but with heavier tails. Normal(, 1) vs Student(6).45.4.35.3.25.2.15.1.5 X <= -1.645 5.% @RISK Student Version For Academic Use Only X <= 1.645 95.% -4-3 -2-1 1 2 3 4 Normal: from WSC 21 Tutorial by Biller and Gunes, CMU, slides used with permission E[X]=µ Var=σ 2 24
Physical basis for lognormal distribution Lognormal: Models the distribution of a process that can be thought of as the product of a number of component processes. Example: The rate of return on an investment, when interest is compounded, is the product of the returns for a number of periods. Time to perform some task Quantities that are the product of a large number of others (by virtue of central limit theorem).4 Lognorm(2.5, 2) Shift=-2.5.7 Lognorm(2.5, 5) Shift=-2.5-4. -2. 2. 4. 6. 8. -5. 5. 1. 15. 2. from WSC 21 Tutorial by Biller and Gunes, CMU, slides used with permission 25
Physical basis for beta distribution 3. Beta An extremely flexible distribution used to model bounded (fixed upper and lower limits) random variables in the absence of data. Used as a rough model in the absence of data Distribution of a random proportion such as the proportion of defective items in a shipment Time to complete a task, e.g. in a PERT network Example: Proportion of defective items in a shipment. Beta(1.5, 5) 3. Beta(5, 1.5) 2.5 2.5 2. 2. 1.5 1.5 1. 1..5.5 -.2.8.36.64.92 1.2 from WSC 21 Tutorial by Biller and Gunes, CMU, slides used with permission -.2.8.36.64.92 1.2 26
Physical basis for Pert (Beta) distribution Pert, (Beta in disguise) Used to model the activity times in project management problems and defined by three point estimates: min, mode, max Example: Time to complete a task in a PERT network. PERT is a method to analyze the involved tasks in completing a given project, especially the time needed to complete each task, and identifying the minimum time needed to complete the total project..3 Pert(5, 6, 15).25 Pert(5, 13, 15) 4. 6. 8. 1. 12. 14. 16. 4. 6. 8. 1. 12. 14. 16. 27
Physical basis for triangular distribution Triangular: Models a process when only the minimum, most likely and maximum values of the distribution are known. Example: The minimum, most likely and maximum inflation rate we will have this year..25 Triang(5, 6, 15).25 Triang(5, 13, 15) 4. 6. 8. 1. 12. 14. 16. 4. 6. 8. 1. 12. 14. 16. from WSC 21 Tutorial by Biller and Gunes, CMU, slides used with permission 28
Physical basis for uniform distribution Discrete Uniform Models complete uncertainty, since all outcomes are equally likely. Example: A first model for a quantity that is varying among the integers 1 through 4, but about which little else is known..3 DUniform({x}).5 1. 1.5 2. 2.5 3. 3.5 4. 4.5 from WSC 21 Tutorial by Biller and Gunes, CMU, slides used with permission 29
Distributions Many theoretical distributions with nice properties experience with scenarios when to apply those well-studied properties, parameters, characteristics compact representation of data software support for sampling in simulation runs software support to perform parameter fitting easy to vary by modification of parameters some allow for closed-form analytical formulas for system analysis (queueing networks) may allow for numbers beyond reasonable limits, e.g. negative values, very high values such that truncation may be necessary less sensitive to data irregularities than an empirical distribution For distributions and their relationships see also: Wheyming Song and Yi-Chun Chen, Simulation Input Models: Relationships Among Eighty Univariate Distributions Displayed in a Matrix Format, Proceedings Winter Simulation Conference 21. Larry Leemis:Univariate Distribution Relationships www.math.wm.edu/~leemis/chart/udr/udr.html 3
Overview of fitting with data Select one or more candidate distributions based on physical characteristics of the process and graphical examination of the data. Fit the distribution to the data determine values for its unknown parameters. Check the fit to the data via statistical tests and via graphical analysis. If the distribution does not fit, select another candidate and repeat the process, or use an empirical distribution. from WSC 21 Tutorial by Biller and Gunes, CMU, slides used with permission 31