Module 1: Fundamentals of Data Analysis

Size: px
Start display at page:

Download "Module 1: Fundamentals of Data Analysis"

Transcription

1 Using Statistical Data to Make Decisions Module 1: Fundamentals of Data Analysis Dr. Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business S tatistics are an important tool for many fields - business, the physical sciences, economics and the social sciences, engineering and the biological sciences. They enable us to examine and test important research questions concerning individual variables and the relationships among a set of variables. The results, if used properly, can help make difficult decisions. Many students approach statistics with some fear and trepidation. Common concerns involve anxiety over math skills and a feeling of distrust for the relevance of statistics. In terms of the former concerns, modern desktop and laptop computers have made most of the calculations of statistics easy and painless. These tools enable us to focus on the more important aspects of good data analysis practice and interpreting results. As for the latter concern of relevance, I believe we see examples of the importance of statistics each and every day. For example: How do they know how much snow has fallen given snow is so difficult to measure? They actually have a standard protocol and take an average of several measurements. Can a business make decisions about the future by analyzing data from the past? Yes! Prediction on the future, though always with some uncertainty, is a core use of statistics by businesses. Making informed decisions with data is a value-added opportunity for businesses. Can we ever get a good measurement of crowd size at war or policy protests? I don t have an answer for this one, but estimates vary wildly! Can a sales team make decisions on new products from a sample of consumers? Yes, we can make estimates. Marketing research is a big user of statistics to help make decisions on price, product attributes, and the formation of new products. How do drug trials lead to the acceptance of a new drug? Statistics are a big part of this process! Ultimate acceptance of a new drug for sale requires an elaborate Key Objectives Understand the difference between the descriptive and inferential aspects of statistics Understand the concept of a random sample, measurement, and levels of measurement Understand the use of basic summary statistics of measures of central tendency and measures of dispersion In this Module We Will Be: Describing data using summary measures of Central Tendency and Dispersion Looking at graphical displays of data: box plots, stem and leaf, and time series smoothing Transforming data with logs, inversions, trimming and dealing with outliers For more information, contact: Tom Ilvento 213 Townsend Hall, Newark, DE ilvento@udel.edu

2 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 2 experimental trial which is analyzed by statistical models. Millions of dollars ride on the outcome of the statistical analysis. The focus of this course is on understanding the basics of statistics. I would like you to gain an appreciation for how descriptive and inferential statistics are used in your business or field; how to analyze a set of data; how to present the data and make meaningful and coherent conclusions to others, and how to critique the use of statistics by others. WHAT ARE STATISTICS? There are many concepts of statistics and what it means. Statistics are thought to be the data itself, as in the government released the latest statistics on unemployment, a field of study in mathematics, and a set of tools used by many disciplines to analyze data. In its broadest sense, statistics is the science of data. It refers to Collecting data Classifying, summarizing, and organizing data Analysis of data Interpretation of data Descriptive versus Inferential Statistics. We make a distinction between two main approaches in the use of statistics for data analysis, descriptive versus inferential statistics. Both approaches are related to each other, but the distinction between them is important to note. Descriptive statistics uses measures and graphs to summarize the data with an emphasis on parsimony. Our strategy is to find summary measures which describe the data adequately and succinctly, be they a percentage, average, or a standard deviation. Descriptive statistics also involve describing the relationships between variables or sets of variables through the use of very sophisticated techniques, such as correlation, regression, factor analysis, logistic regression and probit analysis. Inferential statistics involves many of the same techniques used in descriptive statistics, but takes it a step further. Now we use these techniques to make estimates, decisions, predictions, or generalizations about a population from a smaller subset of data called a sample. The sample can be a subset of a population at a point in time or a sample of the population in time or space. Descriptive statistics uses measures and graphs to summarize the data with an emphasis on parsimony. Inferential statistics uses some of the same techniques to make estimates, decisions, predictions, or generalizations about a population from a smaller subset or sample.

3 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 3 Inferential statistics are a powerful tool for research. It enables us to make statements about a large group from a much smaller sample. Thus, we can survey a sample of 1,000 people and make good generalizations about 280 million people in the U.S. Sampling. A census is when we collect data on all elements in a population. Sometimes it is difficult or impossible to get information on the entire population. An alternative is to take a sample of the population. A sample is a subset of the units or elements of a population. Sampling saves time, money, and other resources (computation time). In some cases, it may actually be impossible to collect information on every element of the population and sampling becomes a reasonable alternative. A valuable property of a sample is that it is representative of the population. By this we mean that the sample characteristics resemble those possessed by the population. Inferential statistics require a sample to be representative of the population, and that can be done when the sample is drawn through a random process. A random sample is when each element or unit has the same chance of being selected. Classic statistical inference requires that the sample be selected through a random process. Measurement. Measurement is the process of assigning a number to variables of the individual elements of the population (or sample). Measurement is a bigger issue than many think. Some measurement seems relatively straight-forward - distance, weight, dollars spent. However, measurement always comes with some error and perhaps even bias. Only by taking a random sample can we have confidence when we make a inference from a sample to the population. Measurement is the process of assigning a number to variables of the individual elements of the population (or sample). The process of measurement is often complex don t take it for granted. With measurement we must also deal with issues of validity (are we measuring what we think we are measuring) and reliability (is the measuring device consistent). A user of data is responsible for asking questions and in some cases doing preliminary analysis to determine if the measurement is valid and consistent. The process of measurement is often complex don t take it for granted. Levels of Measurement. There are various ways to characterize measurement of variables. An easy dichotomy in measurement is qualitative versus quantitative data. Qualitative data do not follow a natural numerical scale and thus are classified into categories such as male or female; customers versus noncustomers; and race (white, African American, Asian, and so forth). Quantitative data use measures that are recorded on a naturally occurring scale, such as age, income, or time.

4 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 4 A more elaborate description involves three levels of measurement - nominal, ordinal, and continuous. Nominal (or categorical) measures have no implied order or superiority and can be thought of as qualitative. A middle ground is ordinal data, where there is an implied order or rank, but the distance between units is not well specified. Rankings, opinion questions that use ordered categories such as strongly agree to strongly disagree, and variables that use an ordered scale from one to ten are examples of ordinal data. Continuous data are the same as quantitative data. Qualitative data do not follow a natural numerical scale, while quantitative data use measures that are recorded on a naturally occurring scale. Levels of measurement are not trivial to the use of statistics. Many statistical techniques are predicated on certain levels of measurement of the variables involved. Some techniques or formulas assume a certain level is used and misusing a statistical technique can lead to results that are biased or misleading. GRAPHING DATA Excel and other data management software allow us numerous ways to graph and display data. Along with summary statistics, graphs help tell the story of the data and lead to insight or explanation. Like all statistics, the validity of a graph depends upon the user. It is easy to distort a graph by manipulating the scale, collapsing data, or choosing misleading ways to represent the data. For example, even a small change in a measurement over time can look large if you adjust the scale on the axis. A good strategy for all graphing of data is to provide sufficient information and context to let the reader judge for him or herself. A good protocol to follow is to: Graphs help tell the story of the data and lead to insight or explanation. But, the validity of a graph or chart depends upon the care in setting it up correctly. Give a caption or title describing the graph Identify the source of the data Label the axes, bars, or pie slices Give an indication of the measurement level (e.g., in $ or $1,000s) Identify the scale of the axes, including the starting point Provide a context for the graph in the narrative of the report

5 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 5 Graphs of Qualitative Data. Pie charts and bar charts are the most frequently used graphs of qualitative data. The graph will depict the frequency or relative frequency of the categories in the variable. For example, a pie chart can depict the percentages of customers in different credit card bins at a period in time. Pie charts have limitations on how many categories can be represented (more than five begins to be a problem). Both pie charts and bar charts can present a categorical variable broken down by a second variable so that you can compare the distribution of the categories across the groups. Graphs of Quantitative Data. There are several useful graphing techniques to show the distribution of a quantitative variable. These include a histogram, a box plot, and a stem and leaf plot. Many of these graphs require some decisions from the user that may affect the shape of the distribution. A final graph that we will look at is the scatter plot, which shows the relationship between two quantitative variables. Histograms. A histogram is a depiction of a quantitative variable broken down into categories reflecting the range of the variable. The histogram bars represent the relative number (or percentage) of observations in each category. By taking a continuous variable and breaking it into ordinal categories we lose some information, but the loss may be tempered by the potential gain in insight provided by the graph. However, decisions made by the use, such as the width and number of the categories, can influence the shape of the histogram. Care must be taken not to distort the data with too few or too many categories. Histograms provide a good visual of the distribution of a variable and can help identify outliers, multiple modes in the data, and the skew of the data. The process of deciding the number and width of intervals in Histograms can result in misleading depiction of the data. The easiest approach in deciding the width of the category intervals (also referred to as bins in Excel) is to determine the range of the data (maximum value minus the minimum value) and decide by the number of categories desired (minimum of five and a maximum of 15). The number of categories is constrained by the number of observations in your variable - the more observations the more categories possible (the default in Excel is to take the square root of the number of observations). This approach would provide equal width intervals, but the frequencies within each interval would not necessarily be equal. A limitation of this approach is the intervals may not reflect key thresholds or values important to decision-making. You may have to tweak the bin ranges to better reflect your needs. The following is an example of a histogram from Excel. The data are miles per gallon (mpg) of 100 sub-compact cars. The default under Excel is 10 intervals (square root of 100).

6 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 6 The interval width for the histogram is determined from the range of the data. The maximum is 44.9 and the minimum is 30. Thus, the interval width calculated by Excel is: Frequency Interval Width ( )/10 = 1.49 Excel created the following bin table and the resulting histogram. The bin values represent the minimum value in the interval and the Bin Frequency Figure 1. Excel Histogram Exampleof MPG Using System Defualts for Intervals frequencies are the number of observations up to the next bin value. For example, there are 9 observations between and mpg. I made one modification to this table. Because of rounding, there was one extra category which Excel labeled More. I simply add that last value into the final interval, and the graph below reflected this change. Notice that the graph labels for the X-axis in Excel are identical to the bin values in the table. These labels could be cleaned up in Excel if desired. You are also free to modify the graph in other ways, including setting the width of the gap between bars. MPG ys to set the number of categories and the interval width. For example, intervals could be set at particular thresholds, so that the frequencies for each interval are equal, or based on positional measures such as percentiles. There isn t a single right way, but the choice of intervals will influences the shape of the graph and care should be taken. T h e re a re m a n y ot h er w a Analysis feature under the Tools menu. Pros of Histograms Good visual depiction of the distribution of a variable, showing shape, modes, skew and outliers Can be graphed for a small sample size or a large sample size Most software programs (including Excel) provide an easy means to construct and display a histogram Cons of Histograms Requires a user decision of the number of intervals and the width of the intervals Choices made by a user can distort the graph Excel can help you design Histograms as part of the Data

7 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 7 Box Plots. Box Plots (also known as Box and Whisker Plots) are a graphical way to depict the center and spread of the data based on the median, quartiles, and the interquartile range. These are called positional measures of the center and the spread. A box plot provides a convenient way to look at the spread of a variable and is especially useful in comparing the spread of a continuous variable for two or more groups. The calculation and formation of a box plot is best left to computer programs. Unfortunately, Excel does not provide a box plot graph, but many add-in programs for Excel do provide this feature. The box plot is based on a five number summary - the minimum, first quartile (Q1), median (Q2), third quartile (Q3), and the maximum. From Q3 and Q1 we can calculate a sixth number, the Interquartile range (IQR). These numbers provide the way to formulate the box and the whiskers of the plot. The box to the right shows the five numbers for the mpg data set of 100 sub-compact cars. Five-number Summary Minimum 30.0 First Quartile 35.6 Median 37.0 Third Quartile 38.4 Maximum 44.9 The dimensions of the box represents the IQR, and goes from the first quartile (Q1) to the third quartile (Q3). The box often has a center line which represents the median value, and occasionally the mean is depicted to show the difference between the mean and the median. The whiskers, lines on either side of the box, reflect a distance of 1.5 IQR to the left of Q1 and to the right of the Q3. For a variable that follows a symmetrical, bell-shaped curve, most of the values will fall within 1.5 IQR of the first and third quartiles. Values outside of the whiskers are considered outliers. Some programs will depict mild outliers (between 1.5 and 3 IQR from Q1 and Q3) and extreme outliers (more than 3 IQR from Q 1 and Q3). Box Plots, also called Box and Whisker Plots, are based a Five Number Summary of statistics based on position, including the median and quartiles. Pros of Box Plots Good visual depiction of the distribution of a variable, showing shape, modes, skew and outliers Can be graphed for a small or large sample size, although it may be difficult to show outliers when the data set is large There is a uniform approach to constructing box plots - no user decisions Excellent approach for comparing the distribution of a variable two or more sub-groups Cons of Box Plots Figure 2. Box Plot of MPG of 100 Sub-Compact Cars Excel cannot construct a Box Plot without an add-in program

8 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 8 Figure 3. Box Plots of Amount Spent on Catelog Sales by Home Ownership $7,000 Amount Spent - Total Renter Home Owner $6,000 $5,000 $4,000 $3,000 $2,000 $1,000 $0 $1217 $962 $869 $623 $1543 $1359 Box plots are particularly good at comparing the center and spread of a variable for two or more groups. The graph at the top of the page shows a box plot of catalog sales for audio and video electronic entertainment for customers who own their home and those that rent. The Box Plot is constructed by XLSTAT, an Excel add-in. XLSTAT Box Plots shows the mild outliers as open points and extreme outliers as solid points. XLSTAT allows several user options, such as how the plots are oriented and the inclusion of the median and mean values (mean values are on the top in this graph). This graph above shows the distribution for all customers, those who rent, and those who own their homes. The plots show the data are skewed towards several extreme outliers of customers that have made large purchases of equipment` The group that owns their home has higher expenditures, more spread in the data, and more outliers. Stem and Leaf Plots. Another approach to graphing continuous data is the Stem and Leaf plot. This approach tend to works best with small to mid-sized data sets (up to 150 observations). The Stem and Leaf plot uses a clever approach by using the data itself to make the graph. The graph below is a depiction of the MPG data of 100 subcompact cars. For this graph the stems are the whole numbers and the leafs are a single decimal place.

9 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 9 Figure 4. Stem-and-Leaf Display for MPG Stem unit: Whole number Pros of Stem and Leaf Plots Good visual depiction of the distribution of a variable, showing shape, modes, skew and outliers The plot actually uses the data itself to make the graph There are less user or program decisions when compared to a histogram Cons of Stem and Leaf Plots The Stem and Leaf plot provides a good graphical picture of a variable s distribution, showing the shape, range, skew, and outliers. In order to construct a Stem and Leaf plot the user (or a software program) must make some decisions as well as manipulate the data. Some data do not lend themselves to a stem and leaf plot, particularly when the choices for leaves are limited. The three key steps in constructing a Stem and Leaf plot are: 1. Sort the data 2. Choose the stems 3. Add the leaves Excel cannot construct a Stem and Leaf Plot without an add-in program Limited to small and medium sized data sets - difficult to produce when the sample size is over 150 The user (or program) must make some decisions that can influence the shape of the graph The stems are the initial digit in the values, such as 1 in the number 10; 10 in the number 10.6; or 2 in the number 215. It is helpful to look at the sorted data and the range of the variable to decide the appropriate stems. The stems can be one, two, or more digits. For example, the stem for 215 could be 2 or 21. Once the stems are set, the leaves are simply the remaining digits in the numbers. In most cases it will be one digit, but it is possible to use more than one digit for leaves. If you are constructing a stem and leaf plot by hand, make sure the distance between digits are uniform and large enough to show the separate observations.

10 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 10 Scatter Plots. When graphing two continuous variables we often use a scatter plot to show how the variables vary together. Scatter Plots also provide a useful way to show how data vary over time. Most spreadsheet programs provide an easy mechanism to make a scatter plot (also called XY scatterplot). In scatter plots we tend to think of the one of the variables as a dependent variable and label it Y. The dependent variable is the variable you wish to explain or understand by knowing something about the independent variable (denoted as X). The dependent variable (Y) tends to be on the vertical axis and the independent variable (X) is on the horizontal axis of the plot. Scatter plots provide a visual representation of the relationship between two variables. As such, it provides a useful first step in more sophisticated analysis strategies, such as regression. Excel provides mechanisms to include a trend line or best fitting line based on regression or an alternative curve fitting procedure. The following graph shows the relationship between 2001 average state SAT scores and the percent of high school students that take the SAT test. The graph clearly shows a linear relationships between the two variables, the greater the percentage of high school seniors who take the test, the lower the average state SAT score. Using options in Excel, we added a regression line, the regression equation, and R 2 (a measure of the fit of the data). Average State SAT scores by Percent Taking the Test, 2001 Scatter Plots are an effective way to graph the relationship between two continuous variables. It will show strength and direction of the relationship. However, too many observations (e.g., more than 1,000) are difficult to plot and still see a relationship. Excel can provide you with many ways to dress up the scatter plot by adding detail to the chart, a regression or trend line, and Average SAT (Math + Verbal) y = x R 2 = Percent Taking Figure 5. A Scatter Plot using Excel of the relationship between the 2001 average state SAT scores and the percent of high school seniors taking the test

11 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 11 Scatter plots are also very useful in graphing data over time. In these plots the x-axis is typically the time element. The data can be left as a scatter of data points or the points can be connected by a line. The following graph shows the annual percentage change in the Consumer Price Index over the last century. The graph also includes a five-year running average to show how this approach smooths the data. Consumer Price Index Percent Change 20 Percent Change Year 5-Yr Avg Raw Data Figure 6. A graph of Annual Percentage Change in the Consumer Price Index Including 5-Year Smoothing CENTRAL TENDENCY OF DATA A useful concept when summarizing data is to find some way to measure the center of the data. The central tendency of a variable is the tendency of the data to cluster or center about certain numerical values. Central tendency is in contrast to another concept which will be discussed shortly, variability or the spread of the data. For central tendency we will focus on the mean, the mode, and the median. The central tendency of a variable is the tendency of the data to cluster or center about certain numerical values. The Mean. The arithmetic mean or mean is the sum of the measurements divided by the number of measurements contained in the data set. For a sample we use x with a bar over it, 0. For a population, we use the Greek term, : (mu). The formula for the mean is given as: x = n i= 1 ( x i / n) The first formula is the more familiar formula and reflects that the mean is the average observation. The second

12 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 12 formula yields the same result and emphasizes the mean is a weighted summation with the weights being the probability of each observation in the data set (i.e., 1/n) and as such is an expectation of a probability distribution. As a measure of central tendency the mean has several advantages (and disadvantages) over other measures. The first is that the mean uses information of all the values in a variable - all the values of the variable are added together and divided by the sample size. We can make inferences from a sample to a population for the mean - some descriptive statistics of central tendency do not have inferential properties. The mean forms the basis for a number of other statistics known as Product Moment Statistics, which includes the variance, correlation, and regression coefficients. But, the mean is sensitive to outliers and extremes in the data. The mean is pulled toward extreme values in the data and it is not as resistant as other measures of central tendency. The mean has two important mathematical properties that are important in statistics. The first is that mathematically, the sum of the deviations about the mean equals zero. The second property is that the sum of squared deviations about the mean is a minimum. The latter is called the Least Squares property. It means that the sum of deviations around the mean is smaller than around any other value. The least squares property is exploited when looking at the spread of the data and in regression. The Median. The median is the middle value when the measurements are arranged in ascending order. It is a positional measure because it is based on the middle case in a variable. In order to find the median value, we first must sort the data in ascending or descending order, find the position of the middle value, and then read that value. The median is an intuitive measure of central tendency - the value at the middle of the ordered data. However, the median is computationally difficult to compute because it requires you to sort the data. Fortunately, spreadsheets and statistical software packages calculate the median for us rather easily. Properties of the Mean The mean uses information of all the values in a variable We can make inferences from a sample to a population for the mean The mean forms the basis for a number of other statistics known as Product Moment Statistics The mean is sensitive to outliers and extremes in the data. The median is a positional measure of the center of a variable and is preferred when the data contain extreme values. For example, it is common to report the median income rather than mean income. The median has very limited inferential properties, so it is not used when making inferences from a sample or in hypothesis testing. Nonetheless, the median is often used in skewed data because it is not as sensitive to outliers. The median is often the preferred measure of the center in data with extreme values, such as income.

13 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 13 The median is also referred to as the 50th percentile. Other ordered measures include percentiles, deciles, and quintiles, and quartiles. Quartiles, used in box plots, represent the values at the 25 th, 50 th, and 75 th percentiles. Some software programs use Q1 for the 25th percentile, Q3 for the 75th percentile, and Q2 (50 th percentile) for the median. Mode. The mode is the most frequent occurring value in a variable. As a measure of the center, the mode is less useful than the mean or median. However, it can provide some insights to the most common value in a variable and the shape of a distribution. In some cases there are multiple modes referred to as Bi-Modal or Tri-Modal. Multiple modes or groupings around a value may reflect different groups within a variable. Figure 7 shows a bimodal distribution with a histogram of student weight. In continuous level data, there may not be any single value that is the most frequent. The mode may make more sense in reference to qualitative data. With a qualitative variable, we refer to the Modal Class or Category which represents the category with the most responses Frequency Figure 7. Histogram of weight of 312 students showing a bi-modal distribution that reflects differences between males and females

14 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 14 Comparing the Mean, Median, and Mode. If we have a variable with a distribution that reflects a symmetrical, bell shaped curve, the mean, median, and mode would be very similar to one another. The normal distribution is a very special bell shaped curve where the mean, median, and mode are equal to each other by definition. The symmetrical, bell shaped curve is important in statistics because it more easily allows us to make probability statements about the distribution of a variable. The skew of the data reflect a tail in the distribution pulled by extreme values, either high or low. The following three simple rules are useful in making a quick assessment of the distribution of a variable. 1. If the mean is larger than the median, the data are skewed to the right with some extreme high values. In this case the mean is being pulled up by extreme values in the data. 2. If the mean is smaller than the median, data are skewed to the left with some extreme low values. In this case the mean is being pulled down by the extreme low values in the data. In a symmetrical, bell shaped distribution, the mean, median, and mode would be very similar to one another Simply comparing the mean to the median gives a sense of the presence of extreme values or outliers, and in which direction we can expect the skew. 3. If the mean and the median are very close to each other it is likely (though not guaranteed) that the distribution is symmetric and mound shaped. Simply comparing the mean to the median can give us a sense of the presence of extreme values or outliers, and in which direction we can expect the skew.

15 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 15 MEASURES OF VARIABILITY OF DATA Central tendency only tells part of the story when describing a variable. Another aspect of data is the spread or variability of data. There are several intuitive measures of spread of data, including the range, the Inter-Quartile (IQR) range, the variance, standard deviation, and the coefficient of variation. Range. The range is the difference between the highest and lowest value in the data. The range provides an sense of the extremes in the data. It is an order statistic and depends upon the two most extreme values in the data. As such, the range may be seriously influenced by outliers. Variability of the data reflects the spread of the data around some center value, usually the mean. Inter-Quartile Range. An alternative to the range is the Inter-quartile range, which is the difference between the 3 rd quartile (Q3 or 75 th percentile) and the 1 st quartile (O1 or 25 th percentile). The inter-quartile range provides a sense of the range in the middle of the data and is not as sensitive to extreme values in the data. Variance. The variance is the average squared deviation around the mean. By deviation we refer to the difference of a particular value of a variable from the mean of the variable. The concept of deviations around the mean can be intuitively appealing as a measure of spread of the data. If the mean is a good measure of central tendency, then it is reasonable to ask how different (or how far away) is a particular value of a variable (X) from the mean of X. Taking this a step further, we might ask what is the average distance of all values in the variable from the mean. However, because of the property that the sum of deviations around the mean always equals zero, we need to square the deviations around the mean and take an average squared deviation. Let s look at the formula for the variance. We will use the Greek symbol F 2 (sigma) to represent the variance of a population. The sample term for the variance will be s 2. The Total Sum of Squares is the sum of the squared deviations of each value in a variable around the mean. It is the numerator of the formula for the variance. 2 σ n i= = 1 ( x i n x) 2 To put it into words, the numerator reflects the sum of the square of the calculation of each value in the variable minus the mean. The numerator is called the Total Sum of Squares (a term we will see later in regression). Since

16 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 16 we take the square of the deviations around the mean, the numerator will always be a positive term. Once we divide by n (or n-1 as we will show later), the number of observations, the variance reflects the average squared deviation around the mean. Another way to describe the variance is that it is the Mean Squared Deviation. Like the mean, the variance is sensitive to outliers in the data. In fact, because the terms are squared, the variance can be extremely sensitive to outliers. When you square large numbers you get much larger numbers. Care should be taken when using and interpreting the variance with data that is highly skewed with high or low outliers. Standard Deviation. Average squared deviations around the mean are awkward to discuss and interpret. However, if we take the square root of the variance we have a value that is no longer in squared terms. This new term is the standard deviation, or the average deviation around the mean. We use the Greek term F (sigma) to represent the population standard deviation and the term s to represent the sample standard deviation. Properties of the Variance It is also known as the Mean Squared Deviation The numerator is known at the Total Sum of Squares When dealing with a sample we divide by (n-1) to adjust for degrees of freedom The variance is sensitive to extreme values - outliers have a large effect on the variance The Variance and Standard Deviation with Sample Data. When we are dealing with a sample of the population, and our ultimate goal is some sort of inference, the formula for the variance and standard deviation must change. As noted earlier, when we are dealing with a population we use the Greek term F 2 (sigma squared) and when we are dealing with a sample we use s 2. However, when dealing with a sample the formula must change to reflect an adjustment due to degrees of freedom. The adjustment involves using n-1 in the denominator of the formula. We will use the following formula for the variance (and the square root of this formula for the standard deviation) almost exclusively for the rest of this course. s 2 = n i= 1 ( x i x) ( n 1) 2 Degrees of freedom is an important concept in inferential statistics and it will be seen again in regression analysis. While it is a difficult concept to comprehend at this level, think of it as an adjustment when dealing with a sample. Using n in the formula for s 2 tends to underestimate F 2 for the population. Note that the adjustment makes more of a difference when the sample size is small (less than 30) than when the sample is large (greater than 1,000). Using (n-1) in the formula is the default in Excel and most calculators. When dealing with a sample of data, we adjust the formula for the variance for the degrees of freedom by diving by n-1.

17 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 17 Coefficient of Variation. Another way to express the standard deviation is in relation to the mean. The Coefficient of Variation (CV) is the ratio of the standard deviation to the absolute value of the mean, usually expressed as a percentage. By taking a ratio, we express the standard deviation relative to the mean and it provides a way to say how much variability there is in a variable relative to the size of the mean. The higher the percentage, the more variability. The CV is particularly useful when comparing the variability of different variables. For example, suppose we had a data set on customers and we want to compare the variability of education level and their income. It would not be useful to compare the standard deviations because the metric on income is so much larger. However, we could compare the CVs for each variability and talk about which variable has more variability. The CV formula is given below. CV = s/ 0 * 100 Interpreting the Standard Deviation. If our variable is symmetrical and mound shaped in its distribution, we can use the Empirical Rule to make some statements to interpret the standard deviation. By symmetrical we mean that the distribution is the same (or reasonably close) to the left and right of the mean. By mound shaped we mean that the largest proportion of the observations are centered around the middle of the distribution, and the mean, median, and mode of the variable are close in value. The following histogram (Figure 8) of miles per gallon (MPG) of 100 compact cars can be thought of as a symmetrical, mound shaped distribution with a mean, median, and mode of 37 and a standard deviation of 2.4. If our variable is symmetrical and mound shaped we can use the Empirical Rule to interpret the standard deviation, and give us an indication if a value is an outlier Frequency Figure 8. Histogram of MPG of compact cars that represents a symmetrical, mound shaped distribution.

18 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 18 If our variable is symmetrical and mound shaped, the Empirical Rule tells us that approximately 68% of the observations should be plus or minus one standard deviation (34% above or below); 95% should be within plus or minus 2 standard deviations, and nearly all the observations (99.7%) should be plus or minus 3 standard deviations around the mean. We can express this as:. 68% of the observations are ± 1*s. 95% of the observations are ± 2*s. 99.7% of the observations are ± 3*s This rule allows us to say how likely or unlikely it would be to find a variable that is a certain number of standard deviations away from the mean. For the MPG example, we can say that we would expect 68% of the cars to be within: A value more than three standard deviations above or below the mean is unusual, especially if the distribution is symmetrical and moundshaped. 37 ± 2.4 = 34.6 mpg to 39.4 mpg We could also say that we would expect 34% of the cars (one half of the 68%) to have a mpg between 37 and The Empirical Rule also gives us a rule of thumb to determine if a value is an outlier. If a value is more than three standard deviations away from the mean, it is extremely rare. In a probabilistic framework, we would say that it is possible, but not very probable. Thus, if we had a compact car that gets less than 29.8 mpg or more than 44.2 mpg we might ask questions. Perhaps it is a performance car that is part of a different population of compact cars, or if it is on the high end a specialty hybrid that is unique. Or, someone could have made a mistake in measuring mpg or in entering the data in a computer. The fact that a value is extreme does not make it wrong or bad, but it should cause us to ask questions and examine it further. The fact that a value is extreme does not make it wrong or bad, but it should cause us to ask questions and examine it further. TRANSFORMATIONS OF DATA There are times when we will want to transform our data into another form that is more useable. Reasons to transform data are to reduce the impact of extreme values in the data, to make a nonlinear relationship linear, to present data in a more easily interpretable manner, or to make adjustments based on a third factor. The latter reason involves a weighting schemes such as per capita, seasonal adjustments, or adjustments for inflation. Each of these methods have strengths and weakness and no Transforming data before analysis can be a useful way to better see what is going on in the data.

19 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 19 method will perfectly solve all the data problems they were designed to address. However, transforming data at times provides a useful way to present a clearer picture. I will talk of three methods to transform the data: creating z- scores; log transformation, and weighting the data. Z-scores. The z-score approach is a method of transforming data to reflect relative standing of the value in relation to the mean. A z-score is calculated by subtracting the mean from a value and then dividing by the standard deviation. z i = ( x x) i s The result represents the distance between a given measurement X and its mean, expressed in standard deviations. A positive z-score means that measurement is larger than the mean while a negative z-score means that it is smaller than the mean. By dividing through by the standard deviation we are able to say how far away a value is from its mean in a relative way. When transforming data, be careful in the interpretation of the newly transformed data, especially if the transformation changes the order of the data. Z-scores are an effective way to re-express a value to reflect its position to the mean relative to the standard deviation of the data. If we were to convert an entire variable to z-scores - take each value, subtract the mean, and divide by the standard deviation - we would create a new variable that has a mean equal to zero and a standard deviation equal to one. The new variable would be in standardized units and thus would allow us to compare different values to each other in terms of how many standard deviations away from the mean they are. A z-score transformation does not change the order of the data or the shape of the distribution of the data. This is because we are subtracting and dividing through by constant values (i.e., the mean and standard deviation). Use a z-score transformation can help in interpretation of a variable, comparison of variables measured on different scales, and in cases of variables whose measurement is somewhat contrived and arbitrary, such as an index. Log Transformation. A popular transformation of data is a log transformation. In most cases we use the natural logarithm or base e. The value of e is and the natural log of a value is the power that I need to raise the value of e to equal that number. For example:

20 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 20 The log of 10 = , which means = 10 Look at the following natural logs of numbers: log 10 = log 100 = log 1,000 = log 10,000 = log 100,000 = log 1,000,000 = With the use of the natural logarithm we can reduce the variability of a number while still maintaining the original order of the data. By this I mean that the largest numbers are still larger, but we greatly reduced the variability between larger numbers and smaller numbers. Thus, log transformations can be used to reduce the variability in variables that have a large range. Lets look at an example of the price of 25 apartment buildings in a city. The goal of the analysis is to better understand the price of the building as influenced by such factors as square footage, number of apartments, and age and condition of the building. The original data for price shows a large variability from a minimum of $79,300 to a maximum of $950,000. The summary statistics are given in the table below, along with the same statistics for the natural log of price. Summary Statistics for Apartment Price and Log of Apartment Price Logarithmic transformations are useful when the data has a large variability and as a result is skewed toward high or low values. Log transformations are used in growth models, finance models of instantaneous rates of change, and economic models of constant elasticities. Price Ln Price Mean Standard Error Median Mode #N/A #N/A Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count

21 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 21 If we look at a histogram of the original data we can see that price data are skewed and that the mean is being pulled by several extreme values in the data set Frequency Figure 9. Histogram of Apartment Building Price showing a skewed distribution. By taking the natural logarithm of the price we change the distribution of the variable, reduce the variability, and often make the variable more normally distributed. If we look at the transformed price in Figure 10 we can see that the influence of the extreme values is reduced. The main problem with a log transformation is that we change the measurement units and make it harder to interpret the statistics - they are now expressed in log units Frequency Figure 10. Histogram of the Natural Log of Apartment Price

22 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 22 Log transformations are very useful tools for data analysis. They are used in population growth models, instantaneous rates of change in finance, and in constant elasticity models in economics. Log transformations can be used in regression analysis to transform a nonlinear relationship into a linear relationship in the parameters. The main caution with a log transformation is that the data are changed and it is not easy to interpret or compare results to the original data. Weighting Schemes. The last transformation discussed here is various weighting schemes. These are strategies to weight the data, usually by multiplying or dividing by another variable, to adjust the original data. Examples of this include putting a value on a per capita basis (dividing by the population), adjusting for inflation by using a Consumer Price Index (CPI), or adjusting time series data by dividing through by seasonal averages. If the weights are a constant (every data value is adjusted by the same weight) the impact is small. For example, expressing dollar figures as per $1,000 or per $1,000,000. In fact, this transformation will have no meaningful impact in advanced techniques such as correlation and regression. By this we mean that the analysis will not change in substance or conclusion simply because we express a variable in $1,000 dollars rather than in real terms. This is a good thing and is a strength of regression analysis. If the weights are a constant the impact is small on most analyses techniques, such as regression. However, if there is a unique weight for each value, as is the case of adjusting for annual inflation rates, the impact of this transformation can be substantial. However, if there is a unique weight for each value, the impact of this transformation can be substantial. Because of this, using weights as an adjustment should be justified on a past practice or theoretical basis. This transformation can change the distribution and order of the data, so the change is more radical than other transformations. Let s look at an example of flood insurance payments over time in the U.S, from 1978 to There has been an upward trend in payments over time, from nearly $148 million in 1978 to $339 million in There is a great deal of fluctuation from year to year based on rainfall and natural occurrences. However, part of the trend over time is also due to inflation and the change in the value of money. In fact, one might ask if there really is an upward trend in payments once we adjust for inflation. The graph below (Figure 11) shows the trend since You can clearly see an upward trend in payments, but there are fluctuations from year to year. Although not shown, a model was fit to the data that shows a linear trend that explains about 33% of the variability in payments follows a linear trend line. The line in the graph is generated from a regression model.

23 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 23 $1,800,000 $1,600,000 $1,400,000 $1,200,000 $1,000,000 $800,000 $600,000 $400,000 $200,000 $ Figure 11. Linear trend of U.S. Flood Insurance Payments, 1978 to 2002 The next graph (Figure 12) shows the same trend, only this time the payments are adjusted for inflation using the Consumer Price Index (CPI) from the Bureau of Economic Analysis. The shape of the graph is very similar to the previous graph, but there are some important differences. Figure 12 also shows a upward trend, but the trend line is not as steep and the amount the model explains drops to only 7%. This result shows that the upward trend in payments is not near as steep or noticeable once we adjust for inflation. In other words, part of the trend was due the CPI and not simply an upward payout trend. Adjusting for inflation helped clarify the trend. $ 1,800,000 $ 1,600,000 $ 1,400,000 $ 1,200,000 $ 1,000,000 $ 800,000 $ 600,000 $ 400,000 $ 200,000 $ Figure 12. Linear Trend of U.S. Flood Insurance Adjusted for Inflation, 1978 to 2002 CONCLUSIONS This module was designed to help you gain a basic understanding of descriptive statistics, graphing, and transformations as a way to better understand your data. These techniques serve as basic building blocks for analysis and form the foundation of more sophisticated techniques such as correlation and regression. In fact, much of what regression is about is explaining variability in a dependent variable, and how independent variables influence or explain the variability. Throughout this course we continually use some of the techniques in Module 1 as a starting point for analysis.

STAT/MATH Chapter3. Statistical Methods in Practice. Averages and Variation 1/27/2017. Measures of Central Tendency: Mode, Median, and Mean

STAT/MATH Chapter3. Statistical Methods in Practice. Averages and Variation 1/27/2017. Measures of Central Tendency: Mode, Median, and Mean STAT/MATH 3379 Statistical Methods in Practice Dr. Ananda Manage Associate Professor of Statistics Department of Mathematics & Statistics SHSU 1 Chapter3 Averages and Variation Copyright Cengage Learning.

More information

A is used to answer questions about the quantity of what is being measured. A quantitative variable is comprised of numeric values.

A is used to answer questions about the quantity of what is being measured. A quantitative variable is comprised of numeric values. Stats: Modeling the World Chapter 2 Chapter 2: Data What are data? In order to determine the context of data, consider the W s Who What (and in what units) When Where Why How There are two major ways to

More information

1. Contingency Table (Cross Tabulation Table)

1. Contingency Table (Cross Tabulation Table) II. Descriptive Statistics C. Bivariate Data In this section Contingency Table (Cross Tabulation Table) Box and Whisker Plot Line Graph Scatter Plot 1. Contingency Table (Cross Tabulation Table) Bivariate

More information

Why Learn Statistics?

Why Learn Statistics? Why Learn Statistics? So you are able to make better sense of the ubiquitous use of numbers: Business memos Business research Technical reports Technical journals Newspaper articles Magazine articles Basic

More information

Introduction to Statistics. Measures of Central Tendency and Dispersion

Introduction to Statistics. Measures of Central Tendency and Dispersion Introduction to Statistics Measures of Central Tendency and Dispersion The phrase descriptive statistics is used generically in place of measures of central tendency and dispersion for inferential statistics.

More information

PRINCIPLES AND APPLICATIONS OF SPECIAL EDUCATION ASSESSMENT

PRINCIPLES AND APPLICATIONS OF SPECIAL EDUCATION ASSESSMENT PRINCIPLES AND APPLICATIONS OF SPECIAL EDUCATION ASSESSMENT CLASS 3: DESCRIPTIVE STATISTICS & RELIABILITY AND VALIDITY FEBRUARY 2, 2015 OBJECTIVES Define basic terminology used in assessment, such as validity,

More information

GETTING READY FOR DATA COLLECTION

GETTING READY FOR DATA COLLECTION 3 Chapter 7 Data Collection and Descriptive Statistics CHAPTER OBJECTIVES - STUDENTS SHOULD BE ABLE TO: Explain the steps in the data collection process. Construct a data collection form and code data

More information

Descriptive Statistics Tutorial

Descriptive Statistics Tutorial Descriptive Statistics Tutorial Measures of central tendency Mean, Median, and Mode Statistics is an important aspect of most fields of science and toxicology is certainly no exception. The rationale behind

More information

Quantitative Methods. Presenting Data in Tables and Charts. Basic Business Statistics, 10e 2006 Prentice-Hall, Inc. Chap 2-1

Quantitative Methods. Presenting Data in Tables and Charts. Basic Business Statistics, 10e 2006 Prentice-Hall, Inc. Chap 2-1 Quantitative Methods Presenting Data in Tables and Charts Basic Business Statistics, 10e 2006 Prentice-Hall, Inc. Chap 2-1 Learning Objectives In this chapter you learn: To develop tables and charts for

More information

Online Student Guide Types of Control Charts

Online Student Guide Types of Control Charts Online Student Guide Types of Control Charts OpusWorks 2016, All Rights Reserved 1 Table of Contents LEARNING OBJECTIVES... 4 INTRODUCTION... 4 DETECTION VS. PREVENTION... 5 CONTROL CHART UTILIZATION...

More information

Introduction to Statistics. Measures of Central Tendency

Introduction to Statistics. Measures of Central Tendency Introduction to Statistics Measures of Central Tendency Two Types of Statistics Descriptive statistics of a POPULATION Relevant notation (Greek): µ mean N population size sum Inferential statistics of

More information

Measurement and sampling

Measurement and sampling Name: Instructions: (1) Answer questions in your blue book. Number each response. (2) Write your name on the cover of your blue book (and only on the cover). (3) You are allowed to use your calculator

More information

An Introduction to Descriptive Statistics (Will Begin Momentarily) Jim Higgins, Ed.D.

An Introduction to Descriptive Statistics (Will Begin Momentarily) Jim Higgins, Ed.D. An Introduction to Descriptive Statistics (Will Begin Momentarily) Jim Higgins, Ed.D. www.bcginstitute.org Visit BCGi Online While you are waiting for the webinar to begin, Don t forget to check out our

More information

The Dummy s Guide to Data Analysis Using SPSS

The Dummy s Guide to Data Analysis Using SPSS The Dummy s Guide to Data Analysis Using SPSS Univariate Statistics Scripps College Amy Gamble April, 2001 Amy Gamble 4/30/01 All Rights Rerserved Table of Contents PAGE Creating a Data File...3 1. Creating

More information

Overview. Presenter: Bill Cheney. Audience: Clinical Laboratory Professionals. Field Guide To Statistics for Blood Bankers

Overview. Presenter: Bill Cheney. Audience: Clinical Laboratory Professionals. Field Guide To Statistics for Blood Bankers Field Guide To Statistics for Blood Bankers A Basic Lesson in Understanding Data and P.A.C.E. Program: 605-022-09 Presenter: Bill Cheney Audience: Clinical Laboratory Professionals Overview Statistics

More information

Spreadsheets in Education (ejsie)

Spreadsheets in Education (ejsie) Spreadsheets in Education (ejsie) Volume 2, Issue 2 2005 Article 5 Forecasting with Excel: Suggestions for Managers Scott Nadler John F. Kros East Carolina University, nadlers@mail.ecu.edu East Carolina

More information

Business Intelligence, 4e (Sharda/Delen/Turban) Chapter 2 Descriptive Analytics I: Nature of Data, Statistical Modeling, and Visualization

Business Intelligence, 4e (Sharda/Delen/Turban) Chapter 2 Descriptive Analytics I: Nature of Data, Statistical Modeling, and Visualization Business Intelligence, 4e (Sharda/Delen/Turban) Chapter 2 Descriptive Analytics I: Nature of Data, Statistical Modeling, and Visualization 1) One of SiriusXM's challenges was tracking potential customers

More information

Seven Basic Quality Tools. SE 450 Software Processes & Product Metrics 1

Seven Basic Quality Tools. SE 450 Software Processes & Product Metrics 1 Seven Basic Quality Tools SE 450 Software Processes & Product Metrics 1 The Seven Basic Tools Checklists (Checksheets) Pareto Diagrams Histograms Run Charts Scatter Diagrams (Scatter Plots) Control Charts

More information

Statistical Observations on Mass Appraisal. by Josh Myers Josh Myers Valuation Solutions, LLC.

Statistical Observations on Mass Appraisal. by Josh Myers Josh Myers Valuation Solutions, LLC. Statistical Observations on Mass Appraisal by Josh Myers Josh Myers Valuation Solutions, LLC. About Josh Josh Myers is an independent CAMA consultant and owner of Josh Myers Valuation Solutions, LLC. Josh

More information

Unit Activity Answer Sheet

Unit Activity Answer Sheet Algebra 1 Unit Activity Answer Sheet Unit: Descriptive Statistics The Unit Activity will help you meet these educational goals: Mathematical Practices You will make sense of problems and solve them, reason

More information

Statistics Definitions ID1050 Quantitative & Qualitative Reasoning

Statistics Definitions ID1050 Quantitative & Qualitative Reasoning Statistics Definitions ID1050 Quantitative & Qualitative Reasoning Population vs. Sample We can use statistics when we wish to characterize some particular aspect of a group, merging each individual s

More information

THE LEAD PROFILE AND OTHER NON-PARAMETRIC TOOLS TO EVALUATE SURVEY SERIES AS LEADING INDICATORS

THE LEAD PROFILE AND OTHER NON-PARAMETRIC TOOLS TO EVALUATE SURVEY SERIES AS LEADING INDICATORS THE LEAD PROFILE AND OTHER NON-PARAMETRIC TOOLS TO EVALUATE SURVEY SERIES AS LEADING INDICATORS Anirvan Banerji New York 24th CIRET Conference Wellington, New Zealand March 17-20, 1999 Geoffrey H. Moore,

More information

SECTION 11 ACUTE TOXICITY DATA ANALYSIS

SECTION 11 ACUTE TOXICITY DATA ANALYSIS SECTION 11 ACUTE TOXICITY DATA ANALYSIS 11.1 INTRODUCTION 11.1.1 The objective of acute toxicity tests with effluents and receiving waters is to identify discharges of toxic effluents in acutely toxic

More information

TDWI strives to provide course books that are contentrich and that serve as useful reference documents after a class has ended.

TDWI strives to provide course books that are contentrich and that serve as useful reference documents after a class has ended. Previews of TDWI course books offer an opportunity to see the quality of our material and help you to select the courses that best fit your needs. The previews cannot be printed. TDWI strives to provide

More information

Chapter 12. Sample Surveys. Copyright 2010 Pearson Education, Inc.

Chapter 12. Sample Surveys. Copyright 2010 Pearson Education, Inc. Chapter 12 Sample Surveys Copyright 2010 Pearson Education, Inc. Background We have learned ways to display, describe, and summarize data, but have been limited to examining the particular batch of data

More information

ANALYSING QUANTITATIVE DATA

ANALYSING QUANTITATIVE DATA 9 ANALYSING QUANTITATIVE DATA Although, of course, there are other software packages that can be used for quantitative data analysis, including Microsoft Excel, SPSS is perhaps the one most commonly subscribed

More information

Exploratory Data Analysis

Exploratory Data Analysis Exploratory Data Analysis Brawijaya Professional Statistical Analysis BPSA MALANG Jl. Kertoasri 66 Malang (0341) 580342 Exploratory Data Analysis Exploring data can help to determine whether the statistical

More information

+? Mean +? No change -? Mean -? No Change. *? Mean *? Std *? Transformations & Data Cleaning. Transformations

+? Mean +? No change -? Mean -? No Change. *? Mean *? Std *? Transformations & Data Cleaning. Transformations Transformations Transformations & Data Cleaning Linear & non-linear transformations 2-kinds of Z-scores Identifying Outliers & Influential Cases Univariate Outlier Analyses -- trimming vs. Winsorizing

More information

Test Name: Test 1 Review

Test Name: Test 1 Review Test Name: Test 1 Review 1. Determine whether the statement describes a population or a sample. The heights of all the corn plants at Mr. Lonardo's greenhouse. 2. Determine whether the statement describes

More information

STATISTICAL TECHNIQUES. Data Analysis and Modelling

STATISTICAL TECHNIQUES. Data Analysis and Modelling STATISTICAL TECHNIQUES Data Analysis and Modelling DATA ANALYSIS & MODELLING Data collection and presentation Many of us probably some of the methods involved in collecting raw data. Once the data has

More information

Attachment 1. Categorical Summary of BMP Performance Data for Solids (TSS, TDS, and Turbidity) Contained in the International Stormwater BMP Database

Attachment 1. Categorical Summary of BMP Performance Data for Solids (TSS, TDS, and Turbidity) Contained in the International Stormwater BMP Database Attachment 1 Categorical Summary of BMP Performance Data for Solids (TSS, TDS, and Turbidity) Contained in the International Stormwater BMP Database Prepared by Geosyntec Consultants, Inc. Wright Water

More information

Glossary of Standardized Testing Terms https://www.ets.org/understanding_testing/glossary/

Glossary of Standardized Testing Terms https://www.ets.org/understanding_testing/glossary/ Glossary of Standardized Testing Terms https://www.ets.org/understanding_testing/glossary/ a parameter In item response theory (IRT), the a parameter is a number that indicates the discrimination of a

More information

TEACHER NOTES MATH NSPIRED

TEACHER NOTES MATH NSPIRED Math Objectives Students will recognize that the mean of all the sample variances for samples of a given size drawn with replacement calculated using n-1 as a divisor will give the population variance

More information

Statistics Chapter 3 Triola (2014)

Statistics Chapter 3 Triola (2014) 3-1 Review and Preview Branches of statistics Descriptive Stats: is the branch of stats that involve the organization, summarization, and display of data Inferential Stats: is the branch of stats that

More information

More-Advanced Statistical Sampling Concepts for Tests of Controls and Tests of Balances

More-Advanced Statistical Sampling Concepts for Tests of Controls and Tests of Balances APPENDIX 10B More-Advanced Statistical Sampling Concepts for Tests of Controls and Tests of Balances Appendix 10B contains more mathematical and statistical details related to the test of controls sampling

More information

Using Excel s Analysis ToolPak Add-In

Using Excel s Analysis ToolPak Add-In Using Excel s Analysis ToolPak Add-In Bijay Lal Pradhan, PhD Introduction I have a strong opinions that we can perform different quantitative analysis, including statistical analysis, in Excel. It is powerful,

More information

1. What is a key difference between an Affinity Diagram and other tools?

1. What is a key difference between an Affinity Diagram and other tools? 1) AFFINITY DIAGRAM 1. What is a key difference between an Affinity Diagram and other tools? Affinity Diagram builds the hierarchy 'bottom-up', starting from the basic elements and working up, as opposed

More information

Business Statistics. Syllabus. General Certificate of Education (Advanced Level) Grade 12 and 13 (Implemented from 2017)

Business Statistics. Syllabus. General Certificate of Education (Advanced Level) Grade 12 and 13 (Implemented from 2017) Business Statistics Syllabus General Certificate of Education (Advanced Level) Grade 12 and 13 (Implemented from 2017) Department of Commerce Faculty of Science and Technology National Institute of Education

More information

CHAPTER 8 T Tests. A number of t tests are available, including: The One-Sample T Test The Paired-Samples Test The Independent-Samples T Test

CHAPTER 8 T Tests. A number of t tests are available, including: The One-Sample T Test The Paired-Samples Test The Independent-Samples T Test CHAPTER 8 T Tests A number of t tests are available, including: The One-Sample T Test The Paired-Samples Test The Independent-Samples T Test 8.1. One-Sample T Test The One-Sample T Test procedure: Tests

More information

Chapter 3. Table of Contents. Introduction. Empirical Methods for Demand Analysis

Chapter 3. Table of Contents. Introduction. Empirical Methods for Demand Analysis Chapter 3 Empirical Methods for Demand Analysis Table of Contents 3.1 Elasticity 3.2 Regression Analysis 3.3 Properties & Significance of Coefficients 3.4 Regression Specification 3.5 Forecasting 3-2 Introduction

More information

ISSN (Online)

ISSN (Online) Comparative Analysis of Outlier Detection Methods [1] Mahvish Fatima, [2] Jitendra Kurmi [1] Pursuing M.Tech at BBAU, Lucknow, [2] Assistant Professor at BBAU, Lucknow Abstract: - This paper presents different

More information

CREDIT RISK MODELLING Using SAS

CREDIT RISK MODELLING Using SAS Basic Modelling Concepts Advance Credit Risk Model Development Scorecard Model Development Credit Risk Regulatory Guidelines 70 HOURS Practical Learning Live Online Classroom Weekends DexLab Certified

More information

Draft Poof - Do not copy, post, or distribute

Draft Poof - Do not copy, post, or distribute Chapter Learning Objectives 1. Explain the importance and use of the normal distribution in statistics 2. Describe the properties of the normal distribution 3. Transform a raw score into standard (Z) score

More information

DIGITAL VERSION. Microsoft EXCEL Level 2 TRAINER APPROVED

DIGITAL VERSION. Microsoft EXCEL Level 2 TRAINER APPROVED DIGITAL VERSION Microsoft EXCEL 2013 Level 2 TRAINER APPROVED Module 4 Displaying Data Graphically Module Objectives Creating Charts and Graphs Modifying and Formatting Charts Advanced Charting Features

More information

Chart Recipe ebook. by Mynda Treacy

Chart Recipe ebook. by Mynda Treacy Chart Recipe ebook by Mynda Treacy Knowing the best chart for your message is essential if you are to produce effective dashboard reports that clearly and succinctly convey your message. M y O n l i n

More information

CHAPTER 2: Descriptive Statistics: Tabular and Graphical Methods Essentials of Business Statistics, 4th Edition Page 1 of 127

CHAPTER 2: Descriptive Statistics: Tabular and Graphical Methods Essentials of Business Statistics, 4th Edition Page 1 of 127 2 CHAPTER 2: Descriptive Statistics: Tabular and Graphical Methods 34 Essentials of Business Statistics, 4th Edition Page 1 of 127 2.1 Learning Objectives When you have mastered the material in this chapter,

More information

Copyright K. Gwet in Statistics with Excel Problems & Detailed Solutions. Kilem L. Gwet, Ph.D.

Copyright K. Gwet in Statistics with Excel Problems & Detailed Solutions. Kilem L. Gwet, Ph.D. Confidence Copyright 2011 - K. Gwet (info@advancedanalyticsllc.com) Intervals in Statistics with Excel 2010 75 Problems & Detailed Solutions An Ideal Statistics Supplement for Students & Instructors Kilem

More information

WINDOWS, MINITAB, AND INFERENCE

WINDOWS, MINITAB, AND INFERENCE DEPARTMENT OF POLITICAL SCIENCE AND INTERNATIONAL RELATIONS Posc/Uapp 816 WINDOWS, MINITAB, AND INFERENCE I. AGENDA: A. An example with a simple (but) real data set to illustrate 1. Windows 2. The importance

More information

+? Mean +? No change -? Mean -? No Change. *? Mean *? Std *? Transformations & Data Cleaning. Transformations

+? Mean +? No change -? Mean -? No Change. *? Mean *? Std *? Transformations & Data Cleaning. Transformations Transformations Transformations & Data Cleaning Linear & non-linear transformations 2-kinds of Z-scores Identifying Outliers & Influential Cases Univariate Outlier Analyses -- trimming vs. Winsorizing

More information

Pareto Charts [04-25] Finding and Displaying Critical Categories

Pareto Charts [04-25] Finding and Displaying Critical Categories Introduction Pareto Charts [04-25] Finding and Displaying Critical Categories Introduction Pareto Charts are a very simple way to graphically show a priority breakdown among categories along some dimension/measure

More information

Summary Statistics Using Frequency

Summary Statistics Using Frequency Summary Statistics Using Frequency Brawijaya Professional Statistical Analysis BPSA MALANG Jl. Kertoasri 66 Malang (0341) 580342 Summary Statistics Using Frequencies Summaries of individual variables provide

More information

Missouri Standards Alignment Grades One through Six

Missouri Standards Alignment Grades One through Six Missouri Standards Alignment Grades One through Six Trademark of Renaissance Learning, Inc., and its subsidiaries, registered, common law, or pending registration in the United States and other countries.

More information

Getting Started with OptQuest

Getting Started with OptQuest Getting Started with OptQuest What OptQuest does Futura Apartments model example Portfolio Allocation model example Defining decision variables in Crystal Ball Running OptQuest Specifying decision variable

More information

Identify sampling methods and recognize biased samples

Identify sampling methods and recognize biased samples 9-1 Samples and Surveys Identify sampling methods and recognize biased samples Vocabulary population (p. 462) sample (p. 462) biased sample (p. 463) random sample (p. 462) systematic sample (p. 462) stratified

More information

Data Analysis in Empirical Research - Overview. Prof. Dr. Hariet Köstner WS 2017/2018

Data Analysis in Empirical Research - Overview. Prof. Dr. Hariet Köstner WS 2017/2018 Data Analysis in Empirical Research - Overview Prof. Dr. Hariet Köstner WS 2017/2018 Contents 1. Data Collection Methods 2. Scales 3. Data Analysis 2 1. Data collection methods Business Research Primary

More information

Who Are My Best Customers?

Who Are My Best Customers? Technical report Who Are My Best Customers? Using SPSS to get greater value from your customer database Table of contents Introduction..............................................................2 Exploring

More information

Gasoline Consumption Analysis

Gasoline Consumption Analysis Gasoline Consumption Analysis One of the most basic topics in economics is the supply/demand curve. Simply put, the supply offered for sale of a commodity is directly related to its price, while the demand

More information

CH 2 - Descriptive Statistics

CH 2 - Descriptive Statistics 1. A quantity of interest that can take on different values is known as a(n) a. variable. b. parameter. c. sample. d. observation. a A characteristic or a quantity of interest that can take on different

More information

Credit Card Marketing Classification Trees

Credit Card Marketing Classification Trees Credit Card Marketing Classification Trees From Building Better Models With JMP Pro, Chapter 6, SAS Press (2015). Grayson, Gardner and Stephens. Used with permission. For additional information, see community.jmp.com/docs/doc-7562.

More information

SAMPLING FROM POPULATIONS

SAMPLING FROM POPULATIONS CHAPTER 14 SAMPLING FROM POPULATIONS Up to this point, we have mainly been interested in providing some description of one or more data sets such as test scores or heights. Generally speaking, if all you

More information

A SIMULATION STUDY OF THE ROBUSTNESS OF THE LEAST MEDIAN OF SQUARES ESTIMATOR OF SLOPE IN A REGRESSION THROUGH THE ORIGIN MODEL

A SIMULATION STUDY OF THE ROBUSTNESS OF THE LEAST MEDIAN OF SQUARES ESTIMATOR OF SLOPE IN A REGRESSION THROUGH THE ORIGIN MODEL A SIMULATION STUDY OF THE ROBUSTNESS OF THE LEAST MEDIAN OF SQUARES ESTIMATOR OF SLOPE IN A REGRESSION THROUGH THE ORIGIN MODEL by THILANKA DILRUWANI PARANAGAMA B.Sc., University of Colombo, Sri Lanka,

More information

Basic Statistics, Sampling Error, and Confidence Intervals

Basic Statistics, Sampling Error, and Confidence Intervals 02-Warner-45165.qxd 8/13/2007 5:00 PM Page 41 CHAPTER 2 Introduction to SPSS Basic Statistics, Sampling Error, and Confidence Intervals 2.1 Introduction We will begin by examining the distribution of scores

More information

Soci Statistics for Sociologists

Soci Statistics for Sociologists University of North Carolina Chapel Hill Soci708-001 Statistics for Sociologists Fall 2009 Professor François Nielsen Stata Commands for Module 11 Multiple Regression For further information on any command

More information

THE GUIDE TO SPSS. David Le

THE GUIDE TO SPSS. David Le THE GUIDE TO SPSS David Le June 2013 1 Table of Contents Introduction... 3 How to Use this Guide... 3 Frequency Reports... 4 Key Definitions... 4 Example 1: Frequency report using a categorical variable

More information

Using Excel s Analysis ToolPak Add In

Using Excel s Analysis ToolPak Add In Using Excel s Analysis ToolPak Add In Introduction This document illustrates the use of Excel s Analysis ToolPak add in for data analysis. The document is aimed at users who don t believe that Excel s

More information

Bivariate Data Notes

Bivariate Data Notes Bivariate Data Notes Like all investigations, a Bivariate Data investigation should follow the statistical enquiry cycle or PPDAC. Each part of the PPDAC cycle plays an important part in the investigation

More information

9.7 Getting Schooled. A Solidify Understanding Task

9.7 Getting Schooled. A Solidify Understanding Task 35 9.7 Getting Schooled A Solidify Understanding Task In Getting More $, Leo and Araceli noticed a difference in men s and women s salaries. Araceli thought that it was unfair that women were paid less

More information

Effective Applications Development, Maintenance and Support Benchmarking. John Ogilvie CEO ISBSG 7 March 2017

Effective Applications Development, Maintenance and Support Benchmarking. John Ogilvie CEO ISBSG 7 March 2017 Effective Applications Development, Maintenance and Support Benchmarking John Ogilvie CEO ISBSG 7 March 2017 What is the ISBSG? ISBSG (International Software Benchmarking Standards Group) is a not-forprofit

More information

Chapter 1 INTRODUCTION TO SIMULATION

Chapter 1 INTRODUCTION TO SIMULATION Chapter 1 INTRODUCTION TO SIMULATION Many problems addressed by current analysts have such a broad scope or are so complicated that they resist a purely analytical model and solution. One technique for

More information

PRESENTING DATA ATM 16% Automated or live telephone 2% Drive-through service at branch 17% In person at branch 41% Internet 24% Banking Preference

PRESENTING DATA ATM 16% Automated or live telephone 2% Drive-through service at branch 17% In person at branch 41% Internet 24% Banking Preference Presenting data 1 2 PRESENTING DATA The information that is collected must be presented effectively for statistical inference. Categorical and numerical data can be presented efficiently using charts and

More information

LabGuide 13 How to Verify Performance Specifications

LabGuide 13 How to Verify Performance Specifications 1 LG 13 Comments? Feedback? Questions? Email us at info@cola.org or call us at 800 981-9883 LabGuide 13 How to Verify Performance Specifications The verification of performance specifications confirms

More information

Quality Control Charts

Quality Control Charts Quality Control Charts General Purpose In all production processes, we need to monitor the extent to which our products meet specifications. In the most general terms, there are two "enemies" of product

More information

How to Lie with Statistics Darrell Huff

How to Lie with Statistics Darrell Huff How to Lie with Statistics Darrell Huff Meredith Mincey Readings 5050 Spring 2010 In Darrell Huffs famous book, he teaches us how to lie with statistics so as to protect ourselves from false information.

More information

AcaStat How To Guide. AcaStat. Software. Copyright 2016, AcaStat Software. All rights Reserved.

AcaStat How To Guide. AcaStat. Software. Copyright 2016, AcaStat Software. All rights Reserved. AcaStat How To Guide AcaStat Software Copyright 2016, AcaStat Software. All rights Reserved. http://www.acastat.com Table of Contents Frequencies... 3 List Variables... 4 Descriptives... 5 Explore Means...

More information

Predictive Modeling Using SAS Visual Statistics: Beyond the Prediction

Predictive Modeling Using SAS Visual Statistics: Beyond the Prediction Paper SAS1774-2015 Predictive Modeling Using SAS Visual Statistics: Beyond the Prediction ABSTRACT Xiangxiang Meng, Wayne Thompson, and Jennifer Ames, SAS Institute Inc. Predictions, including regressions

More information

The Institute of Chartered Accountants of Sri Lanka

The Institute of Chartered Accountants of Sri Lanka The Institute of Chartered Accountants of Sri Lanka Postgraduate Diploma in Business Finance and Strategy Quantitative Methods for Business Studies Handout 01: Basic Statistics Statistics The statistical

More information

Engenharia e Tecnologia Espaciais ETE Engenharia e Gerenciamento de Sistemas Espaciais

Engenharia e Tecnologia Espaciais ETE Engenharia e Gerenciamento de Sistemas Espaciais Engenharia e Tecnologia Espaciais ETE Engenharia e Gerenciamento de Sistemas Espaciais SITEMA DE GESTÃO DA QUALIDADE SEIS SIGMA 14.12.2009 SUMÁRIO Introdução ao Sistema de Gestão da Qualidade SEIS SIGMA

More information

Tutorial Segmentation and Classification

Tutorial Segmentation and Classification MARKETING ENGINEERING FOR EXCEL TUTORIAL VERSION v171025 Tutorial Segmentation and Classification Marketing Engineering for Excel is a Microsoft Excel add-in. The software runs from within Microsoft Excel

More information

Outliers and Their Effect on Distribution Assessment

Outliers and Their Effect on Distribution Assessment Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016 Topics of Discussion What is an Outlier? (Definition) How can we use outliers Analysis of Outlying Observation The Standards

More information

Standardised Scores Have A Mean Of Answer And Standard Deviation Of Answer

Standardised Scores Have A Mean Of Answer And Standard Deviation Of Answer Standardised Scores Have A Mean Of Answer And Standard Deviation Of Answer The SAT is a standardized test widely used for college admissions in the An example of a "grid in" mathematics question in which

More information

Modified Ratio Estimators for Population Mean Using Function of Quartiles of Auxiliary Variable

Modified Ratio Estimators for Population Mean Using Function of Quartiles of Auxiliary Variable Bonfring International Journal of Industrial Engineering and Management Science, Vol. 2, No. 2, June 212 19 Modified Ratio Estimators for Population Mean Using Function of Quartiles of Auxiliary Variable

More information

Introduction to Analytics Tools Data Models Problem solving with analytics

Introduction to Analytics Tools Data Models Problem solving with analytics Introduction to Analytics Tools Data Models Problem solving with analytics Analytics is the use of: data, information technology, statistical analysis, quantitative methods, and mathematical or computer-based

More information

Multiple Regression. Dr. Tom Pierce Department of Psychology Radford University

Multiple Regression. Dr. Tom Pierce Department of Psychology Radford University Multiple Regression Dr. Tom Pierce Department of Psychology Radford University In the previous chapter we talked about regression as a technique for using a person s score on one variable to make a best

More information

TRIAGE: PRIORITIZING SPECIES AND HABITATS

TRIAGE: PRIORITIZING SPECIES AND HABITATS 32 TRIAGE: PRIORITIZING SPECIES AND HABITATS In collaboration with Jon Conrad Objectives Develop a prioritization scheme for species conservation. Develop habitat suitability indexes for parcels of habitat.

More information

ENGG1811: Data Analysis using Excel 1

ENGG1811: Data Analysis using Excel 1 ENGG1811 Computing for Engineers Data Analysis using Excel (weeks 2 and 3) Data Analysis Histogram Descriptive Statistics Correlation Solving Equations Matrix Calculations Finding Optimum Solutions Financial

More information

Lesson-9. Elasticity of Supply and Demand

Lesson-9. Elasticity of Supply and Demand Lesson-9 Elasticity of Supply and Demand Price Elasticity Businesses know that they face demand curves, but rarely do they know what these curves look like. Yet sometimes a business needs to have a good

More information

Managerial Economics

Managerial Economics Managerial Economics Estimating Demand Functions Rudolf Winter-Ebmer Johannes Kepler University Linz Winter Term 2014 Winter-Ebmer, Managerial Economics: Unit 2 - Demand Estimation 1 / 21 Why do you need

More information

ASTEROID. Profiler Module. ASTEROID Support: Telephone

ASTEROID. Profiler Module. ASTEROID Support: Telephone ASTEROID Profiler Module ASTEROID Support: Telephone +61 3 9223 2428 Email asteroid.support@roymorgan.com April 2017 Course Objectives At the end of this session you will be able to understand and confidently

More information

Examiner s report F5 Performance Management December 2017

Examiner s report F5 Performance Management December 2017 Examiner s report F5 Performance Management December 2017 General comments The F5 Performance Management exam is offered in both computer-based (CBE) and paper formats. The structure is the same in both

More information

IT portfolio management template User guide

IT portfolio management template User guide IBM Rational Focal Point IT portfolio management template User guide IBM Software Group IBM Rational Focal Point IT Portfolio Management Template User Guide 2011 IBM Corporation Copyright IBM Corporation

More information

Overview of Statistics used in QbD Throughout the Product Lifecycle

Overview of Statistics used in QbD Throughout the Product Lifecycle Overview of Statistics used in QbD Throughout the Product Lifecycle August 2014 The Windshire Group, LLC Comprehensive CMC Consulting Presentation format and purpose Method name What it is used for and/or

More information

Control Charts for Customer Satisfaction Surveys

Control Charts for Customer Satisfaction Surveys Control Charts for Customer Satisfaction Surveys Robert Kushler Department of Mathematics and Statistics, Oakland University Gary Radka RDA Group ABSTRACT Periodic customer satisfaction surveys are used

More information

Consumer Math Unit Lesson Title Lesson Objectives 1 Basic Math Review Identify the stated goals of the unit and course

Consumer Math Unit Lesson Title Lesson Objectives 1 Basic Math Review Identify the stated goals of the unit and course Consumer Math Unit Lesson Title Lesson Objectives 1 Basic Math Review Introduction Identify the stated goals of the unit and course Number skills Signed numbers and Measurement scales A consumer application

More information

Consumer and Producer Surplus and Deadweight Loss

Consumer and Producer Surplus and Deadweight Loss Consumer and Producer Surplus and Deadweight Loss The deadweight loss, value of lost time or quantity waste problem requires several steps. A ceiling or floor price must be given. We call that price the

More information

To provide a framework and tools for planning, doing, checking and acting upon audits

To provide a framework and tools for planning, doing, checking and acting upon audits Document Name: Prepared by: Quality & Risk Unit Description Audit Process The purpose of this policy is to develop and sustain a culture of best practice in audit through the use of a common framework

More information

Chapter 8 Designing Pay Levels, Mix, and Pay Structures

Chapter 8 Designing Pay Levels, Mix, and Pay Structures Chapter 8 Designing Pay Levels, Mix, and Pay Structures Major Decisions -Some major decisions in pay level determination: -determine pay level policy (specify employers external pay policy) -define purpose

More information

KeyMath Revised: A Diagnostic Inventory of Essential Mathematics (Connolly, 1998) is

KeyMath Revised: A Diagnostic Inventory of Essential Mathematics (Connolly, 1998) is KeyMath Revised Normative Update KeyMath Revised: A Diagnostic Inventory of Essential Mathematics (Connolly, 1998) is an individually administered, norm-referenced test. The 1998 edition is a normative

More information

Impact of Market Segmentation Practices on the Profitability of Small and Medium Scale Enterprises in Hawassa City: A Case Study

Impact of Market Segmentation Practices on the Profitability of Small and Medium Scale Enterprises in Hawassa City: A Case Study Impact of Market Segmentation Practices on the Profitability of Small and Medium Scale Enterprises in Hawassa City: A Case Study * Hailemariam Gebremichael, Lecturer, Wolaita Sodo University, Ethiopia.

More information

RESIDUAL FILES IN HLM OVERVIEW

RESIDUAL FILES IN HLM OVERVIEW HLML09_20120119 RESIDUAL FILES IN HLM Ralph B. Taylor All materials copyright (c) 1998-2012 by Ralph B. Taylor OVERVIEW Residual files in multilevel models where people are grouped into some type of cluster

More information