FAQ: Collecting and Analyzing Data

Size: px
Start display at page:

Download "FAQ: Collecting and Analyzing Data"

Transcription

1 Question 1: How do you choose a tool for collecting data? Answer 1: A variety of tools exist for collecting data, including direct observation or interviews, surveys, questionnaires, and experiments. Choice of data collection tools depends upon the type, quantity, quality, timeliness, and cost of data desired. Rarely are data collectors able to optimize the quantity, quality, and timeliness of the effort all at once. The data collector will definitely have to understand the purpose of the data collection to best choose a tool. If a consumer goods company needs immediate reaction to a product, taste tests and solicitation of direct opinions might be the best choice of tools. The data are timely but may be biased toward those individuals who typically engage in taste tests, so the quality may suffer. A company needing lots of data on its competitor may have to sacrifice timeliness or quality. Experiments may take months to plan and execute, but the results may be of higher quality than opinions or surveys. Users of the data will also need to understand the potential for measurement error and bias given the data collection tool. Consider the example of a police department trying to achieve 90% availability on its police vehicles, which are at 75% availability currently. To understand reasons the vehicles were not more available, it surveyed the maintenance staff. This type of data collection is quick and relatively simple, but the quantity is limited and the quality is based on opinions. If the police department desires to fix what is most broken, quality of data may mean collecting data for the next 1 month and recording actual reasons for availability issues. In fact, the department learned upon collecting data on actual availability issues that preventative maintenance processes were inefficient and slow. Reliance on the presumed availability reasons would have meant examination of the transmissions and engines on the fleet. Question 2: How do you choose a sampling technique? Answer 2: A sample is part of a population. Ideally, to make inferences about the population as a whole, the sample should resemble the population as much as possible. Again, however, perfection is not possible, so the choice of a sampling technique depends on quality (i.e., allowable measurement error), accessibility, and type of the data population itself. Common sampling techniques include nonstatistical sampling, random statistical sampling, stratified statistical sampling, systematic random sampling, and cluster sampling. 1

2 Assume that a beverage manufacturer would like to understand the change in sales over a 12-month period of bottled water at its 1,000 nationwide retail outlets. There are several sampling technique choices, as follows: 1. It could select the sales at the retail outlets closest to the corporate office. This is convenient (accessible), but unless those outlets are identical to the other outlets, the quality of the result would suffer with this type of sampling for this population. 2. It could throw darts at a map of the nation and select those outlets closest to where the dart landed on the map. This is random sampling, where every point has an equal chance of being selected. This would allow for the possibility of including sales of other geographic regions and thus better representing trends across the population of water sales. Still, it does not specifically consider geographic or other outlet differences in the sampling technique. 3. If water sales are expected to differ by geographic region, the company might consider randomly selecting several outlets from each geographic region. This is cluster sampling. It might be less convenient, but the quality of the resulting data should be improved for this population. 4. Sales of bottled water might be a function of the population density. In this case, groups (called strata) representing different levels of population density might be formed and a sample selected from each group in order to draw conclusions. Each technique has its own set of advantages and disadvantages. The final choice depends on the objectives with respect to timeliness, quality, quantity, and the data characteristics. Question 3: When do you need to create a joint frequency distribution? Answer 3: Frequency distributions describe frequency of observations along a single dimension divided into discrete groups or classes. For example, a police department publishes the number of crimes recorded by zip code. A frequency distribution would measure the relative frequencies across each of six zip codes. When additional information may be obtained by further describing that single dimension, a joint frequency distribution may be helpful. For example, a crime frequency distribution may become more meaningful if you understand the frequency of crimes in houses versus apartments. Although you could separately prepare a zip code-based frequency distribution for apartments and another one for houses, an 2

3 alternative is to prepare a joint distribution. The results will suggest if the crime level differs between apartments and houses within a given zip code. Question 4: How do you calculate standard deviation? Answer 4: Standard deviation measures the distribution of a set of data. For example, the price for a gallon of milk over the last 5 months would most likely have a smaller standard deviation than the standard deviation for a gallon of gasoline over the same 5-month period. Assume the following simple example on the price of gasoline over the past 5 months. Month Avg. price 1 $ There are 5 steps to calculating the standard deviation. 1. Compute the average (the sum of the items divided by the number of items). Average = $1.52 In Excel, the AVERAGE function computes the average. 2. Compute how far away (the difference or deviation) each item is from the average. Month Avg. price from Average 1 $

4 3. To eliminate whether an item is higher or lower than the average, square the difference. Month Avg. price from Average squared 1 $ Sum the squared differences and divide by the sample size less one. Month Avg. price from Average squared 1 $ SUM of s / (5 1) = The variance of the sample is Compute square root to determine standard deviation: = In Excel, the STDEV function computes the standard deviation for a sample of observations. What does the magnitude of the resulting standard deviation tell the user of the data? The greater the magnitude, the more spread out the data around the center point or mean. 4

5 Question 5: How do you calculate the coefficient of variation? Answer 5: Neither the standard deviation nor the mean by themselves adequately describe the data. Recall the example when mean restaurant sales were the same before and after a smoking ban. When the variance was measured, it was relatively easy to conclude that post-smoking ban sales were more variable than pre-smoking ban sales. When distribution means are not identical, the coefficient of variation standardizes the extent of variation for a distribution with a given mean so that comparisons may be made. The coefficient of variation is defined simply as the distribution s standard deviation scaled by its mean. A greater coefficient of variation suggests greater relative variability of one set of data compared with another. Consider the standard deviation that was computed for the price of gasoline over a 5- month period. To understand whether the price of gasoline is more or less variable than the price of milk over the same period, you can first compute the mean and standard deviation for the price of milk. As the results presented below show, milk has a higher standard deviation. However, because the mean price of milk is different, you must compute the coefficient of variation to appropriately compare the variability of the data groups. Scaled by the mean, the average monthly price of gasoline has greater relative variability than the average monthly price of milk. Month Avg. price of gasoline Avg. price of milk 1 $1.55 $ Mean $1.52 $2.56 Standard $ $ Coefficient of Variation

6 Question 6: How do you calculate z-scores? Answer 6: Z-scores represent another tool to compare multiple groups of data that may not be similarly distributed. More specifically, a point of data can be described and compared with other data points using z-scores. A z- score is a data point reduced by the sample (or population) mean and scaled by the standard deviation. The result is a standardized value that can be compared with other data points. The z-score for the price of gasoline in month 2 is computed as ($1.45 $1.52) / $ = -.99 The z-score for the price of gasoline in month 3 is computed as ($1.56 $1.52) / $ = 0.56 The standardized z-score for the month 2 price of gasoline means that price is about 0.99 standard deviations lower than the mean. The standardized z- score for the month 3 price of gasoline means that price is about 0.56 standard deviations higher than the mean. The z-score for the price of milk in month 5 is computed as ($2.67 $2.56) / $ = 1.45 The price of milk in month 5 is almost 1.5 standard deviations greater than its mean. Comparing the data points between groups of data the z-score allows you to state that the price of milk in month 5 is more significantly different from its mean than the price of gasoline in months 2 and 3. 6