Lab 9: Sampling Distributions

Size: px

Start display at page:

Download "Lab 9: Sampling Distributions"

Emery Thompson
6 years ago
Views:

1 Lab 9: Sampling Distributions Sampling from Ames, Iowa In this lab, we will investigate the ways in which the estimates that we make based on a random sample of data can inform us about what the population might look like. We re interested in formulating a sampling distribution of our estimate in order to get a sense of how good of an estimate it is. The Data The dataset that we ll be considering comes from the town of Ames, Iowa. The Assessor s Office records information on all real estate sales and the data set we re considering contain information on all residential home sales between 2006 and We will consider these data as our statistical population. In this lab we would like to learn as much as we can about these homes by taking smaller samples from the full population. Let s load the data. ames = read.delim(" source(" We see that there are quite a few variables in the data set, but we ll focus on the number of rooms above ground (TotRms.AbvGrd) and sale price (SalePrice). Let s look at the distribution of number of rooms in homes in Ames by calculating some summary statistics and making a histogram. summary(ames$totrms.abvgrd) hist(ames$totrms.abvgrd) Exercise 1 How would you describe this population distribution? The Unknown Sampling Distribution In this lab, we have access to the entire population, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or even impossible. Because of this, we often take a smaller sample survey of the population and use that to make educated guesses about the properties of the population. If we were interested in estimating the mean age number of rooms in homes in Ames based on a sample, we can use the following command to survey the population. samp1 = sample(ames$totrms.abvgrd,75) This command allows us to create a new vector called samp1 that is a simple random sample of size 75 from the population vector ames$totrms.abvgrd. At a conceptual level, you can imagine randomly choosing 75 entries from the Ames phonebook, calling them up, and recording the number of rooms in their houses. You would be correct in objecting that the phonebook probably doesn t contain phone numbers for all homes and that there will almost 1

2 certainly be people that don t pick up the phone or refuse to give this information. These are issues that can make gathering data very difficult and are a strong incentive to collect a high quality sample. Exercise 2 How would you describe the distribution of this sample? How does it compare to the distribution of the population? If we re interested in estimating the average number of rooms in homes in Ames, our best guess is going to be the sample mean from this simple random sample. mean(samp1) Exercise 3 How does your sample mean compare to your neighbors? Are the sample means the same? Why or why not? Depending which 75 homes you selected, your estimate could be a bit above or a bit below the true population mean of But in general, the sample mean turns out to be a pretty good estimate of the average number of rooms, and we were able to get it by sampling less than 3% of the population. Exercise 4 Take a second sample, also of size 75, and call it samp2. How does the mean of samp2 compare with the mean of samp1? If we took a third sample of size 150, intuitively would you expect the sample mean to be a better or worse estimate of the population mean? Not surprisingly, every time we take another random sample, we get a different sample mean. It s useful to get a sense of just how much variability we should expect when estimating the population mean this way. This is what is captured by the sampling distribution. In this lab, because we have access to the population, we can build up the sampling distribution for the sample mean by repeating the above steps 5000 times. We will use the function gen_samp_ means to do this, this function takes three arguments, pop: the population data, samp_size: the size of the sample to take when generating the samples, and niter: the number of sample means to generate. samp_means = gen_samp_means( ames$totrms.abvgrd, samp_size = 75, niter = 5000 ) hist(samp_means, probability = TRUE) Here we rely on the computational ability of R to quickly take 5000 samples of size 75 from the population, compute each of those sample means, and store them in a vector called samp _means. Exercise 5 How would you describe this sampling distribution? On what value is it centered? Would you expect the distribution to change if we instead collected 50,000 sample means? 2

3 Approximating the Sampling Distribution The sampling distribution that we computed tells us everything that we could hope for about the average number of rooms in homes in Ames. Because the sample mean is an unbiased estimator, the sampling distribution is centered at the true average number of rooms of the the population and the spread of the distribution indicates how much variability is induced by sampling only 75 of the homes. We computed the sampling distribution for mean number of rooms by drawing 5000 samples from the population and calculating 5000 sample means. This was only possible because we had access to the population. In most cases you don t (if you did, there would be no need to estimate!). Therefore, you have only your single sample to rely upon... that, and the Central Limit Theorem. The Central Limit Theorem states that, under certain conditions, the sample mean follows a normal distribution. This allows us to make the inferential leap from our single sample to the full sampling distribution that describes every possible sample mean you might come across. But we need to look before we leap. Exercise 6 Does samp1 meet the conditions for the sample mean to be approximately normally distributed according to the central limit theorem? If the conditions are met, then we can find the approximate sampling distribution by plugging in our best estimate for the population mean and standard error: x and s/ n. xbar = mean(samp1) se = sd(samp1)/sqrt(75) We can add a curve representing this approximation to our existing histogram using the command hist_curve. This function takes the arguments, sample_means the sample means used to generate the histogram, mean the mean of the normal curve to draw, and sd the standard deviation for normal curve to draw. hist_curve(samp_means, mean = xbar, sd = se) We can see that the line does a decent job of tracing the histogram that we derived from having access to the population. In this case, our approximation based on the CLT is a good one. Confidence Intervals In class this week we discussed how we can use the central limit theorem and the resulting normal distribution to describe a plausible range of values for the true population mean, we called these ranges confidence intervals. In the case of a sample mean we calculate the confidence interval using the following formula CI = X ± z CL s n 3

4 where X is the sample mean, z CL is the z-score for the appropriate confidence level (ie for a 95% CL), s is the sample standard deviation, and n is the sample size. We can calculate a 95% confidence interval in R for samp1 using the following code: mean(samp1)+c(-1,1)*1.96*sd(samp1)/sqrt(length(samp1)) Exercise 7 Does the confidence interval for samp1 include the true population mean 6.443? Does your neighbors confidence interval contain it? In class we also mentioned that the definition of a confidence level is that if we were collect additional samples of the same size and calculated a confidence interval based on their sample mean and sample standard deviation then we would expect CL% of those confidence intervals to contain the true population mean. We will confirm this by taking multiple samples and examining the resulting confidence intervals. We will do this using the check_ci function which will produce a graphical representation of 100 confidence intervals ranges relative to the true population mean. check_ci(ames$totrms.abvgrd, samp_size=75, CL = 0.95) Note that we can change both the size of the sample used as well as the confidence level. Exercise 8 What happens to the size of the confidence intervals when you increase the sample size? When you decrease it? What about when you change the confidence level? You will have also hopefully noticed that the color of the confidence intervals changes depending of if it includes the true population mean, which is indicated by the vertical black line. The confidence interval is represented in blue if it does contain the population mean, red if it does not. In practice when we can only take a single sample we would not necessarily know the value of the true population mean, which is why we have to use the language of confidence intervals / levels. Based on the resulting plot(s) it is possible to count the number of confidence intervals that do not include the true population mean, and if our definition of confidence level is correct this number should correspond to the confidence level you used when running the function. Exercise 9 Run the check_ci function several times with different values for the confidence level, CL, do the number of confidence intervals that contain the true population mean agree with the specified confidence level? 4

5 On Your Own So far we have only focused on estimating the mean number of rooms of the homes of Ames. Now we ll try to estimate the mean sale price. 1. Take a random sample of size 30 from ames$saleprice. Using this sample, what is your best point estimate of the population mean? Include a histogram of this sample in your answer. 2. Check the conditions for the sampling distribution of x SaleP rice to be nearly normal. 3. Since you have access to the population, compute the sampling distribution for x SaleP rice by taking 5000 samples from the population of size 30 and computing 5000 sample means. Describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean sale price of homes in Ames to be? Include a histogram of the sampling distribution. 4. Change your sample size from 30 to 150, then compute the sampling distribution using the same method as above. Describe the shape of this sampling distribution (where n = 150) and compare it to the sampling distribution from earlier (where n = 30). Based on this sampling distribution, what would you guess the mean sale price of the homes in Ames to be? Include a histogram of the sampling distribution. 5. Based on their shape, which sampling distribution would you feel more comfortable approximating by the normal model? 6. Which sampling distribution has a smaller spread? If we re concerned with making estimates that are consistently close to the true value, is having a sampling distribution with a smaller spread more or less desirable? 7. Generate plots of the confidence intervals for a sample sizes of 30 and 150 at confidence levels of 0.90, 0.95 and (6 plots in total) 8. Based on your plots how would describe the relationship of sample size and confidence level to the size of the confidence interval? Notes This is a product of OpenIntro that is released under a Creative Commons Attribution-NonCommercial- NoDerivs 3.0 Unported (creativecommons.org/ licenses/ by-nc-nd/ 3.0/ ). This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by the faculty and TAs of UCLA Statistics. 5

Chapter 4: Foundations for inference. OpenIntro Statistics, 2nd Edition

Chapter 4: Foundations for inference OpenIntro Statistics, 2nd Edition Variability in estimates 1 Variability in estimates Application exercise Sampling distributions - via CLT 2 Confidence intervals 3