Understanding Inference: Confidence Intervals II. Questions about the Assignment. Summary (From Last Class) The Problem

Size: px
Start display at page:

Download "Understanding Inference: Confidence Intervals II. Questions about the Assignment. Summary (From Last Class) The Problem"

Transcription

1 Questions about the Assignment Part I The z-score is not the same as the percentile (eg, a z-score of 98 does not equal the 98 th percentile) The z-score is the number of standard deviations the value is above or below the mean Part II Some variables appear to be quantitative, but they are really categorical (eg year born, income bracket, occupation code) Good quantitative variables include phrases such as how many, how often, etc and the respondent answers with a number Who was asked the question (everyone or a subset of the sample) Understanding Inference: Confidence Intervals II Confidence intervals distribution Other levels of confidence Summary (From Last Class) To create a plausible range of values for a population parameter: 1 Take many random samples from the population, and compute the sample statistic for each sample 2 Compute the standard error (ie, the standard deviation of all these statistics) 3 The plausible range of values = sample statistic ± 2 standard error One small problem Often we only have one sample! How can we calculate the variation in sample statistics, if we only have one sample? The Problem We have one random sample of the population We need multiple (ie >1,000) random samples from the population to calculate the standard error of the sample statistic We cannot afford to conduct 1,000+ random samples Is there a way to use the information from the one random sample we have, to simulate 1,000+ random samples? Yes, the simulation method enables us to generate a sampling distribution Random Population When we conducted our study yesterday, we treated this bag of Reese s Pieces as the population and we drew random samples of size 10 from this population If our sample size was 500 (the total number of pieces in the bag) we would know the exact proportion of the orange pieces in the population N = 500 n = 10 Random Population Hershey Factory N = 1 Billion+ N = 500 Is the proportion of orange pieces in this bag pretty close to the proportion of the entire population of Reese s Pieces? Why? Because this bag is a random sample of the entire population of all Reese s Pieces produced by Hershey s 1

2 Random Population Hershey Factory N = 1 Billion+ p 52 = 52 Because this bag is a random sample of the entire population, it contains approximately the same proportion of orange Reese s Pieces as the entire population Random Population Random Population Our goal is to determine the population proportion of orange pieces Let s say that we ve determined that a sample size of 500 is sufficiently large enough to adequately estimate the population proportion This bag is the one random sample (n=500) of the population we have It gives us one sample statistic, but we need multiple sample statistics We can use this sample to generate multiple samples Because this bag is a random sample of the entire population, it contains approximately the same proportion of orange Reese s Pieces as the entire population Random Population Drawing additional samples from this bag is equivalent to drawing additional samples from the vat of Reese s Pieces at the Hershey Factory? Reaching into this bag to draw more samples, it s equivalent to reaching into the vat of Reese s Pieces at the Hershey Factory To generate a sampling distribution, we can take repeated random samples from this population Thus, we can draw 1,000+ samples (n=500) from this population to generate the equivalent of 1,000+ random samples from the population Random Population But won t all of the subsequent samples of size 500 we draw from this bag produce the same sample statistic as the original sample? Yes, unless we sample with replacement Sampling with Replacement Why bootstrap? After we sample a unit, we put it back into the population such that each unit can be selected more than once By sampling with replacement, we can ensure that the population in the bag retains the same proportion of orange pieces that are in the true population This type of sampling process is known as bootstrapping Pull yourself up by your bootstraps Lift yourself up into the air simply by pulling up on the laces of your boots A metaphor for accomplishing an impossible task without any outside help 2

3 ping Terms sample: A random sample taken with replacement from the original sample It needs to be the same size as the original sample sample statistic: The sample statistic computed on the bootstrap sample sampling distribution: The sampling distribution of many bootstrap sample statistics Original Sampling Distribution StatKey wwwlock5statcom\statkey count = 260 Standard Error The variability of the bootstrap statistics is similar to the variability of the sample statistics The standard error of our sample statistic can be estimated using the standard deviation of the bootstrap sampling distribution Reese s Pieces Based on this sample, give a 95% confidence interval for the true proportion of Reese s Pieces that are orange Distribution You have a sample of size You sample with replacement 1000 times to get 1000 bootstrap samples A (050, 054) B (048, 056) C (048, 052) D (046, 054) Standard Deviation = Mean = 52 What is the sample size of each bootstrap sample? A 500 B 1,000 samples are the same size as the original sample 3

4 You have a sample of size You sample with replacement 1000 times to get 1000 bootstrap samples How many bootstrap sample statistics will you have? A 1 B 500 C 1,000 Distribution Each bootstrap sample yields one bootstrap sample statistic You have a sample of size You sample with replacement 1000 times to get 1000 bootstrap samples How many dots will be in a dotplot of the bootstrap sampling distribution? A 50 B 1,000 C 50,000 Distribution Each dot in the bootstrap sampling distribution corresponds to one bootstrap sample statistic Atlanta Commutes Random of 500 Commutes What s the mean commute time for workers in metropolitan Atlanta? CommuteAtlanta The Original = 2911 minutes s = 2072 minutes Dot Plot Time This dotplot is A the sample distribution of commute times B the sampling distribution of sample statistics Random of 500 Commutes Random of 500 Commutes CommuteAtlanta Dot Plot The Original = 2911 minutes s = 2072 minutes Hint: CI = sample statistic ± 2 standard error CommuteAtlanta The Original = 2911 minutes s = 2072 minutes Dot Plot Time The confidence interval for the point estimate is A B C cannot be determined with the data available 2911 ± is the interval that contains 95% of the commute times in the original sample The variability of the sample statistic is not known Time How can we determine the variability of the sample statistic so that we can calculate the confidence interval for the population parameter ( )? Generate a Distribution using The Original 4

5 Distribution wwwlock5statcom/statkey/ The 95% Confidence Interval point estimate ± the margin of error sample statistic ± 2 standard error sample statistic ± 2 sd of the bootstrap sampling distribution 2911 ± The Beauty of ping We can use bootstrapping to assess the uncertainty surrounding any sample statistic If we have sample data, we can use bootstrapping to estimate a 95% confidence interval for population parameter The 95% confidence interval for the average commute time is A (282, 300) B (273, 309) C (266, 318) Obama s Approval Rating Obama s Approval Rating Gallup surveyed 1,500 Americans between June 9 th -11 th 2012 and 49% of these people approved of the job Barack Obama is doing as president statistic: (sample proportion) = 49 Calculate a 95% CI for the sample proportion wwwlock5statcom/statkey Count = 735 n = 1,500 CI = original sample proportion Count ± 2 = standard 735 error = 49 ± = 49 ± 026 (remember Gallup s N = 1500 margin of error was ± 03 = (464, 516) We are 95% confident that the true percentage of all Americans that approve of Obama s job performance is between 464% and 516% Obama s Approval Rating wwwlock5statcom/statkey Two Methods for Calculating a 95% CI CI = (464, 516) Count = 735 N = 1500 Middle 95% of the bootstrap statistics We are 95% confident that the true percentage of all Americans that approve of Obama s job performance is between 464% and 516% The Standard Error Method Count or The = 735 Percentile Method N = 1500 CI = sample statistic ± 2 standard error = (464, 516) Middle 95% of the bootstrap statistics

6 Other Levels of Confidence What if we want to be more than 95% confident? How might you produce a 99% confidence interval for the point estimate? Percentile Method For a P% confidence interval, keep the middle P% of bootstrap statistics For a 99% confidence interval, keep the middle 99%, leaving 05% in each tail The 99% confidence interval would be: (05 th percentile, 995 th percentile) where the percentiles refer to the bootstrap distribution wwwlock5statcom/statkey Which is wider, a 90% confidence interval or a 95% confidence interval? A 90% CI B 95% CI Level of Confidence A 95% interval captures the middle 95%, which is a wider range than the middle 90% The Effects of Size Are these bootstrap distributions the same? n = 100 SE = 05 SE = 013 n = A bootstrap using a sample size of 1500 generates a standard error that is much smaller than the standard error generated from a sample size of 100 The margin of error decreases from 0200 to 0052 Part I: Graded Problems 374, and 376(a,c,d, and e) Assignment Part II: Goto For the following 3 categorical variables calculate the confidence interval for the point estimate (ie, sample proportion) using: 1 The Standard Error Method (show your work) 2 The Percentile Method (print/ your screenshot from wwwlock5statcom/statkey ) GENDER calculate the confidence interval for the proportion who are female DIVORCE calculate the confidence interval for the proportion who ve been divorced GUNLAW calculate the confidence interval for the proportion who favor gun laws Finding Proportions from the GSS Enter the variable name here Check Column Check Confidence Intervals Check Unweighted Click on Run Table 6

7 Finding Proportions from the GSS StatKey and the Percentile Method This is the sample proportion This is the confidence interval This is the number of respondents who indicated being female This is the total number of respondents On the StatKey home page, click on CI for Single Proportion to get to this page Click on Edit Data and a window will pop up to enter: n (the sample size) count (the # of respondents who are in the category you are interested in eg, female) Click here to generate 1000 bootstrap samples Click on Two-Tail to get the 95% confidence interval These values represent the 95% confidence interval Summary The standard error of a statistic is the standard deviation of the sampling distribution, which can be estimated from a bootstrap distribution Confidence intervals for population parameter estimates can be calculated using the standard error or the percentiles of a bootstrap distribution Confidence intervals can be calculated this way as long as the bootstrap distribution is approximately symmetric and continuous 7