Lab 9: Sampling Distributions

Similar documents
Chapter 4: Foundations for inference. OpenIntro Statistics, 2nd Edition

Unit3: Foundationsforinference. 1. Variability in estimates and CLT. Sta Fall Lab attendance & lateness Peer evaluations

Lecture 9 - Sampling Distributions and the CLT

Day 1: Confidence Intervals, Center and Spread (CLT, Variability of Sample Mean) Day 2: Regression, Regression Inference, Classification

AP Stats ~ Lesson 8A: Confidence Intervals OBJECTIVES:

Gush vs. Bore: A Look at the Statistics of Sampling

Confidence Intervals

Correlation and Simple. Linear Regression. Scenario. Defining Correlation

Lecture 8: Introduction to sampling distributions

Confidence Intervals for Large Sample Means

provided that the population is at least 10 times as large as the sample (10% condition).

Chapter 7: Sampling Distributions

Capability on Aggregate Processes

Chapter 7: Sampling Distributions

Secondary Math Margin of Error

Chapter 3: Distributions of Random Variables

Game Theory & Firms. Jacob LaRiviere & Justin Rao April 20, 2016 Econ 404, Spring 2016

Comment on A Macroeconomic Framework for Quantifying Systemic Risk by He and Krishnamurthy

THE NORMAL CURVE AND SAMPLES:

Untangling Correlated Predictors with Principle Components

Stat 411/511 MORE ON THE RANDOM SAMPLING MODEL. Charlotte Wickham. stat511.cwick.co.nz. Sep

+? Mean +? No change -? Mean -? No Change. *? Mean *? Std *? Transformations & Data Cleaning. Transformations

Chapter 8: Estimating with Confidence. Section 8.2 Estimating a Population Proportion

Population Genetics Simulations Heath Blackmon and Emma E. Goldberg last updated:

LECTURE 17: MULTIVARIABLE REGRESSIONS I

Bar graph or Histogram? (Both allow you to compare groups.)

Equipment and preparation required for one group (2-4 students) to complete the workshop

The Financial and Insurance Advisor s Guide to Content Writing

CHAPTER 7: Central Limit Theorem: CLT for Averages (Means)

Section 7.3b Sample Means The Central Limit Theorem

Module - 01 Lecture - 03 Descriptive Statistics: Graphical Approaches

Online Student Guide Types of Control Charts

Chapter 7: Sampling Distributions

AP Statistics Scope & Sequence

The 10 Parts of a Great Website Design Request for Proposal (RFP)

VIDEO 1: WHY ARE FORMS IMPORTANT?

Bloomberg s Supply Chain Algorithm: Providing Insight Into Company Relationships

A GUIDE TO GETTING SURVEY RESPONSES

HIMSS ME-PI Community. Quick Tour. Sigma Score Calculation Worksheet INSTRUCTIONS

Chapter 8 Script. Welcome to Chapter 8, Are Your Curves Normal? Probability and Why It Counts.

Utilizing Data Science To Assess Software Quality

AGAINST ALL ODDS EPISODE 28 INFERENCE FOR PROPORTIONS TRANSCRIPT

Section 8.2 Estimating a Population Proportion. ACTIVITY The beads. Conditions for Estimating p

Chapter 1 Data and Descriptive Statistics

How to view Results with. Proteomics Shared Resource

Evaluation of Police Patrol Patterns

Chapter 10 Regression Analysis

TEACHER NOTES MATH NSPIRED

Instructions. AIRBUS A3XX: Developing the World s Largest Commercial Aircraft

Applying the central limit theorem

From Theory to Data Product

Students will understand the definition of mean, median, mode and standard deviation and be able to calculate these functions with given set of

KING ABDULAZIZ UNIVERSITY FACULTY OF COMPUTING & INFORMATION TECHNOLOGY DEPARTMENT OF INFORMATION SYSTEM. Lab 1- Introduction

Broccolini Construction

Happyville. Kevin S. Robinson, PhD

SEO Ranking Research Tools

Chapter 19. Confidence Intervals for Proportions. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Understanding Inference: Confidence Intervals II. Questions about the Assignment. Summary (From Last Class) The Problem

Applied Econometrics

Physics 141 Plotting on a Spreadsheet

DECISION-MAKING 7/23/2018. Do not plant your dreams in the field of indecision, where nothing ever grows but the weeds of what-if.

Shape and Velocity Management. Stu Schmidt

Estimating With Objects - Part III

= = Intro to Statistics for the Social Sciences. Name: Lab Session: Spring, 2015, Dr. Suzanne Delaney

Chapter 12 Module 3. AMIS 310 Foundations of Accounting

Happyville: Putting A Smile Into Statistical Ideas

Biostatistics 208 Data Exploration

INTRODUCTION TO STATISTICS

AComparisonTestforNetSensitivity

Two Way ANOVA. Turkheimer PSYC 771. Page 1 Two-Way ANOVA

Three steps to joining and participating in unions

Basic Statistics, Sampling Error, and Confidence Intervals

How to Use Excel for Regression Analysis MtRoyal Version 2016RevA *

The Art and Science of Bidding for Offshore License Blocks

Eco 300 Intermediate Micro

GENETIC DRIFT INTRODUCTION. Objectives

Clovis Community College Class Assessment

Lab 2: Mathematical Modeling: Hardy-Weinberg 1. Overview. In this lab you will:

The Market Economy. The Economy. Consumers, Producers, and the Market. Are You Motivated Yet? Name:

The Language of Accountability

LIR 832: MINITAB WORKSHOP

MAS187/AEF258. University of Newcastle upon Tyne

+? Mean +? No change -? Mean -? No Change. *? Mean *? Std *? Transformations & Data Cleaning. Transformations

Data Visualization. Prof.Sushila Aghav-Palwe

Managing Airline Customer Satisfaction

Clinical trials patient-education brochure

Marketing Automation: One Step at a Time

Multiple Choice Questions Sampling Distributions

Producer Theory - Monopoly

Monte Carlo Simulation Practicum. S. David Alley, P.E. ANNA, Inc (annainc.com)

Survey Question Analysis (Draft )

CHAPTER 21A. What is a Confidence Interval?

Chapter 9 Assignment (due Wednesday, August 9)

BOOTSTRAPPING AND CONFIDENCE INTERVALS

Session 7. Introduction to important statistical techniques for competitiveness analysis example and interpretations

DIRECT MAIL: MASTERING THE LOGISTICS OF A SUCCESSFUL MAILING

5 CHAPTER: DATA COLLECTION AND ANALYSIS

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Point Sampling (a.k.a. prism cruising)

Persona Development How- To Guide

Transcription:

Lab 9: Sampling Distributions Sampling from Ames, Iowa In this lab, we will investigate the ways in which the estimates that we make based on a random sample of data can inform us about what the population might look like. We re interested in formulating a sampling distribution of our estimate in order to get a sense of how good of an estimate it is. The Data The dataset that we ll be considering comes from the town of Ames, Iowa. The Assessor s Office records information on all real estate sales and the data set we re considering contain information on all residential home sales between 2006 and 2010. We will consider these data as our statistical population. In this lab we would like to learn as much as we can about these homes by taking smaller samples from the full population. Let s load the data. ames = read.delim("http://www.amstat.org/publications/jse/v19n3/decock/ameshousing.txt") source("http://stat.duke.edu/courses/spring12/sta10.1/labs/sampmean.r") We see that there are quite a few variables in the data set, but we ll focus on the number of rooms above ground (TotRms.AbvGrd) and sale price (SalePrice). Let s look at the distribution of number of rooms in homes in Ames by calculating some summary statistics and making a histogram. summary(ames$totrms.abvgrd) hist(ames$totrms.abvgrd) Exercise 1 How would you describe this population distribution? The Unknown Sampling Distribution In this lab, we have access to the entire population, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or even impossible. Because of this, we often take a smaller sample survey of the population and use that to make educated guesses about the properties of the population. If we were interested in estimating the mean age number of rooms in homes in Ames based on a sample, we can use the following command to survey the population. samp1 = sample(ames$totrms.abvgrd,75) This command allows us to create a new vector called samp1 that is a simple random sample of size 75 from the population vector ames$totrms.abvgrd. At a conceptual level, you can imagine randomly choosing 75 entries from the Ames phonebook, calling them up, and recording the number of rooms in their houses. You would be correct in objecting that the phonebook probably doesn t contain phone numbers for all homes and that there will almost 1

certainly be people that don t pick up the phone or refuse to give this information. These are issues that can make gathering data very difficult and are a strong incentive to collect a high quality sample. Exercise 2 How would you describe the distribution of this sample? How does it compare to the distribution of the population? If we re interested in estimating the average number of rooms in homes in Ames, our best guess is going to be the sample mean from this simple random sample. mean(samp1) Exercise 3 How does your sample mean compare to your neighbors? Are the sample means the same? Why or why not? Depending which 75 homes you selected, your estimate could be a bit above or a bit below the true population mean of 6.438. But in general, the sample mean turns out to be a pretty good estimate of the average number of rooms, and we were able to get it by sampling less than 3% of the population. Exercise 4 Take a second sample, also of size 75, and call it samp2. How does the mean of samp2 compare with the mean of samp1? If we took a third sample of size 150, intuitively would you expect the sample mean to be a better or worse estimate of the population mean? Not surprisingly, every time we take another random sample, we get a different sample mean. It s useful to get a sense of just how much variability we should expect when estimating the population mean this way. This is what is captured by the sampling distribution. In this lab, because we have access to the population, we can build up the sampling distribution for the sample mean by repeating the above steps 5000 times. We will use the function gen_samp_ means to do this, this function takes three arguments, pop: the population data, samp_size: the size of the sample to take when generating the samples, and niter: the number of sample means to generate. samp_means = gen_samp_means( ames$totrms.abvgrd, samp_size = 75, niter = 5000 ) hist(samp_means, probability = TRUE) Here we rely on the computational ability of R to quickly take 5000 samples of size 75 from the population, compute each of those sample means, and store them in a vector called samp _means. Exercise 5 How would you describe this sampling distribution? On what value is it centered? Would you expect the distribution to change if we instead collected 50,000 sample means? 2

Approximating the Sampling Distribution The sampling distribution that we computed tells us everything that we could hope for about the average number of rooms in homes in Ames. Because the sample mean is an unbiased estimator, the sampling distribution is centered at the true average number of rooms of the the population and the spread of the distribution indicates how much variability is induced by sampling only 75 of the homes. We computed the sampling distribution for mean number of rooms by drawing 5000 samples from the population and calculating 5000 sample means. This was only possible because we had access to the population. In most cases you don t (if you did, there would be no need to estimate!). Therefore, you have only your single sample to rely upon... that, and the Central Limit Theorem. The Central Limit Theorem states that, under certain conditions, the sample mean follows a normal distribution. This allows us to make the inferential leap from our single sample to the full sampling distribution that describes every possible sample mean you might come across. But we need to look before we leap. Exercise 6 Does samp1 meet the conditions for the sample mean to be approximately normally distributed according to the central limit theorem? If the conditions are met, then we can find the approximate sampling distribution by plugging in our best estimate for the population mean and standard error: x and s/ n. xbar = mean(samp1) se = sd(samp1)/sqrt(75) We can add a curve representing this approximation to our existing histogram using the command hist_curve. This function takes the arguments, sample_means the sample means used to generate the histogram, mean the mean of the normal curve to draw, and sd the standard deviation for normal curve to draw. hist_curve(samp_means, mean = xbar, sd = se) We can see that the line does a decent job of tracing the histogram that we derived from having access to the population. In this case, our approximation based on the CLT is a good one. Confidence Intervals In class this week we discussed how we can use the central limit theorem and the resulting normal distribution to describe a plausible range of values for the true population mean, we called these ranges confidence intervals. In the case of a sample mean we calculate the confidence interval using the following formula CI = X ± z CL s n 3

where X is the sample mean, z CL is the z-score for the appropriate confidence level (ie. 1.96 for a 95% CL), s is the sample standard deviation, and n is the sample size. We can calculate a 95% confidence interval in R for samp1 using the following code: mean(samp1)+c(-1,1)*1.96*sd(samp1)/sqrt(length(samp1)) Exercise 7 Does the confidence interval for samp1 include the true population mean 6.443? Does your neighbors confidence interval contain it? In class we also mentioned that the definition of a confidence level is that if we were collect additional samples of the same size and calculated a confidence interval based on their sample mean and sample standard deviation then we would expect CL% of those confidence intervals to contain the true population mean. We will confirm this by taking multiple samples and examining the resulting confidence intervals. We will do this using the check_ci function which will produce a graphical representation of 100 confidence intervals ranges relative to the true population mean. check_ci(ames$totrms.abvgrd, samp_size=75, CL = 0.95) Note that we can change both the size of the sample used as well as the confidence level. Exercise 8 What happens to the size of the confidence intervals when you increase the sample size? When you decrease it? What about when you change the confidence level? You will have also hopefully noticed that the color of the confidence intervals changes depending of if it includes the true population mean, which is indicated by the vertical black line. The confidence interval is represented in blue if it does contain the population mean, red if it does not. In practice when we can only take a single sample we would not necessarily know the value of the true population mean, which is why we have to use the language of confidence intervals / levels. Based on the resulting plot(s) it is possible to count the number of confidence intervals that do not include the true population mean, and if our definition of confidence level is correct this number should correspond to the confidence level you used when running the function. Exercise 9 Run the check_ci function several times with different values for the confidence level, CL, do the number of confidence intervals that contain the true population mean agree with the specified confidence level? 4

On Your Own So far we have only focused on estimating the mean number of rooms of the homes of Ames. Now we ll try to estimate the mean sale price. 1. Take a random sample of size 30 from ames$saleprice. Using this sample, what is your best point estimate of the population mean? Include a histogram of this sample in your answer. 2. Check the conditions for the sampling distribution of x SaleP rice to be nearly normal. 3. Since you have access to the population, compute the sampling distribution for x SaleP rice by taking 5000 samples from the population of size 30 and computing 5000 sample means. Describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean sale price of homes in Ames to be? Include a histogram of the sampling distribution. 4. Change your sample size from 30 to 150, then compute the sampling distribution using the same method as above. Describe the shape of this sampling distribution (where n = 150) and compare it to the sampling distribution from earlier (where n = 30). Based on this sampling distribution, what would you guess the mean sale price of the homes in Ames to be? Include a histogram of the sampling distribution. 5. Based on their shape, which sampling distribution would you feel more comfortable approximating by the normal model? 6. Which sampling distribution has a smaller spread? If we re concerned with making estimates that are consistently close to the true value, is having a sampling distribution with a smaller spread more or less desirable? 7. Generate plots of the confidence intervals for a sample sizes of 30 and 150 at confidence levels of 0.90, 0.95 and 0.99. (6 plots in total) 8. Based on your plots how would describe the relationship of sample size and confidence level to the size of the confidence interval? Notes This is a product of OpenIntro that is released under a Creative Commons Attribution-NonCommercial- NoDerivs 3.0 Unported (creativecommons.org/ licenses/ by-nc-nd/ 3.0/ ). This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by the faculty and TAs of UCLA Statistics. 5