Copyright K. Gwet in Statistics with Excel Problems & Detailed Solutions. Kilem L. Gwet, Ph.D.

Confidence Copyright 2011 - K. Gwet (info@advancedanalyticsllc.com) Intervals in Statistics with Excel 2010 75 Problems & Detailed Solutions An Ideal Statistics Supplement for Students & Instructors Kilem L. Gwet, Ph.D.

Printed Books by Kilem L. Gwet The Practical Guide to Statistics: Basic Concepts, Methods and Meaning. Application With MS Excel, R, and OpenOffice Calc HANDBOOK OF INTER-RATER RELIABILITY (Second Edition): The Definitive Guide to Measuring the Extent of Agreement Among Multiple Raters INTER-RATER RELIABILITY USING SAS: A Practical Guide for Nominal, Ordinal, and Interval Data

CONFIDENCE INTERVALS IN STATISTICS: 100 Problems & Solutions Excel 2010-Assisted Solutions

Copyright c 2011 by Kilem Li Gwet, Ph.D. All rights reserved. Published by Advanced Analytics, LLC A single copy of this document may be printed and the printed copy be shared with other interested parties. However, this document is NOT to be transmitted in any other form including electronic or mechanical, photocopying, recording, or by an information storage and retrieval system except by a reviewer who may quote brief passages in a review to be printed in a magazine or a newspaper without a writing permission from the publisher. For information, please contact Advanced Analytics, LLC at the following address : Advanced Analytics, LLC PO BOX 2696 Gaithersburg, MD 20886-2696 e-mail: info@advancedanalyticsllc.com This publication is designed to provide accurate and authoritative information in regard of the subject matter covered. However, it is sold with the understanding that the publisher assumes no responsibility for errors, inaccuracies or omissions. The publisher is not engaged in rendering any professional services. A competent professional person should be sought for expert assistance. Publisher s Cataloguing in Publication Data : Gwet, Kilem Li Confidence Intervals in Statistics: 75 Problems and Solutions A Practical Self-Study Guide for Students and Professionals/ By Kilem Li Gwet p. cm. 1. Biostatistics 2. Statistical Methods 3. Statistics - Study - Learning. I. Title.

Contents 1. Introduction.......................................... 1 2. Confidence Interval for a Population Mean........ 3 2.1 Population Standard Deviation (σ) is Known........ 5 2.2 Population Standard Deviation (σ) is Unknown.... 14 2.3 Sample Size Determination (σ) is Unknown........ 59 3. Confidence Interval for a Population Proportion 71 3.1 Confidence Interval Calculation.................... 74 3.2 Sample Size Determination........................ 91 4. Excel s Analysis ToolPak........................... 99 4.1 Installing the Analysis ToolPak for Excel 2010..... 99 4.2 Analyzing Raw Data with the Descriptive Statistics Module................................................102 - iv -

1 Introduction The confidence interval is one of the many techniques used in statistical inference for projecting findings from a limited amount of data to the more general population of interest. It represents a range of values expected to include an unknown quantity at the specified confidence level. Claiming for example that the percent of Americans who support their president s foreign policy is 45% ± 3% with 95% certainty, means two things: (i) That 45% is our best guess of the percent of Americans supporting the president s foreign policy based on the limited number of persons who participated in our study. 45% is called the Point Estimate of the true population percent of Americans supporting the President s foreign policy. (ii) That our approximation is subject to an error margin of 3% that stems from the fact that our study did not cover the entire population of Americans that constitutes our primary interest. Moreover, we can claim with 95% certainty that the values ranging from 42% through 48% contain the true population percentage that we would have obtained had we surveyed the entire US population. 95% is called the Confidence Level. The US president s foreign policy example illustrates the used of confidence intervals for population proportions or percentages. However, confidence intervals are also often used for population means of quantitative or measurement variables such as income, height or weight. This booklet features numerous carefully-selected exercises, all related to the construction of confidence intervals, and all accompanied with detailed and commented solutions. Readers interested in a comprehensive and student-friendly exposition of the notion of confidence interval, could find a detailed and vivid account in Gwet (2011) 1. 1 K. Gwet (2011), The Practical Guide to Statistics: Applications with Excel, R, and Calc, Advanced Analytics, LLC - 1 -

- 2 - Chapter 1: Introduction This booklet contains a series of 75 exercises with solutions. The exercises in chapter 2 aim at showing how confidence intervals for the population mean of a measurement variable can be constructed, while those of chapter 3 focus on the construction of confidence intervals for population proportions. While many exercises illustrate the techniques for constructing confidence intervals, the reader will also find numerous exercises on the sample size calculation. The sample size calculation exercises illustrate the techniques for determining the number of units required in the sample to ensure a specified error margin, or a specified interval length. Occasionally, you will need to construct confidence intervals or calculate the required sample sizes based on a series of raw numbers representing the sample data. Summary statistics will have to be generated before you can construct the confidence intervals, or calculate the sample sizes. My recommendation is to use the Descriptive Statistics module of the Excel 2010 Analysis ToolPak to produce these summary statistics. The Excel Analysis ToolPak is an Excel Add-In that comes with Microsoft Office, and that must be activated before it can be used. Chapter 4 provides detailed instructions for activating the ToolPak and for using the Descriptive Statistics module. If you have comments or questions, do not hesitate to contact the author using one of the following 2 methods: E-Mail: info@advancedanalyticsllc.com Mail: Advanced Analytics, LLC PO BOX 2696 Gaithersburg, MD 20886-2696

2 Confidence Intervals for a Population Mean Many statistics textbooks present the material on confidence intervals for means by separately treating the situation when the population standard deviation σ is known, and the situation when it is unknown. I find it more practical to separately present the situation when the sample size n is large 1, and the situation when it is small. Whether the sample size is large or small, the construction of a confidence interval for a population mean µ relies heavily upon the sampling distribution of the sample mean x. The probability distribution of the sample mean must be known (at least approximately) for you to be able to construct the confidence interval. Large Sample Size (i.e. n 30) Let s be the sample standard deviation. If the population standard deviation σ is available, then use it in place of the sample standard deviation (Note that this interchangeability applies only when the sample size n is large). The confidence interval of the population mean µ at a pre-specified confidence level is given by: ( s s C.I(µ) = x z α/2 ; x + z α/2 ), if σ is not available (a), n n ) σ σ (x z α/2 n ; x + z α/2 n, if σ is available (b), (2.1) where α = 1 (Confidence Level), and z α/2 is the 100 (1 α/2) th percentile of the standard Normal distribution, that is calculated with Excel 2010 as 1 For all practical purposes, a sample size n is considered large when it equals or exceeds 30, and is small otherwise - 3 -

- 4 - Chapter 2: Confidence Intervals for a Population Mean follows: z α/2 = NORM.S.INV(1-α/2) (2.2) The confidence intervals of equation 2.1, are sometimes presented in the form, s C.I(µ) = x ± z α/2, or C.I(µ) = x ± z α/2 n Small Sample Size (i.e. n < 30) When the size n of your sample is small (i.e. the number of sample elements is below 30), then you must assume that the input observations follow the Normal distribution (at least approximately) 2, and separately treat the situations where the population standard deviation σ is known and where it is unknown. (i) Known Population Standard Deviation When the population standard deviation σ is known, then the confidence interval associated with the population mean µ, is given by: C.I(µ) = σ n (x z α/2 σ n ; x + z α/2 σ n ), (2.3) where α = (Confidence Interval), and z α/2 is the 100 (1 α/2) th percentile of the standard Normal distribution, which is computed with Excel 2010 as shown in equation 2.2. (ii) Unknown Population Standard Deviation When the population standard deviation is unknown, then the confidence interval associated with the population mean µ, at a given confidence level is given by: ( s s C.I(µ) = x t α/2,n 1 ; x + t α/2,n 1 ), (2.4) n n 2 Gwet (2011), in the book The Practical Guide to Statistics: Applications with Excel, R, and Calc discusses the situation where this assumption cannot be made. However, all exercises in this section are based on the assumption of Normality or approximate Normality

2.1 Population Standard Deviation (σ) is Known - 5 - where α = 1 (Confidence Level), and t α/2,n 1 is the 100 (1 α/2) th percentile of the Student s t-distribution with n 1 degrees of freedom. This quantity is calculated using Excel 2010 as follows: The Finite-Population Correction Factor t α/2,n 1 = T.INV(1 α/2) (2.5) Occasionally in practice, you may have to select a sample from a well-defined and specific finite population. The sample may well represent a sizeable portion of the entire population. In this case, the standard error of the sample mean calculated as the ratio of the standard deviation to the square root of the sample size, will overestimate the true standard error of the mean. The solution to this problem is to multiply the classical standard error by the finite-population correction (FPC) factor defined as follows: N n FPC = N 1, (2.6) where N is the population size, and n the sample size. The rule of thumb for deciding whether to use or not use the FPC is to use it whenever the sampling fraction that represents the ratio n/n of the sample size to the population size is smaller than 0.05. When using the FPC is deemed appropriate, all ratios of the form σ/ n or s/ n must be replaced with their adjusted versions FPC σ/ n and FPC s/ n. The corresponding confidence intervals must be computed accordingly. 2.1 Population Standard Deviation (σ) is Known Exercise 2.1 A sample of 49 observations is taken from a normal population with a standard deviation of 10. The sample mean is 55. Determine the 99 percent confidence interval for the population mean.

- 6 - Chapter 2: Confidence Intervals for a Population Mean Solution Table 2.1 : Input information Population Mean: µ is unknown a Population standard deviation: σ=10 Sample Size : n = 49 Sample Mean : x = 55 Confidence Level: 0.99 a This quantity has to be unknown, otherwise there would not be any need to develop a confidence interval for a known quantity. Since the probability distribution of the raw data is normal, so is the probability distribution of the sample mean x. Equation 2.1(b) will be used because of the large sample size that exceeds 30, and the availability of the population standard deviation σ. Of all the elements needed to construct the confidence interval (c.f. equation 2.1(b)), only z α/2 is yet to be obtained. It follows from the confidence level 0.99 that α = 1 0.99 = 0.01 and α/2 = 0.005. Consequently, z 0.005 represents the 99.5 th percentile 3 of the Normal distribution needed to construct the 99 th confidence interval. It is obtained with Excel 2010 as shown in Figure 2.1. The two confidence bounds of the interval are given by, Lower Bound = x z α/2 σ/ n = 55 2.576 10/ 49 = 51.32, Upper Bound = x z α/2 σ/ n = 55 + 2.576 10/ 49 = 58.68. The 99% confidence interval of the population mean µ is, CI(µ)=(51.32 ;58.68). It is generally recommended to round the lower bound down and the upper bound up. That is, if for example the lower bound is 13.268, and you want to display only 2 digits after the decimal point, then you should present 13.26 (and not 13.27). On the other hand, an upper bound of 13.262 would be rounded up to 13.27 (and not 13.26). The rationale is to ensure the validity of the confidence level with respect to the final interval. 3 Note that 99.5 = 100 (1 α/2) is the recommended percentile in equation 2.1.

2.1 Population Standard Deviation (σ) is Known - 7 - Figure 2.1. Calculating the 99.5 th percentile of the Normal distribution with Excel 2010 Solution Exercise 2.2 A sample of 81 observations is taken from a normal population with a standard deviation of 5. The sample mean is 40. Determine the 95 percent confidence interval for the population mean. Table 2.2 : Input information Population Mean: µ is unknown a Population standard deviation: σ=5 Sample Size : n = 81 Sample Mean : x = 40 Confidence Level : 0.95 a This quantity has to be unknown, otherwise there would not be any need to develop a confidence interval for a known quantity. Since the probability distribution of the raw data is normal, so is the probability distribution of the sample mean x. Equation 2.1(b) will be used because of the large sample size that exceeds 30, and the availability of the population standard deviation σ. It follows from the confidence level 0.95 that α = 1 0.95 = 0.05 and α/2 = 0.025. Consequently, z 0.025 represents the 97.5 th percentile of the Normal

- 8 - Chapter 2: Confidence Intervals for a Population Mean distribution needed to construct the 95 th confidence interval. It is obtained with Excel 2010 as 1.96 =NORM.S.INV(1-0.05/2). The two confidence bounds of the interval are given by, Lower Bound = x z α/2 σ/ n = 40 1.96 5/ 40 = 38.90, Upper Bound = x z α/2 σ/ n = 40 + 1.96 5/ 40 = 41.90. The 95% confidence interval of the population mean µ is, Solution CI(µ)=(38.9 ;41.9). Exercise 2.3 A sample of 10 observations is selected from a normal population for which the population standard deviation is known to be 5. The sample mean is 20. a. Determine the standard error of the mean. b. Explain why we can use formula (2.3) to determine the 95 percent confidence interval even though the sample is less than 30. c. Determine the 95 percent confidence interval for the population mean. Table 2.3 : Input information Population Mean: µ is unknown a Population standard deviation: σ=5 Sample Size : n = 10 Sample Mean : x = 20 a This quantity has to be unknown, otherwise there would not be any need to develop a confidence interval for a known quantity. (a) Standard Error of the Mean: s.e(x)

2.1 Population Standard Deviation (σ) is Known - 9 - The s.e. of the mean is defined as s.e(x) = σ/ n = 5/ 10 = 1.581. (b) Although the sample size n is smaller than 30, we can use formula 2.3 because the population is normal and the population standard deviation σ is known to be 5. (c) 95% confidence interval for µ Solution Confidence level=1 α = 0.95. The lower bound is LB = x z 0.025 σ/ n = 20 1.96 5/ 10 = 16.9. The upper bound is LB = x + z 0.025 σ/ n = 20 1.96 5/ 10 = 23.1. The 95% confidence interval of the population mean µ is, CI(µ)=(16.9 ;23.1). Exercise 2.4 A research firm conducted a survey to determine the mean amount steady smokers spend on cigarettes during a week. They found the distribution of amounts spent per week followed the normal distribution with a standard deviation of $5. A sample of 49 steady smokers revealed that x = $20. a. What is the point estimate of the population mean? Explain what it indicates. b. Using the 95 percent level of confidence, determine the confidence interval for µ. Explain what it indicates. Table 2.4 : Input information Population Mean: µ is unknown Distribution of data: Normal Population standard deviation: σ=$5 Sample Size : n = 49 Sample Mean : x = $20 (a) The point estimate of the population mean is the sample mean x = $20. It represent our best guess of the magnitude of the actual and unknown population mean µ.

2.2 Population Standard Deviation (σ) is Unknown - 17-2.5 as follows: t = t α/2,n 1 = t 0.05,15 = T.INV (1 0.05, 15) = 1.75. Note that α = 1 0.90 = 0.10, and 0.05 = α/2. For a confidence level of 90%, the value of t represents the 95 th percentile of the t distribution with n 1 = 16 1 = 15 degrees of freedom. (d) We will develop the 90% confidence interval for the population mean using expression 2.4. The lower bound LB and the upper bound UB of the confidence interval ar obtained as follows: LB = x t 0.05 s = 60 1.75 20 = 51.2, n 15 UB = x + t 0.05 s = 60 + 1.75 20 = 68.8. n 15 The 90% confidence interval of the population mean number of eggs per chicken µ is, CI(µ)=(51.2 ;68.8), (e) Yes, it would be reasonable to conclude that the population mean is 63 pounds, because 63 falls inside the 90% confidence interval. This interval is supposed to contain the true population mean with 90% certainty. Exercise 2.11 Merrill Lynch Securities and Health Care Retirement, Inc., are two large employers in downtown Toledo, Ohio. They are considering jointly offering child care for their employees. As a part of the feasibility study, they wish to estimate the mean weekly child-care cost of their employees. A sample of 10 employees who use child care reveals the following amounts spent last week. $107 $92 $97 $95 $105 $101 $91 $99 $95 $104 Develop a 90 percent confidence interval for the population mean. Interpret the result.

- 18 - Chapter 2: Confidence Intervals for a Population Mean Solution The population mean is µ =the mean weekly child-care cost of the population of Merrill Lynch Securities and Health Care Retirement, Inc. employees. It is unknown and must be estimated with a 90% confidence interval. Capture the 10 data points of this exercise in Excel. This data may be organized either vertically or horizontally. Following the directives of section 4.2 of chapter 4, use the Descriptive Statistics module of the Excel 2010 s Analysis ToolPak to produce Table 2.10 below (make sure to specify the correct confidence level of 90%). Table 2.10 : Output of the Descriptive Statistics Module Mean 98.6 Standard Error 1.752 Median 98 Mode 95 Standard Deviation 5.542 Sample Variance 30.711 Kurtosis -1.304 Skewness 0.165 Range 16 Minimum 91 Maximum 107 Sum 986 Count 10 Confidence Level(90.0%) 3.212 The second column of the last raw (entitled Confidence Level(90.0%) ), contains the error margin 4 E = 3.212 associated with the sample mean x = 98.6. These two numbers should be used to obtain the 90% confidence interval as follows: Lower Bound = x E = 98.6 3.212 = 95.388, Upper Bound = x + E = 98.6 + 3.212 = 101.812. 4 The error margin is expressed as E = t α/2,n 1 s/ n

2.2 Population Standard Deviation (σ) is Unknown - 19 - The confidence interval is, Interpretation C.I(µ)=(95.38 ;101.82). Although the feasibility study estimated the mean weekly child-care cost of the employees to be about $98.6 based on a small sample of 10 employees, the actual value of that mean is between $95.38 and $101.82 with 90% certainty, as suggested by the confidence interval. Solution Exercise 2.12 The Greater Pittsburgh Area Chamber of Commerce wants to estimate the mean time workers who are employed in the downtown area spend getting to work. A sample of 15 workers reveals the following number of minutes spent traveling. 29 38 38 33 38 21 45 34 40 37 37 42 30 29 35 Develop a 98 percent confidence interval for the population mean. Interpret the result. The population mean is µ =the mean commute time of workers to work. It is unknown and must be estimated with a 98% confidence interval. Capture the 15 data points of this exercise in Excel. This data may be organized either vertically or horizontally. Using the directives of section 4.2 of chapter 4, use the Descriptive Statistics module of the Excel 2010 s Analysis ToolPak to produce Table 2.11 below (make sure to specify the correct confidence level of 98%).

2.3 Sample Size Determination - 59-2.3 Sample Size Determination The narrower the confidence interval, the more information it provides on the real magnitude of the population mean. A too wide confidence interval on the other hand provides us with a wide range of possible values for the population mean, and little information about the magnitude of the mean itself. Consequently, when designing a research study researchers often want to determine the sample size needed to obtain a confidence interval with a pre-specified length L. We have seen that the confidence interval of the population mean µ generally has the form (x E; x + E) where x is the sample mean, and E the margin of error defined as E = z α/2 s/ n when n is reasonably large. n is the sample size, s the standard deviation 10, and z α/2 the 100(1 α/2) th percentile of the standard Normal distribution with α = 1 Confidence Level. For a specified error margin E and confidence level 1 α, the desired sample size is given by, ( ) zα/2 s 2 n =. (2.7) E If it is the length L of the confidence interval that is provided, then the desired sample size will be given by: ( ) zα/2 s 2 n =. (2.8) L/2 The predicted sample size n obtained by either equation must be rounded UP 11 to the nearest integer value to ensure a confidence interval that matches the pre-determined error margin or is narrower. Note that the sample mean is always the center of the confidence interval. Therefore the population mean is always within the error margin of the sample mean at the specified confidence level. 10 Note that s could be the true population standard deviation σ when it is known, or could be its estimated value obtained from a pilot study. Even a small pilot may yield an estimated standard deviation sufficiently precise for the purpose of calculating the sample size. 11 Rounding up to nearest integer means 56.02 for example will be rounded up to 57.

- 60 - Chapter 2: Confidence Intervals for a Population Mean Solution Exercise 2.40 A population is estimated to have a standard deviation of 10. We want to estimate the population mean within 2, with a 95 percent level of confidence. How large a sample is required? Table 2.36 : Input information Standard deviation: s=10 Error Margin: E = 2 Confidence Level: 0.95 (or α = 0.05) The 97.5 th percentile 12 of the standard Normal distribution is z α/2 = 1.96, which leads to the predicted sample size of, ( ) 2 1.96 10 n = = 97. 2 Solution Exercise 2.41 We want to estimate the population mean within 5, with a 99 percent level of confidence. The population standard deviation is estimated to be 15. How large a sample is required? Table 2.37 : Input information Standard deviation: s=15 Error Margin: E = 5 Confidence Level: 0.99 (or α = 0.01) The 99.5 th percentile 13 of the standard Normal distribution is z α/2 = 2.58, which leads to the predicted sample size of, ( ) 2 2.58 15 n = = 60. 5 12 Note that 97.5 = 100 (1 α/2) 13 Note that 99.5 = 100 (1 α/2)

4 Excel s Analysis ToolPak The objective of this chapter is two-fold: (i) to provide detailed instructions for installing the Excel 2010 Analysis ToolPak add-in, and (ii) to show how the ToolPak is used to construct confidence from raw input data. 4.1 Installing the Analysis ToolPak for Excel 2010 The installation if the Excel 2010 Analysis ToolPak can be carried out in 4 easy steps, which are described as follows: 1 Opening the Excel Options Form From the main Excel menu, select the File tab from the menu bar, then Options as shown in Figure 4.1. These actions should open the Excel Options form of Figure 4.2. Select "Options" Figure 4.1. Opening the Excel Options Form - 99 -