Lecture 10. Outline. 1-1 Introduction. 1-1 Introduction. 1-1 Introduction. Introduction to Statistics

Size: px
Start display at page:

Download "Lecture 10. Outline. 1-1 Introduction. 1-1 Introduction. 1-1 Introduction. Introduction to Statistics"

Transcription

1 Outline Lecture 10 Introduction to 1-1 Introduction 1-2 Descriptive and Inferential 1-3 Variables and Types of Data 1-4 Sampling Techniques 1- Observational and Experimental Studies 1-6 Computers and Calculators 1-1 Introduction Most people become familiar with probability and statistics through radio, television, newspapers, and magazines. Typical statements: 1-1 Introduction A typical one-a-day vitamin and mineral pill boosted certain immune responses in older people by 64 percent. Of 1,000 households polled nationwide, 40% said they owned at least one cordless phone; 9% had two or more. 1-1 Introduction is used in almost all fields of human endeavor. It is also used as a tool in scientific research to make decisions based on controlled experiments Introduction is the science of conducting studies to collect, organize, summarize, analyze, and draw conclusions from data.

2 1-11 Introduction You should know about the vocabulary, symbols, concepts, and statistical procedures used in various studies Descriptive and Inferential In order to gain knowledge about seemingly haphazard events, statisticians collect information for variables, which describe the event Descriptive and Inferential Data are the values (measurements or observations) that the variables can assume. Variables whose values are determined by chance are called random variables Descriptive and Inferential A collection of data values forms a data set. Each value in the data set is called a data value or a datum Descriptive and Inferential Depending on how data are used, two areas are distinguished: descriptive statistics and inferential statistics Descriptive and Inferential Descriptive statistics Descriptive statistics consists of the collection, organization, summation, and presentation of data.

3 1-22 Descriptive and Inferential 1-22 Descriptive and Inferential A population consists of all subjects (human or otherwise) that are being studied. A sample is a subgroup of the population Descriptive and Inferential 1-22 Descriptive and Inferential Inferential statistics consists of generalizing from samples to populations, performing hypothesis testing, determining relationships among variables, and making predictions Variables and Types of Data Qualitative variables are variables that can be placed into distinct categories, according to some characteristic or attribute. For example: gender (male or female), blood type Variables and Types of Data Quantitative variables Quantitative variables are numerical in nature and can be ordered or ranked. s: age, height, weight, blood pressure.

4 1-33 Variables and Types of Data 1-33 Variables and Types of Data Quantitative variables can be further classified into two groups: discrete or continuous. Discrete variables can be considered as variables whose possible values are isolated numbers (mostly integers). They involve counting rather than measuring. : number of kitten in a litter 1-33 Variables and Types of Data Continuous variables can assume all values between any two specific values (i.e., in an interval). They are obtained by measuring. s: length, weight, temperature, volume and time Variables and Types of Data Variables can also be classified by how they are categorized, counted, or measured. This type of classification uses measurement scales, and four common types of scales are used: nominal, ordinal, interval, and ratio Variables and Types of Data 1-33 Variables and Types of Data The nominal level of measurement classifies data into mutually exclusive (non-overlapping), exhausting categories in which no order or ranking can be imposed on the data. s: classifying professors according to subject taught (e.g., English, anatomy, or mathematics); survey subjects as male or female; residents according to zip codes.

5 1-33 Variables and Types of Data 1-33 Variables and Types of Data The ordinal level of measurement classifies data into categories that can be ranked; precise differences between the ranks do not exist. s: letter grades (A, B, C, D, F); professors might be ranked as superior, average, or poor. The interval level of measurement ranks data; precise differences between units of measure do exist; there is no meaningful zero. s: IQ, temperature 1-33 Variables and Types of Data The ratio level of measurement possesses all the characteristics of interval measurement, and there exists a true zero. In addition, true ratios exist for the same variable. s: height; weight; area; the number of phone calls received Sampling Techniques Data can be collected in a variety of ways. One of the most common methods is through the use of surveys. Surveys can be done by using a variety of methods - s are telephone, mail questionnaires, personal interviews, surveying records and direct observations Sampling Techniques 1-44 Sampling Techniques To obtain samples that are unbiased, statisticians use four methods of sampling: random, systematic, stratified, and cluster sampling. Random samples are selected by using chance methods or random numbers. Number each subject in population Select numbered cards from a bowl Subjects whose numbers are selected constitute the sample

6 1-44 Sampling Techniques 1-44 Sampling Techniques Systematic samples are obtained by numbering each value in the population and then selecting every k th value. Population: 2000 subjects; we need a sample of 0. Then k=2000/0=40, thus every 40th subject would be selected; however, the first subject (numbered between 1 and 40) would be selected at random. Stratified samples are selected by dividing the population into groups (strata) according to some characteristic and then taking samples from each group Sampling Techniques 1-44 Sampling Techniques Cluster samples are selected by dividing the population into groups and then taking samples of the groups. : next slide Suppose a researcher wishes to survey apartment dwellers in a large city. If there are 10 apartment buildings in the city, the researcher can select at random 2 buildings from the 10 and interview all the residents of these buildings. 1- Observational and Experimental Studies In an observational study, the researcher merely observes what is happening or what has happened in the past and tries to draw conclusions based on these observations. 1- Observational and Experimental Studies In an experimental study, the researcher manipulates one of the variables and tries to determine how the manipulation influences other variables. : next slide

7 1- Observational and Experimental Studies Students were divided into two groups; they had to perform as many sit-ups as possible in 90 seconds. First group: ``Do your best.' Second group: try to increase the actual number of sit-ups they did each day by 10%. 1- Observational and Experimental Studies After four days, first group averaged 43 sit-ups, second group averaged 6 sit-ups by the last day s session. Conclusion: athletes who were given specific goals perform better than those who were not given specific goals. 1- Observational and Experimental Studies The group that received the special instruction is called the treatment group, while the other is called the control group. The treatment group receives a specific treatment (in this case, instructions for improvement) while the control group does not. 1- Computers and Calculators Computers and calculators make numerical computation easier. Many statistical packages are available. s: SPSS, MINITAB, EXCEL. Data must still be understood and interpreted by you! Outline Frequency Distributions and Graphs 2-1 Introduction 2-2 Organizing Data 2-3 Histograms 2-4 Other Types of Graphs

8 2-1 Introduction In a statistical study, the researcher must gather data for the particular variable under study. The data must be organized in some meaningful way. The most convenient method: construct a frequency distribution Introduction When data are collected in original form, they are called raw data. When the raw data is organized into a frequency distribution, the frequency will be the number of values in a specific class of the distribution Organizing Data A frequency distribution is the organizing of raw data in table form, using classes and frequencies. The following slide shows an example of a frequency distribution Blood Type Frequency Distribution - Class Frequency Percent A 20 B 7 28 O 9 36 AB Three Types of Frequency Distributions Categorical frequency distributions - can be used for data that can be placed in specific categories, such as nominal- or ordinal-level data. s - political affiliation, religious affiliation, blood type etc Ungrouped Frequency Distributions Ungrouped frequency distributions - can be used for data that can be enumerated and when the range of values in the data set is not large. s - number of miles your instructors have to travel from home to campus, number of girls in a 4-child family etc.

9 2-22 Number of Miles Traveled - Class Frequency Grouped Frequency Distributions Grouped frequency distributions - can be used when the range of values in the data set is very large. The data must be grouped into classes that are more than one unit in width. s - the life of boat batteries in hours Lifetimes of Boat Batteries - Class limits Class Boundaries Frequency Cumulative frequency Terms Associated with a Grouped Frequency Distribution Class limits represent the smallest and largest data values that can be included in a class. In the lifetimes of boat batteries example, the values 24 and 37 of the first class are the class limits. The lower class limit is 24 and the upper class limit is Terms Associated with a Grouped Frequency Distribution The class boundaries are used to separate the classes so that there are no gaps in the frequency distribution Terms Associated with a Grouped Frequency Distribution The class width for a class in a frequency distribution is found by subtracting the lower (or upper) class limit of one class minus the lower (or upper) class limit of the previous class.

10 2-22 Guidelines for Constructing a Frequency Distribution There should be between and 20 classes. The class width should be an odd number. The classes must be mutually exclusive Guidelines for Constructing a Frequency Distribution The classes must be continuous. The classes must be exhaustive. The class must be equal in width Procedure for Constructing a Grouped Frequency Distribution Find the highest and lowest value. Find the range. Select the number of classes desired. Find the width by dividing the range by the number of classes and rounding up Procedure for Constructing a Grouped Frequency Distribution Select a starting point (usually the lowest value); add the width to get the lower limits. Find the upper class limits. Find the boundaries. Tally the data, find the frequencies, and find the cumulative frequency Grouped Frequency Distribution - In a survey of 20 patients who smoked, the following data were obtained. Each value represents the number of cigarettes the patient smoked per day. Construct a frequency distribution using six classes. (The data is given on the next slide.) 2-2 Grouped Frequency Distribution 2 Grouped Frequency Distribution

11 2-22 Grouped Frequency Distribution - Step 1: Find the highest and lowest values: H = 22 and L =. Step 2: Find the range: R = H L = 22 = 17. Step 3: Select the number of classes desired. In this case it is equal to Grouped Frequency Distribution - Step 4: Find the class width by dividing the range by the number of classes. Width = 17/6 = This value is rounded up to Grouped Frequency Distribution - Step : Select a starting point for the lowest class limit. For convenience, this value is chosen to be, the smallest data value. The lower class limits will be, 8, 11, 14, 17, and Grouped Frequency Distribution - Step 6: The upper class limits will be 7, 10, 13, 16, 19, and 22. For example, the upper limit for the first class is computed as 8-1, etc Grouped Frequency Distribution - Step 7: Find the class boundaries by subtracting 0. from each lower class limit and adding 0. to the upper class limit. 2-2 Grouped Frequency Distribution 2 Grouped Frequency Distribution - Step 8: Tally the data, write the numerical values for the tallies in the frequency column, and find the cumulative frequencies. The grouped frequency distribution is shown on the next slide.

12 Note: The dash - represents to Histograms Class Limits Class Boundaries Frequency Cumulative Frequency 0 to to to to to to The histogram is a graph that displays the data by using vertical bars of various heights to represent the frequencies. of a Histogram 2-44 Other Types of Graphs Frequency Pareto charts - a Pareto chart is used to represent a frequency distribution for a categorical variable Number of Cigarettes Smoked per Day 2-44 Other Types of Graphs-Pareto Chart When constructing a Pareto chart - make the bars the same width. Arrange the data from largest to smallest according to frequencies. Make the units that are used for the frequency equal in size. of a Pareto Chart Count Defect Count Percent Cum % Pareto Chart for the number of Crimes Investigated by Law Enforcement Officers in U.S. National Parks During 199. Assault Rape Robbery Homicide Percent

13 2-44 Other Types of Graphs 2-44 Other Types of Graphs - Pie Graph Pie graph - A pie graph is a circle that is divided into sections or wedges according to the percentage of frequencies in each category of the distribution. Pie Chart of the Number of Crimes Investigated by Law Enforcement Officers In U.S. National Parks During 199 Assaults (164, 68.3%) Robbery (29, 12.1%) Rape (34, 14.2%) Homicide (13,.4%) Outline Data Description 3-1 Introduction 3-2 Measures of Central Tendency 3-3 Measures of Variation 3-4 Measures of Position 3- Exploratory Data Analysis 3-11 Introduction This chapter shows the statistical methods that can be used to summarize data. Measures of central tendency: the mean, median, mode, and midrange. Measures of variation: the range, variance, and standard deviation Introduction Position of data: quartiles Techniques of exploratory data analysis: box plots, and fivenumber summaries.

14 3-22 Measures of Central Tendency A statistic is a characteristic or measure obtained by using the data values from a sample. A parameter is a characteristic or measure obtained by using the data values from a specific population The Mean (arithmetic average) The mean is defined to be the sum of the data values divided by the total number of values. We will compute two means: one for the sample and one for a finite population of values. The mean, in most cases, is not an actual data value The Sample Mean The symbol X represents the sample mean. X is read as " X - bar". The Greek symbol is read as " sigma" and it means " to sum". X + X X 1 2 X = n X =. n n 3-22 The Sample Mean - The ages in weeks of a random of six kittens at an animal shelter 3, 8,,12, 14, and 12. Find the average age of this sample. The sample mean is X = sample are X = n 6 4 = = 9 weeks The Population Mean The Greek symbol µ represents the population mean. The symbol µ is read as " mu". N is the size of the finite population. X + X X 1 2 µ = N X =. N N 3-22 The Population Mean - A small company consists of the owner, the manager, the salesperson, and two technicians. The salaries are listed as $0, 000, 20, 000, 12, 000, 9, 000 and 9, 000 respectively. ( Assume this is the population.) Then the population mean will be X µ = N 0, , , , ,000 = = $20,000.

15 3-22 The Sample Mean for an Ungrouped Frequency Distribution The mean for an ungrouped frequency distributuion is given by 3-22 The Sample Mean for an Ungrouped Frequency Distribution - The scores for 2 students on a 4 point quiz are given in the table. Find the mean score. f X X = ( ). n Here f is the frequency for the corresponding value of X, and n = f. Score, X Frequency, ff The Sample Mean for an Ungrouped Frequency Distribution The Sample Mean for a Grouped Frequency Distribution Score, Score, X Frequency, ff f f X f X X = = 2 = n 2 The mean for a grouped frequency distributuion is given by f X m X = ( ). n Here X is the corresponding m class midpoint The Sample Mean for a Grouped Frequency Distribution - Given the table below, find the mean The Sample Mean for a Grouped Frequency Distribution - Table with class midpoints, X m. Class Frequency, ff Class Frequency, ff X m f f X Xm m

16 3-22 The Sample Mean for a Grouped Frequency Distribution - f X = m = 46 and n = 17. So f X m X = n 46 = = The Median When a data set is ordered, it is called a data array. The median is defined to be the midpoint of the data array. The symbol used to denote the median is MD The Median - The weights (in pounds) of seven army recruits are 180, 201, 220, 191, 219, 209, and 186. Find the median. Arrange the data in order and select the middle point The Median - Data array: 180, 186, 191, 201, 209, 219, 220. The median, MD = The Median In the previous example, there was an odd number of values in the data set. In this case it is easy to select the middle number in the data array The Median When there is an even number of values in the data set, the median is obtained by taking the average of the two middle numbers.

17 3-22 The Median - The ages of 10 college students are: 18, 24, 20, 3, 19, 23, 26, 23, 19, 20. Find the median. Arrange the data in order and compute the middle point The Median - Data array: 18, 19, 19, 20, 20, 23, 23, 24, 26, 3. The median, MD = ( )/2 = The Mode The mode is defined to be the value that occurs most often in a data set. A data set can have more than one mode. A data set is said to have no mode if all values occur with equal frequency The Mode - s The following data represent the duration (in days) of U.S. space shuttle voyages for the years Find the mode. Data set: 8, 9, 9, 14, 8, 8, 10, 7, 6, 9, 7, 8, 10, 14, 11, 8, 14, 11. Ordered set: 6, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 10, 10, 11, 11, 14, 14, 14. Mode = The Mode - s Six strains of bacteria were tested to see how long they could remain alive outside their normal environment. The time, in minutes, is given below. Find the mode. Data set: 2, 3,, 7, 8, 10. There is no mode since each data value occurs equally with a frequency of one The Mode - s Eleven different automobiles were tested at a speed of 1 mph for stopping distances. The distance, in feet, is given below. Find the mode. Data set: 1, 18, 18, 18, 20, 22, 24, 24, 24, 26, 26. There are two modes (bimodal). The values are 18 and 24. Why?

18 3-22 The Midrange The midrange is found by adding the lowest and highest values in the data set and dividing by 2. The midrange is a rough estimate of the middle value of the data. The symbol that is used to represent the midrange is MR The Midrange - Last winter, the city of Brownsville, Minnesota, reported the following number of water-line breaks per month. The data is as follows: 2, 3, 6, 8, 4, 1. Find the midrange. MR = (1 + 8)/2 = 4.. Note: Extreme values influence the midrange and thus may not be a typical description of the middle.