Chapter 3. Displaying and Summarizing Quantitative Data. 1 of 66 05/21/ :00 AM

Similar documents
A is used to answer questions about the quantity of what is being measured. A quantitative variable is comprised of numeric values.

STA Module 2A Organizing Data and Comparing Distributions (Part I)

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 2 Organizing Data

Chapter 1 Data and Descriptive Statistics

Math 1 Variable Manipulation Part 8 Working with Data

Math 1 Variable Manipulation Part 8 Working with Data

Statistics Chapter Measures of Position LAB

Chapter 1. * Data = Organized collection of info. (numerical/symbolic) together w/ context.

STAT/MATH Chapter3. Statistical Methods in Practice. Averages and Variation 1/27/2017. Measures of Central Tendency: Mode, Median, and Mean

1. Contingency Table (Cross Tabulation Table)

STAT 2300: Unit 1 Learning Objectives Spring 2019

36.2. Exploring Data. Introduction. Prerequisites. Learning Outcomes

CEE3710: Uncertainty Analysis in Engineering

DDBA8437: Central Tendency and Variability Video Podcast Transcript

Chapter 8 Script. Welcome to Chapter 8, Are Your Curves Normal? Probability and Why It Counts.

Super-marketing. A Data Investigation. A note to teachers:

BAR CHARTS. Display frequency distributions for nominal or ordinal data. Ej. Injury deaths of 100 children, ages 5-9, USA,

Elementary Statistics Lecture 2 Exploring Data with Graphical and Numerical Summaries

Biostat Exam 10/7/03 Coverage: StatPrimer 1 4

Introduction to Control Charts

AP Statistics Test #1 (Chapter 1)

Module - 01 Lecture - 03 Descriptive Statistics: Graphical Approaches

Introduction to Statistics. Measures of Central Tendency and Dispersion

Slide 1. Slide 2. Slide 3. Interquartile Range (IQR)

Biostatistics 208 Data Exploration

Data Visualization. Prof.Sushila Aghav-Palwe

Statistics Chapter 3 Triola (2014)

Social Studies 201 Fall Answers to Computer Problem Set 1 1. PRIORITY Priority for Federal Surplus

Test Name: Test 1 Review

11-1 Descriptive Statistics

Gush vs. Bore: A Look at the Statistics of Sampling

AP Statistics Part 1 Review Test 2

Chapter 5. Statistical Reasoning

Module 1: Fundamentals of Data Analysis

Quantitative Methods. Presenting Data in Tables and Charts. Basic Business Statistics, 10e 2006 Prentice-Hall, Inc. Chap 2-1

MAS187/AEF258. University of Newcastle upon Tyne

Mathematics in Contemporary Society - Chapter 5 (Spring 2018)

Sample Exam 1 Math 263 (sect 9) Prof. Kennedy

The Dummy s Guide to Data Analysis Using SPSS

Chapter 2 Part 1B. Measures of Location. September 4, 2008

Computing Descriptive Statistics Argosy University

Descriptive Statistics Tutorial

Measurement and sampling

Capability on Aggregate Processes

Dr. Allen Back. Aug. 26, 2016

Online Student Guide Types of Control Charts

Fundamental Elements of Statistics

GETTING READY FOR DATA COLLECTION

Statistics is the area of Math that is all about 1: collecting, 2: analysing and 3: reporting about data.

Introduction to Statistics. Measures of Central Tendency

CHAPTER 7: Central Limit Theorem: CLT for Averages (Means)

Students will understand the definition of mean, median, mode and standard deviation and be able to calculate these functions with given set of

Descriptive Statistics

Business Statistics: A Decision-Making Approach 7 th Edition

Central Tendency. Ch 3. Essentials of Statistics for the Behavior Science Ch.3

CHAPTER 4. Labeling Methods for Identifying Outliers

Pareto Charts [04-25] Finding and Displaying Critical Categories

= = Intro to Statistics for the Social Sciences. Name: Lab Session: Spring, 2015, Dr. Suzanne Delaney

Lecture 10. Outline. 1-1 Introduction. 1-1 Introduction. 1-1 Introduction. Introduction to Statistics

Complete Week 11 Package

VIII. STATISTICS. Part I

Unit 1 Analyzing One-Variable Data

Math227 Sample Final 3

Bar graph or Histogram? (Both allow you to compare groups.)

CHAPTER 4, SECTION 1

= = Name: Lab Session: CID Number: The database can be found on our class website: Donald s used car data

STA 2023 Test 1 Review You may receive help at the Math Center.

Chapter 2 Ch2.1 Organizing Qualitative Data

JMP TIP SHEET FOR BUSINESS STATISTICS CENGAGE LEARNING

Displaying Bivariate Numerical Data

AP Statistics Scope & Sequence

DSC 201: Data Analysis & Visualization

Making Sense of Data

CHAPTER 21A. What is a Confidence Interval?

ISSN (Online)

Chapter 9 Assignment (due Wednesday, August 9)

Why Learn Statistics?

Ordered Array (nib) Frequency Distribution. Chapter 2 Descriptive Statistics: Tabular and Graphical Methods

Advanced Higher Statistics

Statistics 201 Summary of Tools and Techniques

Core vs NYS Standards

To provide a framework and tools for planning, doing, checking and acting upon audits

Correlation and Simple. Linear Regression. Scenario. Defining Correlation

Statistical Process Control

Statistical Observations on Mass Appraisal. by Josh Myers Josh Myers Valuation Solutions, LLC.

Section Sampling Techniques. What You Will Learn. Statistics. Statistics. Statisticians

Learning Area: Mathematics Year Course and Assessment Outline. Year 11 Essentials Mathematics COURSE OUTLINE

of a student s grades for the period is a better method than using the mean. Suppose the table at the right shows your test grades.

Operations and Supply Chain Management Prof. G. Srinivisan Department of Management Studies Indian Institute of Technology, Madras

Summary Statistics Using Frequency

Chapter 4: Foundations for inference. OpenIntro Statistics, 2nd Edition

Statistics Definitions ID1050 Quantitative & Qualitative Reasoning

Review Materials for Test 1 (4/26/04) (answers will be posted 4/20/04)

Chapter 12 Module 3. AMIS 310 Foundations of Accounting

Assignment 1 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

PRINCIPLES AND APPLICATIONS OF SPECIAL EDUCATION ASSESSMENT

Rounding a method for estimating a number by increasing or retaining a specific place value digit according to specific rules and changing all

SPSS 14: quick guide

Overview. Presenter: Bill Cheney. Audience: Clinical Laboratory Professionals. Field Guide To Statistics for Blood Bankers

Section 9: Presenting and describing quantitative data

Transcription:

Chapter 3 Displaying and Summarizing Quantitative Data D. Raffle 5/19/2015 1 of 66 05/21/2015 11:00 AM

Intro In this chapter, we will discuss summarizing the distribution of numeric or quantitative variables. Recall that numeric data is made up of measurements or numbers that describe our individuals with units. Consider a data set that describes the rate of arrests per 100,000 residents for various types of crime in the 50 US states in 1973. ## Murder Assault UrbanPop Rape ## Alabama 13.2 236 58 21.2 ## Alaska 10.0 263 48 44.5 ## Arizona 8.1 294 80 31.0 ## Arkansas 8.8 190 50 19.5 ## California 9.0 276 91 40.6 ## Colorado 7.9 204 78 38.7 2/66 2 of 66 05/21/2015 11:00 AM

Summarizing a Single Numeric Variable When we look for numerical summaries of a quantitave variable, we are primarily interested in three things: Shape what does the overall pattern of values look like? Measures of Center what is a typical value? Measures of Spread how far, on average, do individuals stray from the center? These properties tell us about how the variable is distributed among our individuals. The distribution helps us answer questions like: What was the national average rate of arrests for murder in 1973? What percent of states arrested more people for assault per capita than Arizona? What was the arrest rate for rape in the middle 90% of states? 3/66 3 of 66 05/21/2015 11:00 AM

Shape To see the shape of the distribution, we typically rely on graphs. There are three main types of graphs for examining a single numeric variable: Histograms Steam and Leaf Boxplots We will discuss the first two here and the third in a little while. 4/66 4 of 66 05/21/2015 11:00 AM

Histograms 5/66 5 of 66 05/21/2015 11:00 AM

Histograms Histograms are similar to bar charts, but the two are not interchangeable. Histograms are made by: 1. Creating an axis by dividing the variable into equal-sized "bins" 2. Drawing a bar in each bin whose height is given by the frequency of observations in that bin. What do we look for? Any peaks in the distribution. The peaks are called modes. Symmetry is the distribution symmetric around the center, or is it skewed to one side? Outliers are there any unusual observations that are far away from the rest? 6/66 6 of 66 05/21/2015 11:00 AM

Modes When we describe a distribution by its modes, we use the terms: Unimodal if there is one peak Bimodal if there are two Multimodal or polymodal if there are more than two Uniform if the distribution is flat What do modes tell us? Usually, having multiple modes mean there are distinct groups in the data For example, if we are looking at a histogram of heights, two modes might suggest we have a mix of men and women or adults and children in our sample 7/66 7 of 66 05/21/2015 11:00 AM

Unimodal Histogram 8/66 8 of 66 05/21/2015 11:00 AM

Bimodal Histogram 9/66 9 of 66 05/21/2015 11:00 AM

Multimodal Histogram 10/66 10 of 66 05/21/2015 11:00 AM

Uniform Histogram 11/66 11 of 66 05/21/2015 11:00 AM

Symmetry When we describe the symmetry of a distribution, we use the terms: Symmetric when the distribution is symmetric about the center Right-Skewed if there are some values higher than the majority Left-Skewed if there are some values lower than the majority We call the areas at either end the tails of the distribution. So, in terms of the tails: Symmetric: the right and left tails have equal length Right-Skewed: the right tail is longer Left-Skewed: the left tail is longer Remember that when we talk about skew we're talking about the unusual points, not the majority. 12/66 12 of 66 05/21/2015 11:00 AM

Symmetric Histogram 13/66 13 of 66 05/21/2015 11:00 AM

Right-Skewed Histogram 14/66 14 of 66 05/21/2015 11:00 AM

Left-Skewed Histogram 15/66 15 of 66 05/21/2015 11:00 AM

Outliers We'll talk about outliers in more detail later, but generally: Outliers are any points that are substantially far away from the rest. As a general rule-of-thumb, Outliers cause skew on whatever side they appear on. 16/66 16 of 66 05/21/2015 11:00 AM

Histogram with Outliers 17/66 17 of 66 05/21/2015 11:00 AM

Cautions Histograms are generally very good, but they have some things to be careful of: With smaller data sets they can be unstable (the shape can change drastically as you adjust the size of the bins). Interactive Example. It can be hard to find the true shape with small data sets Deciding on the bin width is a balancing act You lose information: we only know how many observations fall into a certain range, but we don't know anything about what they look like within the bin. 18/66 18 of 66 05/21/2015 11:00 AM

Your Turn Describe the distribution of the variables from the US Arrests data set. The variables are: Murder: number of people arrested for murder (per 100,000 residents) Assault: number of people arrested for assault (per 100,000 residents) Rape: num of people arrested for rape (per 100,000 residents) UrbanPop: percent of population living in large cities 19/66 19 of 66 05/21/2015 11:00 AM

Murder 20/66 20 of 66 05/21/2015 11:00 AM

Assault 21/66 21 of 66 05/21/2015 11:00 AM

Rape 22/66 22 of 66 05/21/2015 11:00 AM

UrbanPop 23/66 23 of 66 05/21/2015 11:00 AM

Stem-and-Leaf Plots ## ## The decimal point is at the ## ## 0 0000011111122222222233333333333444444444 ## 0 55555666666666777788999999 ## 1 000000001111111122223334 ## 1 555689 ## 2 134 ## 2 9 24/66 24 of 66 05/21/2015 11:00 AM

Stem-and-Leaf Plots Stem-and-Leaf plots are similar to histograms, visually. We construct stem-and-leaf plots by: Breaking up each value at some decimal place (at the decimal point, at the ones place, at the tens place, etc) Grouping numbers that share the value to the left of that place and writing it as a "stem" Writing each value to the right of the chosen decimal place as a "leaf" The leaf should be a single digit, so round anything past the first digit to the right of the decimal place you've chosen. Choosing the decimal place for the stem is similar to choosing the number of bins in a histogram: it's a balancing act. 25/66 25 of 66 05/21/2015 11:00 AM

Stem-and-Leaf Plot: Tens Example Consider the data set:. Almost all the values are in the double digits, so breaking them up at the tens place seems reasonable. Our stem-and-leaf plot would be: 5, 10, 11, 12, 12, 22 ## ## The decimal point is 1 digit(s) to the right of the ## ## 0 5 ## 1 0122 ## 2 2 26/66 26 of 66 05/21/2015 11:00 AM

Stem-and-Leaf Plot: Ones Example 1.2, 1.3, 0.5, 0.6, 3.7, 10.2 Consider the data set:. The most interesting differences in the numbers are probably at the ones place. ## ## The decimal point is at the ## ## 0 56 ## 1 23 ## 2 ## 3 7 ## 4 ## 5 ## 6 ## 7 ## 8 ## 9 ## 10 2 27/66 27 of 66 05/21/2015 11:00 AM

Stem-and-Leaf Plot: Hundreds place What about the data set: 320, 302, 102, 150, 175, 504, 40? ## ## The decimal point is 2 digit(s) to the right of the ## ## 0 4 ## 1 058 ## 2 ## 3 02 ## 4 ## 5 0 28/66 28 of 66 05/21/2015 11:00 AM

Stem-and-Leaf Notes A couple important points: If your data doesn't have values for the stem you've chosen, include the stem but leave the leaf blank. We're looking for the shape of the distribution, so gaps in the data are just as important as the values themselves If you have a repeated value, make sure each one gets a leaf Stem-and-Leaf Plots allow us to completely reconstruct our data set from the plot, unlike histograms They are increasingly hard to read for large data sets Generally, we might prefer stem-and-leaf plots to histograms for small data sets, but histograms for larger sets. 29/66 29 of 66 05/21/2015 11:00 AM

Measures of Center and Spread There are many techniques to measure the center and spread of a variable's distribution. Usually measures of center and spread go together based on how they work. In this class, we will focus on two pairs: Robust Measures: these measures are very reliable, no matter the shape of the distribution Center: The Median Spread: The Interquartile Range (IQR) For Symmetric Distributions: these measures have more useful properties if the distribution is symmetric Center: The Mean Spread: The Standard Deviation 30/66 30 of 66 05/21/2015 11:00 AM

The Median The Median: Is the point in the distribution of our variable that cuts the data in half. Divides the area of a histogram into two equal regions Finding the Median: Start by sorting the data n+1 If n is odd: count over from either end 2 n n If n is even: count over and + 1 and average these values 2 2 Note: If a value is repeated, include it when finding the median. 31/66 31 of 66 05/21/2015 11:00 AM

Finding the Median: Consider the data set: n is Odd 1, 7, 5, 30, 25, 18, 12 Sort: 1, 5, 7, 12, 18, 25, 30 n+1 7+1 8 Count over = = = 4 2 2 2 1, 5, 7, 12, 18, 25, 30 12 The median is. 32/66 32 of 66 05/21/2015 11:00 AM

Finding the Median: Consider the data set: n is Even 1, 7, 5, 18, 30, 25, 18, 12 Sort: 1, 5, 7, 12, 18, 18, 25, 30 n 8 Count over = = 4 2 2 1, 5, 7, 12, 18, 18, 25, 30 n 8 Count Over + 1 = + 1 = 4 + 1 = 5 2 2 1, 5, 7, 12, 18, 18, 25, 30 12+18 30 Average These Values: = = 15 2 2 The median is 15 33/66 33 of 66 05/21/2015 11:00 AM

Visualization of the Median 34/66 34 of 66 05/21/2015 11:00 AM

Measures of Spread: The Range The most natural thing to look at to measure the spread of a sample might be to look at the overall range of the data. The range is the overall spread of the data range = max min Consider our data from earlier: range = 30 1 = 29 What about the second set? range = 30 1 = 29 1, 7, 5, 30, 25, 18, 12 1, 7, 5, 18, 30, 25, 18, 12 Note that adding a value in the middle of the distribution doesn't change the range. 35/66 35 of 66 05/21/2015 11:00 AM

The Range (cont.) But what happens if we have an outlier? range = 60 1 = 59 What does this mean? 1, 7, 5, 18, 60, 25, 18, 12 Most of the data is the same, but the range has doubled. Does this seem like a good property? In most circumstances, no. We need a measure of spread that is less sensitive to outliers. 36/66 36 of 66 05/21/2015 11:00 AM

Measures of Spread: The Interquartile Range Instead of looking at the extreme values in our distribution, we can look points closer to the center After sorting our data, we divide it into fourths, based around the quartiles. These four groups are defined by five points, which make up the five number summary The Min: 0% of observations fall below this point The 1st Quartile ( Q1): 25% of observations fall below this point The 2nd Quartile: aka The Median, 50% of observations fall below this point The 3rd Quartile ( Q3): 75% of observations fall below this point The Max: 100% of observations fall below this point 37/66 37 of 66 05/21/2015 11:00 AM

Finding the Quartiles: We already know how to find three of them: the min, median, and max. So how can we find the 1st and 3rd quartiles? The Median divides the data into two equal halves The 1st Quartile is the median of the lower half The 3rd Quartile is the median of the upper half Note: If n is odd, include the median in both halves 38/66 38 of 66 05/21/2015 11:00 AM

Finding the Quartiles: So our Five Number Summary is: n is Odd 1, 7, 5, 30, 25, 18, 12 Sorted: 1, 5, 7, 12, 18, 25, 30 Recall that the median was 12 Q1 is the median of 1, 5, 7, 12 5+7 12 Q1 = = = 6 2 2 Q3 is the median of 12, 18, 25, 30 18+25 2 43 2 Q3 = = = 21.5 1, 6, 12, 21.5, 30 39/66 39 of 66 05/21/2015 11:00 AM

Finding the Quartiles: So our Five Number Summary is: n is Even 1, 7, 5, 18, 30, 25, 18, 12 Sorted: 1, 5, 7, 12, 18, 18, 25, 30 Recall that the median was 15 Q1 is the median of 1, 5, 7, 12 5+7 12 Q1 = = = 6 2 2 Q3 is the median of 18, 18, 25, 30 18+25 2 43 2 Q3 = = = 21.5 1, 6, 15, 21.5, 30 40/66 40 of 66 05/21/2015 11:00 AM

The IQR Now that we know what quartiles are, we can discuss the interquartile range (IQR). The IQR is defined as: IQR = Q3 Q1 Q1 Q3 Since 25% fall below and 75% below, the IQR defines the range of the middle 50% of the data Using either our previous examples: IQR = 21.5 6 = 15.6 41/66 41 of 66 05/21/2015 11:00 AM

The IQR and Outliers 1, 7, 5, 18, 60, 25, 18, 12 Using our data with an outlier from before:. Note that this is exactly the same as our most recent example, but the 30 was changed to a. The IQR in the previous example was 60 15.6 Sorted: 1, 5, 7, 12, 18, 18, 25, 60 59 Recall that the range was together. Q1 = 6, Q3 = 21.5 IQR = Q3 Q1 = 21.5 6 = 15.6 Despite the outlier, the IQR hasn't changed. We say that the IQR is robust to outliers, even though most of our values are fairly close 42/66 42 of 66 05/21/2015 11:00 AM

Visualizing the IQR 43/66 43 of 66 05/21/2015 11:00 AM

Defining Outliers Giving a firm definition to outliers is a tricky thing. In different contexts, we may expect to see more extreme values than in others. As a general rule: An outlier is an observation that falls more than 1.5 IQR's away from the median. In our most recent example ( and the IQR was 15 15.6 Any observation less than be an outlier Any observation greater than would be an outlier 1, 5, 7, 12, 18, 18, 25, 60 ), the median was 15 (1.5 15.6) = 15 23.4 = 8.4 15 + (1.5 15.6) = 15 + 23.4 = 38.4 would 44/66 44 of 66 05/21/2015 11:00 AM

Visualizing the 5 Number Summary As an alternative to histograms and stem-and-leaf plots, which show the shape of a distribution in great detail, we can get a quick idea of the shape by plotting the five number summary. Recall that the five number summary is made up of: The Min The First Quartile (Q1) The Median The Third Quartile (Q3) The Max The graph of choice for the five number summary is the boxplot 45/66 45 of 66 05/21/2015 11:00 AM

Boxplots 46/66 46 of 66 05/21/2015 11:00 AM

Constructing Boxplots Note that boxplots can be drawn horizontally (as pictured in the previous slide) or vertically. Boxplots, also called box-and-whisker plots, have two major elements: The Box The Whiskers Use one axis for the variable. If it is the horizontal axis, we call it a horizontal boxplot. If the variable's values are on the vertical axis, we call it a vertical boxplot. To draw the box: The sides of the box that are perpendicular to the variable axis are drawn at Q1 and Q3, respectively The median is drawn in the box as a line perpendicular to the variable axis 47/66 47 of 66 05/21/2015 11:00 AM

Drawing Boxplots (cont.) The whiskers are a bit more complicated. In either case, they extend outward from the box on both sides along the variable axis. Two possibilities exist for each side. If there are no outliers on that side: They extend from the box to the max or min If there are outliers on that side: They extend out 1.5 IQRs from the mean Outliers are drawn as points past the end of the whisker 48/66 48 of 66 05/21/2015 11:00 AM

Boxplot with Outliers 49/66 49 of 66 05/21/2015 11:00 AM

Symmetric Boxplot (Unimodal) 50/66 50 of 66 05/21/2015 11:00 AM

Symmetric Boxplot (Bimodal) 51/66 51 of 66 05/21/2015 11:00 AM

Interpreting a Boxplot The range of the data is the distance between the tips of the whiskers (or outliers) on either side The box represents the middle 50% of the data If the median is exactly in the center of the box, the data is symmetric Outliers can easily be seen The outliers, lengths of the whiskers, and position of the median tell us about the skew Downsides: We lose some detail about the shape of the distribution (e.g., we can't see modes) Slight skew is harder to detect 52/66 52 of 66 05/21/2015 11:00 AM

Measure of Center for Symmetric Distributions The Median and IQR work for all distributions, symmetric or skewed. They are also very robust to outliers. In the case of a symmetric distribution, we have a better alternative: the sample mean. The sample mean is what most people think of when they find an average As you've been doing for years, add up all the values and divide by how many there are In more formal notation: Total(Sum) n x = = x n If you picture the histogram as a physical object, the sample mean is the "balancing point" 53/66 53 of 66 05/21/2015 11:00 AM

Calculating a Mean Find the Mean of our data from earlier: Finding the mean: x = x = 116 x = 8 x = 14.5 x n 1+5+7+12+18+18+25+30 8 Recall that the median was 15 1, 5, 7, 12, 18, 18, 25, 30, so our two measures are fairly close. 54/66 54 of 66 05/21/2015 11:00 AM

Means and Outliers Now consider our data with an outlier: Finding the mean: x = x = 146 x = 8 x = 18.5 x n 1+5+7+12+18+18+25+60 8 Recall that the median of this data was still. 1, 5, 7, 12, 18, 18, 25, 60 15 The mean is always pulled towards outliers 55/66 55 of 66 05/21/2015 11:00 AM

The Mean vs. the Median Image you have ten people in a room and want to know about their income. Nine of the people make $30 thousand per year, but one person makes $10 million. In thousands of dollars, our data set looks like: 30, 30, 30, 30, 30, 30, 30, 30, 30, 10000 What do the mean and median look like? Median: $30,000 Mean: $1,027,000 If I asked you what the average person makes, which measure of center is better? 56/66 56 of 66 05/21/2015 11:00 AM

Visualizing the Mean and Median Which is the mean and which is the median? 57/66 57 of 66 05/21/2015 11:00 AM

Measure of Spread for Symmetric Data Just like the IQR is based on the concept of the median, we have a measure of spread based on the same ideas as calculated the mean. We call the measure of spread based on the mean the standard deviation. The standard deviation is related to a similar measure, the variance The variance and standard deviation tell us how far, on average, observations are from the mean They are essentially measures of uncertainty If the mean is a typical value, the variance tells us how far away from this we can generally expect to see observations 58/66 58 of 66 05/21/2015 11:00 AM

The Variance The formula for the variance is given by: What does this formula mean? s 2 (x x ) 2 = n 1 After finding the mean, subtract it from every value Square each difference to make it positive Add up all of the squared differences Divide by the number of observations minus one So the variance gives is the average squared distance from the mean. 59/66 59 of 66 05/21/2015 11:00 AM

The Standard Deviation Note that the variance ends up being in the squared units of the variable. If measured in feet, variance would be in square feet. To correct the problem of having squared units, we take the square root of the variance. This measurement is called the standard deviation. s = s 2 (x x ) 2 = n 1 We interpret the standard deviation as the average distance of the points from the mean. Variances and standard deviations are always positive (s 0) The closer together the values are, the smaller the standard deviation (and variance) will be x was 60/66 60 of 66 05/21/2015 11:00 AM

The Standard Deviation and Outliers 1, 5, 7, 12, 18, 18, 25, 30 s = 10.07 1, 5, 7, 12, 18, 18, 25, 60 s = 18.62 As a rule: Making any value further away from the mean will increase s Making any value closer to the mean will decrease s 61/66 61 of 66 05/21/2015 11:00 AM

Visualizing the Standard Deviation 62/66 62 of 66 05/21/2015 11:00 AM

The Standard Deviation vs. the IQR Recall the data set: 15.6 The IQR is far apart 1, 5, 7, 12, 18, 18, 25, 30. This means that the middle 50% of the observations were this The Standard Deviation: s = 10.07 from the mean by 10.07 What about the version with an outlier? The IQR is still 15.6 s = 18.62, which means that the observations differ units on average. 1, 5, 7, 12, 18, 18, 25, 60 Earlier we said that the mean is worse than the median for skewed data. Since the standard deviation is calculated using the mean, if we throw out the mean we also need to throw out the standard deviation. 63/66 63 of 66 05/21/2015 11:00 AM

Mean and Standard Deviation vs Median and IQR So if both pairs of measures are good for symmetric data, and the median and IQR beat out the mean and standard deviation for skewed data, aren't the median and IQR always the better choices? Not necessarily. The mean and standard deviation have properties that make them more useful The mean is faster to calculate (sorting numbers takes a relatively long time, even for computers, compared to adding them up) The median and IQR only take the order of the values into account, while the mean and standard deviation weigh large and small values differently 64/66 64 of 66 05/21/2015 11:00 AM

Summary The shape of a distribution tells us how values are spread out across the individuals in our sample We can see the shape of a distribution using Histograms, Stem-and-Leaf Plots, and Boxplots The Five Number Summary gives us a quick overview of the distribution The Median and Mean tell us about typical values of the variable The Standard Deviation and IQR tell us about the spread of our variables The Median and IQR are robust to outliers and skew The Mean and Standard Deviation are generally preferred when the distribution is symmetric 65/66 65 of 66 05/21/2015 11:00 AM

Important Notes These techniques are only for numeric variables (the mean area code is meaningless) Bar Charts and Histograms are not interchangeable Choosing the stems in a Stem-and-Leaf plot and number of bins in a Histogram are balancing acts Try to avoid rounding while calculating values, but don't report more decimal places than make sense Graph the data first, it will let you know which measures are appropriate to use 66/66 66 of 66 05/21/2015 11:00 AM