Outliers and Their Effect on Distribution Assessment

Similar documents
CHAPTER 4. Labeling Methods for Identifying Outliers

STAT 2300: Unit 1 Learning Objectives Spring 2019

Chapter 1 Data and Descriptive Statistics

SECTION 11 ACUTE TOXICITY DATA ANALYSIS

Super-marketing. A Data Investigation. A note to teachers:

energy usage summary (both house designs) Friday, June 15, :51:26 PM 1

Chapter 3. Displaying and Summarizing Quantitative Data. 1 of 66 05/21/ :00 AM

MAS187/AEF258. University of Newcastle upon Tyne

A is used to answer questions about the quantity of what is being measured. A quantitative variable is comprised of numeric values.

FUNDAMENTALS OF QUALITY CONTROL AND IMPROVEMENT. Fourth Edition. AMITAVA MITRA Auburn University College of Business Auburn, Alabama.

Attachment 1. Categorical Summary of BMP Performance Data for Solids (TSS, TDS, and Turbidity) Contained in the International Stormwater BMP Database

+? Mean +? No change -? Mean -? No Change. *? Mean *? Std *? Transformations & Data Cleaning. Transformations

Biostatistics 208 Data Exploration

Table. XTMIXED Procedure in STATA with Output Systolic Blood Pressure, use "k:mydirectory,

Improve Trash Pick up Route 11. Project Start: 10/26/09 Project Revision: 11/18/09 Project Champion: Dan Brotton Black/Green Belt: Drew Brown

These slides and the data they contain are the sole property of the author(s) of the slide set.

Coal Combustion Residual Statistical Method Certification for the CCR Landfill at the Boardman Power Plant Boardman, Oregon

Graphical Tools - SigmaXL Version 6.1

STATISTICALLY SIGNIFICANT EXCEEDANCE- UNDERSTANDING FALSE POSITIVE ERROR

FUNDAMENTALS OF QUALITY CONTROL AND IMPROVEMENT

JMP TIP SHEET FOR BUSINESS STATISTICS CENGAGE LEARNING

Chapter 2 Part 1B. Measures of Location. September 4, 2008

rat cortex data: all 5 experiments Friday, June 15, :04:07 AM 1

+? Mean +? No change -? Mean -? No Change. *? Mean *? Std *? Transformations & Data Cleaning. Transformations

Using Excel s Analysis ToolPak Add-In

Ch. 16 Exploring, Displaying, and Examining Data

Bar graph or Histogram? (Both allow you to compare groups.)

ISO 13528:2015 Statistical methods for use in proficiency testing by interlaboratory comparison

Statistics 201 Summary of Tools and Techniques

Applying the Principles of Item and Test Analysis

Groundwater Statistical Methods Certification. Neal South CCR Monofill Permit No. 97-SDP-13-98P Salix, Iowa. MidAmerican Energy Company

Groundwater Statistical Methods Certification. Neal North Impoundment 3B Sergeant Bluff, Iowa. MidAmerican Energy Company

Comparative analysis of forecasting techniques with intermittent demand

CEE3710: Uncertainty Analysis in Engineering

Groundwater Statistical Methods Certification. Neal North CCR Monofill Permit No. 97-SDP-12-95P Sergeant Bluff, Iowa. MidAmerican Energy Company

Social Studies 201 Fall Answers to Computer Problem Set 1 1. PRIORITY Priority for Federal Surplus

AP Statistics Scope & Sequence

Composite Structure Engineering Safety Awareness Course

Biostat Exam 10/7/03 Coverage: StatPrimer 1 4

Advanced Higher Statistics

Energy Efficiency Impact Study

Statistical Observations on Mass Appraisal. by Josh Myers Josh Myers Valuation Solutions, LLC.

Learn What s New. Statistical Software

What s New in Minitab 17

Feroze Chowdary. Clipper Downtime Improvement

!! NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA ! NOTE: The SAS System used:!

Performance of the Sievers 500 RL On-Line TOC Analyzer

Effective Applications Development, Maintenance and Support Benchmarking. John Ogilvie CEO ISBSG 7 March 2017

Clovis Community College Class Assessment

DETECTION MONITORING STATISTICAL METHODS CERTIFICATION U.S. EPA COAL COMBUSTION RESIDUAL RULE

Untangling Correlated Predictors with Principle Components

Sample Exam 1 Math 263 (sect 9) Prof. Kennedy

AP Statistics Test #1 (Chapter 1)

Revisione della norma ISO 4259 Andrea Gallonzelli

36.2. Exploring Data. Introduction. Prerequisites. Learning Outcomes

Measurement Systems Analysis

Elementary Statistics Lecture 2 Exploring Data with Graphical and Numerical Summaries

Module - 01 Lecture - 03 Descriptive Statistics: Graphical Approaches

QUICK & DIRTY GRR PROCEDURE TO RANK TEST METHOD VARIABILITY

Gasoline Consumption Analysis

Chart Recipe ebook. by Mynda Treacy

Dr. Allen Back. Aug. 26, 2016

There are also further opportunities to develop the Core Skill of Information and Communication Technology at SCQF level 5 in this Unit.

Fundamental Elements of Statistics

Groundwater Statistical Analysis Plan

Opening SPSS 6/18/2013. Lesson: Quantitative Data Analysis part -I. The Four Windows: Data Editor. The Four Windows: Output Viewer

-SQA-SCOTTISH QUALIFICATIONS AUTHORITY HIGHER NATIONAL UNIT SPECIFICATION GENERAL INFORMATION

AP Statistics Part 1 Review Test 2

Identification and treatment of outliers in the monitoring data for the monitoringbased exercise under the WFD Review of Priority Substances

August 24, Jean-Philippe Mathevet IAQG Performance Stream Leader SAFRAN Paris, France. Re: 2014 Supplier Performance Data Summary

This document is a preview generated by EVS

CREDIT RISK MODELLING Using SAS

MSU Scorecard Analysis - Executive Summary #2

To provide a framework and tools for planning, doing, checking and acting upon audits

Displaying Bivariate Numerical Data

Statistics and Pharmaceutical Quality

Understanding and accounting for product

Why Learn Statistics?

Who Are My Best Customers?

Discussion Solution Mollusks and Litter Decomposition

ROADMAP. Introduction to MARSSIM. The Goal of the Roadmap

Cycle Time Forecasting. Fast #NoEstimate Forecasting

I. INTRODUCTION II. LITERATURE REVIEW. Ahmad Subagyo 1, Armanto Wijaksono 2. Lecture Management at GCI Business School, Indonesia

Limitations Statement

Computing Descriptive Statistics Argosy University

Capability on Aggregate Processes

Characterization of resolution cycle times of corrective actions in mobile terminals

The Dummy s Guide to Data Analysis Using SPSS

BAR CHARTS. Display frequency distributions for nominal or ordinal data. Ej. Injury deaths of 100 children, ages 5-9, USA,

Course on Data Analysis and Interpretation P Presented by B. Unmar. Sponsored by GGSU PART 1

Extracting and analyzing information from a large volume of aircraft repair messages

Introduction to Research

Data Visualization. Prof.Sushila Aghav-Palwe

Students will understand the definition of mean, median, mode and standard deviation and be able to calculate these functions with given set of

Design of Experiments (DOE) Instructor: Thomas Oesterle

Rational Approach in Applying Reliability Theory to Pavement Structural Design

Four Innovative Methods to Evaluate Attribute Measurement Systems

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

Statistics Chapter Measures of Position LAB

Day 1: Confidence Intervals, Center and Spread (CLT, Variability of Sample Mean) Day 2: Regression, Regression Inference, Classification

Transcription:

Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016

Topics of Discussion What is an Outlier? (Definition) How can we use outliers Analysis of Outlying Observation The Standards Tests for Outliers in Minitab Importance of Data Distributions How Outliers Affect Distribution Assessment Examples

What are Outlying Observations? Do they have value?

What are Outliers? An outlier is an observation point that is distant from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error: the latter are sometimes excluded from the data set. An outlier may be caused by a defective unit or a problem in the process. ISO 16269 defines it as member of a small subset of observations that appears to be inconsistent with the remainder of a given sample."

Anomaly Detection Anomaly detection (also known as outlier detection)is the search for items or events which do not conform to an expected pattern. Anomalies are also referred to as outliers, change, deviation, surprise, aberrant, intrusion, etc. An outlier may be caused by a defective unit or a problem in the process.

Basic Understanding of Outliers There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise. Every effort must be made to determine what is causing the outlier to exist. Testing, equipment, operator error, materials, etc. all must be reviewed and assessment made.

What to do with outliers Outliers must be investigated Use the tools found in root cause analysis Interview operators Evaluate the measuring system Try to duplicate the reading Don t make assumptions or jump to conclusions Do not remove outliers simply because they are outliers

The Importance of Outliers Note that outliers are not necessarily "bad" data-points; indeed they may well be the most important, most information rich, part of the dataset. Under no circumstances should they be automatically removed from the dataset. Outliers may deserve special consideration: they may be the key to the phenomenon under study or the result of human blunders.

From ASTM E178-08 1.1.1 An outlying observation may be merely an extreme manifestation of the random variability inherent in the data. If this is true, the value should be retained and processed in the same manner as the other observations in the sample. 1.1.2 On the other hand, an outlying observation may be the result of gross deviation from prescribed experimental procedure or an error in calculating or recording the numerical value. In such cases, it may be desirable to institute an investigation to ascertain the reason for the aberrant value.

From The Complete Idiot s Guide to Statistics Outliers Extreme values in a data set that should be discarded before analysis. WHAT???

Considerations before collecting data

What distribution is expected for continuous data? 3 distributions cover most circumstances Random variables which are added together follow a normal distribution Human height, weight, many dimensions Random variables which are multiplied together follow a lognormal distribution Also useful for data which are bounded at zero like tensile strength, burst strength. These are skewed right because there are no negative values. Cycle time and microbiologic data often follow the lognormal distribution. Weibull distribution is versatile for skewed data. Time to failure, fatigue life.

Analysis of Outliers The Standards and Minitab

Standards Relating to Outliers ASTM E178-08 Standard Practice for Dealing with Outlying Observations ISO 16269-4:2010 Statistical interpretation of data Part 4: Detection and treatment of outliers

ASTM E178-08 18 page document that focuses on Dixon Test. Includes table for Tietjen-Moore Critical Values (One-Sided Test) for Lk Offers an example with aas laboratory analysis References Wilk-Shapiro W Statistic as a possible test for outliers

ISO 16269-4:2010 54 Page document that offers a myriad of methods for detecting and treating outliers. Methods include Box plot, Cochran s test, Greenwood s test. Full explanation of GESD (Generalized Extreme Studentized Deviate) method Demonstrates various graphical representations of data (Box plot, histogram, stem-and-leaf and probability plot) Much emphasis put on distribution understanding including Box-Cox plot and analysis.

Testing for outliers Other methods flag observations based on measures such as the interquartile range. For example, if Q1 and Q3 are the lowest and upper quartiles respectively, then one could define an outlier to be any observation outside the range: [Q 1 k(q 3 -Q 1 ), Q 3 + k(q 3 Q 1 )]

The Box Plot http://www.physics.csbsju.edu/stats/box2.html

Interpreting the Box Plot

Testing for outliers Model-based methods which are commonly used for identification assume that the data are from a normal distribution, and identify observations which are deemed unlikely based on mean and standard deviation.

Outlier Testing in Minitab Three good tests for outliers found in Minitab: Box plot Grubbs Test Dixon s Test Dixon s Q ratio Dixon s 11 ratio Dixon s 12 ratio Dixon s 20 ratio Dixon s 21 ratio Dixon s 22 ratio

Outlier Testing in Minitab Stat> Basic Statistics> Outlier Test First determine the test you will use. The Dixon s numbered ratios are based on sample size Next select if you want to evaluate the smallest, largest or smallest or largest value to evaluate

Grubbs test 20.28, 18.75, 21.07, 20.33,18.96, 19.35, 19.24, 20.45, 20.61, 18.64, 20.15, 18.26, 20.55, 22.76, 26.10, 19.66, 20.50, 19.93, 22.02, 20.51, 18.67, 20.06, 21.37, 21.05, 22.32 Outlier Plot of Outlier Grubbs' Test Min Max G P 1 8.26 26.1 0 3.42 0.001 1 8 1 9 20 21 22 23 24 25 26 27 Outlier

Dixon s Q Ratio Outlier Plot of Outlier Dixon's Q Test Min Max r1 0 P 1 8.26 26.1 0 0.43 0.002 Dixon s Q Ratio test for the same data set. 1 8 1 9 20 21 22 23 24 25 26 27 Outlier

Outlier Simple Boxplot 27 Boxplot of Outlier 26 25 24 23 22 21 20 1 9 1 8

Other Tests for Outliers Chauvenet s criterion Cochran s C Test Variance Outlier Test Peircece s criterion The Jackknife Method Greenwood s Test

The Importance of Distribution Fitting Impact on Statistical Tests

Why test for normality? Reliability predictions are based on model (distribution) If model is inappropriate, misleading conclusions can be drawn.

Sensitive to normality Non-Sensitive Confidence intervals on means T-tests ANOVA, including DOE and Regression X-bar Charts Sensitive Confidence Intervals on standard deviations Tolerance Intervals, Reliability Intervals Variance tests C pk, C p, P pk, P p I-Charts 29

Reasons for failing normality Measurement resolution Data shift Multiple sources of data Truncated data Outliers Too much data The underlying distribution is not normal

The Effect of Outliers on Distribution Assessment Some Examples

Difference of a Data Point Summary for EW Data 2 A nderson-darling Normality Test A -Squared 0.70 P-V alue 0.060 Mean 110.71 StDev 7.99 V ariance 63.87 Skew ness -1.17564 Kurtosis 1.99926 N 26 90 100 110 120 M inimum 86.20 1st Q uartile 104.40 M edian 112.90 3rd Q uartile 116.80 M aximum 122.40 95% C onfidence Interv al for Mean 107.48 113.94 95% C onfidence Interv al for Median 9 5 % Confidence Intervals 107.73 116.08 95% C onfidence Interv al for StDev Mean 6.27 11.03 Median 108 110 112 114 116 Project: NORMALITY EXAMPLES 2011-06-22.MPJ

With Outlier Removed Summary for EW Data 2 No Outlier A nderson-darling N ormality Test A -Squared 0.61 P-V alue 0.099 Mean 111.69 StDev 6.36 V ariance 40.51 Skew ness -0.353437 Kurtosis -0.992428 N 25 100 105 110 115 120 M inimum 100.40 1st Q uartile 104.70 M edian 113.00 3rd Q uartile 116.80 M aximum 122.40 95% C onfidence Interv al for Mean 109.06 114.32 95% C onfidence Interv al for Median 9 5 % Confidence Intervals 109.24 116.44 95% C onfidence Interv al for StDev Mean 4.97 8.85 Median 110 112 114 116 Project: Normality Examples 2011-06-22.MPJ

Summary Statistics Output Summary for C2 A nderson-darling N ormality Test A -S quared 2.77 P-V alue < 0.005 Mean 3.9883 S tdev 4.1086 V ariance 16.8803 S kew ness 1.90297 Kurtosis 3.02385 N 30 0 4 8 12 16 M inimum 0.3286 1st Q uartile 1.3757 M edian 2.5724 3rd Q uartile 4.8224 M aximum 16.4155 95% C onfidence Interv al for Mean 2.4541 5.5224 95% C onfidence Interv al for Median 9 5 % Confidence Intervals 1.8680 3.7050 95% C onfidence Interv al for StDev Mean 3.2721 5.5232 Median 2 3 4 5 6 Project: NORMALITY EXAMPLES 2011-06-22.MPJ

Percent Percent Investigate Non-Normal Distributions Probability Plot for C2 99 Normal - 95% CI 99 Lognormal - 95% CI Goodness of F it Test Normal A D = 2.774 P-V alue < 0.005 95 80 95 80 Lognormal A D = 0.208 P-V alue = 0.852 50 50 20 20 5 5 1-10 0 C2 10 20 1 0.1 1 C2 10 100 Project: NORMALITY EXAMPLES 2011-06-22.MPJ

Outlier or Not? Data set 200, 440, 220, 1100, 80, 55, 60, 1700, 100, 200, 160, 110, 170, 210, 3600, 90, 85, 35, 430, 180, 5, 15, 1, 30, 10, 10, 20, 15, 20, 15. 4000 Boxplot of Results 3000 Results 2000 1 000 0

Dixon s Q Ratio Outlier Plot of Results Dixon's Q Test Min Max r1 0 P 1.00 3600.00 0.53 0.000 0 1 000 2000 3000 4000 Results

Grubbs Test Outlier Plot of Results Grubbs' Test Min Max G P 1.00 3600.00 4.60 0.000 0 1 000 2000 3000 4000 Results

Outlier or Not? Data set 200, 440, 220, 1100, 80, 55, 60, 1700, 100, 200, 160, 110, 170, 210, 3600, 90, 85, 35, 430, 180, 5, 15, 1, 30, 10, 10, 20, 15, 20, 15. Summary for Aerobes A nderson-darling Normality Test A -Squared 6.23 P-V alue < 0.005 Mean 312.17 StDev 715.08 V ariance 511335.66 Skew ness 3.8720 Kurtosis 16.3024 N 30 Minimum 0.00 0 1000 2000 3000 1st Q uartile 18.75 Median 87.50 3rd Q uartile 202.50 Maximum 3600.00 95% C onfidence Interv al for Mean 45.15 579.18 95% C onfidence Interv al for Median 31.14 177.71 95% Confidence Intervals 95% C onfidence Interv al for StDev 569.49 961.29 Mean Median 0 100 200 300 400 500 600

Review the Probability Plot

Try a transformation Summary for Log Base 10 A nderson-darling Normality Test A -Squared 0.25 P-V alue 0.710 Mean 1.8753 StDev 0.7713 V ariance 0.5950 Skew ness -0.069983 Kurtosis 0.319587 N 30 Minimum 0.0000 0 1 2 3 1st Q uartile 1.2698 Median 1.9418 3rd Q uartile 2.3063 Maximum 3.5563 95% C onfidence Interv al for Mean 1.5873 2.1633 95% C onfidence Interv al for Median 1.4924 2.2496 95% Confidence Intervals 95% C onfidence Interv al for StDev 0.6143 1.0369 Mean Median 1.50 1.65 1.80 1.95 2.10 2.25 Conclusion The sample with a reading of 3600 is expected to occur approximately 1 out of every 67 readings on an average and should not be considered extreme based on the distribution of the data. The sample with this count of Aerobes should be identified through a gram stain to determine the nature of its source, but unless abnormal concerns are raised based on the type of microorganism, no additional action should be taken.

Probability Plot with and without Outliers The plot on the left shows a typical normal distribution. The one on the right contains two data points as outlying observations. Probability plots can give us great insight into the distribution from our process.

Conclusion We make conclusions about processes based on the distribution models we use. Know your process distribution. Outliers can be a useful source of data, so investigate causes of outliers. Slow down on your assumptions. Look at the data several ways including a probability plot. Try to understand why your data is the way it is. When writing up your results, reference standards or technical literature.

References ISO 16269-4:2010 Statistical interpretation of data Part 4 Detection and treatment of outliers ASTM E178-08 Standard Practice for Dealing With Outlying Observations Minitab 17 Training Material Analysis of Nonnormal Data for Quality Acheson J. Duncan Quality Control and Industrial Statistics Fourth Edition 1974 pp 206-207, pp 977 W.J. Dixon Processing Data for Outliers, Biometrics, Vol. IX (1953), pp. 74-89.

Questions?