Statistics and Data Analysis

Similar documents
Discriminant Analysis Applications and Software Support

Outliers identification and handling: an advanced econometric approach for practical data applications

New Customer Acquisition Strategy

Ensemble Modeling. Toronto Data Mining Forum November 2017 Helen Ngo

Introduction to Multivariate Procedures (Book Excerpt)

Binary Classification Modeling Final Deliverable. Using Logistic Regression to Build Credit Scores. Dagny Taggart

From Profit Driven Business Analytics. Full book available for purchase here.

A SIMULATION STUDY OF THE ROBUSTNESS OF THE LEAST MEDIAN OF SQUARES ESTIMATOR OF SLOPE IN A REGRESSION THROUGH THE ORIGIN MODEL

SAS/STAT 13.1 User s Guide. Introduction to Multivariate Procedures

The application of robust regression to a production function comparison the example of Swiss corn*

Chapter 3. Basic Statistical Concepts: II. Data Preparation and Screening. Overview. Data preparation. Data screening. Score reliability and validity

PLAYING WITH HISTORY CAN AFFECT YOUR FUTURE: HOW HANDLING MISSING DATA CAN IMPACT PARAMATER ESTIMATION AND RISK MEASURE BY JONATHAN LEONARDELLI

Chapter 5 Evaluating Classification & Predictive Performance

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Add Sophisticated Analytics to Your Repertoire with Data Mining, Advanced Analytics and R

STAT 2300: Unit 1 Learning Objectives Spring 2019

IT Certification Exams Provider! Weofferfreeupdateserviceforoneyear! h ps://

Examination of Cross Validation techniques and the biases they reduce.

Comparison of multivariate outlier detection methods for nearly elliptically distributed data

SAS. Data Analyst Training Program. [Since 2009] 21,347+ Participants 10,000+ Brands Trainings 45+ Countries. In exclusive association with

EFFICACY OF ROBUST REGRESSION APPLIED TO FRACTIONAL FACTORIAL TREATMENT STRUCTURES MICHAEL MCCANTS

Discriminant models for high-throughput proteomics mass spectrometer data

Sensitivity Analysis of Nonlinear Mixed-Effects Models for. Longitudinal Data That Are Incomplete

Distinguish between different types of numerical data and different data collection processes.

FUNDAMENTALS OF QUALITY CONTROL AND IMPROVEMENT. Fourth Edition. AMITAVA MITRA Auburn University College of Business Auburn, Alabama.

A classical predictive modeling approach for Task Who rated what? of the KDD CUP 2007

KnowledgeSTUDIO. Advanced Modeling for Better Decisions. Data Preparation, Data Profiling and Exploration

Harbingers of Failure: Online Appendix

VQA Proficiency Testing Scoring Document for Quantitative HIV-1 RNA

In the appendix of the published paper, we noted:

Near-Balanced Incomplete Block Designs with An Application to Poster Competitions

Time allowed : 3 hours Maximum marks : 100. Total number of questions : 8 Total number of printed pages : 7 PART A

AP Statistics Scope & Sequence

HIGH-BREAKDOWN ROBUST REGRESSION IN ANALYSIS OF INTERNET USAGE IN EUROPEAN COUNTRIES HOUSEHOLDS

New Statistical Algorithms for Monitoring Gene Expression on GeneChip Probe Arrays

Final Project Report CS224W Fall 2015 Afshin Babveyh Sadegh Ebrahimi

CREDIT RISK MODELLING Using SAS

Logistic Regression for Early Warning of Economic Failure of Construction Equipment

Model Validation of a Credit Scorecard Using Bootstrap Method

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

Individual-Level Targeting

Globally Robust Confidence Intervals for Location

Winsor Approach in Regression Analysis. with Outlier

Prediction of Google Local Users Restaurant ratings

CHAPTER 4. Labeling Methods for Identifying Outliers

Predictive Accuracy: A Misleading Performance Measure for Highly Imbalanced Data

GETTING STARTED WITH PROC LOGISTIC

Predictive Modeling using SAS. Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN

Strength in numbers? Modelling the impact of businesses on each other

Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest

CASE CONTROL MATCHING: COMPARING SIMPLE DISTANCE- AND PROPENSITY SCORE-BASED METHODS

Advanced analytics at your hands

Chapter 6: Customer Analytics Part II

Graphical Tools - SigmaXL Version 6.1

Sociology 7704: Regression Models for Categorical Data Instructor: Natasha Sarkisian. Preliminary Data Screening

COORDINATING DEMAND FORECASTING AND OPERATIONAL DECISION-MAKING WITH ASYMMETRIC COSTS: THE TREND CASE

Exercise Confidence Intervals

QUESTION 2 What conclusion is most correct about the Experimental Design shown here with the response in the far right column?

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong

(DMSTT 21) M.Sc. (Final) Final Year DEGREE EXAMINATION, DEC Statistics. Time : 03 Hours Maximum Marks : 100

Introduction to Genome Wide Association Studies 2015 Sydney Brenner Institute for Molecular Bioscience Shaun Aron

Groundwater Statistical Methods Certification. Neal South CCR Monofill Permit No. 97-SDP-13-98P Salix, Iowa. MidAmerican Energy Company

Choosing the Right Type of Forecasting Model: Introduction Statistics, Econometrics, and Forecasting Concept of Forecast Accuracy: Compared to What?

Groundwater Statistical Methods Certification. Neal North Impoundment 3B Sergeant Bluff, Iowa. MidAmerican Energy Company

Application of SAS in Product Testing in a Retail Business

Groundwater Statistical Methods Certification. Neal North CCR Monofill Permit No. 97-SDP-12-95P Sergeant Bluff, Iowa. MidAmerican Energy Company

Linear State Space Models in Retail and Hospitality Beth Cubbage, SAS Institute Inc.

Identification of local multivariate outliers

Applications of Machine Learning to Predict Yelp Ratings

Predicting user rating for Yelp businesses leveraging user similarity

Accurate Campaign Targeting Using Classification Algorithms

SAS BIG DATA ANALYTICS INCREASING YOUR COMPETITIVE EDGE

Working with Outliers and Time Series Shocks. Michael Grogan

Paper Performing Machine Learning Techniques in a Contextual Marketing Scenario. Francisco Capetillo, Telefonica Chile

Introduction to Genome Wide Association Studies 2014 Sydney Brenner Institute for Molecular Bioscience/Wits Bioinformatics Shaun Aron

Restricting the h-index to a citation time window: A case study of a timed Hirsch index

Analyzing Ordinal Data With Linear Models

Clovis Community College Class Assessment

Predicting Average Basket Value

PROPENSITY SCORE MATCHING A PRACTICAL TUTORIAL

just showing how many miles per gallon the vehicle gets, the label also shows the number of gallons per mile it gets.

Business Quantitative Analysis [QU1] Examination Blueprint

One Year Executive Program in Applied Business Analytics

ACCURATE AND RELIABLE CANCER CLASSIFICATION BASED ON PATHWAY-MARKERS AND SUBNETWORK-MARKERS. A Thesis JUNJIE SU

ESTIMATING TOTAL-TEST SCORES FROM PARTIAL SCORES IN A MATRIX SAMPLING DESIGN JANE SACHAR. The Rand Corporatlon

FUNDAMENTALS OF QUALITY CONTROL AND IMPROVEMENT

depend on knowing in advance what the fraud perpetrators are atlarnpting. Sampling and Analysis File Preparation Nonnal Behavior Samples

An Achievement Test for Knowledge-Based Systems: QUEM

DETECTING OUTLIERS BY USING TRIMMING CRITERION BASED ON ROBUST SCALE ESTIMATORS WITH SAS PROCEDURE. Sharipah Soaad Syed Yahaya

GETTING STARTED WITH PROC LOGISTIC

Practical Application of Predictive Analytics Michael Porter

Exploring the Relationship between Farm Size and Productivity

A logistic regression model for Semantic Web service matchmaking

Cryptocurrency Price Prediction Using News and Social Media Sentiment

Machine Learning Models for Sales Time Series Forecasting

Churn Prevention in Telecom Services Industry- A systematic approach to prevent B2B churn using SAS

Statistics, Data Analysis, and Decision Modeling

A Simulation Based Study on the Effectiveness of Bootstrap-Based T 2 Control Chart for Normal Processes

Visits to a general practitioner

Predictive Modeling Using SAS Visual Statistics: Beyond the Prediction

Transcription:

Selecting the Appropriate Outlier Treatment for Common Industry Applications Kunal Tiwari Krishna Mehta Nitin Jain Ramandeep Tiwari Gaurav Kanda Inductis Inc. 571 Central Avenue #105 New Providence, NJ ABSTRACT 1 Outlier detection and treatment is a very important part of any modeling exercise. A failure to detect outliers or their incorrect treatment can have serious ramifications on the validity of the inferences drawn from the exercise. There are a large number of techniques available to perform this task, and often selection of the most appropriate technique poses a big challenge to the practitioner. In making this selection, the sophistication of the technique needs to be balanced with the ease of implementation and the impact on model performance. In this paper we focus on the impact some of these techniques have on the more common applications of modeling. We evaluate 5 different outlier detection and treatment techniques namely: Capping/Flooring, Sigma approach,, Mahalanobis distance and the Robust Regression approach. The impact of these techniques is evaluated in a linear and logistic regression framework, the two most common modeling approaches relevant for a lot of industry applications. 1. INTRODUCTION Outlier detection is an important task in data mining activities 2 and involves identifying a set of observations whose values deviate from the expected range. These extreme values can unduly influence the results of the analysis and lead to incorrect conclusions. It is therefore very important to give proper attention to this problem before embarking on any data mining exercise. Research in this area is voluminous and there are whole arrays of techniques available to solve this problem. A lot of these techniques look at each variable individually, whereas there are others which take a multivariate approach to the outlier problem. A challenge often faced by the practitioner is to select optimally among these techniques. 1 Please send all correspondence to kmehta@inductis.com 2 For an extensive discussion on this see [1] In this paper, we evaluate the appropriateness of five outlier detection techniques for two of the most popular industry applications, linear and logistic regressions. In short listing the techniques for the study, we paid a lot of attention to the implementability of the techniques in an industry setting. 2. OUTLIER APPROACHES As mentioned before, we selected 5 different outlier treatment approaches for closer evaluation. In this section, we will briefly describe each of the approaches. 2.1 Capping/Flooring Approach A value is identified as outlier if it exceeds the value of the 99 th percentile of the variable by some factor, or if it is below the 1 st percentile of given values by some factor. The factor is determined after considering the variable distribution and the business case. The outlier is then capped at a certain value above the P99 value or floored at a factor below the P1 value. The factor for capping/flooring is again obtained by studying the distribution of the variable and also accounting for any special business considerations. 2.2 Sigma Approach With the sigma approach, a value is identified as outlier if it lies outside the mean by + or x times sigma. Where x is an integer and sigma is standard deviation for the variable. The outlier is then capped or floored at a distance of y times sigma from the mean. y is equal to or greater than x and is determined by the practitioner. 2.3 Approach To detect outliers with this approach, the curve between P95 and P99 is extrapolated beyond P99 by a factor x. Any values lying outside this extended curve are identified as outliers. Similarly the curve between the 5 th percentile and 1 st percentile is extended to Minimum value by some factor. Values lying below this extended curve are outliers. 1

Any value which lies beyond the extended curve is treated by an appropriate function, which maintains the monotonicity of the values but brings them to an acceptable range. 2.4 Mahalanobis Distance Approach Mahalanobis distance is a multivariate approach and is calculated for every observation in the dataset. Then every observation is given a weight as inverse of the Mahalanobis distance. The observations with extreme values get lower weights.. Finally a weighted regression is run on to minimize the effect of outliers. Mahalanobis distance is different from Euclidian distance in that: It is based on correlations between variables by which different patterns can be identified and analyzed is scale-invariant, i.e. not dependent on the scale of measurements it takes into account the correlations of the data set Here is how it is derived: D2 = y *y = y * i2 = (x x avg) *C-1*(x x avg) Where D (= + D2) equals the Mahalanobis-distance. y = C-1/2(x x avg); Where x = original (log-transformed) data-vector C = estimated covariance matrix y = transformed data-vector x avg = estimated vector of averages For more on Mahalanobis technique see [2], [3]. 2.5 Robust-Reg Approach Robust-Reg is also a multivariate approach. It runs regression on the dataset and then calculates the residuals. It then computes the median of residual after which it takes difference of median and actual residual and calculates MAD (mean absolute deviation) as: Median/0.6745 (where 0.6745 is value of sigma) Now it calculates absolute value as residual/mad and gives weight with respect to absolute value. This process is run iteratively and gives weight to each observation. Outliers are given 0 weight and are deleted from final regression. For more on robust regressions, see [4], [5]. 3. THE EVALUATION PROCESS Our focus of application is in the direct marketing and consumer finance areas. In these areas as well as in a lot of other industry and research settings, the problems analyzed involve either a binary or a continuous dependent variable. Therefore, we have evaluated the outlier treatment approaches for the linear and logistic regression problems. Due to the difference in the nature of these problems, we have used separate but appropriate performance metrics for each of them. Four different datasets were used to derive the results. The results presented are average across the datasets. Wherever appropriate, results were calculated for validation datasets. For logistic regressions the performance metrics used were Lift, percentage concordance and c-value. For the linear regressions, Lift, R-square, percentage error bands and coefficient of variation for the process were used for the evaluations. The metrics used to evaluate the performance of the various techniques are fairly standard. A brief description of these metrics follows. Lift at x is defined as the percentage of total responders captured (for logistic regression) or the percentage of total volume captured (for linear regression) in the top x percentile of the model output. In our experimental study, we chose lift at the 10 th percentile as it is a very common cutoff used in different marketing exercises. The results reported broadly hold for lifts at the 20 th and 30th deciles as well Percentage concordance provides the percentage of total population which has come into concordance according to the model. A pair of observations with different observed responses is said to be concordant if the observation with the lower ordered response value has a lower predicted mean score than the observation with the higher ordered response value. For more on concordance see [6]. C-value is equivalent to the well known measure ROC. It ranges from 0.5 to 1, where 0.5 corresponds to the model randomly predicting the response, and a 1 corresponds to the model perfectly discriminating the response. For more on C-value see [6]. R-square is the proportion of variability in a data set that is accounted for by a statistical model. The R-square value is an indicator of how well the model fits the data. Higher is its value better will be the method used to build the model. It is one of the most common metrics used to evaluate a very broad class of models. Coefficient of variation is a measure of dispersion of a probability distribution. Lower its value better will be the method used to build the model. For more on coefficient of variation see [7]. Percentage error bands are formed by breaking the percentage of error in prediction into bands. The percentages of population present in different error bands determine the efficiency of the model. It is desirable to have more of the observations fall into the lower error bands. 2

4. RESULTS 46 Percentage Concordance of Model for different methods In this section we present the findings of our empirical study. We first discuss the results for the logistic regression model, followed by the linear regression results. 4.1 Logistic Regression 3 Percentage Concordance 45 44 43 42 41 40 39 41.2 41.6 44.6 40.6 Lift Curve: 38 37 The sigma approach gave the highest lift at the top deciles, followed by the capping and flooring approach. Figure 1 illustrates the results. 36 Flooring Sigma Approach Distance Lift Comparison Curves of different methods Figure 2: Percentage Concordance comparison chart for various methods using logistic regression Cumulative Lift 90 80 70 60 50 40 30 20 10 0 10 20 30 40 50 Cumulative Percentage of Population Flooring Sigma Approach Distance C-value: When looking at the c-values, again sigma approach came out to be the best approach. 0.69 0.685 C-value of Model for different methods 0.683 0.68 Figure 1: Lift comparison curves for various methods using logistic regression c-value 0.675 0.67 0.665 0.669 0.671 0.664 Concordance: Sigma approach provided the best concordance. The Mahalanobis distance measure again gave the worst results. 0.66 0.655 0.65 Flooring Sigma Approach Distance Figure 3: C-value comparison chart for various methods using logistic regression 4.2 Linear Regression Lift Curve: When results obtained from different approach were compared for the lift values, distance approach came out to be the best. (see figure 4). 3 Robust regression is more appropriate for linear regressions and was not considered for logistic regressions. 3

Lift Comparison Curves of Model for different methods Coefficient of Variation values of Model for different methods 6 12 Cumulative Lift 5 4 3 2 1 Flooring Distance Sigma Approach Robustreg Coefficient of Variation values 10 8 6 4 2 98.87 98.76 63.58 93.55 94.93 10 20 30 40 50 Cumulative Percentage of Population Flooring Distance Sigma Approach Robustreg Figure 4: Lift comparison curves for various methods using linear regression R-square: When looking at the R-square values, again distance approach came out to be slightly better than the other methods (see figure 5). R-square values of Model for different methods Figure 6: Coefficient of variation values comparison chart for various methods using linear regression Error Bands: Mahalanobis distance method outperforms all other methods on the error band criteria, as it captures a substantially larger amount of population in the lower error bands (see figure 7). R-square values 0.28 0.27 0.26 0.25 0.23 0.22 0.21 0.26 Flooring 0.27 Distance 0.25 Sigma Approach Robustreg Figure 5: R-square values comparison chart for various methods using linear regression Coefficient of Variation: On the Coefficient of variation metric, distance approach performed much better than the other methods (see figure 6). Cumulative Percentage of population 10 9 8 7 6 5 4 3 2 1 Error Bands Comparison of different methods <5% 5%-10% 10%-15% 15%-20% 20%-25% 25%-30% 30%-40% 40%-50% Percentage Error Bands 50%-75% 75%-100% >100% Flooring Distance Sigma Approach Robustreg Figure 7: Population Distribution in error Bands for various methods using linear regression Therefore for linear regressions, the Mahalanobis distance method does marginally better than other methods on some criteria but substantially improves model performance on coefficient of variation and error band calculations. 5. CONCLUSIONS. Our analysis provides evidence that the type of technique used for outlier treatment does influence the model performance. For linear regression problems, Mahalanobis distance comes out better than any other technique used. However, for logistic regression problems simpler techniques like sigma perform the best and Mahalanobis distance approach performs the worst. 4

These results have serious implications for industry applications. For example in direct marketing applications, it is fairly common to build customer enrollment and profitability models with the same dataset as a two stage exercise. The first stage involves a logistic model and the second stage a linear regression model. The common here practice is to treat outliers at the beginning of the analysis and then proceed with no additional thought given to the outliers. However our findings suggest that this practice is sub optimal and outliers should be treated differently for the two modeling exercises. 6. REFERENCES [1] Pyle D. Data Preparation for Data Mining, Morgan Kaufmann Publishers, 1999. [2] RasmussenJ.L. Evaluating Outlier Identification Tests: Mahalanobis D Squared and Comrey Dk, Multivariate Behavioral Research 23:2, 189-202, 1988 [3] http://en.wikipedia.org/wiki/mahalanobis_distance [4] Rousseeuw P.J., Leroy A.M. Robust Regression and Outlier Detection, Wiley-Interscienc, 1987. [5] Chernobai A., Rachev S.T. Applying Robust methods to Operational Risk Modeling, UCSB Technical Report, 2006. [6] Allison P. Logistic Regression Using the SAS System, SAS Publishing, 1999. [7] http://en.wikipedia.org/wiki/coefficient_of_variation 7. ACKNOWLEDGEMENTS [1] SAS is a Registered Trademark of the SAS Institute, Inc. of Cary, North Carolina [2] Amit Sharma, Inductis, for his guidance on Mahalanobis distance approach 8. CONTACT INFORMATION Any comments and questions are valued and encouraged. Contact the author at: Author Name: Krishna Mehta Company: Inductis Inc. 571 Central Avenue #105 New Providence, NJ 07310 Work Phone: 1-908-743-1105 Fax: 1-908-508-7811 Email: kmehta@inductis.com Web: www.inductis.com 5