CHAPTER 4. Labeling Methods for Identifying Outliers

CHAPTER 4 Labeling Methods for Identifying Outliers 4.1 Introduction Data mining is the extraction of hidden predictive knowledge s from large databases. Outlier detection is one of the powerful techniques of data mining. There are many authors defined outliers in different words, Hawkins (1980) defined as An outlier is an observation that deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism. Outliers are also referred to as discordance, deviants, abnormalities or anomalies in data mining and statistics literature Aggarwal (2005). Some outlier labeling methods such as the Standard Deviation (SD), the MADe and the Median rule are commonly used. These methods are quite reasonable when the data distribution is normal.in non random distributions, outliers can decrease normality. When data depart from a normal distribution, a transformation to normality is simply a common step in order to identify outliers 92

Chapter 4. Labeling Methods for Identifying Outliers 93 using a method which is quite effective in a normal distribution. Quesenberry & David (1961) discussed on the rejection and location of outlying observations that there might be several ways of approaching the problem, which depended to a large extent on the object in view. One might be particularly interested in identifying the genuinely exceptional observations, in order to a new insight into the phenomena under study the basis of risks of misclassification rather than of estimation errors. Grubbs (1969) has statistically determining whether the highest observation, the highest and lowest observations, the two highest observations, the lowest observations, or more of the observations in the sample are statistical outliers. Most outlier labeling methods, informal tests, generate an interval or criterian for outlier detection instead of hypothesis testing, and any observations beyond the interval or criterian is considered as an outlier. Various location and scale parameters are mostly employed in each labeling method to define a reasonable interval or criterian for outlier detection. There are two reasons for using an outlier labeling method. 1. To find the possible outliers as a screening device before conducting a formal test. 2. To find the extreme values away from the majority of the data regardless of the distribution. while the formal tests are usually require test statistics based on the distribution assumptions and a hypothesis to determine if the largest extreme value is a true outlier of the distribution, most outier labeling methods present the interval using the

Chapter 4. Labeling Methods for Identifying Outliers 94 location and scale parameters of the data. Although the labeling method is usually simple to use, some observations outside the interval may turn out to be falsely identified outliers after a formal test when the outliers are defined as only observations that deviate from the assuming distribution. However, if the purpose of the outlier detection is not a preliminary step to find the extreme values violating the distribution assumptions of the main statistical analyses such as the t-test, ANOVA and regression but mainly to find the extreme values away from the majority of the data regardless of the distribution, the outlier labeling methods may be applicable. In addition, for a large data set that is statistically problematic, e.g. when it is difficult to identify the distribution of the data or transform it into a proper distribution such as the normal distribution, labeling methods can be used to detect outliers. 4.1.1 Issues of Outliers outliers. Iglewicz & Hoaglin (1993) categorized the three following issues with regards to Outlier labeling flag potential outliers erroneous data, indicative of an inappropriate distributional model for further investigation. Outlier accommodation It is used to robust statistical techniques that will not be unduly affected by outliers. That is, if we cannot determine that potential outliers are erroneous observations, do we need modify our statistical analysis to more appropriately account for these observations.

Chapter 4. Labeling Methods for Identifying Outliers 95 Outlier identification It used to formally test whether observations are outliers. This chapter focuses on the outlier labeling technique and issues of outlier identification. Many real-time data sets contain outliers that have unusually large or small values when compared with others in the data set. Outliers may cause a negative effect on data analyses, such as ANOVA and regression, based on distribution assumptions, or may provide useful information about data when we look into an unusual response to a given study. Thus, outlier detection is an important part of data analysis in the above two cases. Several outlier labeling methods have been developed. Some methods are sensitive to extreme values, like the SD method, and others are resistant to extreme values, like Tukey s method. Although these methods are quite powerful with large normal data, it may be problematic to apply them to non-normal data or small sample sizes without knowledge of their characteristics in these circumstances. This is because each labeling method has different measures to detect outliers, and expected outlier percentages change differently according to the sample size or distribution type of the data. Many kinds of data regarding public health are often skewed, usually to the right, and lognormal distributions can often be applied to such skewed data, for instance, surgical procedure times, blood pressure, and assessment of toxic compounds in environmental analysis.

Chapter 4. Labeling Methods for Identifying Outliers 96 4.2 Methods of Analysis In this section, several outlier labeling methods are available among them four of the labeling methods such as Z-Score, Modified Z-Scores, Median Absolute Deviation (MADe) and Tukey Method (Boxplot) are used in the studies. FIGURE 4.1: Flowchart for Outlier Labeling Methods

Chapter 4. Labeling Methods for Identifying Outliers 97 4.2.1 Z - Scores Z-Score is a statistical measurement of a score s relationship to the mean in a group of scores. Z-score of 0 means the score is same as the mean. It can also be positive or negative, indicating whether it is above or below the mean and by how many standard deviations. This method that can be used to identifying outliers in the dataset is the Z-score, using the mean and standard deviation. Z scor e (i )= x i x, (4.1) s where s= 1 n 1 n (x i x) 2 i=1 The Z-scores based on the property is that if X follows a normal distribution, N(µ,σ 2 )then Z follows a standard normal distribution, z = x µ σ N(0,1), and Z scores that exceed 3 in absolute value are generally considered as outliers. This method is simple and it is the same formula as the 3 SD method when the criterion of an outlier is an absolute value of a Z-score of at least 3. According to Shiffler (1988), a possible maximum Z-scores is dependent on sample size and it computed as(n 1)/ n. Since no z-score exceeds 3 in a sample size less than or equal to 10, the z-score method is not very good for outlier labeling, particularly in small data sets. Another

Chapter 4. Labeling Methods for Identifying Outliers 98 limitation of this rule is that the standard deviation can be inflated by a few or even a single observation having an extreme value. Thus it can cause a masking problem, i.e., the less extreme outliers go undetected because of the most extreme outlier(s). Although it is common practice to use Z-scores to identify possible outliers, this can be misleading (partiucarly for small sample sizes) due to the fact that the maximum Z-score is at most (n 1)/ n. 4.2.2 Interpretation of Z-Scores Here it is interpretion steps for z-scores. 1. z-score less than 0 represents an element less than the mean. 2. z-score greater than 0 represents an element greater than the mean. 3. z-score equal to 0 represents an element equal to the mean. 4. z-score equal to 1 represents an element that is 1 standard deviation greater than the mean; a z-score equal to 2, 2 standard deviations greater than the mean; etc. 5. z-score equal to -1 represents an element that is 1 standard deviation less than the mean; a z-score equal to -2, 2 standard deviations less than the mean; etc. 6. If the number of elements in the set is large, about 68% of the elements have a z-score between -1 and 1; about 95% have a z-score between -2 and 2; and about 99% have a z-score between -3 and 3. Standard scores are also called z-values, z-scores, normal scores, and standardized variables; the use of "Z" is because the normal distribution is also known as the "Z

Chapter 4. Labeling Methods for Identifying Outliers 99 FIGURE 4.2: Compares the various grading methods in a normal distribution. Includes: Standard deviations, cumulative percentages, percentile equivalents, Z-scores, T-scores, standard nine, percent in stanine distribution". They are most frequently used to compare a sample to a standard normal deviate, though they can be defined without assumptions of normality. The z-score is only defined if one knows the population parameters; if one only has a sample set, then the analogous computation with sample mean and sample standard deviation yields the Student s t-statistic. The z-score is often used in the z-test in standardized testing the analog of the Student s t-test for a population whose parameters are known, rather than estimated. As it is very unusual to know the entire population, the t-test is much more widely used. A few applications of z-scores include the following: 1. What percentage of people fall below a specific value?

Chapter 4. Labeling Methods for Identifying Outliers 100 2. What values can be deemed extreme? For example, in an IQ test, what scores represent the top 5%? 3. What is the relative score of one distribution versus another? For example, Michael is taller than the average male and Emily is taller than the average female, but who is relatively taller in their own gender? These types of questions can be answered using a z-score. As a general rule, z-scores less than -1.96 or greater than 1.96 are considered unusual and generally very interesting. They are also synonymous with being statistically significant and outliers. 4.2.3 Modified Z - Scores The previous problem of Z-Scores was used two estimators the sample mean ( x) and sample standard deviation(s), can be affected by a few extreme values or by even a single extreme value. To resolve this problem the median and the median of the absolute deviation (MAD) are employed in the modified Z - Scores instead of the mean and standard deviation of the sample, respectively (Iglewicz & Hoaglin (1993)). M AD = medi an { x i x } (4.2) where x is the sample median. M i = 0.6745(x i x) M AD (4.3) where E(M AD)=0.675σ for large normal data. Iglewicz & Hoaglin (1993) suggested

Chapter 4. Labeling Methods for Identifying Outliers 101 that observations are labeled outliers when M i >3.5 through the simulation based on pseudo-normal observations for sample sizes of 10, 20 and 40. The M i score is effective for normal data in the same way as the Z-score. 4.2.4 Median Absolute Deviation (MADe) The Median Absolute Deviation (MADe) method is one of the basic robust methods which are largely unaffected by the presence of extreme values of the data set. This approach is similar to the SD method. However, the median and MADe are employed in this method instead of the mean and standard deviation. It is defined as follows, 2M AD e Method : Medi an± 2M AD e (4.4) 3M AD e Method : Medi an± 3M AD e (4.5) where M AD e = 1.483 M AD for large normal data and is an estimator of the spread in a data similar to the standard deviation. M AD = medi an { x i medi an(x) }, i = 1,2,...,n (4.6) The MAD is scaled by a factor of 1.483 it also similar to the standard deviation in normal distribution or Absolute Deviation around the Median as stated in the title is a robust measure of central tendency.

Chapter 4. Labeling Methods for Identifying Outliers 102 4.2.5 Tukeys Method (Box Plot) Tukey (1977) method, constructing a boxplot, is well known simple graphical tool to display information about continuous univariate data, such as the median, lower quartile, upper quartile, lower extreme and upper extreme of a data set. This method for finding outliers uses the interquartile range to filter out very large or very small numbers. The formulas are: Low outlier s= Q 1 1.5(Q 3 Q 1 )= Q 1 1.5(IQR) (4.7) Hi g h outlier s= Q 1 + 1.5(Q 3 Q 1 )= Q 1 + 1.5(IQR) (4.8) Where: Q1 = first quartile, Q3 = third quartile, IQR = Interquartile range These equations gives two values, or fences. A fence that cordons off the outliers from all of the values that are contained in the bulk of the data. The given following steps for finding outliers using IQR, Step 1 Find the Interquartile Range and Median. Step 2 Find Q1 and Q3. Q1 can be thought of as a median in the lower half of the data. Q3 can be thought of as a median for the upper half of data. Subtract Q1 from Q3. Step 3 Calculate 1.5 IQR and subtract from Q1 to get lower fence Step 4 Add to Q3 to get upper fences

Chapter 4. Labeling Methods for Identifying Outliers 103 Step 5 Add fences to the data to identify outliers 4.3 Computation Results and Discussion In this study, the diabetes data was obtained from the primary health center in Tirunelveli. This data has 50 observations for the patient s diabetes levels. It has computed with different output on the several labeling methods. The given methods are computed by open source R software package. Several labeling methods are employed in this study, each methods has different measures for identifying outliers in the data set. It screens the different behavior of the skewness and sample size. 4.3.1 Computation of Z-Scores In Table 4.1(case - 1) with all the data has included, it appears that the value 50 is outlier, yet no observations exceed the absolute value 3. For Table 4.2 (case - 2), the most extreme value 50 has excluded in the data, 49 and 48 has considered as outliers. This is because the multiple extreme values have artificially inflated the standard deviation. 4.3.2 Computation of Modified Z-Scores For this method, the computation results are tabulated below and it is compared with z-scores. Table 4.3 shows that the computed data values of the modified Z-scores in absolute value, out of these, this 3 observations (236, 236, and 525), may well be

Chapter 4. Labeling Methods for Identifying Outliers 104 TABLE 4.1: Computation and Masking Problem of the Z-Scores(Case-1) Obs.No. x i Z-Score Obs.No. x i Z-Score 1 70-0.83 26 110-0.24 2 75-0.76 27 113-0.20 3 81-0.67 28 114-0.18 4 84-0.63 29 117-0.14 5 84-0.63 30 117-0.14 6 84-0.63 31 119-0.11 7 85-0.61 32 121-0.08 8 85-0.61 33 121-0.08 9 86-0.60 34 127 0.01 10 92-0.51 35 130 0.05 11 93-0.49 36 131 0.07 12 95-0.46 37 132 0.08 13 95-0.46 38 134 0.11 14 96-0.45 39 135 0.13 15 96-0.45 40 136 0.14 16 96-0.45 41 139 0.19 17 98-0.42 42 153 0.39 18 99-0.40 43 155 0.42 19 101-0.38 44 166 0.59 20 101-0.38 45 169 0.63 21 105-0.32 46 172 0.68 22 106-0.30 47 175 0.72 23 108-0.27 48 236 1.62 24 109-0.26 49 236 1.62 25 110-0.24 50 525 5.90 outliers. 4.3.3 Computaion of MADe This method was computed from the data set results as follows, from the equations MADe = 28.1694, Median = 110, MAD = 19. Here the 2 MADe method has identifying 6 outliers which are: 172, 525, 175, 236, 169 and 236. Also, the 3 MADe method has identifying 3 outliers which are: 525, 236 and 236.

Chapter 4. Labeling Methods for Identifying Outliers 105 TABLE 4.2: Computation and Masking Problem of the Z-Scores(Case-2) Obs.No. x i Z-Score Obs.No. x i Z-Score 1 70-1.35 26 110-0.23 2 75-1.21 27 113-0.15 3 81-1.04 28 114-0.12 4 84-0.96 29 117-0.03 5 84-0.96 30 117-0.03 6 84-0.96 31 119-0.02 7 85-0.93 32 121-0.08 8 85-0.93 33 121-0.08 9 86-0.90 34 127 0.25 10 92-0.73 35 130 0.33 11 93-0.70 36 131 0.36 12 95-0.65 37 132 0.39 13 95-0.65 38 134 0.44 14 96-0.62 39 135 0.47 15 96-0.62 40 136 0.50 16 96-0.62 41 139 0.58 17 98-0.56 42 153 0.97 18 99-0.54 43 155 1.03 19 101-0.48 44 166 1.34 20 101-0.48 45 169 1.42 21 105-0.37 46 172 1.50 22 106-0.34 47 175 1.59 23 108-0.29 48 236 3.29 24 109-0.26 49 236 3.29 25 110-0.23 50 - - Dot Plot for MADe A dotplot is made up of dots plotted on a graph. Here is how to interpret a dotplot. 1. Each dot represents a specific number of observations from a set of data. (Unless otherwise indicated, assume that each dot represents one observation. If a dot represents more than one observation, that should be explicitly noted on the plot.) 2. The dots are stacked in a column over a category, so that the height of the

Chapter 4. Labeling Methods for Identifying Outliers 106 TABLE 4.3: Computation of Z-Scores compared with the Modified Z-Scores Z-Scores Modified Z-Scores Modified i x i i x i Z-Scores Z-Scores Case - 1 Case -2 Case - 1 Case -2 1 70-0.83-1.35-1.42 26 110-0.24-0.23 0.0000 2 75-0.76-1.21-1.2425 27 113-0.2-0.15 0.1065 3 81-0.67-1.04-1.0295 28 114-0.18-0.12 0.1420 6 84-0.63-0.96-0.923 29 117-0.14-0.03 0.2485 5 84-0.63-0.96-0.923 30 117-0.14-0.03 0.2485 6 84-0.63-0.96-0.923 31 119-0.11 0.02 0.3195 7 85-0.61-0.93-0.8875 32 121-0.08 0.08 0.3905 8 85-0.61-0.93-0.8875 33 121-0.08 0.08 0.3905 9 86-0.6-0.9-0.852 34 127 0.01 0.25 0.6035 10 92-0.51-0.73-0.639 35 130 0.05 0.33 0.7100 11 93-0.49-0.7-0.6035 36 131 0.07 0.36 0.7455 12 95-0.46-0.65-0.5325 37 132 0.08 0.39 0.7810 13 95-0.46-0.65-0.5325 38 134 0.11 0.44 0.8520 14 96-0.45-0.62-0.497 39 135 0.13 0.47 0.8875 15 96-0.45-0.62-0.497 40 136 0.14 0.5 0.9230 16 96-0.45-0.62-0.497 41 139 0.19 0.58 1.0295 17 98-0.42-0.56-0.426 42 153 0.39 0.97 1.5265 18 99-0.4-0.54-0.3905 43 155 0.42 1.03 1.5975 19 101-0.38-0.48-0.3195 44 166 0.59 1.34 1.9880 20 101-0.38-0.48-0.3195 45 169 0.63 1.42 2.0945 21 105-0.32-0.37-0.1775 46 172 0.68 1.5 2.2010 22 106-0.3-0.34-0.142 47 175 0.72 1.59 2.3075 23 108-0.27-0.29-0.071 49 236 1.62 3.29 4.4730 24 109-0.26-0.26-0.0355 48 236 1.62 3.29 4.4730 25 110-0.24-0.23 0.0000 50 525 5.9-14.7325 column represents the relative or absolute frequency of observations in the category. 3. The pattern of data in a dotplot can be described in terms of symmetry and skewness only if the categories are quantitative. If the categories are qualitative (as they often are), a dotplot cannot be described in those terms.

Chapter 4. Labeling Methods for Identifying Outliers 107 Compared to other types of graphic display, dotplots are used most often to plot frequency counts within a small number of categories, usually with small sets of data. FIGURE 4.3: Dotplot for visualize the data with outliers In figure 4.3 the extreme value at x=525 has dragged x+ 2s is the outlier cutoff, above the same two points at x=236, 236. Only the point at x=525 is therefore caught as an outlier, even though the points at x=236, 236 is clearly also an outlier. 4.3.4 Computaion of Tukey Method(Box Plot) In this method obtained from the result of the dataset is, TABLE 4.4: Tukey method outlier detection using IQR Sample size 50 Lowest value 70.0000 Highest value 525.0000 Arithmetic mean 126.3400 Median 110.0000 Standard deviation 67.5533 Coefficient of Skewness 4.4967 (P<0.0001) Coefficient of Kurtosis 25.1797 (P<0.0001) Suspected outliers(tukey 1977) Outside values 236 236 Far-out values 525

Chapter 4. Labeling Methods for Identifying Outliers 108 FIGURE 4.4: Box and Whisker plot for visualizing outliers The IQR (Inter Quartile Range) is the distance Q1=95.25, Q3=133.5 and IQR =38.25. Thus the inner fences is [37.875, 190.875] and outer fence is [19.5, 248.25]. The two extreme values are, 236 and 525 are identified as probable outliers in this method. Figure 4.4 is a boxplot for the dataset. In Figure 4.4, the central box represents the values from the lower to upper quartile (25 to 75 percentile). The middle line represents the median. The horizontal line extends from the minimum to the maximum value, excluding outside and far out values which are displayed as separate points. An outside value is defined as a value that is smaller than the lower quartile minus 1.5 times the interquartile range, or larger than the upper quartile plus 1.5 times the interquartile range (inner fences). A far out value is defined as a value that is smaller than the lower quartile minus 3 times the interquartile range, or larger than the upper quartile plus 3 times the interquartile range (outer fences).

Chapter 4. Labeling Methods for Identifying Outliers 109 TABLE 4.5: Number of outliers detected by different outlier labeling methods Methods Cases Cutoff value Outliers Z-Scores I 525 Zi>3 II 236, 236 Modified Z-Scores MAD Mi >3.5 525, 236, 236 MAD 2MADe MAD>2 169, 172, 175, 236, 236, 525 3MADe MAD>3 525, 236, 236 Tukeys Method Outside values [37.875, 190.875] 236,236 Far outside values [19.5, 248.25] 525 4.4 Conclusion The performance of the various outlier labeling methods Z-Score, Modified Z-Scores, MADe and Tukey has been studied statistically using real time dataset to evaluate which of the methods has more powerful way for detecting and handling outliers. Most intervals are used to identify the possible outliers in the outlier labeling methods that are effective under the normal distribution. Z-Scores and Tukey methods are affected by masking problem, for this reason the detection sensitivity is low. MADe is one of the most common ways for finding the outliers in one-dimensional data that is to mark as a potential outlier for any point which is more than two standard deviations. MADe and Modified Z-scores are used in the MAD method. It has identified almost three values 525, 236, 236 which are considered as the outliers. But all the methods can find that the maximum far away value is 525. In MADe method M AD > 2 is identifying six (169, 172, 175, 236, 236, 525) outliers and M AD > 3 is identifying three (525, 236 and 236) outliers. In univariate case, the Median Absolute Deviation is one of the most robust dispersion scales in the presence of outliers, and therefore we recommended the MADe method for outlier detection.