Identification of local multivariate outliers

Size: px
Start display at page:

Download "Identification of local multivariate outliers"

Transcription

1 Identification of local multivariate outliers Anne Ruiz-Gazen and Christine Thomas-Agnan Gremaq, TSE and IMT Toulouse, France (in collab. with Peter Filzmoser) SSIAB - Avignon - 11/05/12 A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 1 / 24

2 Introduction In robust statistics, an observation is considered as outlying if it differs from the main bulk of the data set. A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 2 / 24

3 Introduction In robust statistics, an observation is considered as outlying if it differs from the main bulk of the data set. F ε = (1 ε)f + εg In the case of continuous attributes, the main bulk of the data set assumed to follow an elliptical distribution (e.g. gaussian) F and the outlying observations following a distribution G (e.g. point mass). A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 2 / 24

4 Introduction In robust statistics, an observation is considered as outlying if it differs from the main bulk of the data set. F ε = (1 ε)f + εg In the case of continuous attributes, the main bulk of the data set assumed to follow an elliptical distribution (e.g. gaussian) F and the outlying observations following a distribution G (e.g. point mass). Objective : identify/detect gross errors, atypical observations taking into account the multivariate and the spatial nature of the data. A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 2 / 24

5 Introduction 0 km North Atlantic Ocean N O R W A Y Ivalo F I N L A N D Kirkenes Nikel Zapolyarnij R U S S I A E U R O P E Barents Sea Murmansk Olenegorsk Legend Mine, in production Mine, closed down Important mineral occurrence, not developed Smelter, production of mineral concentrate City, town, settlement Project boundary Monchegorsk Saattopora Kittilä Keivitsa Pahtavaara Kovdor Apatity Kirovsk Kandalaksha 24 o E Arctic Circle o 35 30'E White Sea Rovaniemi The Kola project : concentration measures for more than 50 chemical elements in four layers and 617 observations. A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 3 / 24

6 Introduction 0 km North Atlantic Ocean N O R W A Y Ivalo F I N L A N D Kirkenes Nikel Zapolyarnij R U S S I A E U R O P E Barents Sea Murmansk Olenegorsk Legend Mine, in production Mine, closed down Important mineral occurrence, not developed Smelter, production of mineral concentrate City, town, settlement Project boundary Monchegorsk Saattopora Kittilä Keivitsa Pahtavaara Kovdor Apatity Kirovsk Kandalaksha 24 o E Arctic Circle o 35 30'E White Sea Rovaniemi The Kola project : concentration measures for more than 50 chemical elements in four layers and 617 observations. Data available in the R-package mvoutlier by M. Gschwandtner et P. Filzmoser. A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 3 / 24

7 1 Detection of outliers in a non spatial context Detection of univariate outliers Detection of multivariate outliers 2 Spatial outliers Global and local outliers Identification of univariate spatial outliers 3 Identification of multivariate spatial outliers Variocloud of pairwise Mahalanobis distances Toy example Quantile geographical-variate plot A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 4 / 24

8 Detection of outliers in a non spatial context Detection of univariate outliers 1 Detection of outliers in a non spatial context Detection of univariate outliers Detection of multivariate outliers 2 Spatial outliers Global and local outliers Identification of univariate spatial outliers 3 Identification of multivariate spatial outliers Variocloud of pairwise Mahalanobis distances Toy example Quantile geographical-variate plot A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 5 / 24

9 Detection of outliers in a non spatial context Detection of univariate outliers Detection of univariate outliers Let us consider a data set x, n p with n observations x i and p variables. A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 6 / 24

10 Detection of outliers in a non spatial context Detection of univariate outliers Detection of univariate outliers Let us consider a data set x, n p with n observations x i and p variables. In one dimension (p = 1), the detection of outliers is often based on (Grubbs, 1969). x i x σ x A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 6 / 24

11 Detection of outliers in a non spatial context Detection of univariate outliers Detection of univariate outliers Let us consider a data set x, n p with n observations x i and p variables. In one dimension (p = 1), the detection of outliers is often based on (Grubbs, 1969). x i x σ x Problem of masking effect : outliers may spoil the empirical mean and the standard deviation estimators in such a way that outliers are not detected. A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 6 / 24

12 Detection of outliers in a non spatial context Detection of univariate outliers Detection of univariate outliers Let us consider a data set x, n p with n observations x i and p variables. In one dimension (p = 1), the detection of outliers is often based on (Grubbs, 1969). x i x σ x Problem of masking effect : outliers may spoil the empirical mean and the standard deviation estimators in such a way that outliers are not detected. Robust version : x and σ x replaced by some robust estimators such as the median and the MAD. A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 6 / 24

13 Detection of outliers in a non spatial context Detection of univariate outliers Detection of univariate outliers log10(as) mean +/ 2*s median +/ 2*MAD Boxplot Fig.: Histogram of log(arsenic) A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 7 / 24

14 Detection of outliers in a non spatial context Detection of multivariate outliers Detection of multivariate outliers In the multivariate case (p > 1), detection based on Mahalanobis distances to the center of the data set : A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 8 / 24

15 Detection of outliers in a non spatial context Detection of multivariate outliers Detection of multivariate outliers In the multivariate case (p > 1), detection based on Mahalanobis distances to the center of the data set : Let t be a location estimator (p 1) and C a dispersion matrix estimator (p p) of the distribution of the main bulk of the data. MD(x i, t, C) = { (x i t) C 1 (x i t) } 1/2 A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 8 / 24

16 Detection of outliers in a non spatial context Detection of multivariate outliers Detection of multivariate outliers In the multivariate case (p > 1), detection based on Mahalanobis distances to the center of the data set : Let t be a location estimator (p 1) and C a dispersion matrix estimator (p p) of the distribution of the main bulk of the data. MD(x i, t, C) = { (x i t) C 1 (x i t) } 1/2 Multivariate outliers are associated with large values of Mahalanobis distances. A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 8 / 24

17 Detection of outliers in a non spatial context Detection of multivariate outliers Detection of multivariate outliers In the multivariate case (p > 1), detection based on Mahalanobis distances to the center of the data set : Let t be a location estimator (p 1) and C a dispersion matrix estimator (p p) of the distribution of the main bulk of the data. MD(x i, t, C) = { (x i t) C 1 (x i t) } 1/2 Multivariate outliers are associated with large values of Mahalanobis distances. In the Gaussian case N (µ, Σ), the MD 2 (x i, µ, Σ) follow a chi-square distribution with p degrees of freedom and a common used cut-off value is the quantile of order 95% of this chi-square distribution. A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 8 / 24

18 Detection of outliers in a non spatial context Detection of multivariate outliers Detection of multivariate outliers (Rousseeuw and Van Zomeren, 1990) A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 9 / 24

19 Detection of outliers in a non spatial context Detection of multivariate outliers Detection of multivariate outliers (Rousseeuw and Van Zomeren, 1990) Use robust estimators t and C such as the Minimum Covariance Determinant (MCD) estimators. Look for a subset of data points (e.g. 75%) having the smallest determinant for its covariance matrix. A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 9 / 24

20 Detection of outliers in a non spatial context Detection of multivariate outliers Detection of multivariate outliers (Rousseeuw and Van Zomeren, 1990) Use robust estimators t and C such as the Minimum Covariance Determinant (MCD) estimators. Look for a subset of data points (e.g. 75%) having the smallest determinant for its covariance matrix. R-packages mvoutliers et rrcov. A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 9 / 24

21 Detection of outliers in a non spatial context Detection of multivariate outliers In two dimensions, scatterplot with ellipsoids corresponding to non-robust estimators (blue) t and C and MCD estimators (red) for a quantile of order 95%. A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 10 / 24

22 Spatial outliers 1 Detection of outliers in a non spatial context Detection of univariate outliers Detection of multivariate outliers 2 Spatial outliers Global and local outliers Identification of univariate spatial outliers 3 Identification of multivariate spatial outliers Variocloud of pairwise Mahalanobis distances Toy example Quantile geographical-variate plot A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 11 / 24

23 Spatial outliers Global and local outliers Global and local outliers Global outlier : relative to the distribution of the whole data set (whole area of study). A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 12 / 24

24 Spatial outliers Global and local outliers Global and local outliers Global outlier : relative to the distribution of the whole data set (whole area of study). Local outlier : relative to the sub-distribution associated with the observation and its neighborhood. Underlying assumption : positive spatial autocorrelation. A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 12 / 24

25 Spatial outliers Identification of univariate spatial outliers Exploratory plots for identifying univariate spatial outliers Neighbor plot Moran plot Drift map Angle plot Variocloud A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 13 / 24

26 Spatial outliers Identification of univariate spatial outliers Exploratory plots for identifying univariate spatial outliers Neighbor plot Moran plot Drift map Angle plot Variocloud R-package GeoXp by C. Thomas et al. (2012). A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 13 / 24

27 Identification of multivariate spatial outliers 1 Detection of outliers in a non spatial context Detection of univariate outliers Detection of multivariate outliers 2 Spatial outliers Global and local outliers Identification of univariate spatial outliers 3 Identification of multivariate spatial outliers Variocloud of pairwise Mahalanobis distances Toy example Quantile geographical-variate plot A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 14 / 24

28 Identification of multivariate spatial outliers Variocloud of pairwise Mahalanobis distances Variocloud of pairwise Mahalanobis distances MD(x i, x j, C) = { (x i x j ) C 1 (x i x j ) } 1/2 A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 15 / 24

29 Identification of multivariate spatial outliers Variocloud of pairwise Mahalanobis distances Variocloud of pairwise Mahalanobis distances MD(x i, x j, C) = { (x i x j ) C 1 (x i x j ) } 1/2 Draw a variocloud by replacing absolute pairwise difference with robust pairwise Mahalanobis distances (MCD covariance estimator) A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 15 / 24

30 Identification of multivariate spatial outliers Variocloud of pairwise Mahalanobis distances Variocloud of pairwise Mahalanobis distances MD(x i, x j, C) = { (x i x j ) C 1 (x i x j ) } 1/2 Draw a variocloud by replacing absolute pairwise difference with robust pairwise Mahalanobis distances (MCD covariance estimator) Draw only part of the cloud, summarize the rest by conditional quantile curves A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 15 / 24

31 Identification of multivariate spatial outliers Variocloud of pairwise Mahalanobis distances Example of multivariate variocloud Selected units : small spatial distances and high pairwise Mahalanobis distance Pairwise Mahalanobis distances Y coordinate e+00 2e+05 4e+05 6e+05 Pairwise spatial distances 4e+05 5e+05 6e+05 7e+05 8e+05 X coordinate Fig.: Multivariate variocloud A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 16 / 24

32 Identification of multivariate spatial outliers Toy example A small toy example Variable Y coordinate Variable X coordinate Fig.: Toy example Comparing pairwise geographical distance and pairwise distance in the non spatial attributes space. A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 17 / 24

33 Identification of multivariate spatial outliers Toy example Variocloud for the toy example Pairwise Mahalanobis distances Y coordinate Pairwise spatial distances X coordinate Fig.: Toy example A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 18 / 24

34 Identification of multivariate spatial outliers Quantile geographical-variate plot Distribution of the pairwise Mahalanobis Distance When the observations X 1,..., X n are i.i.d. with a normal distribution N (µ, Σ), we can prove that : Conditional on one observation X i, the distribution of the squared pairwise Mahalanobis distance MD 2 (X i, X j, Σ) of X j, j i is a non central chi-square distribution with the Mahalanobis distance MD 2 (X i, µ, Σ) as the non-centrality parameter and p degrees of freedom. A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 19 / 24

35 Identification of multivariate spatial outliers Quantile geographical-variate plot Distribution of pairwise MD : toy example Histogram for filled square Histogram for filled triangle Histogram for filled circle Density Density Density Pairwise MD^ Pairwise MD^ Pairwise MD^2 Fig.: Toy example A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 20 / 24

36 Identification of multivariate spatial outliers Quantile geographical-variate plot Quantile geographical-variate plot Variable Variable Variable 1 Variable 1 Fig.: Toy example A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 21 / 24

37 Identification of multivariate spatial outliers Quantile geographical-variate plot Quantile geographical-variate plot Regular observations Outlying observations Degree of isolation Degree of isolation Percentage of next 10 neighbors Percentage of next 10 neighbors Fig.: Toy example A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 22 / 24

38 Identification of multivariate spatial outliers Quantile geographical-variate plot Quantile geographical-variate plot on Kola example Regular observations Outlying observations Quantile of non central chisquare distribution Quantile of non central chisquare distribution Percentage of neighbors inside tolerance ellipse Percentage of neighbors inside tolerance ellipse Fig.: Kola example A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 23 / 24

39 Identification of multivariate spatial outliers Quantile geographical-variate plot Conclusion Thank you for your attention! A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local multivariate outliers SSIAB - Avignon - 11/05/12 24 / 24