Aristotle University Thessaloniki Dept. of Mechanical Engineering Preprocessing, analyzing and modelling of AQ measurement data Kostas Karatzas Informatics Systems and Applications Environmental Informatics Research Group Dept. of Mechanical Engineering, Aristotle University, Thessaloniki, Greece Tel/Fax: +30 2310 994176 kkara@eng.auth.gr http://isag.meng.auth.gr
Contents Preprocessing Analysis Modelling
I n f o r m a t i c s S y s t e m s & A p p l i c a t i o n s G r o u p ( I S A G ) http://isag.meng.auth.gr Aristotle University Thessaloniki Dept. of Mechanical Engineering Ι. Preprocessing
Preprocessing The goal is to Identify and remove errors Prepare a reference data set, available for further analysis and modelling
Heterogeneity and quality of data Outliers identification Missing data handling
Ημέρες Missing values graphs 50 100 150 200 250 300 350 CO NO2 O3 PM10 SO2 Θερμ. Υγρ. Ταχ. Αν. Δ. Αν. Στιγμ.
Συγκέντρωση ΡΜ 10 [μgr/m 3 ] Συγκέντρωση ΡΜ 10 [μgr/m 3 ] Missing value(s) Missing values handling Removal and replacement of missing values Calculation of missing value(s) 220 200 180 (α) Data Linear Nearest Splines 160 140 (β) Data LinReg Nearest Som 160 120 140 120 100 100 80 80 60 60 11 12 13 14 15 16 17 18 19 20 21 11 12 13 14 15 16 17 18 19 20 21 Missing value calculation examples (a) Interpolation methods and (b) predictive modelling
Τιμές Τιμές Μεταβλητής 2 Τιμές Τιμές Μεταβλητής Graphs for identifying outliers 250 200 (α) 250 (β) 200 150 μ+3σ Mean value (μ), μ-3σ and μ+3σ. 150 μ+3σ 100 100 50 μ 50 μ 0 μ-3σ -50 0 μ-3σ -100-50 0 50 100 150 200 250 300 350-150 0 50 100 150 200 250 300 350 (α) 150 Μεταβλητή 1 120 (β) 100 100 50 80 2-D outlier detection 0 0 50 100 150 200 250 300 350 60 150 100 50 Μεταβλητή 2 40 20 0 0 0 50 100 150 200 250 300 350 20 40 60 80 100 120 140 Τιμές Μεταβλητής 1
Harmonized data overview Descriptive statistics. The main goal is to calculate a basic set of statistical measures describing the measurements: Basic analysis & central tendency measures Mean value, median and most frequent values Variation or dispersion measures Standard deviation Shape measures Skewness, Kurtosis
Descriptive statistics Measure Formulae Comment Mean Value Sensitive to outliers Median Value Not influenced by outliers Trimmed Mean Useful for measurements following the normal distribution. Robust to outliers Mode The most frequent value among observations Useful in the presence of outliers Geometric Mean Useful when measurements follow logarithmic or asymmetric distributions. Sensitive to outliers Harmonic Mean Useful when measurements follow logarithmic or asymmetric distributions. Sensitive to outliers
Descriptive statistics Measure Formulae Comment Standard Deviation or σ = 1 N 1 1 N N i=1 N i=1 x i x 2 x i x 2 1/2 1 2 Useful for measurements following the normal distribution. Sensitive to outliers Variance or 1 N 1 1 N N i=1 N i=1 x i x 2 x i x 2 Useful for measurements following the normal distribution. Sensitive to outliers Mean Absolute Deviation 1 N N i=1 x i x Useful for measurements following the normal distribution. Less sensitive to outliers in comp to std Interquantile Range 50% of the sorted values of x Less representative if values follow the normal distribution. Robust to outliers Range max x i min x i Very sensitive to outliers x: arithmetic mean
Visual inspection Basic time series graphs Per parameter Per group of parameters (AQ and meteo groups) One parameter, all institutes, versus reference measurements We can thus identify common behavior between sensors and use this in the next step to group sensors and proceed with further analysis Dispersion plots
Συγκέντρωση O 3 [μgr/m 3 ] Time series analysis De-trenting 120 100 (α) Χρονοσειρά Τάση 40 30 (β) 20 80 60 40 20 Συγκέντρωση O 3 10 0-10 -20-30 0 0 50 100 150 200 250 300 350 400 Hμέρες -40 0 50 100 150 200 250 300 350 400 Hμέρες Identification and removal of trend (2 nd degr. polyonym) from O3 time series: (α) before (β) after
Periodicity identification Identify and isolate periodicities 25 20 15 (α) k=5 k=15 30 20 (β) α=0.1 α=0.5 Συγκέντρωση O 3 10 5 0-5 -10-15 s i = 1 k k 1 x i n n=0-20 0 50 100 150 200 250 300 350 400 Hμέρες Συγκέντρωση O 3 10 0-10 -20-30 0 50 100 150 200 250 300 350 400 Hμέρες Smoothed O3 time series (after de-trenting) (α) running mean (k=5, k=15) and (β) exponential smoothing (α=0.1, α=0.5).
Normalization & useful transformations Variance normalization (all values between 0 and 1) x = x μ x σ x Logarithmic (for big differences) x = l n( x x min 1 Trigonometric transformation (cyclic nature) x = 1 + tan x + π 4
I n f o r m a t i c s S y s t e m s & A p p l i c a t i o n s G r o u p ( I S A G ) http://isag.meng.auth.gr Aristotle University Thessaloniki Dept. of Mechanical Engineering IΙ. Analysis
AQ data analysis The goal is to identify the most important parameters and their basic relationships Covariance matrix Correlation coefficient matrix Information gain criterion PCA SOM K-means clustering
I n f o r m a t i c s S y s t e m s & A p p l i c a t i o n s G r o u p ( I S A G ) http://isag.meng.auth.gr Aristotle University Thessaloniki Dept. of Mechanical Engineering IIΙ. Modelling
Modelling Data oriented modelling. The goal is Behavior reproduction (descriptive modelling) Forecasting (predictive modelling) Algorithms Linear regression (just for reference) Decision trees ANNs SVMs
Value Value PM10mean *[+1u]* 200 Observed Predicted 150 100 50 50 100 150 200 250 300 Row index 140 O3max *[+1u]* Observed Predicted 120 100 80 60 40 20 50 100 150 200 250 300 Row index
And the final goal should be Development of new, innovative, personalized, georeferenced, quality of life related, everyday activity associated. Services!!!
The facts (1/2) We already have services providing information on: Current air quality (monitoring stations) AQ forecasts
The facts (2/2) What we would like to receive as a service: Everything else!, i.e. How will the quality of my life, in the place where I live will develop tomorrow What do others like myself have reported and are expected to experience tomorrow Biking like I do Commuting like I do Having everyday habits similar to mine Which are the places and times of the day that I can use to move around and feel better if possible.
Collaboration is the only way Consider joined design, informatics and environmental workshops with hands-on sessions for service designers
Thank you for your attention!