Preprocessing, analyzing and modelling of AQ measurement data

Similar documents
Business Quantitative Analysis [QU1] Examination Blueprint

Chapter 2 Part 1B. Measures of Location. September 4, 2008

FOLLOW-UP NOTE ON MARKET STATE MODELS

CHAPTER 4. Labeling Methods for Identifying Outliers

Bar graph or Histogram? (Both allow you to compare groups.)

One Year Executive Program in Applied Business Analytics

Distinguish between different types of numerical data and different data collection processes.

+? Mean +? No change -? Mean -? No Change. *? Mean *? Std *? Transformations & Data Cleaning. Transformations

STAT 2300: Unit 1 Learning Objectives Spring 2019

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

Lab 1: A review of linear models

Comparison of sales forecasting models for an innovative agro-industrial product: Bass model versus logistic function

ISSN (Online)

Choosing the Right Type of Forecasting Model: Introduction Statistics, Econometrics, and Forecasting Concept of Forecast Accuracy: Compared to What?

JMP TIP SHEET FOR BUSINESS STATISTICS CENGAGE LEARNING

From Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques. Full book available for purchase here.

PsychTests.com advancing psychology and technology

Statistics and Business Decision Making TEKS/LINKS Student Objectives One Credit

Background and Normalization:

TUM Data Innovation Lab Energy Peak Load Prediction in a BMW Plant

Preprocessing Methods for Two-Color Microarray Data

Predictive Analytics

Add Sophisticated Analytics to Your Repertoire with Data Mining, Advanced Analytics and R

Eyal Carmi. Google, 76 Ninth Avenue, New York, NY U.S.A. Gal Oestreicher-Singer and Uriel Stettner

Ezgi AVCI TSE, Personnel and System Certification Center, TURKEY. Gülser KÖKSAL METU, Industrial Engineering Department, TURKEY

Students will understand the definition of mean, median, mode and standard deviation and be able to calculate these functions with given set of

DETECTING OUTLIERS BY USING TRIMMING CRITERION BASED ON ROBUST SCALE ESTIMATORS WITH SAS PROCEDURE. Sharipah Soaad Syed Yahaya

Introduction to Statistics. Measures of Central Tendency

Introduction to Statistics. Measures of Central Tendency and Dispersion

QUESTION 2 What conclusion is most correct about the Experimental Design shown here with the response in the far right column?

FUNDAMENTALS OF QUALITY CONTROL AND IMPROVEMENT. Fourth Edition. AMITAVA MITRA Auburn University College of Business Auburn, Alabama.

Winsor Approach in Regression Analysis. with Outlier

PMF modeling for WRAP COHA

Forecasting Introduction Version 1.7

Segmentation and Targeting

AP Statistics Scope & Sequence

Gene Expression Data Analysis

Clairvoyant Site Allocation of Jobs with Highly Variable Service Demands in a Computational Grid

Appendix II. Preliminary Sensitivity Analysis

Comparison of multivariate outlier detection methods for nearly elliptically distributed data

This chapter will present the research result based on the analysis performed on the

Topic 1: Descriptive Statistics

APPLIED MATHEMATICS (CODE NO. 840) SESSION

Segmentation and Targeting

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

POST GRADUATE PROGRAM IN DATA SCIENCE & MACHINE LEARNING (PGPDM)

C-14 FINDING THE RIGHT SYNERGY FROM GLMS AND MACHINE LEARNING. CAS Annual Meeting November 7-10

PREDICTION OF PIPE PERFORMANCE WITH MACHINE LEARNING USING R

Mining Heterogeneous Urban Data at Multiple Granularity Layers

FUNDAMENTALS OF QUALITY CONTROL AND IMPROVEMENT

Bioinformatics for Biologists

SAS BIG DATA ANALYTICS INCREASING YOUR COMPETITIVE EDGE

Neural Connection s four powerful neural networks give you better performing models

Relevant level review. Dr Richard Tooth & Dr James Swansson 5 November 2014

BSc (Hons) Business Information Systems. Examinations for / Semester 2

Test lasts for 120 minutes. You must stay for the entire 120 minute period.

PROJECT MANAGEMENT. Systems, Principles, and Applications. Taylor & Francis Group Boca Raton London New York

Super-marketing. A Data Investigation. A note to teachers:

SIDDHARTH INSTITUTE OF ENGINEERING & TECHNOLOGY (AUTONOMOUS) :: PUTTUR Siddharth Nagar, Narayanavanam Road QUESTION BANK (DESCRIPTIVE)

Volume 30, Issue 1. R&D and firm growth rate variance

-SQA-SCOTTISH QUALIFICATIONS AUTHORITY HIGHER NATIONAL UNIT SPECIFICATION GENERAL INFORMATION

1 INTRODUCTION. 1.1 Background. 1.2 A/B tests

TDWI strives to provide course books that are contentrich and that serve as useful reference documents after a class has ended.

Biology 644: Bioinformatics

In the appendix of the published paper, we noted:

PG. DIPLOMA IN RURAL BANKING (PGDRBI) Term-End Examination December, 2011 MCQ-033 : RURAL RESEARCH METHODS AND QUANTITATIVE TECHNIQUES

ISO 13528:2015 Statistical methods for use in proficiency testing by interlaboratory comparison

Week 1 Tuesday Hr 2 (Review 1) - Samples and Populations - Descriptive and Inferential Statistics - Normal and T distributions

STAT 430 SAS Examples SAS1 ==================== ssh tap sas913 (or sas82), sas

Operations Management I Fall 2004 Odette School of Business University of Windsor

Marta Fernández-Diego Mónica Martínez-Gómez José-MaríaTorralba-Martínez UNIVERSIDAD POLITÉCNICA DE VALENCIA SPAIN

XGBOOST AS A TIME-SERIES FORECASTING TOOL

Bioinformatics. Microarrays: designing chips, clustering methods. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

Week 10: Heteroskedasticity

GETTING READY FOR DATA COLLECTION

The Dummy s Guide to Data Analysis Using SPSS

Data Analysis Boot Camp

CDG1A/CDZ3A/CDC3A/ MBT3A BUSINESS STATISTICS. Unit : I - V

Session 7. Introduction to important statistical techniques for competitiveness analysis example and interpretations

Examination. Telephone: Please make your calculations on Graph paper. Max points: 100

CDG1A/CDZ3A/CDC3A/ MBT3A BUSINESS STATISTICS. Unit : I - V

ANALYSING QUANTITATIVE DATA

Statistical analysis and assessment of water quality parameters in Pagoneri, river Nestos

Monitoring Corruption: Evidence from a Field Experiment in Indonesia Additional Tables Benjamin A. Olken December 30, 2006

THE INSTITUTE OF CHARTERED ACCOUNTANTS OF INDIA

Biostatistics 208 Data Exploration

Bioinformatics for Biologists

Choosing Smoothing Parameters For Exponential Smoothing: Minimizing Sums Of Squared Versus Sums Of Absolute Errors

A Statistical Comparison Of Accelerated Concrete Testing Methods

Module 1: Fundamentals of Data Analysis

Project 2 - β-endorphin Levels as a Response to Stress: Statistical Power

Cluster-based Forecasting for Laboratory samples

Exploration and Analysis of DNA Microarray Data

Business Intelligence, 4e (Sharda/Delen/Turban) Chapter 2 Descriptive Analytics I: Nature of Data, Statistical Modeling, and Visualization

TNM033 Data Mining Practical Final Project Deadline: 17 of January, 2011

Assignment 1 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Sociology 7704: Regression Models for Categorical Data Instructor: Natasha Sarkisian. Preliminary Data Screening

SECTION 11 ACUTE TOXICITY DATA ANALYSIS

Discussion of Bachmann and Ma (2012) Lumpy Investment, Lumpy Inventories prepared for CIREQ Macroeconomics Conference

Airbnb Price Estimation. Hoormazd Rezaei SUNet ID: hoormazd. Project Category: General Machine Learning gitlab.com/hoorir/cs229-project.

Transcription:

Aristotle University Thessaloniki Dept. of Mechanical Engineering Preprocessing, analyzing and modelling of AQ measurement data Kostas Karatzas Informatics Systems and Applications Environmental Informatics Research Group Dept. of Mechanical Engineering, Aristotle University, Thessaloniki, Greece Tel/Fax: +30 2310 994176 kkara@eng.auth.gr http://isag.meng.auth.gr

Contents Preprocessing Analysis Modelling

I n f o r m a t i c s S y s t e m s & A p p l i c a t i o n s G r o u p ( I S A G ) http://isag.meng.auth.gr Aristotle University Thessaloniki Dept. of Mechanical Engineering Ι. Preprocessing

Preprocessing The goal is to Identify and remove errors Prepare a reference data set, available for further analysis and modelling

Heterogeneity and quality of data Outliers identification Missing data handling

Ημέρες Missing values graphs 50 100 150 200 250 300 350 CO NO2 O3 PM10 SO2 Θερμ. Υγρ. Ταχ. Αν. Δ. Αν. Στιγμ.

Συγκέντρωση ΡΜ 10 [μgr/m 3 ] Συγκέντρωση ΡΜ 10 [μgr/m 3 ] Missing value(s) Missing values handling Removal and replacement of missing values Calculation of missing value(s) 220 200 180 (α) Data Linear Nearest Splines 160 140 (β) Data LinReg Nearest Som 160 120 140 120 100 100 80 80 60 60 11 12 13 14 15 16 17 18 19 20 21 11 12 13 14 15 16 17 18 19 20 21 Missing value calculation examples (a) Interpolation methods and (b) predictive modelling

Τιμές Τιμές Μεταβλητής 2 Τιμές Τιμές Μεταβλητής Graphs for identifying outliers 250 200 (α) 250 (β) 200 150 μ+3σ Mean value (μ), μ-3σ and μ+3σ. 150 μ+3σ 100 100 50 μ 50 μ 0 μ-3σ -50 0 μ-3σ -100-50 0 50 100 150 200 250 300 350-150 0 50 100 150 200 250 300 350 (α) 150 Μεταβλητή 1 120 (β) 100 100 50 80 2-D outlier detection 0 0 50 100 150 200 250 300 350 60 150 100 50 Μεταβλητή 2 40 20 0 0 0 50 100 150 200 250 300 350 20 40 60 80 100 120 140 Τιμές Μεταβλητής 1

Harmonized data overview Descriptive statistics. The main goal is to calculate a basic set of statistical measures describing the measurements: Basic analysis & central tendency measures Mean value, median and most frequent values Variation or dispersion measures Standard deviation Shape measures Skewness, Kurtosis

Descriptive statistics Measure Formulae Comment Mean Value Sensitive to outliers Median Value Not influenced by outliers Trimmed Mean Useful for measurements following the normal distribution. Robust to outliers Mode The most frequent value among observations Useful in the presence of outliers Geometric Mean Useful when measurements follow logarithmic or asymmetric distributions. Sensitive to outliers Harmonic Mean Useful when measurements follow logarithmic or asymmetric distributions. Sensitive to outliers

Descriptive statistics Measure Formulae Comment Standard Deviation or σ = 1 N 1 1 N N i=1 N i=1 x i x 2 x i x 2 1/2 1 2 Useful for measurements following the normal distribution. Sensitive to outliers Variance or 1 N 1 1 N N i=1 N i=1 x i x 2 x i x 2 Useful for measurements following the normal distribution. Sensitive to outliers Mean Absolute Deviation 1 N N i=1 x i x Useful for measurements following the normal distribution. Less sensitive to outliers in comp to std Interquantile Range 50% of the sorted values of x Less representative if values follow the normal distribution. Robust to outliers Range max x i min x i Very sensitive to outliers x: arithmetic mean

Visual inspection Basic time series graphs Per parameter Per group of parameters (AQ and meteo groups) One parameter, all institutes, versus reference measurements We can thus identify common behavior between sensors and use this in the next step to group sensors and proceed with further analysis Dispersion plots

Συγκέντρωση O 3 [μgr/m 3 ] Time series analysis De-trenting 120 100 (α) Χρονοσειρά Τάση 40 30 (β) 20 80 60 40 20 Συγκέντρωση O 3 10 0-10 -20-30 0 0 50 100 150 200 250 300 350 400 Hμέρες -40 0 50 100 150 200 250 300 350 400 Hμέρες Identification and removal of trend (2 nd degr. polyonym) from O3 time series: (α) before (β) after

Periodicity identification Identify and isolate periodicities 25 20 15 (α) k=5 k=15 30 20 (β) α=0.1 α=0.5 Συγκέντρωση O 3 10 5 0-5 -10-15 s i = 1 k k 1 x i n n=0-20 0 50 100 150 200 250 300 350 400 Hμέρες Συγκέντρωση O 3 10 0-10 -20-30 0 50 100 150 200 250 300 350 400 Hμέρες Smoothed O3 time series (after de-trenting) (α) running mean (k=5, k=15) and (β) exponential smoothing (α=0.1, α=0.5).

Normalization & useful transformations Variance normalization (all values between 0 and 1) x = x μ x σ x Logarithmic (for big differences) x = l n( x x min 1 Trigonometric transformation (cyclic nature) x = 1 + tan x + π 4

I n f o r m a t i c s S y s t e m s & A p p l i c a t i o n s G r o u p ( I S A G ) http://isag.meng.auth.gr Aristotle University Thessaloniki Dept. of Mechanical Engineering IΙ. Analysis

AQ data analysis The goal is to identify the most important parameters and their basic relationships Covariance matrix Correlation coefficient matrix Information gain criterion PCA SOM K-means clustering

I n f o r m a t i c s S y s t e m s & A p p l i c a t i o n s G r o u p ( I S A G ) http://isag.meng.auth.gr Aristotle University Thessaloniki Dept. of Mechanical Engineering IIΙ. Modelling

Modelling Data oriented modelling. The goal is Behavior reproduction (descriptive modelling) Forecasting (predictive modelling) Algorithms Linear regression (just for reference) Decision trees ANNs SVMs

Value Value PM10mean *[+1u]* 200 Observed Predicted 150 100 50 50 100 150 200 250 300 Row index 140 O3max *[+1u]* Observed Predicted 120 100 80 60 40 20 50 100 150 200 250 300 Row index

And the final goal should be Development of new, innovative, personalized, georeferenced, quality of life related, everyday activity associated. Services!!!

The facts (1/2) We already have services providing information on: Current air quality (monitoring stations) AQ forecasts

The facts (2/2) What we would like to receive as a service: Everything else!, i.e. How will the quality of my life, in the place where I live will develop tomorrow What do others like myself have reported and are expected to experience tomorrow Biking like I do Commuting like I do Having everyday habits similar to mine Which are the places and times of the day that I can use to move around and feel better if possible.

Collaboration is the only way Consider joined design, informatics and environmental workshops with hands-on sessions for service designers

Thank you for your attention!