Methodological challenges of Big Data for official statistics

Size: px
Start display at page:

Download "Methodological challenges of Big Data for official statistics"

Transcription

1 Methodological challenges of Big Data for official statistics Piet Daas Statistics Netherlands THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Content Big Data: properties Big Data: strength and weaknesses Ways to include Big Data in official statistics Important methodological issues And other relevant issues Examples Concluding remarks 2 1

2 Big Data: properties These highly affect the methodological issues Kind of data Characteristics Types of data 3 Big Data: Kind of data NSI Primary data Secondary data Our own questionnaires Data from others - Administrative sources - Big Data Many Big Data sources are predominantly composed of events 4 2

3 Big Data: Characteristics There are more V s 5 Big Data: Types of data In principle, 3 types of Big Data can be discerned 1. Social network data (human sourced) - Facebook, Twitter, Blogs, , Text-messages etc. 2. Traditional system (process mediated) - Bank transaction data, credit cards, medical data 3. Internet of things data (machine generated) - sensor data, GPS data, satellite pictures, etc. However: this is still work in progress Important message: Don t think all Big Data sources are alike (do not generalize a priori) 6 3

4 Big Data: strength and weaknesses (1) Pro s Quickly available (near real time) Lots and lots of data High frequent measurements Usually includes many units 7 Big Data: strength and weaknesses (2) Con s Often composed of events (caused by units) Even those units may not be similar to statistical units (Sometimes) indirect measurements of concepts Volatile and noisy (signal/noise ratio) Can be selective (what part of target population is included?) Stability source and data (effect of maintainers and users) Metadata is not always well described/accurate more in: The Parable of Google Flu: Traps in Big Data Analysis 8 4

5 How can Big Data be included in official statistics? 1. As the only source (replacement/new statistics) - Traffic intensity statistics (NL) and Billion Prices project (MIT) 2. As the main source with survey/admin. data as benchmark - Google trends like approaches, (regular) benchmarking needed 3. As an additional source for a survey/admin. data based statistics - for example to enable small area estimation 4. As supplier of missing data - for example use data on level of education from the internet to fill gaps in education register - But also for now casting and to reduce timeliness! 5. Don t use it 9 Google flu prediction We can learn from this! Models, correlations and changing realities 10 5

6 How can Big Data be included in official statistics? 1. As the only source (replacement/new statistics) - Traffic intensity statistics (NL) and Billion Prices project (MIT) 2. As the main source with survey/admin. data as benchmark - Google trends like approaches, (regular) benchmarking needed 3. As an additional source for a survey/admin. data based statistics - for example to enable small area estimation 4. As supplier of missing data - for example use data on level of education from the internet to fill gaps in education register - But also for nowcasting and to reduce timeliness! 5. Don t use it 11 Most important methodological issues From the above it follows that the essential issues are: 1. Transform events to units (the population frame) 2. Combine Big Data findings with traditional data At the unit level or by correlating, etc. 3. Measure selectivity and correct for it Also related to the stability of the data 4. Determine and improve quality & reduce noise Is this all? 12 6

7 Statistics is all about populations But: not every Big Data sources provides information on the units in the data source Background characteristics (auxiliary variables) are often absent For example: what are the background characteristics of social media users? Take a moment to look this up online! 13 Most important methodological issues From the above it follows that the essential issues are: 1. Transform events to units (the population frame) 2. Combine Big Data findings with traditional data At the unit level or by correlating, etc. 3. Measure selectivity and correct for it Also related to the stability of the data 4. Determine and improve quality & reduce noise 5. Find ways to obtain/derive background characteristics of units in Big Data sources (when needed) 14 7

8 Examples of current methodological state of art 1. Transforming events into units Road sensor data Sensors measure the number of vehicles passing on a specific lane on a specific road in a specific direction on a specific minute What is the population frame of Traffic intensity statistics? The roads!! Therefore one needs to link the sensors to the roads so the number of vehicles per km of road can be calculated!! 15 Road sensor data example There are 20,000 sensors active on Dutch highways They count number of passing vehicle every minute Produce Road intensity statistics Output: Vehicle kilometres (= #vehicles * distance travelled) per highway per COROP area 16 8

9 17 Data per road sensor 18 9

10 Correction method for road sensors Frame Statistics on persons Road intensity Target population (unit=person) Roads (unit=km) Sample Sample of persons All road sensors Data collection Questionnaire Road sensor data Weights Based on (demographic) background characteristics Based on road segment length 19 Road segments Road sensors Main route Road segments (=weights) 20 10

11 Correction Traffic flow Number of vehicles Segment length * * * * *2500 = 107,500 vehicle-km 21 Examples of current methodological state of art 2. Combine Big Data findings with traditional data A. Compare series of development (macro-level) B. Link data sources (micro-level) A. examples: Consumer confidence and social media sentiment (monthly) GDP and Traffic intensities (quarterly) Both Big Data sources measure a related phenomenon (correlation 0.92/0.91) 22 11

12 Correlation example Dutch GDP and Dutch Traffic - GDP - Traffic GDP vs Traffic 3 % increase in GDP corresponds to 12 % increase in traffic Traffic ahead of GDP 1 quarter Correlation 82% from 2010-Q3 till 2014-Q4 91% from 2011-Q2 till 2014-Q4 23 Examples of current methodological state of art 3. Measure selectivity and correct for it Need background characteristics (see 5. when absent) 1. In the case of the road sensors not all roads were found to be covered by a sufficient number of sensors - One road segment had a maximum of 4 sensors, but the number of active sensors was found to vary over time - Another road had 8 sensors with 7 on the right and 1 on the left lane Poor coverages and occasional malfunctioning sensors (missing data) causes bad estimates when not corrected for! 2. Apply model-based & algorithmic ways of inference - Use (advanced) model based approaches (Statistics meets Data Science) 24 12

13 1. Road sensors: poor coverage Number of active sensors may vary over time t1 t2 Poor coverage reduces estimates of number of vehicles Algorithmic inference (1) Does the soup taste well? Non-probability sample ( Big Data) Unknown inclusion probabilities, is precise but biased, inference? 26 13

14 2. Algorithmic inference (2) Compared results of pseudo design-, model- and algorithmic based inference methods on a generated non-probability sample of vehicle odometer values Sample mean (SAM) Pseudo-design based (PDB) Generalized Linear Model (GLM) k-nearest Neighbours (KNN) Artificial Neural Network (ANN) Regression Tree (RTR) Support Vector Machine (SVM) See Buelens et al Algorithmic inference (3) Sample mean (SAM) Average of all units observed = = 1 Pseudo-design based (PDB) = = 1 h Average of units observed in each stratum Generalized Linear Model (GLM) Linearized combination of auxiliary variables ( ) = = ( ) 28 14

15 2. Algorithmic inference (4) k-nearest Neighbours (KNN) Average of closely associated k observed units in space X 1 Artificial Neural Network (ANN) Results of trained network of artificial neurons Wikipedia, Artificial neuron Network of artificial neurons Algorithmic inference (5) Regression Tree (RTR) Construct binary tree - Maximize between variance - With a stop criterion > > > Average of observed units in each leaf =, = 1 Algorithmic version of PDB 30 15

16 2. Algorithmic inference (6) Support Vector Machine (SVM) It represents data as points in space, mapped so that they are divided by a clear gap made as wide as possible. =, Algorithmic inference (7) training sample test Split sample in training (70%) and test set (30%) Train model with training set data Use trained model to predict test set Compare predictions with observed test data Choose parameter with smallest MSE Model optimisation Estimation Train model with whole sample Predict for the rest of data set target population sample rest 16

17 2. Algorithmic inference (8) Illustration of findings Nonprobability sample of odometer values of cars to estimate km driven per reg. year Extrapolation Variance by bootstrapping 33 Examples of current methodological state of art 4. Determine and improve quality & reduce noise Determine the quality of massive amounts of data is hard Event oriented or unit oriented? Improve quality by removing noise/increasing signal Remove non-relevant part of population (persons/companies), impute missing data, aggregate data over longer period etc. (combine approaches) - Road sensors often miss data on various minutes, imputing these values and applying a filter to smooth the data improves quality - Aggregating social media sentiment data reduces noise, as does applying a (Kalman-)filter. Both improve the correlation with Consumer confidence

18 Measuring quality Event based quality indicators (sensor data) Should be rapidly determined (huge amounts of data) e.g. number of measurements per day for each sensor (L) number or blocks of missing data per day for each sensor (B) mean of number of vehicles detected per day for each sensor (M) number of zero measurements per day for each sensor (O) 35 Improving quality Imputing missing values and smoothing (Bayesian filter) The filter does not introduce extra errors: Precision: 3.6% Accuracy:+0.13% 36 18

19 Examples of current methodological state of art 5. Find ways to obtain/derive background characteristics of units in Big Data sources Make use of the massive amount of data to find clues indicative of important background characteristics of units Use AI/machine learning approaches Determine gender of social media users Studied a sample of 1000 Twitter user accounts in the Netherlands 37 Background characteristics of units An example Dutch Twitter users Only a part of the Dutch are active on Twitter But which part? Determine background characteristics Such as gender, age, income, level of education etc. What are the possibilities? Feature extraction is the way to go Lett s look at gender 38 19

20 4) Picture 1)Name 3) Messages content 2) Short bio Studied a Twitter sample From a list of Dutch Twitter users (~ ) a random sample of 1000 unique ids was drawn Of the sample: 844 profiles still existed 844 had a name 583 provided a short bio 473 created tweets 804 had a non-default picture Default Twitter picture Sample composition: 409 Men (49%) 282 Women (33%) 153 Others (18%) - companies, organizations, dogs, cats, bots

21 Gender findings: 1) First name Used Dutch Voornamenbank website (First name database) Score between 0 and 1 (female male); 676 of 844 (80%) names were registered Unknown names scored -1 (usually companies/organizations) 41 Gender findings: 2) Short bio If a short bio is provided Some people mention there position in the family - Mother, father, papa, mama, son of, etc. 155 of 583 (27%) indicated there gender in short bio (especially women!) Need to check both English and Dutch texts 42 21

22 Gender findings: 3) Tweets content In cooperation with University Twente (Dong Nguyen) Machine learning approach that determines gender specific writing style score Language specific: Messages need to be Dutch! 437 of 473 (92%) persons that created tweets could be classified 43 Gender findings: 4) Profile picture Use OpenCV to process pictures 1) Face recognition 2) Standardisation of faces (resize & rotate) 3) Classify faces according to gender of 804 (75%) profile pictures had 1 or more faces on it 44 22

23 Gender findings: overall results Diagnostic Odds Ratio (log) First name 4.33 Short bio Tweet content Picture (faces) 0.57 Diagnostic Odds Ratio = (TP/FN) / (FP/TN) random guessing log(dor) = 0 Multi-agent findings Need clever ways to combine these Take processing efficiency of the agent into consideration 45 Gender findings: combining approaches Combine findings in the best possible way Unassigned (%) Approach used 844 (100%) 1. Use short bio scores (very precise for females) 689 (82%) 2. Use first name scores 153 (18%) 3. Use Tweet content 29 (3.4%) 4. Use picture 20 (2.4%) 5. Assign male gender Final log(dor) is 7.02, an accuracy of 96.5%! 46 23

24 Concluding remarks Big Data has great potential for official statistics There are many challenges Using Big Data is not like using survey or admin. data There are new methodological challenges Such as extracting features and extracting information from texts and images/videos Learn from others by looking at scientific area such as Artificial Intelligence/Machine learning 47 Questions? 48 24