Big Data. Methodological issues in using Big Data for Official Statistics

Size: px
Start display at page:

Download "Big Data. Methodological issues in using Big Data for Official Statistics"

Transcription

1 Giulio Barcaroli Istat Big Data Effective Processing and Analysis of Very Large and Unstructured data for Official Statistics. Methodological issues in using Big Data for Official Statistics 1

2 2. Integrated use of Big Data with survey data Integrated use of Big Data with Survey Data Under this scenario, the classical statistical survey is still being used, but a massive use of Big Data (possibly together with census and administrative data) takes place. The possible uses of Big Data that can be made are: 1. as auxiliary information to be added to the one present in the frame in order to increase the efficiency of sample design; 2. as linked information (record linkage) in order to edit and impute statistical and administrative data; 3. as auxiliary information in model-assisted or model-based estimation procedures: a. to produce known totals in calibration estimation procedures (model-assisted); b. to define models linking auxiliary information (X) (Big Data, administrative data, census data) to the target variables (Y) of the survey (model-based), in order to increase the reliability of estimates. 2

3 Integrated use of Survey and Big Data: the case of Survey on ICT usage To illustrate in more detail this scenario, we take into consideration a given survey, the «Survey on the use of ICT by Enterprises», carried out in all Member States of the European Union. This is a subsection of the questionnaire: Integrated use of Survey and Big Data: the case of Survey on ICT usage In Italy, the survey investigates on a universe of 211,851 enterprises with at least 10 employees, by means of a sampling survey involving 19,186 of them (2011). In the 2013 round of the survey, 8,687 indicated their website (45% of sampling respondent units). The access to the indicated websites in order to gather information directly within them, gives different opportunities. 3

4 Integrated use of Survey and Big Data: the case of Survey on ICT usage Action 1 Substitute the traditional collection technique questionnaire-based, with an Internet as Data Source new one, for all suitable questions 2 Integrate the information collected via questionnaire with the information collected via IaD Target Reduction of respondent burden Increase of accuracy of estimates 3 Collect additional information Increase the offer of statistical information 4

5 Predictive approach We assume, from now on, that our target is to increase the accuracy of estimates by making use of data originating by the Internet as auxiliary data. This particular case study is based on the use of textual data as auxiliary data. Texts are a perfect example of unstructured data, that is one of the characteristics of most Big Data. But the same approach, possibly with variants, can be adopted also for other typologies of data, in particular categorical. First, the usual model-based approach will be followed, requiring the prediction of values at unit level: under this approach, the target is to maximise the correctness of classification for each unit in the reference population. Next, a different approach will be illustrated, where the prediction of values at unit level is no more required and the target becomes to directly maximise the accuracy at the aggregate level (estimates accuracy). Predictive approach In a predictive approach, the subset of data related to sampled respondent units can be considered as the labeled data, and supervisioned learning methods can be applied. In other words, the subset of 8,687 enterprises that indicated to have a website or a home page, and also responded to questions [B8a : B8g], can be considered as the training and test set by means of which different models can be estimated in order to predict answers to [B8a : B8g] questions for the whole reference population. Texts (websites content) Text and data mining Model Survey Microdata 5

6 Predictive approach In our case, we can apply one among the supervisioned learning methods: Classification Trees; ensembles (Bagging, Boosting, Random Forests); Supervised Latent Dirichlet Allocation for classification (SLDA); Naïve Bayes; Neural Networks; Logistic Regression; Support Vector Machines. Prediction using Naïve Bayes Brett Lantz ( Machine Learning with R ) Classifiers based on Bayesian methods utilize training data to calculate an observed probability of each class based on feature values. When the classifier is used later on unlabeled data, it uses the observed probabilities to predict the most likely class for the new features. It's a simple idea, but it results in a method that often has results on par with more sophisticated algorithms. In fact, Bayesian classifiers have been used for Text classification, such as junk (spam) filtering, author identification, or topic categorization 6

7 Prediction using Naïve Bayes Conditional probability with Bayes Theorem The Bayes Theorem defines the relationship between the conditional (posterior) probability, the likehood, the prior probability and the marginal likelihood: For example: Prediction using Naïve Bayes The Naïve Bayes algorithm The Naïve Bayes algorithm is not the only statistical learning method using the Bayesian Theorem, but is the most used, especially in the field of the text classification, where it can be considered as a standard choice. It is called naïve because of its (simplistic) assumptions concerning data. In particular, it assumes that all the features in a dataset are independent and equally important, a condition that is seldom verified in real situations. Actually, words in a text are not equally important in order to predict a given category to be associated to the text, and words are not independent each other. But Naïve Bayes works well despite the fact that its basic assumptions are very seldom fulfilled. 7

8 The Naïve Bayes classification As an example, consider we want to build a classifier able to detect spam mail. As training set, we have 100 mails, with four words selected as the most important. As a first step, we construct a likelihood table for the appearance of these four words (W1, W2, W3, and W4): If we use the Bayes Theorem, we are able to estimate the probability that a new message is spam. Let us suppose that this new message contains the words Viagra and Unsubscribe, but not the other two words. So, the likelihood that the message is spam is: The Naïve Bayes classification With only four different words, this formula is easy to compute. But when we have to consider, as in our case, hundreds or even thousands different words, the computation becomes too onerous. In particular, we would have to consider all the different possible interactions among the events, and an enormous training dataset would be required in order to estimate the marginal likelihoods. The independence among events assumed by Naïve Bayes allows to simplify the computation. The formula becomes: 8

9 The Naïve Bayes classification In order to correctly calculate the probability that a message is spam, once we he have calculated its likelihood with the previous formula, we have to calculate also the likelihood that it is not spam with the same formula. Then, the probability that the message is spam will be given by the ratio between the likelihood to be spam, on the sum of the two likelihoods. In general: That is, the probability of a level L for a categorical variable C given the evidence provided by features F, is given by the product of the probabilities of each evidence F conditioned on the class level, the prior probability of the class level, and a scaling factor 1/Z (that converts the result into a probability). Evaluation of the prediction model As for other statistical learning methods, Naïve Bayes can be trained on a subset of data (trained dataset), and applied to another subset (test dataset) in order to verify its capability to correctly predict the values of the target variable (cross-validation). True values in the test dataset are known, and they can be compared to predicted ones in order to produce the confusion or error matrix: TP=True Positive FN= False negative FP= False positive TN= True Negative 9

10 Evaluation of the prediction model From the error matrix it is possible to compute the following indicators: Indicator Expression Meaning Accuracy (precision) False negatives rate False positives rate (TP+TN) / Total Rate of correctly classified cases FN / (TP + FN) FP / (FP+TN) Rate of positive cases misclassified as negative Rate of negative cases misclassified as positive Evaluation of the prediction model The application of Naïve Bayes to predict question B8a Online ordering or reservation or booking (Yes/No) Observed values Predicted values 1 (Yes) 2 (No) Total 1 (Yes) (No) Total Accuracy 0.77 False negatives rate 0.49 False positives rate

11 Prediction and estimation The results of the application of Naïve Bayes to test data can be considered as acceptable with respect to the accuracy (77%), but not acceptable with regard to the rate of false negatives (49%). In any case, even when the rates of misclassification cases can be considered as acceptable, but they are not balanced in absolute terms, the result is that the distribution of predicted values can be significantly different from the distribution of observed cases. For instance, in our case the proportion of websites enabling e-commerce is 19,3% in observed cases, while it is 23,3% considering predicted cases. In statistical surveys, we are not interested in predicting individual cases, but instead the usual target is to estimate parameters in the population. In general, other limits of the full predictive approach are: 1. if the labeled dataset (used as training dataset) is not a random sample of the target population, final estimates can be severely biased; 2. standard supervised learners assume that the data generation process is of the kind P(D S) (where D is the target variable, and S the set of explicative variables), but in the real world it is the reverse: the values of D generate the values of S; 3. another assumption is that the class of learners chosen for modeling P(D S) includes the true model, but there is no guarantee about that. Corrected aggregations of individual predictions 11

12 Corrected aggregations of individual predictions Observed values Predicted values 1 (Yes) 2 (No) Total 1 (Yes) (No) Total Sensitivity Specificity Proportion of e-commerce (aggregation of observed) Proportion of e-commerce (aggregation of predicted) Corrected proportion of e- commerce Estimation of proportions without individual predictions 12

13 Estimation of proportions without individual predictions This method has been evaluated (with simulated datasets) against classical statistical learners, and has outperformed, in particular in situations where the distribution of the target variable in the test dataset was significantly different from the one in the training dataset. This means that this method is particularly robust with respect to the characteristics of the labeled dataset, and this is of the utmost importance when this dataset is not a random sample of the reference population (typical case for Big Data). 13

14 Application to the survey on ICT usage (1) We assumed that our target is to increase the accuracy of estimates by making use of data originating by the Internet as auxiliary data. It is then possible to proceed in this way: 1. the whole set of 8,600 texts scraped from the websites indicated by the 19,000 enterprises participating to the ICT usage survey, can be considered as the labeled dataset; 2. web scraping is also applied to all websites owned by 211,000 enterprises in the population (a similar percentage of websites, 45%, is to be expected); 3. the total number of expected 95,000 texts can be used to estimate P(D) for all suitable variables in the questionnaire. It is possible to calculate the Root Mean Square Error for each one of the estimated proportion. An increase in the quality of these estimates has to be expected, compared to the estimates directly obtained from the survey, which are affected by sampling errors. Big Data, Internet as Data Source Population frame (ASIA) 95,000 websites Reference population: 212,000 enterprises 1. Web scraping (websites) Texts + labels HK method Estimates Sampled units: 19,000 enterprises Sample selection Data collection Microdata Application to the survey on ICT usage to produce estimates of proportions in the population 14

15 Application to the survey on ICT usage (2) Another important potential target is the characterisation of units in the Business Register. For instance, we could want to identify all the enterprises that own a website; use it as an instrument for the e-commerce. This subset can be used as a new (sub)population frame (of the enterprises offering e-commerce) in order to investigate them with a new, specific survey. With this different target, the usual method based on the prediction at individual level (classification of sinlge units) must be resumed. It can be improved, however, by considering the proportions produced by using the estimation of proportions without individual predictions method previously described. In fact, the aggregation of all units with a particular predicted value (e-commerce=yes ), resulting from the use of a learner, can be compared with the estimated proportion obtained by the application of the Hopkins-King method. The threshold for the assignment of values ( Yes or No ) can be used in order to approximate the aggregations to make them coherent with the estimates of the proportions. Big Data, Internet as Data Source Population frame (ASIA) 95,000 websites Reference population: 212,000 enterprises 1. Web scraping (websites) Texts + labels HK method Estimates 2. Text and data mining Predictive model Application to Population frame to individuate sub-population of interest 15

16 References Brett Lantz Machine Learning with R - Packt Publishing Limited (2013) D.J.Hopkins, G. King A method of Automated Nonparametric Content Analysis for Social Science American Journal of Political Science, Vol. 54, No. 1, January 2010, Pp (2010) G. James, D. Witten, T. Hastie, R. Tibshirani - An Introduction to Statistical Learning with Applications in R - Springer Texts in Statistics (2013) 16

Italian Examples of the use of big data for producing statistics

Italian Examples of the use of big data for producing statistics Italian Examples of the use of big data for producing statistics Monica Scannapieco THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Istat Big Data Strategy - 1 Istat (The

More information

Quality evaluation of experimental statistics produced by making use of Big Data

Quality evaluation of experimental statistics produced by making use of Big Data Quality evaluation of experimental statistics produced by making use of Big Data Giulio Barcaroli, Istat (Italian National Institute of Statistics), barcarol@istat.it Natalia Golini, Istat (Italian National

More information

Experiences in the Use of Big Data for Official Statistics

Experiences in the Use of Big Data for Official Statistics Think Big - Data innovation in Latin America Santiago, Chile 6 th March 2017 Experiences in the Use of Big Data for Official Statistics Antonino Virgillito Istat Introduction The use of Big Data sources

More information

2 Maria Carolina Monard and Gustavo E. A. P. A. Batista

2 Maria Carolina Monard and Gustavo E. A. P. A. Batista Graphical Methods for Classifier Performance Evaluation Maria Carolina Monard and Gustavo E. A. P. A. Batista University of São Paulo USP Institute of Mathematics and Computer Science ICMC Department of

More information

Customer Relationship Management in marketing programs: A machine learning approach for decision. Fernanda Alcantara

Customer Relationship Management in marketing programs: A machine learning approach for decision. Fernanda Alcantara Customer Relationship Management in marketing programs: A machine learning approach for decision Fernanda Alcantara F.Alcantara@cs.ucl.ac.uk CRM Goal Support the decision taking Personalize the best individual

More information

Machine Learning Techniques For Particle Identification

Machine Learning Techniques For Particle Identification Machine Learning Techniques For Particle Identification 06.March.2018 I Waleed Esmail, Tobias Stockmanns, Michael Kunkel, James Ritman Institut für Kernphysik (IKP), Forschungszentrum Jülich Outlines:

More information

Advanced Quantitative Research Methodology, Lecture Notes: Text Analysis: Supervised Learning

Advanced Quantitative Research Methodology, Lecture Notes: Text Analysis: Supervised Learning Advanced Quantitative Research Methodology, Lecture Notes: Text Analysis: Supervised Learning Gary King Institute for Quantitative Social Science Harvard University April 22, 2012 Gary King (Harvard, IQSS)

More information

(& Classify Deaths Without Physicians) 1

(& Classify Deaths Without Physicians) 1 Advanced Quantitative Research Methodology, Lecture Notes: Text Analysis I: How to Read 100 Million Blogs (& Classify Deaths Without Physicians) 1 Gary King http://gking.harvard.edu April 25, 2010 1 c

More information

Credibility: Evaluating What s Been Learned

Credibility: Evaluating What s Been Learned Evaluation: the Key to Success Credibility: Evaluating What s Been Learned Chapter 5 of Data Mining How predictive is the model we learned? Accuracy on the training data is not a good indicator of performance

More information

Predictive Modeling using SAS. Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN

Predictive Modeling using SAS. Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN Predictive Modeling using SAS Enterprise Miner and SAS/STAT : Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN 1 Overview This presentation will: Provide a brief introduction of how to set

More information

From Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques. Full book available for purchase here.

From Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques. Full book available for purchase here. From Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques. Full book available for purchase here. Contents List of Figures xv Foreword xxiii Preface xxv Acknowledgments xxix Chapter

More information

CSE 255 Lecture 3. Data Mining and Predictive Analytics. Supervised learning Classification

CSE 255 Lecture 3. Data Mining and Predictive Analytics. Supervised learning Classification CSE 255 Lecture 3 Data Mining and Predictive Analytics Supervised learning Classification Last week Last week we started looking at supervised learning problems Last week We studied linear regression,

More information

CS6716 Pattern Recognition

CS6716 Pattern Recognition CS6716 Pattern Recognition Aaron Bobick School of Interactive Computing Administrivia Shray says the problem set is close to done Today chapter 15 of the Hastie book. Very few slides brought to you by

More information

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong Machine learning models can be used to predict which recommended content users will click on a given website.

More information

Accurate Campaign Targeting Using Classification Algorithms

Accurate Campaign Targeting Using Classification Algorithms Accurate Campaign Targeting Using Classification Algorithms Jieming Wei Sharon Zhang Introduction Many organizations prospect for loyal supporters and donors by sending direct mail appeals. This is an

More information

Data Mining Applications with R

Data Mining Applications with R Data Mining Applications with R Yanchang Zhao Senior Data Miner, RDataMining.com, Australia Associate Professor, Yonghua Cen Nanjing University of Science and Technology, China AMSTERDAM BOSTON HEIDELBERG

More information

Machine Learning Models for Sales Time Series Forecasting

Machine Learning Models for Sales Time Series Forecasting Article Machine Learning Models for Sales Time Series Forecasting Bohdan M. Pavlyshenko SoftServe, Inc., Ivan Franko National University of Lviv * Correspondence: bpavl@softserveinc.com, b.pavlyshenko@gmail.com

More information

In silico prediction of novel therapeutic targets using gene disease association data

In silico prediction of novel therapeutic targets using gene disease association data In silico prediction of novel therapeutic targets using gene disease association data, PhD, Associate GSK Fellow Scientific Leader, Computational Biology and Stats, Target Sciences GSK Big Data in Medicine

More information

Enhanced Cost Sensitive Boosting Network for Software Defect Prediction

Enhanced Cost Sensitive Boosting Network for Software Defect Prediction Enhanced Cost Sensitive Boosting Network for Software Defect Prediction Sreelekshmy. P M.Tech, Department of Computer Science and Engineering, Lourdes Matha College of Science & Technology, Kerala,India

More information

E-Commerce Sales Prediction Using Listing Keywords

E-Commerce Sales Prediction Using Listing Keywords E-Commerce Sales Prediction Using Listing Keywords Stephanie Chen (asksteph@stanford.edu) 1 Introduction Small online retailers usually set themselves apart from brick and mortar stores, traditional brand

More information

Preface to the third edition Preface to the first edition Acknowledgments

Preface to the third edition Preface to the first edition Acknowledgments Contents Foreword Preface to the third edition Preface to the first edition Acknowledgments Part I PRELIMINARIES XXI XXIII XXVII XXIX CHAPTER 1 Introduction 3 1.1 What Is Business Analytics?................

More information

Evaluation next steps Lift and Costs

Evaluation next steps Lift and Costs Evaluation next steps Lift and Costs Outline Lift and Gains charts *ROC Cost-sensitive learning Evaluation for numeric predictions 2 Application Example: Direct Marketing Paradigm Find most likely prospects

More information

Effective CRM Using. Predictive Analytics. Antonios Chorianopoulos

Effective CRM Using. Predictive Analytics. Antonios Chorianopoulos Effective CRM Using Predictive Analytics Antonios Chorianopoulos WlLEY Contents Preface Acknowledgments xiii xv 1 An overview of data mining: The applications, the methodology, the algorithms, and the

More information

Predictive Accuracy: A Misleading Performance Measure for Highly Imbalanced Data

Predictive Accuracy: A Misleading Performance Measure for Highly Imbalanced Data Paper 942-2017 Predictive Accuracy: A Misleading Performance Measure for Highly Imbalanced Data Josephine S Akosa, Oklahoma State University ABSTRACT The most commonly reported model evaluation metric

More information

FINAL PROJECT REPORT IME672. Group Number 6

FINAL PROJECT REPORT IME672. Group Number 6 FINAL PROJECT REPORT IME672 Group Number 6 Ayushya Agarwal 14168 Rishabh Vaish 14553 Rohit Bansal 14564 Abhinav Sharma 14015 Dil Bag Singh 14222 Introduction Cell2Cell, The Churn Game. The cellular telephone

More information

Using decision tree classifier to predict income levels

Using decision tree classifier to predict income levels MPRA Munich Personal RePEc Archive Using decision tree classifier to predict income levels Sisay Menji Bekena 30 July 2017 Online at https://mpra.ub.uni-muenchen.de/83406/ MPRA Paper No. 83406, posted

More information

Application of Classifiers in Predicting Problems of Hydropower Engineering

Application of Classifiers in Predicting Problems of Hydropower Engineering Applied and Computational Mathematics 2018; 7(3): 139-145 http://www.sciencepublishinggroup.com/j/acm doi: 10.11648/j.acm.20180703.19 ISSN: 2328-5605 (Print); ISSN: 2328-5613 (Online) Application of Classifiers

More information

A Bayesian Predictor of Airline Class Seats Based on Multinomial Event Model

A Bayesian Predictor of Airline Class Seats Based on Multinomial Event Model 2016 IEEE International Conference on Big Data (Big Data) A Bayesian Predictor of Airline Class Seats Based on Multinomial Event Model Bingchuan Liu Ctrip.com Shanghai, China bcliu@ctrip.com Yudong Tan

More information

Predictive Modelling for Customer Targeting A Banking Example

Predictive Modelling for Customer Targeting A Banking Example Predictive Modelling for Customer Targeting A Banking Example Pedro Ecija Serrano 11 September 2017 Customer Targeting What is it? Why should I care? How do I do it? 11 September 2017 2 What Is Customer

More information

Supervised Learning Using Artificial Prediction Markets

Supervised Learning Using Artificial Prediction Markets Supervised Learning Using Artificial Prediction Markets Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, FSU Dept. of Scientific Computing 1 Main Contributions

More information

Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. ENTERPRISE MINER: ANALYTICAL MODEL DEVELOPMENT

Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. ENTERPRISE MINER: ANALYTICAL MODEL DEVELOPMENT ENTERPRISE MINER: ANALYTICAL MODEL DEVELOPMENT ANALYTICAL MODEL DEVELOPMENT AGENDA Enterprise Miner: Analytical Model Development The session looks at: - Supervised and Unsupervised Modelling - Classification

More information

Enhancing Decision Making

Enhancing Decision Making Enhancing Decision Making Content Describe the different types of decisions and how the decision-making process works. Explain how information systems support the activities of managers and management

More information

Software for Typing MaxDiff Respondents Copyright Sawtooth Software, 2009 (3/16/09)

Software for Typing MaxDiff Respondents Copyright Sawtooth Software, 2009 (3/16/09) Software for Typing MaxDiff Respondents Copyright Sawtooth Software, 2009 (3/16/09) Background: Market Segmentation is a pervasive concept within market research, which involves partitioning customers

More information

Advances in Machine Learning for Credit Card Fraud Detection

Advances in Machine Learning for Credit Card Fraud Detection Advances in Machine Learning for Credit Card Fraud Detection May 14, 2014 Alejandro Correa Bahnsen Introduction Europe fraud evolution Internet transactions (millions of euros) 800 700 600 500 2007 2008

More information

MISSING DATA CLASSIFICATION OF CHRONIC KIDNEY DISEASE

MISSING DATA CLASSIFICATION OF CHRONIC KIDNEY DISEASE MISSING DATA CLASSIFICATION OF CHRONIC KIDNEY DISEASE Wala Abedalkhader and Noora Abdulrahman Department of Engineering Systems and Management, Masdar Institute of Science and Technology, Abu Dhabi, United

More information

Machine Learning - Classification

Machine Learning - Classification Machine Learning - Spring 2018 Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Data Mining Looking for

More information

ScienceDirect. An Efficient CRM-Data Mining Framework for the Prediction of Customer Behaviour

ScienceDirect. An Efficient CRM-Data Mining Framework for the Prediction of Customer Behaviour Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 46 (2015 ) 725 731 International Conference on Information and Communication Technologies (ICICT 2014) An Efficient CRM-Data

More information

Predicting Corporate 8-K Content Using Machine Learning Techniques

Predicting Corporate 8-K Content Using Machine Learning Techniques Predicting Corporate 8-K Content Using Machine Learning Techniques Min Ji Lee Graduate School of Business Stanford University Stanford, California 94305 E-mail: minjilee@stanford.edu Hyungjun Lee Department

More information

Assistant Professor Neha Pandya Department of Information Technology, Parul Institute Of Engineering & Technology Gujarat Technological University

Assistant Professor Neha Pandya Department of Information Technology, Parul Institute Of Engineering & Technology Gujarat Technological University Feature Level Text Categorization For Opinion Mining Gandhi Vaibhav C. Computer Engineering Parul Institute Of Engineering & Technology Gujarat Technological University Assistant Professor Neha Pandya

More information

Selective editing in the Integrated Business Statistics System (SINTESI)

Selective editing in the Integrated Business Statistics System (SINTESI) Working Paper. UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS Workshop on Statistical Data Editing (Neuchâtel, Switzerland, 18-20 September 2018) Selective editing in

More information

ANNUAL QUALITY REPORT

ANNUAL QUALITY REPORT REPUBLIC OF SLOVENIA ANNUAL QUALITY REPORT FOR THE SURVEY USAGE OF ICT IN ENTERPRISES FOR THE YEAR 2011 Prepared by: Gregor Zupan Date: July 2012 1/9 Table of contents 0 Basic Data... 3 1 Relevance...

More information

Weka Evaluation: Assessing the performance

Weka Evaluation: Assessing the performance Weka Evaluation: Assessing the performance Lab3 (in- class): 21 NOV 2016, 13:00-15:00, CHOMSKY ACKNOWLEDGEMENTS: INFORMATION, EXAMPLES AND TASKS IN THIS LAB COME FROM SEVERAL WEB SOURCES. Learning objectives

More information

Copyright 2013, SAS Institute Inc. All rights reserved.

Copyright 2013, SAS Institute Inc. All rights reserved. IMPROVING PREDICTION OF CYBER ATTACKS USING ENSEMBLE MODELING June 17, 2014 82 nd MORSS Alexandria, VA Tom Donnelly, PhD Systems Engineer & Co-insurrectionist JMP Federal Government Team ABSTRACT Improving

More information

How hot will it get? Modeling scientific discourse about literature

How hot will it get? Modeling scientific discourse about literature How hot will it get? Modeling scientific discourse about literature Project Aims Natalie Telis, CS229 ntelis@stanford.edu Many metrics exist to provide heuristics for quality of scientific literature,

More information

Predicting and Explaining Price-Spikes in Real-Time Electricity Markets

Predicting and Explaining Price-Spikes in Real-Time Electricity Markets Predicting and Explaining Price-Spikes in Real-Time Electricity Markets Christian Brown #1, Gregory Von Wald #2 # Energy Resources Engineering Department, Stanford University 367 Panama St, Stanford, CA

More information

Professor Dr. Gholamreza Nakhaeizadeh. Professor Dr. Gholamreza Nakhaeizadeh

Professor Dr. Gholamreza Nakhaeizadeh. Professor Dr. Gholamreza Nakhaeizadeh Statistic Methods in in Mining Business Understanding Understanding Preparation Deployment Modelling Evaluation Mining Process (( Part 3) 3) Professor Dr. Gholamreza Nakhaeizadeh Professor Dr. Gholamreza

More information

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS) International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) ISSN (Print): 2279-0047 ISSN (Online): 2279-0055 International

More information

A Survey on Recommendation Techniques in E-Commerce

A Survey on Recommendation Techniques in E-Commerce A Survey on Recommendation Techniques in E-Commerce Namitha Ann Regi Post-Graduate Student Department of Computer Science and Engineering Karunya University, India P. Rebecca Sandra Assistant Professor

More information

How to build and deploy machine learning projects

How to build and deploy machine learning projects How to build and deploy machine learning projects Litan Ilany, Advanced Analytics litan.ilany@intel.com Agenda Introduction Machine Learning: Exploration vs Solution CRISP-DM Flow considerations Other

More information

Conclusions and Future Work

Conclusions and Future Work Chapter 9 Conclusions and Future Work Having done the exhaustive study of recommender systems belonging to various domains, stock market prediction systems, social resource recommender, tag recommender

More information

2016 INFORMS International The Analytics Tool Kit: A Case Study with JMP Pro

2016 INFORMS International The Analytics Tool Kit: A Case Study with JMP Pro 2016 INFORMS International The Analytics Tool Kit: A Case Study with JMP Pro Mia Stephens mia.stephens@jmp.com http://bit.ly/1uygw57 Copyright 2010 SAS Institute Inc. All rights reserved. Background TQM

More information

MODELING THE EXPERT. An Introduction to Logistic Regression The Analytics Edge

MODELING THE EXPERT. An Introduction to Logistic Regression The Analytics Edge MODELING THE EXPERT An Introduction to Logistic Regression 15.071 The Analytics Edge Ask the Experts! Critical decisions are often made by people with expert knowledge Healthcare Quality Assessment Good

More information

Model-based analysis of the potential of macroinvertebrates as indicators for microbial pathogens in rivers

Model-based analysis of the potential of macroinvertebrates as indicators for microbial pathogens in rivers DEPARTMENT OF APPLIED ECOLOGY AND ENVIRONMENTAL BIOLOGY RESEARCH GROUP AQUATIC ECOLOGY (AECO) Model-based analysis of the potential of macroinvertebrates as indicators for microbial pathogens in rivers

More information

Logistic Regression and Decision Trees

Logistic Regression and Decision Trees Logistic Regression and Decision Trees Reminder: Regression We want to find a hypothesis that explains the behavior of a continuous y y = B0 + B1x1 + + Bpxp+ ε Source Regression for binary outcomes Regression

More information

A STUDY ON STATISTICAL BASED FEATURE SELECTION METHODS FOR CLASSIFICATION OF GENE MICROARRAY DATASET

A STUDY ON STATISTICAL BASED FEATURE SELECTION METHODS FOR CLASSIFICATION OF GENE MICROARRAY DATASET A STUDY ON STATISTICAL BASED FEATURE SELECTION METHODS FOR CLASSIFICATION OF GENE MICROARRAY DATASET 1 J.JEYACHIDRA, M.PUNITHAVALLI, 1 Research Scholar, Department of Computer Science and Applications,

More information

Web Customer Modeling for Automated Session Prioritization on High Traffic Sites

Web Customer Modeling for Automated Session Prioritization on High Traffic Sites Web Customer Modeling for Automated Session Prioritization on High Traffic Sites Nicolas Poggi 1, Toni Moreno 2,3, Josep Lluis Berral 1, Ricard Gavaldà 4, and Jordi Torres 1,2 1 Computer Architecture Department,

More information

Webscraping job vacancies ESSNet face to face meeting. Country presentation #2: France

Webscraping job vacancies ESSNet face to face meeting. Country presentation #2: France Webscraping job vacancies ESSNet face to face meeting Country presentation #2: France Outline 1. Data Access 2. Data Handling 3. Methodology 4. Statistical Outputs ESSNet meeting 2 Scraped data Experimental

More information

Incremental Development and Cost-based Evaluation of Software Fault Prediction Models

Incremental Development and Cost-based Evaluation of Software Fault Prediction Models Incremental Development and Cost-based Evaluation of Software Fault Prediction Models Yue Jiang Dissertation submitted to the College of Engineering and Mineral Resources at West Virginia University in

More information

A Comparative Study of Filter-based Feature Ranking Techniques

A Comparative Study of Filter-based Feature Ranking Techniques Western Kentucky University From the SelectedWorks of Dr. Huanjing Wang August, 2010 A Comparative Study of Filter-based Feature Ranking Techniques Huanjing Wang, Western Kentucky University Taghi M. Khoshgoftaar,

More information

TDWI Analytics Fundamentals. Course Outline. Module One: Concepts of Analytics

TDWI Analytics Fundamentals. Course Outline. Module One: Concepts of Analytics TDWI Analytics Fundamentals Module One: Concepts of Analytics Analytics Defined Data Analytics and Business Analytics o Variations of Purpose o Variations of Skills Why Analytics o Cause and Effect o Strategy

More information

Cryptocurrency Price Prediction Using News and Social Media Sentiment

Cryptocurrency Price Prediction Using News and Social Media Sentiment Cryptocurrency Price Prediction Using News and Social Media Sentiment Connor Lamon, Eric Nielsen, Eric Redondo Abstract This project analyzes the ability of news and social media data to predict price

More information

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy AGENDA 1. Introduction 2. Use Cases 3. Popular Algorithms 4. Typical Approach 5. Case Study 2016 SAPIENT GLOBAL MARKETS

More information

Macroinvertebrate based mathematical models for the prediction of microbial pathogen in rivers

Macroinvertebrate based mathematical models for the prediction of microbial pathogen in rivers Macroinvertebrate based mathematical models for the prediction of microbial pathogen in rivers Rubén Jerves-Cobo, I. Nopens, P. Goethals Ruben.JervesCobo@UGent.Be Laboratory of Environmental Toxicology

More information

Salford Predictive Modeler. Powerful machine learning software for developing predictive, descriptive, and analytical models.

Salford Predictive Modeler. Powerful machine learning software for developing predictive, descriptive, and analytical models. Powerful machine learning software for developing predictive, descriptive, and analytical models. The Company Minitab helps companies and institutions to spot trends, solve problems and discover valuable

More information

A Brief History. Bootstrapping. Bagging. Boosting (Schapire 1989) Adaboost (Schapire 1995)

A Brief History. Bootstrapping. Bagging. Boosting (Schapire 1989) Adaboost (Schapire 1995) A Brief History Bootstrapping Bagging Boosting (Schapire 1989) Adaboost (Schapire 1995) What s So Good About Adaboost Improves classification accuracy Can be used with many different classifiers Commonly

More information

The Integrated System of Statistical Registers: methodological approach and impact on official statistics

The Integrated System of Statistical Registers: methodological approach and impact on official statistics The Integrated System of Statistical Registers: methodological approach and impact on official statistics ROBERTO MONDUCCI ORIETTA LUZI Outline 1 Istat s register-based statistical system 2 3 4 5 Integrated

More information

Using Decision Tree to predict repeat customers

Using Decision Tree to predict repeat customers Using Decision Tree to predict repeat customers Jia En Nicholette Li Jing Rong Lim Abstract We focus on using feature engineering and decision trees to perform classification and feature selection on the

More information

Let the data speak: Machine learning methods for data editing and imputation

Let the data speak: Machine learning methods for data editing and imputation Working Paper 31 UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing (Budapest, Hungary, 14-16 September 2015) Topic (v): Emerging

More information

Predicting Restaurants Rating And Popularity Based On Yelp Dataset

Predicting Restaurants Rating And Popularity Based On Yelp Dataset CS 229 MACHINE LEARNING FINAL PROJECT 1 Predicting Restaurants Rating And Popularity Based On Yelp Dataset Yiwen Guo, ICME, Anran Lu, ICME, and Zeyu Wang, Department of Economics, Stanford University Abstract

More information

S p e c i f i c G r a n t A g r e e m e n t N o 1 ( S G A - 1)

S p e c i f i c G r a n t A g r e e m e n t N o 1 ( S G A - 1) ESSnet KOMUSO Q u a l i t y i n M u l t i s o u r c e S t a t i s t i c s h t t p : / / e c. e u r o p a. e u / e u r o s t a t / c r o s / c o n t e n t / e s s n e t - q u a l i t y - m u l t i s o u

More information

Predicting Stock Prices through Textual Analysis of Web News

Predicting Stock Prices through Textual Analysis of Web News Predicting Stock Prices through Textual Analysis of Web News Daniel Gallegos, Alice Hau December 11, 2015 1 Introduction Investors have access to a wealth of information through a variety of news channels

More information

COMPARATIVE STUDY OF SUPERVISED LEARNING IN CUSTOMER RELATIONSHIP MANAGEMENT

COMPARATIVE STUDY OF SUPERVISED LEARNING IN CUSTOMER RELATIONSHIP MANAGEMENT International Journal of Computer Engineering & Technology (IJCET) Volume 8, Issue 6, Nov-Dec 2017, pp. 77 82, Article ID: IJCET_08_06_009 Available online at http://www.iaeme.com/ijcet/issues.asp?jtype=ijcet&vtype=8&itype=6

More information

CASE STUDY: WEB-DOMAIN PRICE PREDICTION ON THE SECONDARY MARKET (4-LETTER CASE)

CASE STUDY: WEB-DOMAIN PRICE PREDICTION ON THE SECONDARY MARKET (4-LETTER CASE) CASE STUDY: WEB-DOMAIN PRICE PREDICTION ON THE SECONDARY MARKET (4-LETTER CASE) MAY 2016 MICHAEL.DOPIRA@ DATA-TRACER.COM TABLE OF CONTENT SECTION 1 Research background Page 3 SECTION 2 Study design Page

More information

INTELLIGENT SYSTEM FOR PREDICTING BEHAVIOR OF ELECTRICAL ENERGY CONSUMPTION

INTELLIGENT SYSTEM FOR PREDICTING BEHAVIOR OF ELECTRICAL ENERGY CONSUMPTION Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,

More information

Today. Last time. Lecture 5: Discrimination (cont) Jane Fridlyand. Oct 13, 2005

Today. Last time. Lecture 5: Discrimination (cont) Jane Fridlyand. Oct 13, 2005 Biological question Experimental design Microarray experiment Failed Lecture : Discrimination (cont) Quality Measurement Image analysis Preprocessing Jane Fridlyand Pass Normalization Sample/Condition

More information

3 Ways to Improve Your Targeted Marketing with Analytics

3 Ways to Improve Your Targeted Marketing with Analytics 3 Ways to Improve Your Targeted Marketing with Analytics Introduction Targeted marketing is a simple concept, but a key element in a marketing strategy. The goal is to identify the potential customers

More information

1.3 Building blocks The input datasets

1.3 Building blocks The input datasets 1.3 Building blocks The input datasets One aim of a S-DWH is to create a set of fully integrated statistical data. Input for these data may come from different sources like surveys, administrative data,

More information

New restaurants fail at a surprisingly

New restaurants fail at a surprisingly Predicting New Restaurant Success and Rating with Yelp Aileen Wang, William Zeng, Jessica Zhang Stanford University aileen15@stanford.edu, wizeng@stanford.edu, jzhang4@stanford.edu December 16, 2016 Abstract

More information

arxiv:cs/ v1 [cs.cl] 13 Sep 2001

arxiv:cs/ v1 [cs.cl] 13 Sep 2001 arxiv:cs/01015v1 [cs.cl] 13 Sep 2001 Boosting Trees for Anti-Spam Email Filtering Xavier Carreras and Lluís Màrquez TALP Research Center LSI Department Universitat Politècnica de Catalunya (UPC) Jordi

More information

Text Categorization. Hongning Wang

Text Categorization. Hongning Wang Text Categorization Hongning Wang CS@UVa Today s lecture Bayes decision theory Supervised text categorization General steps for text categorization Feature selection methods Evaluation metrics CS@UVa CS

More information

Text Categorization. Hongning Wang

Text Categorization. Hongning Wang Text Categorization Hongning Wang CS@UVa Today s lecture Bayes decision theory Supervised text categorization General steps for text categorization Feature selection methods Evaluation metrics CS@UVa CS

More information

Evaluating Workflow Trust using Hidden Markov Modeling and Provenance Data

Evaluating Workflow Trust using Hidden Markov Modeling and Provenance Data Evaluating Workflow Trust using Hidden Markov Modeling and Provenance Data Mahsa Naseri and Simone A. Ludwig Abstract In service-oriented environments, services with different functionalities are combined

More information

Agriculture Crop Prediction System Based on Meteorological Information

Agriculture Crop Prediction System Based on Meteorological Information International Journal of Control Theory and Applications ISSN : 0974-5572 International Science Press Volume 9 Number 43 2016 Agriculture Crop Prediction System Based on Meteorological Information Rageena

More information

ASSIGNMENT SUBMISSION FORM

ASSIGNMENT SUBMISSION FORM ASSIGNMENT SUBMISSION FORM Treat this as the first page of your assignment Course Name: Assignment Title: Business Analytics using Data Mining Crowdanalytix - Predicting Churn/Non-Churn Status of a Consumer

More information

Prediction of Road Accidents Severity using various algorithms

Prediction of Road Accidents Severity using various algorithms Volume 119 No. 12 2018, 16663-16669 ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu Prediction of Road Accidents Severity using various algorithms ABSTRACT: V M Ramachandiran, P N Kailash

More information

Enhancing Decision Making

Enhancing Decision Making MIS 14e Ch12 6.1 Copyright 2014 Pearson Education, Inc. publishing as Prentice Hall Chapter 12 Enhancing Decision Making VIDEO CASES Video Case 1: FreshDirect Uses Business Intelligence to Manage Its Online

More information

Dallas J. Elgin, Ph.D. IMPAQ International Randi Walters, Ph.D. Casey Family Programs APPAM Fall Research Conference

Dallas J. Elgin, Ph.D. IMPAQ International Randi Walters, Ph.D. Casey Family Programs APPAM Fall Research Conference Utilizing Predictive Modeling to Improve Policy through Improved Targeting of Agency Resources: A Case Study on Placement Instability among Foster Children Dallas J. Elgin, Ph.D. IMPAQ International Randi

More information

Data Analytics for Semiconductor Manufacturing The MathWorks, Inc. 1

Data Analytics for Semiconductor Manufacturing The MathWorks, Inc. 1 Data Analytics for Semiconductor Manufacturing 2016 The MathWorks, Inc. 1 Competitive Advantage What do we mean by Data Analytics? Analytics uses data to drive decision making, rather than gut feel or

More information

Week 1 Unit 1: Intelligent Applications Powered by Machine Learning

Week 1 Unit 1: Intelligent Applications Powered by Machine Learning Week 1 Unit 1: Intelligent Applications Powered by Machine Learning Intelligent Applications Powered by Machine Learning Objectives Realize recent advances in machine learning Understand the impact on

More information

Determining NDMA Formation During Disinfection Using Treatment Parameters Introduction Water disinfection was one of the biggest turning points for

Determining NDMA Formation During Disinfection Using Treatment Parameters Introduction Water disinfection was one of the biggest turning points for Determining NDMA Formation During Disinfection Using Treatment Parameters Introduction Water disinfection was one of the biggest turning points for human health in the past two centuries. Adding chlorine

More information

Data Mining in CRM THE CRM STRATEGY

Data Mining in CRM THE CRM STRATEGY CHAPTER ONE Data Mining in CRM THE CRM STRATEGY Customers are the most important asset of an organization. There cannot be any business prospects without satisfied customers who remain loyal and develop

More information

Scaling Deep Learning for Multispectral Imagery

Scaling Deep Learning for Multispectral Imagery Scaling Deep Learning for Multispectral Imagery Hanlin Tang, Ph.D. Principal Engineer, Intel AI Products Group hanlin.tang@intel.com Legal Notices & Disclaimers This document contains information on products,

More information

Online appendix for THE RESPONSE OF CONSUMER SPENDING TO CHANGES IN GASOLINE PRICES *

Online appendix for THE RESPONSE OF CONSUMER SPENDING TO CHANGES IN GASOLINE PRICES * Online appendix for THE RESPONSE OF CONSUMER SPENDING TO CHANGES IN GASOLINE PRICES * Michael Gelman a, Yuriy Gorodnichenko b,c, Shachar Kariv b, Dmitri Koustas b, Matthew D. Shapiro c,d, Dan Silverman

More information

Convex and Non-Convex Classification of S&P 500 Stocks

Convex and Non-Convex Classification of S&P 500 Stocks Georgia Institute of Technology 4133 Advanced Optimization Convex and Non-Convex Classification of S&P 500 Stocks Matt Faulkner Chris Fu James Moriarty Masud Parvez Mario Wijaya coached by Dr. Guanghui

More information

Who Is Likely to Succeed: Predictive Modeling of the Journey from H-1B to Permanent US Work Visa

Who Is Likely to Succeed: Predictive Modeling of the Journey from H-1B to Permanent US Work Visa Who Is Likely to Succeed: Predictive Modeling of the Journey from H-1B to Shibbir Dripto Khan ABSTRACT The purpose of this Study is to help US employers and legislators predict which employees are most

More information

A Semi-automated Peer-review System Bradly Alicea Orthogonal Research

A Semi-automated Peer-review System Bradly Alicea Orthogonal Research A Semi-automated Peer-review System Bradly Alicea bradly.alicea@ieee.org Orthogonal Research Abstract A semi-supervised model of peer review is introduced that is intended to overcome the bias and incompleteness

More information

Decision Trees And Random Forests A Visual Introduction For Beginners

Decision Trees And Random Forests A Visual Introduction For Beginners Decision Trees And Random Forests A Visual Introduction For Beginners We have made it easy for you to find a PDF Ebooks without any digging. And by having access to our ebooks online or by storing it on

More information

The usage of Big Data mechanisms and Artificial Intelligence Methods in modern Omnichannel marketing and sales

The usage of Big Data mechanisms and Artificial Intelligence Methods in modern Omnichannel marketing and sales The usage of Big Data mechanisms and Artificial Intelligence Methods in modern Omnichannel marketing and sales Today's IT service providers offer a large set of tools supporting sales and marketing activities

More information

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Lecture - 02 Data Mining Process Welcome to the lecture 2 of

More information

Video Traffic Classification

Video Traffic Classification Video Traffic Classification A Machine Learning approach with Packet Based Features using Support Vector Machine Videotrafikklassificering En Maskininlärningslösning med Paketbasereade Features och Supportvektormaskin

More information