Data Mining Applications with R

Similar documents
Preface to the third edition Preface to the first edition Acknowledgments

From Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques. Full book available for purchase here.

From Profit Driven Business Analytics. Full book available for purchase here.

Leveraging Analytics and. User Segmentation

Effective CRM Using. Predictive Analytics. Antonios Chorianopoulos

Strategic Marketing Planning

Security Risk Management

Implementing Analytics

Exploring Engineering

CONTENT STRATEGY AT WORK

IFFICULT PROJECT: Andre A. Costin AMSTERDAM BOSTON HEIDELBERG LONDON OXFORD NEW YORK

Power Generation Technologies

Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. ENTERPRISE MINER: ANALYTICAL MODEL DEVELOPMENT

TNM033 Data Mining Practical Final Project Deadline: 17 of January, 2011

DATA ANALYTICS WITH R, EXCEL & TABLEAU

Data Analytics with MATLAB Adam Filion Application Engineer MathWorks

Big Data. Methodological issues in using Big Data for Official Statistics

Building the In-Demand Skills for Analytics and Data Science Course Outline

Predicting the Odds of Getting Retweeted

Engineering. Gas and Oil Reliability. Modeling and Analysis. Dr. Eduardo Calixto ELSEVIER

2015 The MathWorks, Inc. 1

SPM 8.2. Salford Predictive Modeler

2016 INFORMS International The Analytics Tool Kit: A Case Study with JMP Pro

Thermodynamics of. Turbomachinery. Fluid Mechanics and. Sixth Edition. S. L. Dixon, B. Eng., Ph.D. University of Liverpool, C. A. Hall, Ph.D.

Data Mining and Applications in Genomics

Handbook of Small Modular Nuclear

Intelligence and. Vivek Kaie

Marketing Communications in Tourism and Hospitality

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong

Effective CRM Using Predictive Analytics

Business Intelligence

BIG DATA SKILLS: CHALLENGES FOR THE UNIVERSITY WORLD CREATING A NEW GENERATION OF DATA SCIENTISTS. Massimiliano Marcellino Bocconi University

TABLE OF CONTENTS ix

Applications of Machine Learning to Predict Yelp Ratings

IT Architectures and Middleware

Practical Application of Predictive Analytics Michael Porter

3 Ways to Improve Your Targeted Marketing with Analytics

PROJECT MANAGEMENT. Systems, Principles, and Applications. Taylor & Francis Group Boca Raton London New York

Software Metrics. Practical Approach. A Rigorous and. Norman Fenton. James Bieman THIRD EDITION. CRC Press CHAPMAN & HALIVCRC INNOVATIONS IN

Brian Macdonald Big Data & Analytics Specialist - Oracle

Shobeir Fakhraei, Hamid Soltanian-Zadeh, Farshad Fotouhi, Kost Elisevich. Effect of Classifiers in Consensus Feature Ranking for Biomedical Datasets

Credit Scoring, Response Modelling and Insurance Rating

KnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE

Advanced Job Daimler. Julian Leweling, Daimler AG

Predictive Modeling Using SAS Visual Statistics: Beyond the Prediction

Real Estate Modelling and Forecasting

Knowledge Discovery and Data Mining

DATA MINING AND BUSINESS ANALYTICS WITH R

MARKETING RESEARCH AN APPLIED APPROACH FIFTH EDITION NARESH K. MALHOTRA DANIEL NUNAN DAVID F. BIRKS. W Pearson

IBM SPSS & Apache Spark

Introduction to Logistics Systems Management

Business Risk Management Handbook

Natural Resource and Environmental Economics

Data mining and Renewable energy. Cindi Thompson

Analytics for Banks. September 19, 2017

KnowledgeSTUDIO. Advanced Modeling for Better Decisions. Data Preparation, Data Profiling and Exploration

Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest

Methodological challenges of Big Data for official statistics

Stock Price Prediction with Daily News

Multiple Attribute Decision Making

ML Methods for Solving Complex Sorting and Ranking Problems in Human Hiring

E-Commerce Sales Prediction Using Listing Keywords

advanced analysis of gene expression microarray data aidong zhang World Scientific State University of New York at Buffalo, USA

MISSING DATA CLASSIFICATION OF CHRONIC KIDNEY DISEASE

POST GRADUATE PROGRAM IN DATA SCIENCE & MACHINE LEARNING (PGPDM)

MARKETING PLANNING. Where Strategy Meets Action PEARSON

Power Plants. Structural Alloys for. Operational Challenges and. High-temperature Materials. Edited by. Amir Shirzadi and Susan Jackson.

Machine Learning Techniques For Particle Identification

Aircraft Structures B H. for engineering students. T. H. G. Megson ELSEVIER SAN FRANCISCO SINGAPORE SYDNEY TOKYO

DETECTING COMMUNITIES BY SENTIMENT ANALYSIS

Professor Dr. Gholamreza Nakhaeizadeh. Professor Dr. Gholamreza Nakhaeizadeh

A Smart Tool to analyze the Salary trends of H1-B Workers

New Customer Acquisition Strategy

Chapter 13 Knowledge Discovery Systems: Systems That Create Knowledge

Salford Predictive Modeler. Powerful machine learning software for developing predictive, descriptive, and analytical models.

BIOMEDICAL ENGINEERING ACADEMIC PRESS SERIES IN BIOMEDICAL ENGINEERING ELSEVIER ACADEMIC PRESS. "mmmmmm

Contents PREFACE 1 INTRODUCTION The Role of Scheduling The Scheduling Function in an Enterprise Outline of the Book 6

Predicting Corporate Influence Cascades In Health Care Communities

DATA SCIENCE: HYPE AND REALITY PATRICK HALL

Who Is Likely to Succeed: Predictive Modeling of the Journey from H-1B to Permanent US Work Visa

Application of Machine Learning to Financial Trading

Modular Design for Machine Tools

Achieve Better Insight and Prediction with Data Mining

Fundamentals of Preparatiue and Nonlinear Chromatography

Advanced analytics at your hands

Mining Heterogeneous Urban Data at Multiple Granularity Layers

e-marketing Applications of information technology and the Internet within marketing Cor Molenaar Routledge Taylor & Francis Croup LONDON AND NEW YORK

Hadoop Course Content

Natural Resource and Environmental Economics

Transforming Analytics with Cloudera Data Science WorkBench

Data Mining. Chapter 7: Score Functions for Data Mining Algorithms. Fall Ming Li

New restaurants fail at a surprisingly

FINAL PROJECT REPORT IME672. Group Number 6

DASI: Analytics in Practice and Academic Analytics Preparation

Predictive Modelling for Customer Targeting A Banking Example

Machine Learning Models for Sales Time Series Forecasting

Hotel Industry Demand Curves

Data Analytics for Semiconductor Manufacturing The MathWorks, Inc. 1

CS229 Project Report Using Newspaper Sentiments to Predict Stock Movements Hao Yee Chan Anthony Chow


Transcription:

Data Mining Applications with R Yanchang Zhao Senior Data Miner, RDataMining.com, Australia Associate Professor, Yonghua Cen Nanjing University of Science and Technology, China AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO ELSEVIER SAN FRANCISCO SYDNEY TOKYO Academic Prcsi is in imprint of Elsevier

Contents Preface Acknowledgments Review Committee Foreword Chapter 1: Power Grid Data Analysis with R and Hadoop 1 1.1 Introduction 1 1.2 A Brief Overview of the Power Grid 2 1.3 Introduction to MapReduce, Hadoop, and RHIPE 5 1.3.1 MapReduce 6 1.3.2 Hadoop 7 1.3.3 RHIPE: R with Hadoop 8 1.3.4 Other Parallel R Packages 13 1.4 Power Grid Analytical Approach 14 1.4.1 Data Preparation 15 1.4.2 Exploratory Analysis and Data Cleaning 16 1.4.3 Event Extraction 25 1.5 Discussion and Conclusions 31 Appendix 32 References 34 xiii xv xvii xix Chapter 2: Picturing Bayesian Classifiers: A Visual Data Mining Approach to Parameters Optimization 35 2.1 Introduction 35 2.2 Related Works 36 2.3 Motivations and Requirements 37 2.3.1 R Packages Requirements 38 2.4 Probabilistic Framework of NB Classifiers 39 2.4.1 Choosing the Model 40 2.4.2 Estimating the Parameters 44 2.5 Two-Dimensional Visualization System 47 2.5.1 Design Choices 48 2.5.2 Visualization Design 49 v

vi Contents 2.6 A Case Study: Text Classification 52 2.6.1 Description of the Dataset 52 2.6.2 Creating Document-Term Matrices 53 2.6.3 Loading Existing Term-Document Matrices 54 2.6.4 Running the Program 55 2.7 Conclusions 59 Acknowledgments 60 References 60 Chapter 3: Discovery ofemergent Anthropology Using Text Mining, Topic Modeling, Network Analysis ofmicroblog Issues and Controversies in and Social Content 63 3.1 Introduction 63 3.2 How Many Messages and How Many Twitter-Users in the Sample? 65 3.3 Who Is Writing All These Twitter Messages? 66 3.4 Who Are the Influential Twitter-Users in This Sample? 67 3.5 What Is the Community Structure of These Twitter-Users? 72 3.6 What Were Twitter-Users Writing About During the Meeting? 75 3.7 What Do the Twitter Messages Reveal About the Opinions of Their Authors? 80 3.8 What Can Be Discovered in the Less Frequently Used Words in the Sample? 84 3.9 What Are the Topics That Can Be Algorithmically Discovered in This Sample? 86 3.10 Conclusion 88 References 91 Chapter 4: Text Mining and Network Analysis of Digital Libraries in R 95 4.1 Introduction 95 4.2 Dataset Preparation 96 4.3 Manipulating the Document-Term Matrix 97 4.3.1 The Document-Term Matrix 97 4.3.2 Term Frequency-Inverse Document Frequency 99 4.3.3 Exploring the Document-Term Matrix 100 4.4 Clustering Content by Topics Using the LDA 101 4.4.1 The Latent Dirichlet Allocation 101 4.4.2 Learning the Various Distributions for LDA 102 4.4.3 Using the Log-Likelihood for Model Validation 104 4.4.4 Topics Representation 105 4.4.5 Plotting the Topics Associations 106 4.5 Using Similarity Between Documents to Explore Document Cohesion 108 4.5.1 Computing Similarities Between Documents 108 4.5.2 Using a Heatmap to Illustrate Clusters of Documents 109

Contents vii 4.6 Social Network Analysis of Authors 109 4.6.1 Constructing the Network as a Graph 109 4.6.2 Author Importance Using Centrality Measures 113 4.7 Conclusion 115 References 115 Chapter 5: Recommender Systems in R 117 5.1 Introduction 117 5.2 Business Case 117 5.3 Evaluation 117 5.4 Collaborative Filtering Methods 118 5.5 Latent Factor Collaborative Filtering 127 5.6 Simplified Approach 143 5.7 Roll Your Own 145 5.8 Final Thoughts 149 References 151 Chapter 6: Response Modeling in Direct Marketing: A Data Mining-Based Approach for Target Selection 153 6.1 Introduction/B ackground 153 6.2 Business Problem 155 6.3 Proposed Response Model 156 6.4 Modeling Detail 158 6.4.1 Data Collection 158 6.4.2 Data Preprocessing 158 6.4.3 Feature Construction 160 6.4.4 Feature Selection 164 6.4.5 Data Sampling for Training and Test 169 6.4.6 Class Balancing 171 6.4.7 Classifier (SVM) 172 6.5 Prediction Result 174 6.6 Model Evaluation 175 6.7 Conclusion 177 References 178 Chapter 7: Caravan Insurance Customer Profile Modeling with R 181 7.1 Introduction 181 7.2 Data Description and Initial Exploratory Data Analysis 182 7.2.1 Variable Correlations and Logistic Regression Analysis 184 7.3 Classifier Models of Caravan Insurance Holders 185 7.3.1 Overview of Model Building and Validating 185 7.3.2 Review of Four Classifier Methods 188 7.3.3 RP Model 190 7.3.4 Bagging Ensemble 192

viii Contents 7.3.5 Support Vector Machine 193 7.3.6 LR Classification 195 7.3.7 Comparison of Four Classifier Models: ROC and AUC 199 7.3.8 Model Comparison: Recall-Precision, Accuracy-v-Cut-off, and Computation Times 201 7.4 Discussion of Results and Conclusion 206 Appendix Appendix B Customer Profile Data-Frequency of Binary Appendix C Proportion of Caravan Insurance Holders vis-a-vis other A Details of the Full Data Set Variables 209 Values 212 Customer Profile Variables 220 Appendix D LR Model Details 222 Appendix E R Commands for Computation of ROC Curves for Each Model Using Validation Dataset 225 Appendix F Commands for Cross-Validation Analysis of Classifier Models 225 References 226 Chapter 8: Selecting Best Features for Predicting Bank Loan Default 229 8.1 Introduction 229 8.2 Business Problem 230 8.3 Data Extraction 230 8.4 Data Exploration and Preparation 231 8.4.1 Null Value Detection 231 8.4.2 Outlier Detection 232 8.5 Missing Imputation 235 8.5.1 Relevance Analysis 235 8.5.2 Data Set Balancing 237 8.5.3 Feature Selection 239 8.6 Modeling 240 8.7 Model Evaluation 243 8.8 Finding and Model Deployment 243 8.9 Lessons and Discussions 244 Appendix Selecting Best Features for Predicting Bank Loan Default 244 References 245 Chapter 9: A Choquet Integral Toolbox and Its Application in Customer Preference Analysis 247 9.1 Introduction 247 9.2 Background 248 9.2.1 Aggregation Functions 248 9.2.2 Choquet Integral 249 9.2.3 Fuzzy Measure Representation 251 9.2.4 Shapley Value and Interaction Index 252

Contents ix 9.3 Rfmtool Package 253 9.3.1 Installation 253 9.3.2 Toolbox Description 254 9.3.3 Preference Analysis Example 255 9.4 Case Study 258 9.4.1 Traveler Preference Study and Hotel Management 258 9.4.2 Data Collection and Experiment Design 259 9.4.3 Model Evaluation 260 9.4.4 Result Analysis 263 9.4.5 Discussion 269 9.5 Conclusions 270 References 271 Chapter 10: A Real-Time Property Value Index Based on Web Data 273 10.1 Introduction 273 10.2 Housing Prices and Indices 273 10.3 A Data Mining Approach 274 10.3.1 Data Capture 275 10.3.2 Geocoding 277 10.3.3 Price Evolution 280 10.4 Real Estate Pricing Models 283 10.4.1 Model 1: Hedonic Model Plus Smooth Term 284 10.4.2 Model 2: GWR Plus a Smooth Term 287 10.4.3 Relationship to Other Work 293 10.5 Conclusion 295 Acknowledgments 295 References 295 Chapter 11: Predicting Seabed Hardness Using Random Forest in R 299 11.1 Introduction 299 11.2 Study Region and Data Processing 300 11.2.1 Study Region 301 11.2.2 Data Processing of Seabed Hardness 301 11.2.3 Predictors 304 11.3 Dataset Manipulation and Exploratory Analyses 305 11.3.1 Features of the Dataset 306 11.3.2 Exploratory Data Analyses 306 11.4 Application of RF for Predicting Seabed Hardness 307 11.5 Model Validation Using rfcv 313 11.6 Optimal Predictive Model 315 11.7 Application of the Optimal Predictive Model 319 11.8 Discussion and Conclusions 321 11.8.1 Selection of Relevant Predictors and the Consequences of Missing the Most Important Predictors 321 11.8.2 Issues with Searching for the Most Accurate Predictive Model Using RF 323

x Contents 11.8.3 Predictive Accuracy of RF and Prediction Maps of Seabed Hardness 324 11.8.4 Limitations 325 Acknowledgments 326 Appendix AA Dataset of Seabed Hardness and 15 Predictors 326 Appendix BA R Function, if.cv, Shows the Cross-Validated Prediction Performance of a Predictive Model 326 References 327 Chapter 12: Supervised Classification ofimages, Applied to Plankton Samples Using R and Zooimage 331 12.1 Background 331 12.2 Challenges 332 12.3 Data Extraction and Exploration 336 12.4 Data Preprocessing 341 12.5 Modeling 344 12.6 Model Evaluation 348 12.7 Model Deployment 355 12.8 Lessons, Discussion, and Conclusions 359 Acknowledgments 362 References 363 Chapter 13: Crime Analyses Using R 367 13.1 Introduction 367 13.2 Problem Definition 368 13.3 Data Extraction 369 13.4 Data Exploration and Preprocessing 369 13.5 Visualizations 375 13.6 Modeling 385 13.7 Model Evaluation 392 13.8 Discussions and Improvements 394 References 395 Chapter 14: Football Mining with R 397 14.1 Introduction to the Case Study and Organization of the Analysis 397 14.2 Background of the Analysis: The Italian Football Championship 398 14.3 Data Extraction and Exploration 399 14.3.1 Data Extraction 399 14.3.2 Data Exploration 400 14.4 Data Preprocessing 403 14.4.1 Variable Importance Evaluation 403 14.4.2 Composite Indicators Construction 408 14.5 Model Development: Building Classifiers 412 14.5.1 Learning Step 413

Contents xi 14.5.2 Model Selection 421 14.5.3 Model Refinement 424 14.6 Model Deployment 426 14.7 Concluding Remarks 430 Acknowledgments 431 References, 431 Chapter 15: Analyzing Internet DNS(SEC) Traffic with R for Resolving Platform Optimization 435 15.1 Introduction 435 15.2 Data Extraction from PCAP to CSV File 436 15.3 Data Importation from CSV File to R 437 15.4 Dimension Reduction Via PCA 438 15.5 Initial Data Exploration Via Graphs 440 15.6 Variables Scaling and Samples Selection 442 15.7 Clustering for Segmenting the FQDN 443 15.8 Building Routing Table Thanks to Clustering 446 15.9 Building Routing Table Thanks to Mixed Integer Linear Programming 448 15.10 Building Routing Table Via a Heuristic 451 15.11 Final Evaluation 452 15.12 Conclusion 454 References 455 Index 457