Data Mining Applications with R Yanchang Zhao Senior Data Miner, RDataMining.com, Australia Associate Professor, Yonghua Cen Nanjing University of Science and Technology, China AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO ELSEVIER SAN FRANCISCO SYDNEY TOKYO Academic Prcsi is in imprint of Elsevier
Contents Preface Acknowledgments Review Committee Foreword Chapter 1: Power Grid Data Analysis with R and Hadoop 1 1.1 Introduction 1 1.2 A Brief Overview of the Power Grid 2 1.3 Introduction to MapReduce, Hadoop, and RHIPE 5 1.3.1 MapReduce 6 1.3.2 Hadoop 7 1.3.3 RHIPE: R with Hadoop 8 1.3.4 Other Parallel R Packages 13 1.4 Power Grid Analytical Approach 14 1.4.1 Data Preparation 15 1.4.2 Exploratory Analysis and Data Cleaning 16 1.4.3 Event Extraction 25 1.5 Discussion and Conclusions 31 Appendix 32 References 34 xiii xv xvii xix Chapter 2: Picturing Bayesian Classifiers: A Visual Data Mining Approach to Parameters Optimization 35 2.1 Introduction 35 2.2 Related Works 36 2.3 Motivations and Requirements 37 2.3.1 R Packages Requirements 38 2.4 Probabilistic Framework of NB Classifiers 39 2.4.1 Choosing the Model 40 2.4.2 Estimating the Parameters 44 2.5 Two-Dimensional Visualization System 47 2.5.1 Design Choices 48 2.5.2 Visualization Design 49 v
vi Contents 2.6 A Case Study: Text Classification 52 2.6.1 Description of the Dataset 52 2.6.2 Creating Document-Term Matrices 53 2.6.3 Loading Existing Term-Document Matrices 54 2.6.4 Running the Program 55 2.7 Conclusions 59 Acknowledgments 60 References 60 Chapter 3: Discovery ofemergent Anthropology Using Text Mining, Topic Modeling, Network Analysis ofmicroblog Issues and Controversies in and Social Content 63 3.1 Introduction 63 3.2 How Many Messages and How Many Twitter-Users in the Sample? 65 3.3 Who Is Writing All These Twitter Messages? 66 3.4 Who Are the Influential Twitter-Users in This Sample? 67 3.5 What Is the Community Structure of These Twitter-Users? 72 3.6 What Were Twitter-Users Writing About During the Meeting? 75 3.7 What Do the Twitter Messages Reveal About the Opinions of Their Authors? 80 3.8 What Can Be Discovered in the Less Frequently Used Words in the Sample? 84 3.9 What Are the Topics That Can Be Algorithmically Discovered in This Sample? 86 3.10 Conclusion 88 References 91 Chapter 4: Text Mining and Network Analysis of Digital Libraries in R 95 4.1 Introduction 95 4.2 Dataset Preparation 96 4.3 Manipulating the Document-Term Matrix 97 4.3.1 The Document-Term Matrix 97 4.3.2 Term Frequency-Inverse Document Frequency 99 4.3.3 Exploring the Document-Term Matrix 100 4.4 Clustering Content by Topics Using the LDA 101 4.4.1 The Latent Dirichlet Allocation 101 4.4.2 Learning the Various Distributions for LDA 102 4.4.3 Using the Log-Likelihood for Model Validation 104 4.4.4 Topics Representation 105 4.4.5 Plotting the Topics Associations 106 4.5 Using Similarity Between Documents to Explore Document Cohesion 108 4.5.1 Computing Similarities Between Documents 108 4.5.2 Using a Heatmap to Illustrate Clusters of Documents 109
Contents vii 4.6 Social Network Analysis of Authors 109 4.6.1 Constructing the Network as a Graph 109 4.6.2 Author Importance Using Centrality Measures 113 4.7 Conclusion 115 References 115 Chapter 5: Recommender Systems in R 117 5.1 Introduction 117 5.2 Business Case 117 5.3 Evaluation 117 5.4 Collaborative Filtering Methods 118 5.5 Latent Factor Collaborative Filtering 127 5.6 Simplified Approach 143 5.7 Roll Your Own 145 5.8 Final Thoughts 149 References 151 Chapter 6: Response Modeling in Direct Marketing: A Data Mining-Based Approach for Target Selection 153 6.1 Introduction/B ackground 153 6.2 Business Problem 155 6.3 Proposed Response Model 156 6.4 Modeling Detail 158 6.4.1 Data Collection 158 6.4.2 Data Preprocessing 158 6.4.3 Feature Construction 160 6.4.4 Feature Selection 164 6.4.5 Data Sampling for Training and Test 169 6.4.6 Class Balancing 171 6.4.7 Classifier (SVM) 172 6.5 Prediction Result 174 6.6 Model Evaluation 175 6.7 Conclusion 177 References 178 Chapter 7: Caravan Insurance Customer Profile Modeling with R 181 7.1 Introduction 181 7.2 Data Description and Initial Exploratory Data Analysis 182 7.2.1 Variable Correlations and Logistic Regression Analysis 184 7.3 Classifier Models of Caravan Insurance Holders 185 7.3.1 Overview of Model Building and Validating 185 7.3.2 Review of Four Classifier Methods 188 7.3.3 RP Model 190 7.3.4 Bagging Ensemble 192
viii Contents 7.3.5 Support Vector Machine 193 7.3.6 LR Classification 195 7.3.7 Comparison of Four Classifier Models: ROC and AUC 199 7.3.8 Model Comparison: Recall-Precision, Accuracy-v-Cut-off, and Computation Times 201 7.4 Discussion of Results and Conclusion 206 Appendix Appendix B Customer Profile Data-Frequency of Binary Appendix C Proportion of Caravan Insurance Holders vis-a-vis other A Details of the Full Data Set Variables 209 Values 212 Customer Profile Variables 220 Appendix D LR Model Details 222 Appendix E R Commands for Computation of ROC Curves for Each Model Using Validation Dataset 225 Appendix F Commands for Cross-Validation Analysis of Classifier Models 225 References 226 Chapter 8: Selecting Best Features for Predicting Bank Loan Default 229 8.1 Introduction 229 8.2 Business Problem 230 8.3 Data Extraction 230 8.4 Data Exploration and Preparation 231 8.4.1 Null Value Detection 231 8.4.2 Outlier Detection 232 8.5 Missing Imputation 235 8.5.1 Relevance Analysis 235 8.5.2 Data Set Balancing 237 8.5.3 Feature Selection 239 8.6 Modeling 240 8.7 Model Evaluation 243 8.8 Finding and Model Deployment 243 8.9 Lessons and Discussions 244 Appendix Selecting Best Features for Predicting Bank Loan Default 244 References 245 Chapter 9: A Choquet Integral Toolbox and Its Application in Customer Preference Analysis 247 9.1 Introduction 247 9.2 Background 248 9.2.1 Aggregation Functions 248 9.2.2 Choquet Integral 249 9.2.3 Fuzzy Measure Representation 251 9.2.4 Shapley Value and Interaction Index 252
Contents ix 9.3 Rfmtool Package 253 9.3.1 Installation 253 9.3.2 Toolbox Description 254 9.3.3 Preference Analysis Example 255 9.4 Case Study 258 9.4.1 Traveler Preference Study and Hotel Management 258 9.4.2 Data Collection and Experiment Design 259 9.4.3 Model Evaluation 260 9.4.4 Result Analysis 263 9.4.5 Discussion 269 9.5 Conclusions 270 References 271 Chapter 10: A Real-Time Property Value Index Based on Web Data 273 10.1 Introduction 273 10.2 Housing Prices and Indices 273 10.3 A Data Mining Approach 274 10.3.1 Data Capture 275 10.3.2 Geocoding 277 10.3.3 Price Evolution 280 10.4 Real Estate Pricing Models 283 10.4.1 Model 1: Hedonic Model Plus Smooth Term 284 10.4.2 Model 2: GWR Plus a Smooth Term 287 10.4.3 Relationship to Other Work 293 10.5 Conclusion 295 Acknowledgments 295 References 295 Chapter 11: Predicting Seabed Hardness Using Random Forest in R 299 11.1 Introduction 299 11.2 Study Region and Data Processing 300 11.2.1 Study Region 301 11.2.2 Data Processing of Seabed Hardness 301 11.2.3 Predictors 304 11.3 Dataset Manipulation and Exploratory Analyses 305 11.3.1 Features of the Dataset 306 11.3.2 Exploratory Data Analyses 306 11.4 Application of RF for Predicting Seabed Hardness 307 11.5 Model Validation Using rfcv 313 11.6 Optimal Predictive Model 315 11.7 Application of the Optimal Predictive Model 319 11.8 Discussion and Conclusions 321 11.8.1 Selection of Relevant Predictors and the Consequences of Missing the Most Important Predictors 321 11.8.2 Issues with Searching for the Most Accurate Predictive Model Using RF 323
x Contents 11.8.3 Predictive Accuracy of RF and Prediction Maps of Seabed Hardness 324 11.8.4 Limitations 325 Acknowledgments 326 Appendix AA Dataset of Seabed Hardness and 15 Predictors 326 Appendix BA R Function, if.cv, Shows the Cross-Validated Prediction Performance of a Predictive Model 326 References 327 Chapter 12: Supervised Classification ofimages, Applied to Plankton Samples Using R and Zooimage 331 12.1 Background 331 12.2 Challenges 332 12.3 Data Extraction and Exploration 336 12.4 Data Preprocessing 341 12.5 Modeling 344 12.6 Model Evaluation 348 12.7 Model Deployment 355 12.8 Lessons, Discussion, and Conclusions 359 Acknowledgments 362 References 363 Chapter 13: Crime Analyses Using R 367 13.1 Introduction 367 13.2 Problem Definition 368 13.3 Data Extraction 369 13.4 Data Exploration and Preprocessing 369 13.5 Visualizations 375 13.6 Modeling 385 13.7 Model Evaluation 392 13.8 Discussions and Improvements 394 References 395 Chapter 14: Football Mining with R 397 14.1 Introduction to the Case Study and Organization of the Analysis 397 14.2 Background of the Analysis: The Italian Football Championship 398 14.3 Data Extraction and Exploration 399 14.3.1 Data Extraction 399 14.3.2 Data Exploration 400 14.4 Data Preprocessing 403 14.4.1 Variable Importance Evaluation 403 14.4.2 Composite Indicators Construction 408 14.5 Model Development: Building Classifiers 412 14.5.1 Learning Step 413
Contents xi 14.5.2 Model Selection 421 14.5.3 Model Refinement 424 14.6 Model Deployment 426 14.7 Concluding Remarks 430 Acknowledgments 431 References, 431 Chapter 15: Analyzing Internet DNS(SEC) Traffic with R for Resolving Platform Optimization 435 15.1 Introduction 435 15.2 Data Extraction from PCAP to CSV File 436 15.3 Data Importation from CSV File to R 437 15.4 Dimension Reduction Via PCA 438 15.5 Initial Data Exploration Via Graphs 440 15.6 Variables Scaling and Samples Selection 442 15.7 Clustering for Segmenting the FQDN 443 15.8 Building Routing Table Thanks to Clustering 446 15.9 Building Routing Table Thanks to Mixed Integer Linear Programming 448 15.10 Building Routing Table Via a Heuristic 451 15.11 Final Evaluation 452 15.12 Conclusion 454 References 455 Index 457