DATA MINING AND BUSINESS ANALYTICS WITH R

DATA MINING AND BUSINESS ANALYTICS WITH R Johannes Ledolter Department of Management Sciences Tippie College of Business University of Iowa Iowa City, Iowa

Copyright 2013 by John Wiley & Sons, Inc. All rights reserved Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging-in-Publication Data: Ledolter, Johannes. Data mining and business analytics with R / Johannes Ledolter, University of Iowa. pages cm Includes bibliographical references and index. ISBN 978-1-118-44714-7 (cloth) 1. Data mining. 2. R (Computer program language) 3. Commercial statistics. I. Title. QA76.9.D343L44 2013 006.3 12 dc23 2013000330 Printed in the United States of America 10987654321

CONTENTS Preface Acknowledgments ix xi 1. Introduction 1 Reference 6 2. Processing the Information and Getting to Know Your Data 7 2.1 Example 1: 2006 Birth Data 7 2.2 Example 2: Alumni Donations 17 2.3 Example 3: Orange Juice 31 References 39 3. Standard Linear Regression 40 3.1 Estimation in R 43 3.2 Example 1: Fuel Efficiency of Automobiles 43 3.3 Example 2: Toyota Used-Car Prices 47 Appendix 3.A The Effects of Model Overfitting on the Average Mean Square Error of the Regression Prediction 53 References 54 4. Local Polynomial Regression: a Nonparametric Regression Approach 55 4.1 Model Selection 56 4.2 Application to Density Estimation and the Smoothing of Histograms 58 4.3 Extension to the Multiple Regression Model 58 4.4 Examples and Software 58 References 65 5. Importance of Parsimony in Statistical Modeling 67 5.1 How Do We Guard Against False Discovery 67 References 70 v

vi CONTENTS 6. Penalty-Based Variable Selection in Regression Models with Many Parameters (LASSO) 71 6.1 Example 1: Prostate Cancer 74 6.2 Example 2: Orange Juice 78 References 82 7. Logistic Regression 83 7.1 Building a Linear Model for Binary Response Data 83 7.2 Interpretation of the Regression Coefficients in a Logistic Regression Model 85 7.3 Statistical Inference 85 7.4 Classification of New Cases 86 7.5 Estimation in R 87 7.6 Example 1: Death Penalty Data 87 7.7 Example 2: Delayed Airplanes 92 7.8 Example 3: Loan Acceptance 100 7.9 Example 4: German Credit Data 103 References 107 8. Binary Classification, Probabilities, and Evaluating Classification Performance 108 8.1 Binary Classification 108 8.2 Using Probabilities to Make Decisions 108 8.3 Sensitivity and Specificity 109 8.4 Example: German Credit Data 109 9. Classification Using a Nearest Neighbor Analysis 115 9.1 The k-nearest Neighbor Algorithm 116 9.2 Example 1: Forensic Glass 117 9.3 Example 2: German Credit Data 122 Reference 125 10. The Naïve Bayesian Analysis: a Model for Predicting a Categorical Response from Mostly Categorical Predictor Variables 126 10.1 Example: Delayed Airplanes 127 Reference 131 11. Multinomial Logistic Regression 132 11.1 Computer Software 134 11.2 Example 1: Forensic Glass 134

CONTENTS vii 11.3 Example 2: Forensic Glass Revisited 141 Appendix 11.A Specification of a Simple Triplet Matrix 147 References 149 12. More on Classification and a Discussion on Discriminant Analysis 150 12.1 Fisher s Linear Discriminant Function 153 12.2 Example 1: German Credit Data 154 12.3 Example 2: Fisher Iris Data 156 12.4 Example 3: Forensic Glass Data 157 12.5 Example 4: MBA Admission Data 159 Reference 160 13. Decision Trees 161 13.1 Example 1: Prostate Cancer 167 13.2 Example 2: Motorcycle Acceleration 179 13.3 Example 3: Fisher Iris Data Revisited 182 14. Further Discussion on Regression and Classification Trees, Computer Software, and Other Useful Classification Methods 185 14.1 R Packages for Tree Construction 185 14.2 Chi-Square Automatic Interaction Detection (CHAID) 186 14.3 Ensemble Methods: Bagging, Boosting, and Random Forests 188 14.4 Support Vector Machines (SVM) 192 14.5 Neural Networks 192 14.6 The R Package Rattle: A Useful Graphical User Interface for Data Mining 193 References 195 15. Clustering 196 15.1 k-means Clustering 196 15.2 Another Way to Look at Clustering: Applying the Expectation-Maximization (EM) Algorithm to Mixtures of Normal Distributions 204 15.3 Hierarchical Clustering Procedures 212 References 219 16. Market Basket Analysis: Association Rules and Lift 220 16.1 Example 1: Online Radio 222 16.2 Example 2: Predicting Income 227 References 234

viii CONTENTS 17. Dimension Reduction: Factor Models and Principal Components 235 17.1 Example 1: European Protein Consumption 238 17.2 Example 2: Monthly US Unemployment Rates 243 18. Reducing the Dimension in Regressions with Multicollinear Inputs: Principal Components Regression and Partial Least Squares 247 18.1 Three Examples 249 References 257 19. Text as Data: Text Mining and Sentiment Analysis 258 19.1 Inverse Multinomial Logistic Regression 259 19.2 Example 1: Restaurant Reviews 261 19.3 Example 2: Political Sentiment 266 Appendix 19.A Relationship Between the Gentzkow Shapiro Estimate of Slant and Partial Least Squares 268 References 271 20. Network Data 272 20.1 Example 1: Marriage and Power in Fifteenth Century Florence 274 20.2 Example 2: Connections in a Friendship Network 278 References 292 Appendix A: Exercises 293 Exercise 1 294 Exercise 2 294 Exercise 3 296 Exercise 4 298 Exercise 5 299 Exercise 6 300 Exercise 7 301 Appendix B: References 338 Index 341

PREFACE This book is about useful methods for data mining and business analytics. It is written for readers who want to apply these methods so that they can learn about their processes and solve their problems. My objective is to provide a thorough discussion of the most useful data-mining tools that goes beyond the typical black box description, and to show why these tools work. Powerful, accurate, and flexible computing software is needed for data mining, and Excel is of little use. Although excellent data-mining software is offered by various commercial vendors, proprietary products are usually expensive. In this text, I use the R Statistical Software, which is powerful and free. But the use of R comes with start-up costs. R requires the user to write out instructions, and the writing of program instructions will be unfamiliar to most spreadsheet users. This is why I provide R sample programs in the text and on the webpage that is associated with this book. These sample programs should smooth the transition to this very general and powerful computer environment and help keep the start-up costs to using R small. The text combines explanations of the statistical foundation of data mining with useful software so that the tools can be readily applied and put to use. There are certainly better books that give a deeper description of the methods, and there are also numerous texts that give a more complete guide to computing with R. This book tries to strike a compromise that does justice to both theory and practice, at a level that can be understood by the MBA student interested in quantitative methods. This book can be used in courses on data mining in quantitative MBA programs and in upper-level undergraduate and graduate programs that deal with the analysis and interpretation of large data sets. Students in business, the social and natural sciences, medicine, and engineering should benefit from this book. The majority of the topics can be covered in a one semester course. But not every covered topic will be useful for all audiences, and for some audiences, the coverage of certain topics will be either too advanced or too basic. By omitting some topics and by expanding on others, one can make this book work for many different audiences. Certain data-mining applications require an enormous amount of effort to just collect the relevant information, and in such cases, the data preparation takes a lot more time than the eventual modeling. In other applications, the data collection effort is minimal, but often one has to worry about the efficient storage and retrieval of high volume information (i.e., the data warehousing ). Although it is very important to know how to acquire, store, merge, and best arrange the information, ix

x PREFACE this text does not cover these aspects very deeply. This book concentrates on the modeling aspects of data mining. The data sets and the R-code for all examples can be found on the webpage that accompanies this book (http://www.biz.uiowa.edu/faculty/jledolter/datamining). Supplementary material for this book can also be found by entering ISBN 9781118447147 at booksupport.wiley.com. You can copy and paste the code into your own R session and rerun all analyses. You can experiment with the software by making changes and additions, and you can adapt the R templates to the analysis of your own data sets. Exercises and several large practice data sets are given at the end of this book. The exercises will help instructors when assigning homework problems, and they will give the reader the opportunity to practice the techniques that are discussed in this book. Instructions on how to best use these data sets are given in Appendix A. This is a first edition. Although I have tried to be very careful in my writing and in the analyses of the illustrative data sets, I am certain that much can be improved. I would very much appreciate any feedback you may have, and I encourage you to write to me at johannes-ledolter@uiowa.edu. Corrections and comments will be posted on the book s webpage.

ACKNOWLEDGMENTS I got interested in developing materials for an MBA-level text on Data Mining when I visited the University of Chicago Booth School of Business in 2011. The outstanding University of Chicago lecture materials for the course on Data Mining (BUS41201) taught by Professor Matt Taddy provided the spark to put this text together, and several examples and R-templates from Professor Taddy s notes have influenced my presentation. Chapter 19 on the analysis of text data draws heavily on his recent research. Professor Taddy s contributions are most gratefully acknowledged. Writing a text is a time-consuming task. I could not have done this without the support and constant encouragement of my wife, Lea Vandervelde. Lea, a law professor at the University of Iowa, conducts historical research on the freedom suits of Missouri slaves. She knows first-hand how important and difficult it is to construct data sets for the mining of text data. xi