DATA MINING AND BUSINESS ANALYTICS WITH R

Size: px
Start display at page:

Download "DATA MINING AND BUSINESS ANALYTICS WITH R"

Transcription

1 DATA MINING AND BUSINESS ANALYTICS WITH R

2 DATA MINING AND BUSINESS ANALYTICS WITH R Johannes Ledolter Department of Management Sciences Tippie College of Business University of Iowa Iowa City, Iowa

3 Copyright 2013 by John Wiley & Sons, Inc. All rights reserved Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) , fax (978) , or on the web at Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) , fax (201) , or online at Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) , outside the United States at (317) or fax (317) Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at Library of Congress Cataloging-in-Publication Data: Ledolter, Johannes. Data mining and business analytics with R / Johannes Ledolter, University of Iowa. pages cm Includes bibliographical references and index. ISBN (cloth) 1. Data mining. 2. R (Computer program language) 3. Commercial statistics. I. Title. QA76.9.D343L dc Printed in the United States of America

4 CONTENTS Preface Acknowledgments ix xi 1. Introduction 1 Reference 6 2. Processing the Information and Getting to Know Your Data Example 1: 2006 Birth Data Example 2: Alumni Donations Example 3: Orange Juice 31 References Standard Linear Regression Estimation in R Example 1: Fuel Efficiency of Automobiles Example 2: Toyota Used-Car Prices 47 Appendix 3.A The Effects of Model Overfitting on the Average Mean Square Error of the Regression Prediction 53 References Local Polynomial Regression: a Nonparametric Regression Approach Model Selection Application to Density Estimation and the Smoothing of Histograms Extension to the Multiple Regression Model Examples and Software 58 References Importance of Parsimony in Statistical Modeling How Do We Guard Against False Discovery 67 References 70 v

5 vi CONTENTS 6. Penalty-Based Variable Selection in Regression Models with Many Parameters (LASSO) Example 1: Prostate Cancer Example 2: Orange Juice 78 References Logistic Regression Building a Linear Model for Binary Response Data Interpretation of the Regression Coefficients in a Logistic Regression Model Statistical Inference Classification of New Cases Estimation in R Example 1: Death Penalty Data Example 2: Delayed Airplanes Example 3: Loan Acceptance Example 4: German Credit Data 103 References Binary Classification, Probabilities, and Evaluating Classification Performance Binary Classification Using Probabilities to Make Decisions Sensitivity and Specificity Example: German Credit Data Classification Using a Nearest Neighbor Analysis The k-nearest Neighbor Algorithm Example 1: Forensic Glass Example 2: German Credit Data 122 Reference The Naïve Bayesian Analysis: a Model for Predicting a Categorical Response from Mostly Categorical Predictor Variables Example: Delayed Airplanes 127 Reference Multinomial Logistic Regression Computer Software Example 1: Forensic Glass 134

6 CONTENTS vii 11.3 Example 2: Forensic Glass Revisited 141 Appendix 11.A Specification of a Simple Triplet Matrix 147 References More on Classification and a Discussion on Discriminant Analysis Fisher s Linear Discriminant Function Example 1: German Credit Data Example 2: Fisher Iris Data Example 3: Forensic Glass Data Example 4: MBA Admission Data 159 Reference Decision Trees Example 1: Prostate Cancer Example 2: Motorcycle Acceleration Example 3: Fisher Iris Data Revisited Further Discussion on Regression and Classification Trees, Computer Software, and Other Useful Classification Methods R Packages for Tree Construction Chi-Square Automatic Interaction Detection (CHAID) Ensemble Methods: Bagging, Boosting, and Random Forests Support Vector Machines (SVM) Neural Networks The R Package Rattle: A Useful Graphical User Interface for Data Mining 193 References Clustering k-means Clustering Another Way to Look at Clustering: Applying the Expectation-Maximization (EM) Algorithm to Mixtures of Normal Distributions Hierarchical Clustering Procedures 212 References Market Basket Analysis: Association Rules and Lift Example 1: Online Radio Example 2: Predicting Income 227 References 234

7 viii CONTENTS 17. Dimension Reduction: Factor Models and Principal Components Example 1: European Protein Consumption Example 2: Monthly US Unemployment Rates Reducing the Dimension in Regressions with Multicollinear Inputs: Principal Components Regression and Partial Least Squares Three Examples 249 References Text as Data: Text Mining and Sentiment Analysis Inverse Multinomial Logistic Regression Example 1: Restaurant Reviews Example 2: Political Sentiment 266 Appendix 19.A Relationship Between the Gentzkow Shapiro Estimate of Slant and Partial Least Squares 268 References Network Data Example 1: Marriage and Power in Fifteenth Century Florence Example 2: Connections in a Friendship Network 278 References 292 Appendix A: Exercises 293 Exercise Exercise Exercise Exercise Exercise Exercise Exercise Appendix B: References 338 Index 341

8 PREFACE This book is about useful methods for data mining and business analytics. It is written for readers who want to apply these methods so that they can learn about their processes and solve their problems. My objective is to provide a thorough discussion of the most useful data-mining tools that goes beyond the typical black box description, and to show why these tools work. Powerful, accurate, and flexible computing software is needed for data mining, and Excel is of little use. Although excellent data-mining software is offered by various commercial vendors, proprietary products are usually expensive. In this text, I use the R Statistical Software, which is powerful and free. But the use of R comes with start-up costs. R requires the user to write out instructions, and the writing of program instructions will be unfamiliar to most spreadsheet users. This is why I provide R sample programs in the text and on the webpage that is associated with this book. These sample programs should smooth the transition to this very general and powerful computer environment and help keep the start-up costs to using R small. The text combines explanations of the statistical foundation of data mining with useful software so that the tools can be readily applied and put to use. There are certainly better books that give a deeper description of the methods, and there are also numerous texts that give a more complete guide to computing with R. This book tries to strike a compromise that does justice to both theory and practice, at a level that can be understood by the MBA student interested in quantitative methods. This book can be used in courses on data mining in quantitative MBA programs and in upper-level undergraduate and graduate programs that deal with the analysis and interpretation of large data sets. Students in business, the social and natural sciences, medicine, and engineering should benefit from this book. The majority of the topics can be covered in a one semester course. But not every covered topic will be useful for all audiences, and for some audiences, the coverage of certain topics will be either too advanced or too basic. By omitting some topics and by expanding on others, one can make this book work for many different audiences. Certain data-mining applications require an enormous amount of effort to just collect the relevant information, and in such cases, the data preparation takes a lot more time than the eventual modeling. In other applications, the data collection effort is minimal, but often one has to worry about the efficient storage and retrieval of high volume information (i.e., the data warehousing ). Although it is very important to know how to acquire, store, merge, and best arrange the information, ix

9 x PREFACE this text does not cover these aspects very deeply. This book concentrates on the modeling aspects of data mining. The data sets and the R-code for all examples can be found on the webpage that accompanies this book ( Supplementary material for this book can also be found by entering ISBN at booksupport.wiley.com. You can copy and paste the code into your own R session and rerun all analyses. You can experiment with the software by making changes and additions, and you can adapt the R templates to the analysis of your own data sets. Exercises and several large practice data sets are given at the end of this book. The exercises will help instructors when assigning homework problems, and they will give the reader the opportunity to practice the techniques that are discussed in this book. Instructions on how to best use these data sets are given in Appendix A. This is a first edition. Although I have tried to be very careful in my writing and in the analyses of the illustrative data sets, I am certain that much can be improved. I would very much appreciate any feedback you may have, and I encourage you to write to me at johannes-ledolter@uiowa.edu. Corrections and comments will be posted on the book s webpage.

10 ACKNOWLEDGMENTS I got interested in developing materials for an MBA-level text on Data Mining when I visited the University of Chicago Booth School of Business in The outstanding University of Chicago lecture materials for the course on Data Mining (BUS41201) taught by Professor Matt Taddy provided the spark to put this text together, and several examples and R-templates from Professor Taddy s notes have influenced my presentation. Chapter 19 on the analysis of text data draws heavily on his recent research. Professor Taddy s contributions are most gratefully acknowledged. Writing a text is a time-consuming task. I could not have done this without the support and constant encouragement of my wife, Lea Vandervelde. Lea, a law professor at the University of Iowa, conducts historical research on the freedom suits of Missouri slaves. She knows first-hand how important and difficult it is to construct data sets for the mining of text data. xi