Introduction to Data,Mining

Size: px
Start display at page:

Download "Introduction to Data,Mining"

Transcription

1 Introduction to Data,Mining

2

3

4

5 Equivalent,Buzz,Words! Data Mining! Machine Learning! Analytics! Big Data (not in actually, but in 90% of uses)! Data Science! Business Intelligence (often means reporting)! Predictive Analytics! Pattern Recognition

6 Stanford

7 CIT

8 MIT

9 The,takeBaway?! Don t be daunted by buzz words.! Don t be afraid to ask someone to be more specific.! You re learning ~90% of the mainstream methods for data analysis in this program.

10 Step,1:,Preparing,for,Model, Validation Splitting into Training/Validation/Test Sets Deciding on Cross Validation

11 Data,Preprocessing! Let s start at the beginning.! You ve been handed a problem and some data.! You start exploring to see what you have. BEFORE YOU LOOK AT ANY RELATIONSHIPS BETWEEN VARIABLES, YOU SHOULD SPLIT INTO TRAINING, VALIDATION AND TEST SAMPLES. (Or decide on Cross-Validation)

12 The,Problem,of,Overfitting! Overfitting: Left unchecked, models will capture nuances/characteristics of the data on which they re built.! Example: Regression! Every additional predictor variable increases! ".! Naive interpretation: every additional predictor variable helps explain more of the target s variance.! That can t be true!! Modelling requires subject-matter expertise

13 The,Problem,of,Overfitting! Error on the training data does not predict future performance.! Complexity can undermine model s performance on future data. Yay!,My,model, is,perfect! OMG!,I m,going, to,get,fired.

14 The,BiasBVariance,Tradeoff! Bias = underfitting! The modeled value s distance from truth.! Want a model with low bias.! Variance = overfitting! The model parameters will vary greatly on different training samples.! Want a model with low variance. Unfortunately, the concepts are inversely related: Lower Bias " Higher Variance Lower Variance " Higher Bias. (Hence the term Bias-Variance Tradeoff )

15 Training/Validation/Test! Want to make sure your models are generalizable! Not just good models of training sample.! Can predict equally well on out-of-sample data.! Split into Training + Validation + Test sets is necessary! Somewhere around two-thirds training, one-third validation/test is typical.! Lots of data? split! Not so much data? split! Not enough data? Use Cross-Validation! Test set not always emphasized, but is crucial.! Gives impartial estimate of how your final model will perform on future data.

16 Training/Validation! Use the Training set to build your model.! Evaluate and tune the model based on how it performs on the validation data! Never report accuracy metrics from training set!

17 Testing! Continually adapting a model to perform better on validation data essentially trains the model to the validation data.! Once you ve chosen a final model, you re-run it on (training+validation) data to finalize your parameters, and report accuracy on test data.! Before deploying that final model to the customer, you can update parameters using entire dataset.

18 10BFold,Cross,Validation:, Image,by,Chris,McCormick.

19 KBfold,CrossBValidation! Divide your data into k equally-sized samples (folds)! k=10 or k=100 are common.! Depends on time complexity of model and size of the dataset!! For each fold, train the model on all other data, using that fold as a validation set! Record measures of error/goodness-of-fit! In the end, report summary of error/goodness-of-fit measure (average, std. deviation etc)! Use that report summary to choose a model

20 CrossBValidation! Can use cross-validation in any situation.! Will be necessary if you do not have sufficient observations to split into training/validation/test! What is sufficient? It depends!! Rule of thumb: 10 observations per input variable in training set minimum! For nontraditional problems (and solutions) often anything goes.

21 LeaveBOneBOut, CrossBValidation, (Jackknife),! Use only one observation as the validation-set! Repeat for every observation in the dataset! Compute the accuracy of the model by aggregating those predictive results.! Results similar having your entire dataset as a validation set.! Can be extremely time consuming! Only use when necessary!

22 Step,2:,Data,Cleaning: Handling,Missing,Values You will have missing values in your data.

23 Handling,Missing,Values Is,the,variable,more,than, 40B50%,missing? yes no Is,it,a,numeric, variable? Consider, imputation yes no Throw,it,away., Recommend,better, collection,practices. Create,level, missing., Then,probably, throw,it,away.

24 Missing,Value,Imputation! Replacing missing values with a substitute value typically a guess at what you think the value should have been.

25 Imputing,Missing,Values! Always Always Always create a binary flag = 1 indicating that the value has been imputed and include the flag in your model. Obs. Gender Q1,Response 1 M 5 2 M 4 3 F NA 4 M 1 5 F NA Obs. Gender Q1,Response Q1,Flag 1 M M F M F 3 1! Nonresponse might be an important indicator of target or relate to another variable.

26 Categorical,Variables! Option one: Replace missing values with the mode (the most frequent value).! Option two: Create new level of variable as missing. No flag necessary in this case.! Option three: Try to predict the missing value using other attributes.! Decision trees, KNN methods popular for missing values.! These types of classification models will be covered in this module.! Option four: Randomly choose a level of the variable with probability of its occurrence (will not change the distribution).

27 Numeric,Variables! Option one: Replace missing values with the average! Option two: Replace missing values with the median (for skewed distributions).! Option three: Predict the missing values using other attributes.! Multiple Regression or Regression Trees popular! Option four: Discretize (bin) the numeric variable into categories and create missing category.! Option five: Randomly choose a value of the variable using probability distribution (will not change the distribution).

28 Ordinal,Variables! Will depend on the variable! Likely to treat level of education differently than Likert scale response! Use one of the options prescribed for numeric and categorical variables

29 So,what,should I,DO?

30 So,what,should I,DO?! Only the person closest to the data and to the problem can make these judgment calls!! Can try several methods to see what works best.! The binary flag indicating imputed value will show you if there is something special about missing values.

31 Variable,Transformations Discretizing (Binning) and other Transformations

32 Binning,Numeric, Variables Unsupervised Approach 1: Equal Width Each,bin,has,the, same,width,in, variable,values Each,bin,has, a,different, number,of, observations

33 Binning,Numeric, Variables Unsupervised Approach 2: Equal Depth Take percentiles of the population. Each bin has the same number of observations.

34 Binning,Numeric, Variables Supervised Approach! Use target variable info to optimally bin numeric variables for prediction.! Typically used in classification problems.! Want bins that result in the most pure set of target classes.

35 Binning,Numeric, Variables Supervised Approach 15K 20K 50K 55K Income mixed mixed blue green 60K, blue 65K 85K 150K 995K blue green mixed mixed Vehicle, Color

36 Binning,Numeric, Variables Supervised Approach! Decision tree methods will be helpful to create these bins.! More on this technique later.

37 Standardization,and Normalization! standardizing a variable in statistics: transform to the number of std. devs. away from the mean: $ $ ' (! Removes the units from the variable! Avoid having variable with large values (e.g. income) dominate a calculation.! Many other ways to standardize/normalize! Range (0-1) Standardization: Subtract Minimum and Divide by Maximum! Divide by 2-norm, Divide by sum

38 Transformation, Considerations! Transformations change the nature of the data.! Ex: x={1,2,3} transform to 1/x = {1, ½, ⅓}! The sorting order of the observations reverses! Observations close to 0 will get very large! Always consider the following questions:! Does the order of the data need to be maintained?! Does the transformation apply to all values, especially negative values and 0? (Think log(x) and 1/x)! What is the effect on values between 0 and 1?

39 Log,Transformation,and, Percent,Change! Logarithmic transformations are common because still interpretable. log(y) = 2*log(x)! 1% increase in x yields 2% increase in y! 5% decrease in x yields 5% decrease y This interpretation only valid for changes of up to +/- 20%

40 Log,Transformation,and, Percent,Change! More concrete example: log(oil_consumption) = 2*log(GDP)! 1% increase in GDP results in 2% increase in oil consumption.! 5% decrease in GDP results in 10% decrease in oil consumption. This interpretation only valid for changes of up to +/- 20%

41 Log,Transformation,and, Percent,Change r For,math,folks: log(1+r) (gets,multiplied,by, coefficient,to,get, the,change,in,y) ) = + log $ ) = +log0($(1 + 4))0 ) = +[log $ + log ] so ) 8 ) = + log is,the,net,increase/decrease,in,y, for,an,r,percent,increase,in,x.

42 So,what,should I,DO?

43 So,what,should I,DO?! Only the person closest to the data and to the problem can make these judgment calls!! Can try several methods to see what works best.! Transformations are typically something that is either required to meet assumptions of a model, or something done in hindsight to improve performance of a given model.