Professor Dr. Gholamreza Nakhaeizadeh. Professor Dr. Gholamreza Nakhaeizadeh

Size: px
Start display at page:

Download "Professor Dr. Gholamreza Nakhaeizadeh. Professor Dr. Gholamreza Nakhaeizadeh"

Transcription

1 Statistic Methods in in Mining Business Understanding Understanding Preparation Deployment Modelling Evaluation Mining Process (( Part 3) 3) Professor Dr. Gholamreza Nakhaeizadeh Professor Dr. Gholamreza Nakhaeizadeh

2 Short review of the last lecture Pre-Processing Observation and and attribute reduction Sampling - Representative sample sample - Random sampling - Sampling with with and and without without replacement - Systematic sampling - Stratified sampling Attribute reduction - Supervised and and unsupervised leaning leaning - Why Why attribute reduction? What What are are the the benefits? - Curse Curse of of dimensionality - Attribute Reduction: Generating new new attributes, selection of of attributes subsets - First First elementary steps: steps: Consideration of of background knowledge, screening - Attribute ranking ranking according to to importance - supervised and and unsupervised ranking ranking - Embedded approaches, Wrapper - PCA PCA

3 Mining Process CRISP-DM: Preparation CRISP-DM: Preparation Dealing with with :: Missing Values --Ignore Ignore the the observation --Ignore Ignore the the attribute --Using Using the the attribute mean mean --Predict Predict the the missing value value --Decision Decision tree tree --Regression Regression Inaccurate data data --Using Using Background Knowledge (Rules) (Rules) Cleaning Cleaning ß ß ß223 Duplicates --Straße,, Strasse, Str. Str. Robert Robert X, X, Bob Bob X --Professor, Prof. Prof. Dr. Dr. 3

4 Mining Process CRISP-DM: Preparation CRISP-DM: Preparation Dealing with Outliers - Outlier as noise - Outlier detection as interesting finding - Outliers Analysis Methods - Model-based outlier detection - Using distance measures - Density-Based local Outlier Detection ß ß Cleaning Cleaning ß

5 Mining Process CRISP-DM: Preparation CRISP-DM: Preparation Transformation Transformation Adding new new attributes - According to to new new facts facts or or background knowledge Creation of of new new attributes using using available attributes - Surface area area instead instead of of length length and and width width Aggregation and and Generalization of of attribute values values Monthly sales sales instead instead of of daily daily sales sales - City City instead instead of of streets streets 5

6 Mining Process CRISP-DM: Preparation CRISP-DM: Preparation Transformation Transformation Example: monthly sales instead daily sales Example: monthly sales instead daily sales Day1.. Day30 Month1 Company 1 S11.. S130 Company 1 M1 Company 2 S21.. S230 Company 2 M2.. Company n Sn1.. S230 Company n Mn 30 Mi = Sij j = 1 6

7 Mining Process CRISP-DM: Preparation CRISP-DM: Preparation Transformation Transformation Binarization Binarizationof of categorical categorical attributes attributes Discretization Discretizationof of Continuous-Valued Continuous-Valued attributes attributes Normalization Normalization of of attributes attributes value value -- Values Values between between 1 and and 0 7

8 Mining Process CRISP-DM: Preparation CRISP-DM: Preparation integration integration DB1 DB 2 Flat Files 1 Flat Files 2 Good occasion for cleaning 8

9 Mining Process 1. Task Identification 4. Choosing the search (optimization) Classification method Prediction Analytical methods Clustering Greedy search Gradient descent 2. Determining the DM-algorithms Decision Trees 5. Choosing the data management method Neural Networks Not always necessary Association Rules 3. Choosing the evaluation function and evaluation method Sum squared Errors Accuracy rates Loss function Cross-Validation.. 9

10 Mining Process Classification Credit Scoring Good customer Bad customer Task Identification Task Identification Clustering Concept Description Customers Loyalty : Age Income Education. Prediction Deviation detection Dependency Analysis A and B C Sequence Pattern 10

11 Mining Process Task Identification Task Identification Classification Examples: Credit Scoring Quality Management Marketing Univ. entrance exam - Good customer - device defect - Customer buys - successful - Bad customer - device not defect - Customer doesn't - unsuccessful - perhaps defect buy Suppose that that a tuple tuplex is is represented by by n Attributes A1, A1, A2,, A2,, An An and and is is assigned to to another predefined attribute called called target target variable or or class class label label attribute. For For m tuples tuples we we will will have have a matrix matrix representation like: like: m tuples A1, A2, An class label Using this data Goal: Goal: Build Build a classifier classifier that that can can predict predict the the class class label label of of a new new tuple tupleonly by by using using its its attribute attribute values values 11

12 Mining Process Task Identification Task Identification Classification Simple fictive example; Credit Rating in a Bank Simple fictive example; Credit Rating in a Bank Income >2000 Car Gender Credit Rating Customer 1 no yes F bad A1, A2, An class label Customer 2 no statement no F bad Customer 3 no statement yes M good m tuples Customer 4 no yes M bad Customer 5 yes yes M good Customer 6 yes yes F good Customer 7 no statement yes F good In classification, the class label (target variable) is a nominal variable Customer 8 yes no F good Customer 9 no statement no M bad Customer 10 no no F bad Income=3000, car=yes, gender=female Credit rating? 12

13 Mining Process Task Identification Task Identification Prediction Examples: Examples: Monthly Monthly Sales Sales Hourly Hourly Exchange Exchange Average Average Daily Daily Temperature Temperature Rate Rate Suppose that that a tuple tuplex is is represented by by n Attributes A1, A1, A2,, A2,, An An and and is is assigned to to another predefined attribute called called target target attribute. For For m tuples tuples we we will will have have a matrix matrix representation like: like: m tuples A1, A2, An Target attribute Using this data Goal: Goal: Build Build a Predictor Predictor that that can can predict predict the the target target value value of of a new new tuple tupleonly by by using using its its attribute attribute values values 13

14 Mining Process Task Identification Task Identification Prediction Simple fictive example; Prediction of annual income Simple fictive example; Prediction of annual income ID Income in three years ago Education Age Income 1A High School A BSc B PhD A High School C PhD M MS D MA A High School Q BA R BA T High School E PhD m tuples A1, A2, An Target attribute In prediction, the target variable is a continuous-valued variable Income Income in in three three years years ago=60000 ago=60000,, education=ba, education=ba, Age=35 Age=35 Annual income? Annual income? 14

15 Mining Process Determining the DM-algorithms Determining the DM-algorithms Mining Statistics AI (Machine Learning) base Tchnology Mining Algorithms Machine Learning Statistics base Technology Rule Based Induction Decision Trees Neural Networks. Discriminant Analysis Cluster Analysis Regression Analysis Logistic Regression Analysis. Association Rules Sequence Mining. 15

16 Mining Process Choosing evaluation function and and evaluation method General remarks --Results produced by by the the model are are normally worse than than the the real real facts Reasons: Error Error in in data data Model Model Misspecification Structural Change.. To To evaluate the the results results produced by by the the model model we we need need :: Evaluation methods Evaluation functions 16

17 Mining Process Choosing evaluation function and and evaluation method Shaping training and and test test data data Choosing evaluation function Training Model Generation Calculate Accuracy Test?? 17

18 Mining Process Choosing evaluation function and and evaluation method Choosing a certain portion of data for training and test Example: 70% training, 30% test 100% 70% Training 30% Test Cross Validation Cross Validation Leave-one-out

19 Mining Process Choosing evaluation function and and evaluation method In In some some cases, cases, besides the the training and and test test datasets, a validation dataset is is used used too too validation Model Calibration Training Model Generation Calculate Accuracy Test 19

20 Mining Process Choosing Choosing evaluation evaluation function function and and evaluation evaluation method method Training Training set: set: A set set of of examples examples used used for for learning, learning, that that is is to to fit fit the the parameters parameters [i.e., [i.e., weights] weights] of of the the classifier. classifier. Validation Validation set: set: A set set of of examples examples used used to to tune tune the the parameters parameters [i.e., [i.e., architecture, architecture, not not weights] weights] of of a a classifier, classifier, for for example example to to choose choose the the number number of of hidden hidden units units in in a a neural neural network. network. Test Test set: set: Ripley (1996), (p.354) A set set of of examples examples used used only only to to assess assess the the performance performance [generalization] [generalization] of of a a fully-specified fully-specified classifier. classifier. validation Model Calibration Training Model Generation Calculate Accuracy Test 20

21 Mining Process Classification Choosing Choosing the the evaluation evaluation function function and and evaluation evaluation method method Accuracy Rate = N1+N2 N1+N2+M1+M2 Confusion Matrix Model Class 1 Class 2 Error Rate = AR= 1- ER AR= 1- ER M1+M2 N1+N2+M1+M2 Actual Class 1 Class 2 N1 M2 M1 N2 N1 N1 is is the the number number of of correct correct classified classified observations observations of of class class 1 N2 N2 is is the the number number of of correct correct classified classified observations observations of of class class 2 M1 M1 is is the the number number of of incorrect incorrect classified classified observations observations (from (from class1 class1 to to class2) class2) M2 M2 is is the the number number of of incorrect incorrect classified classified observation observation (from (from class2 class2 to to class1) class1) 21

22 Mining Process Classification Choosing Choosing the the evaluation evaluation function function and and evaluation evaluation method method Example Example Income >2000 Car Gender Credit Rating Actual predicted Customer 1 no yes F bad bad Confusion Matrix Customer 2 no statement no F bad Customer 3 no statement yes M good Customer 4 no yes M bad Customer 5 yes yes M good Customer 6 yes yes F good Customer 7 no statement yes F good Customer 8 yes no F good Customer 9 no statement no M bad Customer 10 no no F bad bad bad bad bad good good bad good bad Actual good bad good 1 Model 2 bad Accuracy Rate = 6/10 = 60% Error Rate= 40%

23 Mining Process Classification Choosing Choosing the the evaluation evaluation function function and and evaluation evaluation method method Class Class 1= 1= positive Class2 Class2 = negative Confusion Matrix Class 1 Model Class 2 Actual Class 1 Class 2 N1 M2 M1 N2 Confusion Matrix Actual Class 1 positive Class 2 negative true positive Model Class 1 positive false positive Class 2 negative false negative true negative 23

24 Mining Process Prediction n MAE=1/n Yi-Y i RAE = i=1 n Yi-Y i i=1 _ n Yi-Y i=1 Choosing the the evaluation function and and evaluation method n MSE=1/n (Yi-Y i) RSE = i=1 n 2 (Yi-Y i) i=1 n (Yi-Y) i=1 _ 2 2 MAE MAE = Mean Mean Absolute Absolute Error Error MSE MSE = Mean Mean Squared Squared Error Error RAE RAE = Relative Relative Absolute Absolute Error Error RSE RSE = Relative Relative Squared Squared Error Error n= n= Number Number of of observations observations in in test test data data Y= Actual value, Y = Predicted value, _ Y= Actual value, Y = Predicted value, Y = mean mean of of Ys Ys 24

25 Mining Process CRISP-DM: Evaluation CRISP-DM: Evaluation Evaluate results Review process Determine next steps 25

26 Mining Process CRISP-DM: Deployment CRISP-DM: Deployment Plan deployment Plan monitoring and maintenance Produce final report Review project 26

27 Mining Process Another Example: SAS data Mining Process SEMMA Another Example: SAS data Mining Process SEMMA Sample the data by creating one or more data tables. The sample should be large enough to contain the significant information, yet small enough to process Explore the data by searching for anticipated relationships, unanticipated trends, and anomalies in order to gain understanding and ideas Modify the data by creating, selecting, and transforming the variables to focus the model selection process Model the data by using the analytical tools to search for a combination of the data that reliably predicts a desired outcome Assess the data by evaluating the usefulness and reliability of the findings from the data mining process 27 Source:

28 Mining Algorithms Mining algorithms Mining algorithms Machine Learning Statistics base Technology Rule Based Induction Decision Trees Neural Networks Conceptional clustering. Discriminant Analysis Cluster Analysis Regression Analysis Logistic Regression Analysis. Association Rules. 28