Professor Dr. Gholamreza Nakhaeizadeh. Professor Dr. Gholamreza Nakhaeizadeh

Statistic Methods in in Mining Business Understanding Understanding Preparation Deployment Modelling Evaluation Mining Process (( Part 3) 3) Professor Dr. Gholamreza Nakhaeizadeh Professor Dr. Gholamreza Nakhaeizadeh

Short review of the last lecture Pre-Processing Observation and and attribute reduction Sampling - Representative sample sample - Random sampling - Sampling with with and and without without replacement - Systematic sampling - Stratified sampling Attribute reduction - Supervised and and unsupervised leaning leaning - Why Why attribute reduction? What What are are the the benefits? - Curse Curse of of dimensionality - Attribute Reduction: Generating new new attributes, selection of of attributes subsets - First First elementary steps: steps: Consideration of of background knowledge, screening - Attribute ranking ranking according to to importance - supervised and and unsupervised ranking ranking - Embedded approaches, Wrapper - PCA PCA

Mining Process CRISP-DM: Preparation CRISP-DM: Preparation Dealing with with :: Missing Values --Ignore Ignore the the observation --Ignore Ignore the the attribute --Using Using the the attribute mean mean --Predict Predict the the missing value value --Decision Decision tree tree --Regression Regression --.... Inaccurate data data --Using Using Background Knowledge (Rules) (Rules) Cleaning Cleaning 673374726 462156675 7632085287666463 673374726 462156675 7632085287666463 673374726 462156675 7632085287666463 640928649427737062849 918227365410285396 673374726 462156675 7632085287666463 84747218471ß223 84747218471ß223 84747218471ß223 Duplicates --Straße,, Strasse, Str. Str. Robert Robert X, X, Bob Bob X --Professor, Prof. Prof. Dr. Dr. 3

Mining Process CRISP-DM: Preparation CRISP-DM: Preparation Dealing with Outliers - Outlier as noise - Outlier detection as interesting finding - Outliers Analysis Methods - Model-based outlier detection - Using distance measures - Density-Based local Outlier Detection 84747218471ß223 4 673374726 462156675 7632085287666463 84747218471ß223 673374726 462156675 7632085287666463 7632085287666463 673374726 462156675 7632085287666463 Cleaning Cleaning 673374726 462156675 84747218471ß223 640928649427737062849 918227365410285396

Mining Process CRISP-DM: Preparation CRISP-DM: Preparation Transformation Transformation Adding new new attributes - According to to new new facts facts or or background knowledge Creation of of new new attributes using using available attributes - Surface area area instead instead of of length length and and width width Aggregation and and Generalization of of attribute values values Monthly sales sales instead instead of of daily daily sales sales - City City instead instead of of streets streets 5

Mining Process CRISP-DM: Preparation CRISP-DM: Preparation Transformation Transformation Example: monthly sales instead daily sales Example: monthly sales instead daily sales Day1.. Day30 Month1 Company 1 S11.. S130 Company 1 M1 Company 2 S21.. S230 Company 2 M2.. Company n Sn1.. S230 Company n Mn 30 Mi = Sij j = 1 6

Mining Process CRISP-DM: Preparation CRISP-DM: Preparation Transformation Transformation Binarization Binarizationof of categorical categorical attributes attributes Discretization Discretizationof of Continuous-Valued Continuous-Valued attributes attributes Normalization Normalization of of attributes attributes value value -- Values Values between between 1 and and 0 7

Mining Process CRISP-DM: Preparation CRISP-DM: Preparation integration integration DB1 DB 2 Flat Files 1 Flat Files 2 Good occasion for cleaning 8

Mining Process 1. Task Identification 4. Choosing the search (optimization) Classification method Prediction Analytical methods Clustering Greedy search Gradient descent 2. Determining the DM-algorithms Decision Trees 5. Choosing the data management method Neural Networks Not always necessary Association Rules 3. Choosing the evaluation function and evaluation method Sum squared Errors Accuracy rates Loss function Cross-Validation.. 9

Mining Process Classification Credit Scoring Good customer Bad customer Task Identification Task Identification Clustering Concept Description Customers Loyalty : Age Income Education. Prediction Deviation detection Dependency Analysis A and B C Sequence Pattern 10

Mining Process Task Identification Task Identification Classification Examples: Credit Scoring Quality Management Marketing Univ. entrance exam - Good customer - device defect - Customer buys - successful - Bad customer - device not defect - Customer doesn't - unsuccessful - perhaps defect buy Suppose that that a tuple tuplex is is represented by by n Attributes A1, A1, A2,, A2,, An An and and is is assigned to to another predefined attribute called called target target variable or or class class label label attribute. For For m tuples tuples we we will will have have a matrix matrix representation like: like: m tuples A1, A2, An class label Using this data Goal: Goal: Build Build a classifier classifier that that can can predict predict the the class class label label of of a new new tuple tupleonly by by using using its its attribute attribute values values 11

Mining Process Task Identification Task Identification Classification Simple fictive example; Credit Rating in a Bank Simple fictive example; Credit Rating in a Bank Income >2000 Car Gender Credit Rating Customer 1 no yes F bad A1, A2, An class label Customer 2 no statement no F bad Customer 3 no statement yes M good m tuples Customer 4 no yes M bad Customer 5 yes yes M good Customer 6 yes yes F good Customer 7 no statement yes F good In classification, the class label (target variable) is a nominal variable Customer 8 yes no F good Customer 9 no statement no M bad Customer 10 no no F bad Income=3000, car=yes, gender=female Credit rating? 12

Mining Process Task Identification Task Identification Prediction Examples: Examples: Monthly Monthly Sales Sales Hourly Hourly Exchange Exchange Average Average Daily Daily Temperature Temperature Rate Rate 2000 2000 1.3918 1.3918 23.4 23.4 2560 2560 1.3917 1.3917 25.6 25.6 1947 1947 1.3914 1.3914 24.6 24.6.. Suppose that that a tuple tuplex is is represented by by n Attributes A1, A1, A2,, A2,, An An and and is is assigned to to another predefined attribute called called target target attribute. For For m tuples tuples we we will will have have a matrix matrix representation like: like: m tuples A1, A2, An Target attribute Using this data Goal: Goal: Build Build a Predictor Predictor that that can can predict predict the the target target value value of of a new new tuple tupleonly by by using using its its attribute attribute values values 13

Mining Process Task Identification Task Identification Prediction Simple fictive example; Prediction of annual income Simple fictive example; Prediction of annual income ID Income in three years ago Education Age Income 1A 24552 High School 32 27026 2A 88282 BSc 52 93725 3B 82902 PhD 41 82356 4A 39838 High School 56 36828 5C 53542 PhD 32 62542 6M 63826 MS 28 64882 7D 82783 MA 43 89025 8A 72886 High School 33 74925 9Q 21383 BA 37 62572 1R 63552 BA 41 66427 1T 62522 High School 25 63552 1E 65254 PhD 56 67252 m tuples A1, A2, An Target attribute In prediction, the target variable is a continuous-valued variable Income Income in in three three years years ago=60000 ago=60000,, education=ba, education=ba, Age=35 Age=35 Annual income? Annual income? 14

Mining Process Determining the DM-algorithms Determining the DM-algorithms Mining Statistics AI (Machine Learning) base Tchnology Mining Algorithms Machine Learning Statistics base Technology Rule Based Induction Decision Trees Neural Networks. Discriminant Analysis Cluster Analysis Regression Analysis Logistic Regression Analysis. Association Rules Sequence Mining. 15

Mining Process Choosing evaluation function and and evaluation method General remarks --Results produced by by the the model are are normally worse than than the the real real facts Reasons: Error Error in in data data Model Model Misspecification Structural Change.. To To evaluate the the results results produced by by the the model model we we need need :: Evaluation methods Evaluation functions 16

Mining Process Choosing evaluation function and and evaluation method Shaping training and and test test data data Choosing evaluation function Training Model Generation Calculate Accuracy Test?? 17

Mining Process Choosing evaluation function and and evaluation method Choosing a certain portion of data for training and test Example: 70% training, 30% test 100% 70% Training 30% Test Cross Validation Cross Validation 1 2 3 4 5 1 2 3 4 5 Leave-one-out 1 2 3 4 5 6 7 8 9 10... 1000. 1 2 3 4 5 6 7 8 9 10.1000 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 18

Mining Process Choosing evaluation function and and evaluation method In In some some cases, cases, besides the the training and and test test datasets, a validation dataset is is used used too too validation Model Calibration Training Model Generation Calculate Accuracy Test 19

Mining Process Choosing Choosing evaluation evaluation function function and and evaluation evaluation method method Training Training set: set: A set set of of examples examples used used for for learning, learning, that that is is to to fit fit the the parameters parameters [i.e., [i.e., weights] weights] of of the the classifier. classifier. Validation Validation set: set: A set set of of examples examples used used to to tune tune the the parameters parameters [i.e., [i.e., architecture, architecture, not not weights] weights] of of a a classifier, classifier, for for example example to to choose choose the the number number of of hidden hidden units units in in a a neural neural network. network. Test Test set: set: Ripley (1996), (p.354) A set set of of examples examples used used only only to to assess assess the the performance performance [generalization] [generalization] of of a a fully-specified fully-specified classifier. classifier. validation Model Calibration Training Model Generation Calculate Accuracy Test 20

Mining Process Classification Choosing Choosing the the evaluation evaluation function function and and evaluation evaluation method method Accuracy Rate = N1+N2 N1+N2+M1+M2 Confusion Matrix Model Class 1 Class 2 Error Rate = AR= 1- ER AR= 1- ER M1+M2 N1+N2+M1+M2 Actual Class 1 Class 2 N1 M2 M1 N2 N1 N1 is is the the number number of of correct correct classified classified observations observations of of class class 1 N2 N2 is is the the number number of of correct correct classified classified observations observations of of class class 2 M1 M1 is is the the number number of of incorrect incorrect classified classified observations observations (from (from class1 class1 to to class2) class2) M2 M2 is is the the number number of of incorrect incorrect classified classified observation observation (from (from class2 class2 to to class1) class1) 21

Mining Process Classification Choosing Choosing the the evaluation evaluation function function and and evaluation evaluation method method Example Example Income >2000 Car Gender Credit Rating Actual predicted Customer 1 no yes F bad bad Confusion Matrix Customer 2 no statement no F bad Customer 3 no statement yes M good Customer 4 no yes M bad Customer 5 yes yes M good Customer 6 yes yes F good Customer 7 no statement yes F good Customer 8 yes no F good Customer 9 no statement no M bad Customer 10 no no F bad bad bad bad bad good good bad good bad Actual good bad good 1 Model 2 bad Accuracy Rate = 6/10 = 60% Error Rate= 40% 3 4 22

Mining Process Classification Choosing Choosing the the evaluation evaluation function function and and evaluation evaluation method method Class Class 1= 1= positive Class2 Class2 = negative Confusion Matrix Class 1 Model Class 2 Actual Class 1 Class 2 N1 M2 M1 N2 Confusion Matrix Actual Class 1 positive Class 2 negative true positive Model Class 1 positive false positive Class 2 negative false negative true negative 23

Mining Process Prediction n MAE=1/n Yi-Y i RAE = i=1 n Yi-Y i i=1 _ n Yi-Y i=1 Choosing the the evaluation function and and evaluation method n MSE=1/n (Yi-Y i) RSE = i=1 n 2 (Yi-Y i) i=1 n (Yi-Y) i=1 _ 2 2 MAE MAE = Mean Mean Absolute Absolute Error Error MSE MSE = Mean Mean Squared Squared Error Error RAE RAE = Relative Relative Absolute Absolute Error Error RSE RSE = Relative Relative Squared Squared Error Error n= n= Number Number of of observations observations in in test test data data Y= Actual value, Y = Predicted value, _ Y= Actual value, Y = Predicted value, Y = mean mean of of Ys Ys 24

Mining Process CRISP-DM: Evaluation CRISP-DM: Evaluation http://www.crisp-dm.org/crispwp-0800.pdf Evaluate results Review process Determine next steps 25

Mining Process CRISP-DM: Deployment CRISP-DM: Deployment Plan deployment http://www.crisp-dm.org/crispwp-0800.pdf Plan monitoring and maintenance Produce final report Review project 26

Mining Process Another Example: SAS data Mining Process SEMMA Another Example: SAS data Mining Process SEMMA Sample the data by creating one or more data tables. The sample should be large enough to contain the significant information, yet small enough to process Explore the data by searching for anticipated relationships, unanticipated trends, and anomalies in order to gain understanding and ideas Modify the data by creating, selecting, and transforming the variables to focus the model selection process Model the data by using the analytical tools to search for a combination of the data that reliably predicts a desired outcome Assess the data by evaluating the usefulness and reliability of the findings from the data mining process 27 Source: http://support.sas.com/documentation/onlinedoc/miner/getstarted.pdf

Mining Algorithms Mining algorithms Mining algorithms Machine Learning Statistics base Technology Rule Based Induction Decision Trees Neural Networks Conceptional clustering. Discriminant Analysis Cluster Analysis Regression Analysis Logistic Regression Analysis. Association Rules. 28