Credibility: Evaluating What s Been Learned

Size: px
Start display at page:

Download "Credibility: Evaluating What s Been Learned"

Transcription

1 Evaluation: the Key to Success Credibility: Evaluating What s Been Learned Chapter 5 of Data Mining How predictive is the model we learned? Accuracy on the training data is not a good indicator of performance on future data Simple solution that can be used if lots of (labeled) data is available: Split data into training and test set Holdout method Usually: 1/3 for testing, 2/3 for training However: (labeled) data is usually limited More sophisticated techniques need to be used Natural performance measure for classification problems: error rate Success: instance s class is predicted correctly Error: instance s class is predicted incorrectly Error rate: proportion of errors made over the whole set of instances Resubstitution error: error rate obtained from training data Resubstitution error is (hopelessly) optimistic! Test set: independent instances that have played no part in formation of classifier Assumption: both training instances and test instances are representative samples of the underlying problem Calls for stratification (more later) Test and training data may differ in nature Example: classifiers built using customer data from two different towns A and B To estimate performance of classifier from town A in completely new town, test it on data from B

2 ID# Throat Glands Congestion Headache Diagnosis Model's Diagnosis ID# Throat Glands Congestion Headache Diagnosis Model's Diagnosis

3 ID# Throat Glands Congestion Headache Diagnosis Model's Diagnosis ID# Throat Glands Congestion Headache Diagnosis Model's Diagnosis ID# Throat Glands Congestion Headache Diagnosis Model's Diagnosis ID# Throat Glands Congestion Headache Diagnosis Model's Diagnosis

4 (cont.) ID# Throat Glands Congestion Headache Diagnosis Model's Diagnosis The error rate of a model is the percentage of test examples that it misclassifies. in our example, the error rate = 2/6 = 33.3 % ID# Throat Glands Congestion Headache Diagnosis Model's Diagnosis (cont.) ID# Throat Glands Congestion Headache Diagnosis Model's Diagnosis The error rate of a model is the percentage of test examples that it misclassifies. in our example, the error rate = 2/6 = 33.3 % the model's accuracy = 100 error rate (cont.) ID# Throat Glands Congestion Headache Diagnosis Model's Diagnosis The error rate of a model is the percentage of test examples that it misclassifies. in our example, the error rate = 2/6 = 33.3 % the model's accuracy = 100 error rate Problem: these metrics treat all misclassifications as being equal. this isn't always the case example: more problematic to misclassify strep throat than to misclassify a cold or allergy

5 (cont.) ID# Throat Glands Congestion Headache Diagnosis Model's Diagnosis To provide a more detailed picture of the model's accuracy, we can use a confusion matrix: actual class: cold allergy strep throat (cont.) ID# Throat Glands Congestion Headache Diagnosis Model's Diagnosis To provide a more detailed picture of the model's accuracy, we can use a confusion matrix: actual class: cold allergy strep throat the diagonal of the matrix shows cases that were correctly classified the diagonal of the matrix Interpreting a Confusion Matrix Let's say that we had a larger number of test examples, and that we obtained the following confusion matrix: what is the accuracy of the model? Interpreting a Confusion Matrix Let's say that we had a larger number of test examples, and that we obtained the following confusion matrix: what is the accuracy of the model? total # of test cases = 106 what is its error rate? what is its error rate?

6 Interpreting a Confusion Matrix Let's say that we had a larger number of test examples, and that we obtained the following confusion matrix: what is the accuracy of the model? total # of test cases = 106 # of accurate classifications = = 73 what is its error rate? Interpreting a Confusion Matrix Let's say that we had a larger number of test examples, and that we obtained the following confusion matrix: what is the accuracy of the model? total # of test cases = 106 # of accurate classifications = = 73 accuracy = 73/106 =.6886 or 68.9% what is its error rate? Interpreting a Confusion Matrix Let's say that we had a larger number of test examples, and that we obtained the following confusion matrix: what is the accuracy of the model? total # of test cases = 106 # of accurate classifications = = 73 accuracy = 73/106 =.6886 or 68.9% what is its error rate? error rate = 100 accuracy = = 31.1% Interpreting a Confusion Matrix (cont.) how many test cases of strep throat are there? how many colds were misdiagnosed? what percentage of the colds were correctly diagnosed?

7 Interpreting a Confusion Matrix (cont.) how many test cases of strep throat are there? = 42 how many colds were misdiagnosed? what percentage of the colds were correctly diagnosed? Interpreting a Confusion Matrix (cont.) how many test cases of strep throat are there? = 42 how many colds were misdiagnosed? = 15 what percentage of the colds were correctly diagnosed? Interpreting a Confusion Matrix (cont.) how many test cases of strep throat are there? = 42 how many colds were misdiagnosed? = 15 what percentage of the colds were correctly diagnosed? total # of colds = = 40. correctly diagnosed = 25. Interpreting a Confusion Matrix (cont.) how many test cases of strep throat are there? = 42 how many colds were misdiagnosed? = 15 what percentage of the colds were correctly diagnosed? total # of colds = = 40. correctly diagnosed = 25. % correctly diagnosed = 25/40 =.625 or 62.5%

8 Two-Class Confusion Matrices When there are only two classes, the classification problem is often framed as a yes / no judgement: yes / no fraudulent / not fraudulent has cancer / doesn't have cancer The terms positive / negative are often used in place of yes / no. Two-Class Confusion Matrices When there are only two classes, the classification problem is often framed as a yes / no judgement: yes / no fraudulent / not fraudulent has cancer / doesn't have cancer The terms positive / negative are often used in place of yes / no. In such cases, there are four possible types of classifications: true positive (TP): the model correctly predicts "yes" predicted yes no actual: yes TP FN no FP TN Two-Class Confusion Matrices When there are only two classes, the classification problem is often framed as a yes / no judgement: yes / no fraudulent / not fraudulent has cancer / doesn't have cancer The terms positive / negative are often used in place of yes / no. In such cases, there are four possible types of classifications: true positive (TP): the model correctly predicts "yes" false positive (FP): the model incorrectly predicts "yes" predicted yes no actual: yes TP FN no FP TN Two-Class Confusion Matrices When there are only two classes, the classification problem is often framed as a yes / no judgement: yes / no fraudulent / not fraudulent has cancer / doesn't have cancer The terms positive / negative are often used in place of yes / no. In such cases, there are four possible types of classifications: true positive (TP): the model correctly predicts "yes" false positive (FP): the model incorrectly predicts "yes" true negative (TN): the model correctly predicts "no" predicted yes no actual: yes TP FN no FP TN

9 Two-Class Confusion Matrices When there are only two classes, the classification problem is often framed as a yes / no judgement: yes / no fraudulent / not fraudulent has cancer / doesn't have cancer The terms positive / negative are often used in place of yes / no. In such cases, there are four possible types of classifications: true positive (TP): the model correctly predicts "yes" false positive (FP): the model incorrectly predicts "yes" true negative (TN): the model correctly predicts "no" false negative (FN): the model incorrectly predicts "no" predicted yes no actual: yes TP FN no FP TN Comparing Models Using Confusion Matrices Let's say we're trying to detect credit-card fraud. We use two different classification-learning techniques and get two different models. Performance on 400 test examples: predicted by model A actual: fraud not fraud predicted by model B actual: fraud not fraud which model is better? Comparing Models Using Confusion Matrices Let's say we're trying to detect credit-card fraud. We use two different classification-learning techniques and get two different models. Comparing Models Using Confusion Matrices Let's say we're trying to detect credit-card fraud. We use two different classification-learning techniques and get two different models. Performance on 400 test examples: predicted by model A actual: fraud not fraud overall accuracy = ( )/400 =.875 Performance on 400 test examples: predicted by model A actual: fraud not fraud overall accuracy = ( )/400 =.875 better at classifying actual fraud (fewer false negatives) predicted by model B actual: fraud not fraud which model is better? overall accuracy = (80+270)/400 =.875 predicted by model B actual: fraud not fraud which model is better? overall accuracy = (80+270)/400 =.875 better at classifying non-fraud (fewer false positives)

10 Overall Accuracy Isn't Enough Someone tells you that they have a fraud-detection classifier with an overall accuracy of 99%. Should you use it? Overall Accuracy Isn't Enough Someone tells you that they have a fraud-detection classifier with an overall accuracy of 99%. Should you use it? It depends on the test examples used to compute the accuracy! Example: assume 1% of actual credit-card purchases are fraudulent assume the test examples reflect this: 10 examples of fraud, 990 examples of not fraud Overall Accuracy Isn't Enough Someone tells you that they have a fraud-detection classifier with an overall accuracy of 99%. Should you use it? It depends on the test examples used to compute the accuracy! Example: assume 1% of actual credit-card purchases are fraudulent assume the test examples reflect this: 10 examples of fraud, 990 examples of not fraud on these examples, a model can be 99% accurate by always predicting "not fraud"! predicted actual: fraud 0 10 not fraud Overall Accuracy Isn't Enough (cont.) Test examples should include an adequate number of all possible classifications. especially ones you're most concerned about getting right in our example, need to include enough examples of fraud

11 Overall Accuracy Isn't Enough (cont.) Test examples should include an adequate number of all possible classifications. especially ones you're most concerned about getting right in our example, need to include enough examples of fraud It's also important that your training examples include all possible classifications. Overall Accuracy Isn't Enough (cont.) Test examples should include an adequate number of all possible classifications. especially ones you're most concerned about getting right in our example, need to include enough examples of fraud It's also important that your training examples include all possible classifications. If you're primarily concerned about getting one of the classifications right, it may make sense to artificially inflate the number of examples of that class in the training examples. Beyond Holdout Estimation Repeated Holdout Method What to do if the amount of data is limited? Problem with the holdout method: the samples might not be representative Example: class might be missing in the test data Advanced version uses stratification Ensures that each class is represented with approximately equal proportions in both subsets Holdout estimate can be made more reliable by repeating the process with different subsamples In each iteration, a certain proportion is randomly selected for training (possibly with stratification) The error rates on the different iterations are averaged to yield an overall error rate This is called the repeated holdout method Still not optimum: the different test sets overlap Can we prevent overlapping?

12 Cross-validation More on Cross-Validation Cross-validation avoids overlapping test sets First step: split data into k subsets of equal size Second step: use each subset in turn for testing, the remainder for training Called k-fold cross-validation Often the subsets are stratified before the cross-validation is performed The error estimates are averaged to yield an overall error estimate Standard method for evaluation: stratified ten-fold cross-validation Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate There is also some theoretical evidence for this Stratification reduces the estimate s variance Even better: repeated stratified cross-validation E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance)