2 Maria Carolina Monard and Gustavo E. A. P. A. Batista

Size: px

Start display at page:

Download "2 Maria Carolina Monard and Gustavo E. A. P. A. Batista"

Byron Nash
6 years ago
Views:

1 Graphical Methods for Classifier Performance Evaluation Maria Carolina Monard and Gustavo E. A. P. A. Batista University of São Paulo USP Institute of Mathematics and Computer Science ICMC Department of Computer Science and Statistics SCE Laboratory of Computational Intelligence LABIC P. O. Box 668, , São Carlos, SP, Brazil {gbatista, Abstract. Evaluating the performance of classifiers is not as trivial as it would seem at a first glance. Even the most widely used methods such as measuring accuracy or error rate on a test set has severe limitations. Two of the most prominent limitations of these measures are that they do not consider misclassification costs and can be misleading when the classes have very different prior probabilities. On the last years, several researches have pointed out alternative methods to evaluate the performance of learning systems. Some of those methods are based on graphical evaluation of classifiers. Usually, a graphical evaluation lets the user analyze the performance of a classifier under different scenarios, for instance, with different misclassification costs, and to select the classifier parameters setting that provides the best result. The objective of this paper is to survey some of the most used graphical methods for performance evaluation, which do not rely on precise class and cost distribution information. 1 Introduction In supervised learning, a set of n training examples is given to an inducer. Each example E i is a tuple ( x i, y i ), where x i is a vector of m features values and y i is the class value. The main objective in supervised learning is to induce a general mapping of the vectors x to the class values y. Thus, the inducer should build a model, y = f( x), of an unknown function, f, also known as concept function, which predicts y values for previously unseen examples. However, in most cases, the number of examples used to induce a model is not sufficient to completely characterize the function f. In fact, the inducers are usually able to induce a function h that approximates f, i.e., h( x) f( x), where h is known as the hypotheses of the concept function f. For classification problems, the y values are drawn from a discrete set of classes C = {C 1, C 2,... C Ncl }, where Ncl is the number of classes. Given a set of training examples, the learning algorithm outputs a classifier such that, given a new unlabelled example, it accurately predicts the label y. Assuming the vectors x correspond to points

2 2 Maria Carolina Monard and Gustavo E. A. P. A. Batista in a m-dimensional space, R m, the objective is to find a function h that approximates the function f : R m C. Thus, h is a classifier that outputs a class value C k C for each new example x. In this work, we reserve our discussion to concept-learning 1 domains, so y can assume one of two mutually exclusive values. We use the general labels positive and negative to discriminate between the two class values. Evaluating the performance of classifiers is not as trivial as it would seen at a first glance. Even the most widely used methods such as measuring accuracy or error rate on a test set has severe limitations. Two of the most prominent limitations of these measures are that they do not consider misclassification costs and can be misleading when the classes have very different prior probabilities. On the last years, several researches have pointed out alternative methods to evaluate the performance of learning systems. Some of those methods are based on a graphical evaluation of classifiers. Usually, a graphical evaluation lets the user analyze the performance of a classifier under different scenarios, for instance, with different misclassification costs, and to select the classifier s parameters setting that provides the best result. The objective of this paper is to survey some of the most used graphical methods for performance evaluation, which do not rely on precise class and cost distribution information. This work is organized as follows: Section 2 describes the necessary conditions to use accuracy and error rates to measure classifiers performance; Section 3 discusses cost sensitive learning and Section 4 introduces probabilistic classifiers; Section 5 presents three graphical methods that can be used when accuracy is not an appropriated measure for classifiers performance and Section 6 concludes this work. 2 Accuracy and Error Rate Accuracy and error rate are appropriate measures when misclassification costs and prior probabilities of each class are the same. However, these assumptions can be hardly confirmed in practice. In most application domains, each type of error that can be performed by a classifier has different costs. For instance, in fraud detection for financial applications, the cost of generating a false alarm is usually lower than the cost of not detecting a fraud. Also, several researches have reported that a large intrinsic difference in prior probability of each class is common for a number of domains. In other words, there is a large difference among the number of examples belonging to each class. Using the same example, fraud detection, the number of fraudulent transactions is usually much smaller than the number of regular transactions. When there is a significant difference among the prior probability of each class, error rate and accuracy can be very misleading metrics. For instance, it is straightforward to create a classifier 99% accurate if the data set has a majority class with 99% of all examples, by simply classifying every new case as belonging to the majority class. Different types of errors and hits performed by a classifier can be summarized in a confusion matrix. In Table 1 is illustrated a confusion matrix for a two-class problem. 1 However, the methods discussed here can be adapted to multi-class problems by considering the class under study as the positive class, and the remaining classes as the negative class.

3 Graphical Methods for Classifier Performance Evaluation 3 Positive Prediction Negative Prediction Positive Class True Positive (a) False Negative (b) Negative Class False Positive (c) True Negative (d) Table 1: Different types of errors and hits for a two-class problem. For a multi-class problem with Ncl classes, the confusion matrix will have Ncl 2 entries. The correct classifications lie on the diagonal line, and the off-diagonal entries contain the various cross-classification errors. Several metrics to measure the performance of learning systems can be extracted c+b from a two-class confusion matrix, such as error rate Err = and accuracy a+b+c+d Acc = a+d = 1 Err. Also, we can derive other metrics that disassociate the errors, a+b+c+d or hits, occurred in each class. These metrics measure the classification performance on the positive and negative classes independently. Some of these metrics are: False negative rate: F N = b is the percentage of positive cases misclassified a+b as belonging to the negative class; False positive rate: F P = c is the percentage of negative cases misclassified c+d as belonging to the positive class; True negative rate: T N = d = 1 F P is the percentage of negative cases c+d correctly classified as belonging to the negative class; True positive rate: T P = a = 1 F N is the percentage of positive cases a+b correctly classified as belonging to the positive class; 3 Cost Sensitive Learning A cost-sensitive learning system can be used in applications where the misclassification costs are known. A misclassification cost is simply a value that is assigned as a penalty for making a mistake. In this case, misclassification cost can be used in substitution to error rate, and a cost-sensitive learning system attempts to reduce the cost of misclassified examples instead of classification error. Usually, a cost matrix is used to define the costs associated to a domain. A cost matrix is similar to a confusion matrix. Each entry of a cost matrix defines a constant cost for each type of error that can be committed by a classifier. Given a confusion matrix and a cost matrix, the total misclassification cost, T C, can be computed using Equation 1. Ncl Ncl T C = Conf ij Cost ij (1) i=1 j=1 where, Conf ij is the number of errors in the confusion matrix and Cost ij is the cost for that type of misclassification. If the values on the diagonal line are represented with negative costs, then these values can be interpreted as gains or benefits. So far, fixed numerical values have been used to measure costs. In a utility model of performance analysis, measures of cost can be modified by a function called utility function. The nature of this function is part of the specification of the problem under study. Utility theory is widely used in economic analysis. For instance, a utility function based on wealth might be used to modify cost

4 4 Maria Carolina Monard and Gustavo E. A. P. A. Batista values of an uncertain investment decision, because the risk in investing $10,000 is much greater for a small investor than for a large one [10]. Some learning systems are not able to integrate cost information into the learning process. However, there is a simple and general method to make any learning system cost-sensitive for a concept-learning problem if the costs are known and are constant [2]. The idea is to change the class distributions in the training set towards the most costly class. Suppose that the positive class is five times more costly than the negative class. If the number of positive examples are artificially increased by a factor of five, then the learning system, aiming to reduce the number of classification errors, will come up with a classifier that is skewed towards the avoidance of error in the positive class, since any such errors are penalized 5 times more. In [4] is provided a theorem that shows how to change the proportion of positive and negative examples in order to make optimal cost-sensitive classifications for a concept-learning problem. Moreover, a general method to make a learning system cost-sensitive is presented in [3]. This last method has the advantage of being applicable to multi-class problems. 4 Probabilistic Classifiers Most learning algorithms can be adapted to produce probabilistic classifiers, i.e., to induce a classifier that produces probabilities of an example being in each class. In this scenario, given a new example x, the classifier does not output a class value, but a tuple (P (C 1 ), P (C 2 ),... P (C Ncl )), where P (C k ) is the probability that x belongs to class C k. Naive Bayes is an intrinsically probabilistic classifier, but other learning system can be adapted to produce such posterior probabilities estimates. For instance, in decision trees, the class distributions at the leaves can be used as an estimate. Rule learning systems can make similar estimates with the class distributions in each rule, and neural networks produce continuous outputs that can be mapped to probability estimates. In fact, it might be more natural to take the probability of each class into account when judging correctness. For instance, an outcome predicted with a probability of 99% should perhaps weigh more heavily than one predicted with a probability of 51%. Several works have shown that symbolic Machine Learning algorithms, specially decision trees, produce poor probability estimates. In [8], it is concluded that the limitation of the decision trees algorithms for probability estimation is not on the tree structure but on the tree-building algorithm. The use of Laplace correction is a very simple and effective method for improving the quality of a tree s probability estimates [1]. Even though decision trees do not produce good probability estimates, in [7] it is shown that decision trees produce surprisingly good probability rankings. Probability rankings are the basis for building graphical methods for performance analysis, as discussed in the next section. 5 Graphical Performance Analysis with Probabilistic Classifiers In practice, costs are rarely known with accuracy, and the analyst might want to ponder different scenarios with different classifications costs. For instance, in direct mailing, the number of respondents is much smaller than the number of non-respondents, usually the respondents are less than 1%. Suppose that a mail campaign with a promotional offer is

5 Graphical Methods for Classifier Performance Evaluation 5 going to be sent to 100,000 households, and it is expected that 1% of them will respond, i.e. 1,000 respondents. Using a predictive model, with a certain parameters setting, an analyst may be able to select 40,000 households (40%) for which the response rate is estimated to be 2%, i.e. 800 respondents. With another parameters setting, he may be able to select a smaller set of 10,000 households (10%) expecting a response rate of 3%. Which setting should be selected? The answer depends on the cost of sending each offer and the profit obtained by selling each product. Table 2 2 shows a hypothetical scenario in which the cost of sending each offer is estimated to be $0.70 and the profit obtained by selling each product is $ In mass mailing, the cost of the mail campaign is greater than the profit obtained in selling the offered product with a response rate of 1%. Mailing 40% of the customer base was the scenario that provided the best results comparing the scenarios analyzed in Table 2. However, how could we find out the best number of customers to be mailed, given the cost of mailing and the profit per product sold? This question can be answered with the aid of graphical methods for performance analysis, described next. Mass mailing Direct mailing Direct mailing (100%) (40%) (10%) Number of customers mailed 100,000 40,000 10,000 Cost of printing and mailing ($0.70 each) 70,000 28,000 7,000 Response rate 1% 2% 3% Number of products sold Profit from sale ($50.00 each) 50,000 40,000 15,000 Net profit -20,000 12,000 8,000 Table 2: Profit analysis for a direct mail campaign. 5.1 Lift Graph Assume the class under interest is the positive class. In the previous example, the positive class is the class of households who will purchase the product under offer. Given a classifier that outputs probabilities, each example in the test set can be labelled with the probability the example belongs to the positive class, i.e., P (positive). If the test set is labelled in descending order of the predicted probability, then it should be similar to the data represented in Table 3. If the learning system is able to identify some predictive patterns, then it is expected that there are more positive examples than negative in the top ranked examples. This ranking from most likely to least likely makes possible to choose any number of examples from the test set. For instance, the top 10 ranked examples could be selected for the mail campaign, and 8 of them will respond. This is the basic idea behind the lift graph. A lift graph is a widely used method in database marketing [6], and it is built over a test set. The lift graph shows the relationship between the set of X% top ranked examples and the number of positive examples in this set. The number of positive examples can be expressed as a percentage of the total number of positive examples in the test set. Figure 1 shows a lift graph for the hypothetical direct mailing example. 2 The cost of mining the data is not considered for simplicity.

6 6 Maria Carolina Monard and Gustavo E. A. P. A. Batista Rank Predicted Actual Rank Predicted Actual Probability Class Probability Class positive negative positive positive negative positive positive positive positive negative positive negative positive positive negative positive positive negative positive Table 3: Hypothetical test set with examples ranked by the probability of belonging to the positive class. Figure 1: A hypothetical lift graph. The lift graph shows a diagonal line and a curve. The curve, also called lift curve, represents the performance obtained by the classifier. The x-axis represents the number of examples of the test set that were selected according to the probabilistic ranking generated by the classifier. The y-axis represents the percentage of positive examples in the subset of selected examples. This percentage is calculated over the total number of examples in the test set. The diagonal line represents a random classifier, i.e., a classifier that selects a random subset of examples from the test set. For instance, if 50% of the test set examples were selected, it is expected that 50% of the positive examples would be in this set. The graph in Figure 1 emphasizes two points in the lift curve. The first one represents the selection of the top 10% ranked examples of the test set, and the second one represents the selection of the top 40%. These selections result in mailing 30% and 80% of the positive examples, respectively. These choices are the same shown in Table 2. Lift graphs are independent of costs and class distribution. This property allows the user to analyze different scenarios in which the selection of a larger subset of examples results in a larger number of contacted buyers. Through the selection of subsets with different sizes and a profit analysis similar to the one shown in Table 2, the user may decide which subset size will provide an appropriate result. With the addition of cost information to a lift graph, it is possible to obtain a more

7 Graphical Methods for Classifier Performance Evaluation 7 direct answer to the question proposed in the beginning of Section 5, i.e., to find out the number of customers to be mailed that gives the higher profit, given the cost of the mailing and the profit per product sold. 5.2 ROI Graph A ROI (Return of Investment) graph is similar to a lift graph. However, the gain obtained by the classifier is expressed in terms of profit instead of percentage of positive examples. In order to build a ROI graph, the same procedure used to build a lift graph is applied, i.e., selecting the top X% ranked examples in a test set and calculating the profit obtained in these examples. Figure 2 shows a ROI graph for the example of direct marketing. As in lift graphs, ROI graphs usually present a diagonal line and a curve that is also called ROI curve. The ROI curve represents the profit obtained by the classifier under analysis, and the diagonal line the profit obtained by a random classifier. Figure 2: A hypothetical ROI graph. The profit is usually calculated using the total cost given by Equation 1, and associating negative costs with correct classifications. Consequently, a ROI curve is dependent on a specific cost matrix, and in order to analyze the behavior of a classifier under different cost scenarios, it is necessary to plot one curve for each cost matrix. Frequently, a ROI curve presents a maximum point that provides a maximum return of investment. From this point, the percentage of the top ranked test set examples can be identified in order to obtain the best net profit for a certain cost scenario. The graph in Figure 2 shows the point having the maximum return of investment, as well as the returns obtained by selecting 10% and 40% of the test set. The later two points were used in the analysis presented in Table 2.

8 8 Maria Carolina Monard and Gustavo E. A. P. A. Batista 5.3 ROC Graph T P, T N, F P and F N are four performance measures that have the advantage of being independent of class costs and prior probabilities. It is obvious that the main objective of a classifier is to minimize the false positive and negative rates or, similarly, to maximize the true negative and positive rates. Unfortunately, for most real world applications, there is a tradeoff between F N and F P and, similarly, between T N and T P. The ROC 3 graphs [9] can be used to analyze the relationship between F N and F P (or T N and T P ) for a classifier. Lets continue considering the positive class as the class under study. On a ROC graph, T P is plotted on the y-axis and F P is plotted on the x-axis. One approach to plot a ROC graph is to use a probabilistic classifier. A threshold parameter determines the final classification. For instance, the threshold can be set to 0.90 and only the examples labelled with positive class probability higher than the threshold are labelled as positive, the remaining examples are labelled as negative. We can construct more or less strict classifiers by varying the threshold. Plotting all the ROC points that can be produced by varying these parameters produces a ROC curve for the classifier. Typically this is a discrete set of points, including (0,0) and (1,1), which are connected by line segments. Figure 3 illustrates a ROC graph of 3 classifiers: A, B and C. Several points on a ROC graph should be noted. The lower left point (0,0) represents a strategy that classifies every example as belonging to the negative class. The upper right point represents a strategy that classifies every example as belonging to the positive class. The point (0,1) represents the perfect classification, and the line x = y represents the strategy of random guessing the class. Figure 3: A ROC graph for 3 classifiers. From a ROC graph is possible to calculate an overall measure of quality, the under the ROC curve area (AUC). The AUC is the fraction of the total area that falls under the ROC curve. This measure is equivalent to several other statistical measures for evaluating classification and ranking models [5]. The AUC effectively factors in the performance of a classifier over all costs and distributions. However, it is important to 3 ROC is an acronym for Receiver Operating Characteristic, a term used in signal detection to characterize the tradeoff between hit rate and false alarm rate over a noisy channel.

9 Graphical Methods for Classifier Performance Evaluation 9 note that for a specific cost matrix, the classifier with maximum AUC may not be the best classifier. 6 Conclusion The traditional way to build a classification system consists in experimenting with many different classifiers, comparing their performance in terms of accuracy and choosing the classifier that performs best. However, accuracy is often not an appropriated measure of classifier performance, specially in classification problems with heavily imbalanced classes and asymmetric misclassification costs. In practice, costs are rarely known with accuracy, thus it is interesting to ponder various different scenarios. In this work we have described three methods, lift, ROI and ROC graph, that can be applied whenever there is a learning scheme that outputs probabilities, like Naive Bayes does, for the predicted class of each member of the set of test instances. These sort of tools can aid in freeing researchers from the need to have precise class and cost distribution information. Acknowledgements. The authors would like to thank Ronaldo C. Prati for his helpful comments on the draft of this paper. This research is partially supported by Brazilian Research Councils CAPES and FAPESP. References [1] E. Bauer and R. Kohavi. An Empirical Comparision of Voting Classification Algorithms: Bagging, Bosting and Variants. Machine Learning, 36: , [2] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth & Books, Pacific Grove, CA, [3] Pedro Domingos. MetaCost: A General Method for Making Classifiers Cost-Sensitive. In Knowledge Discovery and Data Mining, pages , [4] Charles Elkan. The Foundations of Cost-Sensitive Learning. In Seventeenth International Joint Conference on Artificial Intelligence, pages , [5] David J. Hand. Construction and Assessment of Classification Rules. John Wiley and Sons, [6] Charles X. Ling and Chenghui Li. Data Mining for Direct Mining: Problems and Solutions. In Forth International Conference on Knownledge Discovery and Data Mining, pages 73 79, [7] D. D. Margineantu and T. G. Dietterich. Improved Class Probability Estimates from Decision Tree Models. In Nonlinear Estimation and Classification, pages , Lecture Notes in Statistics, 171. [8] Foster J. Provost and Pedro Domingos. The Induction for Probability-based Ranking. Machine Learning, 52(3): , [9] Foster J. Provost and Tom Fawcett. Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions. In Knowledge Discovery and Data Mining, pages 43 48, [10] S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn. Morgan Kaufmann, San Mateo, CA, 1991.

Chapter 5 Evaluating Classification & Predictive Performance

Chapter 5 Evaluating Classification & Predictive Performance Data Mining for Business Intelligence Shmueli, Patel & Bruce Galit Shmueli and Peter Bruce 2010 Why Evaluate? Multiple methods are available