An Experimental Study of Three Different Rule Ranking Formulas in Associative Classification

Size: px
Start display at page:

Download "An Experimental Study of Three Different Rule Ranking Formulas in Associative Classification"

Transcription

1 An Experimental Study of Three Different Rule Ranking Formulas in Associative Classification Neda Abdelhamid Infonnatics Dept De Montfort University Leicester Aladdin Ayesh Informatics Dept De Montfort University Leicester Fadi Thabtah MIS Dept Philadelphia University Amman, Jordan Abstract- Associative classification (AC) is a combination of classification and association rule in data mining that has attracted several scholars due to its models simplicity and its effectiveness in predicting test cases. This paper investigates the impact of rule ranking before constructing the classifier in AC mining. We would like to experimentally compare three different rule ranking formulas during building the classifier in order to determine the most appropriate one than can positively impact the classification accuracy of the derived classifiers. We believe that rule ranking may play a significant role in determining accuracy of the classifiers and also can be considered a prepruning step for the rules. Sixteen different data sets from DCI data repository have been used in the experiments, and the bases of the comparisons are the error rate, and the number of rules. The results reveal that rule ranking plays a major role in determining the subset of rules to be utilised in the prediction step and it indeed affects the predictive power of such subset. Keywords: Associative classification, Classification, Data Mining, Prediction, Rule Ranking 1. INTRODUCTION The integration of association rule and classification data mining had come to surface as a promising research discipline named associative classification (AC) in 1998 [8]. In AC mining, the training phase is about discovering useful hidden knowledge primarily using association rule mining algorithms and then a classification model (classifier) is constructed after ranking the knowledge in regards to certain criteria and pruning redundant knowledge. Many research studies including [4] [13] [15] [16] showed that AC often derives higher quality classifiers with reference to classification accuracy than other classification data mining approaches like probabilistic [9], decision tree [11], and rule induction [2]. Normally, an AC algorithm operates in three main phases where during the first phase, it searches for hidden correlations among the attribute values and the class attribute in the training data set and generates them as "IF-THEN" rules [12]. After the complete set of rules are found, ranking and pruning procedures (phase 2) start operating where the ranking procedure sorts rules according to certain thresholds. Further, during pruning, contradicting and duplicating rules are discarded from the complete set of discovered rules. The output of phase two is the set of rules which form the classifier. Lastly, the classifier derived at the previous phase gets tested on new independent data set to measure its effectiveness in forecasting the class label of test cases. The output of the last phase is the accuracy or the error rate of the classifier. Research studies for instance [15] have shown that AC mining has two distinguishing features over other classification approaches. The first one is that it produces very simple rules that can be easily interpreted and manually updated by end-user. Secondly, this approach often fmds additional useful hidden knowledge missed by other classification algorithms and therefore the error rate of the resulting classifier is minimised. Though, in many cases the possible number of derived rules may become excessive [7] in which end-user becomes unable to control or maintain them. One way to control the number of rules in AC mining is to select the right rule ranking formula before constructing the classifier. As matter fact, rule ranking process can be considered an early rule pruning since rules that are of higher rank are usually kept in the classifier and laterally utilised for class assigmnent of test cases. On the other hand, rules with lower rank are often discarded during pruning simply because higher rank rules have classified their training cases while constructing the classifier [12]. The reason that higher rank rules cover lower rank rules training cases is since rules share attribute values in their antecedent (body/right hand side). Therefore, rule preference step is crucial in AC since the classifier accuracy depends heavily on rules with higher ranks. To show the significance of rule ranking process consider for example the "German" data set from the UCI data repository [10] which consists of 1000 tuples and 20 attributes before discretisation. Assume that we apply the MCAR AC algorithm [14] on this data set using 2% minimum support and 40% minimum confidence. The number of rules derived using MCAR on the "German" data set without pruning with similar confidence values is 3443 from which 323 rules have the same confidence and support values, which make the rule preference a hard task. Tn fact, after pruning the classifier of MCAR on this data set the remaining number of rules is only 563. In Section 3 we show how AC algorithms discriminate among rules during building the classifier step. There are several different rule ranking formulas containing different criteria considered by scholars in AC. For instance, CBA algorithm [8] 97B-1-90B320-0B/7/$ IEEE 795

2 and its successors [17] [12] considers the rule's confidence and support as main criteria for rule favouring, and MCAR [14] adds on this sorting procedure the class distribution when two or more rules having similar confidence and support values. This paper investigates the step of rule ranking in AC mining aiming to 1) Determine the impact of rule ranking step on the classifiers accuracy by experimentally contrasting different rule ranking formulas. We mainly focus on three main criteria: confidence, support and rule length in the formulas and use them in different order during the ranking process. In the rule length criteria we favour rules having less number of items (attribute values) in their body (general rules). 2) Consider rule ranking as a pre-pruning step for rules in order to select only the most appropriate rules in the classifier for the prediction step. In particular, we investigate the impact of different formulas that can be utilised in rule ranking during constructing the classifier (CONFIDENCE-SUPPORT-RULE CARDINALITY) (CONFIDENCE-RULE CARDINALITY SUPPORT) and (SUPPORT-CONFIDENCE-RULE CARDINALAITY) on 16 different data sets from the UCI data repository. The experimentations of the rule ranking formula are based on the classification accuracy, and the number of rules derived of the resulting classifiers. Moreover, we would like to evaluate the effect of rule ranking on the number of rules derived aiming to select the rule ranking formula that reduces the classifier size without deteriorating the predictive power of it. The rest of the paper is organised as follows: AC and its main definitions are presented in Section 2. Different methods used to discriminate among rules are given in Section 3. Section 4 is devoted to experimental results, and finally conclusions are presented in Section 5. Definition 2: The jth row or a training case in D can be described as a list of attribute values (Aj 1, aj 1),..., (Ajk, ajk), plus a class denoted by cj. Definition 3: An AttributeValueSet set can be described as a set of disjoint attribute values contained in a training case, denoted < (Ail, ail),..., (Aik, aik». Definition 4: A ruleitem r is of the form <antecedent, c>, where antecedent is an AttributeValueSet and c is a class. Definition 5: The actual occurrence (actoccr) of a ruleitem r in D is the number of cases in D that match r's antecedent. Definition 6: The support count (suppcount) of ruleitem r = <cond, c> is the number of cases in D that matches r's antecedent, and belongs to a class c. Definition 7: A ruleitem r passes the minsupp threshold if, suppcount(r)/ IDI 2: minsupp. Such a rule item is said to be a frequent ruleitem. Definition 8: A ruleitem r passes the minimum confidence (minconf) threshold if suppcount(r) / actoccr(r) 2: minconf. Definition 9: A rule is represented as: Antecedent c, where antecedent is an attributevalueset and the consequent is IT. ASSOCIATIVE CLASSIFICATION PROBLEM a class. In other words, the left hand side of the rule are disjoint In this section, we defme the AC mining problem and shed the light on its related definitions. In general, the aim in AC is to construct a classification system that consists of simple rules (knowledge) from training data set in order to forecast the class label in test data set accurately. Given a training data set D, which has n distinct attributes AI, A2,..., An and C is a list of classes. The number of cases in D is denoted IDI. An attribute may be categorical where each attribute takes a value from a known set of possible values or continuous where each attribute takes a value from an infmite set, e.g. real or integer. For categorical attributes, all possible values are mapped to a set of positive integers. In the case of continuous attributes, any discretisation method can be utilised. The goal is to construct a classifier from D, e.g. Cl: A C, where A is the set of attribute values and C is a class, in order forecast the class of test cases. The AC can be formalized thorough the following defmitions: Definition 1: An AttributeValue can be described as an attribute values and the right hand side is the class label. TIT. THE PROCEDURES OF RULE RANKING IN AC MINING Classification algorithms are able to generalise their performance on test data cases by inductive biases since they have implicit assumptions of favouring one rule over another. For instance, a decision tree algorithms like C5 [11] have a clear bias in their searching for the best attribute decision node, which is, the attribute selection method based on information gain. Moreover, these algorithms favour smaller effective subtrees over complex ones by using backward pruning. Probabilistic classification algorithms like NaiVe Bayes [9] compute the probability for each class in the training data set using joint probabilities of attribute values for a test case. An inductive bias in NaIve Bayes algorithm stands for the assumption that the conditional probability of a data case given a class is independent of the probabilities of other data cases given the same class [7]. In the next subsections, we shed the light on the different rule ranking procedures in AC mining. attribute name Ai and its value ai, denoted (Ai, ai). 97B-1-90B320-0B/7/$2S IEEE 796

3 A. CONFIDENCE, SUPPORT AND RULE CARDINALITY PROCEDURE The first rule preference procedure in AC was introduced in [8] and it is based on rule's confidence, support and the number of attributes in the rule antecedent. This sorting method is displayed in Fig. l. Using this rule preference procedure has derived good quality classifiers with respect to accuracy according to some empirical studies, i.e. [7] [8], though the number of rules with similar confidence and support values are still massive. Consider for example two data sets "Auto" and "Glass" from the UCI data repository. Assume that the minsupp and minconf are set to 2% and 40%, respectively. If we apply a common AC algorithm such as MCAR, the number of discovered rules with identical confidence from the "Auto" and "Glass" data sets are 2660 and 759 respectively without rule pruning. When we apply the rule support as a tie breaking condition, we end up with 2492 and 624 rules with similar confidence and support values. This example if limited show clearly a direct evidence that in AC mining there are great number of rules that have common confidence and support and thus additional tie Breaking conditions are needed to minimise the chance for rule arbitrary choices. There are a number of AC algorithms that employ the rule sorting procedure shown in Fig. 1 including ACN [3], ACCF [5], CAAR [17], and others. In 2005, MCAR algorithm adds the class distribution in the training data set as a tie breaking condition beside the rule confidence, support, and antecedent length. In particular, if two rules have identical confidence, support, and antecedent length, MCAR favours the rule which is associated with the class that has larger frequency in the training data set. Experimental tests [14] on different data sets from UCI data repository [10] showed that the rule ranking procedure of MCAR positively impacted the classifiers produced in regards to accuracy and reduced the rule random selection during ranking step. B. LAZY RANKING PROCEDURE Live and Let Live (L3) is one of the early lazy AC algorithms which was developed in 2002 by [1]. Lazy AC algorithms often prefer rules that hold more number of attribute values in their antecedent. These kinds of rules are named specific rules. In fact, these algorithms try to hold almost all knowledge discovered during the training step even if redundancy exist aiming to maximise the predictive power of the outputting classifiers. Unlike CBA rule ranking procedure, the L3 ranking procedure (Fig. 2) mainly prefers specific rules over general ones in order to give the specific rules higher chance in the prediction step since they are often more accurate than general rules according to [1]. In the prediction phase, when the specific rules are unable to assign a class to the test case, then general rules, those with smaller number of attributes in their antecedent, are considered. C. DISCUSSION ON RULE RANKING Rule sorting is considered one of the main phases in AC mining which may impact the 1) classifier building process and 2) predicting of test cases. In fact, without rule sorting the algorithm will not be able to easily choose the rules that can be employed in the prediction step. CBA and its successors Given two rules, R/ and R2, R/ precedes R2 if 1. The confidence of RJ is larger than that of R2 2. The confidences of R/ and R2 are the identical. but the support of R/ is larger than that of R2 3. The confidence and support of RJ and R2 are the identical, but the R/ contains less number of attributes in its antecedent than that of R2 Figure. 1 CBA rule sorting method considered confidence and support the main criteria for rule preference, and MCAR adds upon CBA the class distribution of the rules if two or more rules have identical confidence, support and length. On the other hand, unlike CBA and MCAR, L3 algorithm prefers specific rules over general ones since they contain multiple general rules. An experimental study [14] revealed that using confidence, support and rule antecedent cardinality is effective approach. Though, recent studies and the example discussed in Section 3.1 showed that imposing more tie breaking condition beside confidence and support may reduce the chance of randomisation in rule ranking which consequently limits the use of default class in the prediction step. Moreover, approaches that favour specific rules may sometimes gain slight improvement in prediction phase, however it suffers from holding large number of rules many of which are never used and thus consuming memory usage as well as training time. Lastly, the employment of mathematical measures such as Entropy and Chi-Square seems to be a direction toward the possibility of improving the process of sorting the rules. IV. EXPERIMENTAL RESULTS Different criteria in rule ranking have been evaluated on 16 data sets from the UCI data repository to measure their impact on the classification accuracy and the number of rules CARDINALLTY),(SUPPORT-CONFLDENCE- generated. Specifically, (CONFIDENCE-SUPPORT (CONFIDENCE-CARDTNALITY SUPPORT) rule ranking formulas are tested on the data sets we consider in the experiments. The cardinality criterion is based on the rules that have less number of attributes in its antecedent (general rules). We have implemented MCAR AC algorithm along with the three rule ranking formulas in Java. Ten-fold cross validation have been utilised as a testing method to produce the number of rules and the error rate of the classifiers. In ten-fold cross validation, the training data set gets partitioned into 10 blocks in which the classifier is learned from 9 blocks and tested on the holdout block to measure its predictive power. The process is repeated 10 times and the Given two rules, R/ and R2 R/ precedes R2 if 1. The confidence ofrjis larger than that ofr2 2. The confidences of R/ and R2 are the identical, but the support of R/ is larger than that of R2 3. The confidence and support values of RJ and R2 are the identical, but the R/ contains more number of attributes in its antecedent than that of R2 Figure. 2 L 1 rule ranking method 97B-1-90B320-0B/7/$2S IEEE 797

4 error rates produced from the holdout block in the 10 runs are averaged to output an average error rate for each data set. Following research studies carried by [12] the minsupp was set to 2%., and the minconf was set to 40%. All experiments were conducted on Pentium IV 1.7 Ghz Due Centrino, 1 GB RAM machine using Java under Windows 7. Fig. 3 shows the error rate produced by the three rule ranking formulas against the data sets. It is obvious from the graph that (CONFIDENCE SUPPORT-CARDINALITY) outperformed both (SUPPORT CONFIDENCE-- and (CONFIDENCE CARDINALITY SUPPORT) by 7.2% and 9.9% respectively. Further, the (SUPPORT-CONFIDENCE-CARDINALITY) formula outperformed the (CONFIDENCE- CARDINALITY SUPPORT) on the data sets we consider. These results reveal that rule's confidence is the most fundamental criterion in favouring among rules when it comes to measuring the impact of rule ranking on the classifier's predictive power, then rule support and finally rule cardinality. The effect of rule ranking on the number of rules has been investigated in Fig. 4. Precisely, Fig. 4 displays the number of rules generated when using the above three rule ranking formulas. It is clear from the graph that (CONFIDENCE-SUPPORT-CARDINALITY) generates the least number of rules and the (CONFIDENCE CARDINALITY-SUPPORT) derives the largest number of rules. In particular, and for the 16 data sets, the average number of rules generated by (CONFIDENCE-SUPPORT- (SUPPORT-CONFIDENCE- (CONFIDENCE-CARDINALITY- SUPPORT) ranking formulas are 85.81, , and respectively. This means that the (CONFIDENCE-SUPPORT CARDINALITY) method not only produces high quality classifiers with reference to accuracy but also moderate size ones if contrasted with (SUPPORT-CONFIDENCE- (CONFIDENCE-CARDINALITY SUPPORT) methods. It should be noted that we use the words "formula", and "method" when tall(ing about rule ranking interchangeably and they refer to the same meaning. To have an insight look on the behaviour of the above rule ranking methods, Table 1 shows the number of times each criterion does not break tie between rules for the 16 data sets. Figure. 3 Error rate (%) derived by the rule ranking formulas j ;; 350 """ 'E 300 ci :z Number of Rules..... I \ IJ, II \, II \\ \ U \l. " \./ " I \ /,\ / "-.1 \ /.::: , "'" --..;,, //' "-""" "" Figure. 4 Number of rules derived by the rule ranking formulas /7/$2S IEEE 798

5 Column "Conf' indicates the number of rules with identical confidence values, column "Conf&Supp" represents the number of rules with the same confidence and support values. Column "Conf&Supp&Card" depicts the number of rules that have similar confidence, support and cardinality. Values shown in Table 1 represent the candidate rules tested by the MCAR algorithm during the rule ranking and before constructing the classifier. Table 1 shows that support and confidence are not effective enough in distinguishing among rules in most data sets we consider. For the "Austrad" data set for instance, there are 2421 rules with the same confidence as some other rule, with 233 rules having identical confidence and support. There are 10 with the same confidence, support and cardinality as some other rule. These results necessitate considering new tie breaking criterion to further favour between rules during the ranking process. TABLE I NUMBER OF TIMES EACH CONDITION IN THE RULE RANKING FORMULA DOES NOT BREAK TIE BETWEEN RULES DURING SORTING Dataset Conf Conf&Snpp Conf&Snpp&Card Austrad Balance-scale Breast Cleved Diabetesd Germand Glassd Heart-s Irisd V. CONCLUSIONS Recent research studies suggest that associatlve classification (Ae) approach often derive higher predictive classification models than other classification approaches including decision trees, probabilistic, covering and rule induction. Though, AC suffers from the exponential growth of rules which consequently results in large size classifier if contrasted with those produced by decision tree and rule induction. This indeed limits the use and the applicability of AC in real world applications since domain experts are unable to control and maintain the massive number of rules generated. Rule ranking is a crucial step that has a significant impact on the order of the rules in the final classifier. This paper investigated different rule ranking criteria aiming to identify the formula that can positively impact the accuracy and output a concise set of rules. In this paper, we focused on three main parameters associated with a rule (Confidence, Support, Cardinality/Size) to distinguish among rule's position during the ranking process. The experimentation on sixteen UCI data sets revealed that using (CONFLDENCE-SUPPORT CARDINALITY) rule ranking to discriminate among rules have improved the accuracy of the resulting classifiers and minimised rules redundancy by outputting moderate size classifiers. On other hand, (CONFIDENCE-CARDINALITY SUPPORT) rule ranking had produced very large size classifiers without improving the predictive power. In particular, (CONFLDENCE-SUPPORT-CARDINALITY) method outperformed both (SUPPORT -CONFIDENCE- (CONFLDENCE- CARDINALITY SUPPORT) rule ranking methods on average by 7.2% and 9.9% respectively in regards to classification accuracy and for the 16 data sets we consider. Moreover, the results showed that rule criteria such as rule's support and cardinality are used frequently in breaking ties among rules, and therefore imposing additional criteria to further distinguish among rules is essential. Led Mushroom Pimad Tic-tac Vote Wined Zoo REFERENCES [I] Baralis, E., and Torino, P. (2002) A lazy approach to pruning classification rules. Proceedings of the 2002 IEEE ICDM'02, (pp. 35). [2] Jensen, D., and Cohen, P. (2000) Multiple comparisons in induction algorithms. Machine Learning 38(3), (pp ). [3] Kundu G., Islam M., Munir S. and Bari M. (2008). ACN: An Associative Classifier with Negative Rules, 11 th IEEE International Conference on Computational Science and Engineering. pp , [4] Lan, Y., Janssens, D., Chen, G., and Wets, G. (2006) Improving associative classification by incorporating novel interestingness measures. Expert Syst. Applications, pp , [5] Li X., Qin D, and Yu C. (2008) ACCF: Associative Classification Based on Closed Frequent ltemsets. Proceedings of the Fifth International Conference on Fuzzy Systems and Knowledge Discovery -. FSKD., pp , [6] Li, W., Han, J., and Pei, J. (2001) CMAR: Accurate and efficient classification based on multiple-class association rule. Proceedings of the IEEE International Conference on Data Mining -ICDM, pp , [7] Liu, B., Ma, Y., and Wong, C-K. (2001) Classification using association rules: weakness and 97B-1-90B320-0B/7/$2S IEEE 799

6 enhancements. In Vipin Kumar, et ai, (eds) Data mining for scientific applications, [8] Liu, B., Hsu, W., and Ma, Y. (1998) Integrating classification and association rule mmmg. Proceedings of the Knowledge Discovery and Data Mining Conference- KDD, pp New York, NY. [9] Meretakis, D., and WUthrich, B. (1999) Extending naive Bayes classifiers using long itemsets. Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (pp ). San Diego, California. [IO]Merz, c., and Murphy, P. (1996) UCI repository of machine learning databases. Irvine, CA, University of California, Department of Information and Computer Science. [I I] Quinlan, J. (1998) Data mining tools See5 and C5.0. Technical Report, RuleQuest Research. [12] Thabtah F., Hadi W., Abdelhamid N. Issa A. (2011) Prediction Phase in Associative Classification. Journal of Knowledge Engineering and Software Engineering. Volume: 21, Issue: 6 (2011) pp WorldScinet. [13] Thabtah F., Mahmood Q., McCluskey L., Abdeljaber H (2010). A new Classification based on Association Algorithm. Journal of Information and Knowledge Management, Vol 9, No. 1, pp World Scientific. [14] Thabtah, F., Cowling, P., and Peng, Y. (2005) MCAR: Multi-class classification based on association rule approach. Proceeding of the 3rd IEEE International Conference on Computer Systems and Applications (pp. 1-7).Cairo, Egypt. [15] Veloso A., Meira Jf. W., Gonyalves M., Almeida H., Zaki M. (201O)Calibrated lazy associative classification. Information Sciences Journal, Volume 181, Issue 13, 1 July 2011, Pages [16] Wang L. Z. ; Duwu C. L. (2011) Associative classification with evolutionary autonomous agents. International Journal of Modelling, Identification and Control, Volume 14, Number 4, October 2011, pp (9) [17]Xu, X., Han, G., and Min H. (2004) A novel algorithm for associative classification of images blocks. Proceedings of the fourth IEEE International Conference on Computer and Information Technology, (pp ) /7/$2S IEEE 800