Integration of Rules from a Random Forest

Similar documents
Prediction algorithm for users Retweet Times

Classification and Regression Trees and MLP Neural Network to Classify Water Quality of Canals in Bangkok, Thailand

Analysis Online Shopping Behavior of Consumer Using Decision Tree Leiyue Yao 1, a, Jianying Xiong 2,b

Consumption capability analysis for Micro-blog users based on data mining

Customer segmentation, return and risk management: An emprical analysis based on BP neural network

Experiments with Protocols for Service Negotiation

MULTIPLE FACILITY LOCATION ANALYSIS PROBLEM WITH WEIGHTED EUCLIDEAN DISTANCE. Dileep R. Sule and Anuj A. Davalbhakta Louisiana Tech University

Calculation and Prediction of Energy Consumption for Highway Transportation

Development and production of an Aggregated SPPI. Final Technical Implementation Report

A Batch Splitting Job Shop Scheduling Problem with bounded batch sizes under Multiple-resource Constraints using Genetic Algorithm

Construction of Control Chart Based on Six Sigma Initiatives for Regression

Product Innovation Risk Management based on Bayesian Decision Theory

Supporting Information

OVERVIEW OF 2007 E-DEFENSE BLIND ANALYSIS CONTEST RESULTS

A TABU SEARCH FOR MULTIPLE MULTI-LEVEL REDUNDANCY ALLOCATION PROBLEM IN SERIES-PARALLEL SYSTEMS

A STUDY ON THE FACTORS AFFECTING THE ECONOMICAL LIFE OF HEAVY CONSTRUCTION EQUIPMENT

Planning of work schedules for toll booth collectors

Supplier selection and evaluation using multicriteria decision analysis

An Implicit Rating based Product Recommendation System

Introducing Ensemble Methods to Predict the Performance of Waste Water Treatment Plants (WWTP)

RVFL-Based Optical Fiber Intrusion Signal Recognition With Multi-Level Wavelet Decomposition as Feature

A Two-Echelon Inventory Model for Single-Vender and Multi-Buyer System Through Common Replenishment Epochs

Incremental online PCA for automatic motion learning of eigen behaviour. Xianhua Jiang and Yuichi Motai*

Prediction of Hourly Generated Electric Power Using Artificial Neural Network for Combined Cycle Power Plant

Wei Zheng College of Science, Hebei North University, Zhangjiakou , Hebei, China

A Group Decision Making Method for Determining the Importance of Customer Needs Based on Customer- Oriented Approach

Market Segmentation of Inbound Business Tourists to Thailand by Binding of Unsupervised and Supervised Learning Techniques

Using Data Mining Techniques for Estimating Minimum, Maximum and Average Daily Temperature Values

Evaluating Clustering Methods for Multi-Echelon (r,q) Policy Setting

The Credit Risk Assessment Model of Internet Supply Chain Finance: Multi-Criteria Decision-Making Model with the Principle of Variable Weight

The Credit Risk Assessment Model of Internet Supply Chain Finance: Multi-Criteria Decision-Making Model with the Principle of Variable Weight

Simulation of Steady-State and Dynamic Behaviour of a Plate Heat Exchanger

Evaluation Method for Enterprises EPR Project Risks

A SIMULATION STUDY OF QUALITY INDEX IN MACHINE-COMPONF~T GROUPING

Adaptive Neuro Fuzzy Inference System (ANFIS) for Prediction of Groundwater Quality Index in Matar Taluka and Nadiad Taluka

Qiang Yang and Hong Cheng

Cross Channel Optimized Marketing by Reinforcement Learning

Optimal Operation of a Wind and Fuel Cell Power Plant Based CHP System for Grid-Parallel Residential Micro-Grid

SIMULATION RESULTS ON BUFFER ALLOCATION IN A CONTINUOUS FLOW TRANSFER LINE WITH THREE UNRELIABLE MACHINES

1 Basic concepts for quantitative policy analysis

Modeling of LDO-fired Rotary Furnace Parameters using Adaptive Network-based Fuzzy Inference System

MALAY ARABIC LETTERS RECOGNITION AND SEARCHING

The Identification of Human Cassette Exons based on SVM

RIGOROUS MODELING OF A HIGH PRESSURE ETHYLENE-VINYL ACETATE (EVA) COPOLYMERIZATION AUTOCLAVE REACTOR. I-Lung Chien, Tze Wei Kan and Bo-Shuo Chen

Researches on the best-fitted talents recommendation algorithm

WISE 2004 Extended Abstract

The Estimation of Thin Film Properties by Neural Network

MEASURING USER S PERCEPTION AND OPINION OF SOFTWARE QUALITY

Numerical Analysis about Urban Climate Change by Urbanization in Shanghai

A Similarity-Based Approach for the All-Time Demand Prediction of New Automotive Spare Parts

Genetic Algorithm based Modification of Production Schedule for Variance Minimisation of Energy Consumption

A NOVEL PARTICLE SWARM OPTIMIZATION APPROACH FOR SOFTWARE EFFORT ESTIMATION

Financial Distress Prediction of K-means Clustering Based on Genetic Algorithm and Rough Set Theory

Implementation of Supplier Evaluation and. Ranking by Improved TOPSIS

Research on the Process of Runoff and Sediment-production in the Shunjiagou Small Watershed by Applying Automatic Measurement System

LIFE CYCLE ENVIRONMENTAL IMPACTS ASSESSMENT FOR RESIDENTIAL BUILDINGS IN CHINA

Experimental Validation of a Suspension Rig for Analyzing Road-induced Noise

The ranks of Indonesian and Japanese industrial sectors: A further study

A Scenario-Based Objective Function for an M/M/K Queuing Model with Priority (A Case Study in the Gear Box Production Factory)

Adaptive Noise Reduction for Engineering Drawings Based on Primitives and Noise Assessment

Study on Productive Process Model Basic Oxygen Furnace Steelmaking Based on RBF Neural Network

Application of Ant colony Algorithm in Cloud Resource Scheduling Based on Three Constraint Conditions

A study on Fast Predicting the Washability Curve of Coal

Evaluation of Quality Management Performance in Office of President using Modified Public Sector Management Quality Award (PMQA) Model

A Multi-Product Reverse Logistics Model for Third Party Logistics

EXPERIMENTAL DETERMINATION OF THERMAL CHARACTERISTICS OF MUNICIPAL SOLID WASTE

Spatial difference of regional carbon emissions in China

Sporlan Valve Company

Journal of Applied Research and Technology ISSN: Centro de Ciencias Aplicadas y Desarrollo Tecnológico.

EVALUATING THE PERFORMANCE OF SUPPLY CHAIN SIMULATIONS WITH TRADEOFFS BETWEEN MULITPLE OBJECTIVES. Pattita Suwanruji S. T. Enns

International Trade and California Employment: Some Statistical Tests

552. o December January February 558.6

Heat Transfer Model of Casted Heat Exchanger in Summer Condition Yu Jie 1,2,a, Ni Weichen 1,2,b and You Shijun 3,c

K vary over their feasible values. This allows

FIN DESIGN FOR FIN-AND-TUBE HEAT EXCHANGER WITH MICROGROOVE SMALL DIAMETER TUBES FOR AIR CONDITIONER

Optimization of e-learning Model Using Fuzzy Genetic Algorithm

Identifying Factors that Affect the Downtime of a Production Process

CYCLE TIME VARIANCE MINIMIZATION FOR WIP BALANCE APPROACHES IN WAFER FABS. Zhugen Zhou Oliver Rose

Churn Analysis of a Product of Application Search in Mobile Platform

Mapping Regulations to Industry-Specific Taxonomies

COMPARING TWO NONLINEAR STRUCTURES FOR SECONDARY AIR PROCESS MODELLING

Research on the Evaluation of Corporate Social Responsibility under the Background of Low Carbon Economy

Study on trade-off of time-cost-quality in construction project based on BIM XU Yongge 1, a, Wei Ya 1, b

High impact force attenuation of reinforced concrete systems

Development of a Quality Control Programme for steel production: A case study

Modelling of Fatigue life of 6082 T6 Al-alloy based on Genetic Programming

The 27th Annual Conference of the Japanese Society for Artificial Intelligence, Shu-Chen Cheng Guan-Yu Chen I-Chun Pan

ASSESSMENT OF THE IMPACT OF DECAY CORRECTION IN THE DOSE-TO- CURIE METHOD FOR LONG-TERM STORED RADIOACTIVE WASTE DRUMS

Evaluation and Comparison of Different Machine Learning Methods to Predict Outcome of Tuberculosis Treatment Course

Content-Based Cross-Domain Recommendations Using Segmented Models

A Real-time Planning and Scheduling Model in RFID-enabled Manufacturing

TOWARDS A SUPPLY CHAIN SIMULATION REFERENCE MODEL FOR THE SEMICONDUCTOR INDUSTRY

Risk Assessment Using AHP in South Indian Construction Companies: A Case Study

PREDICTION OF SEWAGE QUALITY BASED ON FUSION OF BPNETWORKS

MODELLING AND SIMULATION OF TEAM EFFECTIVENESS EMERGED FROM MEMBER-TASK INTERACTION. Shengping Dong Bin Hu Jiang Wu

Port Customer Credit Risk Prediction Based on Internal and External Information Fusion

A formal analysis of a conventional job evaluation system

An Artificial Neural Network Method For Optimal Generation Dispatch With Multiple Fuel Options

Active Learning for Decision-Making

Comparison of robust M estimator, S estimator & MM estimator with Wiener based denoising filter for gray level image denoising with Gaussian noise

Transcription:

2011 Internatonal Conference on Informaton and Electroncs Engneerng IPCSIT vol.6 (2011) (2011) IACSIT Press, Sngapore Integraton of Rules from a Random Forest Naphaporn Srkulvrya 1 and Sukree Snthupnyo 2 1 Department of Computer Engneerng, Chulalongkorn Unversty, Bangkok, Thaland E-mal: naphaporn.s@student.chula.ac.th 2 Department of Computer Engneerng, Chulalongkorn Unversty, Bangkok, Thaland E-mal: sukree.s@chula.ac.th Abstract. Random forests s an effectve predcton tool wdely used n data mnng. However, the usage and human comprehensveness of the rules obtaned from a forest s a dffcult task because of an amount of rules, whch are patterns of the data, from a number of trees. Moreover, some rules conflct wth other rules. Ths paper thus proposes a new method whch can ntegrate rules from multple trees n a Random Forest whch can help mprove the comprehensveness of the rules. The experments show that the rules obtaned from our method yelded the better results and also reduced the nconsstent condton between rules from dfferent decson trees n the same forest. Keywords: Random Forests, rules ntegraton, Decson Trees. 1. Introducton Ensemble method s a popular machne learnng technque whch has been nterested n data mnng communtes. It s wdely accepted that the accuracy from the ensemble of several weak classfers s usually better than a sngle classfer gven the same amount of tranng nformaton. A number of effectve ensemble algorthms have been nvented durng the past 15 years, such as Baggng (Breman, 1996), Boostng (Freund and Schapre, 1996), Archng (Breman, 1998) and Random Forests (Breman, 2001). Random Forests [1] s an ensemble classfer proposed by Breman. It constructs a seres of classfcaton trees whch wll be used to classfy a new example. The dea used to create a classfer model s constructng multple decson trees, each of whch uses a subset of attrbutes randomly selected from the whole orgnal set of attrbutes. However, the rules generated by exstng ensemble technques sometmes conflct wth the rules generated from another classfer. Ths may lead to a problem when we want to combne all rule set nto a sngle rule set. Therefore, several works ntend to ncrease the accuracy of the classfers. In ths paper, we present an approach whch can ntegrate rules from multple decson trees. Our method s amed at ncrementally ntegratng a par of rules. The newly ntegrated rules wll replace ts orgnal rules. The replacement process wll be repeated untl a stoppng crteron s met. Fnally, the new set of rules wll be used to classfy a new data. 2. Random Forests The Random Forests [1] s an effectve predcton tool n data mnng. It employs the Baggng method to produce a randomly sampled set of tranng data for each of the trees. Ths Random Forests method also sem-randomly selects splttng features; a random subset of a gven sze s produced from the space of possble splttng features. The best splttng s feature determnstcally selected from that subset. A pseudo code of random forest constructon s shown n Fgure 1. To classfy a test nstance, the Random Forests classfes the nstance by smply combnng all results from each of the trees n the forest. The method used to combne the results can be as smple as predctng the class obtaned from the hghest number of trees. 194

Algorthm 1: Pseudo code for the random forest algorthm To generate c classfers: for = 1 to c do Randomly sample the tranng data D wth replacement to produce Create a root node, Call BuldTree( N ) end for N contanng D BuldTree(N): f N contans nstances of only one class then return else Randomly select x% of the possble splttng features n N Select the feature F wth the hghest nformaton gan to splt on Create f chld nodes of N, N 1,..., N f, where F has f possble values ( F 1,, for = 1 to f do Set the contents of N to D, where D s all nstances n N that match F Call BuldTree( N ) end for end f D F ) f 3. Methodology Fg. 1: The pseudo code of Random Forest algorthm [17] 3.1. Extractng Rules from Decson Tree A method for extractng rules from a decson tree [11] s qute smple. A rule can be extracted from a path lnkng from the root to a leaf node. All nodes n the path are gathered and connected to each other usng conjunctve operatons. Fg. 2: An example of decson tree For example, a decson tree for classfyng the person who get sunburned after sunbathe s shown n Fg. 2. A rule of sunburned person can be obtaned from the root node har color and ts value blonde lnkng to the node loton used and ts value no. So that the extracted rule wll be IF har color s blonde and loton used s no, THEN sunburned. All obtaned rules from the tree n Fg. 2 are lsted below. (1) IF har color s blonde AND loton used s no THEN nothng happens. (2) IF har color s blonde AND loton used s yes THEN the person gets sunburned. (3) IF har color s red THEN the person gets sunburned (4) IF har color s brown THEN nothng happens 3.2. Integraton of Rules from Random Forests 195

We have proposed a new method to ntegrate rules from random forests whch has the followng steps. 1. Remove redundancy condtons In ths step, we wll remove the more general condtons whch appear n the same rule wth more specfc condtons. For example: IF weght>40 AND weght>70 AND weght>80 AND weght<150 AND heght<180 THEN fgure=fat We can see that the condton weght>80 s more specfc than weght>40 and weght>70 so weght>40 and weght>70 are removed. The fnal rule wll be IF weght>80 AND weght<150 AND heght<180 THEN fgure=fat 2. For every par decson trees 2.1 Remove redundancy rules. For example: Rule 1: IF thckness=thn AND lace=glue THEN report=mnor Rule 2: IF thckness=thn AND lace=glue THEN report=mnor New Rule: IF thckness=thn AND lace=glue THEN report=mnor 2.2 Remove all conflcts rules. The rules wth the same condtons but dfferent consequences must be removed. For example: Rule 1: IF face=soft AND age>3 THEN toy=doll Rule 2: IF face=soft AND age>3 THEN toy=elastc New Rule: - 2.3 Remove more specfc rules. The rules wth a condton set whch s a superset of another rule should be removed. For example: Rule 1: IF fur=short AND nose=yes AND tal=yes THEN type=bear Rule 2: IF fur=short AND ear=yes AND nose=yes AND tal=yes THEN type=bear Rule 3: IF nose=yes AND tal=yes THEN type=bear New Rule: IF nose=yes AND tal=yes THEN type=bear 2.4 Extend the range of contnuous condtons. The rules wth the range of the same attrbute can be combned nto the wdest one. For example: Rule 1: IF duty=recordng AND perod<3 AND perod>1.5 THEN wage=1500 Rule 2: IF duty=recordng AND perod<2 AND perod>1 THEN wage=1500 New Rule: IF duty=recordng AND perod<3 AND perod>1 THEN wage=1500 2.5 Dvde range of condtons. The rules of dfferent classes wth the same attrbute whch has overlapped range should be dvded nto several parts. For example: Rule 1: IF credt=yes AND money>20000 THEN allow=yes Rule 2: IF credt=yes AND money<40000 THEN allow=no New Rule 1: IF credt=yes AND money>=40000 THEN allow=yes New Rule 2: IF credt=yes AND money<=20000 THEN allow=no Rule 3: IF usage>100 AND payment=pad THEN promoton=false Rule 4: IF usage>=200 AND usage<400 AND payment=pad THEN promoton=true 196

New Rule 3: IF usage>100 AND usage<200 AND payment=pad THEN promoton=false New Rule 4: IF usage>=200 AND usage<400 AND payment=pad THEN promoton=true New Rule 5: IF usage>=400 AND payment=pad THEN promoton=false 2.6 If percent of accuracy of new rules on the valdaton set s stll mproved, repeat 2.1-2.5. 3. Output the new rule set 4. Experments 4.1. Data Sets We used seven datasets from UCI Machne Learnng Repostory [15], namely Balance Scale, Blood Transfuson, Haberman's Survval, Irs, Lver Dsorders, Pma Indans Dabetes Database, and Statlog. Moreover, we compared our proposed method to Random Forests and C4.5 [7] usng a standard 10-fold Cross Valdaton. In each tranng set, a valdaton set whch was used to fnd the best new rule set conssted of 20% of the number of tranng examples n the orgnal tranng set. The remanng was used to tran a Random Forest. In ths experment, we used WEKA [14] as our learnng tool. 4.2. Expermental Results Because the order of the rules whch are ntegrated can affect the fnal results, we dvded our rule ntegraton method nto two ways,.e. ntegrate the hghest accurate rule frst (RFh) and ntegrate the lower accurate rule frst (RFl). The results obtaned from our experments are shown n Table 1. Accuracy (%) Data Set Random RFh RFl Forests C4.5 Balance Scale 90.98 91.68 80.48 76.64 Blood Transfuson 94.79 97.60 72.19 77.81 Haberman's Survval 92.17 94.80 66.67 72.87 Irs 96.00 98.67 95.33 96.00 Lver Dsorders 97.71 98.86 68.95 68.69 Pma Indans Dabetes Database 97.13 97.27 73.82 73.83 Statlog 97.97 98.12 86.96 85.22 Table1. The average of accuracy percent of predctng result by ntegratng rules compare wth Random Forest and C4.5 5. Concluson We have been proposed a new method whch can ntegrate rules obtaned from several trees n a Random Forest. The results from seven datasets from UCI machne learnng repostory show that our method yelds the better classfcaton results than the orgnal random forest and the ordnary decson tree. Moreover, the rule set from our ntegraton method can help users when they use the rule set. The rules from dfferent decson trees may conflct wth rules from another tree. Our method can remove these nconsstent condtons and output a new rule set whch can be better appled to classfy unseen data. 6. Acknowledgements Ths work was supported by the Thaland Research Fund (TRF). 7. References [1] L. Breman. Random Forests. Machne Learnng, 45(1):5-32, 2001. 197

[2] L. Breman, J.H. Fredman, R.A. Olshen, and C.J. Stone. Classfcaton and Regresson Trees. Wadswort, Belmont, 1984. [3] Y. Zhang, S. Burer, and W. N. Street. Ensemble Prunng va Sem-defnte Programmng. Journal of Machne Learnng Research, 7:1315-1338, 2006. [4] A.V. Assche, and H. Blockeel. Seeng the Forest through the Trees: Learnng a Comprehensble Model from an Ensemble. In Proceedngs of ECML, 418-429, 2007. [5] G. Sen, E. Yang, and S. Akar. Yeld Modelng wth Rule Ensembles. 18th Annual IEEE/SEMI Advanced Semconductor Manufacturng Conference, Stresa, Italy, 2007. [6] G. Sen, and J. Elder. From Trees to Forest and Rule Sets, A Unfed Overvew of Ensemble Methods. 13th Internatonal Conference on Knowledge Dscovery and Data Mnng (KDD), 2007. [7] J. R. Qunlan. Generatng Producton Rules from Decson Trees. In Proceedngs of the 10th Internatonal Conference on Artfcal Intellgence, 1987. [8] Z.-H. Zhou, and W. Tang. Selectve Ensemble of Decson Trees. Nanjng 210093, Natonal Laboratory for Novel Software Technology, Chna, 2003. [9] D. Optz, and R. Macln. Popular Ensemble Methods: An Emprcal Study. Journal of Artfcal Intellgence Research 11: 169 198, 1999. [10] I.H. Wtten, and E. Frank. Attrbute-Relatonal Fle Format. Unversty of Wakato, New Zealand, 2002. [11] B. Kjsrkul. Artfcal Intellgence. Department of Computer Engneer, Faculty of Engneerng, Chulalongkorn Unversty, 2003 (n Tha). [12] W. Thongmee. An Approach of Soft Prunng for Decson Tree Usng Fuzzfcaton. In Proceedng of the Thrd Internatonal Conference on Intellgent Technologes, 2002. [13] L. Breman, and A. Cutler. Random Forests. Avalable: http://www.stat.berkeley.edu/~breman/randomforests [14] M. Hall, E. Frank, G. Holmes, B. Pfahrnger, P. Reutemann, and I. H. Wtten. The WEKA Data Mnng Software: An Update. SIGKDD Exploratons, Volume 11, Issue 1, 2009. [15] A. Asuncon, and D.J. Newman. UCI Machne Learnng Repostory. Department of Informaton and Computer Scence, Unversty of Calforna, 2007. Avalable: http://archve.cs.uc.edu/ml/ [16] R. E. Banfeld, O. Lawrence, K.W. Bowyer, and W. P. Kegelmeyer. A Comparson of Decson Tree Ensemble Creaton Technques. IEEE Transacton on Pattern Analyss and Machne Intellgence, 2007. [17] G. Anderson. PhD thess: Random Relatonal Rules. 2009. 198