A Pruning method based on conditional misclassification

Size: px
Start display at page:

Download "A Pruning method based on conditional misclassification"

Transcription

1 Appled Mechancs and Maerals Onlne: ISSN: , Vols , pp do: / Trans Tech Publcaons, Swzerland A Prunng mehod based on condonal msclassfcaon Xu Wexang 1,a, Xujng 2,b, Lu Xumn 3 and Dongru 4 1 School of Traffc and Transporaon Bejng Jaoong Unversy Bejng, , Chna 2 Chna Moble Group Desgn Insue Co.. Ld. Bejng, , Chna 3 Deparmen of Informaon Engneerng, Capal Normal Unversy, Bejng, , Chna 4 Bn Zhou Hydrology and Waer Resources Offce, Bnzhou, , Chna a xu_wexang@163.com, b luxumn@126.com Keywords: Decson ree, Msclassfcaon prunng, Condon msclassfcaon Absrac: The mehods of prunng have grea nfluence on he effec of he decson ree. By researchng on he prunng mehod based on msclassfcaon, nroduced he concepon of condon msclassfcaon and mproved he sandard of prunng. Propose he condonal msclassfcaon prunng mehod for decson ree opmzaon and apply n C4.5 algorhm. The expermen resul shows ha he condon msclassfcaon prunng can avod over pruned problem and non-enough pruned problem o some exen and mprove he accurae of classfcaon. Inroducon Decson ree s a popular used model for classfcaon n daa mnng. However, over fness s nevable by buld classfcaon model wh ranng daa se, whch usually has negave effec on he generaon of he model. The samples ha can be chosen reduced as he ree grew. Therefore, he boom levels of he ree become sascally unsable, and he generalzaon becomes low when predcng oher unknown daa ses. Namely, he accuracy of he ree s low. Besdes, he decson ree wll grow large when he ranng se s huge, and he nellgbly of he ree s affeced. Above all, s mporan o opmze he decson ree effecvely. There have been many mehods of opmzng decson ree, among whch pos-prunng mehods are more popular. The CCP (Cos-Complexy Prunng) was promoed by Breman n 1984, whch s used o CART algorhm. REP (Reduced Error Prunng) and PEP (pessmsc error prunng) were nroduced by Qunlan n 1987, whch are used o ID3 and C4.5. MEP (Mnmum-ErrorPrunng) was frsly suggesed by Nble and Brako n 1986, whch prune ree whou exra daa se. Mngyu Zhou [1] has researched on 2-norm error rae prunng mehod, whch decdes wheher he branch should be cu off accordng o he combnaon of he expermen and covarance. Afer sudyng hese algorhms menoned above, we sugges he condonal msclassfcaon prunng mehod, and compare wh PEP mehod n he expermen. The resul shows ha he condonal msclassfcaon prunng can solve he over-prunng or no-enough-prunng, and promoe he accuracy. PEP and EBP C4.5 algorhm apples pos-prunng, whch prunes he redundan branches accordng o a ceran rule afer buldng he whole ree. The concree mehods are PEP [2] and EBP, whch are based on he number of msclassfcaons and he amoun of he ranng daase. These mehods decde wheher o cu off he branch by assumng msclassfcaon rae, and don need daase for prunng. However, s excessvely opmsc o evaluae msclassfcaon by ranng daa, as wll cause All rghs reserved. No par of conens of hs paper may be reproduced or ransmed n any form or by any means whou he wren permsson of Trans Tech Publcaons, (# , Pennsylvana Sae Unversy, Unversy Park, USA-20/09/16,12:07:12)

2 Appled Mechancs and Maerals Vols much paraly. Therefore, he PEP nroduced correcon facor o overcome he paraly [3]. Le be he number of ranng daa samples on he node and le be he number of msclassfcaon on he node. The norm should ensure ha he number of msclassfcaon afer correcng s a sandard devaon beer han he node. The sandard devaon s defned as he followng formula: SE[ n'( T )] = n'( T )*( n'( T )) The number of msclassfcaon of each node s n '( = + 1/ 2, and he number of msclassfcaon of he subree s n '( T ) = ) + N / 2. Therefore, f he number of msclassfcaon of he subree afer correcng s larger han he number of msclassfcaon of he node, cu off he branch accordng o he algorhm and subsue wh he node, whch becomes a leaf. Namely, f he nequaly s rue, cu off he branch. n' ( n'( T ) + SE[ n'( T )] EBP apples he mehod of promong subree [4]. If a ceran subree has small number of msclassfcaon whch s smaller han ha of he node, can be chosen o subsue he orgnal branch. The condonal msclassfcaon prunng (CMP) PEP pruned he decson ree wh ranng daase, and overcome he paraly. Bu has no heorecal basemen o nroduce he connual correcon facor. The consan 1/2 can only express one leaf s conrbuon o he complexy of he whole decson ree [5]. Sascally, he bnomal dsrbuon usually subsues he normal dsrbuon, bu canno correc he opmsc evaluaon effecvely. In hs hess, we nroduce he concepon of condonal msclassfcaon rae, and apply o PEP. The condonal msclassfcaon rae. The node condon s rao of he number of samples n hs node o ha n s paren node. We combne hs concepon and nroduce he condonal msclassfcaon rae on he bass of msclassfcaon rae n (1) (2) of PEP, connual correced msclassfcaon rae r ' ( ), msclassfcaon rae of subree r( T ) and connual correced msclassfcaon rae of subree r ' ( T ). The connual correced condonal msclassfcaon rae s + 1/ 2 + 1/ 2 + 1/ 2 r' '( = = = NTd N Td (3) Accordng o he above formula, he connual correced condonal msclassfcaon rae of subree s ) ( ) + 1/ 2) d ) ( ) + 1/ 2) N d r''( T = = (4) ) d ) ) ) N d r ( )

3 3450 Froners of Manufacurng and Desgn Scence Among hs, P d ) presens he node condon of d. ( The man dea of CMP. Add he condonal msclassfcaon rae as weghs o he prunng of decson ree [6], whch can consder he nfluence suffcenly ha he node splng has on he prunng. The node splng s an mporan par n consrucng decson ree model, herefore, o nroduce he node condon n he prunng can consder he dsrbuon of samples n splng nodes. I s smlar wh PEP ha f he number of msclassfcaon afer correcng s a sandard devaon beer han he node, branch won be cu off. The sandard devaon s defned below. SE' [ n''( T )] = n''( T )*( n''( T )) * Among hs, for he node here s n ''( = ( + 1/ 2) For he subree he number of condonal msclassfcaon s n' '( T ) = ( ) + 1/ 2) d) (7) When he (8) s sasfed, he subree should be cu off. n ''( n''( T ) + SE[ n''( T )] The descrpon of CMP. When he ree grows, calculae he node condon, and evaluae he condonal msclassfcaon rae when prunng he ree. The descrpon of applyng CMP o decson ree algorhm s as follows: Inpu: ranng daase Oupu: he decson ree pruned by CMP Algorhm: Sep 1: Se up he node N, whch presens all of he samples. Sep 2: If sasfes he condon of he ermnal nodes, hs node should be generaed as a leaf, and s assgned o he class C. Meanwhle, record he node condon of hs node. Sep 3: Else. Accordng o spl creron, analyze arbues o decde by whch arbue o spl he node. Then consruc every branch whch s relaed o he value of he arbue chosen. The subse of he samples should also be se up. Sep 4: Recursvely consruc he decson ree as sep2 and sep3 do. (5) (6) (8) Sep 5: Calculae every leaf s number of condonal msclassfcaon he subrees number of condonal msclassfcaon n' '( T). n '' leaf, and calculae all Sep 6: Calculae he condonal msclassfcaon sandard devaon SE of he node. If he (8) s sasfed, cu off he branch and subsue wh he node. Sep 7: Else. If one of he branch s subrees can sasfy he (8), subsue he branch wh he subree. Sep 8: Recursvely prune he decson ree as sep6 and sep7 do. Sep 9: Reurn N.

4 Appled Mechancs and Maerals Vols Expermen and Analyss By comparng he resul of EBP algorhm and he one of CMP algorhm, he expermen examned he propery of CMP proposed. The daa ses are chosen from UCI and all of he daa ses are 2-class, whch are lsed n he able 1. One hrd of ems n each daa se are chosen as es se o exam he resul of he decson ree afer prunng. Table 1 The daa ses of expermens Daa ses Insance Tes nsance arbues Monk s problem Hear-salog Sonar Voe Fgure 1 shows ha he resul of Monk s problem daa se afer prunng he decson ree by EBP algorhm, and fgure 2 shows he resul afer CMP prunng. The number of nodes (Sze) and he number of msclassfcaon errors (Error). Fg. 1 Monk s problem EBP Fg. 2 Monk s problem CMP Do he same expermens on he oher daa ses n able 1, choose es daa randomly o fnd ou he Sze and Error, and ge he average value of he resuls. Table 2 shows he resuls of he expermens.

5 3452 Froners of Manufacurng and Desgn Scence Table 2 The resul of all he daa ses n expermens Daa ses EBP CMP Sze Errors Error rae Sze Errors Error raes Monks % % Hear-salog % % Sonar % % Voe % % Analyze he resuls, for he monks se he decson ree pruned by CMP s larger han he ree pruned by BEP, bu he accuracy rae s mproved and he over-prunng s releved o some exen. For he oher daa ses he szes of decson rees become smaller or keep he same sze. Tha s o say, he CMP algorhm can mprove accuracy and solve he problem of over-prunng and no-enough-prunng. Conclusons The condonal msclassfcaon s nroduced n he hess, and he CMP s proposed. By nroducng CMP he decson ree s opmzed. Afer applyng CMP o C4.5 algorhm, he splng of nodes s consdered and balance he over-prunng and no-enough-prunng effecvely. However, he algorhm s lmed o 2-class daa se, and mul-class daa ses wll be he nex researchng subjec. Acknowledgmens Ths research was suppored by he Bejng Muncpal Scence & Technology Commsson key projec (Z ), Naonal Scence and Technology Suppor Program opcs (2009BAG12A10), he Sae Key Laboraory of Ral Traffc Conrol and Safey (RCS2009ZT007), Bejng Jaoong Unversy, and parally suppored by he MOE key Laboraory for Transporaon Complex Sysems Theory and Technology School of Traffc and Transporaon Bejng Jaoong Unversy. References [1] Mngyu Zhong. Mchael Georgopoulos. A k-norm prunng algorhm for decson ree classfers based on error rae esmaon[j]. Machne learnng, 2008, 71: [2] J.R. Qunlan. Smplfyng decson rees[j]. Human-Compuer Sudes, 1999, 51: [3] Xndong Wu, Vpn Kumar, J.Ross Qunlan e al. Top 10 algorhms n daa mnng[j]. Knowledge and Informaon Sysem, 2008, 14: 1-37 [4] T. Elomaa, M. kaaranen. An Analyss of Reduced Error Prunng[J]. Journal of Arfcal Inellgence research (15): [5] Florana Esposo, Donao Malerba, Govaa Semeraro. A comparave analyss of mehods for prunng decson ree[j]. IEEE Transacon on Paern Analyss and Machne Inellgence, 1997, 19(5): [6] Luwe, Wangzong. Opmzaon and Comparson of Decson Tree Algorhm [J]. Compuer Engneerng, 2007, 16(33):