Application research on traffic modal choice based on decision tree algorithm Zhenghong Peng 1,a, Xin Luan 2,b

Size: px
Start display at page:

Download "Application research on traffic modal choice based on decision tree algorithm Zhenghong Peng 1,a, Xin Luan 2,b"

Transcription

1 Applied Mechanics and Materials Online: ISSN: , Vols , pp doi:10.408/ 011 Trans Tech Publications, Switzerland Application research on traffic modal choice based on decision tree algorithm Zhenghong Peng 1,a, Xin Luan,b 1 School of Urban Design, Wuhan University, Wu Han, China School of Urban Design, Wuhan University, Wu Han, China a laopeng19@vip.sina.com, b dly @163.com Keywords: Decision tree; Information gain ratio; Traffic modal Abstract: With the rapid development of urbanization in china, the contradiction between transport, environment and population growth is becoming more and more pronounced, which offers higher demands for transport planning. This article mainly describes the application of decision tree learning algorithm in traffic modal choice. First preprocess the sample data, then calculate and analyze the information gain ratio, and finally we will build a decision tree model. The results show that the rules obtained by decision tree method have some practical value in the analysis of traffic modal choice. Introduction As a common and important method in data mining classification, decision tree is the evidence-based inductive learning algorithm. After determining the training set, it will be entirely dependent on the data itself for learning. Data mining results will be tree structure which is similar to the flow chart and easy to understand. It has been widely applied in many fields such as company project decision-making, prediction of disaster weather and land planning. As an important part of transportation planning and policy-making, traffic modal choice affects people s travel efficiency in the city [1]. During the analysis procedure, we have to take account of several factors which may influence the traffic modal choice of the residents. And when the amount of data collected increases, complex algorithms and reasoning in the traditional mathematical methods can not find the knowledge completely. This article attempts to use the decision tree method to analyze the transportation problems, thus provide some new ideas for future research of traffic modal choice. Decision tree learning algorithm Decision tree originated in the concept learning system (CLS), then progress to Quinlan's ID3 and become the climax, and later evolved into C4.5 which can handle continuous attributes []. There are some other commonly used decision tree algorithm such as CART, SPRINT and QUEST. The main purpose of decision tree is to extract classification rules which can be used in classification prediction. The basic decision tree method is a greedy algorithm. Through top-down recursive approach, it constructs a decision tree gradually. Figure 1 shows the decision tree process. Decision tree classify an instance through arrangement of instances from the root to the leaf node. The result is a tree structure. The tree's internal nodes (non-leaf nodes) generally expressed as a logical judgment. Side of the tree branch was the result of logical judgment and leaf nodes were the type of tag. How to choose a good logical judgment or property is the key to construct decision tree. In general, the smaller the tree, the stronger the predictive power. This paper mainly uses the C4.5 algorithm to construct tree model. C4.5 algorithm use category information gain ratio in information theory as selection level. Attribute with the highest information ratio will be the test attributes of given set. It overcomes the problems of the ID3 algorithm which use information gain as selection level and in favor of attribute that has more value. All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of Trans Tech Publications, (ID: , Pennsylvania State University, University Park, USA-05/03/16,1:6:47)

2 844 Advanced Transportation. Fig. 1 Decision tree generation process We assume that: S is a set of data samples, class label attribute has m different values which can define m different classes C i (i=1,...m) and s i is the number of samples in class C i. The expectation required in the classification of a given sample is: m I ( s, s,..., s ) = p log ( p ) (1) 1 m i i i= 1 In the above formula, p i is the probability that the sample belongs to C i, and with the s i /s estimated. We set attribute A has k different values {a 1, a,,a k }. S can be divided into k subsets {S 1, S,,S k } by using A. S j contains some sample of S which have a value of a j on A. If you choose A as the test attribute (the best splitting attribute), these subsets correspond to the branches growing from the node containing S. We assume that that s ij is the sample number of C i in subset S j. Entropy or expectations generated by A split is: s s = () k 1j mj ( ) I( s1j,..., smj ) 1 s E A s1 j smj In the above formula, s acts as the weight of subset j. And it is equal to the number of samples in subset (the value of A is a j ) divided by the total number of samples in S. The smaller the entropy value, the higher the purity of subsets. The information gain which is the measure of classification ability on A is: Gain( S, A) = I( s1, s,..., sm) E( A) (3) In other words, Gain (A) is the entropic squeezing caused by guiding the value of attribute A. For attribute A, the split information obtained is k Sj Sj SplitInformation( S, A) = log (4) 1 S S In the end, the information gain ratio is: Gain( S, A) GainRatio( S, A) = (5) SplitInformation( S, A) The higher the value contains, the more useful information include. For each node, first calculate the entropy in the training sample, then calculate each attribute's information gain, and finally obtains the corresponding information gain ratio. We use attribute which has the largest gain information ratio as the root node, the different attribute value as a branch, each data elements divided by the branch as the subset of training sample example. Finally we take the subset which has same category as a leaf node by using recurrence strategy. Application ideas Using decision tree to discuss the traffic modal choice, Research ideas are shown in figure. 1. Data preprocessing. According to the specific conditions of traffic modal choice, choose and construct the training set which will be used in decision tree construction, that is, collect and determine a typical data set.

3 Applied Mechanics and Materials Vols Construct decision tree and extract evaluation rules. The factors influencing the traffic modal choice both have discrete type, like trip purpose, transportation comfort level, and have continual type, like trip time, household income, travel cost and so on. Therefore, this research used C4.5 algorithm to construct traffic modal choice decision tree, and then extracted rules by IF THEN structural style organization from pruned tree. 3. Rule application. According to rule extracted, first establish the traffic modal choice database, then reason and operate on the database by rule matching, and finally we will get the result. Fig. Research ideas Fig. 3 Decision tree flow chart Calculation instance 1. Data preparation This paper selects the survey data of Shijiazhuang City in October 007 as the input vectors. According to the characteristics of traffic modal choice, we select walk, bicycle, bus, taxi and private car five ways as a feature mode, select age, income, ownership of private car, trip time, trip purpose and trip distance six attributes (or called impact factors) as test attributes. Its value definition is shown in Table 1. In this table, continuous indicated that the corresponding factor attribute value is continuity and had discrete it. Table 1. Definition of traffic mode value //***Definition of traffic mode selection*** walk, bicycle, bus, taxi, private car Table. Data set of traffic mode selection age income ownership of private car trip time trip purpose trip distance model (0,9) (0,1000) No Morning Shopping (0,) Walk //***Definition of attribute value*** age:continuous (10,19), (0,9), (30,39), (40,49), (50,59) income:continuous (0,1000), (1000,000), (000,3000), (3000,+ ) ownership of private car:yes, no trip time:continuous early morning(6:00~8:00), morning(8:00~ 11:00), noon(11:00~:00), afternoon(:00~5: 30), evening(5:30~1:00) trip purpose:go to work, go to school, official business, shopping, go home, entertainment and sports, other trip distance:continuous,units km (0,), (,5), (5,10), (10,0), (0,+ ) (10,19) (0,1000) No Early morning School (0,) Walk (0,9) (1000,000) No Morning Work (,5) Bicycle (0,9) (1000,000) No Evening Home (5,10) Bus (40,49) (3000,+ ) No Noon Other (5,10) Bus (50,59) (000,3000) No Evening Entertainment (0,) Walk (40,49) (3000,+ ) Yes Morning Other (,5) Private car (10,19) (0,1000) No Evening Home (0,) Walk (0,9) (000,3000) No Afternoon Business (10,0) Bus This experiment extracted 1000 data samples. Table is part of the samples. The experiment uses SPSS Clementine 11.1 to carry on the analysis of mining classification rules, while selects the traffic modal as the output variable and other fields as the input variable. We collected samples from the data set at random, 80% as the training samples, 0% as the test samples, and valid data is 788. The flow chart was shown in Figure 3.

4 846 Advanced Transportation. Generate decision tree We assume that set S is a collection of 788 training samples, which means that the sample size s = 788. Class label attribute transportation value has 5 different modals, and class C 1 ~ C 5 correspond to walk, bicycle, bus, taxi, private car. From the data showed, we can see that s 1 =04, s =86, s 3 =9, s 4 =31, s 5 =175. According to type (1), calculate the expectation required in the classification of a given sample: I( s1, s,..., s 5) = log log log log log = First, we use age as a splitting attribute. Age has five different values (10,19), (0,9), (30,39), (40,49), (50,59). So we can divide S into five subsets {S 1,S,S 3,S 4,S 5 }. S 1 contains the sample of S in age (10, 19) (number is 110); S contains the sample of S in age (0, 9) (number is 169); S 3 contains the sample of S in age (30, 39) (number is 0); S 4 contains the sample of S in age (40, 49) (number is 161); S 5 contains the sample of S in age (50, 59) (number is 146). S ij is the sample size of C i in subset S j. That is, s 11 expressed the sample number chooses the walk in age (10,19), s 1 expressed the sample number chooses the walk in age (0,9), and so on: (10,19) : S = 78S = 8S = 4 I ( s, s,..., s ) = (0, 9) : S = 36S = 5S = 86S = 14S = 8 I ( s, s,..., s ) = (30,39) : S = 18S = 1S = 95S = 1 I ( s, s,..., s ) = (40, 49) : S14 = 11S4 = 31S34 = 48S44 = 13S54 = 58 I ( s14, s4,..., s54 ) = (50,59) : S15 = 61S5 = 1S35 = 39S45 = 3S55 = 4 I( s15, s5,..., s55) = According to type (), the entropy of the subset divided by the age attribute is: 5 s1j + sj + s3j + s4j + s5j E( age)= I ( s1j, sj, s3j, s4j, s5j ) s = I( s11, s1,..., s51) +... I( s15, s5,..., s55 ) = According to type (3), the information gain is: Gain( S, age)= I( s1, s,..., s5 ) E( age)= =0.704 According to type (4), the split information of attribute age is: k Sj Sj SplitInformation( S, age) = log =- log... log =.96 1 S S According to type (5), the information gain ratio of attribute age is: Gain( S, age) GainRatio( S, age) = = =0.306 SplitInformation( S, age).96 Similarly, we can calculate that: GainRatio( S, income) = 0.5 GainRatio( S, ownership of privatecar) = GainRatio( S, trip time) = GainRatio( S, trip purpose) = GainRatio( S, tripdis tan ce) = 0.5 The result shows that GainRatio(S, ownership of private car) is the biggest. Therefore, ownership of private car is the best splitting attribute. According to ownership of private car, the data set is divided into two subsets, as shown in figure 4.

5 Applied Mechanics and Materials Vols Fig. 4 Decision tree after selecting the node Fig. 5 Final rules of traffic modal choice for the first time For each sub-tree, calculate the information gain ratio of each attribute recursively. After that, we can convert the decision tree into rules as shown in Figure 5. According to the rules generated we can predict resident's traffic modal choice. For instance, private car will be their first choice if they have. If they don t have private car and the trip distance is in (0, ), most of them will choose walk. If they don t have private car, the trip distance is in (, 5), and age in (10, 19) or (50, 59), the possibility of walking is higher. These all can be reflected from the rules obtained. 3. Assessment of the decision tree Through testing on the Clementine data mining platform, while uses 10-fold cross validation, the correct prediction rate is 89.97%. The model forecasting result is good. It is shown in Figure 6. Fig. 6 Model predicts validation In the figure above, coincidence matrix shows concrete condition of the prediction: Data on the diagonal is the number of correctly predicted samples, while the other is the number of wrongly predicted samples. The first line of the matrix show that, for the samples whose determine type was bicycle, 69 was predicted correctly, 11 was predicted wrong as bus, 0 was predicted wrong as private car, and 8 was predicted wrong as walk. From the report of confidence level, we can see that the stability of the results is not very good and the span is very large. This is also the place where the experiment needs to be improved. Conclusions The experimental result is quite satisfying. But because this article data quantity is small, and didn t carry on the pretreatment to the redundant attribute and the noise data, so when we carries on the forecast research analysis there will be certain limitation unavoidable. In determining the traffic modal choice model, fewer factors were choose, which also makes other factors such as land use, psychological factors were not reflected in the model. Therefore, these factors should be quantified scientifically, and that these data are adopted as the input vectors is one of the ways of improving the forecasting model. In the research on transaction forecasting based on decision tree algorithms, the paper tries just a small part of the theory of decision tree algorithm, but the results are conspicuous. Thus, there is a lot of improving work to do in this field. In a word, combing decision tree algorithm with transport forecasting is one way forward to accurate and reasonable traffic planning and design.

6 848 Advanced Transportation References [1] Lu Huapu: Transportation Planning Theories and Methods.nd ed. [M]. Beijing:Tsinghai University Press (006), (In Chinese) [] Han Jiawei, Micheline Kamber: Data Mining Concepts and Techniques [M]. Beijing:Machine Industry Press (001), 70-18(In Chinese). [3] J.Ross Quinlan: C4.5 Program for Machine Learning [M]. Morgan Kaufmann (1993), 63-91(In Chinese) [4] Huang Aihui: C4.5 Decision tree algorithm and application[j].science Technology and Engineering, (009), 9(1), 34-4(In Chinese).

7 Advanced Transportation / Application Research on Traffic Modal Choice Based on Decision Tree Algorithm /