Superimposing Activity-Travel Sequence Conditions on GPS Data Imputation

Size: px
Start display at page:

Download "Superimposing Activity-Travel Sequence Conditions on GPS Data Imputation"

Transcription

1 Superimposing Activity-Travel Sequence Conditions on GPS Data Imputation Tao Feng 1, Harry J.P. Timmermans 2 1 Urban Planning Group, Department of the Built Environment, Eindhoven University of Technology, The Netherlands, t.feng@tue.nl 2 Urban Planning Group, Department of the Built Environment, Eindhoven University of Technology, The Netherlands, h.j.p.timmermans@tue.nl Abstract This paper proposed an algorithm to decrease the discrepancies between imputation and real GPS data. Based on the activity-travel pattern obtained using a Bayesian Belief Network model, the algorithm takes into account the consistency of the full activitytravel pattern within a day in the sense that the activity-travel sequence is represented in terms of a hierarchical set of tours, and the transportation modes within a tour are logically consistent. We explore three different approaches based on the number of epochs and imputation probabilities to identify the transportation mode for each trip period between two consecutive activities. In principle, the mode with the highest number of epochs for which it has the highest probability is selected. The algorithm was tested using the GPS data recently collected in the Netherlands. Results show that the new algorithm significantly improves the imputation accuracy of transportation modes. Keywords: Bayesian Belief Network; Imputation Algorithm; Prompt Recall 1. Introduction Using GPS for collecting activity-travel data has been shown to significantly reduce respondent burden and capture more precisely than conventional survey methods the unreported trips and activities (Wolf, 2004). However, the accuracy of the imputed agenda highly depends on the performance of data processing algorithms (Titheridge and Simpson, 2011). Because most imputation algorithms are not perfect in the sense that the imputed data do not necessarily match completely the real data, a prompted recall survey is often used for respondents to verify the imputed activity-travel diaries. However, prompted recalls involve extra effort and consequently the time gains of using GPS devices are partially lost, while results are not necessarily error free. Decreasing the gap between imputed data and real activity-travel data would be beneficial not only in terms of improving the ease of data verification, but also in terms of developing fully automatic data imputation systems. Therefore, enhancing the predictability of current imputation algorithms to provide more reliable data is important.

2 Various imputation algorithms have been employed in previous research (Bohte and Maat, 2009; Stopher and Wargelin, 2010). In general, these methods first identify activity or travel episodes, and then identify the transportation mode between two consecutive activities. The key determinant in this process is the speed related variable of different transport modes. This might be problematic when multiple transportation modes have similar speed values. The advanced learning algorithms which have been developed and applied (Anastasia, et al., 2010; Rudloff and Ray, 2010) have the advantage that it can identify transport modes and activity types simultaneously and recognize the transport modes based on conditional probabilities. However, similar to other algorithms, these algorithms focuses on individual epochs only, rather than on the whole activity-travel sequence. The consistency of the imputed activity sequence with respect to transportation modes has been scarcely addressed (Stopher et al., 2011). The issue here is that existing algorithms tend to examine individual epochs with a limited time horizon to impute transportation mode. However, individuals tend to use the same transportation mode for the same tour, and often the same mode for the return part of their daily schedule. For instance, if people drive by car from home to work in the morning, they are most likely to return home by car as well. It can be considered as a home-based tour or a sequence loop within which all trips involve the same mode. The example can be easily extended to the more general case that transportation mode used for the first trip coincides with the one used at the end of the tour. In this case, combinations of different transportation modes in different travel segments are not very likely. Therefore, it is potentially valuable to check and evaluate the consistency of different transportation modes when imputing transportation modes. Therefore, this paper suggests an improved algorithm that takes into account the consistency of the whole imputed activity-travel sequence within a day. It is assumed daily activity-travel sequences consist of a set of hierarchically ordered tours and the transportation mode within the two legs of these tours are consistent, depending on the kind of transportation mode involved. The main travel mode is identified separately for each trip period between two activities according to the number of epochs and imputation probabilities. In principle, the mode with the highest number of epochs which has the highest probabilities will be selected and assigned to the tour. The paper is organized as follows: Section 2 will describe the suggested algorithm to superimpose the activity-travel sequence. Section 3 will introduce the GPS data and Section 4 presents the results of the performance of this new algorithm. The paper will be concluded in Section The Improved Algorithm The method is based on a Bayesian Belief Network (BBN) model, which replaces ad hoc rules with a dynamic structure, leading to improved classification if consistent evidence is obtained over time from more samples. The BBN represents the multiple relationships between different spatial, temporal and other factors, including errors in the technology

3 Figure 1 Example of the activity-travel sequence and consistency of transportation modes on tours itself (input), and the facet of activity-travel patterns that we wish to impute from the GPS traces (output). 2.1 The Improved Algorithm The BBN model was applied previously to single epochs of 1-3 seconds. Because speed data and other variables used in the classification will change over time, the classification result will also differ between epochs. The challenge of the algorithm therefore is to detect whether multiple transportation modes were used during the trip, which is usually identified by setting some threshold for the length of a stop, and the type of transportation mode(s) given the classification results of all epochs belonging to the same trip. This is usually accomplished by some predefined merge rules with respect to the time threshold. However, this process does not guarantee the consistency of imputed transportation modes across tours and the legs of the tour. Even if individual may choose different transportation modes for each trip, in reality, the majority will use the same transportation mode at the different legs of a tour, especially in case of a transport mode which is difficult to leave behind, like car or bike. Although superimposing the consistency in transportation modes may add some error to the imputation, we contend that it involves avoiding more errors so that the overall result implies an improvement in the imputation. If this reasoning is accepted, any improved algorithms should address two issues: (i) how to identify the set of hierarchically related tours, and (ii) how to identify the (main) transportation mode for each tour. The identification of tours involves the comprehensive

4 investigation on the sequence of activities and trips. The activity location and activity type in conjunction with the timing information provide useful references to identify the tours. Furthermore, an algorithm is needed to address the consistency issue of transportation modes for each tour. As in the BBN, we could further extend the learning feature of conditional probabilities to determine the transportation modes. Figure 1 illustrates the concept used to ensure that transportation modes are consistent for different legs of tours. Assume that a trip (Trip 1) is made by car from Home (L1) to Work (L2). If it is enforced that the car would be returned to home, we identify the chained trips or trip stages where Home is revisited (=L4). To make sure that the car is part of a tour that starts and ends at home, we identify L3, the start of the trip (stage that ends at L4). We then create a network, consisting of all points between L1 and L4. The shortest path from L3 to L2 (L3-L2 as L2-L5-L2) is a loop starting and terminating at L2, creating a circuit (in red) starting and ending at Home (L1). The trips indicated in black still have to be assigned a transport mode, which can be indicated by updating a logical expression indicating whether or not a transport mode has been assigned. As shown in the example, the trips made later during the day to the sports facilities still need to assign a mode. By searching for the next trip, which requires a transport mode to be assigned, these two remaining journeys will be processed. In general, the transportation mode used to conduct a certain activity can be a single mode or a combination of multiple modes. In most cases where the main transportation mode dominates the traveling episode, the sections of short travel can be ignored. One example is people walking before and after driving a car from one location to another. In case of multi-modal trips, every transportation mode compromises a section of the whole tour. Therefore, being without loss of generality, we equalize the section of transportation modes between two activities as follows: A T T T A i, l1, i, i+ 1, l2, i, i+ 1,..., L,, 1, i i i+ i+ 1 (1) where A i and A i + 1 are two adjacent activities; T li, i, i+ 1 is the l th transportation mode of the tour starting from activity i, li [1, L i, i + 1]. L i, i + 1 is the total number transportation mode between activity i and activity i + 1. Then the whole activity-travel sequence can be described as follows A T T T A A T T T A A T T T A 1, l1,1,2, l2,1,2,..., L1,1,2, 2,..., i, l1, i, i+ 1, l2, i, i+ 1,..., L,, 1, 1,..., 1, 1, 1,, 2, 1,,..., 1, 1,, i i i+ i+ N l N N l N N LN N N N (2) where N represents the total number of activities, and i [1, N]. In order to analyze the interrelationship between different trips, we identify the tours from the sequence. The tour means an activity-travel sequence which starts and ends at a same location. Assum there is only one tour in the sequence, starting from A 1 and ending at A N, then, the sequence of the tour can be represented as: C ( A1,..., A N ) = (3)

5 where activity A 1 and activity A N happen at the same location. If there is another tour within this tour, then the tours could be represented as C1 = ( Ai,..., Aj ) (4) C2 = ( A1,..., Ai ; Aj,..., AN ) (5) where the activities of A i and A j happen at a same location. In case there are multiple transportation modes between activity i and activity j, we assume that one main transportation mode holds the most part of travel distance or travel time. Taking the above tour C 2 as an example, if T and l ',1, i T are the main modes with respect to the l ', j, N trips from activity 1 to i and from activity j to N, respectively, then, T = T (6) ' ' l,1, i l, j, N Equation (6) indicates that the transportation modes of different travel episodes which belong to the same tour should be the same. This example formulizes a simple case, which can be extended to more complex cases of multiple trips. Let Ti, C to represent the k main mode of i th trip in the k th tour C. Since the transportation mode is identified in terms of the level of conditional probabilities, the selected transportation mode for each epoch has the highest probability. Assume the imputation probability of transportation mode T i for epoch e is, then P e, T i PT Max( P1, 2,..., ) i i Pi PiS = (7) where S indicates the total number of transportation modes. For each travel episode, the transportation mode which has the highest frequency with the highest probability is treated as the main mode for the trip. We use F s to indicate the frequency with the highest probability of transportation mode s. Then, the final transportation mode for the travel episode should be the maximum value of F s, Max( F1, F2,..., F S ) (8) In case the transportation modes are identified for each trip segment, we need to confirm the transportation mode for tours. Here, we propose the following three methods: Method 1: The frequency of the transportation mode which has the highest probability is identified for each trip episode separately. The transportation mode which has the highest frequency for all trips is selected. Method 2: The frequencies of all transportation modes of all trip episodes which belong to the same tour are put together. Then, the one which has the highest frequency with highest probabilities is selected to replace others. Method 3: In case of three or more trips within a same tour, we identify the transportation mode using Method 1 for all trips excluding the first and the last trips. Then, we use the confirmed mode as the replacement of the first and last trips.

6 2.2 Data Processing and Comparison In the process of prompted recall, people may have deleted, modified or merged imputed activity and travel episodes. Some people prefer to change the content of imputation results by not changing the sequence frame, while others like removing the incorrect data and create brief agendas. Moreover, people may either keep the short walking trips as a truth or completely ignore them. These all increase the complexity of comparison since people have different behavior in validating agendas. Therefore, in the subsequent analyses, we only compared observed and activity-travel patterns for which the activity sequences were identical. The performance of the algotithms were compared by calculating and comparing hit ratios. 3. GPS Data The GPS data was collection recently in a large-scale data collection project which is conducted in the Eindhoven and Rotterdam, The Netherlands. A web-based data collection and a data processing system which were developed before has been improved and combined to processing and collecting the personal profiles data and activity-travel sequence data. Participants were provided with user accounts and password for system login to upload and validate their data. In the follow-up prompt recall page, people are requested to fill in and/or correct information which they think as inaccurate or missing. The system allows to change, remove and merge the imputation data, and create new activity/travel data. The data of originally uploaded and confirmed by respondents are automatically saved into a background database. 4. Results and Analyses 4.1 Facts from the validated data The basic principle of the designed algorithm is to improve the performance of the imputation algorithm. Because the speed profile of different transportation modes is most likely blurred during peak hours, in addition to an overall comparison of the algorithms, we specifically compared their performance for ther work commute during peak hours. Figure 2(a) shows the distribution of different transportation modes. Here, we define the morning peak from 7:00 to 9:00 and the evening peak from 16:00 to 18:30. As shown, the frequency of transportation modes according to the morning peak time and the evening peak time is almost identically distributed along the trend line. This result also presents in part the applicability of the suggested algorithm for commuter trips. Figure 2(b) shows that the majority of changed modes relate to the car (42%) and walking (34%). Public transportation modes are not changed that often. It is understandable that the car is the most frequently used transportation mode, and traveling by car is heavily influenced by traffic situations which results in much noise for data imputation. In addition, walking is

7 car Evening peak bike walk 200 motorbike train bus 0 metro 0 tram Morning peak (a) (b) Figure 2 (a) Transportation modes of peak hour (b) Frequency of transportation modes that people changed in the prompt recall frequenty changed due to people s different validation behavior and the reliability of imputation agendas. 4.2 Comparative analyses Apart from differences in perception and understanding of the survey, respondents also have different attitudes towards validating the data. Respondents differ in terms of their motivation to and precision to validate the imputed data. In some cases, the imputation data are completely the same as the validation data even when imputed data look irrational. In this case, the data cannot be ensured to be true. Therefore, before comparison, we excluded the data, which are all imported (without any changes) from the whole sequence. Trips that could not be identified because of non-identified activity episodes were disregarded in the analysis. Table 1 shows the hit ratios for different transportation modes. Since the values are calculated using data which are incorrectly imputed only, the percentages present the level of improvement of the new algorithms. As one can see that the imputation accuracy of almost all transportation modes is significantly improved. Relative to the original imputed level 4.3% for bike, the correctly imputed percentages using the new algorithms all significantly increased by 34% (Method 1), 19.1% (Method 2) and 17.0% (Method 3), respectively. The imputation of the car mode is also significantly improved from 2.3% (original) to 26.2% (Method 1), 26.7% (Method 2) and 13.7% (Method 3), respectively. The accuracy improvement is also significant for walking, where the hit ratio increases from 0.6% to 6.9% (Method 1), 8.4% (Method 2) and 11.3% (Method 3), respectively. Table 1 Hit ratio of different transportation modes BIKE BUS CAR METRO TRAIN TRAM WALKING Originally imputed 4,3% - 2,3% 27,0% 28,6% 79,6% 0,6% Method 1 34,0% 4,8% 26,2% 23,8% - 72,2% 6,9% Method 2 19,1% - 26,7% 20,6% 14,3% 77,8% 8,4% Method 3 17,0% - 13,7% 27,0% 28,6% 68,5% 11,3%

8 Table 2 Confusion matrix of original imputed data and new methods BIKE BUS CAR METRO TRAIN TRAM WALKING Original BIKE 4,3% - 6,4% 4,8% - 5,6% 20,9% BUS 4,3% - 34,6% 9,5% ,3% CAR 4,3% 42,9% 2,3% 6,3% 57,1% - 24,4% METRO - - 0,5% 27,0% - - 2,2% RUNNING 48,9% - 0,3% ,5% TRAIN - 4,8% 42,7% 34,9% 28,6% - 17,2% TRAM - 47,6% 1,8% ,6% 0,9% WALKING 38,3% 4,8% 11,5% 17,5% 14,3% 14,8% 0,6% Method 1 BIKE 34,0% - 2,8% 4,8% - 1,9% 14,1% BUS 4,3% 4,8% 22,6% 9,5% - - 9,7% CAR - 28,6% 26,2% 11,1% 85,7% - 44,4% METRO - 4,8% 0,8% 23,8% - - 1,9% RUNNING 34,0% - 0,3% ,3% TRAIN - 9,5% 28,8% 33,3% ,4% TRAM - 38,1% 1,3% ,2% 3,4% WALKING 27,7% 14,3% 17,3% 17,5% 14,3% 25,9% 6,9% Method 2 BIKE 19,1% - 3,1% 4,8% - 1,9% 15,0% BUS 4,3% - 19,6% 9,5% - - 7,8% CAR 2,1% 33,3% 26,7% 11,1% 71,4% - 44,1% METRO - - 0,8% 20,6% - - 1,6% RUNNING 34,0% - 0,3% ,3% TRAIN - 9,5% 31,6% 36,5% 14,3% - 14,4% TRAM - 47,6% 2,0% ,8% 2,5% WALKING 40,4% 9,5% 16,0% 17,5% 14,3% 20,4% 8,4% Method 3 BIKE 17,0% - 4,8% 4,8% - 1,9% 13,8% BUS 4,3% - 23,2% 9,5% ,4% CAR 2,1% 38,1% 13,7% 6,3% 57,1% - 29,7% METRO - - 1,3% 27,0% - - 1,6% RUNNING 29,8% - 0,3% - - 5,6% 10,6% TRAIN - 9,5% 34,4% 36,5% 28,6% - 16,3% TRAM - 38,1% 0,8% ,5% 2,5% WALKING 46,8% 14,3% 21,6% 15,9% 14,3% 24,1% 11,3% Total 100,0% 100,0% 100,0% 100,0% 100,0% 100,0% 100,0% Table 3 Hit ratios of car mode during morning and evening peak time Morning peak Evening peak Original imputed 60,50% 71,1% Method 1 65,8% 76,3% Method 2 76,3% 65,4% Method 3 63,2% 68,4%

9 It is interesting to compare the transportation modes which are significantly improved with Figure 2(b). Both results show that exactly the same transportation modes which respondents changed are mostly improved after applying the new algorithm. This indicates that the suggested algorithm could substantially improve the accuracy of the imputation. In addition, imputation improvement for bus mode is only obtained by Method 1, indicating that Method 1 can correctly recognize the bus mode with an increased accuracy of 4.8%. For other public transportation modes, like metro, train and tram, only Method 3 gets the similar levels of imputation accuracy to original levels (except for tram). A detailed overview of the imputation results are shown in Tables 2 in the way of confusion matrix. Taking the bike mode of Method 1 as an example, 48.9% of bike episodes was recognized as running mode and 38.3% as walking in the original imputation data. After applying the new algorithm, Method 1, the incorrectly recognized percentages are decreased to 34.0% and 27.7% with respect to running and walking, respectively. The results discussed above are obtained based on the data that include the trip episode with respect to all types of activities. It may not be able to represent the specifics of commute trips during peak hours. To explore this issues, we calculated specifically the hit ratio for car mode during the morning peak and evening peak. As shown in Table 3, all improved methods lead to increased accuracy for morning peak trips relative to originally imputed data. The level of hit ratio is increased from 60.5% (original) to 65.8% (Method 1), 76.3% (Method 2) and 63.2% (Method 3), respectively. For the evening peak, only Method 1 leads to a significant improvement, increasing the accuracy level from 71.1% (original) to 76.3% (Method 1). The levels of the other two methods are slightly lower than the original results. In this end, it probably indicates that Method 1 is better than the other two methods, especially for the prediction of motorized commute trips during peak times. 5. Summary and Conclusions Although the application of GPS data collection and imputation methods has increased lately, imputation algorithms are still not perfect. In some cases, the gap between imputation results and validation data is still substantial. Although a prompted recall process may compensate for imputation errors, it involves additional respondent burden and human error. Therefore, there is still a need to further improve imputation algorithms. Existing imputation algorithms are typically based on epoch data, with the potential disadvantage that classification at this temporal scale may contain errors and one needs a way to aggregate the data. To compensate for the potential error, in the present paper, we suggest superimposing aggregate activity-travel patterns conditions. More specifically, the improved algorithm identifies a set of hierarchical ordered tours making up a daily activity-travel pattern and uses probabilistic rules and a learning algorithm to systematically and consistently impute the same transportation mode for the two legs of

10 the tours. Although the consistency can be violated in reality, we contend that on balance imputation results will improve by superimposing these conditions. A test of this approach confirms this contention. We found from the validation data that transportation modes during morning and evening peak times are almost equally distributed. The majority of the transportation modes that respondents corrected are car and walking, which accounts for 76% in total. Furthermore, the performance of the proposed algorithm was examined through comparative analyses based on imputed data and validation data. Results indicate that the suggested algorithm substantially improves the accuracy of imputed data. For commute trips by car, all three methods yield significant improvements for the morning peak, while Method 1 leads to accuracy increase for both morning peak and evening peak. References Bohte, W. and Maat, K. (2009) Deriving and validating trip purposes and travel modes for multi-day GPS-based travel surveys A large-scale application in the Netherlands. Transportation Research Part C, 17, Moiseeva, A., Jessurun, A.J. and Timmermans, H.J.P. (2010). Semi-automatic imputation of activity-travel diaries using GPS traces, prompted recall and context-sensitive learning algorithms. Transportation Research Record: Journal of the Transportation Research Board, 2183, Rudloff, C. and Ray, M. (2010) Detecting travel modes and profiling commuter habits solely based on GPS data. Compendium of Papers DVD, Transportation Research Board 89th Annual Meeting, January 10~14, Washington D.C. Stopher, P.R. and Wargelin, L. (2010) Conducting a household travel survey with GPS: Reports on a pilot study. Proceedings of the 12 th WCTRS, July 11~15, Lisbon, Portugal. Stopher, P.R., Zhang, J. and Prasad, C. (2011) Evaluating and improving software for identifying trips, occupancy, mode and purpose from GPS traces. Proceedings of the 9 th International Conference on Transport Survey Methods, November 14~18, Termas de Puyehue, Chile. Titheridge, H. and Simpson, D.J. (2011) Travel surveys: Measuring compliance over an eight week GPS Survey. Proceedings of the 9 th International conference on Transport Survey Methods, Chile. Wolf, J. (2004) Applications of new technologies in travel surveys. Proceedings of the International Conference on Transport Survey Quality and Innovation, August, Costa Rica.