Relational Data Mining and Web Mining

Size: px
Start display at page:

Download "Relational Data Mining and Web Mining"

Transcription

1 Relational Data Mining and Web Mining Prof. Dr. Daning Hu Department of Informatics University of Zurich Nov 20th, 2012

2 Outline Introduction: Big Data Relational Data Mining Web Mining Ref Book: Web Intelligence, Zhong et al. 2

3 Introduction: Big Data 90% of the data in the world today has been created in the last two years alone (IBM). Big data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, etc. In response, everyone from marketers to policymakers has begun embracing a loosely defined term for today's massive data sets and the challenges they present: Big Data. Lack of efficient and effective methods Big Brother

4 A Brief History of Big Data Herman Hollerith census data (electric hole pouching) FDR s Social Security Act 26 million working Americans and 3 million emplyers s IBM, field investigators WWII and Cold War Colossus Project: Deciphering Nazi Codes 742M U.S. tax returs and 175M fingerprints -> Privacy act 1990s 2000s 2012 Internet Ages and 9/11 NSA: 1.7 billion s, phone calls, daily Retailers amassing information on shopping habbits Wal-Mart: 460 T cache in 2004 Social Network Profilerate U.S. Open Government Initiative: data.gov? 4

5 Introduction: Data Mining Data mining (the analysis step of the "Knowledge Discovery in Databases" process, or KDD) is the process that attempts to discover patterns in large data sets. a field at the intersection of computer science and statistics AI, machine learning, statistics, and database systems The goal of is to extract information from a large data set and transform it into an understandable structure for further use Data -> Information -> Knowledge Involving analysis, data preprocessing & management, model and inference considerations, complexity considerations, post-processing of discovered structures, visualization, and online updating (real-time).

6 Data Mining and Web Mining The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages: Collection and Selection Pre-processing Web Mining Transformation Data Mining (Analysis) Interpretation/Evaluation 6

7 Data Mining Tasks Major Data Mining Tasks: Association rule learning (Dependency modeling) Searches for relationships between variables. E.g., a supermarket might gather data on customer purchasing habits. Clustering discovering groups and structures in the data that are in some way or another "similar", without using known structures. Classification generalizing known structure to apply to new data. E.g., software classifying an as "spam". (Training dataset) Regression Attempts to find a function which models the data with the least error. Summarization providing a more compact representation of the data set, including visualization and report generation. 7

8 Data Mining Tasks Major Data Mining Tasks: Association rule learning (Dependency modeling) Searches for relationships between variables. E.g., a supermarket might gather data on customer purchasing habits. Clustering discovering groups and structures in the data that are in some way or another "similar", without using known structures. Classification generalizing known structure to apply to new data. E.g., software classifying an as "spam". (Training dataset) Regression Attempts to find a function which models the data with the least error. Summarization providing a more compact representation of the data set, including visualization and report generation. 8

9 Relational Data Mining and Web Mining Relational Data Mining differs from regular DM in several ways Network-based Representation Often involves large-scale relational data and can be modeled with network measures/metrics. Network-based Models and Algorithms (HITS) The tasks are often similar: Classification, Regression etc. But the application goal often requires analytical insights about the relations among entities in the data set. Web Mining Collecting large-scale Web based data or data from Internet DM Analysis on Web itself (E.g., Google s PageRank) 9

10 Applications of MI for U.S. Border Safety Border-crossing records can be considered as a stream of text (license plates) ordered by the time of crossing. MI can be used to identify frequent co-occurrence between a pair of vehicle crossings. If one vehicle in the pair has a criminal record, some inferences may be made about the second vehicle if they cross together frequently. We use conditional probability to include domain heuristics in the MI formulation. The heuristics are derived from information recorded in multiple law-enforcement databases.

11 Case Study: Association Rule Mining in CopLINK The COPLINK dataset contains data from multiple law enforcement agencies from million incident reports Their personal and sociological information (age, ethnicity, etc.) Time information: when two individuals co-offend TPD, PCSD, CBP (Six ports between AZ and Mexico) A Integrated Criminal Dataset 1.44 million criminals 662,000 vehicles Table 1. Summary of the COPLINK vehicle dataset TPD PCSD CBP Number of People 662, , M record ( 2.6 M vehicles) Time Span

12 Association Rule Mining Inferring associations between items in the database was motivated by decision support problems faced by retail organizations (Stonebraker 1993). An association rule (AR) is a relationship of the form A B A is the antecedent item-set and B is the consequent item-set. The antecedent and consequent item-sets can contain multiple items. A B holds in a transaction set D with confidence c if c% of transactions in D that contain A also contain B, support s if s% of transactions in D contain both A and B. Association mining identifies all the rules that have support and confidence greater than user-specified thresholds.

13 Mutual Information Mutual information is an information theoretic measure that can be used to identify interesting co-occurrences of objects. It can be considered a subset of AR mining with 1-item antecedent and consequent item-sets. The earliest definitions of MI was given by Claude et al. (1949) and Fano (1961) as the amount of information provided by the occurrence of an event (y) about the occurrence of another event (x): I(; x y) = log 2 Pxy (, ) PxPy ( ) ( ) Intuitively, this concept measures if the co-occurrence of x and y (P(x,y)) is more likely than their separate occurrences (P(x).P(y)).

14 Research Design (cont.) Border Wait Times Web-Spider TPD Internet Archive Law Enforcement Data* PCSD Border Crossing Data Six Ports Splitting Training Data 2/3 Testing Data 1/3 Heuristic Calculation Set A Criminal Vehicles with Crossings MIW/MIC Scores Evaluation Set B Potential Target Vehicles Law Enforcement Data* Subset Narcotics Vehicles Overlap TPD PCSD Research design and process explained in the following slides 14

15 Estimating Border Wait Times 2006 Google Imagery 2006 DigitalGlobe, Map data 2006 NAVTEQ TM An aerial photograph of a typical U.S. port of entry (southern border). Vehicle lanes are backed up with dozens of vehicles during peak times. Criminal vehicles operate in groups. If one is caught others turn back into Mexico. They may join the lines one at a time or use turn-out points. Port of Entry (Check points) Turn-out points Thus, time interval between two related vehicles is likely to be less or equal to the waiting time if the second vehicle doesn t join the line until the first vehicle goes through. This needs to be taken into consideration in the calculation of MI.

16 Estimating Border Wait Times CBP publishes hourly wait times on its website (BWT). The information is posted only for the current day No publicly available archive is maintained A web-spider was used to systematically download the web-page for every hour over several days in April 2006 However, the average waiting times thus obtained cannot be generalized to the entire year The Internet Archive (IA) contained snapshots of the BWT webpage from April 10, 2004 to March 31, Obtain waiting time statistics for various days over many months in 2004 and 2005 The statistics from the spidering process and IA were then used to calculate average waiting times for each port on an hourly basis and used in MIW.

17 Temporal Patterns of Border Crossings 8pm-Midnight 23% Night Day 4pm-8pm 22% (a) 2pm-4pm 13% Midnight-5am 12% 5am-10am 10% 10am-2pm 20% 8pm-Midnight 27% Night Day 4pm-8pm 24% The figure suggests that a large number ( 50%) of crossings with police contacts happen after dark. MIW uses this information to assign more weight to time periods with more criminal crossings. 17 (b) Midnight-5am 15% 2pm-4pm 10% 5am-10am 10% 10am-2pm 14% Figure (a) shows the percentage of all crossings over six time periods of the a day. 23% of all crossings take place between 8pm- Midnight. Figure (b) shows the percentage of all crossings by vehicles with police contacts over the six time periods. 27% of crossings by vehicles with police contacts happen between 8pm-Midnight.

18 Comparative Evaluation (cont.) For hypothesis testing, thirty data points (ranging from top 5 to 3500 pairs) were taken for each of the measures and a t-test was done for the differences in the mean number of potentially criminal vehicles identified. MIW - MIC TPD dataset PCSD dataset * Tucson met. dataset * It was found that MIW performed significantly better (at the 99% level) than MIC in all but one dataset in identifying potentially criminal vehicles. The hypothesis on MIW performing better was partially supported.

19 Case 1: Vehicle Pair Identified by MIW This figure shows the crossing patterns of a pair of vehicles with the high MIW score. Vehicle C Vehicle D Vehicle C from Arizona and it s occupant were arrested in Tucson for the sale of narcotics. Time of Day After dark/no fixed work schedule Vehicle C crossed 7 times in a one month period and crossed within a few minutes of Vehicle D Jan 15 Jan 25 Jan 26 Jan 29 Feb 6 Feb 7 Feb 14 The crossings may be considered suspicious since they are almost always after dark and do not fit a standard work schedule.

20 Criminal Activity of Vehicle C & D Tucson met. area Narcotics Network Customs and Border Protection Tucson met. area Criminal Network MIW Vehicle A Vehicle B Feb 7 Feb 6 Jan 29 Jan 26 Jan 25 Jan 15 Vehicle C Frequent Crossers at Night Vehicle D Vehicle C was found to have strong connections to a narcotics network in the Tucson metropolitan area. It had links to other people and vehicles that had been arrested / suspected for narcotics sales and possession in the region. Vehicle D was also involved in criminal activity in the Tucson region. MIW identified many other such strong cases.

21 A Suspect Vehicle Triple Identified 2000 After dark Vehicle E Vehicle F Vehicle G This figure shows the crossing patterns of vehicle triple that was identified by the transitive use of MIW with support constraints. Time of Day Vehicle F crossed 7 times in a one month period out of which it crossed 5 times within a few minutes of Vehicle E. 0 Sep 6 Sep 11 Sep 17 Sep 18 Dates (2005) Sep 25 Oct 4 Oct 5 It was also found that Vehicle E was involved in multiple narcotics crimes in the Tucson region in recent times. MIW scores were calculated between Vehicle F and other crossing vehicles and a promising transitive association with Vehicle G was found. Vehicle G had crossed 3 times within minutes Vehicle F over a 12 day period.

22 Crime Involvement of Vehicles E and G Tucson met. area Narcotics Crimes Customs and Border Protection MIW MIW Tucson met. area Crimes Vehicle A Vehicle B Vehicle A Vehicle B Vehicle C Oct 5 Oct 4 Oct 2 Sep 25 Sep 18 Sep 17 Sep 11 Sep 6 Sep 5 Feb 4 Jan 14 Nov 26 Nov 12 Vehicle E Vehicle F Vehicle G Vehicle C Vehicle E was involved in narcotics crimes and Vehicle G was found to be involved in suspicious activity and forgery. Since the procedure used MIW, it indicates that the vehicles may have been simultaneously waiting in line at the same port of entry. This example clearly shows that the transitive use of MIW shows promise in identifying potentially criminal vehicles.