Seminar Data & Web Mining Christian Groß Timo Philipp
Agenda Application types Introduction to Applications Evaluation 19.07.2007 Applications 2
Overview Crime enforcement Crime Link Explorer Money laundering Peer Group Analysis Open Source Software Development Adolescent cigarette smoking 19.07.2007 Applications 3
Crime Link Explorer (1) Software developed by University of Arizona Crime investigators should be enabled to automatically conduct effective and efficient link analysis Link Analysis Identification, analysis and visualization of associations between entities : persons locations criminal incidents 19.07.2007 Applications 4
Crime Link Explorer (2) Three techniques included: Concept space approach Shortest path algorithm Heuristic approach 19.07.2007 Applications 5
Data source for Link Analysis Data source: crime incident report ( Anzeige ) Uniform Crime Report (UCR) established 1930 surveillance logs telephone records financial transactions Link: If two entities appear in the same document / log / telephone record 19.07.2007 Applications 6
Problems Costs much time and human effort Information overload Information buried in large volume of raw data High branching factors The number of direct links an entity has Determining the importance of links Relies heavily on domain knowledge 19.07.2007 Applications 7
System design GUI for visualizing founded association paths Dijkstra shortest path algorithm used for finding strong association between entities Associations are identified and extracted from the dataset using the concept space approach Heuristics, capturing domain knowledge, are used for identifying criminal associations 19.07.2007 Applications 8
Co-occurrence weights 19.07.2007 Applications 9
Concept Space Network consisting of domain specific concepts (nodes) Weighted co-occurrence relationship (links) Example: COPLINK Concepts: (nodes) Persons Organizations Locations Vehicles Link: if two concepts appear in the same criminal incidents 19.07.2007 Applications 10
Co-occurrence weights (1) Incident report A Location A Incident report B Person A Person A Person C Person B Person B Location B 19.07.2007 Applications 11
Co-occurrence weigths (2) Co-occurrence weight computed based on frequency that two persons appear together in same incident report Person A weight Person B Con: Weights computed are only a minor assistance in term of uncovering investigative leads 19.07.2007 Applications 12
Heuristic approach 19.07.2007 Applications 13
Heuristic approach three criteria: Relationship between crime type and person roles Shared addresses Shared telephone numbers Repeated co-occurrence in incident report 1 100 % scale indicating the strength of associations 19.07.2007 Applications 14
Heuristic approach Crime type / person role (1) Construction of a matrix for each crime type Homicide Robbery Auto Theft Sexual Assault. Each matrix containing strength estimation for each role combination victim <-> witness witness <-> suspect suspect <-> victim. 19.07.2007 Applications 15
Heuristic approach Crime type / person role (2) Table for crime type homicide: Homicide Victim Witness Suspect Arrestee Other Victim... 98 Witness... Suspect 98... Arrestee... Other Estimation of strength of associations occurring for role combination and crime type out of every 100 incidents Heuristic score could be improved by including statistical analysis 19.07.2007 Applications 16
Heuristic approach Shared address / phone Important indicator for associations But: phone number often erroneous only 5 % to final weight Address more accurate than phone number 10 % to final weight 19.07.2007 Applications 17
Heuristic approach Co-occurrence Same idea as concept space approach But: estimation of co-occurrence weights based on empirically derived probability distribution Co-occurrence count Association probability (%) 1 1 2 45 3 98 4 100 19.07.2007 Applications 18
Heuristic approach Final heuristic weight P1 = crime_type / person_score P2 = shared phone score P3 = shared address score P4 = association probability based on co-occurrence counts w final Max 0.85 P P 1 0.05 P2 0.10 P3, 4 19.07.2007 Applications 19
Association Path Person A Person B w 1 Person C w 2 w 4 w 3 Person D Person E w 5 Person F 19.07.2007 Applications 20
Association path search Logarithmic transformation done on weights w i Modified Dijkstra Algorithm used for finding strongest association path between two or more persons 19.07.2007 Applications 21
System Evaluation Data set: 239.780 incident reports 229.938 persons involved Age, gender, race, address, phone number 10 crime analysts Heuristic approach more accurate than concept space approach Heuristic approach uses domain knowledge Reduced time and effort needed for link analysis 19.07.2007 Applications 22
Overview Crime enforcement Link Exploration Money laundering Peer Group Analysis Open Source Software Development Adolescent cigarette smoking 19.07.2007 Applications 23
FAIS The Financial Crimes Enforcement Network AI System maintained by FinCEN (U.S.) Aim: detection of money laundering 19.07.2007 Applications 24
Information collection Information injection into DB (U.S. Customs Services Data Centre) Money Transaction over 10.000 into/out of US Fill Currency Transaction Report (CTR) DB 7/19/2007 Applications 25
FAIS Load and prepare Data DB (U.S. Customs Services Data Centre) transaction Load Program Consolidated data Suspicious Rating Prog. Data with rating FAIS DB (Sybase) Analysis rules (336) -Subject, Accounts (linked with transactions) Link Analysis NEXPERT: GUI for investigating how result is received Allow what if statements Heuristic knowledge for text fields Rules result in individual pos./neg. evidences Bayesian transform to single rate 19.07.2007 Applications 26
FAIS Data Analysis Data Driven Mode User Directed Mode Apply filters Create SQL query 19.07.2007 Applications 27
FAIS Data Analysis (cont.) 19.07.2007 Applications 28
FAIS Use and Payoff Introduction 1993 Beginning 1995: 20 mio transactions 3000 subjects detected 2,5 mio accounts Beginning 1997: (see Strategy Plan of FinCEN 1997-2002) 39 mio (Bank Secrecy Acts including CTR) Revealing new 3500 subjects 5,000 bank accounts of Colombian/Mexican money launderers detected Received Feedback 50% known hits 50% hits with similar behaviour 90% of leads are correct 19.07.2007 Applications 29
Consequences Revised Form 19.07.2007 Applications 30
Overview Crime enforcement Link Exploration Money laundering Peer Group Analysis Open Source Software Development Adolescent cigarette smoking 19.07.2007 Applications 31
OSS development phenomenon OSS := Open Source Software Hypothesis: Open Source Software development could be modeled as a self-organizing, collaborative network Collaborative network Variation of social network Edge between nodes if they are part of a collaboration Linchpins connect disparate groups into larger cluster Motivation: Better understanding of how the OSS community works IT planners are able to better calculate the risk of OSS usage 19.07.2007 Applications 32
OSS development (1) Recent studies showed: OSS development produces better, more bug-free software Most developers work for enjoyment and pride of being part of an successful OSS project. Not working for monetary return Collaborate from around the world Developers rarely meet face-to-face Developers are self-organized 19.07.2007 Applications 33
OSS development (2) OSS movement is a example of a decentralized selforganizing process. No central control or planning Threatens traditional proprietary software business Open Questions: Intellectual property rights Role of the government concerning OSS Software licensing 19.07.2007 Applications 34
Power Law Networks Collaborative networks often show power law distribution Examples for power law distributions: City size distribution Word ranking in languages and writing Internet Example: 19.07.2007 Applications 35
Data Collection and Analysis Web Crawler collected data from SourceForge (Mailing Lists, project sites, forums) from Jan 2001 to March 2002 Project number Developer id SourceForge Number of projects: 39.000 (2002), 152.000 (2007) Number of developers: 33.000 (2002) Number of registered users: 1.600.000 (2007) 19.07.2007 Applications 36
Modeling approach Modeling the OSS Community as collaborative social network Hypothesis: The OSS Movement displays power law relationships in its structure Cluster size Degree of nodes 19.07.2007 Applications 37
Graph modeling Node = developer Edge = work on the same project Node = projects Edge = same developer works on both projects Dev[53] emule GIMP Dev[14] Dev[75] Azureus 19.07.2007 Applications 38
Results Both figures show, that the two modeled networks satisfy the power law property 19.07.2007 Applications 39
Clustering Analysis (1) Linchpins Linchpins 19.07.2007 Applications 40
Number of cluster Clustering Analysis (2) Cluster size 19.07.2007 Applications 41
Conclusions (1) OSS developer network fits to the power law relationship OSS developer network is not a random network The graph displays preferential attachment of new nodes Initial success of a OSS project more developer join the project Important role of linchpins Attractors for other developers Facilitate the diffusion of ideas and technology between clusters 19.07.2007 Applications 43
Conclusions (2) Long term study needed because of high fluctuation rate of nodes Further research should be done on the OSS network Additional graph theoretic properties could be computed (cluster coefficients, network diameter, etc) Deeper understanding of how nodes join and leave Role of SourceForge? 19.07.2007 Applications 44
Overview Crime enforcement Link Exploration Money laundering Peer Group Analysis Open Source Software Development Adolescent cigarette smoking 19.07.2007 Applications 45
Adolescent Cigarette Smoking Social network theory and analysis applied to examine whether adolescents differ in prevalence of current smoking. Research project on 1092 ninth graders of 5 schools: Each choose 3 best friends (ordered by better friends first) Aim to classified each adolescent in Clique member Clique liaison Isolate Additional information provided 7/19/2007 Applications 46
Building Link Graph Liaisons Clique members: -Belong to group of min 3-50+% of their links within their group -Connected by some path lying entirely within the clique Clique liaisons: -2+ links with clique members/other liaisons -Not in a clique Isolates Isolates: Few/no links to other Weight of arcs = 1 if non reciprocated friendship otherwise 2 19.07.2007 Applications 47
Test data 19.07.2007 Applications 48
Cigarette Smoking Defined by self-report (current smoker and 1+ packs of cigarette) and carbon monoxide content in alveolar breath samples. 19.07.2007 Applications 49
Result Smokers tent more often to be white than black (2 schools significant) come from families with mothers having lower education 19.07.2007 Applications 50
Additional Analysis significance in interaction at 4 schools School E significant in interactions between social position and variables grander and mother education?!? Including nonsurveyed subjects leads to 5 schools with significant relationship between social position and current smoking (not shown) Underestimation of relationship 19.07.2007 Applications 51
Additional Analysis (cont.) Possibility remains that isolates are integrated into peer groups outside the school social network. 19.07.2007 Applications 52
Fiend Smoking Behaviour Isolates have more smoking friend than clique members/liaisons (1,5 4 times as many); Isolates have fewer friends than other subjects. Add attribute friend smoking to graph (ø of 3 friends - smoking/non smoking ) -> Not significant ->friend smoking is strongly related to subject smoking. Friend smoking is not a proxy for peer group social position. 19.07.2007 Applications 53
Isolates tend to be smokers Explanation: 1. Social Isolation cause smoking 2. Smoking cause social isolation 3. No relationship between smoking & isolation (both caused by same factors) 4. Isolates are members of cliques from outside the school environment Regardless of explanation smoking is not a peer group phenomenon! 19.07.2007 Applications 54
Similarities / Differences between Applications 19.07.2007 Applications 55
Evaluation Link analysis offers a great potential to crime investigation Reduce time and human effort Domain knowledge could improve link analysis More accurate results with domain knowledge based link analysis Peer Group Analysis is a helpful tool for social network analysis 19.07.2007 Applications 56
7/19/2007 Applications 57