Seminar Data & Web Mining. Christian Groß Timo Philipp

Similar documents
Relational Data Mining and Web Mining

An Analysis of Open Source Software Development Using Social Network Theory and Agent-Based Modeling

Bank Secrecy Act Training: Who, What, When, How and Why? Presented by Lynn English Lafayette Federal Credit Union

Networked Life (CSE 112)

Bond Market Simulation

Final Project Report CS224W Fall 2015 Afshin Babveyh Sadegh Ebrahimi

Inferring Social Ties across Heterogeneous Networks

JOB DESCRIPTION: **Pay for new employees typically starts at the minimum of the posted range**

SOCIAL MEDIA MINING. Behavior Analytics

Oracle Financial Services FCCM Analytics User Guide. Release October 2017

Text Analytics for Executives Title

HOW DATA SCIENCE CAN REDUCE AML FIFTH PILLAR RISK

Oracle Financial Services FCCM Analytics User Guide. Release March 2017

Business Network Analytics

Ask the Expert SAS Text Miner: Getting Started. Presenter: Twanda Baker Senior Associate Systems Engineer SAS Customer Loyalty Team

IBM SPSS Modeler Personal

Enabling News Trading by Automatic Categorization of News Articles

Eyal Carmi. Google, 76 Ninth Avenue, New York, NY U.S.A. Gal Oestreicher-Singer and Uriel Stettner

Seminars of Software and Services for the Information Society. Social Networks A case study

Data Mining and Crime Analysis in the Richmond Police Department

The Agent s Independent Review cannot be conducted by the designated Compliance Officer or an employee reporting directly to the Compliance Officer.

Testing and Reviews. Importance of BSA / AML Training Testing staff on their comprehension of the training

Analytics in Action transforming the way we use and consume information

PATH TO INTELLIGENT TRANSACTION MONITORING. February 22, 2018

Practical Application of Predictive Analytics Michael Porter

IBM SPSS Modeler Personal

BUSINESS DEVELOPMENT INTELLIGENCE MONITORING FOR LAW FIRMS

Expanding a Datawarehouse in step with Oracle advancements

Workflow Mining: Identification of frequent patterns in a large collection of KNIME workflows

Operational Application of Targeted Data Analysis. Eric Chasin, NTELX Agenda. What we do?

Business Objects Universe Developer Guide. Release

Brian Macdonald Big Data & Analytics Specialist - Oracle

CHAPTER 8 PROFILING METHODOLOGY

AI that creates professional opportunities at scale

AJS 275. Criminal Investigation. Course Package

REUTERS/Carlos Baria. Thomson Reuters World-Check One Finding Hidden Risks

Public Goods Theory of the Open Source Development Community using Agent-based Simulation

AML / CTF in the Customer Domain

Next-Generation Software Platform For Intelligence-Led Decision Making

Uncovering the Small Community Structure in Large Networks: A Local Spectral Approach

KnowledgeSTUDIO. Advanced Modeling for Better Decisions. Data Preparation, Data Profiling and Exploration

Commercial Due Diligence

HybridRank: Ranking in the Twitter Hybrid Networks

APPLICATION FOR EMPLOYMENT

Final Project - Social and Information Network Analysis

Open System Engineering Environment (OSEE) Action Tracking System (ATS)

Introduction Social Network Analysis Presented by Kimberly A. Fredericks, Ph.D

EMPLOYEE ENGAGEMENT SURVEY

Behavioral Data Mining. Lecture 22: Network Algorithms II Diffusion and Meme Tracking

The Modern FCIU: Special Risk Investigations

FRAUD MONITORING. Modern, comprehensive solution for fraud detection and prevention in banking systems.

Pikes Peak Retail Security Assoc**.. (Decreasing level of shoplifting/citywide)

MAPPING GLOBAL VALUE CHAINS AND MEASURING TRADE IN TASKS MOTIVATIONS AND ANNOTATED OUTLINE HUBERT ESCAITH, WTO JUNE 2013

Group #2 Project Final Report: Information Flows on Twitter

Cash In Transit (CIT) United States

Financial Crime Mitigation

A Framework for Analyzing Twitter to Detect Community Crime Activity

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Advancing Information Management and Analysis with Entity Resolution. Whitepaper ADVANCING INFORMATION MANAGEMENT AND ANALYSIS WITH ENTITY RESOLUTION

LexisNexis Risk Management Suite

Session 4 Strategic analysis products

Technology Office Challenge. (Why numbers rock!)

7 Tips to Successful Check Cashing

Chapter 9. Business Intelligence Systems

Public Transport Planning System by Dijkstra Algorithm: Case Study Bangkok Metropolitan Area Pimploi Tirastittam, Phutthiwat Waiyawuththanapoom

Compliance/Risk Management & Bank Secrecy Act Officer

Leveraging Code Coverage Data to Improve Test Suite Efficiency and Effectiveness

Use of Data Mining and Machine. Use of Data Mining and Machine. Learning for Fraud Detection. Learning for Fraud Detection. Welcome!

UFED Pro Series. Advance the case with access to the widest amount of digital evidence and insights.

Boundedly Rational Consumers

An Exploratory Study on the Relationship Between OSS Project Popularity and Network Characteristics

DATA ANALYTICS WITH R, EXCEL & TABLEAU

Identification of Process-based Fraud Patterns in Credit Application

Oracle Knowledge Analytics User Guide

The Science of Social Media. Kristina Lerman USC Information Sciences Institute

The usage of Big Data mechanisms and Artificial Intelligence Methods in modern Omnichannel marketing and sales

On Alert: Designing Effective AML Monitoring Processes

CA Clarity Project & Portfolio Manager

EMPLOYEE ENGAGEMENT SURVEY

EMPLOYEE ENGAGEMENT SURVEY

Collaborative Free Software Development

Transaction Monitoring

Waterloo Regional Police Service: About Our Data

Enterprise-wide Risk Case

The project recruits, trains and supervises volunteers to act as Independent Visitors for children in the care of Hull City Council.

Lecture 10. Outline. 1-1 Introduction. 1-1 Introduction. 1-1 Introduction. Introduction to Statistics

Using SAS Visual Investigator to Enforce Model Tuning Best Practices in a Regulatory Environment

The University of Jordan. Accreditation & Quality Assurance Center. Curriculum for Master Degree

Occupational Health and Safety Policy

The Customer Is Always Right: Analyzing Existing Market Feedback to Improve TVs

STRATEGIES FOR REDUCING AML CASE PROCESSING TIMES

SmartCare. SPSS Workshop. Rick Durham - North American Advanced Analytics Channel Team IBM Corporation. Date: 5/28/2014

A Unified Theory of Software Testing Bret Pettichord 16 Feb 2003

Artificial Intelligence Breadth-First Search and Heuristic

Contextual Monitoring: Enabling banks to reduce false positives while catching the bad guys

Multi-Resource Packing for Cluster Schedulers. CS6453: Johan Björck

Probation Population: 05-07

Can Cascades be Predicted?

Transcription:

Seminar Data & Web Mining Christian Groß Timo Philipp

Agenda Application types Introduction to Applications Evaluation 19.07.2007 Applications 2

Overview Crime enforcement Crime Link Explorer Money laundering Peer Group Analysis Open Source Software Development Adolescent cigarette smoking 19.07.2007 Applications 3

Crime Link Explorer (1) Software developed by University of Arizona Crime investigators should be enabled to automatically conduct effective and efficient link analysis Link Analysis Identification, analysis and visualization of associations between entities : persons locations criminal incidents 19.07.2007 Applications 4

Crime Link Explorer (2) Three techniques included: Concept space approach Shortest path algorithm Heuristic approach 19.07.2007 Applications 5

Data source for Link Analysis Data source: crime incident report ( Anzeige ) Uniform Crime Report (UCR) established 1930 surveillance logs telephone records financial transactions Link: If two entities appear in the same document / log / telephone record 19.07.2007 Applications 6

Problems Costs much time and human effort Information overload Information buried in large volume of raw data High branching factors The number of direct links an entity has Determining the importance of links Relies heavily on domain knowledge 19.07.2007 Applications 7

System design GUI for visualizing founded association paths Dijkstra shortest path algorithm used for finding strong association between entities Associations are identified and extracted from the dataset using the concept space approach Heuristics, capturing domain knowledge, are used for identifying criminal associations 19.07.2007 Applications 8

Co-occurrence weights 19.07.2007 Applications 9

Concept Space Network consisting of domain specific concepts (nodes) Weighted co-occurrence relationship (links) Example: COPLINK Concepts: (nodes) Persons Organizations Locations Vehicles Link: if two concepts appear in the same criminal incidents 19.07.2007 Applications 10

Co-occurrence weights (1) Incident report A Location A Incident report B Person A Person A Person C Person B Person B Location B 19.07.2007 Applications 11

Co-occurrence weigths (2) Co-occurrence weight computed based on frequency that two persons appear together in same incident report Person A weight Person B Con: Weights computed are only a minor assistance in term of uncovering investigative leads 19.07.2007 Applications 12

Heuristic approach 19.07.2007 Applications 13

Heuristic approach three criteria: Relationship between crime type and person roles Shared addresses Shared telephone numbers Repeated co-occurrence in incident report 1 100 % scale indicating the strength of associations 19.07.2007 Applications 14

Heuristic approach Crime type / person role (1) Construction of a matrix for each crime type Homicide Robbery Auto Theft Sexual Assault. Each matrix containing strength estimation for each role combination victim <-> witness witness <-> suspect suspect <-> victim. 19.07.2007 Applications 15

Heuristic approach Crime type / person role (2) Table for crime type homicide: Homicide Victim Witness Suspect Arrestee Other Victim... 98 Witness... Suspect 98... Arrestee... Other Estimation of strength of associations occurring for role combination and crime type out of every 100 incidents Heuristic score could be improved by including statistical analysis 19.07.2007 Applications 16

Heuristic approach Shared address / phone Important indicator for associations But: phone number often erroneous only 5 % to final weight Address more accurate than phone number 10 % to final weight 19.07.2007 Applications 17

Heuristic approach Co-occurrence Same idea as concept space approach But: estimation of co-occurrence weights based on empirically derived probability distribution Co-occurrence count Association probability (%) 1 1 2 45 3 98 4 100 19.07.2007 Applications 18

Heuristic approach Final heuristic weight P1 = crime_type / person_score P2 = shared phone score P3 = shared address score P4 = association probability based on co-occurrence counts w final Max 0.85 P P 1 0.05 P2 0.10 P3, 4 19.07.2007 Applications 19

Association Path Person A Person B w 1 Person C w 2 w 4 w 3 Person D Person E w 5 Person F 19.07.2007 Applications 20

Association path search Logarithmic transformation done on weights w i Modified Dijkstra Algorithm used for finding strongest association path between two or more persons 19.07.2007 Applications 21

System Evaluation Data set: 239.780 incident reports 229.938 persons involved Age, gender, race, address, phone number 10 crime analysts Heuristic approach more accurate than concept space approach Heuristic approach uses domain knowledge Reduced time and effort needed for link analysis 19.07.2007 Applications 22

Overview Crime enforcement Link Exploration Money laundering Peer Group Analysis Open Source Software Development Adolescent cigarette smoking 19.07.2007 Applications 23

FAIS The Financial Crimes Enforcement Network AI System maintained by FinCEN (U.S.) Aim: detection of money laundering 19.07.2007 Applications 24

Information collection Information injection into DB (U.S. Customs Services Data Centre) Money Transaction over 10.000 into/out of US Fill Currency Transaction Report (CTR) DB 7/19/2007 Applications 25

FAIS Load and prepare Data DB (U.S. Customs Services Data Centre) transaction Load Program Consolidated data Suspicious Rating Prog. Data with rating FAIS DB (Sybase) Analysis rules (336) -Subject, Accounts (linked with transactions) Link Analysis NEXPERT: GUI for investigating how result is received Allow what if statements Heuristic knowledge for text fields Rules result in individual pos./neg. evidences Bayesian transform to single rate 19.07.2007 Applications 26

FAIS Data Analysis Data Driven Mode User Directed Mode Apply filters Create SQL query 19.07.2007 Applications 27

FAIS Data Analysis (cont.) 19.07.2007 Applications 28

FAIS Use and Payoff Introduction 1993 Beginning 1995: 20 mio transactions 3000 subjects detected 2,5 mio accounts Beginning 1997: (see Strategy Plan of FinCEN 1997-2002) 39 mio (Bank Secrecy Acts including CTR) Revealing new 3500 subjects 5,000 bank accounts of Colombian/Mexican money launderers detected Received Feedback 50% known hits 50% hits with similar behaviour 90% of leads are correct 19.07.2007 Applications 29

Consequences Revised Form 19.07.2007 Applications 30

Overview Crime enforcement Link Exploration Money laundering Peer Group Analysis Open Source Software Development Adolescent cigarette smoking 19.07.2007 Applications 31

OSS development phenomenon OSS := Open Source Software Hypothesis: Open Source Software development could be modeled as a self-organizing, collaborative network Collaborative network Variation of social network Edge between nodes if they are part of a collaboration Linchpins connect disparate groups into larger cluster Motivation: Better understanding of how the OSS community works IT planners are able to better calculate the risk of OSS usage 19.07.2007 Applications 32

OSS development (1) Recent studies showed: OSS development produces better, more bug-free software Most developers work for enjoyment and pride of being part of an successful OSS project. Not working for monetary return Collaborate from around the world Developers rarely meet face-to-face Developers are self-organized 19.07.2007 Applications 33

OSS development (2) OSS movement is a example of a decentralized selforganizing process. No central control or planning Threatens traditional proprietary software business Open Questions: Intellectual property rights Role of the government concerning OSS Software licensing 19.07.2007 Applications 34

Power Law Networks Collaborative networks often show power law distribution Examples for power law distributions: City size distribution Word ranking in languages and writing Internet Example: 19.07.2007 Applications 35

Data Collection and Analysis Web Crawler collected data from SourceForge (Mailing Lists, project sites, forums) from Jan 2001 to March 2002 Project number Developer id SourceForge Number of projects: 39.000 (2002), 152.000 (2007) Number of developers: 33.000 (2002) Number of registered users: 1.600.000 (2007) 19.07.2007 Applications 36

Modeling approach Modeling the OSS Community as collaborative social network Hypothesis: The OSS Movement displays power law relationships in its structure Cluster size Degree of nodes 19.07.2007 Applications 37

Graph modeling Node = developer Edge = work on the same project Node = projects Edge = same developer works on both projects Dev[53] emule GIMP Dev[14] Dev[75] Azureus 19.07.2007 Applications 38

Results Both figures show, that the two modeled networks satisfy the power law property 19.07.2007 Applications 39

Clustering Analysis (1) Linchpins Linchpins 19.07.2007 Applications 40

Number of cluster Clustering Analysis (2) Cluster size 19.07.2007 Applications 41

Conclusions (1) OSS developer network fits to the power law relationship OSS developer network is not a random network The graph displays preferential attachment of new nodes Initial success of a OSS project more developer join the project Important role of linchpins Attractors for other developers Facilitate the diffusion of ideas and technology between clusters 19.07.2007 Applications 43

Conclusions (2) Long term study needed because of high fluctuation rate of nodes Further research should be done on the OSS network Additional graph theoretic properties could be computed (cluster coefficients, network diameter, etc) Deeper understanding of how nodes join and leave Role of SourceForge? 19.07.2007 Applications 44

Overview Crime enforcement Link Exploration Money laundering Peer Group Analysis Open Source Software Development Adolescent cigarette smoking 19.07.2007 Applications 45

Adolescent Cigarette Smoking Social network theory and analysis applied to examine whether adolescents differ in prevalence of current smoking. Research project on 1092 ninth graders of 5 schools: Each choose 3 best friends (ordered by better friends first) Aim to classified each adolescent in Clique member Clique liaison Isolate Additional information provided 7/19/2007 Applications 46

Building Link Graph Liaisons Clique members: -Belong to group of min 3-50+% of their links within their group -Connected by some path lying entirely within the clique Clique liaisons: -2+ links with clique members/other liaisons -Not in a clique Isolates Isolates: Few/no links to other Weight of arcs = 1 if non reciprocated friendship otherwise 2 19.07.2007 Applications 47

Test data 19.07.2007 Applications 48

Cigarette Smoking Defined by self-report (current smoker and 1+ packs of cigarette) and carbon monoxide content in alveolar breath samples. 19.07.2007 Applications 49

Result Smokers tent more often to be white than black (2 schools significant) come from families with mothers having lower education 19.07.2007 Applications 50

Additional Analysis significance in interaction at 4 schools School E significant in interactions between social position and variables grander and mother education?!? Including nonsurveyed subjects leads to 5 schools with significant relationship between social position and current smoking (not shown) Underestimation of relationship 19.07.2007 Applications 51

Additional Analysis (cont.) Possibility remains that isolates are integrated into peer groups outside the school social network. 19.07.2007 Applications 52

Fiend Smoking Behaviour Isolates have more smoking friend than clique members/liaisons (1,5 4 times as many); Isolates have fewer friends than other subjects. Add attribute friend smoking to graph (ø of 3 friends - smoking/non smoking ) -> Not significant ->friend smoking is strongly related to subject smoking. Friend smoking is not a proxy for peer group social position. 19.07.2007 Applications 53

Isolates tend to be smokers Explanation: 1. Social Isolation cause smoking 2. Smoking cause social isolation 3. No relationship between smoking & isolation (both caused by same factors) 4. Isolates are members of cliques from outside the school environment Regardless of explanation smoking is not a peer group phenomenon! 19.07.2007 Applications 54

Similarities / Differences between Applications 19.07.2007 Applications 55

Evaluation Link analysis offers a great potential to crime investigation Reduce time and human effort Domain knowledge could improve link analysis More accurate results with domain knowledge based link analysis Peer Group Analysis is a helpful tool for social network analysis 19.07.2007 Applications 56

7/19/2007 Applications 57