Uncovering the Small Community Structure in Large Networks: A Local Spectral Approach

Similar documents
Influence Maximization-based Event Organization on Social Networks

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

ERP Reference models and module implementation

Estimating Cell Cycle Phase Distribution of Yeast from Time Series Gene Expression Data

Influence of First Steps in a Community on Ego-Network: Growth, Diversity, and Engagement

Generative Models for Networks and Applications to E-Commerce

Bridge-Driven Search in Social Internetworking Scenarios

39 Transfer Learning to Infer Social Ties across Heterogeneous Networks

Exploring Similarities of Conserved Domains/Motifs

You are Who You Know and How You Behave: Attribute Inference Attacks via Users Social Friends and Behaviors

Link Prediction in Bipartite Venture Capital Investment Networks

SPM 8.2. Salford Predictive Modeler

Seminar Data & Web Mining. Christian Groß Timo Philipp

A Propagation-based Algorithm for Inferring Gene-Disease Associations

RFM analysis for decision support in e-banking area

Prediction of Personalized Rating by Combining Bandwagon Effect and Social Group Opinion: using Hadoop-Spark Framework

Using Decision Tree to predict repeat customers

A Decision Support System for Market Segmentation - A Neural Networks Approach

Knowledge Discovery and Data Mining

Influencer Communities. Influencer Communities. Influencers are having many different conversations

Business Customer Value Segmentation for strategic targeting in the utilities industry using SAS

An Algorithm for Multi-Unit Combinatorial Auctions

How to view Results with. Proteomics Shared Resource

Predicting Yelp Ratings From Business and User Characteristics

PREDICTING EMPLOYEE ATTRITION THROUGH DATA MINING

Distance and Friendship: A Distance-based Model for Link Prediction in Social Networks

Event-based Social Networks: Linking the Online and Offline Social Worlds

arxiv: v2 [cs.si] 19 Aug 2014

Package DNABarcodes. March 1, 2018

Towards Effective and Efficient Behavior-based Trust Models. Klemens Böhm Universität Karlsruhe (TH)

A STUDY ON STATISTICAL BASED FEATURE SELECTION METHODS FOR CLASSIFICATION OF GENE MICROARRAY DATASET

Chapter 5 Evaluating Classification & Predictive Performance

What about streaming data?

Optimization-Based Peptide Mass Fingerprinting for Protein Mixture Identification

2 Maria Carolina Monard and Gustavo E. A. P. A. Batista

Evaluating Twitter Influence Ranking with System Theory

Variable Selection Methods for Multivariate Process Monitoring

Genetic Algorithms in Matrix Representation and Its Application in Synthetic Data

Facencounter: bridging the gap between offline and online social networks

Predicting Corporate Influence Cascades In Health Care Communities

Segmentation and Targeting

TwitterRank: Finding Topicsensitive Influential Twitterers

by Xindong Wu, Kui Yu, Hao Wang, Wei Ding

Cluster Hire in a Network of Experts

Predicting the Odds of Getting Retweeted

Social and Knowledge Networks. Tina Wakolbinger

PROCESS-ORIENTED TEST AUTOMATION OF CONFIGURABLE BUSINESS SOLUTIONS

Recent HARVIST Results: Classifying Crops from Remote Sensing Data

R&D. R&D digest. by simulating different scenarios and counting their cost in profits and customer service.

Netflix Optimization: A Confluence of Metrics, Algorithms, and Experimentation. CIKM 2013, UEO Workshop Caitlin Smallwood

Introduction Reference System Requirements GIST (Gibbs sampler Infers Signal Transduction)... 4

In silico prediction of novel therapeutic targets using gene disease association data

Random Walks on the Reputation Graph

Predicting user rating for Yelp businesses leveraging user similarity

Segmentation and Targeting

Facility Location. Lindsey Bleimes Charlie Garrod Adam Meyerson

Robust Performance of Complex Network Infrastructures

Runs of Homozygosity Analysis Tutorial

Temporal Expertise Profiling

Tweeting Questions in Academic Conferences: Seeking or Promoting Information?

Text Analysis of American Airlines Customer Reviews

Weka Evaluation: Assessing the performance

Real Estate Appraisal

A Comprehensive Evaluation of Regression Uncertainty and the Effect of Sample Size on the AHRI-540 Method of Compressor Performance Representation

Community-Based Prediction of Activity Change in Skype

Active Learning for Logistic Regression

Making Thin Data Thick: User Behavior Analysis with Minimum Information. Reza Zafarani

Insights from the Wikipedia Contest

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

Bilateral and Multilateral Exchanges for Peer-Assisted Content Distribution

Discussion of Cayen, Coletti, Lalonde and Maier s "What Drives Exchange Rates? New Evidence from a Panel of U.S. Dollar Bilateral Exchange Rates"

What is Word Cloud? Word clouds provide a concise yet fun way to summarize the content of websites or text documents

Time Series Motif Discovery

Consumer Math Unit Lesson Title Lesson Objectives 1 Basic Math Review Identify the stated goals of the unit and course

Social PrefRec framework: leveraging recommender systems based on social information

Incentives in Crowdsourcing: A Game-theoretic Approach

De Novo Assembly of High-throughput Short Read Sequences

All Friends are Not Equal: Using Weights in Social Graphs to Improve Search

Advanced Topics in Computer Science 2. Networks, Crowds and Markets. Fall,

Data Mining. Chapter 7: Score Functions for Data Mining Algorithms. Fall Ming Li

Extracting Topics with Focused Communities in Multi-dimensional Social Data

An Exact Solution for a Class of Green Vehicle Routing Problem

SAP Predictive Analytics Suite

A Theory of Loss-Leaders: Making Money by Pricing Below Cost

ISSN: ISO 9001:2008 Certified International Journal of Engineering and Innovative Technology (IJEIT) Volume 3, Issue 9, March 2014

Software for Typing MaxDiff Respondents Copyright Sawtooth Software, 2009 (3/16/09)

Getting Started with HLM 5. For Windows

A Particle Swarm Optimization Algorithm for Multi-depot Vehicle Routing problem with Pickup and Delivery Requests

DATA MINING TO DISCOVER ENTERPRISE NETWORKS

Financial Market - A Network Perspective

Protein-Protein-Interaction Networks. Ulf Leser, Samira Jaeger

Will My Followers Tweet? Predicting Twitter Engagement using Machine Learning. David Darmon, Jared Sylvester, Michelle Girvan and William Rand

Determining NDMA Formation During Disinfection Using Treatment Parameters Introduction Water disinfection was one of the biggest turning points for

Clustering Method using Item Preference based on RFM for Recommendation System in u-commerce

How to view Results with Scaffold. Proteomics Shared Resource

Duke MINERVA. Hubs and Communities: A Structural Analysis of MENA s Wheat Network. Abstract. Keywords. No. 04 September Duke MINERVA Project

IBM SPSS Modeler Personal

Credit Card Marketing Classification Trees

To Join or Not to Join: The Illusion of Privacy in Social Networks with Mixed Public and Private User Profiles

Transcription:

Uncovering the Small Community Structure in Large Networks: A Local Spectral Approach Yixuan Li 1, Kun He 2, David Bindel 1 and John E. Hopcroft 1 1 Cornell University, USA 2 Huazhong University of Science and Technology, China May 20th, 2015

Introduction Preliminaries Algorithm Overview Seeding Comparison and Discussion Miscellaneous Uncover Small Community Structure How can we find a community of hundreds in a network of billions? Figure : Visualization of Facebook friendship data1. 1 www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919

Motivation Global Structure Local Structure Reduce Complexity Enables finding communities in time functional to the size of the community ( 100) rather than the size of the entire graph ( billions) Improve Accuracy Enables finding out how many communities is an individual in, and what are these communities.

Introduction Preliminaries Algorithm Overview Seeding Comparison and Discussion Miscellaneous Local expansion What is a seed set 234? What is local expansion useful for? How can we conduct local expansion? (short-step random walks) 2 K. Kloster and D. F. Gleich. Heat kernel based community detection. In KDD 14. 3 I. M. Kloumann and J. M. Kleinberg. Community membership identification from small seed sets. In KDD 14. 4 J. J. Whang, D. F. Gleich, and I. S. Dhillon. Overlapping community detection using seed set expansion. In CIKM 13.

Datasets Synthetic datasets: LFR benchmark graphs 5 Built-in community structure, power-law distribution. om: overlapping memership. (2 8) µ (mixing parameter): controls the fraction of links for each vertex to connect outside. (µ = 0.1, µ = 0.3) Real datasets 6 Product Networks Collaboration Networks Social Networks Social Networks 5 A. Lancichinetti, S. Fortunato, and F. Radicchi. Benchmark graphs for testing community detection algorithms. Physical Review E, 78(4):046110, 2008. 6 Stanford Network Analysis Project: http://snap.stanford.edu

Existing Spectral Methods Consider an undirected graph G = (V, E) with n nodes and m edges. A is the adjacency matrix; N = D 1 A is the transition matrix where D is the diagonal matrix of node degrees. Find out the dominant eigenvectors of N. Figure : Spectral clustering methods

Existing Spectral Methods Drawbacks: Inefficient computation of eigenvectors Unable to partition overlapping communities

Local Spectral Clustering Find local spectra: Random walks starting from a seed set S. Consider the span of l-dimensional probability vectors P 0,l = [p 0, p 1,..., p l ]. Initial invariant subspace: V 0,l. Use the following recurrence to calculate the local spectra V k,l after k steps of random walk V k,l R k,l = V k 1,l Ā, (1) where Ā = D 1/2 (A + I)D 1/2 is the normalized adjacency matrix of the graph.

Local Spectral Clustering To find out the members in the same community as the seed set S belong to, is equivalent to find rows in V k,l that point in nearly the same direction as nodes in S. To seek a sparse vector y in the span of V k,l such that seed nodes are in its support. min e T y = y 1 s.t. y = V k,l x, y 0, y(s) 1

Community Size Determination We can obtain a candidate community by truncating sorted sparse vector ŷ. But we still need to pin down to answer the following question: Q.1 What defines good communities and when do they emerge as we expand the seed set?

Community Size Determination The random walk techniques produce communities with conductance guarantees 7. ψ(v ) = (V ) min(vol(v ), Vol( V )), (2) where (V ) denote the cut size, and Vol(V ) is the sum of node degree in set V. A.1 The expansion of seed set can stop and form a natural community when it encounters a low-conductance cut. 1 Andersen, Reid, and Kevin J. Lang. Communities from seed sets. Proceedings of the 15th international conference on World Wide Web. ACM, 2006.

Community Size Determination Comparison of the average F1 score with ground truth and automatic size determination: 0.9 1.0 0.8 0.7 0.8 F1 score 0.6 0.5 0.4 0.3 0.2 1. With g.t. 2. Auto F1 score 0.6 0.4 0.2 1. With g.t. 2. Auto 0.1 0.0 2 3 4 5 6 7 8 overlapping membership 0.0 Amazon DBLP Orkut YouTube LFR benchmark graphs (µ = 0.3) Real datasets

Complexity Reduction by Sampling If one wants to uncover a small community within a network with billions of vertices, it would be very costly to take the whole graph into account. Q.2 How to find a small community in time functional to the size of the community rather than that of the entire graph?

Complexity Reduction by Sampling In practice, the unknown members in the target community are more likely to be around the seed members, and are usually a few steps away from the seeds. A.2 Sample the graph by cutting off the redundant nodes with low probability being reached after short random walk. Dataset Coverage Sample C avg Subgraph ratio rate size Amazon 1.00 0.0087 39 2913 DBLP 0.98 0.0076 251 2409 YouTube 0.66 0.0033 79 3745 Orkut 0.64 0.0011 83 3379 Table : Statistics of the mean values for the sampling method on real datasets.

Seeding Method The quality of seed set is crucial to the detection accuracy. The alternative seeding methods can be strategically applied by domain experts in different scenarios based on the availability of candidate seeds. Q.3: What defines a good seed set and how many seeds are needed in order to define a community?

Seeding Method 1.0 0.8 Adopt S = 3 seeds for each of the seeding method: High degree seeding Low degree seeding Random seeding Triangle seeding High inward-edge ratio seeding F1 score F1 score 0.6 0.4 0.2 0.0 2 3 4 5 6 7 8 overlapping membership 1.0 0.8 0.6 0.4 0.2 Seeding method (mu=0.1) High inward-edge ratio High degree Random Low degree Triangle Seeding method High inward-edge ratio Low degree Random High degree Triangle 0.0 Amazon DBLP Orkut YouTube

Seed Set Size For LFR benchmark graphs, we test with five different seeding ratios: 2%, 4%, 6%, 8% and 10%. For real networks, varying seed set size has little effect on the performance. 1.0 F1 score 0.8 0.6 0.4 0.2 Seeding Ratio 2% 4% 6% 8% 10% 0.0 2 3 4 5 6 7 8 overlapping membership Figure : LFR benchmark graphs with µ = 0.1.

Comparison with localized algorithms Heat Kernel (Kloster et. al, KDD 14) PageRank (Kloumann et. al, KDD 14) Seed Set Expansion (Whang et. al, CIKM 13) Local Expansion via Minimum One Norm 1.0 F1 score 0.8 0.6 0.4 0.2 Local algorithms LEMON LEMON-auto HK PR SSE 0.0 Amazon DBLP Orkut YouTube Figure : Comparison of the average F1 score with state-of-the-art local detection algorithms.

Discussion Networks are not all similar and we cannot assume one algorithm works for finding communities in a network will behave the same on the other networks. Q.4 How the local expansion approach is suited for uncovering communities in different types of networks?

Discussion Empirical comparison between synthetic and real networks: A.4 LEMON is less sensitive to the random walk step and subspace dimension on real networks than that on LFR benchmark graphs. LEMON is less sensitive to the seed set size on real networks than that on LFR benchmark graphs. LEMON is more sensitive to the high-degree seeds on real networks than that on LFR benchmark graphs.

Future Work Hierarchical Structure Look further into some larger low-conductance communities and see if a hierarchical structure exists. In this case, some large social group consisting of several small cliques is likely to be discovered. Membership Detection The local spectral clustering method could be potentially applied to the membership detection problem, i.e., finding all the communities that an arbitrary vertex belongs to.

Q & A

Evaluation Metric Use F1 score to quantify the similarity between the algorithmic community C and the ground truth community C : F 1 (C, C ) = 2 Precision(C, C ) Recall(C, C ) Precision(C, C ) + Recall(C, C ), (3) where the precision and recall are defined as: Precision(C, C ) = C C, (4) C Recall(C, C ) = C C C. (5)

Parameter Selection Fix k = 3, vary l from 1 to 15 Fix l = 3, vary k from 1 to 15 The observation holds for the remaining datasets as well. 1.00 1.00 F1 score 0.95 0.90 0.85 Amazon F1 score 0.95 0.90 0.85 0.80 0.75 Amazon 0.80 2 4 6 8 10 12 14 dimension 0.70 2 4 6 8 10 12 14 random walk step

Parameters of LFR graphs Figure : Parameters for the LFR datasets.

Stats of Real Datasets Figure : Statistics for the real networks.

Further Extension We enforce a bias towards the high-degree vertices at the beginning of the random walk by normalizing the initial probability vector: p 0 (v i ) = { d(vi )/Vol(S) if v i S 0 otherwise (6) 1.0 1.0 0.8 0.8 F1 score 0.6 0.4 0.2 mu=0.1, Normalized mu=0.1, Unnormalized mu=0.3, Normalized mu=0.3, Unnormalized F1 score 0.6 0.4 0.2 Normalized Unnormalized 0.0 2 3 4 5 6 7 8 overlapping membership 0.0 Amazon DBLP Orkut YouTube LFR benchmark graphs Real datasets