Uncovering the Small Community Structure in Large Networks: A Local Spectral Approach

Uncovering the Small Community Structure in Large Networks: A Local Spectral Approach Yixuan Li 1, Kun He 2, David Bindel 1 and John E. Hopcroft 1 1 Cornell University, USA 2 Huazhong University of Science and Technology, China May 20th, 2015

Introduction Preliminaries Algorithm Overview Seeding Comparison and Discussion Miscellaneous Uncover Small Community Structure How can we find a community of hundreds in a network of billions? Figure : Visualization of Facebook friendship data1. 1 www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919

Motivation Global Structure Local Structure Reduce Complexity Enables finding communities in time functional to the size of the community ( 100) rather than the size of the entire graph ( billions) Improve Accuracy Enables finding out how many communities is an individual in, and what are these communities.

Introduction Preliminaries Algorithm Overview Seeding Comparison and Discussion Miscellaneous Local expansion What is a seed set 234? What is local expansion useful for? How can we conduct local expansion? (short-step random walks) 2 K. Kloster and D. F. Gleich. Heat kernel based community detection. In KDD 14. 3 I. M. Kloumann and J. M. Kleinberg. Community membership identification from small seed sets. In KDD 14. 4 J. J. Whang, D. F. Gleich, and I. S. Dhillon. Overlapping community detection using seed set expansion. In CIKM 13.

Datasets Synthetic datasets: LFR benchmark graphs 5 Built-in community structure, power-law distribution. om: overlapping memership. (2 8) µ (mixing parameter): controls the fraction of links for each vertex to connect outside. (µ = 0.1, µ = 0.3) Real datasets 6 Product Networks Collaboration Networks Social Networks Social Networks 5 A. Lancichinetti, S. Fortunato, and F. Radicchi. Benchmark graphs for testing community detection algorithms. Physical Review E, 78(4):046110, 2008. 6 Stanford Network Analysis Project: http://snap.stanford.edu

Existing Spectral Methods Consider an undirected graph G = (V, E) with n nodes and m edges. A is the adjacency matrix; N = D 1 A is the transition matrix where D is the diagonal matrix of node degrees. Find out the dominant eigenvectors of N. Figure : Spectral clustering methods

Existing Spectral Methods Drawbacks: Inefficient computation of eigenvectors Unable to partition overlapping communities

Local Spectral Clustering Find local spectra: Random walks starting from a seed set S. Consider the span of l-dimensional probability vectors P 0,l = [p 0, p 1,..., p l ]. Initial invariant subspace: V 0,l. Use the following recurrence to calculate the local spectra V k,l after k steps of random walk V k,l R k,l = V k 1,l Ā, (1) where Ā = D 1/2 (A + I)D 1/2 is the normalized adjacency matrix of the graph.

Local Spectral Clustering To find out the members in the same community as the seed set S belong to, is equivalent to find rows in V k,l that point in nearly the same direction as nodes in S. To seek a sparse vector y in the span of V k,l such that seed nodes are in its support. min e T y = y 1 s.t. y = V k,l x, y 0, y(s) 1

Community Size Determination We can obtain a candidate community by truncating sorted sparse vector ŷ. But we still need to pin down to answer the following question: Q.1 What defines good communities and when do they emerge as we expand the seed set?

Community Size Determination The random walk techniques produce communities with conductance guarantees 7. ψ(v ) = (V ) min(vol(v ), Vol( V )), (2) where (V ) denote the cut size, and Vol(V ) is the sum of node degree in set V. A.1 The expansion of seed set can stop and form a natural community when it encounters a low-conductance cut. 1 Andersen, Reid, and Kevin J. Lang. Communities from seed sets. Proceedings of the 15th international conference on World Wide Web. ACM, 2006.

Community Size Determination Comparison of the average F1 score with ground truth and automatic size determination: 0.9 1.0 0.8 0.7 0.8 F1 score 0.6 0.5 0.4 0.3 0.2 1. With g.t. 2. Auto F1 score 0.6 0.4 0.2 1. With g.t. 2. Auto 0.1 0.0 2 3 4 5 6 7 8 overlapping membership 0.0 Amazon DBLP Orkut YouTube LFR benchmark graphs (µ = 0.3) Real datasets

Complexity Reduction by Sampling If one wants to uncover a small community within a network with billions of vertices, it would be very costly to take the whole graph into account. Q.2 How to find a small community in time functional to the size of the community rather than that of the entire graph?

Complexity Reduction by Sampling In practice, the unknown members in the target community are more likely to be around the seed members, and are usually a few steps away from the seeds. A.2 Sample the graph by cutting off the redundant nodes with low probability being reached after short random walk. Dataset Coverage Sample C avg Subgraph ratio rate size Amazon 1.00 0.0087 39 2913 DBLP 0.98 0.0076 251 2409 YouTube 0.66 0.0033 79 3745 Orkut 0.64 0.0011 83 3379 Table : Statistics of the mean values for the sampling method on real datasets.

Seeding Method The quality of seed set is crucial to the detection accuracy. The alternative seeding methods can be strategically applied by domain experts in different scenarios based on the availability of candidate seeds. Q.3: What defines a good seed set and how many seeds are needed in order to define a community?

Seeding Method 1.0 0.8 Adopt S = 3 seeds for each of the seeding method: High degree seeding Low degree seeding Random seeding Triangle seeding High inward-edge ratio seeding F1 score F1 score 0.6 0.4 0.2 0.0 2 3 4 5 6 7 8 overlapping membership 1.0 0.8 0.6 0.4 0.2 Seeding method (mu=0.1) High inward-edge ratio High degree Random Low degree Triangle Seeding method High inward-edge ratio Low degree Random High degree Triangle 0.0 Amazon DBLP Orkut YouTube

Seed Set Size For LFR benchmark graphs, we test with five different seeding ratios: 2%, 4%, 6%, 8% and 10%. For real networks, varying seed set size has little effect on the performance. 1.0 F1 score 0.8 0.6 0.4 0.2 Seeding Ratio 2% 4% 6% 8% 10% 0.0 2 3 4 5 6 7 8 overlapping membership Figure : LFR benchmark graphs with µ = 0.1.

Comparison with localized algorithms Heat Kernel (Kloster et. al, KDD 14) PageRank (Kloumann et. al, KDD 14) Seed Set Expansion (Whang et. al, CIKM 13) Local Expansion via Minimum One Norm 1.0 F1 score 0.8 0.6 0.4 0.2 Local algorithms LEMON LEMON-auto HK PR SSE 0.0 Amazon DBLP Orkut YouTube Figure : Comparison of the average F1 score with state-of-the-art local detection algorithms.

Discussion Networks are not all similar and we cannot assume one algorithm works for finding communities in a network will behave the same on the other networks. Q.4 How the local expansion approach is suited for uncovering communities in different types of networks?

Discussion Empirical comparison between synthetic and real networks: A.4 LEMON is less sensitive to the random walk step and subspace dimension on real networks than that on LFR benchmark graphs. LEMON is less sensitive to the seed set size on real networks than that on LFR benchmark graphs. LEMON is more sensitive to the high-degree seeds on real networks than that on LFR benchmark graphs.

Future Work Hierarchical Structure Look further into some larger low-conductance communities and see if a hierarchical structure exists. In this case, some large social group consisting of several small cliques is likely to be discovered. Membership Detection The local spectral clustering method could be potentially applied to the membership detection problem, i.e., finding all the communities that an arbitrary vertex belongs to.

Q & A

Evaluation Metric Use F1 score to quantify the similarity between the algorithmic community C and the ground truth community C : F 1 (C, C ) = 2 Precision(C, C ) Recall(C, C ) Precision(C, C ) + Recall(C, C ), (3) where the precision and recall are defined as: Precision(C, C ) = C C, (4) C Recall(C, C ) = C C C. (5)

Parameter Selection Fix k = 3, vary l from 1 to 15 Fix l = 3, vary k from 1 to 15 The observation holds for the remaining datasets as well. 1.00 1.00 F1 score 0.95 0.90 0.85 Amazon F1 score 0.95 0.90 0.85 0.80 0.75 Amazon 0.80 2 4 6 8 10 12 14 dimension 0.70 2 4 6 8 10 12 14 random walk step

Parameters of LFR graphs Figure : Parameters for the LFR datasets.

Stats of Real Datasets Figure : Statistics for the real networks.

Further Extension We enforce a bias towards the high-degree vertices at the beginning of the random walk by normalizing the initial probability vector: p 0 (v i ) = { d(vi )/Vol(S) if v i S 0 otherwise (6) 1.0 1.0 0.8 0.8 F1 score 0.6 0.4 0.2 mu=0.1, Normalized mu=0.1, Unnormalized mu=0.3, Normalized mu=0.3, Unnormalized F1 score 0.6 0.4 0.2 Normalized Unnormalized 0.0 2 3 4 5 6 7 8 overlapping membership 0.0 Amazon DBLP Orkut YouTube LFR benchmark graphs Real datasets