Social group dynamics in networks

Size: px

Start display at page:

Download "Social group dynamics in networks"

Dustin Dixon
5 years ago
Views:

1 Social group dynamic in network Gergely Palla 1, Péter Pollner 1, Albert-Lázló Barabái 3 and Tamá Vicek 1,2 Abtract The rich et of interaction between individual in the ociety reult in complex community tructure, capturing highly connected circle of friend, familie, or profeional clique in a ocial network. Due to the frequent change in the activity and communication pattern of individual, the aociated ocial and communication network i ubject to contant evolution. The coheive group of people in uch network can grow by recruiting new member, or contract by looing member; two (or more) group may merge into a ingle community, while a large enough ocial group can plit into everal maller one; new communitie are born and old one may diappear. We dicu a new algorithm baed on a clique percolation technique, that allow to invetigate in detail the time dependence of communitie on a large cale and a uch, to uncover baic relationhip of the tatitical feature of community evolution. According to the reult, the behaviour of maller collaborative or friendhip circle and larger communitie, e.g., intitution how ignificant difference. Social group containing only a few member perit longer on average when the fluctuation of the member i mall. In contrat, we find that the condition for tability for large communitie i continuou change in their memberhip, allowing for the poibility that after ome time practically all member are exchanged. 1 Statitical and Biological Phyic Reearch Group of HAS H-1117 Budapet, Pázmány Péter étány 1/A, Hungary 2 Dept. of Biological Phyic, Eötvö Univerity H-1117 Budapet, Pázmány Péter étány 1/A, Hungary 3 Center for Complex Network Reearch and Dept. of Phyic, Biology and Computer Science, Northeatern Univerity, Boton, MA 02115, USA correponding author: Tamá Vicek, vicek@angel.elte.hu 1

2 2 G. Palla, P. Pollner, A-L. Barabái and T. Vicek 1 Introduction Mapping ocial relation between people onto a network ha a long tradition in ociology [76, 20, 72]. The tandard method for revealing the topology of the connection i to ue quetionnaire and peronal interview. The advantage of thi approach i that it can provide very detailed information about the ocial tie, e.g., the type of acquaintance behind a given connection, what ort of emotion do the examined pair of people induce in each other, whether the relation i mutual or not, etc. The drawback of thi data collection framework i that the typical ize of the examined ample i of the order of N 2 individual and the trength aociated to the link between people i ubjective. In the lat decade a change of paradigm took place due to the rapid development of complex network theory [75, 4, 2, 39]. Thi new interdiciplinary field i devoted to the analyi of the tatitical feature of ytem ranging from protein interaction network through tock correlation graph to the Internet. Since the ize of the invetigated network can grow up to more than N 6 node, the underlying data mut be collected in an automated way, extracting the relevant information from large electronic databae. Thi approach ha been uccefully ued to create large ocial network a well [51, 50, 73]. databae [12, 13, ], phone-call record [1, 51, 50] and cientific co-authorhip data [23, 22, 42, 5] provide good example for the tarting point of a ocial network analyi on large cale. Although the range of ocial interaction that can be detected uing data bae of thi type i narrow compared to the quetionnaire, in ome cae the trength of the connection (e.g., the number of phone-call between two individual in a certain time period) may be more objectively quantifiable. In thi chapter we preent a tudy concerning the tatitical propertie of two large ocial network of major interet, capturing the collaboration between cientit and the call between mobile phone uer. Our focu i on the community dynamic, where the communitie (alo called a module, cluter or coheive group) can correpond to familie, friendhip circle, work group [63, 74], etc. Thee tructural ub-unit have no widely accepted unique definition, however we can aume that a community member i uually more tightly connected to it group than to other part of the network, and that mot people in a community know each other [64, 15, 33, 43, 57] (the group are dene). Although mot empirical tudie have focued on naphot of thee communitie, thank to frequent change in the activity and communication pattern of individual, the aociated ocial and communication network i ubject to contant evolution [38, 5, 31, 11, 70, 78, 47]. Our knowledge of the mechanim governing the underlying community dynamic i limited, but i eential for a deeper undertanding of the development and elf-optimiation of the ociety a a whole [25, 32, 28, 37, 56, 34]. Typically, the communitie in a complex ytem are not iolated from each other, intead, they have overlap, e.g., people can be member in different ocial group at the ame time [72]. Thi obervation naturally lead to the definition of the community graph: a network repreenting the connection between the communitie, with

3 Social group dynamic in network 3 the node referring to communitie and link correponding to hared member between the communitie. Accordingly, the community degree d com of a community i given by the number of other communitie it overlap with, and i equal to the degree of the correponding node in the community graph. So far, in the network invetigated, the community degree ditribution wa hown to decay exponentially for low and a a power law for higher community degree value. Thi mean that fat tailed degree ditribution appear at two level in the hierarchy of thee ytem: both at the level of node (the underlying network are cale free), and at the level of the communitie a well. Preferential attachment i a key concept in the field of cale-free network. In a wide range of graph model the baic mechanim behind the emerging power law degree ditribution i that the new node attach to the old one with probability proportional to their degree [4, 2, 39]. Furthermore, in earlier work the occurrence of preferential attachment wa directly demontrated in everal real world network with cale free degree ditribution[5, 41]. The oberved fat tail in the degree ditribution of the community graph indicate that the mechanim of preferential attachment could be preent at the level of communitie a well. One of our aim in the preent chapter i to examine the attachment tatitic of communitie in order to clarify thi quetion. We further develop a new algorithm baed on the clique percolation method (CPM) [53, 9], that allow to invetigate in detail the time dependence of overlapping communitie on a large cale and a uch, to uncover baic relationhip of the tatitical feature of community evolution [52, 55]. According to our reult, the behaviour of large and mall communitie how an intereting difference. We find that large group perit longer if they are capable of dynamically altering their memberhip, uggeting that an ability to change the compoition reult in better adaptability and a longer lifetime for ocial group. Remarkably, the behaviour of mall group diplay the oppoite tendency, the condition for tability being that their compoition remain unchanged. We alo how that the time commitment of member to a given community can be ued for etimating the community lifetime. Thi chapter i organied a follow. We begin with the contruction of the invetigated network from the baic data et in Sect.2. and continue with the main apect of the CPM in Sect.3. We detail the algorithm for building evolving communitie from ubequent naphot of the community tructure in Sect.4. The main reult are dicued in Sect.5, wherea the concluding remark are drawn in Sect.6. 2 Contruction of the network The data et we conider contain the monthly roter of article in the Lo Alamo cond-mat archive panning 142 month, with over author [71], and the complete record of phone-call between the cutomer of a mobile phone company panning 52 week (accumulated over two week long period), and containing the communication pattern of over 4 million uer [51, 50]. Both type of collabora-

4 4 G. Palla, P. Pollner, A-L. Barabái and T. Vicek tion event (a new article or a phone-call) document the preence of ocial interaction between the involved individual (node), and can be repreented a (timedependent) link. We aumed that in both cae the ocial connection between people had tarted ome time before the collaboration/communication event and lated for ome time after thee event a well. ( E.g., the ubmiion of an article to the archive i uually preceded by intene collaboration and reconciliation between the author, which i in mot cae prolonged after the ubmiion a well). Collaboration/communication event between the ame people can be repeated from time to time again, and higher frequency of collaboration/communication act uually indicate cloer relationhip [58]. Furthermore, weight can be aigned to the collaboration and communication event quite naturally: an article with n author correpond to a collaboration act of weight 1/(n 1) between every pair of it author, wherea the cot of the phone-call provide the weight in cae of the phone-call network. Baed on thi, we define the link weight between two node a and b at time t a w a,b (t)= [w i Θ (t t i )exp( λ + (t t i )/w i )+w i Θ (t i t)exp( λ (t i t)/w i )], i (1) where the ummation run over all collaboration event in which a and b are involved e.g., a phone-call between a and b, and w i denote the weight of the event i occurring at t i. (The contant λ + and λ are decay time characteritic for the particular ocial ytem we tudy. The function Θ(t) i the tep function taking 0 at negative t value and 1 for poitive). Thu, in thi approach the time evolution of the network i manifeted in the changing of the link weight. However, if the link weaker than a certain threhold w are neglected, the network become truly retructuring in the ene that link appear only in the vicinity of the event and diappear further away in time. The above method of weighting tie between people i very 2 phone call w w*= t Fig. 1 The link-weight a a function of time for a connection in the phone-call network. If a weight threhold of w = 1 i introduced, the link i abent outide the haded interval. Here λ =λ +. Figure from the Suppl. of [52].

5 Social group dynamic in network 5 ueful in capturing the continuou time dependence of the trength of connection when the information about them i available only at dicrete time tep. Except for our analyi of the preferential attachment of communitie (Sect.3.2.) we ued ymmetric decay characteritic λ = λ +, wherea in Sect.3.2. we applied a pecial choice correponding to a imple growing network. 3 Finding communitie 3.1 The clique percolation method The tudy of the intermediate-cale ubtructure in network, made up of vertice more denely connected to each other than to the ret of the network, ha become one of the mot highlighted topic in complex network theory. Thee tructural ubunit can correpond to multi-protein functional unit in molecular biology [59, 65], a et of tightly coupled tock or indutrial ector in economy [49, 30], group of people [63, 74, 52], cooperative player [67, 69, 66], etc. The location of uch building block can be crucial to the undertanding of the tructural and functional propertie of the ytem under invetigation. Furthermore, a reliable method to pinpoint uch object ha many potential indutrial application, e.g., it can help ervice provider (phone, banking, Internet, etc.) identify meaningful group of cutomer (uer), or upport biomedical reearcher in their earch for individual target molecule and novel protein complex target [35, 3]. Since communitie have no widely accepted unique definition, the number of available method to pinpoint them i vat [63, 64, 15, 33, 19, 43, 62, 53, 16, 54, 60, 61, 26, 24, 27, 40]. The majority of thee algorithm claify the node into dijunct communitie, and in mot cae a global quantity called modularity [45, 44] i ued to evaluate the quality of the partitioning. However, a pointed out in [17, 36], the modularity optimiation introduce a reolution limit in the clutering, and communitie containing a maller number of edge than M (where M i the total number of edge) cannot be reolved. One of the big advantage of the clique percolation method (CPM) i that it provide a local algorithm for detecting the communitie, and therefore, it doe not uffer from reolution problem of thi type [53, 9]. In thi approach the communitie are built up from k-clique, correponding to complete (fully connected) ub-graph of ize k. Two k-clique are aid adjacent if they hare k 1 node [9, 14, 6], and a k-clique community correpond to a et of k-clique in which all k-clique can reach each other through chain of k-clique adjacency. In other word, the communitie defined in thi way are equivalent to k-clique percolation cluter. Thee object can be bet viualied with the help of k-clique template (Fig.2), that are object iomorphic to a complete graph of k vertice. Such a template can be placed onto any k-clique in the graph, and rolled to an adjacent k-clique by relocating one of it vertice and keeping it other k 1 vertice fixed. Thu, the k-clique percolation

6 6 G. Palla, P. Pollner, A-L. Barabái and T. Vicek cluter (k-clique communitie) of a graph are all thoe ubgraph that can be fully explored by rolling a k-clique template in them but cannot be left by thi template. rolling the k clique template E E F A D A D D B C C C Fig. 2 Illutration of k-clique template rolling at k = 4. Initially the template i placed on A-B- C-D (left panel) and it i rolled onto the ubgraph A-C-D-E (middle panel). The poition of the k-clique template i marked with thick black line and black node, wherea the already viited edge are repreented by thick gray line and gray node. Oberve that in each tep only one of the node i moved and the two 4-clique (before and after rolling) hare k 1 = 3 node. At the final tep (right panel) the template reache the ubgraph C-D-E-F, and the et of node viited during the proce (A-B-C-D-E-F) are conidered a a k-clique percolation cluter. The further advantage of the community definition above (beide it locality) are that it i not too retrictive, it i baed on the denity of the link and it allow overlap between the communitie: a node can be part of everal k-clique percolation cluter at the ame time. Revealing overlap between communitie ha obtained a ignificant attention in the recent literature devoted to community detection [77, 7, 29, 68, 18, 79, 46, 60, 40]. Indeed, communitie in real-world graph are often inherently overlapping: each peron in a ocial web belong uually to everal group (family, colleague, friend, etc.), protein in a protein interaction network may participate in multiple complexe [29] and a large portion of web-page can be claified under multiple categorie. Prohibiting overlap during module identification trongly increae the percentage of fale negative co-claified pair. A an example, in a ocial web a group of colleague might end up in different module, each correponding to e.g., their familie. In thi cae, the network module correponding to their work-group i bound to become lot. 3.2 Preferential attachment at the level of communitie In thi ection we examine whether the fat tail oberved earlier in the community ditribution could reult from preferential attachment mechanim at the level of communitie. The method preented below can be applied in general to any empirical tudy of an attachment proce where the main goal i to decide whether the attachment i uniform or preferential with repect to a certain property (e.g., degree, ize, etc.) of the attached object (e.g., node, communitie etc.).

7 Social group dynamic in network Method for detecting preferential attachment If the tudied proce i uniform with repect to a property ρ, then object with a given ρ are choen at a rate given by the ditribution of ρ amongt the available object. However, if the attachment mechanim prefer high (or low) ρ value, then object with high (or low) ρ are choen with a higher rate compared to the ρ ditribution of the available object. To monitor thi enhancement, one can contruct the cumulative ρ ditribution P t (ρ) of the available object at each time tep t, together with the un-normalied cumulative ρ ditribution of the object choen by the proce between t and t +1, denoted by w t t+1 (ρ). The value of w t t+1 (ρ ) at a given ρ equal to the number of object choen in the proce between t and t + 1, that had a ρ value larger than ρ at t. To detect deviation from uniform attachment, it i bet to accumulate the ratio of w t t+1 (ρ) and P t (ρ) during the time evolution to obtain W(ρ) = t max 1 t=0 w t t+1 (ρ). (2) P t (ρ) If the attachment i uniform with repect to ρ, then W(ρ) become a flat function. However, if W(ρ) i an increaing function, then the object with large ρ are favoured, if it i a decreaing function, the object with mall ρ are favoured in the attachment proce. The advantage of thi approach i that the rate-like variable w t t+1 (ρ) aociated to the time tep between t and t +1 i alway compared to the P t (ρ) ditribution at t. Therefore W(ρ) i able to indicate preference (or the abence of preference) even when P t (ρ) i lowly changing in time (a in the cae of the community degree in the co-authorhip network under invetigation). We have teted the above method on imulated graph grown with known attachment mechanim, i) uniform attachment (new node are attached to a randomly elected old node), ii) linear preferential attachment (new node are attached to old one with a probability proportional to the degree), iii) and anti-preferential attachment (new node are attached to the old one with a probability proportional to exp( d), where d i the degree). In thee cae the degree d of the individual node play the role of the parameter ρ. For each time tep, we recorded the cumulative degree ditribution of the node P t (d), together with the number of node gaining new link with a degree higher than a given d, labelled by w t t+1 (d). By umming the ratio of thee two function along the time evolution of the ytem one get W(d) = t max 1 t=0 w t t+1 (d)/p t (d). In fig.3a. we how the empirical reult for W(d) obtained for the imulated network grown with the three different attachment rule. The curve reflect the difference between the three cae very well: for the uniform attachment probability W(d) i flat, for the preferential attachment W(d) i clearly increaing, and for the anti-preferential attachment W(d) i decreaing. We have alo calculated the attachment tatitic of the node in the tudied coauthorhip network. In thi cae we ued extremely aymmetric decay characteritic in (1): λ = and λ + = 0. Thi reult in a imply growing network, where every collaboration event give rie to a et of link between each pair of collabo-

8 8 G. Palla, P. Pollner, A-L. Barabái and T. Vicek W( d ) a) 9. 5 lin. pref. cont. pref.. 5 anti pref. 8 b) d W( d ) node d Fig. 3 a) The W(d) function for network grown with known attachment rule: uniform probability (quare), linear preferential attachment (open circle), and anti preferential attachment (diamond). b) The W(d) function in the co-authorhip network of the Lo Alamo cond-mat archive. Figure from [56]. rator at the very moment of the collaboration act, and the trength of thee link remain contant from then on. A it can be een in fig.3b., the correponding W(d) curve i increaing, therefore preferential attachment i preent at the level of node in the ytem Community growth in the co-authorhip network The two propertie to be ubtituted in place of ρ in Eq.2 are the community degree d com and the community ize, therefore, the cumulative community ize ditribution P t () and the cumulative community degree ditribution P t (d com ) were recorded at each time tep t. To tudy the etablihment of the new community link, we contructed the un-normalied cumulative ize ditribution w t t+1 () and the unnormalied cumulative degree ditribution w t t+1 (d com ) of the communitie gaining new community link to previouly unlinked communitie. The value of thee ditribution at a given (or given d com ) i equal to the number of unlinked communitie at t that etablih a community link between t and t + 1 with a community larger than (or having larger degree than d com ) at t. By accumulating the ratio of the rate-like variable and the correponding ditribution we obtain W() = t max 1 t=0 w t t+1 (), W(d com ) = P t () t max 1 t=0 w t t+1 (d com ) P t (d com. (3) ) For the invetigation of the appearance of new member in the communitie, we recorded the un-normalied community ize ditribution ŵ t t+1 () and the unnormalied community degree ditribution ŵ t t+1 (d com ) of the communitie gaining new member (belonging previouly to none of the communitie) between t and t + 1. The correponding ditribution that can be ued to detect deviation from the uniform attachment are

9 Social group dynamic in network 9 Ŵ() = t max 1 t=0 ŵ t t+1 (), Ŵ(d com ) = P t () t max 1 t=0 ŵ t t+1 (d com ) P t (d com. (4) ) In fig.4a. we how the empirical W() and Ŵ() function, wherea in fig.4b. the empirical W(d com ) and Ŵ(d com ) are diplayed. All four function are clearly increaing, therefore we can draw the following important concluion: When a previouly unlinked community etablihe a new community link, communitie with large ize and large degree are elected with enhanced probability from the available other communitie. When a node previouly belonging to none of the communitie join a community, communitie with large ize and large degree are elected with enhanced probability from the available communitie W( ) W( ) W( d com ) W( d com ) d com Fig. 4 a) The W() and Ŵ() function for the communitie of the co-authorhip network of the Lo Alamo cond-mat e-print archive. b) The W(d com ) and Ŵ(d com ) function of the ame network. The increaing nature of thee function indicate preferential attachment at the level of communitie in the ytem. Figure from [56] Model for growth of community network In thi ection we outline a imple model for the growth of overlapping communitie. Our goal i to demontrate that preferential attachment of the node to communitie with the community ize, together with minor additional aumption are enough for the emergence of a community ytem with a caling community ize and community degree ditribution. In our model the underlying network between the node i left unpecified, the focu i on the content of the communitie. During the time evolution, imilarly to the model publihed in [48, 28, 58], new member may join the already exiting communitie, and new communitie may emerge a well. The new node introduced

10 G. Palla, P. Pollner, A-L. Barabái and T. Vicek to the ytem chooe their community preferentially with the community ize, therefore the ize ditribution of the communitie i expected to develop into a power-law. The appearance of the new community link originate in new node joining everal communitie at the ame time. The detailed rule of the model are the following: The initial tate of the model i a mall et of communitie with random ize. The new node are added to the ytem eparately. For each new node i, a memberhip m i i drawn from a Poionean ditribution with an expectation value of µ. If m i 1, communitie are ubequently choen with probabilitie proportional to their ize, until m i i reached, and the node i join the choen communitie imultaneouly. If m i = 0, the node i join the group of unclaified vertice. When the ratio r of the group of unclaified node compared to the total number of node N exceed a certain limit r, a number of q vertice from the group etablih a new community. (Obviouly, q mut be maller than Nr even in the initial tate). To be able to compare the reult of the model with the community tructure of the co-authorhip network, the run were topped when the number of node in the model reached the ize of the co-authorhip network. Our experience howed that the model i quite inenitive to change in r or q, and µ i the only important parameter. For mall value (µ < 0.3) the reulting community degree ditribution i truncated, wherea when µ i too large (µ > 1), a giant community with abnormally large community degree appear. For intermediate µ value (0.3 < µ < 1), the community ize and community degree ditribution become fat tailed, imilarly to the co-authorhip network. In fig.5. we how the a) 1 1 b) 1 1 P( ) 2 P( d com ) d com Fig. 5 a) the cumulative community ize ditribution P() (open circle) in our model at µ = 0.6 follow a power-law with an exponent of 1.4 (traight line) ( b) the cumulative community degree ditribution P(d com ) (filled circle) in our model at the ame µ. The tail of thi ditribution follow the ame power-law a the community ize ditribution (traight line), imilarly to the communitie found in the co-authorhip network [53]. Figure from [56]. 3 cumulative community ize ditribution P() and the cumulative community degree

11 Social group dynamic in network 11 ditribution P(d com ) of the communitie obtained in our model at µ = 0.6. (Change in the parameter r and q only hift thee curve, their hape remain unchanged). Our model grap the relevant tatitical propertie of the community tructure in the co-authorhip network [53] quite well: the community ize ditribution and the tail of the community degree ditribution follow a power-law with the ame exponent. 3.3 The tatic communitie Turning back to the tudy of the community evolution (where link correponding to abandoned ocial connection may diappear with time, 0 < λ = λ + < ), the communitie at each time tep were extracted with the CPM for both the co-authorhipand the phone-call network. When applied to weighted network, the CPM ha two parameter: the k-clique ize k, (in Fig.6a-b we how the communitie for k = 4), and the weight threhold w (link weaker than w are ignored). By increaing k or w, the communitie tart to hrink and fall apart, but at the ame time they become alo more coheive. In the oppoite cae, at low k there i a critical w, under which a giant community appear in the ytem that mear out the detail of the community tructure by merging (and making inviible) many maller communitie. The criterion ued to fix thee parameter i baed on finding a community tructure a highly tructured a poible: at the highet k value for which a giant community may emerge, the w i decreaed jut below the critical point. The actual value of thee parameter in our tudie were k = 3,w = 0.1 in cae of the co-authorhip network, and k = 4,w = 1.0 in cae of the phone-call network. In Fig.6a-b we how the local tructure at a given time tep in the two network in the vicinity of a randomly choen individual (marked by a black frame). The communitie (ocial group repreented by more denely interconnected part within a network of ocial link) are coloured with different hade of gray, o that white node (and dahed edge) do not belong to any community, and thoe that imultaneouly belong to two or more communitie are hown in black. The two network have rather different local tructure: due to it bipartite nature, the collaboration network i quite dene and the overlap between communitie i very ignificant, wherea in the phone-call network the communitie are le interconnected and are often eparated by one or more inter-community node/edge. Indeed, while the phone record capture the communication between two people, the publication record aign to all individual that contribute to a paper a fully connected clique. A a reult, the phone data i dominated by ingle link, while the co-authorhip data ha many dene, highly connected neighbourhood. Furthermore, the link in the phone network correpond to intant communication event, capturing a relationhip a i happen. In contrat, the co-authorhip data record the reult of a long term collaboration proce. Thee fundamental difference ugget that any potential common feature of the community evolution in the two network potentially, repreent generic characteritic of community formation, rather than being rooted in the detail of the network repreentation or data collection proce.

12 12 G. Palla, P. Pollner, A-L. Barabái and T. Vicek a) co authorhip b) phone call Fig. 6 a) The local community tructure at a given time tep in the vicinity of a randomly elected node in cae of the co-authorhip network. b) The ame picture in the phone-call network. Figure from [52]. 3.4 Validating the communitie When validating the found communitie, a a firt tep, it i important to check if the uncovered communitie correpond to group of individual with a hared common activity pattern. For thi purpoe we compared the average weight of the link inide communitie, w c, to the average weight of the inter-community link, w ic. For the co-authorhip network w c /w ic i about 2.9, while for the phone-call network the difference i even more ignificant, ince w c /w ic 5.9, indicating that the intenity of collaboration/communication within a group i ignificantly higher than with contact belonging to a different group [21, 8, 50, 51]. While for coauthor the quality of the clutering can be directly teted by tudying their publication record in more detail, in the phone-call network peronal information i not available. In thi cae the zip-code and the age of the uer provide additional information for checking the homogeneity of the communitie. In Fig.7a we how the ize of the larget ubet of people having the ame zip code in the communitie, n real, averaged over the time tep, a the function of the community ize, divided by n rand, repreenting the average over random et of uer. The ignificantly higher number of people with the ame zip-code in the CPM communitie a compared to random et indicate that the communitie uually correpond to individual living relatively cloe to each other. It i of pecific interet that n real / n rand ha a prominent peak at 35, uggeting that communitie of thi ize are geographically the mot homogeneou one. However, a Fig.7b how, the ituation i more complex: on average, the maller communitie are more homogeneou, but there i till a noticeable peak at In Fig.7a we alo how the average ize of the larget ubet of member with an age falling into a three year wide time window, divided by the ame quantity obtained for randomly elected group of individual. The fact that the ratio i larger than one indicate that

13 Social group dynamic in network 13 a) b) zip code age < n real > 8 < nreal> 0.3 < n rand > zip code age Fig. 7 a) The black ymbol correpond to the average ize of the larget ubet of member with the ame zip-code, n real, in the phone-call communitie divided by the ame quantity found in random et, n rand, a the function of the community ize. Similarly, the white ymbol how the average ize of the larget ubet of community member with an age falling in a three year time window, divided by the ame quantity in random et. The error-bar in both cae correpond to n real /( n rand + σ rand ) and n real /( n rand σ rand ), where σ rand i the tandard deviation in cae of the random et b) The n real / a a function of, for both the zip-code (black ymbol) and the age (white ymbol). Figure from [52]. communitie have a tendency to contain people from the ame generation, and the n rand / plot indicate that the homogeneity of mall group i on average larger than that of the big group. a) p b) N u com N rand u erv. no. 1 erv. no. 3 erv. no. 4 erv. no. 6 erv. no. erv. no. 11 erv. no. 12 erv. no. 14 erv. no. 20 erv. no. 21 erv. no. 22 erv. no. 25 erv. no age difference n u Fig. 8 a)the probability ditribution of the age difference between community member in the phone-call network. The mot probable value are zero and 25, indicating that a pair of member from a community are mot likely to be of the ame age, or to be a generation apart from each other. b)the number of communitie divided by the average number of random et containing the ame n u number of people uing a given ervice. Each ample of the random et wa prepared with ize ditribution of the communitie determined for the phone-call network. Figure from the Suppl. of [52]. Another intereting feature of Fig.7 i that the difference in the homogeneity of the age i le pronounced than in cae of the zip-code. A plauible reaon for thi effect i that due to the trong ocial relation between parent and children, many communitie contain member coming from different generation. Thi i upported by the ditribution of the age difference in communitie, hown in Fig.8a: there i a

14 14 G. Palla, P. Pollner, A-L. Barabái and T. Vicek major peak at zero correponding to member with the ame age, however there i alo another peak at 25, correponding to the typical age difference between parent and children. Beide the zip-code and the age, the tatitic of the ervice uage of the cutomer upport the validity of the communitie a well. In our primary data, the number of time people have ued a certain ervice in one of the two week long period wa alo available. (There were altogether 34 available ervice for the cutomer). However, for mot ervice, the probability for a randomly elected cutomer uing the ervice at all i very low. For thi reaon, intead of comparing the average number of member uing the ame ervice in communitie and random et, we compared the Nu com (n u ) number of communitie having n u member uing the ame ervice to the ame quantity in random et, denoted by Nu rand (n u ). For each ervice, random et with the ame ize ditribution a the communitie were contructed 000 time, and Nu rand (n u ) wa averaged over the ample. A it can be een from Fig.8b, for 13 ervice the Nu com (n u ) number of communitie having n u member uing the ervice i ignificantly larger than in cae of random et. In fact, the Nu com (n u )/Nu rand (n u ) ratio in ome cae reache infinity, indicating that there were no random et at all containing uch high number of ervice uer a ome communitie. In ummary, the phone-call communitie uncovered by the CPM tend to contain individual living in the ame neighbourhood, and with comparable age, a homogeneity that upport the validity of the uncovered community tructure. 4 Evolving communitie Our focu i on the tatitical propertie of evolving communitie, therefore, we need a reliable method for matching the tatic nap-hot of the community tructure at ubequent time tep. The baic event that may occur in the life of a community are growth contraction t t+1 merging t t+1 plitting t t+1 birth t t+1 death t t+1 t t+1 Fig. 9 Poible event in the community evolution. When new member are introduced, the community grow, wherea leaving member caue decay in the ize. Communitie can merge and plit, new group may emerge and old one can diappear. Figure from [52].

15 Social group dynamic in network 15 hown in Fig.9a: a community can grow by recruiting new member, or contract by looing member; two (or more) group may merge into a ingle community, while a large enough ocial group can plit into everal maller one; new communitie are born and old one may diappear. Given the huge number of group preent at each time tep, it i a ignificant algorithmic and computational challenge to match communitie uncovered at different time tep. The fact that the communitie obtained by the CPM can have overlap make the problem even more complicated. A imple approach would be to match communitie from conecutive time tep in decending order of their relative overlap. The relative overlap between communitie A and B can be defined a C(A,B) A B A B, (5) where A B i the number of common node in A and B, and A B i the number of node in the union of the two communitie. However, the node hared between the communitie can undermine thi type of community conjugation between conecutive time tep: In cae a mall community A i inflated by large magnitude between time tep t and t +1, and at t +1 it overlap with a mall tatic community B = B t = B t+1, then the relative overlap (5) between A t+1 and B t can be larger than the relative overlap between A t+1 and A t. To overcome thi difficulty, we refine the identification of communitie a hown in Fig.. For each conecutive time tep t and t +1 we contruct a joint graph coniting of the union of link from the correponding two network, and extract the CPM community tructure of thi joint network (we thank I. Derényi for pointing out thi poibility). When new link are introduced in a network, the CPM communitie may remain unchanged, they may grow, or a group of CPM communitie may become joined into a ingle community, however no CPM community may decay by looing member. From thi it follow that if we merge two network, any CPM community in any of the original network will be contained in exactly one community in the joined network. Let u denote the et of communitie from t by A, the et of communitie from t+ 1 by B, and the et of communitie from the joint network by V. For any community A i A or B j B we can find exactly one community V k V containing it. When matching the communitie in A and in B, firt for every community V k V in the joint ytem we extract the lit of communitie A k i A and Bk j B that are contained in V k (thi mean A k i V k and B k j V k). (Note that either of the lit may be empty). Then the relative overlap between every poible (A k i,bk j ) pair can be obtained a A k Ci k i Bk j j = A k i, (6) Bk j and we match the pair of communitie in decending order of their relative overlap.

16 16 G. Palla, P. Pollner, A-L. Barabái and T. Vicek a) b) t t U t+1 t+1 c) t t U t+1 t+1 t t U t+1 t+1 Fig. Simple cenario in the community evolution of the phone-call network for k = 4. The communitie at t are coloured black on light gray, the communitie at t + 1 are coloured white on light gray, and the communitie in the joint network are coloured dark gray on light gray. a) a community imply propagate, b) the larger community wallow the maller one, c) a mall community i detached from a larger one. Figure from the Suppl. of [52]. A an illutration of the above proce, in Fig. we how three imple cenario occurring in the community evolution of the phone-call network. In Fig.a both lit A k i and B k j conit of only a ingle community, therefore thee can be matched right away. However, in Fig.b the A k i lit contain two element, let u denote the maller community of ize = 6 at t by A k 1 and the larger community coniting of nine node at t by A k 2. The correponding Bk j lit contain a ingle community Bk 1 having 15 member. The relative overlap between the communitie are given a C1,1 k = 2/5 and Ck 2,1 = 3/5. Since the Ck 2,1 relative overlap of the Bk 1 community with A k 2 community i larger than the Ck 1,1 relative overlap with Ak 1, we aign Bk 1 to Ak 2. A a conequence the A k 1 community come to the end of it life at t, and it i wallowed by A k 2. The oppoite proce i hown in Fig.c: in thi cae the Ak i lit conit of a ingle community A k 1 of ize = 15, wherea the Bk j lit ha two element, the community with ix member labelled by B k 1, wherea and the community containing ten node labelled by B k 2. The relative overlap are Ck 1,1 = 2/5 and Ck 1,2 = 2/3, therefore the A k 1 i matched to Bk 2, and Bk 1 i treated a a new born community. In general, whenever the community V k contain more communitie from A than from B, the communitie A k i left with no counterpart from B k j finih their life at t, and when V k contain more communitie from B than from A, the communitie B k j left with no counterpart from A k i are conidered a new born communitie.

17 Social group dynamic in network 17 In ome cae we can oberve that although a community wa diintegrated, after a few tep it uddenly reappear in the network. Our conjecture i that thi i more likely to be the conequence of a temporally lower publihing-rate/callingrate of the people in quetion than of the real diaembly and re-aembly of the correponding ocial community between the people. Therefore, whenever a newborn community include a formerly diintegrated one, then the lat tate of the old community i elongated to fill the gap before the reappearance, and the newborn community i treated a the continuation of the old one, a hown in Fig.11. a) b) D C B A D C B A D C B A F E D C B A F E D C B A F E A F F E E A A t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 D C B A D C B A D C B A F E D C B A F E D C B A F E F E F E F E F E A A A A A t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Fig. 11 a) A community i diintegrated after tep t 5, and it i reborn at tep t 8. b) We treat the community a if it wa alive at tep t 6 and t 7 too, with the ame node a at tep t 5. Figure from the Suppl. of [52]. 5 Statitical propertie of the community dynamic 5.1 Baic tatitic One of the mot baic propertie characteriing the partitioning of a network i the overall coverage of the community tructure, i.e. the ratio of node contained in at leat one community. In cae of the co-authorhip network the average value of thi ratio wa above 59%, which i a reaonable coverage for the CPM. In contrat, we could only achieve a ignificantly maller ratio for the phone-call network. At uch a large ytem ize, in order to be able to match the communitie at ubequent time tep in reaonable time we had to decreae the number of communitie by chooing a higher k and w parameter (k = 4 and w = 1.0), and keeping only the communitie having a ize larger or equal to = 6. Therefore, in the end the ratio of node contained in at leat one community wa reduced to 11%. However, thi till mean more than cutomer in the communitie on average, providing a repreentative ampling of the ytem. By lowering the k to k = 3, the fraction of node included in the communitie i raied to 43%. Furthermore, a ignificant number of additional node can be alo claified into the dicovered communitie. For example, if a node not yet claified ha link() only to a ingle community (and, if it ha no link connecting to node in any other community) it can be afely added to that community. Carrying out thi proce iteratively, the fraction of node

18 18 G. Palla, P. Pollner, A-L. Barabái and T. Vicek that can be claified into communitie increae to 72% for the k=3 co-authorhip network, and to 72% (61%) for the k=3 (k=4) mobile phone network, which, in principle, allow u to claify over 2.4 million uer into communitie. a) phone call b) co authorhip P() 3 t=0 t=5 t= t= P() t=0 t=15 t=30 t=60 t=90 t= c) 5 1 phone call 2 d) 4 1 co authorhip N() t=0 t=5 t= t= N() t=0 t=15 t=30 t=60 t=90 t= Fig. 12 a) The cumulative community ize ditribution in the phone-call network at different time tep. b) The time evolution of the cumulative community ize ditribution in the co-authorhip network. c) The number of communitie of a given ize at different time tep in the phone-call network. d) The time evolution of the number of communitie with a given ize in the co-authorhip network. Figure from [55]. 2 Another important tatitic decribing the community ytem i the community ize ditribution. In Fig.12a we how the community ize ditribution in the phonecall network at different time tep. They all reemble to a power-law with a high exponent. In cae of t = 0, the larget communitie are omewhat maller than in the later time tep. Thi i due to the fact that the event before the actual time tep cannot contribute to the link-weight in cae of t = 0, wherea they can if t > 0. In Fig.12b we can follow the time evolution of the community ize ditribution in the co-authorhip network. In thi cae t = 0 correpond to the birth of the ytem itelf a well (wherea in cae of the phone-call it doe not), therefore the network and the communitie in the network are mall in the firt few time tep. Later on, the ytem i enlarged, and the community ize ditribution i tabilied cloe to a power-law. In Fig.12c-d we how the number of communitie a a function of the community ize at different time tep in the examined ytem. For the phone-call network (Fig.12c), thi ditribution i more or le contant in time. In contrat, (due to the growth of the underlying network) we can ee an overall growth in the number

19 Social group dynamic in network 19 of communitie with time in the co-authorhip network (Fig.12d). Since the number of communitie drop down to only a few at large community ize in both ytem, we ued ize binning when calculating the tatitic hown in Fig and Fig.17. A for evolving communitie, we firt conider two baic quantitie characteriing a community: it ize and it age τ, repreenting the time paed ince it birth. and τ are poitively correlated: larger communitie are on average older (Fig.13a), which i quite natural, a communitie are uually born mall, and it take time to recruit new member to reach a large ize. a) 3 b) co authorhip phone call < τ( ) > < τ > < C( t ) > phone call, =6 phone call, =12 phone call, =18 co authorhip, =6 co authorhip, =12 co authorhip, = t Fig. 13 a) The average age τ of communitie with a given ize (number of people), divided by the average age of all communitie τ, a the function of, indicating that larger communitie are on average older. b) The average auto-correlation function C(t) of communitie with different ize (the unit of time, t, i one month). The C(t) of larger communitie decay fater. Figure from [52]. Next we ued the auto-correlation function, C(t), to quantify the relative overlap between two tate of the ame community A(t) at t time tep apart: C A (t) A(t 0) A(t 0 +t) A(t 0 ) A(t 0 +t), (7) where A(t 0 ) A(t 0 +t) i the number of common node (member) in A(t 0 ) and A(t 0 + t), and A(t 0 ) A(t 0 +t) i the number of node in the union of A(t 0 ) and A(t 0 + t). Fig.13b how the average time dependent auto-correlation function for communitie born with different ize. We find that in both network, the autocorrelation function decay fater for the larger communitie, indicating that the memberhip of the larger communitie i changing at a higher rate. On the contrary, mall communitie change at a maller rate, their compoition being more or le tatic. 5.2 Stationarity and lifetime According to the reult of Sect.5.1 a difference can be oberved in the veratility of mall and large communitie. To quantify thi apect of community evolution, we define the tationarity ζ of a community a the average correlation between

20 20 G. Palla, P. Pollner, A-L. Barabái and T. Vicek ubequent tate: ζ t max 1 t=t 0 C(t,t + 1), (8) t max t 0 1 where t 0 denote the birth of the community, and t max i the lat tep before the extinction of the community. In other word, 1 ζ repreent the average ratio of member changed in one tep; larger ζ correpond to maller change (more tationary memberhip). We oberve a very intereting effect when we invetigate the relationhip between the lifetime τ (the number of tep between the birth and diintegration of a community), the tationarity and the community ize. The lifetime can be viewed a a imple meaure of fitne : communitie having higher fitne have an extended life, while the one with mall fitne quickly diintegrate, or are wallowed by another community. In Fig.14a-b we how the average life-pan τ a a function of the tationarity ζ and the community ize (both and ζ were binned). In both network, for mall community ize the highet average life-pan i at a tationarity value very cloe to one, indicating that for mall communitie it i optimal to have tatic, time independent memberhip. On the other hand, the peak in τ i hifted toward low ζ value for large communitie, uggeting that for thee the optimal regime i to be dynamic, i.e., a continually changing memberhip. In fact, large communitie with a ζ value equal to the optimal ζ for mall communitie have a very hort life, and imilarly, mall communitie with a low ζ (being optimal at large ize) are diappearing quickly a well. a) ζ b) 18 < τ > < τ > ζ Fig. 14 a) The average life-pan τ of the communitie a the function of the tationarity ζ and the community ize for the co-authorhip network. The peak in τ i cloe to ζ = 1 for mall ize, wherea it i hifted toward lower ζ value for large ize. b)similar reult found in the phone-call network. Figure from [52]. To illutrate the difference in the optimal behaviour (a pattern of memberhip dynamic leading to extended lifetime) of mall and large communitie, in Fig.15. we how the time evolution of four communitie from the co-authorhip network. A Fig.15. indicate, a typical mall and tationary community undergoe minor change, but live for a long time. Thi i well illutrated by the naphot of the community tructure, howing that the community tability i conferred by a core

21 Social group dynamic in network 21 a) 50 τ=0 2 τ=3 τ=4 34 τ=35 τ=36 52 mall, tationary b) c) τ 50 τ=1 τ=2 τ=3 τ=4 τ=5 τ=6 τ=7 τ=8 0 0 mall, non tationary τ large, tationary τ d) e) new old leaving in next tep large, non tationary τ Fig. 15 Time evolution of four communitie in the co-authorhip network. The height of the column correpond to the actual community ize, and within one column the light gray colour indicate the number of old node (that have been preent in the community at leat in the previou time tep a well), while newcomer are hown with black. The member abandoning the community in the next time tep are hown with mid gray colour, the hade depending on whether they are old or new. (Thi latter type of member join the community for only one time tep). From top to bottom, we how a mall and tationary community (a), a mall and non-tationary community (b), a large and tationary community (c) and, finally, a large and non-tationary community (d). A mainly growing tage (two time tep) in the evolution of the latter community i detailed in panel e). Figure from [52]. τ=9 τ= of three individual repreenting a collaborative group panning over 52 month. While new co-author are added occaionally to the group, they come and go. In contrat, a mall community with high turnover of it member, (everal member abandon the community at the econd time tep, followed by three new member joining in at time tep three) ha a lifetime of nine time tep only (Fig.15b). The oppoite i een for large communitie: a large tationary community diintegrate after four time tep (Fig.15c). In contrat, a large non-tationary community whoe member change dynamically, reulting in ignificant fluctuation in both ize and the compoition, ha quite extended lifetime (Fig.15d). Indeed, while the community undergoe dramatic change, gaining (Fig.15e) or looing a high fraction of it memberhip, it can eaily withtand thee change.

22 22 G. Palla, P. Pollner, A-L. Barabái and T. Vicek 5.3 Predicting community break up The quite different tability rule followed by the mall and large communitie raie an important quetion: could an inpection of the community itelf predict it future? To addre thi quetion, for each member in a community we meaured the total weight of thi member connection to outide of the community (w out ) a well a to member belonging to the ame community (w in ). We then calculated the probability that the member will abandon the community a a function of the w out /(w in + w out ) ratio. A Fig.16a how, for both network thi probability increae monotonically, uggeting that if the relative commitment of a uer i to individual outide a given community i higher, then it i more likely that he/he will leave the community. a) p l τ n co authorhip phone call wout/( win+ w out ) wout /( w in + wout) b) p d τ 15 5 Wout/( Win+ Wout) co authorhip phone call Wout /( W in + W out ) Fig. 16 a) The probability p l for a member to abandon it community in the next tep a a function of the ratio of it aggregated link weight to other part of the network (w out ) and it total aggregated link weight (w in + w out ). The inet how the average time pent in the community by the node, τ n, in function of w out /(w in +w out ). b) The probability p d for a community to diintegrate in the next tep in function of the ratio of the aggregated weight of link from the community to other part of the network (W out ) and the aggregated weight of all link tarting from the community (W in +W out ). The inet how the average life time τ of communitie a a function of W out /(W in +W out ). Figure from [52]. In parallel, the average time pent in the community by the node, τ n, i a decreaing function of the above ratio (Fig.16a inet). Individual that are the mot likely to tay are thoe that commit mot of their time to community member, an effect that i particularly prominent for the phone network. A Fig.16a how, thoe with the leat commitment have a quickly growing likelihood of leaving the community. Taking thi idea from individual to communitie, we meaured for each community the total weight of link (a meaure of how much a member i committed) from the member to other, outide of the community (W out ), a well a the aggregated link weight inide the community (W in ). We find that the probability for a community to diintegrate in the next tep increae a a function of W out /(W in +W out ) (Fig.16b), and the lifetime of a community decreae with the W out /(W in + W out ) ratio (Fig.16b inet). Thi indicate that elf-focued communitie have a ignificantly longer lifetime than thoe that are open to the outide world. However, an

23 Social group dynamic in network 23 intereting obervation i that, while the lifetime of the phone-call communitie for moderate level i relatively inenitive to outide commitment, the lifetime of the collaboration communitie poee a maximum at intermediate level of intercollaboration (collaboration between colleague who belong to different communitie). Thee reult ugget that a tracking of the individual a well a the community relative commitment to the other member of the community provide a clue for predicting the community fate. 5.4 Merging of communitie Finally, we invetigate a pecial apect of the merging proce between communitie. During uch event, a pair (or a larger group) of initially ditinct communitie join together and form a ingle community. A very intereting quetion connected to thi i that can we find a imple relation between the ize of a community and the likelihood that it will take part in uch proce? To invetigate thi iue we carried out meaurement imilar to thoe in [56] and preented in ection The baic idea i that if the merging proce i uniform with repect to the ize of the communitie, then communitie with a given are choen at a rate given by the ize ditribution of the available communitie. However, if the merging mechanim prefer large (or mall) ize, then communitie with large (or mall) are choen with a higher rate compared to the ize ditribution of the available communitie. To monitor thi enhancement we ued the indicator function, defined in Eq.2, ubtituting the ρ = ( 1, 2 ) ize-pair object. At each time tep t the cumulative ize-pair ditribution P t ( 1, 2 ) wa recorded. Simultaneouly, the un-normalied cumulative ize-pair ditribution of the communitie merging between t and t + 1 wa contructed; we hall denote thi ditribution by w t t+1 ( 1, 2 ). The value of thi rate-like variable w t t+1 ( 1, 2 ) at a given value of 1 and 2 i equal to the number of pair of communitie that merged between t and t + 1 and had ize 1 > 1 and 2 > 2. Here the reulting indicator function W( 1, 2 ) t max 1 w t t+1 ( 1, 2 ) t=0 P t ( 1, 2 ) (9) i defined on a two dimenional plane. When the merging proce i uniform with repect to the community ize the W( 1, 2 ) become a flat function: on average we ee pair of communitie merging with ize 1 and 2 at a rate equal to the probability of finding a pair of communitie of thee ize. However, if the merging proce prefer large (or mall) communitie, than pair with large (or mall) ize merge at a higher rate than the probability of finding uch pair, and W( 1, 2 ) become increaing (or decreaing) with the ize. The reaon for uing un-normalied w t t+1 ( 1, 2 ) ditribution i that in thi way each merging event contribute to W( 1, 2 ) with equal weight, and the time tep with a lot of merging event count more than thoe with only a few event. In the

24 24 G. Palla, P. Pollner, A-L. Barabái and T. Vicek oppoite cae (when w t t+1 ( 1, 2 ) i normalied for each pair of ubequent time tep t,t + 1), the merging event occurring between time tep with a lot of other merging event are uppreed compared to the event with only a few other parallel event, a each pair of conecutive time tep t,t + 1 contribute to the W( 1, 2 ) function with equal weight. Thi difference between normalied and un-normalied w t t+1 ( 1, 2 ) become important in cae of the co-authorhip network, where in the beginning the ytem i mall and merging i rare, and later on a the ytem i developing, merging between communitie become a regular event. In Fig.17. we how W( 1, 2 ) for both network, and the picture ugget that large ize are preferred in the merging proce. Thi i conitent with our finding Fig. 17 The merging of communitie. a) the W( 1, 2 ) function for the co-authorhip network, b) the W( 1, 2 ) function for the phone-call network, c) the region with maller W( 1, 2 ) in (a) enlarged, d) the region with maller W( 1, 2 ) in (b) enlarged. Figure from the Suppl. of [52]. that the content of large communitie i changing at a fater rate compared to the mall one. Swallowing other communitie i an efficient way to bring numerou new member into the community in jut one tep, therefore taking part in merging i beneficial for large communitie following a urvival trategy baed on contantly changing their member. Another intereting apect of the reult hown in Fig.17. i that they are analogou to the attachment mechanim of link between already exiting node in collaboration network [5]: the probability for a new link to appear between two node with degree d 1 and d 2 i roughly proportional to d 1 d 2. Similarly, the probability that two communitie of ize 1 and 2 will merge i proportional to to 1 2,