PROXIMITY, INTERACTIONS, AND COMMUNITIES IN SOCIAL NETWORKS: PROPERTIES AND APPLICATIONS.

Size: px

Start display at page:

Download "PROXIMITY, INTERACTIONS, AND COMMUNITIES IN SOCIAL NETWORKS: PROPERTIES AND APPLICATIONS."

Della Manning
6 years ago
Views:

1 PROXIMITY, INTERACTIONS, AND COMMUNITIES IN SOCIAL NETWORKS: PROPERTIES AND APPLICATIONS. By Tommy Nguyen A Thesis Submitted to the Graduate Faculty of Rensselaer Polytechnic Institute in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY Major Subject: COMPUTER SCIENCE Examining Committee: Boleslaw K. Szymanski, Thesis Adviser Sibel Adalí, Member James A. Hendler, Member Gyorgy Korniss, Member Mohammed J. Zaki, Member Rensselaer Polytechnic Institute Troy, New York October 214 (For Graduation December 214)

3 CONTENTS LIST OF TABLES vi LIST OF FIGURES vii ACKNOWLEDGMENT ix ABSTRACT x 1. INTRODUCTION Ranking Information in Social Networks Small Worlds and Social Stratification Summary of Contributions & Organization Organization LITERATURE REVIEW Ranking Techniques Web Conceptualization User Data & Trust Models Learning to Rank Small-world Problem Six Degrees of Separation Social Stratification SOCIAL NETWORK ANALYSIS Geography, Co-Appearance, & Interactions Data Collection Notations & Definitions Data Analysis & Results Limitations Incorporating Geography into Community Detection Clique Percolation Method Modularity Maximization Speaker-Label Propagation (GANXiS) Contrasting Communities to Null Models Techniques for Generating Covers iii

4 3.3.2 Measuring Covers & Communities Examining Covers in Gowalla Examining Detected Communities Network Community Profile (NCP) Link Connectivity Measurements Face-to-Face Interactions Measurements Application: Social Relationships & Human Mobility Network Congestion in MANETs Mobility Generation Experimental Congestion Design Congestion Simulation Results Application: Long Ties & Economic Development A Stochastic Model of Economic Development Experimental Results & Discussion Summary of Results SOCIAL RANKING TECHNIQUES Google Buzz & Twitter Categories of URLs Spreaders & Affected Sets Information Distances Geographical Distances Densities of Social Relationships Keyword Similarity Social Ranking Techniques PageRank on Social Network HITS on Social Network Ranking with Maximum Flow Variants of Maximum Flow Social Ranking Experiments Comparing PageRank & HITS Flow Ranking Rank Differences Rank Distributions Rank Validation Summary of Results iv

5 5. SOCIAL SEARCHING EXPERIMENTS Attrition, Geography, & Communities Modeling Attrition Geographical Analysis Detecting Communities Experimental Design Routing Strategies Starter & Target Selections Experimental Results Selection & Routing Combinations Friends-of-Friends Knowledge Densities Distributions of Successful Chains Effects of Hubs and Connectors Individual and Community Prominence Summary of Results CONCLUSION AND FUTURE WORK REFERENCES v

6 LIST OF TABLES 1.1 Aspects of SNA & applications Data summary of Gowalla network Six techniques for generating covers Measurements for cover C of the size k Detected communities and their sizes Measuring spatial conductance Measuring face-to-face interactions Network simulator ns-2 parameters Measuring economic development (Gowalla) Measuring economic development (FourSquare) Data summary of Google Buzz Data summary of Twitter Google Buzz (left) & Twitter (right) with geography Social relationships densities in Google Buzz Social relationships densities in Twitter Ranking results of 3 popular URLs in Google Buzz Ranking results of 3 random URLs in Google Buzz Avg. ranking differences in Google Buzz Avg. ranking differences in Twitter Summaries of online social networks datasets Communities detected by GANXiS Prominence of individuals and communities Experimental results for Gowalla Experimental results for FourSquare Aspects of SNA & applications vi

7 LIST OF FIGURES 3.1 Geographical spread of 1K checkins in Gowalla Friendship is bounded by geographical distance Densities of pairs as a function of geographical distance Measuring face-to-face interactions (t ɛ =3mins, d ɛ =1km) Generating CT A & F T A covers Intra-edge count, boundary-edge count, and geographic diameter of covers Contraction, expansion, conductance, and geographic distance of covers Communities detected by Clique Percolation Method Communities detected by Inference Algorithm Communities detected by GANXiS Measuring face-to-face interactions among members Generating a Markov Model using checkins Design of simulation overview Traffic congestion in FMM and RWP Frequency of pauses using the RWP Scaling laws of short and long ties Face-to-face interactions of short ties and long ties The collective strength of long ties in a simple contagion model Distribution of long ties for adopters and non-adopters Economic development as a function of idea flow (Gowalla) Economic development as a function of idea flow (FourSquare) Speedy idea flow as a function of social diversity Conceptualization of social ranking Categories of popular (a,c) and random (b,d) URLs vii

8 4.3 Shortest paths to URLs in Google Buzz (a) and Twitter (b) Ultra small-world property from starters to information Densities of shortest path lengths from starters to URLs Two degrees of spatial concentration Four dimensions of social relationships CKS for friendship, following, peers, and random pairs Graph G p for ranking URLs {u 1, u 2 } with respect to node p Ranking URLs on Google Buzz Ranking URLs on Twitter Social ranking with popular URLs on Google Buzz Social ranking with random URLs on Google Buzz Social ranking with popular URLs on Twitter Social ranking with random URLs on Twitter Densities of rank correlation coefficient Ranking quality results Stratification graph of communities in Gowalla Distributions of shortest path lengths & average path lengths Densities of geographical distances Friends-of-friends knowledge densities Path length of successful chains & drop rates Effects of routing to connectors & hubs Prominence of individuals & communities on reachability Prominence of individuals & communities correlations viii

9 ACKNOWLEDGMENT I like to thank everyone that mentored me during my undergraduate and graduate studies. This dissertation is not possible without their guidance. First, I like thank my dissertation chair for his guidance, ideas and intellectual contributions in this dissertation. From seeking research problems to career planning, he was always encouraging and supportive throughout my graduate studies. To quote a previous graduate student, his pleasant and friendly personality made this graduate study more enjoyable. Also, I like to thank committee members for providing their feedback and helping me organize the structure of this thesis. Second, I like to thank the entire staff in the CS department. Ms. Coonrad and Ms. Hayden are always responsive to my questions regarding classes, graduation requirements, etc. even when there are hundreds of questions from other students. Mr. Lindsay is always around and ready to help whenever a server crashes. It was always a pleasure to interact with them throughout my graduate studies. Last but not least, I like to acknowledge the graduate students and postdocs in our center and computer science department. Some of them are talented scientists and experts in their areas of research; others are going to become experts one day. They make me feel proud of being a member of our center and alumni of the university. ix

10 ABSTRACT Social network analysis, in the form of network theory, where nodes represent humans and edges represent social relationships between humans, have a wide range of applications in information science, political science, social science, economics, etc. The availability of data from location-based social media such as Gowalla and FourSquare has helped scientists model and analyze human relationships and their interactions. In this thesis, we use such data to analyze multiple dimensions of social relationships in terms of three specific aspects: geographical proximity of nodes, their face-to-face interactions, and the structure of their communities. Then we incorporate these three aspects of social relationships into the following applications. First, we propose techniques for analyzing human relationships in terms of geographical proximity, face-to-face interactions, and communities. We show how geographical proximity shapes structure of the social network by limiting face-to-face interactions among distant users. We also incorporate geographical locations that users visited into a few community detection algorithms for the purpose of detecting communities where members are on average separated by a few friendship link, are close to each other geographically, and are likely to interact with each other faceto-face. These aspects of social network analysis allowed the study of the first two applications human mobility patterns and the spread of ideas. Second, we use URLs that people share with their followers on social media to personalize the ranking of information by looking at who follows whom, geographical location of the users, and the structure of their detected communities. This allows us to analyze how social media tunnels the flow of information in the network. More importantly, personalized ranking based on these aspects allow users to see information through the eyes of other users whom they consider important (neighbors, friends, peers, etc.) and provides an opportunity for them to interact with information which was used by the people that they care resulting in the third application studied in this thesis. Finally, we replicate the small world experiment by emulating the process of searching for targets by routing a folder among their acquaintances. Geographical x

11 information and community structure allow us to selectively choose starters and targets based on the knowledge of where users are located and to which community they belong. In addition, we examine various routing strategies based on geographical proximity and community structure that perhaps were likely used by participants in the small-world experiment to reach a target. In doing so, we discover which combinations of routing strategies and selection techniques are likely to make the small-world experiment successful in terms of the small number of hops required to reach the target and the percentage of such successful chains resulting in the last application studied in this thesis. xi

12 CHAPTER 1 INTRODUCTION Social network analysis examines human relationships in terms of graph theory where nodes represent humans and edges represent their social relationships. In addition, social network analysis can also examine the geographical proximity of the nodes, their face-to-face interactions, and the structure of their detected communities. This thesis examines these three aspects of social network analysis in detail. Within the last five years, the proliferation of smartphones has provided a new type of social networking where people can share their current location with their friends and tag the activities that they are doing. This new type of social networking has provided a much richer dataset of human behavior because geographical locations and face-to-face interactions were not previously available. More importantly, this new type of social networking provides a bridge that connects the digital world with the physical world where physical activities of human behavior such as proximity and face-to-face interactions are recorded and shared instantly. Before location-based social media, scientists used CDRs (call detail records) of telephone companies to study spatial properties, infer friendship topology, and guess face-to-face interactions. However, a problem with CDRs is that call volume is not a good proxy for friendship because people can make phone calls to order food, request technical support, seek medical help, and so on. More importantly, using calling patterns to infer friendship is biased towards those that are more likely to be strong ties since weak ties are by definition those that are contacted infrequently; hence using CDRs to infer friendship leaves out an important dimension of social relationships in the study of social network analysis. Therefore, location-based social media is valuable for the study of social network analysis because it provides a network that is embedded into physical space - the Portions of this chapter previously appeared as: T. Nguyen and B. Szymanski, Social Ranking Techniques for the Web, in Proc. IEEE/ACM Int. Conf. Advances in Social Network Analysis and Mining, Niagara Falls, Ontario, 213, pp Portions of this chapter have been submitted as: T. Nguyen et al., Small Worlds and Social Stratification, PLoS ONE, (under review). 1

13 2 surface of earth, and its nodes - humans, are constantly moving. In addition, the links have different characteristics depending on the frequency of interactions. The questions that immediately arise are what are ramifications of this type of social graph embedding in physical space, and what are the roles of ties (weak/strong, long/short) in human behavior. The collection of data from Gowalla and FourSquare allows the investigation of these issues which are studied in detail in this thesis. Chapter 3 addresses the issue of face-to-face interactions and finds that friendship still requires both face-to-face interactions and geographical proximity. Moreover, the desire to interact face-to-face motivates strong ties to travel together impacting human mobility patterns with ramification for transportation traffic and wireless bandwidth infrastructure management (one of the applications studied in Chapter 3). However, this does not mean that weak ties are unimportant. The last section of Chapter 3 shows that weak ties that are geographically distant tunnel the flow of ideas and are a strong predictor of economic development in the US in terms of GDP, patents, and startups. Chapter 4 returns to strong ties and examines social influence that people have on each other in terms of interests, geographical distance, and communities. Chapter 4 explores this influence to improve relevancy of responses to queries by individualizing them for the users based on the ranking of web pages shared on social networks. Some potential evidence of increased relevancy mentioned in this thesis could possibly demonstrate the level of influence the friends exert on the interests of others. Chapter 5 expands the last section of chapter 3 by examining how spatial embedding of social networks, long distance ties, and communities underlie strategies of social search. These aspects of social network analysis examine whether social networks are small-world, stratified, or both simultaneously. Results show that while social networks have small topological path lengths, there is no evidence that people with limited knowledge can find a designated target within a small number of hops when attrition is completely eliminated. 1.1 Ranking Information in Social Networks Over the last decade, scientists examined the structure of web [1]-[4] and proposed algorithms to rank web pages based on significance and relevance to a given

14 3 query [5]-[9]. A conceptualization of the web is to look at patterns in the topology of hyperlinks containing web pages to separate prominent websites that serve as authorities for trusted information from malicious pages created by spammers [1]. This conceptualization of the web eliminates the complexity of textual analysis and creates a pot-pourri of information that gets incorporated into search engines or other information retrieval systems for the purpose of finding information on personal computers, mobile devices, and any other computing platforms [1]. In the case of a search engine, billions of web pages containing rich context of information are organized where end users can find their target quickly. Thus, this need for speed makes ranking crucial in information retrieval systems. Also, ranking has many other applications in social sciences such as the citation analysis of legal and scientific documents [11]. Advances in social network analysis and the proliferation of online social media have provided a different perspective for examining ranking [12]-[18]. The study of algorithms used for ranking and organizing information in hybrid networks such as social search engines have promising improvements when incorporating social network analysis into them; for example, incorporating personal information containing social relationships on G+ for personalizing search results on Google. As the proliferation of social media continues to expand, we want to be able to use techniques from social network analysis to personalize the ranking of information for a given user. This is important because social relevance allows users to see information through the eyes of other users who they consider important and provides an opportunity for them to interact with the information accessed by the people about whom they care. Social media such as Twitter and Google Buzz can be characterized as a web service that allows users to share information with their followers. While a lot of research has been devoted to examining text in hashtags and messages [19]-[21] we focus on URLs because information contained in URLs is not restricted by length limitation, is less likely to be informally written, and contains less slang and fewer abbreviations. Analyzing URLs provides a unique opportunity to infer the interests of users based on their reading habits. We assume that URLs shared via people concentrate on selected topics of their interests. It is important to notice that our purpose here is not to rank a set of URLs based on a given query but instead to rank a

15 4 set of URLs based on whether we think a user is likely to engage with the information contained within the URLs. Such engagement could be clicking, commenting, resharing, and spending time reading them. The problem we want to solve is to provide a framework for ranking URLs shared on social media based on social relationships; where some of the URLs are ranked higher if they are shared via certain type of social relationships. The social relationships we examine for ranking URLs include but are not limited to neighbors (nodes that are within geographical proximity [22]) and peers (nodes that are within a detected community [23]) The literature review on this subject is provided in Chapter 2 (Section 1) and the contribution is discussed in Chapter 6. Some data-driven questions that we examine are whether pairs of users that are geographically close are more likely to have similar interests than pairs that are distant, and whether reciprocal relationships have higher keyword similarity in web pages than non-reciprocal relationships. Other related questions that we explore are examining the densities of friends, peers, neighbors, and people with similar interests, since these social relationships are the building block for understanding social relevance. 1.2 Small Worlds and Social Stratification Data scientists have recently calculated the distribution of the shortest path lengths between randomly selected pairs of users in online social networking sites and confirmed that the majority of people are on average within six degrees of separation (e.g., 4.7 in Facebook [24], 2.7 in MySpace [25], 4.2 in Twitter [26], and so on [27]). However, empirical research in social stratification such as racial segregation and income inequality undermine the premise that we live in a small-world where there are short paths connecting people with culturally and economically diverse backgrounds together. In [28], Kleinfeld mentioned that Beck and Cadamagnani were unsuccessful in replicating the small-world experiment with high success rates when they attempted to reach a high-income target starting from a low-income person, suggesting that the world we live in is divided by wealth caused by income inequality. Before the availability of data from online social networking sites, Milgram and his colleagues performed an experiment to demonstrate the small-world phenomenon

16 5 by recruiting randomly selected starters from Nebraska and Oklahoma to reach a broker in Boston [29]. In their experiment, starters were asked to mail a folder to an acquaintance known to them on a first-name basis and would be likely to reach the target using the least number of hops. The process repeats until the chain stops when the folder eventually reaches the target or its current holder drops out the experiment for the lack of qualified acquaintances or unwillingness to participate in the experiment. Hence, the expected number of hops required for a starter to successfully reach a target is an upper bound and also a lose estimate for the length of shortest path connecting them. Travers and Milgram reported that 64% of the chains successfully reached the designated target within 5.2 hops [29], suggesting that the diameter of the network of social connections is small. The problem we want to solve is finding out whether the network of our social connections is small, stratified, or both simultaneously. We want to investigate this problem by replicating the process of routing a folder from selected starters to randomly chosen targets by using data containing geographical locations and social relationships of hundreds of thousands of users from location-based social media. The advantage of incorporating large-scale and multi-dimensional data into the smallworld experiment is that many aspects of the experiment can be controlled such as determining how to strategically route a folder between acquaintances and having real data on who is actually connected to whom for hundreds of thousands of users. Unlike other social experiments requiring incentives for human subjects to participate, we can control the effect of participation by supposing that everyone who receives a chain letter participates in the experiment once, since long chains are not likely to exist when the average participant rate is 37% [3] (e.g.,.37 5 <.1) reported by Dodds et al. These advantages from the data help us focus on how two factors of the experiment, geographical locations and community structure of users s connections, make it possible for social networks to be either small-world, stratified, or both simultaneously. These aspects of geographical proximity and community structures allows us to strategically route a folder between their acquaintances and also select starters and targets based on geographical distance or by a fixed number of community hops connecting them. We used community detection algorithms to partition a social network so that

17 6 starters and targets can be selected in the following ways. We define the network distance from community of the starter C s to the community of the target C t as the length of the shortest path connecting nodes from C s to C t. The question we ask is how many hops does it take to reach a target t originating from a starter s if the length of the shortest path connecting their communities is fixed at k? When k, we expect to capture the small-world phenomenon where it is easy to find short paths connecting people together. On the other hand, when k >>, we expect that while there might exist short paths connecting people together, it is much harder to find them with limited information available to the participants due to the stratified nature of society where some people have little social capital compare to others, making it difficult for people to reach targets outside of their communities and social class. Beside the debate between whether we live in a small world or stratified one, the techniques that were used by the participants in the experiment to select an acquaintance have practical applications in rescue and search operations [31] and job searching via personal contacts [32]. Dodds et al. reported that such successful techniques used by the participants including forwarding the folder to a selected acquaintance such as a friend (67%), relative (1%), co-worker (9%), sibling (5%), significant other (3%), and others (6%) based on geographical proximity and occupation for at least half of the decisions [3]. In addition, the results from the small-world experiment led to an avalanche of network models that have certain properties resembling real social networks such as the short diameter and high clustering coefficient [33]. The literature review on this subject is included in Chapter 2 (Section 2) and the contribution is discussed in Chapter Summary of Contributions & Organization First, this thesis collects terabytes of data that users shared on social media and analyzes their relationship dynamics in terms of three specific aspects: geography, face-to-face interactions, and communities. Such data allows us to analyze human behavior in terms of social network analysis such as the interplay between interactions, geographical proximity, and community structure. An example of an interesting behavior we notice is the creation of friendship between two people is more likely to occur when they are geographically close and friends-of-friends are also more

18 7 likely than not to be within proximity of each other. Also, geography has an effect by limiting face-to-face interactions as well as their interests in terms of what users read on social media. For more details on data analysis of human behavior and their social relationships, see Chapter 3. Second, this thesis proposes techniques for incorporating social relevance into the process of ranking URLs. Personalized ranking results using variants of network flow are highly independent from PageRank. The four dimensions of social relationships that we use for ranking URLs are friends, neighbors, peers, and users with similar interests. Results from the experiments show that social relevance can improve ranking quality of up to 19% compare to the baseline and 5% compare to PageRank. For more details on the personalization of information, see Chapter 4. Third, this thesis examines effects of social stratification in the small-world problem. Results show that while using geographical and community information in modeling social routing for the small-world problem is more realistic than using either one alone, average path lengths are 3 times longer then in Travers-Milgram experiments when attrition is eliminated. Community distance is more effective and robust at predicting probability of reaching targets than geographical distance in terms of average path lengths and percentage of successful chains. Finally, results show that prominent targets and targets in prominent communities can be reached much quicker than on average. Our results can be summarized as follows: the smallworld property holds for the prominent but everyone else is lost in the crowd except when being reached by members within its own community. For more details on effects of stratification in searching for people, see Chapter Organization Table 1.1: Aspects of SNA & applications. Geography Interactions Communities Human Mobility Congestion Communication Group Spreading Ideas Long Ties Weak Ties Bridge Ties Personalized Ranking Geo. Influence Peer Influ. Collective Influ. Small-world Selection Cognitive Biases Routing The organization of this thesis can be summarized by using Table 1.1. The

19 8 three aspects of social network analysis are geographical proximity of nodes (Chapter 3 Section 1), their face-to-face interactions (Chapter 3 Section 1), and the structure of their communities (Chapter 3 Section 2). The four applications studied in this thesis are human mobility & congestion modeling (Chapter 3 Section 5), spreading ideas & economic development (Chapter 3 Section 6), personalized ranking (Chapter 4), and the small-world experiment (Chapter 5). Each element in Table 1.1 describes how the corresponding aspect of social network analysis can be used to analyze the corresponding application. For the first application (human mobility), geography in terms of the geographical proximity of friends shows that human mobility traces can be used to study wireless bandwidth infrastructure management, and as we later see, network congestion is centralized in a few geographical locations impacting the throughput of the bandwidth when studying mobile ad-hoc networks. Later in Chapter 3 Section 5, face-to-face interactions is analogous to establishing wireless connections, since the purpose of establishing connections in wireless networks is to communicate, and establishing connection is only possible when nodes are within geographical proximity just like face-to-face interactions. Last but not least, this can be extended to incorporate the communities where mobility traces are simulated based on a group of nodes belonging to the same community and moving together. For the second application (spreading ideas), geography plays a role in distinguishing between short and long ties where the effects of long ties are examined in simple contagion models for the purpose of measuring economic development of large geographical areas. The analysis of face-to-face interactions shows that long ties are especially weak. In addition to long ties, ties that connect between different communities are also examined in Chapter 3 Section 6. For the third application (personalized ranking), three elements are incorporated into the process of ranking URLs. Geography allows selecting users based on geographical distance (neighbors). Reciprocal interactions in terms of social relationship (friends instead of followers) allows us to select nodes based on their interactions. Last but not least, community structures allow us to select nodes that belong to the same community. For the last application (small-world), geography allows selecting a starter and

20 9 a target in the simulations based on their geographical distance. Face-to-face interactions could affect the statistics of average path lengths because the folder holder is likely to pass the folder to the next holder based on the number of their interactions and independent of the target. And finally, community strictures allow the nodes in the simulations to pass the folder based on community awareness.

21 CHAPTER 2 LITERATURE REVIEW This chapter provides a literature review on ranking techniques and the small-world problem. 2.1 Ranking Techniques The literature review on ranking techniques is broken down into three parts. The first part looks at the conceptualization of the web (Sec ), the second part looks at incorporating more sources of data and modeling trust (Sec ), and the third part looks at data mining techniques for learning how to rank (Sec ) Web Conceptualization Early days of search engines rated information on the web by using the text embedded in the page rather than by the hypertext containing the information invisible to the end users. Previous work in the ranking of web pages incorporated text and hypertext to determine the rank of a page, since hypertext by itself does not contain information related to the query and a lot of information in the text does not mean it is authoritative [34]. In a sense, ranking pages by counting the number of inlinks is like voting, where the number of inlinks is the number of votes for a page, and additional textual analysis can be applied to a query for retrieving a subset of related pages ranked by the number of votes. Advances came from Page and Brin when they devised an algorithm now known as PageRank to capture not only the number of incoming inlinks like in voting but also the quality of those links [5]. The initial score of a web page is equal to 1 n where n is the number of pages containing a link to that page. At the first iteration, each page sends its score divided by the number of its links pointing to other pages. Then each page replaces its current score with the sum of scores that were sent to it by the pointing links. The process of sending and updating scores repeats until convergence Portions of this chapter have been submitted as: T. Nguyen et al., Small Worlds and Social Stratification, PLoS ONE, (under review). 1

22 11 or a pre-defined number of iterations is reached. The final scores determined by PageRank are used to rank pages across the web graph. Kleinberg purposed a ranking algorithm known as HITS (Hypertext-Induced Topic Search) based on the idea that good hubs point to good authoritative pages and vice-versa [35]. This query dependent algorithm first retrieves a subset of pages that are related to a query. Then it applies an update technique to recalculate scores of hubs and authorities, and the algorithm uses the scores of the authorities to rank the pages. Initially, the score of an authority is the number of backlinks coming from hubs, and the score of a hub is the sum of scores of authorities that it points to. At the second iteration, the algorithm updates the score of an authority by taking the sum of the scores of the hubs pointing to it. The updating scores process is then repeated, and the algorithm stops after reaching some number of iterations. Stochastic Approach for Link-Structure Analysis, or SALSA for abbreviation, is proposed by Lempel and Moran where two independent random walks are applied to a bipartite graph consisting of hubs and authorities [2]. Instead of repeatedly calculating and updating scores for hubs and authorities as is done in HITS, the number of times a page is visited by the surfer in the random walk is used to extrapolate the quality of the pages. The TKC (tightly knit community) effect is shown where communities of web pages are scored relatively high even though some pages are not authoritative or relevant to the topic when every hub points to every authority causing a tight knit community of hubs and authorities User Data & Trust Models While the link analysis of the web structure is a powerful tool used to capture the ranking of pages, an emergence of algorithms and ideas came from difference sources of data where additional information about end users is taken into consideration. For instance, how long on average do users stay on a page, and how often are two pages consecutively visited? BrowseRank is proposed to capture the number of page visits and the amount of time a user stays on a page modeled as a continuous time Markov process [8]. Another technique is taken from the principle of isolation or the disconnectivity of trustworthy pages from spam pages where trust is propagated from trustworthy pages to other trustworthy pages [6]. EdgeRank is proposed by

23 12 researchers from Facebook to consider interactions of two people or social associates during the process of ranking updated messages, photos, URLs, etc. on news feed [36]. Last but not least, the annotation of web pages created by users on Delicious is used to rank pages in SocialSimRank by considering the structure of annotators and annotated pages [12]. A technique of using personal data to rank pages was proposed by Liu et al. called BrowseRank where they used the browsing graph in which vertices represent visited pages and edges between vertices represent a transition from one page to another [8]. The novelty in BrowseRank is that it incorporates data that provides the amount of time an average user stays on a page which is an indicator of the page s quality and that cannot be captured by discreet time link analysis techniques such as PageRank, HITS, and SALSA. Also as mentioned by the authors, the web graph is not the most reliable source of data because of its large size and decentralized architecture where problems can come from spammers creating link farms to increase the visibility of their pages and web masters are constantly changing the content of their pages. Empirical results suggest that BrowseRank outperforms PageRank when independently hired researchers evaluated the ranked pages according to a linear combination of relevance and importance. TrustRank algorithm proposed by Gyongyi et al. relies on the principle of isolation, under the assumption that it is unlikely for trustworthy pages to link to spam pages [6]. Seed detection is a process that determines a small set of pages to be evaluated where these pages are likely to point to other trustworthy pages. First, a small set of seed pages is evaluated by using an oracle function to determine whether a page is trustworthy or not. In practice, the oracle function represents human judgment and would be too costly to use on a large set of pages. Second, each trustworthy page propagates its trust to pages that its points to and the value of the trust gets divided equally among all pointed pages. The propagation process repeats until convergence or some predefined number of iterations is reached. Additional advances came from the interests of Facebook in ranking items such as photos, messages, URLs, etc. on each individual news feed. In EdgeRank, the affinity score of two users, the weight of the posted item, and time decay are taken into consideration for the ranking of items on personalized news feeds [36]. The

24 13 affinity score of the viewing user and the item creator is calculated by looking at their online interactions; the more they have interacted, the more likely the item is shown or ranked higher. Time decay decreases the relevance of a posted item as time goes on, and the edge weight increases the score of items that have a high level of potential interaction such as photo albums, messages embedded with URLs, etc. In addition to EdgeRank, Bao et al. proposed SocialSimRank that uses social annotations on Delicious to rank pages according to the observation that popular pages are annotated by up-to-date users and up-to-date users annotate popular pages [12]. The novelty of SocialSimRank comes from using the annotations of users to match search queries to the corresponding annotated pages and applying the PageRank algorithm to the annotated pages as means to rank pages corresponding to the view of the annotator Learning to Rank Learning to rank is an intersection between information retrieval and machine learning where techniques in machine learning are used to model the learning process of ranking documents. Techniques are based on the idea of computing a function to maximize quality measures in ranking or minimize the sum of differences between the computed function and human-defined ratings. The advantage of using machine learning techniques is that parameters in proposed learning models are tuned automatically. In pointwise comparison, the objective is to minimize the difference between the calculated score of a document and the human-defined rating of it. In pairwise comparison, the objective is to determine whether the first document in a pair of documents is ranked higher than the second document or vice-versa. One of the challenges in learning to rank is to go from pointwise to pairwise comparison where the goal is to predict the ranking positions of two given documents. Another challenge is to optimize non-continuous and non-differential objective functions. Fortunately, previous work in the machine learning literature shows that techniques were developed to handle such cases. RankNet learns how to rank pages by using a neural network with pairwise comparison [37], SoftRank approximates the non-continuous and non-differential objective function [9], and SVMRank uses support vector machines to minimize pairwise inconsistency [38]. In RankNet, Burges et al. proposed to use a two layer neural network for learn-

25 14 ing the process of ranking pages [37]. Given a pair of pages represented as vectors, the ranking problem that the authors proposed is to compute the probability that the first page is ranked higher than or equal to the second page. One advantage in the learning stage is pairs of ranks might not be complete or even consistent to reflect the missing pieces of information in the data or the noise containing in them. First, they proposed using the cross-entropy cost function where ranking probabilities are modeled by using the logistic function. Second, they proposed using the backward propagation algorithm to optimally calculate the weights and offsets in a two layer neural network such that the difference between the computed function and human-defined ratings is minimalized. They conducted their learning, testing, and validation experiments by using data from a proprietary search engine consisting of 17, searched queries where each query contains the top 1, ranked pages. A page is represented as a vector consisting of 569 features. Query-dependent features are extracted from the anchor text, URL representations, title, and content. The remaining features are taken from log files in the proprietary search engine [37]. Empirical results suggested that NetRank outperformed the other learning models (RankProp [39], PRank [4]) in the validation stage. Taylor et al. proposed SoftRank where the idea is to consider ranking scores as random variables, map score distributions to rank distributions, calculate the expected SoftNDCG (normalized discounted cumulative gain), and use gradient techniques to optimize parameters in a two layer neural network with respect to Soft- NDCG as a cost function. While it is possible to use the cost function proposed in RankNet, there are many other metrics in information retrieval such as MAP (mean average precision), precision, and NDCG that reflect the experience of end users. As mentioned, using these metrics as objective functions for training is challenging since small parameter changes might yield different scores but ranking positions will change when a score passes another score making the function non-differential. SoftNDCG is a proposed metric based on the approximation of NDCG by mapping scores to random variables. Also as in RankNet, backward propagation uses gradient techniques to optimize parameters in a two layer neural network where the cost function is the approximated NDCF metric. Last but not least, SVMRank is an algorithm proposed by Joachims based on the

26 15 idea of using SVM (support vector machines) to construct a function that maximizes the empirical Kendals Tau distance between the targeted function determined from click through data and the system function computed by SVM [38]. Click through data provides constructive feedback of the ranking system where a clicked URL implies an estimate of relevancy relative to the query. While a clicked link does not represent absolute judgement, it provides useful insights about the ranking positions of the unclicked items. For instance, clicking on the link that is ranked 7th implies that 7th link is more relevant to the query than the unclicked links starting from one to six. This motivates the usage of pairwise comparison where the objective is to minimize pairwise inconsistency between a computed function and the targeted function derived from click through data. 2.2 Small-world Problem This literature review on the small-world problem is broken down into two parts. The first part provides an overview of the small-world phenomenon in terms of six degrees of separation (Sec ). The second part looks at effects of inequality and stratification that undermine the small-world property (Sec ) Six Degrees of Separation Milgram and his colleagues proposed an experiment to demonstrate the smallworld property by recruiting starters from Nebraska and Oklahoma to reach a broker in Boston [29]. Starters in the experiments were asked to mail a folder to an acquaintance who would be likely to reach the target quickly. Previous folder holders were recorded into the folder roster so that they would not be selected twice in a mail-forwarding chain. The process repeats until the chain stops either when folder reaches the target, or the current holder drops out of the experiment for various reasons. The expected number of hops it requires for a starter to successfully reach a target is an upper bound of the shortest path length connecting them. Travers and Milgram reported that 64% of the chains successfully reached the designated target within 5.2 hops [29] which gave name to the six degrees of separation. The idea of six degrees of separation is that if we pick any two people on this planet, there are on average 5 unique individuals who are connected in such a way where the first person

27 16 knows the second person, who knows the third person, who eventually knows the last person. Beside the debate between whether we live in a small world or stratified one, the techniques that were used by the participants in the experiment to select an acquaintance have practical applications in rescue and search operations [31] and job searching via personal contacts [32]. Dodds et al. reported that such successful techniques used by the participants including forwarding the folder to a selected acquaintance such as a friend (67%), relative (1%), co-worker (9%), sibling (5%), significant other (3%), and miscellaneous ties (6%) based on geographical proximity and occupation for at least half of the decisions [3]. In addition, the results from the small-world experiment led to an avalanche of network models that have certain properties resembling real social networks such as the short diameter and high clustering coefficient [33] Social Stratification Research in stratification such as racial segregation in neighborhoods and income inequality undermine the premise that we live in a small-world. For instance, are there really short paths connecting random people together? What about people who are isolated from the rest of the world? Clearly, isolated people are much harder to reach than prominent individuals such as politicans, CEOs, religious leaders, celebrities, etc. In [28], Kleinfeld mentioned that Beck and Cadamagnani were unsuccessful in replicating the small-world experiment with high success rates when they attempted to reach a high-income target starting from a low-income person. This suggests that one causes of stratification comes from income inequality where people are segregated into economic classes. This leads to a question what are the elements that cause stratification? What attributes do we associate with other people? Since people have an inclination to associate with people of the same ethnicity, cultural heritage, and other economic classes, how do such tendencies affect the small-world property? Th small-world property has been accepted in the research literature because possible routing strategies have been proposed to show how people strategically make routing decisions. A routing strategy proposed by Kleinberg relies on participants passing the folder to the acquaintance who is closest in terms of geography to the

28 17 target [41]. This make sense since people have cognitive abilities to remember where there acquaintances live. Also, it is common to have a few acquaintances who are geographically close and a few acquaintances who are distant due to the relocation for a new job, studying at a university, retiring, etc.

29 CHAPTER 3 SOCIAL NETWORK ANALYSIS Typically social network analysis examines relationships among people in terms of graph theory where nodes represent actors and edges represent their relationships. In this chapter, we examine three important aspects of social network analysis. The first is understanding the effect of geography in terms of the location of actors on the structure of the social network. The second is measuring face-to-face interactions of the actors and their social relationships. The third is detecting hidden communities that are well-connected in terms of social relationships and highly-active in terms of face-to-face interactions. We examine these three aspects of social network analysis in details using data collected from a location-based social network called Gowalla. Beside ranking and searching, these three aspects of social network analysis can also be used to model human mobility in mobile ad-hoc network (see Sec. 3.5) and predict economic development of large geographical areas (see Sec. 3.6). In section 3.1, we examined geography, co-appearance, and interactions of users in Gowalla focusing on the effect of geography on the structure of the network and face-to-face interactions. In section 3.2, we incorporated geographical information of users into three selected community detection algorithms consisting of a modified version of Clique Percolation Method (CPM), Inference Algorithm (IA), and GANXiS to detect disjoint and overlapping communities that are well-connected in terms of social relationships and highly-active in terms of face-to-face interactions. In section 3.3, we designed an experiment in which we generated different types of covers by using a combination of social and geographic information. In section 3.4, we used quality measurements based on the link connectivity, geographical proximity, and physical interactions among members to examine detected communities as a function of their sizes and used covers as a baseline. We conclude this chapter in section 3.7 Portions of this chapter previously appeared as: T. Nguyen and B. Szymanski, Using Location- Based Social Networks to Validate Human Mobility and Relationships Models, in Proc. IEEE/ACM Int. Conf. Advances in Social Network Analysis and Mining, Istanbul, 212, pp This chapter previously appeared as: T. Nguyen et al., Analyzing the Proximity and Interactions of Friends in Communities in Gowalla, in Proc. IEEE/ACM Int. Conf. Advances on Data Mining Workshops, Dallas, TX, 213, pp

30 19 Figure 3.1: Geographical spread of 1K checkins in Gowalla. with a summary of the results and potential applications that might benefit from the analysis of geography and spatially-aware community detection. 3.1 Geography, Co-Appearance, & Interactions Data Collection We collected data from a location-based social networking provider called Gowalla that allowed people to use their internet-enabled and sensing-capable mobile phones to record and share their current location with their friends. By using the Gowalla s API, we were able to retrieve 391,223 users with public profiles (friends and checkins) from mid September in 211 to late October of that year. Unfortunately, Gowalla has been purchased by Facebook and is no longer operating by itself. The data for FourSquare, Twitter, and Google Buzz are collected in the similar manner by using breath first search. To collect the data, we start with a user randomly chosen and process all the public information available about that user. Then we store all id s of the user s friends and put them into a processing queue in a FIFO order. After that, we retrieve the next user from the queue and repeat the process. Therefore, we crawled Gowalla breadth-first, a standard technique in the social networking literature often referred to as Breadth First Search (BFS) sampling. As shown in Table 3.1, the users accumulated a total of around 26 million checkins and 8 million friendship links. The average day of the checkins is 3.14 which

31 2 Table 3.1: Data summary of Gowalla network. x σ X Users 391,223 Checkins ,33,58 Friends ,176,384 Weekday Jan. 21, 29 Distance ,565,644 Time represents Wednesday. The earliest checkin is on Jan 21, 29. The average distance between two consecutive checkins of a user is km. The average time interval between two consecutive checkins of a user is 6.41 days with a standard deviation of The geographical spread of the checkins is shown in Fig The checkins from Gowalla allow us to measure the face-to-face interactions between friends by inferring how often do friends checked into the same location at approximately the same time Notations & Definitions Given a set of users U, let u U be a particular user, L u be a set of its shared locations known as checkins, and F u be a set of its friends. A shared location l L u of the user u is a tuple of three elements denoted as l 1, l 2, and l 3 corresponding to the latitude, longitude, and timestamp of the location l, respectively. The friendship network denoted as F = (U, E U ) is an undirected and non-weighted graph where an edge represents reciprocal friendship; that is, e = (u, u ) E U means u F u and u F u. The geographic distance d(u, u ) between two users u and u is estimated by averaging the locations in L u and L u and using the haversine formula to calculate arch distances. The checkin similarity CS(u, u ) of user u and u is defined as: CS(u, u ) = L u L u L u L u. (3.1) The level of physical interaction between user u and u denoted as I(u, u ) is calculated from their shared locations as follows. Two locations l L u and l L u are equivalent if they are within geographic proximity d(l, l ) < d ɛ and occurred within a time interval l 3 l 3 < t ɛ. Have such two equivalent locations l u and l u infer u and u have gone to the place l together. means we

32 21 8 Distance Similarity log(km) Not Friends Friends Checkin Similarity Figure 3.2: Friendship is bounded by geographical distance. The maximum pair-wise equivalence between L u and L u is defined as the longest sequence of equivalent location pairs ((l 1, l 1),..., (l k, l k )), such that for each 1 i k, l i L u, l i L u and l i is equivalent to l i. The level of physical interaction I(u, u ) is defined as the length k of the maximum pairwise equivalence divided by the size of the smallest locations set: k/min( L u, L u )). (3.2) Finding the maximum pairwise equivalence can be reduced to a network flow problem where polynomial running time algorithms such as Ford-Fulkerson can be used to calculate the maximum number of matches Data Analysis & Results In Fig. 3.2, there are 71 blue points that represent two randomly selected users who are friends and 62 red points that represent two randomly selected users who are not friends within the dataset. The shaded region is drawn by using the k-nearest neighbor algorithm for classifying whether two users are friends given their average distance apart and checkin similarity. In Fig. 3.2, we notice that co-appearance represented by checking similarity is a poor indicator of friendship; that is, people who are temporarily within the same place and time are not likely to be friends. Intuitively, co-appearance happens often

33 22 at popular spots, like concerts and cafes that attract people living at great variety of locations. Even if a group of a few friends goes together for a concert, they would not be friends with thousands of other attendees, hence, a chance that a random pair of attendees are friends is low. Occasional co-appearances are not sufficient, but geo-proximity helps in establishing and maintaining friendship, as seen in Fig Fraction Hop=1 Hop=2 Hop= Avg. Distance of Separation (km) (a) Hop=1-3 Fraction Hop=4 Hop=5 Hop= Avg. Distance of Separation (km) (b) Hop=4-6 Figure 3.3: Densities of pairs as a function of geographical distance. In Fig. 3.3, we plotted the density of friends (hop=1), friends-of-friends (hop=2), and pairs of users up to six degrees of separation as a function of the average geographic distance between two users in km. For each level 1 k 6 of indirection (measured in the number of hops), we randomly selected 5, non-cyclic paths of length k and created from the ends of these paths 5, pairs from the Gowalla dataset, each pair with k indirection of friendship. We analyzed pairs that were within 4, km distance from each other. In Fig. 3.3(a), the density of direct friends (4,317 total) reaches the highest value of.35 (in other words, 1511 pairs) at the lowest geographic separation in the range from to 16 km (each point at distance x represent users with distances from x-16km to x+16 km) and continues to decrease as the distance between them increases. At the second level of indirection, the density of friends-of-friends (3,464 total) achieves the highest value.19 in the range from to 16 km and continues to decrease as the geographic distance between them increases. Geographic proximity has an effect where friends (hop=1) and friends-of-friends (hop=2) are more likely but not necessary required to be within proximity of each

34 23.2 Hop= Level of Interaction 1.5 x Hop= Avg. Distance of Speration (km) Figure 3.4: Measuring face-to-face interactions (t ɛ =3mins, d ɛ =1km). other. For instance, 61% of friends are within 48 km and 47% of friends-of-friends are within 64 km of each other. Another way of looking at the results is that people who are separated by three or more hops are unlikely to be within geographic proximity of each other. In Fig. 3.3(b), we plotted pairs of users who are separated by four, five, and six hops. We noticed that they are not likely to be within geographic proximity of each other. The density of those pairs reaches the highest value.7 at the 16 km range centered at 1,2 km and continues to decrease regardless of their degrees of separation. In Fig. 3.4, we plotted the average level of face-to-face interactions I(u, u ) of friends (hop=1) and friends-of-friends (hop=2) as a function of their geographic distance in km. The larger the geographic distance between friends, the less likely they physically interact by going to the same places together. The highest peak (.27) is at the lowest geographic separation from to 266 km and continue to gradually decrease (with some small fluctuations) as the distance between them increases. For friends-of-friends, the physical interactions reflect the probability that they happened to be together.

35 Limitations We like to mention that it is possible the locations of some users are irrelevant to their distant friends. This may be a source of potential bias where the geographic proximity of friends may be enlarged by a friendship selection process in Gowalla in which users subjectively add friends who are within their geographic proximity. However, we noticed that 38% of friends are geographically separated by more than 52 km. Also, the Gowalla data and other social media indicate that distant friends are selected, perhaps for the purpose of keeping in contact [42]. In addition, Mislove et al. mentioned that the population of users who tweet on Twitter is unbalanced [43]. Therefore, we believe that the users who checks in on Gowalla do not make a representative sample of the entire population as shown in the concentration of checkins in Fig Incorporating Geography into Community Detection A common approach in community detection is to divide a network into multiple partitions by maximizing the number of edges within each partition and minimizing the number of edges between them. The often used quality measurement for the partitions is modularity that compares the difference between the fraction of edges inside and fraction of edges across a partition and such expected difference if edges in the network were randomly distributed [44]. Greedy approaches like hierarchical clustering [45] and spectral approaches such as minimum cuts [46] divide a network into disjoint partitions by combining or separating clusters of nodes so that modularity is maximized at every step. As studied by authors in [47], [48], a problem with this modularity maximization approach is that it inclines to merge two separated communities together, increasing the value of modularity, but creating the merger that does not reflect the ground truth. Another approach to community detection is to divide a network into multiple partitions so that the majority of members within each partition shares a common attribute [49]. A proposed attribute is based on friendship similarity defined as the density of common friends between pairs of nodes [49]. A problem with this proposed attribute is that it allows for a community consisting of people who have a lot of friends in common but are not friends of each other. However, this imperfect definition works

36 25 well in practice because people who have a lot of friends in common are likely to be friends themselves. Since community detection is an active area of research, our goal is not to provide another technique that detect communities (many have been proposed) but to incorporate the spatial information of nodes into existing algorithms for analyzing Gowalla and propose a null model (generating covers) to benchmark the detected communities. We combine these two approaches in community detection by incorporating the location information of users and geographic distances between them into three selected algorithms taken from the rich literature. First, we want to minimize the number of edges between communities and maximize the number of edges within them. Second, we want members inside a community to be within spatial proximity by giving geographically correlated friends more weight than distant friends during the detection process. This combined approach applies a natural interpretation of a friendship community where members are well connected and also likely to be geographically close. Also, geographically correlated nodes are more likely to interact with each other face-to-face as seen previously. We selected three community detection algorithms based on their popularity (CPM), promising experimental results (IA), and ability to scale to millions of nodes and edges (GANXiS) for the purpose of capturing and measuring the interactions of users inside a community. In the following subsections, we summarize the selected algorithms and describe how we incorporated geographic information of users into the process of detecting friendship communities in Gowalla since level of interactions is correlated with distance as seen previously Clique Percolation Method The CPM algorithm was proposed to detect overlapping communities by combining cliques or fully connected subgraphs [5]. Given an undirected graph F = (U, E U ), let H m denotes the set of all cliques in F of the size m. The clique-graph G = (H m, E) consists of cliques in H m represented as nodes, and edges between pairs of cliques if they have m 1 overlapping members. Each connected component of the graph G is a community consisting of many fully connected subgraphs of F. A problem of the CPM algorithm is its lack of scalability because the number

37 26 of cliques explodes as m increases for large networks. Unfortunately, the problem of finding the clique with the largest size in a given graph is NP-hard [51] preventing the algorithm from using cliques with the near largest size. We modified CPM to incorporate geographic information of nodes and made the algorithm scalable as follows. Instead of finding cliques of large sizes, we find triangles (m = 3) since they can be efficiently identified in parallel using map-reduce. To limit the number of triangles, we select a subset of disjoint triangles from all possible triangles by using geographic distances between pairs of nodes as follows. The average geographic distance of a triangle t is defined as (1/3) d(u, u ) for u u t. We take a triangle one at a time from a sorted list of triangles until all possible disjoint triangles have been taken. If a user is not part of any disjoint triangle, we assign it to a triangle that maximizes the number of edges between this user and the triangle and use geographic distances to break ties by assigning a user to the geographically closest triangle. The clique-graph G is defined as G = (T, E T ) where T is the set of modified triangles and E T is the set of edges between triangles that are assigned as follows. For each triangle, we create a single clique edge from this triangle to the one that maximizes the number of friendship edges between them, and use geographic distances to break ties if necessary. Like in the original CPM algorithm, each connected component of G is a community consisting of geographically correlated and well connected subgraphs of F Modularity Maximization Modularity maximization is a popular technique used to find communities proposed in [44], [45]. Given a graph F = (U, E U ) and a set P containing disjoint partitions or subsets of U, the modularity Q of the partitions in P is defined as: Q = p i P e ii a 2 i (3.3) where e ij is the fraction of edges between nodes in the partitions p i and p j, and a i = j e ij is the fraction of edges leaving the partition p i [44]. A positive value of

38 27 Q correlates with the difference between densities of edges inside and edges leaving the partitions compared to a null model. To maximize modularity, a greedy approach based on hierarchical clustering was proposed in [45], [52]. Initially, every node in U belongs to its own community. Then the pair of communities with the highest increase in modularity is merged together. The process of merging repeats n 1 times where n = U. The clusters with the highest overall value of modularity at each iteration are taken as a set of communities. For weighted networks, Newman proposed a simple technique to map weights of integer values to multigraphs [53]. For every edge of the weight w ij, there will be w ij 1 additional unweighed edges added between node i and j, and the weight w ij is set to 1. The definition of modularity remains the same, since the fraction of edges e ij between partition p i and p j can simply incorporate multiple edges between nodes. We incorporated geographic information about users into the Inference Algorithm by assigning weights to edges based on spontaneousness and typical means of travel: walking up to 1.6km, biking/using public transportation up to 25km, short car/train ride up to 1km, long car/train ride up to 5km, and plane flight above 5km. Friends who are within walking distance (1.6 km) get the highest weight of 2 4. Friends who are within biking distance (25 km) get the second highest weight of 2 3. Friends who are within driving distance get a weight of 2 2, and so on Speaker-Label Propagation (GANXiS) GANXiS was proposed in [54] based on a probabilistic propagation process that spread labels between speakers and listeners. Given a graph F = (U, E U ), each node u i U initially carries a unique label i in its pocket p i = {i}. When a node u is randomly selected to speak, it requests all members of its neighborhood, nodes that are adjacent to u to randomly send a label in their pocket to u. The probability of a label being chosen by u in its pocket p u is proportional to number of times the label was added; the more times a label was added, the more likely it will be chosen. The probability of a speaker u i choosing a label from a listener u j is based on the weight w ij /w i where w i is the sum of all weighted edges coming out of u i. For unweighted networks, w ij = 1. The algorithm repeats until the maximum number of iterations is completed

39 28 where in each iteration everyone gets to speak exactly once in a random order. At the end, labels that have a probability of being chosen to send to a speaker less than a threshold r are deleted. Finally, the labels that a node carries determine the communities that to which it belongs. For instance, nodes that carry a label i will belong to the community c i. Time to live (TTL) has been recently proposed to limit the number of labels that nodes propagate. TTL defines the number of times a label can be sent (so it reaches limited number of nodes within TTL hop distance). The advantage of GANXiS is that it scales linearly with the number of edges, but the disadvantage is that the relationship between convergence and the number of iterations is yet unknown. GANXiS is capable of discovering overlapping communities, but we selected its running parameters in such a way that the results included only disjoint communities to make them compatible with the results of other algorithms. We incorporated geographic information of users into GANXiS by assigning weights based on spontaneousness and typical means of travel like in weighted IA. Friends who are within walking distance (1.6 km) get the highest weight of 2 4. Friends who are within biking distance (25 km) get the second highest weight of 2 3. Friends who are within driving distance get a weight of 2 2, and so on. This is an extension of the interpretation of speaker-listener propagation algorithm where a listener is more likely to be able to hear a speaker if they are within spatial proximity. 3.3 Contrasting Communities to Null Models We proposed to integrate spatial and friendship information of nodes into a process of generating covers. The purpose of the covers is to serve as a baseline for analyzing the performance of various community detection algorithms under a quality measurement. In section 3.3.1, we described how we generated six covers by using a combination of spatial and friendship information in traversing the network. In section 3.3.2, we selected a few quality measurements for examining covers and detected communities. In section 3.3.3, we examined the covers using the selected quality measurements.

40 29 Table 3.2: Six techniques for generating covers. Algorithm Abbreviation Spatial Info.? Social Info.? Completely Random CR no no Random Walk RW no yes Closest Friend First CFF yes yes Farthest Friend First FFF yes yes Closest to All CTA yes yes Farthest to All FTA yes yes Techniques for Generating Covers Given a graph F = (U, E U ), a cover C U of size k is a subgraph of F with k nodes selected in a specific way. A completely random cover CR is one where each user u U has the same probability of being added during the selection. In a random walk cover RW, we first randomly add a seed into the cover, then randomly select a friend of the most recently added user, and continue selecting friends until the cover reaches the size k. The closest-friend-first cover CF F is similar to RW but instead of adding a random friend, we add the spatially closest friend not in the cover of the last added user. If all of that user s friends have already been added into the cover, we go back one step to the previously last added user and branch out from there. We call this the roll-back mechanism. The farthest-friend-first cover F F F is similar to CF F except that we take the spatially farthest friend instead of taking the closest one. The closest-to-all cover CT A is similar to CF F but instead of adding the spatially closest friend to the last added user, we add the spatially closest friend with respect to all members already in the cover. Finally, the farthest-to-all cover F T A is one where we take the spatially farthest friend with respect to all members already in the cover. Cover generation algorithms such as CT A and F T A are described in Fig. 3.5 without the roll back mechanism for simplicity. We listed the covers and their details in Table Measuring Covers & Communities We use three types of quality measurements based on the link connectivity and location of members to measure covers and communities. The first type of measurements is based on the intra-edge count IEC defined as the number of edges whose both ends are inside the cover. The contraction CONT of

41 3 1: procedure CoverGeneration(k) 2: F = (U, E U ) 3: seed = rand(1, U ), cover = [seed] 4: while len(cover) < k do 5: distances = [ ], m = len(cover) 6: for u in F seed do 7: // Compute haversine distance from u to cover[i]. 8: d u = 1 m m i=1 d(cover[i], u) 9: distances.append((u, d u )) 1: end for 11: // sort d u from least to greatest or vice-versa 12: distances = sort(distances, key = x: x[1]) 13: for u, d u in distances do 14: if u / cover then 15: cover.append(u) 16: seed = u 17: end if 18: end for 19: end while 2: return cover 21: end procedure Figure 3.5: Generating CT A & F T A covers. a cover is computed by dividing intra-edge count by the size of the cover. The intradensity IND of a cover is calculated by dividing intra-edge count by the intra-edge count of a completely connected cover of the same size. For these three measures (IEC, CONT, IND), higher the value, better formed is the community. The second type of measurements is based on the boundary-edge count BEC defined as the number of edges whose one end is inside the cover while the other is outside. This metric is useful for taking into account the effect of adding high degree users into covers of large sizes since such users are likely to increase both the intraand boundary-edge counts. The expansion EXP of a cover is computed by dividing the boundary-edge count by the size of the cover. The conductance COND of a cover BEC(C) is defined as COND(C) =. For these three measures (BEC, EXP, 2IEC(C)+BEC(C) CON D), lower the value, better formed is the community. The third type of measurements is based on pair-similarity that measures a given metric such as friendship similarity among pairs of nodes. This is applicable to the definition of a community of which members have a lot of commonality [49]. We

42 31 Table 3.3: Measurements for cover C of the size k. Measurement Definition IEC [55] {(v i, v j ) E v i C v j C} BEC [56] {(v i, v j ) E v i C v j C} - IEC CONT IEC/k EXP [57] BEC/k IND [55] IEC/(.5k(k 1)) COND [56, 57] BEC/(2IEC + BEC) GDI max d(u, u ) u, u C AGD u u C d(u, u )/(.5k(k 1)) SLI u u C I(u, u ) replace friendship similarity ratio with three additional measurements based on the geographic proximity and location of nodes. The first one is the geographic diameter of a cover GDI defined as the geographic distance between the two farthest nodes. The second one is the average geographic distance AGD among pairs of nodes. Here, lower the measure (GDI and AGD), better formed is the community. The third one is the sum of the levels of physical interactions SLI among pairs of nodes for which higher the measure, better formed is the community Examining Covers in Gowalla For each technique, we generated covers of fixed sizes from 5 to 1 with an increment of 1. For each cover size, we generated 1 covers and calculated the average intra-edge count, boundary-edge count, geographic distance, and geographic diameter. We then derived the remaining measurements. In Fig. 3.6(a), we noticed that F F F outgrows the other techniques in terms of intra-edge count as the cover size increases. In Fig. 3.6(b), we noticed that F F F and F T A outgrow the other techniques in terms of boundary-edge count by a great margin suggesting that they strategically add users with very large degrees. While RW is decent at generating covers with high intra-edge counts as seen in Fig. 3.6(a), it is also biased since users with high degrees are more likely to be added, which increases the intra-edge count as the cover continues to grow. However, F F F and F T A are even more biased than RW and F F F outgrows the other five techniques because the radius of the farthest friend would cover everyone including common friends in between. On the other hand, we noticed that CF F and CT A are most

43 32 Intra Edge Count CR RW CFF FFF CTA FTA Cover Size (a) 2 x 15 2 x 14 Boundary Edge Count Geographic Diameter (km) CR RW CFF FFF CTA FTA Cover Size (b) Cover Size (c) Figure 3.6: Intra-edge count, boundary-edge count, and geographic diameter of covers. effective out of the six techniques at increasing the intra-edge count while minimizing the boundary-edge count at the same time. In Fig. 3.6(c), we measure the geographic diameter of a cover as a function of its size. As expected from how covers are generated, F F F and F T A are most effective at maximizing the geographic diameter while CF F and CT A are most effective at minimizing this measurement. The geographic diameter of F F F and F T A reaches the limit within 2 iterations, while the diameter for CT A and CF F slowly continues to grow. A similar trend is seen in Fig. 3.7(c) which shows the average geographic distance in contrast to the growth rate of intra- and boundary-edge counts seen in Fig. 3.7(a). Last but not least, conductance is a measurement used to determine the quality of a community by considering both the intra- and boundary-edge counts. As seen in Fig. 3.7(b), CF F is the most effective out of the six covers at minimizing conductance

44 33 Contraction CR RW CFF FFF CTA FTA Cover Size Expansion Cover Size (a) Conductance CR RW CFF FFF CTA FTA Avg. Geo. Distance (km) CR RW CFF FFF CTA FTA Cover Size (b) Cover Size (c) Figure 3.7: Contraction, expansion, conductance, and geographic distance of covers. since it preserves some geographic structure of the social network by traversing the edges based on who is the geographically closest friend, and adding friends who are likely to be friends with the members already in the cover. CT A is not as effective as CF F because geographic distances get diluted as the size of the cover increases. F F F and F T A are worse than RW at minimizing conductance. We later use the physical interactions of users to compare and contrast the results generated by the CF F cover to results detected by the community detection algorithms. 3.4 Examining Detected Communities We first examined the results by looking at the total number of communities detected and the number of members in each one. The modified CPM algorithm with geographic information detected 2.6K communities whose average size was 6 with the size of the largest one being 69K. We did not run the original CPM algorithm

45 34 Table 3.4: Detected communities and their sizes. Community Size Algorithms Avg. Std. Smallest Largest Total CPM 6 1, ,671 2,572 IA 134 1, ,315 1,151 IA w (w for weighted) 442 2, , GANXiS TTL ,139 7,236 GANXiS TTL w ,29 4,636 because of the long execution time required to generate the clique graph. IA without geographic information detected 1.2K communities with the average size of 134 and the size of the largest one being 52K. IA with geographic information detected 349 communities with the average size of 442 and the size of the largest one being 45K. GANXiS without geographic information detected 7.2K communities with the average size of 21 and the size of the largest one being 3K. Finally, GANXiS with geographic information detected 4.6K communities with the average size of 33 and the size of the largest one being 48,29. Additional information relating to community sizes is listed in Table Network Community Profile (NCP) We used the network community profile (NCP) proposed in [56] to examine detected communities as a function of its size. The authors proposed to take the best partition defined by a quality feature of a given community size because it represents the potential of a partition in a community detection algorithm. By inspecting all communities in the set of communities with the same size, we find for this set the lowest conductance or the highest intra-density among its members, one quality metric at a time. For intra-density and conductance without geographic information, we use the classical definitions from Table 3.3 and include all existing intra- and boundary-edges in the counts. For intra-density and conductance with geographic information, we only include edges that are within geographic proximity of 16 km or roughly 2 hours of driving. A low value of conductance is good because this means that the fraction of edges leading outside the community is low, but the value of is rare since it would indicate that

46 35 the community is isolated. However, for conductance with geographic information, a value of means there are no edges that connect to other communities that are geographically close, so all bridge edges are long. This means also that seeing a short bridge edge, the community detection algorithm tends to merges communities connected by such edge together following the insight that neighbors tend to be friends. The potential issues resulting from using this approach are discussed below. First, in many situations, taking the average value of a community quality gives a more representative picture and probably is less sensitive in cases containing outliers. Second, the number of communities for a given size might vary from a large number of small communities to very few for large communities. Last but not least, there might be no communities of a particular size, and taking the average quality might give a smooth function that is easier to extrapolate at the missing points as seen with the covers. Fig present the results for communities detected by CPM, IA, and GANXiS respectively Link Connectivity Measurements First, intra-density rapidly decreases as the size of the cover increases because adding another member into a large community requires everyone already in it to be connected with this new member, as seen in Fig (a). Unlike intra-density, conductance is not correlated with the community size because there are some small and large communities of varying values, as seen in Fig (b). Third, GANXiS and IA are a little better than CPM at maximizing intra-edges that are within geographic proximity, as seen in Fig (c). IA is the best at minimizing boundaryedges that are within geographic proximity, as seen in Fig. 3.9(d). Last but not least, GANXiS and IA benefited from incorporating the geographic information of users, as seen in Fig (d), where geographically correlated friends are captured in the community detection process Face-to-Face Interactions Measurements Comparing Fig (d) to Fig (b), we noticed that some detected communities had a conductance value of. This means that every potential node

47 36 1 Intra density Community Size (log scale) (a) Conductance Community Size (log scale) (b).7 1 Intra density with Spatial info Conductance with Spatial info Community Size (log scale) Community Size (log scale) (c) (d) Figure 3.8: Communities detected by Clique Percolation Method. Table 3.5: Measuring spatial conductance. Algorithm # Spatial Cond. of Total Ratio CPM IA IA w (w for weighted) GANXiS TTL GANXiS TTL w Table 3.6: Measuring face-to-face interactions. Algorithm Count Total Ratio CPM IA IA w (w for weighted) GANXiS TTL GANXiS TTL w

48 Weighted Network Unweighted Network.5.4 Weighted Network Unweighted Network Intra density.6.4 Conductance Community Size (log scale) Community Size (log scale) (a) (b) Intra density with Spatial info Weighted Network Unweighted Network Conductance with Spatial info Weighted Network Unweighted Network Community Size (log scale) (c) Community Size (log scale) (d) Figure 3.9: Communities detected by Inference Algorithm. within geographic proximity of a community has already been included in it. For the IA without geographic information, out of the 78 community sizes, 2 of them have geographic conductance of, yielding 2/78.26 ratio. For the IA with geographic information, out of the 84 communities, 19 of them have a geographic conductance of, yielding 19/84.23 ratio. The remaining values are listed in Table 3.5. Results in Table 3.5 show that GANXIS has the highest ratio of the number of communities with a spatial conductance divided by the number of communities detected. From this perspective, a good community detection algorithm detects communities that have a lot of communities with spatial conductance as the result of merging connected and geographically close communities together. We examined small-size communities because humans have limited resources and cognitive abilities to keep and maintain social relationships resulting in a limited

49 Weighted Network Unweighted Network 1.8 Weighted Network Unweighted Network Intra density.6.4 Conductance Community Size Community Size (log scale) (a) (b) Intra density with Spatial info Weighted Network Unweighted Network Community Size (log scale) (c) Conductance with Spatial info Weighted Network Unweighted Network Community Size (log scale) (d) Figure 3.1: Communities detected by GANXiS. number of friendships known as Dunbar s number [58]. We measured and then plotted in Fig the NCP level of physical interactions in communities and covers by summing the level of physical interactions among pairs. From the plots, we observed that CPM have small communities where members are statistically more likely than members in covers to physically interact with each other by going to the same places together. In Fig. 3.11(a), out of 95 communities detected by CPM of the size up to 1, 84 of them have higher amount of physical interaction among members than a null model, CF F. In Fig. 3.11(b), out of 41 communities detected by IA under the size of 1, 38 of them have higher amount of physical interaction among members than CF F. The remaining values are listed in Table 3.6. While CPM is the most effective at detecting communities that are intrinsically small (95 total) and where the physical interaction among member is likely to be

50 39 Amount of Physical Interaction CPM CFF Community Size (a) CPM Amount of Physical Interaction Weighted Network Unweighted Network CFF Level of Physical Interaction Weighted Network Unweighted Network CFF Size Size (b) IA (c) GANXiS TTL Figure 3.11: Measuring face-to-face interactions among members. higher than CF F (88%), IA is the most effective at detecting communities where 93% of them have higher amount of physical interaction than the null model, as seen in Table 3.6. Incorporating geographical information into GANXiS improves the overall performance of GANXiS (91% vs. 69% (without geography)). 3.5 Application: Social Relationships & Human Mobility Random mobility models have been popular among applied researchers for generating synthetic movements. Random walk is commonly used for graph traversals, clustering analysis, and many other applications to model unpredictable behavior. Random waypoint is a mobility model on Cartesian coordinate systems where two dimensions are commonly used in simulations and higher dimensions are used for theoretical analysis and generalization. Not only these random models are useful for application purposes, but they are also powerful tools for analytical understanding

51 4 of many networking applications, like routing in decentralized architectures where mobility plays a large role. A typical ad-hoc network is a decentralized network formed by mobile agents in a dynamic process without any fixed infrastructure. It is dynamic because the topology of who is connected to whom is constantly changing due to the mobility and connection preferences of the agents and the physical limitation of communication devices. If two mobile agents are outside of transmission range, then the connection is dropped. If they are within the transmission range, then the connection could be established. Hence, the topology of the ad-hoc networks depends on a complex combination of agent mobility, connection preferences, and environmental factors that could disrupt services or enhance communication. Some of these networks could be uncoordinated where each agent acts selfishly on its behalf while other networks could be coordinated where all agents are collaborating to accomplish a particular goal, task, or mission. For instance, peer-to-peer networks are uncoordinated networks where the architecture is designed for robustness to reduce the damage of selfish activities in which users engage but are reluctant to contribute and anti-choking algorithms are designed for effectively distribute pieces of a file to maximize throughput and efficiency. On the other hand, military ad-hoc networks are coordinated networks where soldiers communicate through a network channel to rescue innocent civilians or capture fugitives in a mission. Outside of computer networks, human mobility is important for studying the spread of contagious diseases, traffic engineering, methods of large scale emergency evacuations, and so on [59]. While individual mobility is important at a micro-level, it serves as a building block for population mobility that has many potential applications in studying the population at large scale. Using data to observe statistical patterns that capture, characterize, and predict trajectories of human movements during their daily activities is important for health organizations, civil engineers, and national interests. For instance, health organizations may want to study the spread of transmitted diseases, while traffic and civil engineers may want to incorporate human mobility analysis into their transportation models, where travellers can use a transportation system consisting of bikes, buses, and subways to get from one place to another. Un-

52 41 Mall Pij Home School Work Lunch Figure 3.12: Generating a Markov Model using checkins. derstanding population mobility allows the design of effective transportation systems where traffic congestion is controlled and reduced. Last but not least, national security might be interested in knowing how social relationships impact population mobility, so guidelines can be provided during emergency evacuations in natural disasters like the Hurricane Irene and Japan Nuclear Meltdown of 211, where evacuating 45, people within a six mile radius of two malfunctioned nuclear power plants required optimal efficiency since every second could potentially counts toward saving a life Network Congestion in MANETs The backoff timer in the MAC protocol is an algorithm designed for preventing traffic collision of wireless signal. If two or more concurrent wireless transmissions are within radio range, one will randomly backoff to let the other one talk. Suppose we are interested in measuring the throughput of a wireless network where people are working on their laptops and moving from location to location with some hidden attributes. Since human beings do not move randomly, we know that there will be more congestion at popular locations. If we use the RWP, most of the congestion occurs in the middle due to the stationary distribution as shown in Fig Mobility Generation We propose a following algorithm for generating mobility traces using social networking data from Gowalla. For our Friendship Mobility Model (FMM) using Markov Model as an underpinning, we first randomly select a user from the dataset and include his or her friends into the selected group of users. For each user selected, we calculate the patterns of checkin activities from the datasets. To define set of

53 42 locations, we look into how many unique places have this user checked in. For each pair of subsequent locations, we calculate the shortest haversine route. For the probability in the Markov Model of moving from location a to location b, we calculate how many times the user checks in at location a immediately after checking in at location b divided by the number of times the user checks in at the location a. Finally, we calculate the time it takes for a given user to go from one checkin to another. The entire process is depicted in Fig After we have our empirical Markov Model built for each user, we use Miller s coordinate projection to convert geographic space into a Cartesian coordinate system that preserve the triangle law of distances. Finally for mobility simulation, each node randomly gets assigned to one of its checkins. Then each node randomly picks with the assigned probability the location of the next checkin and moves directly to it using a straight line trajectory. Once the node reaches the new checkin, it repeats the process until the end of the simulation. Hence, the difference between the RWP mobility model and our FMM is that in the latter the space of travel is limited to the area of the checkins for each individual node. Moreover, each node moves differently based on its training set of checkins. For instance, an adult might be inclined to check in at work more often than a student Experimental Congestion Design We designed a controlled experiment in MANET using ns-2 to compare the traffic congestion between the RWP and the FMM. In the experiment, there are 15 mobile nodes constantly sending out packets to their neighbors within the transmission range. Other simulation parameters are listed in Table 3.7. When two or more nodes are within radio range of each other, at most one can make a successful transmission and the rest has to pause. We measure the overall congestion of the network by counting how many times did a node need to pause given that we know its current geographic location during the simulation. Fig provides the outline of a simulated node moving and how it causes congestion. Suppose a node starts at p 1 and travels to p 2 with some speed dictated We use user when referring to the dataset and node when referring to the simulation. A node is built from the social network data provided by the users.

54 43 Figure 3.13: Design of simulation overview. by the mobility model. A mobile node cannot transmit if there is already a concurrent transmission within some nearby range. Therefore, it pauses until it detects no concurrent transmissions. The pause time duration in a subarea is the total amount of time of all the nodes pausing or suspending their transmissions due to the backoff timer of the MAC protocol. During the trip from p 1 to p 2, the node pauses in 3 subareas (1,2), (2,2), (3,3) represented by the dashed line, meaning that the transmission was suspended for some time. The length of the dashed line in a subarea represents the duration of pause time for that particular trip Congestion Simulation Results Table 3.7: Network simulator ns-2 parameters. Parameters RWP FMM Simulation Time (t) 1,s 1,s MAC Layer 82.11Ext 82.11Ext Width (x ) 2m 2m Length (l ) 2m 2m Nodes (n) Pause Time Min Speed 5 Max Speed 5 5 Total Backoffs. 598,316 1,654,967 With the FMM (see [22]), we were surprised that it had 2.77 times more congestion than the RWP. However, this agrees with our intuition that in the FMM, friends

55 Length (m) 1 5 FMM RWP X (m) Figure 3.14: Traffic congestion in FMM and RWP. like to maintain their relationships by being closer to each other. Economic factors like the cost of transportation and mobility have a great impact on how we choose with whom to be friends. Fig displays the simulation results of network congestion in a controlled MANET. We took a sample of locations with traffic congestion. The points represent places where at least one node had to backoff within the simulation. Notice how traffic congestion is dispersed for RWP and clustered for FMM. Please note that this graph only shows places of congestion but not density or total volume of communications. Fig displays the frequency of pauses caused by the backoff timer in the MAC protocol using the RWP. We noticed how congestion is centralized in the middle, which is correlated to the stationary distribution of the RWP. 3.6 Application: Long Ties & Economic Development A number of results in economic sociology suggested that human relationships affect economic opportunities because information often spread between people [6]- [65]. In addition, information coming from interpersonal relationships is often richer than traditional broadcast media such as television, newspaper, radio, etc. because acquaintances can interact face-to-face and influence one another in terms of adopting new behavior and ideas [66]. Therefore, social networks can be portrayed as

56 45 Figure 3.15: Frequency of pauses using the RWP. a transportation system where individuals are drivers for generating ideas and the links between people are vehicles for transporting ideas from one person to another. Metaphorically, some links are faster at transporting ideas to a larger number of people than others because not all vehicles are created equal. It has been argued that information coming from weak ties is often richer than information arriving via strong ties because those to whom we are weakly tied are more likely to move in circles different from our own... and have access to information different from what we [usually] receive [65]. Weak ties have been shown to be valuable sources of information because individuals can use them to find jobs [32], [6], solicit feedback on starting new ventures [63], and search for people like in the small-world experiment [31], [41], [67], [68]. In other settings such as examining workplaces, structural holes can affect productivity and innovation of employees and could lead to higher compensation, more promotion opportunities, and better performance evaluations [61]-[64]. Structural holes are those social relationships that connect non-redundant contacts together [61]. An example of a structural hole is a bridge that connects non-redundant contacts from two communities together. The effect of weak ties on economic opportunities [69] suggests that perhaps information coming from weak ties can also be used for measuring economic development on a

57 46 larger scale. Contemporary development in the science of urbanization has provided scaling laws for innovation and wealth creation as a power function of the population size in the equation: y(t) = cx(t) m where x(t) is the population size and y(t) is the metric of innovation at time t [7]. These results show that as the population size increases, GDP, wages, patents, private research employment & development increase at superliner rates where 1.3 m 1.46 [7]. A plausible explanation for the superliner scaling of wealth creation is that as the population size increases, the number of social relationships between people increases because there are more choices for establishing relationships; therefore, increasing the connectivity between people and decreasing the time for ideas to spread as long as the rate of establishing connections is faster than the rate of population growth. Following this line of thinking, recent results in [71] suggest that a generative model for tie formation as a function of population density yields results very similar to the model based on population size [7]. Results show that algorithmically generated social ties based on population density, assuming that nodes are distributed uniformly on a Euclidean space and they establish connections similar to the rank friendship model [67], can be used to model urban characteristics of cities such as GDP, HIV transmissions, and communication volume. Here we extend this line of thinking by focusing on characteristics of economic development as a function of speedy idea flow emulated on real social relationships - using long ties as the main component enabling such flow. This was accomplished by using data containing geographical locations and friendship information of hundreds of thousands of people from location-based social media such as Gowalla and FourSquare [22]. More importantly, these datasets allow us to infer face-to-face interactions [23] and measure the strength of ties in terms of not only interactions but also geographical distance (i.e., short or long ties [72], [73]). Other approaches for measuring economic development of large geographical areas include examining the diversity of social contacts (i.e., call records as a proxy for social relationships) since more contacts imply more channels for receiving information [74], but using calling patterns to infer social contacts is biased towards those that are more likely to be strong ties since weak ties are by definition those that are contacted

58 47 infrequently. While these approaches [71], [74] can vary in their complexity, ranging from mathematically oriented to data-driven, what they share in common is using social network analysis to predict innovation, wealth creation, and even patterns of complex human behavior. The novelty of our approach lies at the intersection of economic sociology (i.e., the interplay of weak ties and economic opportunities) and simple contagion models (i.e., the spread of good ideas from one place to another). Results show that the speed of access to ideas is a near prefect measure for social diversity and also a signature of economic development in the US without needing to tune parameters or incorporate secondary factors such as the level of educational attainment and internal transportation infrastructure A Stochastic Model of Economic Development We propose a simple stochastic model that uses long ties as the main component for measuring economic development of large geographical areas. Let G = (V, E, L) be a social network where V is the set of nodes, E is the set of their undirected relationships, and L is the mapping of users to locations of their residences. Let A i denotes the set of nodes that reside in area i; i.e., A i = {v V L(v) = i}. The flow of ideas matrix denoted as F = (f ij ) where f ij is the probability of an idea going from A i to A j in one step defined as the fraction of long ties connecting nodes from A i to A j divided by the number of long ties originating from A i ; i.e., f i j = LT (A i, A j ) m k=1 LT (A i, A k ) (3.4) where m is the total number of areas and LT (A i, A j ) (1 i j m) denotes the number of long ties connecting nodes from A i to A j ; i.e., LT (A i, A j ) = {(s, t) E (s A i & t A j ) or (t A i & s A j )} (3.5) If we assume that innovative ideas travel randomly between areas, and the probability of an idea spreading from A i to A j depends only on the present area and not the previous areas, then {X t, t } is a discrete-time Markov chain where X t denotes

59 48 where the idea is located at time t. Let H ij denotes the expected time it takes for the idea originating at A i to arrive at A j. Then the average expected time for the idea originating from anywhere to arrive at A i denoted as φ i is defined as: φ i = 1 m 1 m H ki (3.6) k=1 where H ii is. Hence, we expect φ i to be inversely correlated with economic development since areas that receive information quicker can act faster. Suppose an innovative idea travels indefinitely, then the fraction of time the idea stays in A i is denoted as: λ i = P (X t = A i ) (3.7) λ = (λ 1, λ 2,..., λ m ) is known as the stationary distribution, and there exists a unique stationary distribution of X t since it is irreducible [24]. If φ i denotes the fraction of time the idea spends in area i, then 1/λ i denotes the expected time needed for the idea to come back to i; therefore, φ i 1 λ i Experimental Results & Discussion We extracted users and their social relationships in Gowalla and FourSquare and kept those that are confined to the US. We partitioned the US into 51 areas where each area corresponds to a federal state. Figure 3.16 shows scaling laws of the number of short and long ties as a function of the population size. Short ties are defined as those relationships where both users live in the same state, while long ties are defined as those who live in separate states. The total number of ties (i.e., all ties) is the sum of the number of short and long ties. A point is a state where the x-axis corresponds to the number of users that live there, and the y-axis corresponds to the number of their ties. Results show that as the population size increases, the number of short ties increases at superliner rates where m 1.34 for Gowalla (a) and m 1.43 for FourSquare (b). This result supports

60 49 14 a) 14 b) Number of Ties (log) Short Ties (m=1.34, r=.97) Long Ties (m=.95, r=.98) All Ties (m=1.2, r=.99) Population Size (log) Number of Ties (log) Short Ties (m=1.43, r=.95) Long Ties (m=1., r=.94) All Ties (m=1.7, r=.96) Population Size (log) Figure 3.16: Scaling laws of short and long ties. Number of face to face interactions Short Ties a) Long Ties P(k > K) (log scale) K (log scale) b) Short ties Long ties Figure 3.17: Face-to-face interactions of short ties and long ties. the claim that increasing the population size increases the number of relationships between people and decreasing their path lengths so ideas can spread quicker. However, long ties do not increase at superlinear rates but instead approximately at linear rates where m.95 for Gowalla (a) and m 1. for FourSquare (b). Therefore, long ties do not explain superlinear scaling of innovation and wealth creation as a function of population size. Figure 3.17 shows that most of long ties are weak because face-to-face interactions occur more often when people are geographically close. In this experiment, we selected all pairs of long and short ties and calculated the number of their face-to-face interactions by matching their checkins. The average number of interactions for short ties is 3.95 (std=43.2) while this number for long ties is.73 (std=9.19). While not all short ties are strong, most of long ties are weak since 9% of them have no more

61 5 a) b) 25 Short Ties Long Ties 25 Weak Ties Long Ties Number of Ties Number of Ties Adopters Adopters Non Adop.Non Adop. Adopters Adopters Non Adop.Non Adop. Figure 3.18: model. The collective strength of long ties in a simple contagion than two interactions. The x-axis in (b) represents the number of interactions K, and the y-axis represents the probability that a tie has more than K interactions. We did not repeat the same experiment for FourSquare because their API did not provide access to users checkins. We emulated a simple contagion process using social relationships of users to examine the effects of short and long ties on adopting versus non-adopting a contagion (similar to the process of spreading ideas in [71]). Using Rogers work on the diffusion of innovations [75], we assume that 2.5% of the population, randomly selected, is responsible for generating innovative ideas (i.e., the seed set). In each step, they randomly select one of their acquaintances to propagate the contagion and that acquaintance decides whether to adopt it with some fixed probability p c. If the acquaintance decides to adopt the contagion, then it later becomes an initiator for spreading it. The process stops when 13.5% of the population has adopted the contagion. Those 13.5% of the population would be considered as early adopters in the diffusion of innovations [75]. Figure 3.18 shows that early adopters have on average more long than short ties. For Gowalla (a), the average adopter has (std=38.67) short ties and 23.9 (std=111.57) long ties compared to 3.81 (std=6.58) short ties and 2.99 (std=6.2) long ties for non-adopters. For FourSquare, the average adopter has (std=27.67) short ties and (std=54.11) long ties compared to 1.62 (std=3.45) short ties and 1.63 (std=6.74) long ties for non-adopters. For the distribution of short and long

62 51.4 a) Adopters b) Non Adopters Fraction Long Ties (log scale) c) Adopters Long Ties (log scale) d) Non Adopters Fraction Long Ties (log scale) Long Ties (log scale) Figure 3.19: Distribution of long ties for adopters and non-adopters. ties of adopters and non-adopters see Fig for Gowalla (a,b) and FourSquare (c,d). Since nodes in the social networks are more likely to adopt if they have more acquaintances, the point is that a job source, valuable idea, or even a social contagion is more likely to come from a weak tie because people have limited number of strong ties but many more weak ties [61]. This experiment shows the collective strength of long ties by showing that people have a higher chance of adopting a new idea if they have more long ties. We generate the flow matrix F = (f ij ) and calculate λ i as a proxy for φ i. Figures 3.2 and 3.21 show the economic development of US states as a function of the speed of access to ideas for Gowalla and FourSquare respectively. The metrics we used for economic development are gross GDP [76], the number of patents issued [77], and the number of startups defined as non-profit firms with less than 2 employees [78]. Overall, results show that φ i is highly correlated with the economic development in the US. Tables 1 and 2 show results using other techniques that have been proposed in the literature for measuring economic development. The population density of a state is defined as the number of residents [79] divided by the state s land area in sq. mi

63 52 Gross GDP (log scale) , m=.67, r=.92, 14 21, m=.67, r= , m=.66, r= , m=.67, r= φ i (log scale) a) 9 Patents (log scale) c) φ i (log scale) b) 29, m=.81, r=.76, 21, m=.82, r= , m=.83, r= , m=.83, r=.79 Startups (log scale) , m=.59, r=.86 21, m=.59, r= , m=.59, r= φ i (log scale) Figure 3.2: Economic development as a function of idea flow (Gowalla). (excluding water) [8]. The social diversity of a state i denoted as D i is defined as: D i = m j=1 p ijlog(p ij ) log(m 1) (3.8) where p ij is the number of edges connecting A i and A j divided by the number of edges leaving A i [74]. Table 3.8: Measuring economic development (Gowalla). GDP Patents Startups Population Density r =.5 r =.45 r =.38 Social Diversity r =.88 r =.74 r =.83 Ideas Flow r =.92 r =.77 r =.86 In Table 3.8, results show that speed of access to ideas φ i in Gowalla is more correlated with economic development than population density and social diversity.

64 53 1 a) b) Gross GDP (log scale) , m=.59, r=.88 21, m=.59, r= , m=.59, r= , m=.59, r= φ i (log scale) Patents (log scale) , m=.7, r=.71 21, m=.71, r= , m=.73, r= , m=.72, r= φ i (log scale) 9 c) Startups (log scale) , m=.51, r=.8 21, m=.51, r= , m=.51, r= φ i (log scale) Figure 3.21: Economic development as a function of idea flow (FourSquare). Table 3.9: Measuring economic development (FourSquare). GDP Patents Startups Population Density r =.5 r =.45 r =.38 Social Diversity r =.88 r =.74 r =.83 Ideas Flow r =.92 r =.77 r =.86 Speedy Idea Flow φ i a) r=.98 linear y =.9*x + 13 Speedy Idea Flow φ i b) r=.99 linear y =.9*x Social Diversity D i Social Diversity D i Figure 3.22: Speedy idea flow as a function of social diversity.

65 54 In Table 3.9, there are two instances where social diversity is more correlated with economic development in FourSquare but still less correlated than the results in Table 3.8. Results show that the speed of access to ideas is correlated with economic development in the US from 29 to 212 because it is a near prefect measure for social diversity as shown in Fig for Gowalla (a) and FourSquare (b); however, the causality between the two relationships is still unknown but the results suggest that perhaps combining long ties and the spread of ideas might be an important indicator of economic development in addition to population size, density and social diversity. Aggregating and normalizing hundreds of thousands of long ties across the US removes the potential effect of ideas not traveling randomly. Unlike social diversity, population density performed not as well as others because it was simply designed to measure characteristics of cities and not geographical areas with diverse ranges of population densities (e.g., New York consists of dense NYC and sparse NYS; therefore, limiting its predictive power). Finally, we focus only on a very specific dimension of social relationships (i.e., long ties) and ignore other ties that could lead to better correlations of economic development. While there are many more dimensions of human relationships (e.g., short ties, strong ties, friends from different communities, etc.), one particular dimension that could lead to better results within a geographical area is friends with different interests or skills since they would complement each other in terms of collaboration like solving a difficult problem. Perhaps understanding the interplay of human relationships and economic development can suggest radical socially-driven alternatives in addition to the traditional stimulus packages for growing the economy [74] and a direction for studying urban growth [71]. 3.7 Summary of Results Contrary to the belief in the death of distance barrier to forming social ties [81], we find that the creation of friendship between two people in Gowalla is more likely to occur when they are geographically closer, and the likelihood of users being friends rapidly decreases as the geographic distance between them increases. Such geographic effects may help in designing spatially-aware community detection algorithms where

66 55 on average every two people in a community are separated by a few hops and also likely to be within spatial proximity. First, our data analysis of Gowalla friendship network reveals two degrees of geographical concentration where friends and friends-of-friends are more likely to be within geographic proximity. Conversely, pairs of users who are separated by three or more hops of friendship relation are unlikely to be within geographic proximity. Also, friends who are within geographic proximity are more likely to physically interact by going to the same places together than distant friends. Yet, the likelihood of physical interactions among friends-of-friends is minuscule even though they are geographically concentrated. Second, we showed that covers can serve as a null model for examining community structures. For most quality metrics, small communities are more likely to outperform large ones because it is much easier to find a small group to maximize a particular metric. Therefore, comparing detected communities to covers tell us how much better the algorithm is performing than a proposed null model for a given size of the community. Finally, we used the results from the covers and compared them to the communities detected by modified CPM, unweighted and weighted IA, and GANXiS. By incorporating spatial information into CPM to make the algorithm scalable, it detected meaningful communities of a large online social network where members are more likely to physically interact than members of a cover used as a null model. From the NCP plots, we noticed the importance of small-size communities in large social networks in which it is much harder to find a large community because humans have limited resources to create and maintain relationships. We used the level of physical interactions among members in a community as the final quality measure to compare and validate the performance of the community detection algorithms to the closest-friend-first cover. Other applications that we foresee might benefit from such spatial effects include recommendation systems and link prediction by designing systems based on the knowledge of users geographical locations, their social connections, and the structure of their friendship communities. For instance, recommendation systems could be enriched by incorporating geographical information of users, their friends and location-

67 56 based ratings to increase the quality of the recommended items [82]. Link prediction could be enriched by using pairs of users that are geographically close and belong to the same community to predict how likely they will become friends or connected in the future [83].

68 CHAPTER 4 SOCIAL RANKING TECHNIQUES Social Graph P4 P1 P2 P3 P5 P6 Web Graph ABC Yahoo Digg CNN Fox MSNBC Figure 4.1: Conceptualization of social ranking. Previous work on the ranking of pages conceptualized the web as a network consisting of pages representing nodes, and links representing directed edges illustrated in Fig Advances in social networks enabled a different perspective of ranking pages from a relationship point of view. For simplicity, the social network of users illustrated in the top rectangular box in Fig. 4.1 consists of nodes P 1, P 2,..., P 6 where an undirected edge between P 1 and P 2 represents a social relationship of the two users and an undirected edge from P 1 to CNN represents P 1 broadcasting a CNN URL to its ties P 2, P 3, and P 4. Note that the edge from P 1 to CNN is not a part of the social network, but a connection between the web and social network. 4.1 Google Buzz & Twitter We collected data from two networks on the web. The first one is the Google Buzz, a platform that combines social relationships and mini-blogging for information dissemination. The second network is Twitter where users choose to follow sources Portions of this chapter previously appeared as: T. Nguyen and B. Szymanski, Social Ranking Techniques for the Web, in Proc. IEEE/ACM Int. Conf. Advances in Social Network Analysis and Mining, Niagara Falls, Ontario, 213, pp

69 58 of information. These two networks have messages containing URLs that provide us clues into how users would rank the quality of the information coming from URLs by using the techniques we later describe. We collected the Google Buzz data from early September of 211 to the middle of October of the same year. There were around 2.5M users who shared approximately 1M messages of which about 3M messages had URLs embedded in them. We collected the Twitter data from early September of 211 to the late December of that year. There were around 1M users who shared approximately 3M messages and 5M of them had URLs embedded in them. Additional details of the datasets for Google Buzz and Twitter are provided in the Table 4.1 and Table 4.2. Please note that all URLs refer to all representations of URLs embedded into messages and two different representations could be the same URL when they are masked by redirect services. *URLs refer to the final destination of URLs that have been shared by at least two users within the network. In addition, we reduced the size of the datasets by keeping users whose geographical locations were known. To pinpoint the geographical location of a user, we extracted locations from their geo-tagged messages and used the most frequent location as the location of their residence. Reduced networks are shown in Table 4.3. Parsing URLs from messages is prone to errors where humans have multiple ways of writing supposedly the same link. Examples are URLs containing typos and spelling mistakes, masked by redirect services, and so on. Second, with limits on hardware resources, bandwidth sharing and data access, we attempted to collect as much as we could for the purpose of ranking URLs on social media. Third, we were able to collect the entire connected component with BFS sampling for Google Buzz, which resulted in the sum of indegree being equal to the sum of outdegree. Twitter is a much larger network that consists of hundreds of millions of accounts [26]. When calculating the data summary of Twitter, we look at users who have been processed in terms of collecting their information and not users who are waiting to be processed, which resulted in the sum of indegree not being equal to the sum of outdegree.

70 59 Table 4.1: Data summary of Google Buzz. x σ X Users 2,522,19 Inlinks ,566,67 Outlinks ,566,67 Messages , ,439,19 All URLs , ,472,25 *URLs ,647,561 Table 4.2: Data summary of Twitter. x σ X Users 1,57,163 Inlinks 17, , B Outlinks , ,421,23 Messages , ,31,683 All URLs , ,532,43 *URLs ,294, Categories of URLs. Figure 4.2 shows categories of 1 most popular and 1 randomly selected URLs for Google Buzz (a,b) and Twitter (c,d). Popular URLs are defined by the number of spreaders, that is the users who shared or re-shared a given URL. In Google Buzz (a), 24% of popular URLs are from social media, 16% are about technological products such as Apple, 15% are videos from Youtube, and so on. In Twitter (c), 41% of popular URLs are from social media, 19% are videos from Youtube, 11% are image related, and so on. Google Buzz has more URLs relating to technological products, while Twitter has more popular URLs relating to social media. For random URLs, Google Buzz has 27% of URLs from social media while for Twitter this number is 53%. Table 4.3: Google Buzz (left) & Twitter (right) with geography. x σ X x σ X Users 24,813 15,36 Inlinks K M Outlinks K M Extracted URLs M M

71 6 Youtube 15% Google 7% a) Technology 16% News 1% Information 27% b) Technology 1% < 1% Images 9% News 21% Twitter 15% Information 27% Games 1% Facebook 4% FourSquare 2% Yfrog 3% Google 4% Videos 2% < 1% Facebook 2% Foursquare 7% Twitter 11% < 1% Tumblr 5% Youtube 6% Last.fm2% Youtube 19% c) Technology 5% News 3% Images 11% Information 23% d) Technology 1% News 18% Facebook 6% Facebook 1% Yfrog 5% Foursquare 11% Twitter 3% Information 21% Twitter 28% Yfrog 8% Tumblr 5% Youtube 5% Figure 4.2: Categories of popular (a,c) and random (b,d) URLs Spreaders & Affected Sets From both Google Buzz and Twitter datasets, we have randomly chosen 2, URLs with equal probability denoted as the random set of URLs. We also have chosen the top 2, shared URLs denoted as the popular set of URLs. There are two sets of URLs in each network giving us four sets of URLs in total. For each URL, we calculated the size of the affected set consists of nodes that received the URL from the spreaders but chose not to spread it further. We also computed the average length of all shortest paths from 1 randomly chosen users to members of a random subset of spreaders. The results are shown in Fig. 4.3(a) for Google Buzz and Fig. 4.3(b) for Twitter. A point on the plot is a URL where the x-axis corresponds to the size of the affected set in logarithmic scale, and the y-axis corresponds to the average length of shortest paths from randomly chosen users to the spreaders. A red point is a URL from the random set, and a blue star is a URL from the popular set. The black line is a linear classifier that separates popular

72 61 Avg. Distance Random Popular Avg. Distance 6 Random Popular Size of Affected Set (log scale) Size of Affected Set (log scale) Figure 4.3: Shortest paths to URLs in Google Buzz (a) and Twitter (b). URLs from random URLs and crosses are points that have been miss-classified. We substitute the entire spreader set with a randomly selected subset simply as a matter of efficiency because shortest-path computations are expensive in large networks as mentioned by authors in [84]. In Fig. 4.3, we noticed that as the size of the affected set increases, the average distance from randomly selected users to the information on the web page decreases for random and popular sets of URLs in Google Buzz. This is because very large affected sets increase the likelihood that a randomly chosen user has a path through an affected user reaching a spreader. This agrees with our intuition that information collectively shared by users with high outdegrees has a greater coverage of dissemination. However, this correlation is weaker in Twitter due to the celebrity effect of some users having millions of followers and creating large affected sets. For instance, a URL that was only shared in the network by a celebrity. More importantly, affected sets influence our social ranking techniques where the structure of the network instead of the web topology is used to rank pages or URLs Information Distances Figure 4.4 shows ultra small-world property of the distance from a randomly selected starter to popular and random URLs in Google Buzz (a) and Twitter (b). For each URL, we randomly selected 1 starters and calculated the length of shortest path from the starter to the closest spreader of the URL. We calculated the densities of the number of hops in Fig. 4.5 and the average shortest path lengths Fig. 4.4.

73 62 5 a) Avg. Path Length b) Avg. Path Length YO TW YF FA IM NE TE RA YO TW YF FA IM NE TE RA Figure 4.4: Ultra small-world property from starters to information. Density a) Facebook Images News Tech Twitter Youtube Random Density b) Facebook Images News Tech Twitter Youtube Random Hop Hop Figure 4.5: Densities of shortest path lengths from starters to URLs. Results show that a randomly selected starter in Google Buzz is about one hop away from a popular URL compared to 2.5 hops distance from a random URL. For Twitter, a randomly selected starter is about 2 hops away from a popular URL and a little bit further for a random URL. These average shortest path lengths to popular and random URLs are much shorter than six degrees of separation in Travers- Milgram small-world experiment [29] demonstrating that the distance from human to information is sometimes shorter than the distance from human to human Geographical Distances Figure 4.6 shows geographical concentration of pairs of users who are separated by a fixed number of hops in Google Buzz and Twitter, and two additional networks: Gowalla and FourSquare. We noticed that these four social networks have two degrees

74 63 a) Hop 1 b) Hop 2 c) Hop B T G F Density 2 4 d) Hop e) Hop f) Hop Geographic Distances (km) 2 4 Figure 4.6: Two degrees of spatial concentration. of spatial concentration where users who are separated by one or two hops are more geographically concentrated than pairs who are separated by 3 hops or more. For instance, 69% of friendship pairs (hops=1 shown in a) are within 56 km, 47% of friends-of-friends pairs (hops=2 shown in b) are within 56 km, 25% of pairs with hops=3 (shown in c) are within 56 km, 2% of pairs with hops=4 (shown in d) are within 56 km, 17% of pairs with hops=5 (shown in e) are within 56 km, and 17% of pairs with hops=6 (shown in f) are within 56 km. An explanation for this two degrees of concentration is the effect of local clustering coefficient of a user defined as the fraction of its friends who are friends with each other. In order for a probability of two people who have a friend in common being friends themselves to be high, they need to be within some geographical proximity or else the opportunity for them to interact is small. The average local clustering coefficient of 1 4 randomly selected pairs of users in Google Buzz, Twitter, Gowalla, and FourSquare are.31,.36,.3, and.34 respectively.

64 Figure 4.7: Four dimensions of social relationships. 4.1.5 Densities of Social Relationships Four dimensions of social relationships are visualized in Fig. 4.7. Friends are defined as reciprocal following relationships.

Interests are users that have similar interests measured by the keyword similarity in URLs they share.

75 64 Figure 4.7: Four dimensions of social relationships Densities of Social Relationships Four dimensions of social relationships are visualized in Fig Friends are defined as reciprocal following relationships. Neighbors are users that are geographically close. Peers are users that belong in the same community. Interests are users that have similar interests measured by the keyword similarity in URLs they share. The intersection of circles represents pairs of users with multiple dimensions of social relationships. Two represents pairs of users with two dimensions of social relationships such as being friends and neighbors. Table 4.4: Social relationships densities in Google Buzz. Buzz Friends Peers Interests Neighbors Among Friends Among Peers Among Interests Among Neighbors Among Random Tables show the densities of friends, peers, neighbors, and users with similar interests. The left column represents relationships of the pairs and the top row represents the density of the relationships. For example, among friends in Table 4.4 for Google Buzz, 99% of are also peers, 9% of them have similar interests, 58% of

76 65 Table 4.5: Social relationships densities in Twitter. Twitter Friends Peers Interests Neighbors Among Friends Among Peers Among Interests < Among Neighbors Among Random < a) b) Avg. CKS Friends Followings Peers Random Avg. CKS Friends Followings Peers Random Geographical Distance (km) Geographical Distance (km) Figure 4.8: CKS for friendship, following, peers, and random pairs. them are neighbors. For Twitter, among friends, 85% of them are peers, 11% have similar interests, and 3% are neighbors. The densities of friends, peers, interests, and neighbors are consistent in Google Buzz and Twitter. For example, most of the friends are among peers, most of the peers are among friends, most of people with similar interests are among peers, and most of the neighbors are among friends Keyword Similarity Figure 4.8 shows cosine keyword similarity (CKS) of selected friendship, following, peers, and random pairs of users in Google Buzz (a) and Twitter (b). The CKS of two users is the cosine of the angle between the two vectors consisting of keyword frequencies extracted from webpages shared by these two users. Let W v and W v be lists of words in web pages that users v and v have shared. Let A v be a vector of word frequencies where the i th index in A v represents the number of times the word w i appears in W v The keyword cosine similarity for v and v is defined as:

77 66 cos(u, u ) = A ua u A B. (4.1) A pair of nodes (v, v ) represents friendship if they follow each other, following if v follows v but not vice-versa and is a random pair if there is no following in either direction. We calculated the average CKS of friendship, following, peers, and random pairs as a function of geographical distance separating members of these pairs. For random pairs, we noticed that CKS decreases as the geographical distance increases. On the other hand, the effect of geography on cosine keyword similarity is negligible when comparing friendship, peer, and following pairs. However, they have a higher cosine keyword similarity than random pairs. 4.2 Social Ranking Techniques Let G U = (V, E) be a directed multi-labeled graph where V is the set of nodes, E is the set of edges where e = (v i, v j ) represents a directed edge from node v i to node v j, and U is the set of URLs with subsets of which nodes in V are labeled. For URL u U, let S(u) denotes the set of all spreaders of the URL u; in other words all nodes in V who has posted u PageRank on Social Network We extend the PageRank algorithm to rank URLs on a social network (PRSN) as follows. Given a multi-labeled graph G U = (V, E), let F = (f ij ) be a n n weighted adjacency matrix where n is the number of nodes (i.e, n = V ), f ij = if there is no directed edge from v i to v j, and f ij = 1/deg(i) otherwise. Let R be a vector consisting of n elements where the i th element of R denoted as r i corresponds to the PageRank score of the i th node. Let k be the maximum number of iterations that the PageRank algorithm runs. At the first iteration, every node sends its score divided by the number of links pointing from this node to other nodes through each outgoing link. After that, each node updates its score to the sum of scores that it has received: r i = f 1i r 1 + f 2i r f ni r n. (4.2) If there is an edge from node j to node i, then f ji > and node j will send

78 67 f ji fraction 1 deg(j) of its score r j to node i. Equation 4.2 can be compactly written as R <1> = F T R <> where F T is the transpose of the matrix F, the superscript <1> denotes the scores of all nodes after the first iteration, and R <> is the initial vector. Let R <k> be the scores of nodes at the k > or last iteration defined by induction as: R <k> = F T R <k 1> (4.3) If there are sinks in the graph G, that is nodes without outgoing edges, then for large enough k s they will absorb all scores since the scores can enter but cannot leave the sinks. One way to fix this problem is to scale the strength of links by a constant factor of < σ < 1 and to compensate this scaling by adding an artificial flow between any two nodes with the weight 1 σ. This solution is known as the scaled n version of PageRank [85]. The score of the i th node is then denoted as r i and is defined as: r i = n j=1 (σf ji + 1 σ n )r j. (4.4) Equation 4.3 can be compactly written using the following matrix F = σf + 1 σ n. By the Perron-Forbenius Theorem [85], the scaled PageRank scores converge to a stable solution: R i = F T R i 1 where < i k. (4.5) Given a subset of URLs U U, the PageRank score of a URL u U on a social network (PRSN) is defined as: P RSN(u) = HITS on Social Network v i S(u) r k i u U v i S(u ) r k i. (4.6) The HITS algorithm used to rank URLs on a social network (HSN) is defined as follows [35], [85]. Given G U = (V, E), let M = (m ij ) be a n n adjacency matrix where n is the number of nodes, m ij = 1 if there is a directed edge from node v i to

79 68 node v j, and m ij = otherwise. Let k be the maximum number of iterations. Given a set of URLs U U, let H and A be vectors of scores for hubs and authorities, respectively. Authorities are the URLs (i.e., u U ) and hubs are nodes that share these URLs. The i th element of the vector H represents the score of the i th hub, and the j th element of the vector A represents the score of the j th authority. At the first iteration, the score h i of a hub gets set to the number of authorities to which it points, and the score a j of an authority gets set to the scores of hubs pointing to it. More formally, h i and a j are defined as: h <> i = m i1 + m i m in, (4.7) a <> j = m 1j h <> 1 + m 2j h <> m nj h <> n. (4.8) Let H <l> and A <l> be the scores of hubs and authorities at the iteration l, the HITS algorithm [85] can be written as: H <l> = (MM T ) l H <> where < l k, (4.9) A <l> = (M T M) l 1 M T H <> where < l k. (4.1) Finally, the score of a URL in the authorities is the value a <k> j the sum of scores in the vector A. normalized by Ranking with Maximum Flow We defined the following maximum flow algorithm to rank URLs on a social network. Given a graph G U = (V, E) and a subset of URLs U U, let p represent a node. We want to rank the URLs in U with respect to p and G by constructing a directed flow graph denoted as G p = (V, E ). The first part of the construction requires copying the social structure of G to G p. For every node v i that p follows, we add v i to V and the edge e = (p, v i ) into E. At the subsequent iteration, we repeat the same process for every node that has been added into V from the previous iteration; that is, if v i was added into V and

80 69 Information and Social Network Web Pages Source p P1 P3 P2 P4 P5 u2 u1 Super Sink t Figure 4.9: Graph G p for ranking URLs {u 1, u 2 } with respect to node p. there is an edge e = (v i, v j ), then we add v j to V if v j has not been added before. The edge e = (v i, v j ) will still be added into E if v j has been added before. This process of constructing the graph G p continues until all possible nodes from V that are reachable from p have been added into V. For practical reasons, it is wise to stop when the diameter of G p is small; e.g., three to reflect the influence of nodes that are within network proximity. At the end of the process, an edge originating from node v gets the weight equal to the inverse of the node degree in G p. The second part of constructing G p introduces some additional nodes and edges. For every URL u U, we add u into V. For every spreader s S(u ) of the URL u, we add an edge e = (s, u ) with a weight of 1 into E if s V. We add a super sink denoted t into V and add an edge e = (u, t) with an edge weight of 1 for every URL u in U. The maximum flow of the graph G p from source p to super sink t is a function F that assigns a non-negative value to each edge so that it maximizes the total flow coming from the source p to the super sink t satisfying two conditions: first, it does not exceed the weight of an edge; i.e, F (e) c e and second, it obeys the conservation of flow law except for the source p and the super sink t; i.e, F out (v) = Flow out to social ties Flows out to pages {}}{{ }}{ ce + c e = F in (v) (4.11) where c e is the assigned flow for the edge e = (v i, v j ) between two nodes, and c e is the assigned flow for the edge e = (v i, u j ) for the node v i and the URL u j. The construction of the graph G p is illustrated in Fig Polynomial running time algorithms such as the Edmonds-Karp algorithm O(V E 2 ) for finding the maximum

81 7 flow can be found in [85], [86] Variants of Maximum Flow The second variant of network flow incorporates social relationships and geography by assigning weights to edges based on the geographical distance between the nodes. We assign the edge weight for nodes v i and v j as: w ij = v k v out i g d (v i, v j ) 1. (4.12) g d (v i, v k ) 1 where g d (v i, v j ) is the geographical distance from v i to v j. The third variant uses cosine keyword similarity to assign the weights. The edge weight for nodes v i and v j is defined as: w ij = v k v out i CKS(v i, v j ) 1. (4.13) CKS(v i, v k ) 1 The last variant of network flow uses community structure by replacing the social network with the community group and connecting the source to all members in the community. Weights (binary) for the edges in community do not taken into account geography or cosine keyword similarity so their values are Social Ranking Experiments Comparing PageRank & HITS We selected 3 URLs from the popular and random URLs sets. For each selected URL, we calculated its score by using PageRank and HITS, and ranked the URLs (i.e, 1st, 2nd, 3rd, etc.) with respect to the set. We compared the ranking results of PageRank and HITS for popular and random URLs shown in Fig. 4.1 for Google Buzz and Fig for Twitter. Ranking Results of Google Buzz are listed in Table 4.6 and Table 4.7. The ranking of popular URLs using PageRank and HITS are more consistent than the random URLs. We measured the ranking consistency as the average difference of two ranking algorithms on a set of URLs (i.e., 1 w u U P HSN (u) P P RSN (u) ) and the sum of differences (i.e., u U P HSN (u) P P RSN (u) ) where P x (u) is the position of the URL u determined by the algorithm x and w is the number of URLs.

82 71 The average difference is more appropriate than the sum difference for ranking a large number of pages. An example is ranking 1 pages instead of 5 pages. The average gives the average difference of two ranking algorithms in the 1 pages, and the sum difference gives the difference in ranks of the two algorithms. For smaller number of pages, sum might be more appropriate in quantifying the difference between two ranking algorithms. For the popular URLs in Google Buzz, the average difference was 2.9 meaning that on average HITS and PageRank were off by 3 positions and the sum of differences between them was 86. For the random URLs in Google Buzz, the average difference was 9.6 and the sum of differences between them was 288. For the popular URLs in Twitter, the average difference was 5.9 and the sum of differences between them was 178. For random URLs in Twitter, the average difference was 7.2 and the sum of differences between them was 216. In both networks, popular URLs are ranked more consistently than random URLs which makes the HITS algorithm more suitable than PageRank when ranking viral information because it is computationally more efficient. HITS on Social Network 3 stacko bbc thesocialne pingchat photofoc 25 boston xkcd reddit empireavenue amazon 2 gizmodo reuters whitehouse techcrunch engadget 15 pcworld businessweek ted guardian apple 1 yahoo bloomberg facebook wordpress lockerz 5 wired nytimes appleinsider youtube abcnews.go PageRank on Social Network 3 theprism picasaweb.google networkedblogs foxnews dslreports 25 digg last.fm huffingtonpost telegraph wimp 2 tech.slash fastestwaylosebellyfat twitter income4free popsci 15 puntogov businessinsider socialt forbes behance 1 economist opencog marketwatch npr addictivefonts 5 wired sports.espn.go thenextweb entrepreneur ping.fm PageRank on Social Network (a) Popular URLs. (b) Random URLs. HITS on Social Network Figure 4.1: Ranking URLs on Google Buzz Flow Ranking We noticed that the ranking results determined by each individual user using maximum flow are less correlated with themselves than the results computed by PageRank and HITS. First, we compared the ranking results of maximum flow with

83 72 HITS on Social Network 3 ubersocial news.yahoo zdnet tinychat latimes 25 nbcnews ted vimeo myspace espn.go 2 busines pitchengine barackobama abc.go change.org 15 brightkit ebay usatoday pinterest huffingtonpost 1 wired pepsi estovar newstomato wordpress 5 forbes wefollow hollywoodlife mtv twitpic.co PageRank on Social Network (a) Popular URLs HITS on Social Network 3 vice turbotdou meadowparty influxinsights 25 getglue gigaom gototennis fizy fastcode chinad 2 adage 9gag happyplace iphoneblog newscj 15 foxnews macrumors amazon keek barnesandnoble 1 mtv blog.vegas nme hotlist scientificamerican 5 eco4planet wimp.com techcrunch blog.naver viewsnnews PageRank on Social Network (b) Random URLs Figure 4.11: Ranking URLs on Twitter. PageRank and HITS using popular and random URLs for Google Buzz shown in Fig for popular URLs and Fig for random URLs. The first and second plots on the left are ranking results of popular URLs and the third and fourth plots on the right are ranking results of random URLs labelled by their sub-captions. A point on the graph is a URL where the x-axis is the ranking position of the URL determined by maximum flow and the y-axis is the ranking position determined by either PageRank or HITS labelled on the y-axis. The identical layout for Twitter is shown in Fig for popular URLs and Fig for random URLs. PageRank on Social Graph Person 1 Person 2 5 Person 3 Person 4 y=x Personalized Ranking with Maximum Flow (a) Max. Flow vs. PageRank HITS on Social Graph Person 1 Person 2 5 Person 3 Person 4 y=x Personalized Ranking with Maximum Flow (b) Max. Flow vs. HITS Figure 4.12: Social ranking with popular URLs on Google Buzz.

84 73 PageRank on Social Graph Person 1 Person 2 5 Person 3 Person 4 y=x Personalized Ranking with Maximum Flow (a) Max. Flow vs. HITS HITS on Social Graph Person 1 Person 2 5 Person 3 Person 4 y=x Personalized Ranking with Maximum Flow (b) Max. Flow vs. PageRank Figure 4.13: Social ranking with random URLs on Google Buzz. HITS on Social Graph PageRank on Social Graph Personalized Ranking with Maximum Flow (a) Max. Flow vs. HITS Personalized Ranking with Maximum Flow (b) Max. Flow vs. PageRank Figure 4.14: Social ranking with popular URLs on Twitter. PageRank on Social Graph Person 1 Person 2 5 Person 3 Person 4 y=x Personalized Ranking with Maximum Flow (a) Random URLs. HITS on Social Graph Person 1 Person 2 5 Person 3 Person 4 y=x Personalized Ranking with Maximum Flow (b) Random URLs. Figure 4.15: Social ranking with random URLs on Twitter.

85 74 Table 4.6: Ranking results of 3 popular URLs in Google Buzz. URLs PRSN HSN MF abcnews.go 1 1 9/12/1/15 youtube 2 2 5/7/5/6 yahoo 3 1 1/2/2/4 businessweek /14/12/14 bloomberg 5 9 1/14/13/12 wordpress 6 7 5/5/7/9 nytimes 7 4 1/14/6/1 appleinsider 8 3 1/14/13/16 facebook 9 8 1/1/1/1 wired 1 5 9/14/13/15 lockerz /6/6/6 apple /8/9/8 pcworld /13/1/7 guardian /14/8/1 reuters /14/1/16 ted /13/7/1 amazon /9/8/1 techcrunch /13/9/14 engadget /13/7/7 reddit /13/8/11 empireavenue /14/11/15 boston /3/3/3/ xkcd /4/8/2 whitehouse /14/11/14 gizmodo /1/12/12 pingchat /12/12/14 thesocialnetwork-movie /14/13/14 bbc /11/4/13 photofocus /14/13/16 stackoverflow 3 3 6/11/12/ Rank Differences For personalized ranking, we measured the ranking consistency as the average difference of a pair of users with respect to a URL set. For instance, in the Table 4.8, the left column and the top row are the four selected users where the element a ij corresponds to the average difference of users i and j. Please note the upper triangle or elements above the diagonal refer to the random URLs and the lower triangle or elements below the diagonal refer to the popular URLs. The right column refers to the outdegree of users in the random URLs, and the last row refers to the outdegree

86 75 Table 4.7: Ranking results of 3 random URLs in Google Buzz. URLs PRSN HSN MF networkedblogs /5/7/2 picasaweb.google /3/1/5 ping.fm 3 1 5/4/4/4 thenextweb 4 3 8/7/8/3 twitter /17/13/1 income4free /1/2/1 fastestwaylosebellyfat /9/1/1 digg /19/12/5 sports.espn.go 9 4 4/6/6/6 wired /21/9/9 businessinsider /2/3/8 forbes /12/12/9 foxnews /13/5/9 behance /23/13/8 huffingtonpost /2/11/7 entrepreneur /21/13/1 puntogov /23/13/1 addictivefonts /14/13/9 theprism /2/13/1 telegraph /1/13/1 npr /19/13/1 popsci /11/13/1 economist /16/13/1 marketwatch /8/13/1 opencog /23/13/8 dslreports /15/13/1 last.fm /23/13/1 tech.slashdot /22/13/1 wimp /18/13/1 socialturns /18/13/1 of users in the popular URLs. For Twitter, the ranking results in the same format are given in Table 4.9. For random URLs in Google Buzz, we noticed that persons p 1 and p 3 have an average difference of 1.7 where p 2 and p 4 have an average difference of 6.7. For popular URLs, the variability is smaller where p 4 and p 2 have an average difference of 2. and p 1 and p 2 have an average difference of 3.2. Outdegree measures the number of people a user follows since the ranking results are based on them. And finally, ties are expected when using maximum flow since the number of URLs shared among

87 76 friends is minuscule compared to the number of pages in the deep Web. Therefore, we simply use PageRank or HITS to break ties among pages when necessary. Table 4.8: Avg. ranking differences in Google Buzz. - p 1 p 2 p 3 p 4 outdegree. p p ,55 p ,125 p out deg Table 4.9: Avg. ranking differences in Twitter. - p 1 p 2 p 3 p 4 outdegree. p p p p out deg , Rank Distributions We examine variants of flow ranking as follows. We selected a user in Twitter, selected the top 25 URLs shared by people that this user is following in terms of CKS shown. These 25 URLs contain similar keywords to the URLs that this user has previously shared. Once we have the candidate URLs, we use network flow to re-rank them taken into account social relationships, the effect of geography, and community structure. Results show a re-ordering where geography have an effect on reducing the number of URLs with positive scores by considering spreaders of URLs who are geographically close. On the other hand, community have an effect on distributing the scores of URLs more evenly since more spreaders are taken into consideration. This flexibility allows users to select information that are locally relevant when it is appropriate or select information of potential interests from their community members. Figure 4.16 shows the rank correlation coefficient of URLs between variants of network flow and PageRank. For a selected user, we selected 25 URLs from its neighborhood and ranked these URLs using variants of network flow: without geography

88 77.15 a) Twitter Buzz b) c) P(x).1.5 P(x) P(x) Tau Tau Tau Figure 4.16: Densities of rank correlation coefficient..8 a).8 b) Avg. NCDG Avg. NCDG Flow O Flow G Flow I Flow C PR BL Flow O Flow G Flow I Flow C PR BL Figure 4.17: Ranking quality results. (a), with geography (b), and with community (c). Given a set of URLs U, let R u (v) and R u (v ) be the ranking results for nodes v and v. The rank correlation coefficient denoted as τ is defined as: τ = n c n d.5k(k 1) (4.14) where k = U, n c is the number of concordant pairs, and n d is the number discordant pairs in R u (v) and R u (v ). Then we calculated the rank correlation coefficient τ where a value of 1 means the ranking results are identical, -1 if they are in reverse order, if they are independent. Results show that personalized ranking using network flow is highly independent from PageRank Rank Validation Fig shows ranking quality results for Google Buzz (a) and Twitter (b) using the four variants of network flow, PageRank applied to the social/information network, and the baseline. The y-axis is the normalized cumulative discounted gain

89 78 (NCDG) used to benchmark the quality of ranking results and defined below. For this experiment, we selected 5 users and 1 URLs from a user s neighborhood. Then we ranked these URLs by using the six ranking techniques. NCDG is defined as follows. Let p be a source node and R a list of ranked URLs for p. The discounted cumulative gain DCG for R with respect to p is: δ(r i, p) + w i=2 δ(r i, p) log(i) (4.15) where δ(r i, p) is 1 if R i is relevant to p and otherwise, and w is the number of pages to be ranked. We assume R i is relevant to p if p has shared R i before. The normalized discounted cumulative gain (NDCG) is the DCG divided by the DCG of the optimal ordering of R with respect to p. Optimal ordering is defined by using the pages that the user has later shared in the future. To capture any effect of social relevance, we randomly rank these URLs and use this random ranking as the baseline. Results shown in Fig confirmed that social relevance can improve ranking results of up to 19% in Google Buzz and 17% in Twitter. The improvement is defined as the difference in two ranks in terms of average NCDG of PageRank and flow rank divided by the average NCDG of PageRank (See Fig. 4.17). It is interesting that peers in community have a stronger effect in ranking quality than friends in Google Buzz. This is consistent with the densities of social relationships in Table 4.4 where 25% of peers have similar interests compared to 9% for friends. For Twitter, the densities in Table 4.5 align with the ranking quality results in Fig. 4.17(b) where the densities of interests among friends and peers are almost identical. Recall that the PageRank is calculated by using the social network and not by using the web graph. 4.4 Summary of Results Information shared between users in online social networks such as URLs provides a unique perspective of the ranking of web pages. In our approach, humans instead of pages are the ones who rank the URLs by sharing them, and the social network of the users instead the web graph topology is used to propagate the ranking. First, we collected two large-scale information networks of online users to study

90 79 how users in these networks share URLs which impacts the distance between a person and a URL. For instance, researchers in [3] estimated the number of hops between any two pages to be on average 19; while Milgram estimated that the number of hops between any two people is no more than 6 [87]. Since information propagates differently in social networks, the social structure bounds how far a person is away from a shared URL. Second, we reinterpreted the ranking techniques of PageRank and HITS and proposed to use maximum network flow to personalized the ranking of pages tailored to each individual user. Maximum flow detects the popularity of a shared URL among friends but popularity does not necessary reflect endorsement which could impact ranking because one could share something that was not meant to be positive (e.g., a sad news). We expected that each unique individual would rank the URLs differently, since no two people on a social network are the same. Interestingly, the ranking results of popular URLs using PageRank and HITS are more correlated than random URLs suggesting that the overall view of users on ubiquitous information is more consistent, but everyone has their own opinion in the end. Instead of attempting to socially rank the entire web, we re-ranked a selected set of URLs to make it scalable and efficiently executable for search engines. If the size of the web doubles in the next few years, it would not affect our approach since only a subset of URLs that users shared are actually re-ranked. Third, experimental results show that personalization can improve ranking quality of up to 19% compared to the baseline and 5% compared to PageRank in Google Buzz. For Twitter, personalization improves ranking quality of up to 17% compared to the baseline but it is not better than PageRank. More importantly, we believe that personalizing the ranking is useful for social searching because it provides a mechanism for the interaction between the searcher and the sharer where the searcher can discuss with the sharer about the item relating to a query on a search engine. For instance, a new product that the sharer posted on appleinsider.com or a piece of political news on nytimes.com. This potential interaction between the searcher and the sharer is valuable because the influence of the sharer on the searcher is stronger than the influence coming from the authorities detected by HITS and PageRank in many non-technical and social situations but not

91 8 for all. This feature could be implemented in search engines where pages returned to a given query are re-ranked via social networks if there are pages shared among friends or other associates of the searcher that are related to the query.

92 CHAPTER 5 SOCIAL SEARCHING EXPERIMENTS We collected friendship, checkin, and location data from two location-based social media, Gowalla and FourSquare, that allowed people to use their internet-enabled and sensing-capable smart phones to record and share their current location. Gowalla is no longer operating by itself since it has been integrated into Facebook. Unlike Gowalla, FourSquare doesn t allow an automated mechanism for collecting publicly shared checkins through their API. We have also collected two additional social networks containing social relationships, Flickr and Last.fm, but without geographical locations of their users. The reason for collecting data from these four diverse networks is that we can directly calculate the hop length of the shortest path between randomly selected pairs of users and use these path lengths as an estimate for the ground truth in the small-world experiment. We use Gowalla and FourSquare for the emulation of the small-world experiment in which knowing geographical distance between users is essential. Even though the collected data from online social media is not a representative sample of the entire population, it still provides one of the best estimates of social distance [88] and one of the best environments for analyzing the small-world experiment at large scale. Table 5.1: Summaries of online social networks datasets. Social Networks Number of Users Number of Edges Period Gowalla 154,557 1,139,11 Sept Oct. 12 FourSquare 251,621 8,21 Jun Aug. 13 Flickr 2,435, ,11,479 Jun Aug. 13 Last.fm 4,355,516 3,325,89 Jun Aug. 13 In Table 5.1, we list the number of users and edges collected for each network over the specified time period. These numbers in case of Gowalla and FourSquare refer to a subset of the collected network reduced after data cleaning. In Gowalla, we removed users that did not have any publicly shared checkins. In FourSquare, Portions of this chapter have been submitted as: T. Nguyen et al., Small Worlds and Social Stratification, Plos One, (under review.) 81

93 82 we kept only users that were successfully geocoded by Google s Maps. This subtle difference between Gowalla and FourSquare is important because checkins in Gowalla directly pinpoint users locations, making connections between users in Gowalla more dense than in FourSquare. However, the advantage of FourSquare is that it provides different perspective to some of the questions being asked such as the effect of network sparsity on the small-world problem. 5.1 Attrition, Geography, & Communities Let G = (V, E) be a social network where V is the set of users and E is the set of edges representing undirected relationships among users. The great-circle distance between two users s and t is denoted as g d (s, t) and estimated based on the users self-entered location of residence (FourSquare) or the most-frequent checkin that they have shared (Gowalla). The network distance between s and t is denoted as n d (s, t) and defined as the smallest number of hops needed to reach t starting from s. Let A be a community detection algorithm that partitions nodes in G into m overlapping clusters denoted as {C 1, C 2,..., C m }. An edge-bridge is an edge e = (u, u ) such that u C i and u C j for i j. A node-bridge is a node u such that for certain i j, u C i and u C j. The stratification graph of G denoted as S = (s ij ) is defined as: s ij = eb(i, j) + nb(i, j) m k=1 eb(i, k) + nb(i, k) (5.1) where eb(i, j) and nb(i, j) are the number of edge- and node-bridges connecting communities i and j respectively. We extend the definition of network distance of users to communities denoted as n d (C i, C j ) and defined it as the smallest number of nodeor edge-bridges needed to reach C j starting from C i. We latter use s ij to define the prominence of community C i. Fig. 5.1 shows the stratification graph of communities for Gowalla Modeling Attrition Let p k denotes the probability of getting from a source to a target in k hops in chains that are of length at least k, and let p denotes the probability of dropping out of

94 83 Figure 5.1: Stratification graph of communities in Gowalla. experiment for nodes that are not adjacent to a target. Let N denotes the number of folders sent, Dk be the number of folders delivered to the target at the k th hop, and Ck be the number of chains continuing for at least k hops. If participants do not drop out P of the experiment, then the number of deliveries in k hops is Ek = pk (N k 1 i=1 Ei ). The expected number of deliveries for one hop targets is D1 = N p1, and the number of chains continuing for two or more hops is C2 = N (1 p1 )p. For k > 1, Dk = pk Ck and Ck+1 = Ck (1 pk )p. In Travers-Milgram s experiment, we know N, Ck,and Dk. Then, the numbers of deliveries including drops for k > 1 is: Ek = Dk + (N Ck k 1 X i E i ) pk. (5.2)

The effect of Product Ratings on Viral Marketing CS224W Project proposal

The effect of Product Ratings on Viral Marketing CS224W Project proposal Stefan P. Hau-Riege, stefanhr@stanford.edu In network-based marketing, social influence is considered in order to optimize marketing