Bridge analysis in a Social Internetworking Scenario
Abstract
The rapid development of the number and the size of Online Social Networks (OSNs) makes the analysis of Social Internetworking Scenarios (SISs) extremely challenging. In a SIS, a user can join multiple OSNs and two users can interact with each other even though they joined different OSNs and did not know each other. While OSNs have been extensively studied in the last years, the most peculiar aspects of Social Internetworking Scenarios have not been yet investigated, especially from the Social Network Analysis perspective. Our paper tries to give a first important contribution in this field by deeply studying the core elements of a SIS, i.e., bridges. Bridges are those users who joined more OSNs and allow users of different OSNs to cooperate. We investigate the main features of this category of users by means of a Social Network Analysis campaign. In particular, we define several specific crawling strategies and extract several samples from a SIS by applying each of them. The experimental results define a clear “identikit” of bridges allowing us to state a number of non-trivial conclusions about their role in a SIS.
Keywords
- Social Network;
- Social Network Analysis;
- Social Internetworking Scenario;
- Bridge users;
- Crawling strategies
1. Introduction
Online Social Networks (OSNs, for short) represent one of the main actors of the Web 2.0 and have been showing an enormous growth in the last years. They attracted the interest of many researchers from disparate fields, such as computer science, psychology and sociology [39]. Many researchers started to collect large amounts of data from OSNs and to apply techniques of classical Social Network Analysis [20] on them. The results they obtained are numerous and extremely interesting (see, for instance, [24], [48], [28] and [57]). Other researchers modeled an OSN as a graph and investigated its structural properties also to infer knowledge about user behavior [31] and [17]. They are based on the intuition that there is a strong correspondence between the user behavior in an OSN and the structural properties of the corresponding graph. This intuition is confirmed by many past studies, as shown in Section 2.
An important aspect to consider is that nowadays users tend to spread their activities among more OSNs and, often, to show a different behavior in different OSNs [36]. As a consequence, different Social Networks are interconnected with each other, thus resulting in a global graph whose structural features could be in principle very different from the ones of each Social Network seen as a graph. This complex topology represents the substrate of an emergent scenario, called Social Internetworking Scenario (SIS), where a user can join multiple OSNs and two users can interact with each other even though they joined different OSNs and did not know each other. This concept is very recent and only a few commercial attempts to implement Social Internetworking Systems have been proposed (think, for instance, of Google Open Social [4], Power.com [6], Gathera [3] and Friendfeed [2]).
Despite the great attention given by scientists towards Social Networks, Social Internetworking Scenarios have been little investigated in the scientific literature [27], also due to their young age. Some papers focus on cross-folksonomies [37] and [60]. They analyze the tagging behavior of users in multiple folksonomies and try to relate information about these behaviors. Other ones, such as [9] and [59], assume users can join heterogenous social systems (e.g., a folksonomy and a blogging platform in [59], or Social Networks like Facebook, and social media, like Flickr, in [9]). Their goal is to aggregate user information in such a way as to build a global user profile. However, no relevant work studying the most peculiar aspects of SIS’s from the Social Network Analysis perspective exists in the scientific literature. In other words, the comprehension of even basilar structural properties of SIS’s is still a fully open problem.
Our goal is to provide a first important contribution in this setting by deeply studying the core elements of a SIS, i.e. bridges. Bridges are those users who joined more OSNs and allow users of different OSNs to cooperate. A bridge generally operates at the borders of two or more OSNs, is very esteemed in them, is very active in her interactions with other users and, therefore, can favor information exchange among different OSNs. As a consequence, bridges certainly represent the most basic structural aspect of a SIS, which requires a specific careful analysis.
Our paper attacks this problem by following an experimental approach based on suitable crawling tasks on a real-life SIS. In particular, we run the following steps: (i) extraction of data concerning the connections among user accounts in a set of OSNs and (ii) identification and analysis of bridges.
In order to carry out the first step it is necessary to derive the connections explicitly declared by users. Furthermore, it is necessary to consider not only connections among accounts of the same OSN but also connections among accounts of different OSNs. For this task, technological standards encoding human relationships, such as XFN (XHTML Friends Network) [8] and FOAF (Friend-Of-A-Friend) [7], can be exploited for those cases when users explicitly declared these relationships. In all the other cases, human relationships should be inferred by suitable approaches (for instance, the one described in [19]). Our extraction task considers these standards but, instead of handling and processing them directly, it exploits the functionalities provided by Google Social Graph (GSG), a tool recently proposed by Google [5]. It is worth noting that the above extraction step is not a simple task, since the crawling strategies adopted in the classical context cannot be trivially run in a SIS context by assuming that they capture its specific peculiarities. Therefore, both an extension of classical crawling strategies to SIS’s and/or the design of new ad hoc strategies are necessary. From this point of view, our paper offers a first interesting contribution since, besides the results of the analysis on bridges, it includes a first investigation about the issue of crawling SIS’s.
But the main contribution of our paper concerns step (ii) above. The analysis of bridges (whose identification, among the data extracted in step (i), is straightforward), aiming at estimating both classical Social Network Analysis parameters and new specific ones, is conducted in such a way as to discover the nature of bridges in a very deep fashion, by running a large number of experiments over large data sets driven by the following questions:
- 1.
Is there a sort of “backbone” among bridges, aiming at favoring the direct links among them?
- 2.
Which probabilistic distribution is followed by the bridge degree?
- 3.
Which is the relationship among the fraction of bridges, the average node degree, and its standard deviation in the OSNs of a SIS?
- 4.
What about the average degree of bridges? Is there some difference with non-bridges?
- 5.
Is there a correlation between bridges and power users?
- 6.
What about the number of OSNs typically connected by bridges?
The result of our analysis provides an answer to each question above with a strong experimental support and discovers even unexpected conclusions about bridges and, in general, a complete knowledge of this crucial elements of Social Internetworking Scenarios.
Summarizing, the main contributions of this paper are the following:
- •
we extended two classical crawling techniques (namely, BFS and MH) in order to make them more suitable to operate in a SIS;
- •
we tested five crawling techniques (including the two old ones) and derived several data sets about SIS’s that we made available to the Social Network Analysis community for further experiments; and
- •
we defined the main characteristics of bridges, which represent the key factor in a SIS, by giving an answer to the above questions.
The plan of this paper is as follows: Section 2 describes related literature. Section 3 presents the crawling strategies we have implemented and used, the test bed we have adopted and the basic characteristics of collected data. Bridge analysis, which represents the core of this paper, is presented in Section 4. Finally, in Section 5 we draw our conclusions and identify some future directions of our research.
Throughout the paper, with a little abuse of notation, we indifferently talk of users referring to both their human role and the corresponding nodes of a graph.
2. Related literature
Initially, studies on Social Networks attracted mainly sociologists. For instance, [61] introduced the 6-degrees of separation and the small-world theories. The effects of these theories are analyzed in [29]. Ref. [35] showed that a Social Network can be partitioned into “strong” and “weak” ties, and that strong ties are tightly clustered. In a second time, with the development of OSNs, Social Network Analysis also attracted computer scientists and, as pointed out in the introduction, many studies have been proposed which investigate the features of one OSN or compare more OSNs. Most of them collect data from one or more OSNs, map these data onto graphs and analyze their structural properties. These approaches are based on the observation that topological properties of graphs may be reliable indicators of the behaviors of the corresponding users [39]. However, to the best of our knowledge, no study that analyzes the main features of bridges in a SIS was presented in the past.
In [44], the authors discuss the problem of sampling from large graphs. They aim at answering questions like: (i) which sampling method to use; (ii) how small can the sample size be; (iii) how to scale up the measurements of the sample to get estimates for the larger graph; and (iv) how success can be measured. For this purpose, they consider several sampling methods and check the goodness of their sampling strategies on several different datasets. Other approaches that consider the same problem can be found in [32], [55], [41] and [12]. Specifically, [32] and [55] use sampling to improve the visualization of a graph. Ref. [41] develops methods to produce a small realistic sample from a large real network. The authors of this work prove that some of these methods maintain key properties of the initial graph also with a sample size down to 30%. Ref. [12] considers different graph generation algorithms and, for each of them, analyzes separability and stability properties.
Ref. [64] investigates the OSN graph crawling problem. Its analysis aims at answering questions like: (i) how fast crawlers into consideration discover nodes/links; (ii) how different OSNs and the number of protected users affect crawlers; (iii) how major graph properties are studied. All these investigations are performed by analyzing samples derived from four OSNs, namely Flickr, LiveJournal, Orkut and YouTube. Another approach that considers the same problem can be found in [18].
In [49], the authors present a deep investigation of the structure of multiple OSNs. For this purpose, they examine data derived from four popular OSNs, namely Flickr, YouTube, LiveJournal and Orkut. Crawled data regard publicly accessible user links on each site. Obtained results confirm the power law, small-world and scale-free properties of OSNs and show that these ones contain a densely connected core of high-degree nodes. In [22], the authors describe a parallel crawler based on Breadth First Search (BFS) and operating on eBay. Studies about how an attacker discovers a social graph can be found in [40] and [16]. The sole purpose of the attacker is to maximize the number of nodes/links it can discover. As a consequence, these two papers do not examine other issues such as biases. In [43], the authors analyze the impact of different graph traversal techniques (namely, BFS, DFS, Forest Fire, Snowball Sampling) on the computation of the average node degree of a network. In particular, they quantify the bias of BFS in estimating the node degree w.r.t. the fraction of sampled nodes. Furthermore, they show that this bias can be corrected reasonably well even in case of very small sample size.
In [42], the authors focus on analyzing the giant component of a graph. Moreover, they define a generative model to describe the evolution of the network. Finally, they introduce techniques to verify the reliability of this model. In [13], the authors investigate the main features of groups in LiveJournal and propose models that represent the growth of user groups over time. The authors of [11] analyze the topological properties of Cyworld (a South Korean OSN), Myspace and Orkut. They obtained data of the whole Cyworld and some samples of the other two OSNs. For the crawling task they use BFS. They show that this technique is capable of accurately approximating some network properties (e.g., degree distribution and correlation) even with small samples. In [46], data crawled from LiveJournal are examined to investigate the possible correlations between friendship and geographic location in OSNs. Moreover, the authors show that this correlation is strong. Ref. [20] proposes a methodology to discover possible aggregations of nodes covering specific positions in a graph (e.g., central nodes), as well as very relevant clusters. Still on clustering, [26] recently proposed an efficient community detection algorithm, particularly suited for OSNs, and tested its performance against a large sample of Facebook (among other OSN samples), observing the emergence of a strong community structure. In [54], the authors propose Social Action, a system based on attribute ranking and coordinated views to help users to systematically examine numerous Social Network Analysis measures. In [34], the authors present an analysis of the Facebook friendship graph devoted to estimate any user property and some topological properties. In this activity they examine and compare several candidate crawling strategies, namely BFS, Random Walk, Metropolis-Hasting Random Walk and Re-Weighted Random Walk. The authors investigate also diagnostics to assess sample quality during the data collection process. In [21], the authors present an analysis of Facebook devoted to investigate the friendship relationships in this OSN. To this purpose, they examine the topological properties of graphs representing data crawled from this OSN by exploiting two crawling strategies, namely BFS and Uniform Sampling. A further analysis of Facebook can be found in [62]. In this paper, the authors crawled Facebook by means of BFS and formalized some properties such as assortativity and interaction. These can be verified in small regions but cannot be generalized to the whole graph.
Ref. [50], [30], [53] and [56] present some approaches for the identification of influential users, i.e. users capable of stimulating other ones to join OSN activities and/or to actively operate in them. In [58], [10] and [45], the authors suitably model the blogosphere to perform leader identification. In [47], the authors first introduce the concept of starters (i.e., users who generate information that catches the interest of fellow users/readers) and, then, adopt a Random Walk technique to find starters. The authors of [52] analyze the main properties of the nodes within a single OSN that connect the peripheral nodes and the peripheral groups with the rest of the network. The authors call these nodes bridging nodes or, simply, bridges. Clearly, here, the term “bridge” is used with a meaning totally different from the one adopted in our paper. The authors base their analysis on the study of the theoretical properties of their model. In [33], the authors propose a predictive model that maps social media data to tie strength. This model is built on a dataset of social media ties and is capable of distinguishing between strong and weak ties with a high accuracy. Moreover, the authors illustrate how tie strength modeling can improve social media design elements, such as privacy controls, message routing, friend introductions and information prioritization. The authors of [63] present a model for predicting the closeness of professional and personal relationships of OSN users on the basis of their behavior in the OSNs joined by them. In particular, they analyze how the behavior of a user on an OSN reflects the strength of her relationships with other users w.r.t. several factors, like profile commenting and mutual connections.
Finally, [14], [15], [25], [51] and [38] present some approaches in the field of multidimensional networks. These networks can be seen as a specific case of a Social Internetworking Scenario in which each social network is specific for one kind of relationship and social networks strongly overlap. Multidimensional social networks are also known as multislice networks in the literature [51]. The first definition of a model for multidimensional networks has been proposed in [14], where a set of measures to extract useful knowledge on multidimensional networks, along with their experimental validation on real-life datasets, is presented. Another model for multidimensional social networks has been proposed in [38]. This model, which resembles the concept of Data Warehouse, defines three main dimensions: relation layers, time windows and groups. A view describes the state of one social group, linked by only one kind of relationship (one layer), derived from within only one time period. The paper discusses also several aggregation possibilities and illustrates some possible uses of the proposed model. The concept of hub, intended as a node highly connected to the other ones, in multidimensional social networks is studied in [15]. The authors redefine the concept of degree in the multidimensional context and introduce a new class of measures, called Dimension Relevance, aiming at analyzing the importance of different dimensions for the capability of a node to act as a hub. Experiments performed on real networks showed that such hubs really exist and they can be found and studied by using suitable measures of interplay of the different dimensions. In the same paper, the authors show how to detect the most influential dimensions that cause the different hub behaviors. In [25], a Generalized Stochastic Block Model (GSBM, for short) for performing positional and role analysis on multi-relational networks is proposed. The authors adopt several Multivariate Probability Distribution Functions to model different kinds of multi-relational network. They experimentally show that their model is able to discover the ground truth grouping in synthetic networks and to predict relationships in real networks.
3. Methods and data for analysis
In this section, we first describe the crawling techniques exploited to collect real-life datasets from the considered SIS, and then we report the basic statistics of collected data.
3.1. Adopted crawlers
As pointed out in the introduction, in order to perform our analyses on bridges, we have to extract some samples from a SIS. These samples registered connections among user accounts. For carrying out the sampling activity, we have to realize some crawlers capable of operating on a SIS, instead of on a single OSN. Specifically, these crawlers have to be able to extract not only connections among the accounts of different users in the same OSN but also connections among the accounts of the same users in different OSNs.
Two standards encoding human relationships are generally exploited to define these last connections. The former is XFN (XHTML Friends Network) [8]. XFN simply uses an attribute, called rel, to specify the kind of relationship between two users. Some possible values of rel are friend, contact, co-worker, parent, and so on. A (presumably) more complex alternative to XFN is FOAF (Friend-Of-A-Friend). A FOAF profile is essentially an XML file describing a person, her links with other people and the links to the objects created by her. It is worth pointing out that the technicalities concerning these two standards are not to be handled manually by the user. As a matter of fact, each OSN has suitable mechanisms to automatically manage them in a way transparent to the user, who has just to specify her relationships in a user-friendly fashion. In order to simplify the management of these standards, our crawlers adopt the functionalities provided by Google Social Graph (hereafter, GSG) [5] which is capable of directly handling them in an efficient way. GSG is a tool proposed by Google to find information among people of those OSNs handling XFN and FOAF protocols. Given a user u who joins several OSNs, GSG returns both: (i) a list of public URLs associated with her and (ii) a list of publicly declared connections between her and other users. GSG is provided with a suitable method called lookup, allowing us to explore connections among users. Given a user u represented by a node nu, a call to lookup on nu generates two lists of nodes. The former contains the nodes pointing to nu, whereas the latter consists of the nodes referenced by nu. The answer to a lookup call is a JSON file. JSON is a lightweight data exchange format which can be easily understood by humans and processed by machines. Finally, a nice feature of GSG is its capability of handling me edges, i.e., of identifying accounts (often located in different Social Networks) which refer to the same individual. Also the management of the technicalities concerning me edges is automatically performed by the involved OSNs. The user must at most specify the corresponding URLs in a friendly fashion.
As stated in the introduction, sometimes me connections are not explicitly specified by users. However, in the literature, several approaches for detecting implicit me connections have been proposed. For instance, the approach described in [19] is based on a notion of node similarity obtained by combining two contributions: a string similarity between the associated user names, and a contribution based on a suitable recursive notion of common-neighbor similarity. The latter is extremely important because it allows synonymy and homonymy errors to be detected and avoided.
As pointed out in Section 2, several crawling strategies for a single OSN have been proposed in the literature. For each of them their features have been investigated. This investigation showed that there exists no crawling strategy which is always better than the other ones. By contrast, each technique could be the optimal one for a specific set of analyses.
In order to carry out our analyses we needed crawling strategies capable of operating on a SIS, instead of on a single OSN. Clearly, it was not possible to a priori rely on the fact that the behavior of a crawling strategy in a single OSN does not change when it operates in a SIS. A verification of this issue was in order. On the basis of this reasoning, it appeared convenient to exploit several crawling strategies; some of them are classical ones extended to a SIS, whereas other ones have been defined to take into account the typical aspects of a SIS. As a consequence, our paper provides a further contribution, in addition to the main one consisting in bridge analysis. Indeed, it starts to investigate the features and the peculiarities of several crawling strategies in a SIS (which, as previously pointed out, is quite a specific scenario).
The crawling strategies we have implemented are the following:
- •
BFS : It implements the classical Breadth First Search visit (stopped after a suitable number nit of iterations).
- •
BFSB: Its underlying philosophy is derived from BFS by including some features which favor the presence of bridges in the sample. In particular, it starts with a BFS visit on a seed node s . When it finds a bridge, it inserts this bridge in a set called B . During each iteration, if B is empty, it continues the current BFS visit. By contrast, if B is not empty, it continues the current BFS visit with a probability p=0.85, whereas, with a probability 1-p, it starts a new BFS visit taking a bridge of B as seed. It stops after nit iterations. The pseudocode of this algorithm is shown in Algorithm 1.
Algorithm 1.
BFSB
Input:s: a seed node Output:SeenNode, VisitedNode: a set of nodes Constantnit {The number of iterations} Variablev: a node Variablep: a number in the real interval (0, 1) VariableB: set of nodes VariableNodeList: an ordered list of nodes accessed according to a FIFO policy 1: SeenNode ≔ ∅, VisitedNode ≔ ∅, B ≔ ∅, NodeList ≔ ∅
2: insert all the nodes adjacent to s into NodeList
3: insert s into SeenNode and VisitedNode
4: insert all the nodes adjacent to s into SeenNode
6: generate uniformly at random a number p in the real interval (0,1)
8: extract uniformly at random a node v from B
9: NodeList ≔ ∅
10: else
11: extract a node v from NodeList
12: ifv is a bridge then
13: insert v into B
14: end if
15: end if
16: insert all the nodes adjacent to v into NodeList
17: insert v into VisitedNode
18: insert all the nodes adjacent to v into SeenNode
19: end for
- Full-size table
Observe that BFS and M are classic crawling strategies, whereas and MB1 have been introduced in this paper in order to take the specific context into account.
A crawling strategy generally uses one or more accounts as seeds. In order to obtain samples as variegated and representative as possible, for each sample to obtain, we run the corresponding crawler from 10 different seeds and, then, merged data crawled from each seed. Moreover, in the definition of the starting seeds, we considered two options. In a first one, we chose seeds in a totally random fashion. In a second one, we chose seeds randomly, but only among bridges. As a consequence, we have defined the following 10 crawling running modes:
- •
and MB1: in these running modes we have applied the BFS ,
and MB1 strategies and we have chosen seeds in a random way, among all nodes;
- •
and
: in these running modes we have applied the
and MB1 strategies and we have chosen seeds randomly, but only among bridges.
Since, to the best of our knowledge, our analysis is the first one specifically conceived for investigating SIS’s, in order to better “dominate” (and, therefore, understand) the context of interest, we limited our crawlers to extract samples from only five OSNs, among the ones compliant with the XFN and FOAF standards. Specifically, selected OSNs are: Twitter, LiveJournal, YouTube, MySpace and Google+. We chose Twitter, LiveJournal and YouTube because they have been largely analyzed in the past in Social Network Analysis papers devoted to study a single OSN or to compare different OSNs. We selected MySpace because, differently from the other OSNs into consideration, it presents symmetrical connections among its accounts and we fstate the following implication, answering Question. Finally, we selected Google+ because it is a recent OSN and we judged that an analysis involving this OSN could have been very interesting for the Social Network Analysis research field.
3.2. Collected data
For our experiments, we exploited a server equipped with a 2 Quad-Core E5440 processor and 16 GB of RAM with the CentOS 6.0 Server operating system. We performed the crawling tasks from September 5, 2011 to October 20, 2011. For each running mode we run the corresponding crawler and we performed 15,000 iterations. In this way, we obtained 10 samples. These samples can be found at the URL address http://www.ursino.unirc.it/bridges.html. For each sample we preliminarily measured the following basic statistics, i.e.:
- •
the number of seen nodes and visited nodes3;
- •
the number of links;
- •
the number of me links; and
- •
the distributions of seen nodes and visited nodes for the five considered OSNs.
BFS BFSB M MB MB1 Seen nodes Total 3,523,588 6,572,238 949,181 6,178,649 312,598 Twitter 1,378,727 5,463,220 734,648 3,239,278 195,774 YouTube 209,858 178,828 31,729 258,618 13,868 MySpace 1,362,129 502,413 95,928 2,562,037 52,045 LiveJournal 147,100 33,170 51,923 4694 29,487 Google+ 425,775 394,609 34,953 114,023 21,424 Visited nodes Total 13,605 10,978 10,933 12,565 6997 Twitter 2909 5072 5,265 5,113 3162 YouTube 2544 222 851 498 531 MySpace 2998 1943 3171 6220 2389 LiveJournal 1891 231 1090 164 688 Google+ 3263 3510 556 570 227 Links 9,630,958 12,248,937 1,307,484 12,206,992 443,967 Me Links 3043 4779 526 4138 1150 - Full-size table
- Full-size table
4. Bridge analysis
In this section, we describe our experimental analysis aiming at answering the questions stated in the introduction. Each sub-section provides the answer to one or more questions given in the introduction. Furthermore, in the last sub-section we report an interesting additional result whose scope is related to the OSNs chosen in our SIS.
4.1. Bridges and backbones
In this analysis, we aim at verifying whether there is a sort of “backbone” among bridges, favoring the direct links among them. To do this, we analyze the percentage of bridges in the samples returned by the crawling tasks. We recall that, for each crawling strategy, we consider two running modes. In the first one we choose seed nodes in a random way among all nodes, whereas in the second one we choose seed nodes randomly, but only among bridges. We argued that, if a “backbone” exists among bridges, a sample produced by a running mode starting from bridges (i.e., and
) should have a percentage of bridges higher than the one occurring in the sample obtained by the same crawling strategy started from seeds selected among all nodes. The percentages of bridges occurring in each crawled sample are reported in Table 3.
- Table 3.
Percentage of bridges in each crawled sample.
BFS BFSB M MB MB1 Bridge (%) 15.83 19.45 23.21 23.25 4.34 3.52 14.91 13.18 7.96 5.98 - Full-size table
From the analysis of this table we can see that, given a crawling strategy, the percentage of bridges in the corresponding samples does not significantly vary whenever the seed nodes are chosen among all nodes or they are selected only from bridges. This result allows us to state the following implication, answering Question 1 in the introduction:
Implication 1.
In a SIS, it does not exist a sort of “backbone” among bridges, aiming at favoring the direct links among them.
4.2. Indegree and outdegree distributions of bridges
The purpose of this analysis is to investigate the distributions of the indegree and the outdegree of bridges, comparing them with the corresponding ones of nodes. In order to maximize the size of the sample exploited in this investigation, we consider a unique sample obtained by the union of the ten samples considered in the other experiments. As a first step we analyze the distributions of the indegree and the outdegree of the bridges and the nodes in the sample. The corresponding Probability Distribution Functions (PDFs) are shown in Fig. 1 and Fig. 2, respectively.
A simple visual analysis of these diagrams leads us to guess that the indegree and the outdegree of both bridges and nodes follow a power law distribution. In order to verify this conjecture we compute the best power law fit using the maximum likelihood method [23]. Table 4 shows the estimated power law coefficient, along with the Kolmogorov–Smirnov goodness-of-fit metric [23], for the four distributions into consideration.
- Table 4.
Power law coefficient estimates and corresponding Kolmogorov–Smirnov goodness-of-fit metrics for indegrees and outdegrees of nodes and bridges.
Indegrees Outdegrees α D α D Nodes 1.64 0.127 1.58 0.144 Bridges 2.23 0.147 2.92 0.159 - Full-size table
From the analysis of this table it is evident that both the indegree and the outdegree of the bridges and the nodes of a SIS follow a power law distribution very well. In fact, in all the considered cases, the values of the Kolmogorov–Smirnov goodness-of-fit metric are low [49]. Interestingly, the power law coefficients of bridges are higher than (even though close to) the ones of nodes for both the indegree and the outdegree distributions. In sum, we may conclude that:
Implication 2.
In a SIS, the indegree and the outdegree of bridges follow a power law distribution.
This implication, along with the results about the power law coefficients of bridges and non-bridges, is very important not only because it answers Question 2 of the introduction, but also because it allows us to conjecture an answer to Question 3. Indeed, from a theoretical point of view, Implication 2 should, in its turn, imply that an increase of the fraction of bridges in an OSN should lead to an increase of the average indegree and outdegree of nodes as well as of the standard deviation of node degree. We perform an experiment to verify this conjecture. In particular, we compute the average indegree and outdegree of nodes,4 along with the corresponding standard deviations, for each OSN of the SIS. Obtained results are shown in Table 5. In order to facilitate the analysis, in this table we report also the fraction of bridges for each crawled sample, as already shown in Table 3.
- Table 5.
Average indegree and outdegree of nodes, and corresponding standard deviations, for each crawling strategy and each OSN into examination.
- Full-size table
From the analysis of this table, it is possible to observe that, in a SIS, an increase of the fraction of bridges generally leads to an increase of the average indegree and outdegree of nodes, along with the corresponding standard deviations, for each OSN. This reasoning allows us to state the following corollary:
Corollary 2.1.
In a SIS, the higher the fraction of bridges, the higher the average indegree and outdegree of nodes, and the higher the corresponding standard deviations of the associated OSNs.
Therefore, the above conjecture holds. Consider that, in presence of a power law distribution, the concepts of average indegree and outdegree of bridges and non-bridges are little significant as absolute values. However, they can become interesting if compared with each other. With regard to this aspect, from a theoretical point of view, Corollary 2.1 implies a further one, which allows us to conjecture an answer to Question 4 of the introduction. In fact, Corollary 2.1 implies, in its turn, that the average indegree and outdegree of bridges are generally higher than the corresponding ones of non-bridges. In order to experimentally verify this last conjecture, for each crawled sample, we compute the average indegree and outdegree of the bridges and non-bridges for each OSN of the sample. Obtained results are shown in Table 6.
- Table 6.
Average indegree and outdegree of bridges and non-bridges for each crawled sample.
- Full-size table
By comparing indegrees and outdegrees of bridges with the ones of non-bridges, it immediately follows the next corollary:
Corollary 2.2.
In a SIS, the average indegree and outdegree of bridges are generally higher than the corresponding ones of non-bridges.
Thus, the latter conjecture holds too. In sum, Implication 2 and its corollaries answer Questions 2, 3 and 4 of the introduction.
4.3. Relationship between bridges and power users
Power users are those nodes which can facilitate information spread in the network. Node indegree and outdegree are considered good indicators to detect whether a node is a power user. Our notion of power user is indeed based on these indicators. In fact, for power users, we use the notion adopted in [1] (one of the most famous libraries which provides features for OSN/graph analysis). According to this notion, a power user is a node of an OSN with a degree higher than the average degree of the nodes (see below).
Implication 2, along with its corollaries, states that the presence of bridges leads to an increment of the average degree of nodes in a SIS and that the average degree of bridges is higher than the one of non-bridges. If the degree probability distribution were not taken into consideration, one would erroneously conclude that it is probable to find bridges with high degree, thus, that bridges are typically power users. But, thanks to Implication 2, we know that the degree of bridges follows a power law distribution. This allows us to exclude the above conclusion, since the power law distribution means that only a few bridges have a very high degree, whereas most of bridges have a low degree. In other words, the effect of increasing the average degree for bridges depends on just a few bridges, where the high degree is concentrated.
The question now is whether the power users of the SIS are just those bridges with high degree and vice versa. In other words, we have to understand how much the set of power users and the one of bridges are overlapping. We distinguish between in power users and out power users, depending on which degree we consider between indegree and outdegree, respectively. In particular, we denote by InPU (OutPU, resp.) the set of nodes of the sample having an indegree (outdegree, resp.) higher than the average indegree (outdegree, resp.) of the nodes of the sample. We also denote by B the set of bridges of the sample.
From the reasoning above, we expect that both and
(expressing the fraction of bridges who are also power users) are low. However, it could happen that
or
(expressing the fraction of power users who are also bridges) is high. This would mean that most of the power users are bridges. The experiment described in this section aims at confirming the expectation about InPB and OutPB and to study InBP and OutBP.
The results of the experiment are shown in Table 7. Here, it is possible to see that, in general, all the coefficients are low. On the one hand, this confirms our expectation that the probability that a bridge is a power user is low (basically due to Implication 2). On the other hand, the experiment allows us to discover that even the probability that a power user is a bridge is low. In sum, we may conclude that there does not exist a meaningful correlation between bridges and power users. The only exception regards the crawling strategy MB. This can be explained by considering the philosophy underlying this strategy, which tends to penalize power users except when they are bridges (since bridges are always selected in this strategy – see Section 3.1). As a consequence, most of the power users selected by this strategy are bridges. Even for MB1 we obtained values slightly higher than the other techniques, but less than MB. This can be explained by recalling that also MB1 is biased towards bridges, but less than MB (see, again, Section 3.1).
- Table 7.
Values of InPB,OutPB,InBP and OutBP for each crawled sample.
- Full-size table
In order to complete the study of the correlation between power users and bridges, we come back to a previous consideration derived from Implication 2. We said above that the bridge effect of increasing the average degree, derived from Implication 2 and its corollaries, depends on just a few bridges, where the high degree is concentrated. This is, in fact, what we have verified through our experiment. At this point it becomes interesting to understand “how much” these few bridges with high degree are power users. In other words, we want to study whether a correlation between bridges and power users emerges if we “stress” the definition of power user. In particular, we computed the above coefficients , and OutBP by considering the sets InPUx and OutPUx, representing the top x % of power users having the highest indegree and outdegree, respectively. x ranges from 10 to 100 with granularity 10. We denote by
, and OutBPx the four coefficients. Obviously, InPU100 reduces to InPU and OutPU100 to OutPU . As a consequence,
, InBP100 and OutBP100 coincide with
, and OutBP, respectively. The lower the value of x, the “stronger” (in terms of degree) the considered power users. We want to understand whether a correlation between power users and bridges arises by increasing the strongness of power users. To this aim, we are interested in studying how the four coefficients vary when x increases. Indeed, we expect that if no correlation between bridges and power users exists (also for increasing strongness), we should obtain:
- •
A quasi-linear dependence of the coefficients InPBx and OutPBx on the value of x , showing that the intersection between bridges and power users decreases proportionally in the same measure as x reduces the cardinality of the power user sets (observe that the power user sets InPUx and OutPUx occur only in the numerator of the respective coefficients).
- •
A quasi-constant behavior of the coefficients InBPx and OutBPx when the value of x varies. It is due to the same reasons of the previous case, but considering that the power user sets InPUx and OutPUx occur in both the numerator and the denominator of the respective coefficients.
As it can be derived by analyzing Fig. 3, Fig. 4, Fig. 5 and Fig. 6, experimental results show that we are in the scenario above hypothesized. The only exception regards the crawling strategies MB and MB1. The biased behavior of these strategies is more evident for the coefficients InBPx and OutBPx, where it is amplified by the reduction of the denominator, which is more rapid than the one of the numerator – observe that the denominator , resp.) of InBPx and (OutBPx, resp.) decreases as x decreases since the number of considered power users decreases. The bias detected for MB and MB1 (more evident for MB1) can be explained by considering how these strategies work. About MB, we have observed in Section 3.1 that the higher the degree of a non-bridge, the higher the probability that MB discards it. As a consequence, the stronger (in terms of degree) a non-bridge, the lower the probability it belongs to the set occurring in the numerator of the coefficients. This works in favor of the increase of the proportion of bridges belonging to the sets occurring in the coefficient numerators, thus contrasting their decrease (as x decreases). This explains why, for decreasing values of x (i.e., increasing strongness), the coefficients InBPx and OutBPx are biased towards higher values w.r.t. the expected behavior (i.e., constant). This effect is more evident in MB1 since, as observed in Section 3.1, this strategy is more selective than MB on non-bridges nodes. From all the above reasonings the following implication arises, answering Question 5 of the introduction:
Implication 3.
In a SIS, there is no correlation between bridges and power users.
This result is important from the structural point of view due to the role of bridges in the interconnection of the OSNs of the SIS. Moreover, it could provide interesting hints for the analysis of user behavior, which is anyway out of the scope of this work.
4.4. Capability of bridges of connecting more OSNs
In this experiment we measure the number of OSNs connected by the available bridges in each crawled sample. Observe that, since our SIS consists of 5 OSNs, the number of OSNs that can be connected by a bridge ranges between 2 and 5. The results of this experiment are shown in Fig. 7. From the analysis of this figure it is possible to observe that, for each crawling strategy, most bridges connect a few OSNs and a few bridges connect many OSNs or all of them. For instance, if we consider all crawled samples, then the percentage of bridges which connect only 2 OSNs ranges from 79.01% to 92.70%, whereas the fraction of bridges which connect 5 OSNs ranges from 0% to 2.25%. The trends shown in the previous figure qualitatively look like a power law distribution. However, the number of OSNs composing our SIS is small. As a consequence, the number of values considered in the axes of the previous figure is small too. This fact makes little meaningful to quantitatively verify whether the number of OSNs connected by bridges follows a power law distribution and, in this case, to compute the corresponding α and D coefficients. On the other hand, the huge volume of analyses performed in our experiments made it prohibitive to extend the number of considered OSNs beyond 5, besides the reasons expressed in Section 3.1 for this choice. Therefore, in the wide-spectrum analysis on bridges faced in this paper, we considered satisfactory a qualitative analysis of this aspect, delaying a quantitative one to a specific future work where taking all the OSNs handling XFN and FOAF into account is feasible. The conclusion we can draw from the above analysis can be thus summarized by the following implication, answering Question 6 of the introduction:
Implication 4.
In a SIS, most of the bridges connect a few OSNs, whereas a few bridges connect many OSNs.
4.5. Fraction of bridges
In this analysis we describe a result whose scope is related to the OSNs chosen in our SIS. In particular, for each crawled sample, we computed the percentage of nodes which are bridges in each OSN of the SIS. The results are shown in Table 8.
- Table 8.
Bridge percentage in the SIS and in the corresponding OSNs.
- Full-size table
From the analysis of this table it emerges an interesting result. Indeed, it is possible to observe that, in Google+, the percentage of nodes which are bridges is generally higher than the one in the other OSNs. This non-obvious result can be explained by considering the characteristics of Google+. In fact, this OSN is quite recent and, thus, most of its users have already joined other OSNs. Furthermore, Google+ provides its users with very friendly utilities allowing them to specify their accounts in other OSNs and to import the corresponding data. All these facts favor the presence of bridges in Google+. This analysis leads us to claim the following implication:
Implication 5.
Among the five OSNs of our SIS, Google+ generally presents the highest percentage of users who are bridges.
5. Conclusion and future work
This paper explores the emergent scenario of Social Internetworking from the perspective of Social Network Analysis. Being aware that the complete investigation of all the aspects of SIS’s is an extremely large task, we have identified the most basic structural peculiarity of a SIS, i.e. bridges, and we have deeply studied it. We argue that most of the knowledge about the structural properties of SIS’s, and possibly also about the behavioral aspects of users, starts from the adequate knowledge of bridges, which are the structural pillars of SIS’s.
Our work was conducted with strong attention on the crawler strategy to adopt, through both the adaptation of existing techniques and the definition of new specific ones. This was also a chance for giving some initial results about the behavior of the various strategies in this particular context, preparing us for a further study on crawling a SIS. But the main reason of the application of a multiple-crawling-strategy approach was to guarantee an adequate scientific support to the implications we have derived about several structural features concerning bridges.
Let us summarize now these implications, and try to draw some conclusions. First, we have discovered that bridges, like power users, are nodes that contribute to increase the average degree of the SIS (Implication 2 and its corollaries), but they are not so “strong” like power users. Then, we have shown that there is no correlation between power users and bridges (Implication 3), and this analysis resulted in a number of very interesting aspects. We have seen that the results above are not in contradiction, because bridges follow a power law distribution, making very probable to have a low degree for a bridge (Implication 2) even though the average degree is higher than the one of non-bridges. A possible interpretation of the above knowledge is that bridges are users who are active in at least one OSN involved by their me links, so that they form a population of users not including a significant percentage of fake (or “one-time”) users. Of course, many future investigations can be focused on this and other behavioral aspects. But coming back to the results of this papers, we have also seen that bridges do not form backbones among the involved OSNs (Implication 1), as one could intuitively expect. Finally, we have seen that most of the bridges connect a few OSNs, whereas a few bridges connect many OSNs, enabling a number of possible considerations from the behavioral point of view.
In conclusion, we think that this paper gives interesting results on the emergent scenario of Social Internetworking which open a number of interesting issues about both structural and behavioral aspects. We plan to move our future research towards these directions, including a deep study on the crawling of SIS’s, and possibly applying Data Warehousing and Data Mining techniques on crawled samples.
References
- [1]
- Stanford Network Analysis Package, 2011. <http://snap.stanford.edu/snap/>
- [2]
- FriendFeed, 2012. <http://friendfeed.com/>
- [3]
- Gathera, 2012. <http://www.gathera.com/>
- [4]
- Google Open Social, 2012. <http://code.google.com/intl/it-IT/apis/opensocial/>
- [5]
- Google Social Graph, 2012. <http://code.google.com/p/itswhoyouknow/wiki/SocialGraph>
- [6]
- Power.com, 2012. <http://power.com>
- [7]
- The Friend of a Friend (FOAF) Project, 2012. <http://www.foaf-project.org/>
- [8]
- XFN – XHTML Friends Network, 2012. <http://gmpg.org/xfn>
- [9]
- F. Abel, N. Henze, E. Herder, D. Krause, Interweaving public user profiles on the web, in: Proc. of the International Conference on User Modeling, Adaptation, Personalization (UMAP’10), Big Island, Hawaii, USA, Lecture Notes in Computer Science, Springer, 2010, pp. 16–27.
- [10]
WisColl: collective wisdom based blog clustering
Information Sciences, 180 (1) (2010), pp. 39–61
- [11]
Analysis of topological characteristics of huge online social networking services
Proc. of the International Conference on World Wide Web (WWW’07), ACM, Banff, Alberta, Canada (2007), pp. 835–844,
- [12]
Sampling algorithms for pure network topologies: a study on the stability and the separability of metric embeddings
ACM SIGKDD Explorations Newsletter, 7 (2) (2005), pp. 13–22
- [13]
Group formation in large social networks: membership, growth, and evolution
Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06), ACM, Philadelphia, USA (2006), pp. 44–54
- [14]
- M. Berlingerio, M. Coscia, F. Giannotti, A. Monreale, D. Pedreschi, Foundations of multidimensional network analysis, in: Proc. of the International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2011), IEEE, Kaohsiung, Taiwan, 2011, pp. 485–489.
- [15]
The pursuit of hubbiness: analysis of hubs in large multidimensional networks
Journal of Computational Science, 2 (3) (2011), pp. 223–237
- [16]
- J. Bonneau, J. Anderson, G. Danezis, Prying data out of a social network, in: Proc. of the International Conference on Advances in Social Network Analysis and Mining (ASONAM’09), IEEE, Athens, Greece, 2009, pp. 249–254.
- [17]
subgraph matching on huge social networks
Proc. of the International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2011), IEEE, Kaohsiung, Taiwan (2011), pp. 271–278
- [18]
Crawling social internetworking systems
Proc. of the International Conference on Advances in Social Analysis and Mining (ASONAM 2012), IEEE, Istanbul, Turkey (2012), pp. 505–509
- [19]
Discovering links among social networks
Proc. of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2012), Lecture Notes in Computer Science, Springer, Bristol, UK (2012), pp. 467–482
- [20]
Models and Methods in Social Network Analysis
Cambridge University Press (2005)
- [21]
Crawling Facebook for social network analysis purposes
Proc. of the International Conference Series on Web Intelligence, Mining and Semantics (WIMS’11), ACM, Sogndal, Norway (2011), pp. 52–59
- [22]
Parallel crawling for online social networks
Proc. of the International Conference on World Wide Web (WWW’07), ACM, Banff, Alberta, Canada (2007), pp. 1283–1284
- [23]
Power-Law Distributions in Empirical Data
SIAM Review, 51 (4) (2009), pp. 661–703
- [24]
A classification for community discovery methods in complex networks
Statistical Analysis and Data Mining, 4 (5) (2011), pp. 512–546
- [25]
Structural analysis in multi-relational social networks
Proc. of the International SIAM Conference on Data Mining (SDM 2012), Omnipress, Anaheim, CA, USA (2012), pp. 451–462
- [26]
Generalized Louvain method for community detection in large networks
Proc. of the International Conference on Intelligent Systems Design and Applications (ISDA 2011), IEEE, Cordoba, Spain (2011), pp. 88–93
- [27]
Recommendation of similar users, resources and social networks in a Social Internetworking Scenario
Information Sciences, 181 (7) (2011), pp. 1285–1305 (Elsevier)
- [28]
Exploratory Social Network Analysis with Pajek
Cambridge University Press (2011)
- [29]
Contacts and influence
Social Networks, 1 (1978), pp. 5–51
- [30]
Predicting influential users in online social networks
Proc. of the KDD International Workshop on Social Network Analysis (SNA-KDD’10), ACM, San Diego, CA, USA (2010)
- [31]
Visual exploration of collaboration networks based on graph degeneracy
Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2012), ACM, Beijing, China (2012), pp. 1512–1515
- [32]
Compressing network graphs
Proc. of the International Workshop on Link Analysis and Group Detection (LinkKDD’04), ACM, Seattle, WA, USA (2004)
- [33]
Predicting tie strength with social media
Proc. of the International Conference on Human Factors in Computing Systems (CHI’09), ACM, Boston, MA, USA (2009), pp. 211–220
- [34]
Walking in Facebook: A case study of unbiased sampling of OSNs
Proc. of the International Conference on Computer Communications (INFOCOM’10), IEEE, San Diego, CA, USA (2010), pp. 1–9
- [35]
The strength of weak ties
American Journal of Sociology, 78 (6) (1973), pp. 1360–1380
- [36]
- OFCOM The independent regulator and competition authority for the UK communications industries, Social Networking: A Quantitative and Qualitative Research Report into Attitudes, behaviours and use, 2009. <http://www.ofcom.org.uk/advice/medialiteracy/medlitpub/medlitpubrss/socialnetworking/annex3.pdf>
- [37]
- J. Iturrioz, O. Diaz, C. Arellano. Towards federated Web2.0 sites: the TAGMAS approach, in: Proc. of the International Workshop on Tagging and Metadata for Social Information Organization, Banff, Alberta, Canada, 2007. <http://www2007.org/workshops/paper34.pdf>
- [38]
Multidimensional social network: model and analysis
Proc. of the International Conference on Computational Collective Intelligence. Technologies and Applications (ICCCI 2011), Lecture Notes in Computer Science, Elsevier, Gdynia, Poland (2011), pp. 378–387
- [39]
The convergence of social and technological networks
Communications of the ACM, 51 (11) (2008), pp. 66–72
- [40]
Link privacy in social networks
Proc. of the ACM International Conference on Information and Knowledge Management (CIKM’08), ACM, Napa Valley, CA, USA (2008), pp. 289–298
- [41]
Reducing large internet topologies for faster simulations
Proc. of the International Conference on Networking (Networking 2005), Springer, Waterloo, Ontario, Canada (2005), pp. 165–172
- [42]
Structure and evolution of online social networks
Link Mining: Models, Algorithms, and Applications (2010), pp. 337–357
- [43]
On the bias of BFS (Breadth First Search)
Proc. of the International Teletraffic Congress (ITC 22), IEEE, Amsterdam, The Netherlands (2010), pp. 1–8
- [44]
Sampling from large graphs
Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06), ACM, Philadelphia, PA, USA (2006), pp. 631–636
- [45]
Discovering influencers for marketing in the blogosphere
Information Sciences, 181 (23) (2011), pp. 5143–5157
- [46]
Geographic routing in social networks
Proc. of the National Academy of Sciences of the United States of America (PNAS), 102 (33) (2005), pp. 11623–11628
- [47]
Efficient identification of starters and followers in social media
Proc. of the International Conference on Extending Database Technology: Advances in Database Technology (EDBT ’09), ACM, Saint Petersburg, Russian Federation (2009), pp. 708–719
- [48]
,in: N. Memon, R. Alhajj (Eds.), From Sociology to Computing in Social Networks – Theory, Foundations and Applications, vol. 1Lecture Notes in Social Networks, Springer (2010)
- [49]
Measurement and analysis of online social networks
Proc. of the ACM SIGCOMM International Conference on Internet Measurement (IMC’07), ACM, San Diego, CA, USA (2007), pp. 29–42
- [50]
MEK: using spatial–temporal information to improve social networks and knowledge dissemination
Information Sciences, 179 (15) (2009), pp. 2524–2537
- [51]
Community structure in time-dependent, multiscale, and multiplex networks
Science, 328 (5980) (2010), pp. 876–878
- [52]
Properties of bridge nodes in social networks
Computational Collective Intelligence, Semantic Web, Social Networks and Multiagent Systems, vol. 5796, Lecture Notes in Computer Science, Springer (2009), pp. 357–364
- [53]
Spontaneous emergence of social influence in online systems
Proceedings of the National Academy of Sciences, 107 (43) (2010), p. 18375
- [54]
Balancing systematic and flexible exploration of social networks
IEEE Transactions on Visualization and Computer Graphics, 12 (5) (2006), pp. 693–700
- [55]
Effectively visualizing large networks through sampling
IEEE Visualization Conference 2005 (VIS’05), IEEE, Minneapolis, MN, USA (2005), p. 48
- [56]
Influence and passivity in social media
Proc. of the International Conference Companion on World Wide Web (WWW’11), ACM, Hyderabad, India (2011), pp. 113–114
- [57]
Introduction to stochastic actor-based models for network dynamics
Social Networks, 32 (1) (2010), pp. 44–60
- [58]
Identifying opinion leaders in the blogosphere
Proc. of the ACM International Conference on Information and Knowledge Management (CIKM’07), ACM, Lisbon, Portugal (2007), pp. 971–974
- [59]
Cross-tagging for personalized open social networking
Proc. of the ACM Conference on Hypertext and Hypermedia (HT’09), ACM, Torino, Italy (2009), pp. 271–278
- [60]
Correlating user profiles from multiple folksonomies
Proc. of the ACM Conference on Hypertext and hypermedia (HT ’08), ACM, Pittsburgh, PA, USA (2008), pp. 33–42
- [61]
An experimental study of the small world problem
Sociometry (1969), pp. 425–443
- [62]
User interactions in social networks and their implications
Proc. of the ACM European Conference on Computer systems (EuroSys’09), ACM, Nuremberg, Germany (2009), pp. 205–218
- [63]
Detecting professional versus personal closeness using an enterprise social network site
Proc. of the International Conference on Human Factors in Computing Systems (CHI’10), ACM, Atlanta, GA, USA (2010), pp. 1955–1964
- [64]
Crawling online social graphs
Proc. of the International Asia-Pacific Web Conference (APWeb’10), IEEE, Busan, Korea (2010), pp. 236–242
Copyright © 2012 Elsevier Inc. All rights reserved.
- •
MB: Its underlying philosophy is similar to the one of M . However, it includes some features which favor the presence of bridges in the sample. It starts its visit from a seed node s . During each iteration it considers the currently visited node v and randomly selects a node w from the neighbors of v . If w is a bridge, then it moves from v to w ; otherwise, it behaves as M . The pseudocode of this algorithm is identical to the one of M , shown in Algorithm 2, except that, in row 9, the statement if
is replaced by the statement if (w is a bridge) or
. Observe that, in this case, the higher the degree of a non-bridge, the higher the probability that MB discards it. This property does not hold for bridges who are always selected by MB.
- •
MB1: Analogously to MB, it has been conceived to correct M by favoring bridges. However, the effects of MB1 on bridges are different from the ones of MB. In words, the preference in favor of bridges is less drastic than MB (see below). MB1 starts by randomly generating two real numbers a and b belonging to the interval (0,1). Then it sets p1=min(a,b) and p2=max(a,b). During each iteration it considers the currently visited node v and a node w randomly selected from the neighbors of v . If w is a bridge, then it moves from v to w if
. Otherwise, if w is not a bridge, then it moves from v to w if
. In all the other cases it stays in v . Since p1⩽p2, bridges are favored. It terminates after nit iterations. The pseudocode of this algorithm is shown in Algorithm 3. Clearly, if we force p1 to 0, then MB1 reduces to MB. Observe that MB1 is more selective on bridges than MB, since bridges are visited only if the degree condition (i.e.,
) is verified. On the contrary, MB1 is less selective on bridges than M (in the sense that the probability of selecting a high-degree bridge is higher than in M ). This happens because the probability distribution of p1, differently from the uniform distribution of p in M , is biased towards 0. Indeed, it is easy to see that the probability density function fp1(x), where x∈(0,1), of the random variable p1 – i.e., the minimum between two real numbers randomly extracted in the interval (0,1) – is fp1(x)=2·(1-x) (linearly decreasing 2)against the density function fp(x)=1 (constant). Moreover, MB1 is more selective also on non-bridges than MB (and therefore M ), since the degree condition (i.e.,
) operates with a bound p2 whose probability distribution, differently from the uniform distribution of p in MB and M , is biased towards 1. Indeed, it is easy to see that the probability density function fp2(x), where x∈(0,1), of the random variable p2 – i.e., the maximum between two real numbers extracted randomly in the interval (0,1) – is fp2(x)=2·x (linearly increasing) against the density function fp(x)=1 (constant).
Algorithm 3.
MB1
- Full-size table
- •
M: It implements the Metropolis-Hasting Random Walk that proved to perform very well in past analyses of single OSNs [34]. This crawling strategy has been conceived to unfavor power users who, instead, are favored by BFS-like crawling strategies that, owing to this fact, can present bias in some network parameters (e.g., the average number of contacts per user) [43]. 1
M starts its visit from a seed node s . During each iteration it considers the currently visited node v and randomly selects a node w from the neighbors of v . Then, it randomly generates a number p belonging to the real interval (0,1). If
, where kv(kw, resp.) is the outdegree of v (w , resp.), then it moves from v to w . Otherwise, it stays in v . It terminates after nit iterations. The pseudocode of this algorithm is shown in Algorithm 2. Observe that the higher the degree of a node, the higher the probability that M discards it.
Algorithm 2.
M
Inputs: a seed node OutputSeenNode, VisitedNode: a set of nodes Constantsnit {The number of iterations} Variablev,w: a node Variablep: a number in the real interval (0,1) 1: SeenNode ≔ ∅, VisitedNode ≔ ∅
2: insert s into SeenNode and VisitedNode
3: insert all the nodes adjacent to s into SeenNode
4: v=s
6: let w be one of the nodes adjacent to v selected uniformly at random
7: generate uniformly at random a number p in the real interval (0,1)
8: let kv and kw be the outdegree of v and w, respectively
10: v=w
11: insert w into VisitedNode
12: insert all the nodes adjacent to w into SeenNode
13: end if
14: end for
- Full-size table