Modularity of Complex Networks Models
- First Online:
Abstract
Modularity is designed to measure the strength of division of a network into clusters (known also as communities). Networks with high modularity have dense connections between the vertices within clusters but sparse connections between vertices of different clusters. As a result, modularity is often used in optimization methods for detecting community structure in networks, and so it is an important graph parameter from practical point of view. Unfortunately, many existing non-spatial models of complex networks do not generate graphs with high modularity; on the other hand, spatial models naturally create clusters. We investigate this phenomenon by considering a few examples from both sub-classes. We prove precise theoretical results for the classical model of random d-regular graphs as well as the preferential attachment model, and contrast these results with the ones for the spatial preferential attachment (SPA) model that is a model for complex networks in which vertices are embedded in a metric space, and each vertex has a sphere of influence whose size increases if the vertex gains an in-link, and otherwise decreases with time.
1 Introduction and Definitions
Many social, biological, and information systems can be represented by networks, whose vertices are items and links are relations between these items [2, 4, 6, 12]. That is why the evolution of complex networks attracted a lot of attention in recent years and there has been a great deal of interest in modeling of these networks [9, 17, 30]. The hyperlinked structure of the Web, citation patterns, friendship relationships, infectious disease spread, these are seemingly disparate linked data sets which have fundamentally very similar natures. Indeed, it turns out that many real-world networks have some typical properties: power-law degree distribution, small diameter, high clustering coefficient, and others [27, 29, 33]. Such properties are well-studied both in real-world networks and in many theoretical models.
Another important property of complex networks is their community structure, that is, the organization of vertices in clusters, with many edges joining vertices of the same cluster and comparatively few edges joining vertices of different clusters [14, 18]. In social networks communities may represent groups by interest, in citation networks they correspond to related papers, in the Web communities are formed by pages on related topics, etc. Being able to identify communities in a network could help us to exploit this network more effectively. For example, clusters in citation graphs may help to find similar scientific papers, discovering users with similar interests is important for targeted advertisement, clustering can also be used for network compression and visualization.
The key ingredient for many clustering algorithms is modularity, which is at the same time a global criterion to define communities, a quality function of community detection algorithms, and a way to measure the presence of community structure in a network. Modularity was introduced by Newman and Girvan [31] and it is based on the comparison between the actual density of edges inside a community and the density one would expect to have if the vertices of the graph were attached at random, regardless of community structure.
Unfortunately, modularity is not a well studied parameter for the existing random graph models, at least from a rigorous, theoretical point of view. We are only aware about results for binomial random graphs G(n, p) and random d-regular graphs (see Sect. 2.3 for more details). In this paper, we continue investigating random d-regular graphs and obtain new upper bounds for their modularity. Then we move to the preferential attachment model, introduced by Barabási and Albert [5], which is probably the most well-studied model of complex networks. For this model no results on modularity are known and we obtain some preliminary results, both lower and upper bounds, and will investigate this model more in the journal version of this paper. In fact, the lower bound we present holds for all graphs with average degree d and sublinear maximum degree.
As expected, the models discussed above, as well as many others, have a common weakness of low modularity. One family of models which overcomes this deficiency is the family of spatial (or geometric) models, wherein the vertices are embedded in a metric space such that similar vertices are closer to each other than dissimilar ones. The underlying geometry of spatial models naturally leads to the emergence of clusters. We prove this statement rigorously for one example of a geometric model, the Spatial Preferential Attachment model introduced in [1].
The paper is structured as follows. In the next section, we formally define modularity, discuss several random graph models and present known results on modularity in these models. In Sects. 3, 4 and 5 we analyze modularity in random d-regular graphs, preferential attachment and SPA models, respectively.
Due to the space limitations, proofs of our results are omitted in this short proceeding version but will be included in the longer journal one.
2 Preliminaries
2.1 Modularity
2.2 Random Graph Models
Classical Models. Let us recall two classical models of random graphs that are extensively studied in the literature. The binomial random graph\({\mathcal {G}}(n,p)\) is the random graph G with the vertex set \([n] := \{ 1, 2, \ldots , n\}\) in which every pair \(\{i,j\} \in \left( {\begin{array}{c}[n]\\ 2\end{array}}\right) \) appears independently as an edge in G with probability p. Note that \(p=p(n)\) may (and usually does) tend to zero as n tends to infinity.
However, in this paper we concentrate on another probability space, the probability space of random d-regular graphs with uniform probability distribution. This space is denoted \(\mathcal {G}_{n,d}\), and asymptotics are for \(n\rightarrow \infty \) with \(d\ge 2\) fixed, and n even if d is odd.
We say that an event in a probability space holds asymptotically almost surely (or a.a.s.) if the probability that it holds tends to 1 as n goes to infinity. Since we aim for results that hold a.a.s., we will always assume that n is large enough.
Preferential Attachment. The Preferential Attachment (PA) model, introduced by Barabási and Albert [5], was an early stochastic model of complex networks. We will use the following precise definition of the model, as considered by Bollobás and Riordan in [10] as well as Bollobás et al. [11].
For \(m \in \mathbb N \setminus \{1\}\), the process \((G_m^t)_{t \ge 0}\) is defined similarly with the only difference that m edges are added to \(G_m^{t-1}\) to form \(G_m^t\) (one at a time), counting previous edges as already contributing to the degree distribution. Equivalently, one can define the process \((G_m^t)_{t \ge 0}\) by considering the process \((G_1^t)_{t \ge 0}\) on a sequence \(v'_1, v'_2, \ldots \) of vertices; the graph \(G_m^t\) is formed from \(G_1^{tm}\) by identifying vertices \(v'_1, v'_2, \ldots , v'_m\) to form \(v_1\), identifying vertices \(v'_{m+1}, v'_{m+2}, \ldots , v'_{2m}\) to form \(v_2\), and so on. Note that in this model \(G_m^t\) is in general a multigraph, possibly with multiple edges between two vertices (if \(m\ge 2\)) and self-loops.
It was shown in [11] that for any \(m \in \mathbb N \) a.a.s. the degree distribution of \(G_m^n\) follows a power law: the number of vertices with degree at least k falls off as \((1+o(1)) ck^{-2} n\) for some explicit constant \(c=c(m)\) and large \(k \le n^{1/15}\). Also, in the case \(m=1\), \(G_1^n\) is a forest. Each vertex sends an edge either to itself or to an earlier vertex, so the graph consists of components which are trees, each with a loop attached. The expected number of components is then \(\sum _{t=1}^n 1/(2t-1) \sim (1/2) \log n\) and, since events are independent, we derive that a.a.s. there are \((1/2+o(1)) \log n\) components in \(G_1^n\) by Chernoff’s bound. Moreover, Pittel [32] essentially showed that a.a.s. the largest distance between two vertices in the same component of \(G_1^n\) is \((\gamma ^{-1} + o(1)) \log n\), where \(\gamma \) is the solution of \(\gamma e^{1+\gamma } = 1\). In contrast, for the case \(m \ge 2\) it is known that a.a.s. \(G_m^n\) is connected and its diameter is \((1+o(1)) \log n / \log \log n\) [10].
Spatial Preferential Attachment. The Spatial Preferential Attachment (SPA) model [1], designed as a model for the World Wide Web, combines geometry and preferential attachment, as its name suggests. Setting the SPA model apart is the incorporation of ‘spheres of influence’ to accomplish preferential attachment: the greater the degree of a vertex, the larger its sphere of influence, and hence the higher the likelihood of the vertex gaining more neighbors.
The process begins at \(t=0\), with \(G_0\) being the null graph. Time-step t, \(t\ge 1\), is defined to be the transition between \(G_{t-1}\) and \(G_t\). At the beginning of each time-step t, a new vertex \(v_t\) is chosen uniformly at random from S, and added to \(V_{t-1}\) to create \(V_{t}\). Next, independently, for each vertex \(u\in V_{t-1}\) such that \(v_t \in S(u,t-1)\), a directed link \((v_{t},u)\) is created with probability p. Thus, the probability that a link \((v_t,u)\) is added in time-step t equals \(p\,|S(u,t-1)|\).
The SPA model produces scale-free networks, which exhibit many of the characteristics of real-life networks (see [1, 13]). In [19], it was shown that the SPA model gave the best fit, in terms of graph structure, for a series of social networks derived from Facebook. In [20], some properties of common neighbors were used to explore the underlying geometry of the SPA model and quantify vertex similarity based on distance in the space. However, the distribution of vertices in space was assumed to be uniform [20] and so in [21] non-uniform distributions were investigated which is clearly a more realistic setting.
2.3 Previous Results on Modularity
In this section we discuss known bounds for modularity in different random graph models.
Upper bounds for \(q^*({\mathcal {G}}_{n,d})\) and for \(q_{\delta }({\mathcal {G}}_{n,d})\) (\(U_2\))
d | \(U_1\) | \(U_2\) | \(U_3\) |
---|---|---|---|
3 | 0.9386 | 0.8771 | 0.8038 |
4 | 0.8900 | 0.7800 | 0.6834 |
5 | 0.8539 | 0.7078 | 0.6024 |
6 | 0.8261 | 0.6521 | 0.5435 |
7 | 0.8038 | 0.6076 | 0.4984 |
8 | 0.7855 | 0.5710 | 0.4624 |
9 | 0.7702 | 0.5403 | 0.4330 |
10 | 0.7570 | 0.5140 | 0.4083 |
Investigating random d-regular graphs continues in [24], a very recent paper. In fact, some of our results for this model mentioned below are obtained independently there. Moreover, they investigate the class of graphs whose product of treewidth and maximum degree is much less than the number of edges. This shows, for example, that random planar graphs typically have modularity close to 1, which is another indication that clusters naturally emerge where geometry is included.
3 Random d-regular Graphs
3.1 Lower Bound
3.2 Slightly Improved, Numerical Upper Bound
Theorem 1
3.3 Explicit but Weaker Upper Bound
Theorem 1 provides an upper bound that can be easily numerically computed for a given \(d \in \mathbb N \setminus \{1,2\}\). Next, we present a slightly weaker but an explicit bound that can be obtained using the expansion properties of random d-regular graphs that follow from their eigenvalues. In particular, it will imply that a.a.s. \(q^*({\mathcal {G}}_{n,d}) = O(1/\sqrt{d})\) and so \(q^*({\mathcal {G}}_{n,d}) \rightarrow 0\) as \(d \rightarrow \infty \).
Theorem 2
4 The Preferential Attachment Model
4.1 Constant Average Degree Graphs
In order to obtain a lower bound for modularity of Preferential Attachment graphs, we first analyze graphs with constant average degree in general. In this section, we extend the results of [26] and we start with the analysis of trees. It was proven in [26] that trees with maximum degree \(\varDelta = o(\root 5 \of {n})\) have asymptotic modularity 1. We generalize this result in two ways: first, we relax the condition on maximum degree; second, we allow our graphs to be disconnected, that is, we consider forests instead of trees. We prove the following theorem.
Theorem 3
Let \(\{F_n \}\) be a sequence of forests, \(F_n\) is a forest on n vertices with no isolated ones and \(\varDelta = \varDelta (F_n) = o(n)\). Then \(q^*(F_n) \ge 1 - O\left( \sqrt{\frac{\varDelta }{n}}\right) = 1-o(1)\) as \(n \rightarrow \infty \).
Note that it is also known that the asymptotic modularity of trees with maximum degree \(\varDelta = \Omega (n)\) is strictly less than 1 [26]. Hence, the assumption \(\varDelta = o(n)\) cannot be eliminated. Now we can use the previous theorem to get the following result for graphs of bounded average degree.
Theorem 4
Let \(\{G_n\}\) be a sequence graphs, \(G_n\) is a connected graph on n vertices with the average degree \(\frac{2|E(G_n)|}{n} \le D\) for some constant D, and \(\varDelta = \varDelta (G_n) = o(n)\). Then \(q^*(G_n) \ge \frac{2}{D} - O\left( \sqrt{\frac{\varDelta }{n}}\right) = \frac{2}{D}-o(1)\).
4.2 Lower Bound
The following theorem easily follows from the above result.
Theorem 5
As in the case of random d-regular graphs, it is natural to conjecture that the above lower bound is not sharp. Let \(c \in (0,1)\) and consider the following partition: \(A_1 = \{v_1, \ldots , v_{cn} \}\), \(A_2 = V(G_m^n) \setminus A_1 = \{ v_{cn+1}, \ldots , v_n\}\). Using martingales, it is possible to show that a.a.s. \(\sum _{v \in A_1} \deg (v, n) \sim 2mn\sqrt{c}\) (and so \(\sum _{v \in A_2} \deg (v, n) \sim 2mn(1-\sqrt{c})\)). Clearly, \(e(A_1) = mcn\) and so a.a.s. \(e(A_1,A_2) \sim 2mn(\sqrt{c}-c)\) and \(e(A_2) \sim mn(1+c-2\sqrt{c})\). The edge contribution and the degree tax are then both asymptotic to \(1+2c-2\sqrt{c}\). Not surprisingly, such partition cannot be used to get a non-trivial lower bound for the modularity but, similarly to the situation for random d-regular graphs, we may try to use it as a starting point to get slightly better partition. The basic idea is very simple: one can start with a given partition (or partition the vertices randomly into two classes), and if a vertex has more neighbors in the other class than in its own, then we randomly decide whether to shift it to the other class or leave it where it is. This approach proved to be useful to get a bound for the bisection width in random d-regular graphs [3] which, in turn, yields a lower bound for the modularity [24]. We plan to investigate it further in the journal version of this paper.
4.3 Upper Bound
Theorem 6
Much stronger expansion property was recently obtained in [15]. We are currently working on using this property to obtain general upper bound for \(q^*(G_m^n)\) that holds for any integer m as well as specific stronger bounds for small values of m. Details will be provided in the journal version of this paper.
5 The Spatial Preferential Attachment Model
Theorem 7
Acknowledgements
This work is supported by Russian Science Foundation (grant number 16-11-10014), NSERC, The Tutte Institute for Mathematics and Computing, and Ryerson University.