Acknowledgements
The authors would like to thank Stefan Braunewell, Michele Leone, Ionas Erb and Andreas Engel for many helpful hints and discussions.
Volume 224, Issues 1–2, December 2006, Pages 20–26
Dynamics on Complex Networks and Applications
Edited By Adilson E. Motter, Manuel A. Matı´as, Jürgen Kurths and Edward Ott
The study of cluster or community structure of complex networks contributes to the understanding of networks at a functional level. In many networks, latent classes of nodes are suspected which manifest themselves as communities, i.e. groups of nodes with a high link density among the nodes of the same class and low link density between nodes of different classes. Community detection algorithms are used to infer these classes, e.g. by finding a partition of the network which maximizes a quality function such as the network modularity Q [M. Newman, M. Girvan, Finding and evaluating community structure in networks, Phys. Rev. E 69 (2004) 026113]. However, it is known from numerical experiments that even purely random networks display intrinsic modularity and may be partitioned yielding high values of Q. Extending on our earlier results [J. Reichardt, S. Bornholdt, Statistical mechanics of community detection, Phys. Rev. E 74 (2006) 016110], the mapping of the community detection problem onto finding the ground state of a spin glass is exploited in order to derive analytical expressions for the expected modularity in random graphs and assess the theoretical limits to community detection. The results are independent of any specific community detection algorithm and allow for differentiation between modularity arising purely due to the search process in the large configuration space of possible partitionings on the one hand, or due to the actual presence of different classes of nodes on the other hand.
With the increasing availability and steadily increasing size of relational datasets or networks the need for appropriate methods for exploratory data analysis arises. For general statistical properties such as the degree distribution, degree correlations, clustering etc. a number of well established methods and models to explain their origin exist [1] and [2]. However, a standard analysis for the higher order structure in graphs has not been established so far. Currently, the problem of the cluster or community structure is the subject of intense study [3] and [4]. Cluster analysis is an important technique that allows for data abstraction and dimensionality reduction or aids in data visualization. It is used in the life sciences [5], over bibliometrics [6], to market research [7], and has implications for experiment planning, funding policies or marketing.
The above examples have also illustrated that such exploratory analysis is often the starting point to further work. It is therefore important to assess the statistical validity of the findings and avoid the “deception of randomness” [8], i.e. to ensure that the output of a community detection algorithm is statistically significant and not the mere result of the search process. To illustrate this, we consider the following problem. Given is an Erdös–Rényi (ER) network with average degree 〈k〉=5. Given is further an assignment of the nodes into two types A and B with 50 nodes each. Between nodes of different type, 42 links are found, the remaining links being equally distributed among nodes of type A or B alone. If nodes are connected independently of their type, the total number of links between type A and B nodes is Poisson-distributed with a mean of 〈k〉N/4=125 and a standard deviation of σ=11. Hence, finding only 42 links between A and B is statistically highly significant with a p-value of p=2.8×10−18. Now assume that the type of each node was not given and an assignment into two equal sized groups A and B was found through an exhaustive search of the possible assignments into two equal sized groups. Applying a Bonferroni correction [9] for this number of different “experiments” would lead to the situation that only less than 22 connections between nodes of type A and B would be significant at the 5% level. Hence, the initial situation with 42 links between nodes of different type could not be called significant. As will be shown below, any ER random network with 100 nodes and 〈k〉=5 can be partitioned into two equal sized groups such that only 42 links connect the two parts. Thus, statistical significance starts much earlier than the limit given by the Bonferroni correction. The Bonferroni correction fails here because it assumes independent experiments. The different assignments into groups produced by a search process, however, are not independent.
In addition to the problem of statistical significance, another aspect of community detection is still under intense discussion, namely the definition of the term community or cluster in a network. This article aims at contributing to both of these questions.
Despite the many applications of community detection across the sciences, it remains remarkably unclear what a community actually is. In addition to the many definitions that are given in sociology [10], the physics community has contributed a fair number as well [3] and [4]. All authors agree that a community should be a group of nodes that is more densely connected among each other than with the rest of the network, but still these definitions differ largely in the details. Below, we give a short overview of the different aspects that have been emphasized by different authors.
The initial work on communities by Girvan and Newman [11] gives an algorithmic definition. They design a community detection algorithm which recursively partitions the graph to produce a hierarchy of communities from the entire network down to single nodes. At each point, the nodes belonging to distinct sub-trees in the resulting dendrogram are considered as communities.
Radicchi et al. [12] tried to improve this heuristic definition by coining the term of “community in a strong sense” such that
Palla et al. [6] and [13] have given a definition based on reachability. They define a subgraph percolation process based on k-cliques (fully connected subgraphs with k nodes). Two k-cliques are connected, if they share a (k−1)-clique, e.g. two triangles (which are 3-cliques) are connected if they share an edge (a 2-clique). A community, or k-clique percolation cluster, is then defined as the group of nodes that can be reached via adjacent k-cliques. Communities may overlap, i.e. nodes may belong to more than one percolation cluster, but communities corresponding to (k+1)-clique percolation clusters always lie completely within k-clique clusters.
Newman and Girvan have further defined a quantitative measure of the quality of an assignment of nodes into communities. This so-called “modularity” [14] can be used to compare different assignments of nodes into communities quantitatively. The modularity is defined as:
We see the diversity of definitions and approaches of which we have described only a few. Refs. [3] and [4] give a more comprehensive overview. Because of this controversy of opinions, we have set out from a first principles approach in the next section that will shed some light on the general properties of the problem.
As outlined in Ref. [16] the problem of community detection can be addressed from a first principles perspective by adhering to a simple principle: to group nodes that are not linked in different communities and to put nodes which are linked in the same community. This principle is expressed in the following Hamiltonian:
Both of these choices render pij positive and smaller than one, hence we are dealing with a spin glass which has ferromagnetic couplings between connected nodes and anti-ferromagnetic couplings between unconnected nodes. The ground state of this spin glass defines the optimal assignment of nodes into communities. For γ=1 and pij=kikj/2M, we recover the modularity Q defined by Newman and Girvan [14] from (4) via [17].
It is instructive to rewrite (4) as a sum over spin states s:
With our mapping couplings between all pairs of nodes exist. Fortunately, the particular choice of pij allows us to implement efficient optimization routines [16] that only need to consider interactions along the links and treat the anti-ferromagnetic interactions along the non-existing links in a mean field manner, which is, however, not an approximation but accounts exactly for the repulsive interactions. One only needs to keep track of the occupation numbers of the spin states or the total sum of degrees in each group.
A definition of community follows directly from the properties of the ground state as a global minimum of (4) [16]:
Every proper subset n1 of a community ns has a maximum coefficient of adhesion with its complement in the community compared to the coefficient of adhesion with any other community (a1,s∖1=max).
The coefficient of cohesion is non-negative for all communities ().
The coefficient of adhesion between any two communities is non-positive ().
The ground state depends on the value γ chosen. Recall the definitions of adhesion and cohesion in Eq. (5), where γ measures the relative influence of the number of internal and external links over the respective expectation values. Further, compare the ground state properties (1) to (3). For γ=1, a set of nodes has non-negative cohesion, if it has at least as many internal links as expected from a random assignment and for γ=2, a set of nodes needs at least twice as many internal links than expected in order to be considered a community. Therefore, larger values of γ will lead to denser communities which are also smaller. The opposite is true for smaller values of γ. Hence, γ determines a threshold for the link density contrast in the community structure. The highest sensible value is γ≈1/p, since then only complete subgraphs (cliques) will have a non-negative cohesion. The lowest sensible value of γ will be the largest value which leads to a ferromagnetic ground state. Of course, an initial investigation should be performed at γ=1, which corresponds to the natural partition of the graph. If one then wants to find additional structures, for most practical applications a sensible parameter range is 0.01≤γ≤100, which should be explored starting around γ=1.
The value of γ at which the community structure was obtained should always be quoted. Changing the value of γ allows us to detect hierarchies in the assignment of nodes into communities [18] and [16].
The performance of this approach to community detection was benchmarked on computer generated test networks [18] and [4] and the results compared to those obtained by Girvan and Newman’s betweenness algorithm [11]. The networks are Erdös–Rényi (ER) graphs [19] with an average degree of 〈k〉=16 and 128 nodes. The nodes were divided into 4 groups of 32 nodes each. Keeping the average degree fixed, the links per node were distributed into an average of to members of the same group and an average of
to members of the 3 remaining groups in the network such that
. Obviously, increasing
at the expense of
makes the recovery of the designed community structure more difficult. At
the network should be completely random and any trace of the built-in community structure is lost since at this point the probability to link to a member of a different node equals the probability to link to a member of the same group
.
Fig. 1 shows the results of the benchmarks. We measured the success of the two methods in terms of sensitivity and specificity. Sensitivity measures the fraction of pairs of nodes that are correctly classified as being in the same cluster, while specificity measures the fraction of nodes correctly classified as belonging to different clusters. In other words, the two measures indicate how good the algorithms are in grouping together what belongs together and in keeping apart what does not belong together. From Fig. 1 we see that both algorithms are rather conservative in terms of grouping things together as indicated by the high levels of specificity. The change in sensitivity is much more drastic and we find that the Potts model approach outperforms the algorithm of Girvan and Newman [11]. The critical value of , at which the ability to recover the built-in community structure vanishes, seems to be
.
Fig. 1. Benchmarks of a community detection algorithm based on finding the ground state of a spin glass and comparison with Girvan–Newman’s algorithm [11]. Tests were run on computer generated test networks with known community structure. “Sensitivity” denotes the fraction of all pairs of nodes that are classified correctly in the same community, while “specificity” denotes the fraction of all pairs of nodes classified correctly in different communities.
In our introductory paragraphs, we have already raised the question of when one may call a network truly modular. Obviously, running a clustering algorithm over a set of randomly generated data points will always produce clusters which, however, have little meaning. Similarly, minimizing the modularity Hamiltonian on a random graph results in a community structure which has all the desired properties. This does, of course, not mean that the graph we studied was in fact modular. A differentiation between graphs which are truly modular and those which are not can hence only be made if we gain an understanding of the intrinsic modularity of random graphs. By comparing the modularity of random graphs with that of real world graphs, we can assess whether the graphs under study are truly modular.
Such a comparison can of course always be made by randomizing the network under study keeping the degree distribution invariant. Such algorithms then remove all correlations and community structures possibly present in the data. Comparing the results of clustering the empirical data and a randomized version of it can always give a clue to what extent the data shows modularity above that of a random network with the same degree distribution. Nevertheless, such analysis is biased by the algorithm used to detect the community structure. Much more desirable would be a measure of modularity that can be used to compare with any algorithm.
In mapping the problem of finding a community structure onto finding the ground state of an infinite range spin glass, we have defined a coupling matrix Jij with the following distribution of couplings:
For the number of edges to cut when equi-partitioning a random graph, a number of results exist since the 1980’s, beginning with the paper by Fu and Anderson [20] about bi-partitioning a random graph. Kanter and Sompolinsky [21] have given an expression for the minimum total number of inter community edges C, also called cut-size, when partitioning a random graph into q equal sized parts. From this, we can immediately write an expectation value for the modularity of random graphs [16]:
Table 1. Values of U(q)/q for various values of q obtained from [21], which can be used to approximate the expected modularity with Eq. (7)
q | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|
U(q)/q | 0.384 | 0.464 | 0.484 | 0.485 | 0.479 | 0.471 | 0.461 | 0.452 | 0.442 |
We see that maximum modularity is obtained at q=5, though the value of U(q)/q for q=4 is not much different from it. This qualitative behavior of dense random graphs tending to cluster into only a few large communities is confirmed by our numerical experiments. Using the largest value of Table 1, we finally arrive at an expression for the expected modularity in any ER random graph with average degree 〈k〉=pN:
The above approximation using a Potts spin glass, however, cannot explain the number of communities found experimentally in random graphs of varying connectivity since it always assumes 5 communities. Therefore, we try to approximate the ground state of a q-state Potts model by recursively bi-partitioning the network and continuing as long as the modularity increases. For every bi-partition we use the expression of the cut-size as a function of the number of N nodes and average degree 〈k〉=pN given by Fu and Anderson [20]. After every partition, the number of links connecting to nodes in the same part and to nodes in the rest of the network is given by:
Though Eq. (10) only allows numbers of communities that are powers of 2, the agreement of Q with the experimental data is surprisingly good as Fig. 2 shows. Also, the number of communities is predicted almost perfectly by (10) as shown in Fig. 2.
With the expressions (8) and (10), we are adequately able to calculate expectation values of Q for random graphs which can be used in the assessment of the statistical significance of the modularity in real world networks. Note that our analytical results improve those empirically found in Ref. [22], which reports a scaling of modularity in ER graphs as Q∝〈k〉−2/3. We have shown that random graphs may exhibit considerable values of modularity even without any built-in group structure. Recall that the modularity Q has an upper bound of Q<1. Significant community structure can hence only be attributed to graphs with values of modularity higher than those calculated for equivalent random null models. The sparser a graph, the higher the expected modularity. It is therefore particularly difficult to detect true modularity in sparse graphs. Also, the sparser a graph, the more modules it will show, while dense random graphs tend to cluster into only a handful of communities.
With the results of the last section we are now prepared to explain Fig. 1 and to give a limit to what extent a designed community structure in a network can be recovered. As we have seen, for any random network we can find an assignment of spins in communities that leads to a modularity Q>0. For our computer-generated test networks with 〈k〉=16 we have a value of p=〈k〉/(N−1)=0.126 and expect a value of Q=0.227 according to (8) and Q=0.262 according to (10). The modularity of the community structure built-in by design is given by:
Fig. 3. Ratio of internal links to external links in the ground state of the Hamiltonian. Shown are the experimental values from clustering random graphs with N=10,000 nodes and the expectation values calculated from using a Potts model (8) or an Ising model (10) recursively. The dotted line represents the Radicchi et al. definition of community in “strong sense” [12]. Note that sparse graphs will, on average, always exhibit such communities, while dense graphs will not, even though their modularity may be well above the expectation value for an equivalent random graph.
This also means that the point at which we cannot distinguish between a random and a modular network is not defined by for the internal and external link densities as one may have intuitively expected. Rather, it is determined by the ratio of
in the ground state of a random network and depends on the connectivity of the network 〈k〉. This result is important whenever benchmark networks are constructed and the results of various studies on possibly different networks are to be compared. It also shows that the Potts model clustering technique is already close to optimal in the sense that it approaches the theoretical limit. Therefore, research emphasis should be laid upon finding efficient minimization routines for the Potts model Hamiltonian suggested.
From Fig. 3 we observe that sparse random graphs all show communities in the strong sense of Radicchi et al. [12]. Further, it is very difficult to find communities in the strong sense in dense graphs, even though they may exhibit a modularity well above that of a random graph.
Finally, let us examine the most widely studied network for community detection, the Zachary Karate Club [23], which was used by Girvan and Newman in their seminal paper [11] as an example. The network depicts the friendships of the members of a Karate club which eventually broke apart into two almost equally sized groups due to a dispute between the manager and the instructor. It has been used widely as an example of community detection, generally with the goal to reproduce the actual split of the club. The network contains 34 nodes and 77 links, which leads to 〈k〉=4.53 and p=0.137. According to Eq. (8), this yields an expectation value of Q=0.423. This is surprisingly close to the highest value of 0.419 found in this network as reported by Duch and Arenas [24]. Alternatively, Eq. (9) yields an estimate of 12.8 for the number of links between two equal sized groups of 17 members each. A recent refinement of this estimate from Ref. [25] gives an estimated cut-size of 12.5 links. The actually observed split by Zachary into a group of 18 and one of 16 had a cut-size of 10. The split found by Girvan and Newman was a partition into a group of 19 and 15 with the same cut-size of 10. It must be understood that the analytical results obtained in the thermodynamic limit are not exactly applicable to such a small network and that a rewiring technique would be more appropriate for such a small system. This said, it still seems that the Karate club network is not truly modular in the sense that one cannot find partitions that considerably exceed the expectation values of Q for equivalent random null models. There exist only few alternative assignments of comparable modularity which overlap only in a small number of nodes. This means the community structure of maximum modularity is very stable. For such a small system, this is, however, expected. Also, one cannot say that the cut-size of the observed split differs drastically from the expected in an equivalent random network. This leads to the following conclusion: from a pure network perspective, we cannot say that there were different groups of people in the club that formed communities into which the club eventually broke apart. Rather, it was the central position of the conflicting individuals, their being highly connected and their specific positions as manager and instructor which lead to the breakup of the club. This breakup then happened along a minimal cut which is only natural. A dispute between any other pair of members would most likely not have led to a splitting of the group.
In this article, we have examined the problem of assessing statistically significant modular structure in networks. We exploited a mapping of the problem of community detection onto finding the ground state of an infinite range spin glass. The quality function of the clustering, or modularity, is identified as the ground state energy of this spin glass. Benchmarks show the good performance of algorithms based on this mapping. Expectation values for the modularity of Erdös–Rényi random graphs were given and the dependence of these expectation values on the link density in the network was discussed. The theoretical limits of community detection were addressed. We found that only those community structures can be recovered reliably that lead to modularities larger than the expectation values of random graphs. This allowed us to quantitatively explain the observed deterioration of benchmark results for computer generated test networks and shed some light on the interpretation of widely used real world example networks. Our findings are universally applicable and independent of any algorithm used to find the community structure.
The authors would like to thank Stefan Braunewell, Michele Leone, Ionas Erb and Andreas Engel for many helpful hints and discussions.
Statistical mechanics of complex networks
Rev. Modern Phys., 74 (2002), pp. 47–97
The structure and function of complex networks
SIAM Rev., 45 (2) (2003), pp. 167–256
Detecting community structure in networks
Eur. Phys. J. B, 38 (2004), p. 321
Comparing community structure identification
J. Stat. Mech. (2005), p. P09008
Functional cartography of complex metabolic networks
Nature, 433 (2005), pp. 895–900
Uncovering the overlapping community structure of complex networks in nature and society
Nature, 435 (2005), p. 814
Statistical Mechanics of Learning
Cambridge University Press (2001)
Social Network Analysis
Cambridge University Press (1994)
Community structure in social and biological networks
Proc. Natl. Acad. Sci. USA, 99 (12) (2003), pp. 7821–7826
Defining and identifying communities in networks
Proc. Natl. Acad. Sci. USA, 101 (2004), p. 2658
Clique percolation in random networks
Phys. Rev. Lett., 94 (2005), p. 160202
Finding and evaluating community structure in networks
Phys. Rev. E, 69 (2004), p. 026113
Fast algorithm for detecting community structure in networks
Phys. Rev. E, 69 (2004), p. 066133
Statistical mechanics of community detection
Phys. Rev. E, 74 (2006), p. 016110
Finding community structure in very large networks
Phys. Rev. E, 70 (2004), p. 066111
Detecting fuzzy community structures in complex networks with a Potts model
Phys. Rev. Lett., 93 (2004), p. 218701
On the evolution of random graphs
Publ. Math. Inst. Hung. Acad. Sci., 5 (1960), pp. 17–61
Application of statistical mechanics to NP-complete problems in combinatorial optimisation
J. Phys. A: Math. Gen., 19 (1986), pp. 1605–1620
Graph optimisation problems and the Potts glass
J. Phys. A: Math. Gen., 20 (1987), pp. 636–679
Modularity from fluctuations in random graphs and complex networks
Phys. Rev. E, 70 (2004), p. 025101R
An information flow model for conflict and fission in small groups
J. Anthropol. Res., 33 (1977), pp. 452–473
Community detection in complex networks using extremal optimization
Phys. Rev. E, 72 (2005), p. 027104
Copyright © 2006 Elsevier Ltd. All rights reserved.