X
By Topic
Abstract:
We develop an information-theoretic view of the stochastic block model, a popular statistical model for the large-scale structure of complex networks. A graph G from such a model is generated by first assigning vertex labels at random from a finite alphabet, and then connecting vertices with edge probabilities depending on the labels of the endpoints. In the case of the symmetric two-group model, we establish an explicit `single-letter' characterization of the per-vertex mutual information between the vertex labels and the graph, when the graph average degree diverges. The explicit expression of the mutual information is intimately related to estimation-theoretic quantities, and -in particular- reveals a phase transition at the critical point for community detection. Below the critical point the per-vertex mutual information is asymptotically the same as if edges were independent of the vertex labels. Correspondingly, no algorithm can estimate the partition better than random guessing. Conversely, above the threshold, the per-vertex mutual information is strictly smaller than the independent-edges upper bound. In this regime there exists a procedure that estimates the vertex labels better than random guessing.
Date of Conference: 10-15 July 2016
Date Added to IEEE Xplore: 11 August 2016
ISBN Information:
Electronic ISSN: 2157-8117
INSPEC Accession Number: 16229160
Publisher: IEEE
Conference Location: Barcelona, Spain

SECTION I.

Introduction and Main Results

The stochastic block model is the simplest statistical model for networks with a community (or cluster) structure. As such, it has attracted considerable amount of work across statistics, machine learning, and theoretical computer science [1][2][3][4][5]. A random graph G=(V, E) from this model has its vertex set V partitioned into r groups, which are assigned r distinct labels. The probability of edge (i,j) being present depends on the group labels of vertices i and j.

In the context of social network analysis, groups correspond to social communities [1]. For other data-mining applications, they represent latent attributes of the nodes [6]. In all of these cases, we are interested in inferring the vertex labels from a single realization of the graph.

In this paper we develop an information-theoretic viewpoint on the stochastic block model. Namely, we develop an explicit (‘single-letter’) expression for the per-vertex conditional entropy of the vertex labels given the graph. Equivalently, we compute the asymptotic per-vertex mutual information between the graph and the vertex labels. Our results hold for in the large network limit, under suitable assumptions on the model parameters. In particular, we require that the mean vertex degree in the graph diverge (albeit arbitrarily slowly). While the asymptotic mutual information is of independent in- terest but, as we will see, it is intimately related to estimation-theoretic quantities.

In this paper, we focus on the symmetric binary model. Namely we assume the vertex set V=[n]{1,2, , n} to be partitioned into two sets V=V+V, with P(iV+)=P(iV)=1/2 independently across vertices i. Conditional on the edge labels, edges are independent with

P((i, j)E|V+,V)={pnqnif {i, j}V+ or Votherwise.(1)
View SourceRight-click on figure for MathML and additional features.

Throughout we will denote by X=(Xi)iV the set of vertex labels Xi{+1, 1}, and we will be interested in the conditional entropy H(X|G) or-equivalently– the mutual information I(X;G) in the limit n. We will write GSBM(n;p, q) (or (X, G)SBM(n;p, q)) to imply that the graph G is distributed according to the stochastic block model with n vertices and parameters p,q.

Since we are interested in the large n behavior, two preliminary remarks are in order:

  1. Normalization. We obviously have10H(X|G)n log 2. It is therefore natural to study the per-vertex entropy H(X|G)/n. As we will see, depending on the model parameters, this will take a value between 0 and log2.

  2. Scaling. The reconstruction problem becomes easier when pn and qn are well separated, and more difficult when they are closer to each other. For instance, in an early contribution, Dyer and Frieze [2] proved that the labels can be reconstructed exactly –modulo an overall flip– if pn=p>qn=(1 are distinct and independent of n. In particular, such exact recovery results imply that H(X|G)/n0. In this regime, the ‘signal’ is so strong that the conditional entropy is trivial. Letting p¯¯¯n=(pn+qn)/2 be the average edge probability. It turns out that the relevant ‘signal-to-noise ratio’ (SNR) is given by the following parameter:

λn=n(pnqn)24p¯¯¯n(1p¯¯¯n).(2)
View SourceRight-click on figure for MathML and additional features.

Indeed, we will see that H(X|G)/n of order 1, and has a strictly positive limit when λn converges to λ of order one. In this regime, the fraction of incorrectly labeled vertices has a limit in (0, 1).

A. Main Results: Limiting Mutual Information and MMSE

Our main result provides a single-letter characterization for the per-vertex mutual information. This is given in terms of an effective Gaussian scalar channel. Namely, define the Gaussian channel

Y0=Y0(γ)=γX0+Z0,(3)
View SourceRight-click on figure for MathML and additional features. where X0Uniform({+1, 1}) independent2 of Z0N(0,1). We denote by mmse(γ) and |(γ) the corresponding minimum mean square error and mutual information:
|(γ)=Elog{dpY|X(Y0(γ)|X0)dpY(Y0(γ))},mmse(γ)=E{(X0E{X0|Y0(γ)})2}.(4)(5)
View SourceRight-click on figure for MathML and additional features.

In the present case, these quantities can be written explicitly as Gaussian integrals of elementary functions:

l(γ)mmse(γ)=γElogcosh(γ+γZ0),=1E{tanh(γ+γZ0)2}.(6)(7)
View SourceRight-click on figure for MathML and additional features.

The following theorem gives a single-letter characterization of the limiting per-vertex mutual information.

Theorem I.1

For any λ>0, let γ=γ(λ) be the largest non-negative solution of the equation:

γ=λ(1mmse(γ)).(8)
View SourceRight-click on figure for MathML and additional features.

We refer to γ(λ) as to the effective signal-to-noise ratio. Further; define Ψ(γ, λ) by:

Ψ(γ, λ)=λ4+γ24λγ2+|(γ).(9)
View SourceRight-click on figure for MathML and additional features.

Let the graph G and vertex labels X be distributed according to the stochastic block model with n vertices and parameters pn,qn(i.e.(G, X)SBM(n;pn, qn)) and define λnn(pnqn)2/(4p¯¯¯n(1p¯¯¯n)).

Assume that, as n,(i)λnλ and (ii) np¯¯¯n(1p¯¯¯n). Then,

limn1nI(X;G)=Ψ(γ(λ), λ)=infγ[0,]Ψ(γ, λ).(10)
View SourceRight-click on figure for MathML and additional features.

This theorem and its proof has implications on the minimum error that can be achieved in estimating the labels X from the graph G. For reasons that will become clear below, a natural metric is given by the matrix minimum mean square error

MMSEn(λ)1n(n1)E{XXTE{XXT|G}2F}.
View SourceRight-click on figure for MathML and additional features.

Theorem I.2

Under the assumptions of Theorem I.1 (in particular assuming λnλ as n), the following limit holds for the matrix minimum mean square error

limnMMSEn(λn)=1γ(λ)2λ2.(12)
View SourceRight-click on figure for MathML and additional features.

Further, this implies limnMMSEn(λn) = for λ1 and limnMMSEn(λn)< for λ>1.

A couple of remarks are in order

Remark I.3

Notice that our assumptions require np¯¯¯n(1p¯¯¯n) at any, arbitrarily slow, rate. In words, this corresponds to the graph average degree diverging at any, arbitrarily slow, rate. Recently (see Section II for a discussion of this literature), there has been considerable interest in the case of bounded average degree, namely

pn=αn,qn=bn,(13)
View SourceRight-click on figure for MathML and additional features. with α,b bounded. Our proof gives an explicit error bound in terms of problem parameters even when np¯¯¯n(1p¯¯¯n) is of order one. Hence we are able to characterize the asymptotic mutual information for large-but-bounded average degree up to an offset:
limsupn1nI(X;G)Ψ(γ(λ), λ)Cλ3α+b,(14)
View SourceRight-click on figure for MathML and additional features.

Further, our mild condition on diverging average degree should be contrasted with the phase transition of naive spectral methods. It is well understood that the community structure can be estimated by the principal eigenvector of the centered adjacency matrix GE{G}=(Gp¯¯¯n11T). (We denote by G the graph as well as its adjacency matrix.) This approach is successful for λ>1 but requires average degree np¯¯¯n(logn)c for c a constant [7], [8].

Remark I.4

The reader will notice that the limiting expres- sions for the mutual information and MMSE are related by the identity:

dΨ(γ(λ),λ)dλ=14(1γ(λ)2λ2).(15)
View SourceRight-click on figure for MathML and additional features.

Such an identity relating mutual information and MMSE is well-known for estimation in Gaussian noise [9], and its appearance in the setting of the SBM is a priori surprising. However, it is hardly coincidental, as our proof strategy involves reducing the SBM to an analogous “Gaussian noise” setting (see Section III).

The rest of the paper is organized as follows. We discuss related work on the stochastic block model in Section II. In Section III, we discuss some consequences of our main results for other estimation metrics. Finally, in Section IV, we outline our proof strategy. We refer the reader to the full version [10] for proofs of the results.

SECTION II.

Related Work

The stochastic block model was first introduced within the social science literature in [1]. Around the same time, it was studied within theoretical computer science [11], [2], [12], under the name of 'planted partition model. A large part of the literature has focused on the problem of exact recovery, providing algorithms and conditions on the gap between pn and qn that guarantee exact recovery of the vertex labels (up to a sign flip), with phase transitions recently discovered [13][14][15]. On the other hand, recent works have also studied the detection problem, i.e., recovering a positively correlated community structure. This was highlighted by [16], who conjectured the following intriguing phase transition phenomenon: the detection problem is solvable if and only if (αb)2>2(α+b) where pn=α/n and qn=b/n. The two parts of the conjecture were settled in [17] and [18], [19] respectively. Results for detection with more than two communities were also obtained in [20], [21], [15], [22].

In a sense, the present paper bridges detection and exact recovery, by characterizing the minimum estimation error when this is non-zero, but —for λ>1-smaller than for random guessing. Note that [23] introduced an information-theoretic view of the SBM, showing that I(X;G)/n admits a limit as n, when pn=α/nqn=b/n. The key argument establishes a certain sub-additivity property for the conditional entropy H(X;G). While these techniques might extend to a broad family of planted models, the sub-additivity property does not establish the limiting value itself.

In terms of proof techniques, our arguments are closest to [24], [25]. We use the well-known Lindeberg strategy to reduce computation of mutual information in the SBM to mutual information of the Gaussian observation model. We then compute the latter mutual information by developing sharp algorithmic upper bounds, which are then shown to be asymptotically tight via an area theorem. The Lindeberg strategy builds from [24], [26] while the area theorem argument also appeared in [27].

Lesieur, Krzakala and Zdeborová [28] studied estimation of low-rank matrices observed through noisy memoryless channels. They conjectured that the resulting minimal estimation error is universal across a variety of channel models. Our proof establishes universality across two such models: the Gaussian and the binary output channels. We expect that similar techniques can be useful to prove universality for other models as well.

Finally, we expect that the result obtained in this paper are likely to extend to more general models, such as censored or labelled models [23], [29], [30], [31], [21].

SECTION III.

Consequences for Estimation

Theorem I.2 establishes that a phase transition takes place at λ=1 for the matrix minimum mean square error MMSEn(λn) defined in Eq. (11). Throughout this section, we will omit the subscript n to denote the n limit (for instance, we write MMSE(λ)limnMMSEn(λn)).

Figure 1
Fig. 1:

Illustration of the fixed point equation eq. (8). The ‘effective signal-to-noise ratio’ γ(λ) is given by the intersection of the curve γG(γ)=1mmse(γ), and the line γ/λ.

Figure 2
Fig. 2:

Left frame: asymptotic mutual information per vertex of the two-groups stochastic block model, as a function of the signal-to-noise ratio λ. The dashed lines are simple upper bounds (see [10]): limnI(X;G)/nλ/4 and I(X;G)/nlog2. Right frame: asymptotic estimation error under different metrics. Note the phase transition at λ=1 in both frames.

Figure 2 reports the asymptotic prediction for MMSE(λ) stated in Theorem I.2, and evaluated as discussed above. The error decreases rapidly to 0 for λ>1.

In this section we discuss two other estimation metrics. In both cases we define these metrics by optimizing a suitable risk over a class of estimators: it is understood that randomized estimators are admitted as well.

vmmsen(λn)=1ninfx^GnRnE{mins{±1}Xsx^(G)22}.(16)
View SourceRight-click on figure for MathML and additional features.

  • The first metric is the vector minimum mean square error:

  • Note the minimization over the sign s: this is necessary because the vertex labels can be estimated only up to an overall flip. Of course vmmsen(λn)[0,1], since it is always possible to achieve vector mean square error equal to one by returning x^(G)=0. • The second metric is the overlap:

Overlap=n(λn)=1nsups^.Gn{±1}nE{|X,s^(G)}|}.(17)
View SourceRight-click on figure for MathML and additional features.

Again Overlap n(λn)[0,1] (but now large overlap corresponds to good estimation). Indeed by returning x^r(G){+1, 1} uniformly at random, we obtain E{|X,s^(G)}|}/n=O(n1/2)0.

Note that the main difference between overlap and vector minimum mean square error is that in the latter case we consider estimators x^:GnRn taking arbitrary real values, while in the former we assume estimators s^:Gn{+1, 1}n taking binary values.

The next lemma clarifies the relationship between matrix, vector minimum mean square error and the overlap.

III.

Lemma III.1

With the above definitions, we have

vmmsen(λ)11(1n1)MMSEn(λ),vmmsen(λ)MMSEn(λ),Overlapn(λ)1MMSEn(λ),Overlapn(λ)1MMSEn(λ).(18)(19)(20)(21)
View SourceRight-click on figure for MathML and additional features.

As an immediate corollary of these lemmas (together with Theorem I.2), we obtain that λ=1 is the critical point for other estimation metrics as well.

III.

Corollary III.2

The vector minimum mean square error and the overlap exhibit a phase transition at λ=1. Namely, under the assumptions of Theorem I.1 (in particular, λnλ and np¯¯¯n(1p¯¯¯n)), we have

  • If λ1, then estimation cannot be performed asymptotically better than without any information:

    limnvmmsen(λn)=1,limnOverlapn(λn)=0.(22)(23)
    View SourceRight-click on figure for MathML and additional features.

  • If λ>1, then estimation can be performed better than without any information, even in the limit n:

    liminfnvmmsen(λn)1γ(λ)λ>0,limsupnvmmsen(λn)1γ(λ)2λ2<1,liminfnOverlanpn(λn)γ(λ)2λ2.(24)(25)(26)
    View SourceRight-click on figure for MathML and additional features.

SECTION IV.

Proof Strategy

In this section we describe the main elements used in the proof of Theorems I.1 and I.2. As an intermediate step, we introduce the Gaussian observation model:

Y=λnXX+Z,(27)
View SourceRight-click on figure for MathML and additional features. where Z is a symmetric matrix with independent Gaussian entries ZijN(0,1+δij)(>δij is the Kronecker delta). Note that this model matches the first two moments of the SBM. More precisely, if Gresi7(Gijp¯¯¯n)/p¯¯¯n(1p¯¯¯n), then E{Gresij|X}=E{Yij|X¯¯¯¯¯} and Var(Gresij|X)=Var(Yij|X)+O(n1/2).

Our strategy consists of two main steps:

  1. Show that the mutual information in the SBM and the Gaussian observation model (i.e. I(X;G) and I(X;Y)) coincide to leading order.

  2. Prove an asymptotic characterization of the mutual information I(X;Y) via an appropriately defined approximate message passing (AMP) algorithm.

The first step is established in the following proposition that also gives explicit error bounds on the difference between the two mutual information terms.

IV.

Proposition IV.1

Assume that, as n,(i)λnλ and (ii) np¯¯¯n(1p¯¯¯n). Then there is a constant C independent of n such that

1n|I(X;G)I(X;Y)|Cλ3/2np¯¯¯n(1p¯¯¯n)+C|λnλ|.(28)
View SourceRight-click on figure for MathML and additional features.

The proof of this proposition relies on the celebrated Lindeberg method, whereby we replace the non-Gaussian random variables Gij with Gaussian counterparts Yij and control the error incurred. Similar arguments have been used to establish universality of various macroscopic properties across microscopic details of the underlying probabilistic models (see [24], [26] for applications to compressed sensing, information theory and spin glass theory).

The second step of characterizing the Gaussian model of Eq.(27) involves developing algorithmic upper bounds via an appropriately defined Approximate Message Passing algorithm. This algorithm yields estimates of XX, and a corresponding mean squared error MSEAMP(λ) incurred by these estimates. This mean squared error is shown to be sharp, via an area theorem argument and the I-MMSE identity. Such an area theorem argument has been used in coding theory [32], multiuser detection [27] and sparse PCA [25]. Finally we state the characterization of the Gaussian model, since it is of independent interest.

IV.

Theorem IV.2

For any λ>0, let γ(λ) be the largest non- negative solution of the equation:

γ=λ(1mmse(γ)).(29)
View SourceRight-click on figure for MathML and additional features.

Further, define Ψ(γ, λ) by:

Ψ(γ, λ)=λ4+γ24λγ2+|(γ).(30)
View SourceRight-click on figure for MathML and additional features.

Then, we have

limn1nI(X;Y)=Ψ(γ(λ), λ).(31)
View SourceRight-click on figure for MathML and additional features.

ACKNOWLEDGMENTS

Y.D. and A.M. were partially supported by NSF grants CCF-1319979 and DMS-1106627 and the AFOSR grant FA9550-13-1-0036. E.A. was partially supported by NSF CAREER Award CCF-1552131. Part of this work was done while the authors were visiting Simons Institute for the Theory of Computing, UC Berkeley.

Authors

Electrical Engineering, Stanford Univeristy, United States of America
Electrical Engineering and PACM, Princeton University, United States of America
Electrical Engineering and Statistics, Stanford University, United States of America