The stochastic block model (SBM) is a canonical model of network with communities. The terminology SBM comes from the machine learning and statistics literature [1], while the model is typically called the planted partition model in theoretical computer science [2], [3], and the inhomogeneous random graphs model in the mathematical literature [4]. Although the model was defined as far back as the 80s, it resurged in the recent years due in part to the following fascinating conjecture established in [5] (and backed in [10]) from deep but non-rigorous statistical physics arguments:
Let (X,G) be drawn from SBM(n,k,a,b), i.e., X is uniformly drawn among partitions of [n] into k balanced clusters, and G is a random graph on the vertex set [n] where edges are placed independently with probability a/n inside the clusters and b/n across. Define SNR =(a−b)2k(a+(k−1)b) and say that an algorithm detects communities' if It takes G as input and outputs X^ that is positively correlated with X with high probability. Then,
Irrespective of k } if SNR >1, it is possible to detect communities in polynomial time, i.e., the KS threshold can be achieved efficiently;
If k≥ 4 (k≥ 5 in the assortative case), it is possible to detect communities information-theoretically for some SNR strictly below 1.
We have recently prove part (i) of this conjecture in [6], and this paper shows that for k=5, it is indeed possible to cross the KS threshold using information theory in the disassortative case. For part (i), i.e., achieving the KS threshold efficiently, [6] relies on a linearized version of BP that can handle cycles and runs in O(nlogn) time. This approach is related to spectral methods based on non -backtracking operators [7].
To cross the KS threshold information-theoretically, we rely on a non-efficient algorithm that samples a typical clustering. Upon observing a graph drawn from the SBM, the algorithm builds the set of all partitions of the n nodes that have a typical fraction of edges inside and across clusters, and then samples a partition uniformly at random in that set. Our analysis of this algorithm reveals two different regimes, that reflect two layers of refinement in the bounds on the typical set's size. In a first regime, bad clusterings (i.e., partitions of the nodes that agree in no more than close to 1/k vertices) are with high probability not typical using a union-bound, and the algorithm samples only good clusterings with high probability. This allows to cross the KS threshold at a=0, and shows in this case that detection is information-theoretically solvable if b>cklnk+ok(1),c∈[1,2]. Thus the gap between the information-theoretic and KS threshold can be large, since the KS threshold reads b>k(k−1). However, the union bound does not allow to recover the correct bound at b=0. For b=0, previous analysis leads to a bound given by a>2k, which is suboptimal. In fact, as soon as a>k, each cluster in the SBM graph has a giant component of linear size, and thus an algorithm that simply separates these components and randomly assigned the remaining vertices will detect the communities. Of course, such an algorithm only applies to the strict case of b=0. To obtain a tighter bound in the general case, we next exploit the large number of tree-like components that in the SBM graph, reaching a regime where some bad clusterings are typical but unlikely to be sampled. This shows that the algorithm succeed with the right bound at b=0, i.e., a>k.
A. Related Literature
For the case of k=2, it was proved in [8], [9] that the KS threshold can be achieved efficiently. An alternative proof was later given in [7]. For k=2, no information-computation gap takes places as shown with a tight converse in [10]. It was also shown in [7] that for SBMs with multiple slightly asymmetric communities, the KS threshold can be achieved, but [7] does not resolve Conjecture 1 for k≥3. Note that the crossing the KS threshold with information theory shows a gap between the information-theoretic and computational thresholds only under non-formal evidences [5]. Note also that standard clustering methods are not believed/known to detect clusters down the KS threshold. This includes spectral methods based on the adjacency matrix or Laplacians or SDPs. For standard spectral methods, a first issue is that the fluctuations in the node degrees produce high-degree nodes that disrupt the eigenvectors from concencrating on the clusters. One possibility is to trim such high-degree nodes, throwing away some information, but this does not suffice to achieve the KS threshold [11].
A few papers have studied information-theoretic bounds in SBMs with a growing number of communities [12], two unbalanced communities [13], and a single community [14]. No results seemed so far known for the symmetric SBM and part (ii) of Conjeture 1. Shortly after this paper posting, [15] obtained in an independent effort bounds on the information theoretic threshold that cross the KS threshold at k=5 (in the disassortaive case), using moment methods. The bound in [15] does however not approach to the correct threshold at b=0.
The SBM can be defined with a uniform or Binomial model for the communities. This means that for a probability vector p=(p1,…,pk), the communities may be drawn uniformly at random among all partitions of n having npi vertices in community i (with an arbitrary rounding rule on npi to obtain integers adding up to n), or each vertex may be assigned a label in [k] independently with probability p. These are equivalent for the purpose of this paper, due to standard concentration argument. In the case where p=(1/k,…,1/k), we say that the communities are balanced.
(X,G) is drawn under SBM (n,k,a,b), if X is a balanced n-dimensional vector with components valued in [k] and G is a random graph on the vertex set [n] where edge (i,j)∈([n]2) is drawn with probability (Xi=Xj)a/n+1(Xi≠Xj)b/n, independently of the other edges.
Note that we often talk about G being drawn under the SBM without specifying the community variables X.
Let x∈[k]n and ε>0. We define the set of bad clusterings with respect to x by Bε(x)={y∈[n]k:1nd∗(x,y)>1−1k−ε}, where d∗(x,y) is the minimum Hamming distance between x and any relabelling of y (i.e., any mapping of the components of y with a fixed permutation of [k]).
Relabellings need to be considered since only the partition needs to be detected and not the actual labels. It is simply convenient to work with labels.
An algorithm x^:2([n]2)→[k]n solves detection (or weak recovery) in SBM (n,k,a,b) if for some ε>0,PX,G{x^(G)∈Bε(X)}=on(1), where (X,G)∼ SBM (n,k,a,b). Detection is solvable efficiently if the algorithm runs in polynomial time in n, and information-theoretically otherwise.
Note that if X^ is a randomized algorithm (i.e., it takes the graph as an input and outputs various clusterings with different probabilities), and if for some ε>0,PX,G,X^{X^(G)∈Bε(X)}=on(1) then detection is solvable (information-theoretically).
We next present the algorithm used to detect below the KS threshold.
Typicality Sampling Algorithm
Given an n-vertex graph g and δ>0, the algorithm draws x^typ(g) uniformly at random in Tδ(g)={x∈ Balanced (n,k): ∑i=1k|{gu,v:(u,v)∈([n]2)s.t.xu=i,xv=i}|≥an2k(1−δ),∑i,j∈[k],i<j|{gu,v:(u,v)∈([n]2)s.t.xu=i,xv=j}|≤bn(k−1)2k(1+δ)}
View Source
\begin{equation*}\begin{split}
&\sum_{i=1}^{k}\vert \{g_{u,v}\,\,: \,\, (u,\,\, v)\in\begin{pmatrix}
[n]\\
2
\end{pmatrix}\,\, \mathrm{s}.\mathrm{t}.\,\, x_{u}=i,\,\, x_{v}=i\}\vert \\
&\qquad \geq\frac{an}{2k}(1-\delta),\\
&\sum_{i,j\in[k], i < j} \vert \{g_{u,v}\,\,: \,\, (u,\,\, v)\in\begin{pmatrix}
[n]\\
2
\end{pmatrix}\,\, \mathrm{s}.\mathrm{t}.\,\, x_{u}=i,\,\, x_{v}=j\}\vert\\
&\qquad \leq\frac{bn(k-1)}{2k}(1+\delta)\}
\end{split}\end{equation*} if the SBM is assortative (i.e.,a≥b), otherwise flip the direction of the above two inequalities.
Theorem 1
Let d:=a+(k−1)bk, assume d>1, and let τ=τd be the unique solution in (0,1) of τe−τ=de−d, i.e., τ:=∑+∞j=1jj−1j!(de−d)j. The Typicality Sampling Algorithm detects communities in SBM (n,k,a,b) if
12lnk(alna+(k−1)blnbk−dlnd)>1−τd(1−τ2).(1)
View Source
\begin{align*}&\frac{1}{2\ln k}\left(\frac{a\ln a+(k-1)b\ln b}{k}-d\ln d\right)\\
&\qquad > 1-\frac{\tau}{d}\left(1-\frac{\tau}{2}\right).
\tag{1}\end{align*}
The above corresponds to the regime where there is not bad clustering that is typical with high probability. However, the above bound is not tight in the extreme regime of b=0, since it reads a>2k as opposed to a>k (the presence of a giant).
Defining ak(b) as the solution in a of 12lnk(alna+(k−1)blnbk−dlnd)=d(τ,d) and expanding the bound in Theorem 1 gives the following.
Corollary 1
Detection is solvable
inSBM(n,k,0,b)ifb>2klnk(k−1)lnkk−1g(τ,b(k−1)k)inSBM(n,k,a,b)ifa>k+Δk(b),(3)(4)
View Source
\begin{align*}&in\,\, \mathrm{SBM}(n,\,\, k,\,\, 0,\,\, b)\,\, if\,\, b > \frac{2k\ln k}{(k-1)\ln\frac{k}{k-1}}g(\tau,\,\, \frac{b(k-1)}{k})\tag{3}\\
&in\,\, \mathrm{SBM}(n,\,\, k,\,\, a,\,\, b)\,\, if\,\, a > k+\Delta_{k}(b),
\tag{4}\end{align*} where (3) is strictly stronger than the KS threshold at k=5, and where Δk(b):=ak(b)−k is such that Δk(0)=0.
Remark 2
Note that (4) approaches the optimal bound given by the presence of the giant at b=0. Note also that (3) improves significantly on the KS threshold given by b>k(k−1) at a=0. By continuity arguments, we can also cross the KS threshold for some positive values of a at k=5.
SECTION III.
Proof Technique
A first question is to estimate the likelihood that a bad clustering, i.e., one that has an overlap that is close to 1/k, belongs to the typical set. This means the probability that a clustering which splits each of the true cluster into k groups belonging to each community still manages to keep the right proportions of edges inside and across the clusters. This is unlixely to take place, but we care about the exponent of this rare event probability.
As illustrated in Figure 1, the number of edges that are contained in the clusters of a bad clustering is roughly distributed as the sum of two Binomial random variables,Ein∼˙Bin(n22k2,an)+Bin((k−1)n22k2,bn),
View Source
\begin{equation*} E_{\mathrm{in}}\dot{\sim} \mathrm{Bin}\left(\frac{n^{2}}{2k^{2}}, \frac{a}{n}\right)+\mathrm{Bin}\left(\frac{(k-1)n^{2}}{2k^{2}}, \frac{b}{n}\right),\end{equation*} where we use ∼ to emphasize that this is an approximation. Note that the expectation of the above distribution is n2ka+(k−1)bk. In contrast, the true clustering would have a distribution given by Bin (n22k,an), which would give an expectation of an2k. In turn, the number of edges that are crossing the clusters of a bad clustering is roughly distributed as Eout∼˙Bin(n2(k−1)2k2,an)+Bin(n2(k−1)22k2,bn),
View Source
\begin{equation*} E_{\mathrm{out}}\dot{\sim} \mathrm{Bin}\left(\frac{n^{2}(k-1)}{2k^{2}}, \frac{a}{n}\right)+\mathrm{Bin}\left(\frac{n^{2}(k-1)^{2}}{2k^{2}}, \frac{b}{n}\right),\end{equation*} which has an expectation of n(k−1)2ka+(k−1)bk. In contrast, the true clustering would have the above replaced by Bin (n2(k−1)2k,bn), and an expectation of bn(k−1)2k.
Thus, we need to estimate the rare event that the Binomial sum deviates from its expectations. While there is a large list of bounds on Binomial tail events, the number of trials here is quadratic in n and the success bias decays linearly in n, which require particular care to ensure tight bounds. We derive these in [6], obtaining that for a bad clustering x, P{xistypical}≈exp(−nkA)
View Source
\begin{equation*}\mathbb{P}\{x\,\, \mathrm{is\, typical}\} \approx\exp\left(-\frac{n}{k}A\right)\end{equation*} where A:=a+b(k−1)2lnka+(k−1)b+a2lna+b(k−1)2lnb.
View Source
\begin{equation*}\begin{split}
&A:=\\
&\frac{a+b(k-1)}{2}\ln\frac{k}{a+(k-1)b}+\frac{a}{2}\ln a+\frac{b(k-1)}{2}\ln b.
\end{split}\end{equation*}
One can then use a union bound, since there are at most kn bad clusterings, to obtain a first regime where no clustering is typical with high probability. This already allows to cross the KS threshold in some regime of the parameters when k≥5. However, this does not interpolate the correct behavior of the information-theoretic bound in the extreme regime of b=0. In fact, for b=0, the union bound requires a>2k to imply no bad typical clustering with high probability, whereas as soon as a>k, an algorithm that simply separates the two giants in SBM (n,k,a,0) and assigns communities uniformly at random for the other vertices solves detection. Thus when a∈(k,2k], the union bound is loose. To remediate to this, we next take into account the topology of the SBM graph.
Since the algorithm samples a typical clustering, we only need the number of bad and typical clusterings to be small compared to the total number of typical clusterings, in expectation. Thus, we seek to better estimate the total number of typical clusterings. The main topological property of the SBM graph that we exploit is the large fraction of nodes that are in tree-like components outside of the giant. Conditioned on being on a tree, the SBM labels are distributed as in a broadcasting problem on a (Galton-Watson) tree. Specifically, for a uniformly drawn root node X, each edge in the tree acts as a k-ary symmetric channel. Thus, labelling the nodes in the trees according to the above distribution and freezing the giant to the correct labels leads to a typical clustering with high probability.
We hence need to count the number of nodes T and edges M that belong to such trees in the SBM graph. This is done in a series of lemmas in our arxiv paper [6], and requires combinatorial estimates similar to those carried for the Erdos-Renyi case [16]. The fractions are shown to concentrate around T/n≈τd(1−τ2),M/n≈τ22d,(5)(6)
View Source
\begin{align*}&T/n\approx\frac{\tau}{d}\left(1-\frac{\tau}{2}\right),\tag{5}\\
&M/n\approx\frac{\tau^{2}}{2d},
\tag{6}\end{align*} where τ is as in the theorem. This in turn gives a bound on the typical set size (see Theorem 2 below). With this bound, we can better estimate the probability of samplings a good clustering, reaching the tight bound at b=0.
Let Tδ(G) denote the typical set for G drawn under SBM (n,k,a,b). Then, for any ε>0, P{|Tδ(G)|<k(ψ−ε)n}=o(1),
View Source
\begin{equation*}\mathbb{P}\{\vert T_{\delta}(G)\vert < k^{(\psi-\varepsilon)n}\}=o(1),\end{equation*} where ψ:=τd(1−τ2)+τ22dlnkH(ν),ν:=(aa+(k−1)b,ba+(k−1)b,…,ba+(k−1)b)
View Source
\begin{equation*}\begin{split}
&\psi:=\frac{\tau}{d}\left(1-\frac{\tau}{2}\right)+\frac{\tau^{2}}{2d\ln k}H(\nu),\\
&\nu:=\left(\frac{a}{a+(k-1)b}, \frac{b}{a+(k-1)b},\ldots, \frac{b}{a+(k-1)b}\right)
\end{split}\end{equation*} and H(⋅) is the entropy in nats.
Let G∼ SBM (n,k,a,b), and let T be the number of isolated trees in G,M the number of edges in those trees, and F the number edges in the planted trees of the largest connected component of G (i.e., the giant). We now build a typical assignment on these trees:
Pick an arbitrary node in each isolated tree, denote these by {v1,…,vT}, and denote the set of edges contained in these trees by {E1,…,EM};
Assign the labels UT1:=(Uv1,…,UvT) uniformly at random in [k]. Then broadcast each of these labels in their corresponding trees by forwarding the labels on each edge with an independent k: -ary symmetric channel of flip probability ba+(k−1)b. This means that the variables Z1,…,ZM are drawn i.i d, from the distribution ν as above on Fk:={0,1,…,k−1}, and that for each edge e in the trees, the input bit is forwarded by adding to it the Ze variable modulo k;
Assign any other vertex (that is not contained in the trees) to their true community assignments. Define ZM1:=(Z1,…,ZM), and denote by X^(UT1,ZM1) the previously defined assignment.
Note that the above gives the induced label-distribution on trees in SBM (n,k,a,b), with possibly a global flip for the isolated trees. Thus, with high probability on G, as T and M grow (linearly) with n, this assignment is typical with high probability on UT1,ZM1: PUT1,ZR1{X^(UT1,ZM1)∈Tδ(G)}=1−o(1).(7)
View Source
\begin{equation*}\mathbb{P}_{U_{1}^{T},Z_{1}^{R}}\{\hat{X}(U_{1}^{T},\,\, Z_{1}^{M})\in T_{\delta}(G)\}=1-\mathrm{o}(1).
\tag{7}\end{equation*}
Denote by Aε,M(ν) the ε -typical set for sequences of length M under the distribution η on [k] (as defined in [17] ). Define similarly Aε,T(η) for the uniform distribution η on k. By the AEP theorem, for any ε>0, P{UT1∈Aε,T(η),ZM1∈Aε,M(ν)}→1.
View Source
\begin{equation*}\mathbb{P}\{U_{1}^{T}\in A_{\varepsilon,T}(\eta),\,\, Z_{1}^{M}\in A_{\varepsilon,M}(\nu)\}\rightarrow 1.\end{equation*}
Therefore,PUT1,ZR1{X^(UT1,ZM1)∈Tδ(G)}≤∑uT1∈Aε,T(η),zR1∈Aε,M(ν)1(X^(uT1,zM1)∈Tδ(G))k−Tk−M(H(ν)−ε)+o(1).
View Source
\begin{equation*}\begin{split}
&\mathbb{P}_{U_{1}^{T},Z_{1}^{R}}\{\hat{X}(U_{1}^{T},\,\, Z_{1}^{M})\in T_{\delta}(G)\}\\
&\leq \sum_{u_{1}^{T}\in A_{\varepsilon,T}(\eta),z_{1}^{R}\in A_{\varepsilon,M}(\nu)}\\
&1(\hat{X}(u_{1}^{T},\,\, z_{1}^{M})\in T_{\delta}(G))k^{-T}k^{-M(H(\nu)-\varepsilon)}+o(1).
\end{split}\end{equation*}
Since the right hand side counts a subset of the typical clusterings, we have with high probability on G, |Tδ(G)|≥(1−o(1))kT+M(H(ν)−ε).
View Source
\begin{equation*}\vert T_{\delta}(G)\vert \geq(1-o(1))k^{T+M(H(\nu)-\varepsilon)}.\end{equation*}
Further, from the topological lemmas derived in our arxiv paper [6], for ε>0, with high probability on G, T∈[τd(1−τ2)−ε,τd(1−τ2)+ε],M∈[τ22d−ε,τ22d+ε].
View Source
\begin{equation*}\begin{split}
&T\in\left[\frac{\tau}{d}\left(1-\frac{\tau}{2}\right)-\varepsilon, \frac{\tau}{d}\left(1-\frac{\tau}{2}\right)+\varepsilon\right],\\
&M\in\left[\frac{\tau^{2}}{2d}-\varepsilon, \frac{\tau^{2}}{2d}+\varepsilon\right].
\end{split}\end{equation*}
The claims follows from algebraic manipulations. ■
This work is partly supported by NSF CAREER Award CCF-1552131 and ARO grant W911NF-16-1-0051.