Introduction
In many data science applications, data appears in the form of large-scale graphs. For example, in social networks, vertices represent users and an edge between vertices represents friendship; in the World Wide Web, vertices are websites and edges indicate the hyperlinks from one site to the other; in biological systems, vertices can be proteins and edges illustrate protein-to- protein interaction. Such graphs may contain billions of vertices. In addition, edges tend to be correlated with each other since, for example, two people sharing many common friends are likely to be friends as well. How to efficiently compress such large-scale structural information to reduce the I/O and communication costs in storing and transmitting such data is a persisting challenge in the era of big data.
The problem of graph compression has been studied in different communities, following various different methodologies. Several methods exploited combinatorial properties such as cliques and cuts in the graph [1], [2]. Many works targeted at domain-specific graphs such as web graphs [3], biological networks [4], [5], and social networks [6]. Various representations of graphs were proposed, such as the text-based method, where the neighbour list of each vertex is treated as a “word” [7], [8], and the
In this paper, we take an information theoretic approach to study lossless compression of graphs with vertex labels. We assume the graph is generated by some random graph model and investigate lossless compression schemes that achieve the theoretical limit, i.e., the entropy of the graph, asymptotically as the number of vertices goes to infinity. When the underlying distribution/statistics of the random graph model is known, optimal lossless compression can be achieved by methods like Huffman coding. However, in most real-world applications, the exact distribution is usually hard to obtain and the data we are given is a single realization of this distribution. This motivates us to consider the framework of universal compression, in which we assume the underlying distribution belongs to a known family of distributions and require that the encoder and the decoder should not be a function of the underlying distribution. The goal of universal compression is to design a single compression scheme that universally achieves the optimal theoretical limit, for every distribution in the family, without knowing which distribution generates the data. For this paper, we focus on the family of stochastic block models, which are widely used random graph models that capture the clustering effect in social networks. Our goal is to develop a universal graph compression scheme for a family of stochastic block models with as wide a range of parameters as possible.
How to design a computationally efficient universal compression scheme is a fundamental question in information theory. In the past several decades, a large number of universal compressors were proposed for one-dimensional sequences with fixed alphabet size, whose entropy is linear in the number of variables. Prominent results include the Laplace and Krichevsky–Trofimov (KT) compressors for i.i.d. processes [12], [13], Lempel–Ziv compressor [14], [15] and Burrows–Wheeler transform [16] for stationary ergodic processes, and context tree weighting [17] for finite memory processes. Many of these have been adopted in standard data compression applications such as
For the first challenge, a natural question arising is: can we convert the two-dimensional adjacency matrix of the graph into a one-dimensional sequence in some order and apply a universal compressor for the sequence? For some simple graph models such as Erdős- Rényi graph, where each edge is generated i.i.d. with probability
We alleviate these challenges by designing a decomposition of the adjacency matrix into submatrices. We then show in Proposition 2 that with a carefully chosen parameter, the submatrix decomposition converts two-dimensional correlated entries into a sequence of almost i.i.d. submatrices with slowly growing alphabet size. To address the second challenge, we introduce a new definition of universality that accommodates data with an unknown leading order in its entropy expression. The new universality normalizes the compression length by the entropy of data, requiring it to go to one, as compared to normalizing by the number of variables in the standard definition. Normalizing by the number of variables, implicitly assumes that the entropy is linear in
Notation: For an integer
A. Problem Setup
1) Universal Graph Compressor:
For simplicity, we focus on simple (undirected, unweighted, no self-loop) graphs with labeled vertices in this paper. But our compression scheme and the corresponding analysis can be extended to more general graphs. Let
2) Stochastic Block Model:
A stochastic block model
3) Minimax Redundancy:
Besides universality, which only concerns the first order terms in the expected length of the compressor and the entropy of the graph, the redundancy of a universal compressor is another important metric we wish to optimize in the design of universal graph compressors. Define the redundancy of a lossless compressor
B. Literature in the Information-Theoretic Framework
Under the information-theoretic framework, lossless graph compression has been studied in [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], and [40] and lossy compression of directed graphs was studied in [41]. In [23] and [31], universal compression of unlabeled graphs (isomorphism classes) is investigated. For lossless compression of labeled graphs, [24] and [25] studied the compression of graphs generated from SBMs. However, it does not consider universal compression schemes. In [22], a universal compression algorithm is proposed for two-dimensional arrays of data by first converting the two-dimensional data into a one-dimensional sequence using a Hilbert–Peano curve and then applying the Lempel–Ziv algorithm [15] for sequences. In [26], [27], [28], [32], and [33], universal compression of tree graphs were studied. In [37], [38], [39], and [40], a minimum description length based method was proposed to infer network structures in stochastic block models. Recently, universal compression of graphs with marked edges and vertices was studied by Delgosha and Anantharam [34], [35]. They focus on the sparse graph regime, where the number of edges is in the same order as the number of vertices
C. Contribution and Organization
The main contributions of the paper are summarized as follows. Firstly, we propose a polynomial-time graph compressor and establish its universality over the corresponding family of stochastic block models. Secondly, we present upper and lower bounds for the minimax redundancy of stochastic block models. Finally, we analyze the proposed compressor under the local weak convergence framework considered in [34] and [42]. We show that the proposed compressor achieves the same performance guarantee (BC entropy) as the compressor in [34].
The rest of the paper is organized as follows. In Section I-A, we defined universality over a family of graph distributions, the stochastic block models and the minimax redundancy of a family of distributions. We present our main result in Section II, which is a polynomial-time graph compressor that is universal for a family containing most of the non-trivial stochastic block models. We describe the proposed graph compressor in Section II-A. In Section II-B, we state the main theorems of the paper—two theorems on the university of the proposed compressor as well as a theorem on upper and lower bounds for the minimax redundancy. In Section II-C, we implement our compressor in four benchmark graph datasets and compare its empirical performance to four competing algorithms. We illustrate key steps in establishing universality in Section III and elaborate on the proof of each step in Section IV. In Section V, we prove upper and lower bounds for the minimax redundancy. In Section VI, we provide the second-order analysis of the expected length of our compressor and compare it to the one in [34]. In Section VII, we explain why some natural attempts to compress stochastic block models are not applicable to our problem or not universal. In Section VIII, we conclude our results and introduce some open problems and potential future works.
Main Results
We present our compression scheme in Section II-A and state its performance guarantee in Theorems 1 and 2. The key idea in our design is a decomposition of the adjacency matrix into submatrices, which, with a carefully chosen parameter, converts the two-dimensional correlated entries in the adjacency matrix into a sequence of almost i.i.d. submatrices. We then compress the submatrices using a Krichevsky–Trofimov or Laplace compressor and generalize their length analyses from i.i.d. processes to arbitrarily correlated but identically distributed processes.
A. Algorithm: Universal Graph Compressor
In this section, we describe our universal graph compression scheme. For each integer
Submatrix decomposition. Let
andn′=⌊n/k⌋ . Forn~=⌊n/k⌋k , let1≤i,j≤n′ be the submatrix ofBij formed by the rowsAn and the columns(i−1)k+1,(i−1)k+2,…,ik . For example, we have(j−1)k+1,(j−1)k+2,…,jk View SourceB12=⎡⎣⎢⎢⎢⎢⎢A1,k+1A2,k+1⋮Ak,k+1A1,k+2A2,k+2⋮Ak,k+2⋯⋯⋱⋯A1,2kA2,2k⋮Ak,2k⎤⎦⎥⎥⎥⎥⎥.(8) \begin{align*} {\mathbf{B}}_{12} = \begin{bmatrix} A_{1, k +1} & A_{1, k + 2} & \cdots & A_{1, 2k}\\ A_{2, k +1} & A_{2, k + 2} & \cdots & A_{2, 2k}\\ \vdots & \vdots & \ddots & \vdots \\ A_{k, k +1} & A_{k, k +2} & \cdots & A_{k, 2k} \end{bmatrix}. \tag{8}\end{align*} We then write the top-left
submatrix ofn~×n~ in the block-matrix form asAn Denote View Source⎡⎣⎢⎢⎢⎢⎢B11B21⋮Bn′,1B12B22⋮Bn′,2⋯⋯⋱⋯B1,n′B2,n′⋮Bn′,n′⎤⎦⎥⎥⎥⎥⎥.(9) \begin{align*} \begin{bmatrix} {\mathbf{B}}_{11} & {\mathbf{B}}_{12} & \cdots & {\mathbf{B}}_{1,n'}\\ {\mathbf{B}}_{21} & {\mathbf{B}}_{22} & \cdots & {\mathbf{B}}_{2,n'}\\ \vdots & \vdots & \ddots & \vdots \\ {\mathbf{B}}_{n',1} & {\mathbf{B}}_{n',2} & \cdots & {\mathbf{B}}_{n',n'} \end{bmatrix}. \tag{9}\end{align*}
as the sequence of off-diagonal submatrices in the upper triangle and View SourceBut:=B12,B13,B23,…,B1,n′,⋯,Bn′−1,n′(10) \begin{equation*} {\mathbf{B}}_{\mathrm {ut}} : = {\mathbf{B}} _{12}, {\mathbf{B}}_{13}, {\mathbf{B}}_{23},\ldots , {\mathbf{B}}_{1,n'},\cdots , {\mathbf{B}}_{n'-1,n'} \tag{10}\end{equation*}
as the sequence of diagonal submatrices. View SourceBd:=B11,B22,…,Bn′,n′(11) \begin{equation*} {\mathbf{B}}_{\mathrm {d}} : = {\mathbf{B}} _{11}, {\mathbf{B}}_{22},\ldots , {\mathbf{B}}_{n',n'} \tag{11}\end{equation*}
Binary to
-ary conversion. Letm . Eachm:=2k2 submatrix with binary entries in the two submatrix sequencesk×k andBut is converted into a symbol inBd .[m] KT probability assignment. Apply KT sequential probability assignment for the two
-ary sequencesm andBut respectively. Given anBd -ary sequencem , KT sequential probability assignment definesx1,x2,…,xN conditional probability distributions overN as follows. For[m] , assign conditional probabilityj=0,1,2,…,N−1 View SourceqKT(i|xj):=qKT(Xj+1=i|Xj=xj)=Ni(xj)+1/2j+m/2for each i∈[m],(12) \begin{align*} q_{\mathrm {KT}}(i|x^{j}) & : = q_{\mathrm {KT}}(X_{j+1} = i| X^{j} = x^{j}) \\ &= \frac {N_{i}(x^{j}) + 1/2 }{j + m/2} \quad \text {for each } i \in [m], \tag{12}\end{align*} where
, andXj:=(X1,…,Xj),xj:=(x1,x2,…,xj) counts the number of symbolNi(xj):=∑jk=11{xk=i} ini .xj Adaptive arithmetic coding. With the KT sequential probability assignments, compress the two sequences
andBut separately using adaptive arithmetic coding [44] (see description in Algorithm 1). In caseBd , the diagonal sequencek=1 becomes an all-zero sequence since we assume the graph is simple. So we will only compress the off-diagonal sequenceBd .But Encoding the remaining bits. The above process compressed the top-left
block of the adjacency matrix. For the remainingn~×n~ entries in the upper diagonal of the adjacency matrix, we simply use(n−n~)n~+(n−n~2) bits to encode the row and column number of each one.2⌈logn⌉
Given the compressed graph sequence
Adaptive arithmetic decoding. With the KT sequential probability assignments defined in (12), decompress the two code sequences for
andBut separately using adaptive arithmetic decoding (see Algorithm 2). The length of data sequenceBd andBut areBd andn′(n′−1)/2 respectively.n′ -ary to binary conversion. Eachm -ary symbol in the sequence is converted to am -bit binary number and further converted into ak2 submatrix with binary entries.k×k Adjacency matrix recovery. With the submatrices in
andBut , recover the top-leftBd submatrix ofn~×n~ in the order described in (9), (10), and (11).An Decoding the remaining bits. Recover the remaining
entries in the(n−n~)n~+(n−n~2) using the row and column numbers of the ones.An
Remark 1 (Complexity):
The computational complexity and the space complexity of the proposed compressor are
To see this, we consider the time and space complexity of the encoder step by step. The steps of submatrix decomposition and binary to
Remark 2:
One can check that
Remark 3:
The orders in
Remark 4 (Laplace Probability Assignment):
As an alternative to the KT sequential probability assignment, one can also use the Laplace sequential probability assignment. Given an
B. Theoretical Performance
We now state the main result of this paper: the proposed compressor
Theorem 1 (Universality OverP1
):
For every
Remark 5:
Recall that
It turns out that for
Theorem 2 (Universality OverP2
):
For every
Remark 6:
Any compressor universal over the class
The above results establish the universality of the proposed compressor over
Theorem 3 (Minimax Redundancy):
Let
Remark 7:
From Theorem 3, we can see that the lower bound scales as
In Section VI, we analyze the performance of the proposed compressor under the local weak convergence framework considered in [34]. It turns out that our proposed compressor achieves the optimal compression rate under the stronger universality criterion considered in that framework. This result requires a lot more definitions to introduce and we defer it to Theorem 5 in Section VI.
C. Empirical Performance
We implement the proposed universal graph compressor (UGC) in four widely used benchmark graph datasets: protein-to- protein interaction network (PPI) [46], LiveJournal friendship network (Blogcatalog) [47], Flickr user network (Flickr) [47], and YouTube user network (YouTube) [48]. The submatrix decomposition size
CSR: Compressed sparse row is a widely used sparse matrix representation format. In the experiment, we further optimize its default compressor exploiting the fact that the graph is simple and its adjacency matrix is symmetric with binary entries.
Ligra+: This is another powerful sparse matrix representation format [49], [50], which improves upon CSR using byte codes with run-length coding.
LZ: This is an implementation of the algorithm proposed in [22], which first transforms the two-dimensional adjacency matrix into a one-dimensional sequence using the Peano-Hilbert space-filling curve and then compresses the sequence using Lempel–Ziv 78 algorithm [15].
PNG: The adjacency matrix of the graph is treated as a grey-scaled image and the PNG lossless image compressor is applied.
Main Ideas in Establishing Universality
In this section, we establish Theorems 1 and 2 in Section II-B based on four intermediate propositions. We defer the proofs of the propositions to Section IV.
A. Graph Entropy
We first calculate the entropy of the (random) graph
Proposition 1 (Graph Entropy):
Let
Then
Remark 8:
In the regime
Remark 9:
Proposition 1 can be used to calculate the entropy of the graph for certain important regimes of
B. Asymptotic i.i.d. via Submatrix Decomposition
To compress the matrix
Proposition 2 (Submatrix Decomposition):
Let
Remark 10:
In Proposition 2, we provide a more general result than what we need to prove Theorem 1. In Theorem 1, the range of
Remark 11:
Proposition 2 implies that any probability assignment that is universal for i.i.d. sequences may be used to compress the blocks, provided that the compression length analysis can be expended to asymptotically i.i.d. sequences as in Propositions 3 and 4. The reason we choose to use the KT/Laplace probability assignment is that it is known to achieve the minimax redundancy and lead to better finite-length performance as compared to, for example, Lempel–Ziv coding which while still being universal incurs larger redundancy. Moreover, the KT-based approach leads to better empirical performance, as observed in Table II.
C. Length Analysis for Correlated Sequences
Thanks to the asymptotic i.i.d. property of the submatrix decomposition, we hope to compress these submatrices as if they are independent using a Laplace probability assignment (which, recall, is universal for the class of all
Proposition 3 (Laplace Probability Assignment for Correlated Sequence):
Consider arbitrarily correlated
We provide a similar result for the KT probability assignment.
Proposition 4 (KT Probability Assignment for Correlated Sequence):
Consider arbitrarily correlated
D. Proof of Theorem 1
To prove Theorem 1, we first state a technical lemma, the proof of which is deferred to Section IV-A.
Lemma 1:
Let
We have
Proof of Theorem 1:
We will prove the universality of
We use the bound in Proposition 3 to establish the upper bound on the length of the code. Recall that here we compress the diagonal submatrices
For the last term in (26), we have
E. Proof of Theorem 2
Proof:
To establish the universality of
Remark 12:
When
Proof of Propositions 1–4
A. Graph Entropy
In this section, we establish Proposition 1. Towards that goal, we state and prove a stronger result on the graph entropy in Lemma 2, which will also be useful for proof of Theorem 3 later. Proposition 1 will follow easily from Lemma 2. The proofs of Lemmas 1 and 2 are provided after the proof of Proposition 1.
Lemma 2 (Graph entropy):
Let
Then
In particular, when
When
Now, we are ready to prove Proposition 1, which is restated as follows.
Proposition 5 (Graph Entropy):
Let
Then
Proof of Proposition 1:
Given Lemma 2, it suffices to prove equation (16) in Proposition 1. We consider two different cases. Firstly, assume that
Finally, we establish Lemmas 1 and 2.
Proof of Lemma 1:
We note that
To see that
Now, consider the case of
Proof of Lemma 2:
Note that
For
Next, consider the case when
Since
Finally, consider the case when
B. Asymptotic i.i.d. via Submatrix Decomposition
We first invoke a known property of stochastic block models (see, for example, [55], [56]). We include the proof here for completeness.
Lemma 3 (Exchangeability of SBM):
Let
Proof:
Let
Now we are ready to establish Proposition 2, which is restated as follows.
Proposition 6 (Submatrix Decomposition):
Let
Proof of Proposition 2:
For any
Now, clearly
: We havef(n)=Θ(1) View SourceH(B12)≤(a)H(B12|X2k1)+H(X2k1)=H(B12|X2k1)+2kH(p)=(b)k2H(A1,k+1|X1,Xk+1)+2kH(p)≤(c)k2(pTh(Q)p+2logLk),(51) \begin{align*} H( {\mathbf{B}}_{12}) &\stackrel {(a)}{\le } H( {\mathbf{B}}_{12}|X_{1}^{2k}) + H(X_{1}^{2k}) \\ &= H( {\mathbf{B}}_{12}|X_{1}^{2k}) + 2k H( {\mathbf{p}}) \\ &\stackrel {(b)}{=} k^{2} H(A_{1,k+1}|X_{1},X_{k+1}) + 2k H( {\mathbf{p}}) \\ &\stackrel {(c)}{\le } k^{2}\left ({ {\mathbf{p}}^{T} h( {\mathbf{Q}}) {\mathbf{p}}+ 2 \frac {\log L}{k}}\right ), \tag{51}\end{align*} where (a) follows from the chain rule, (b) follows since all elements of the matrix
are independent givenB12 and (c) follows by Lemma 1. Plugging this into the right-hand side of (50) we obtainX1,⋯,X2k Since View SourceH(But)(n′2H(B12))≥(n2(pTh(Q)p−2kh(pTQp)n−1))(n′2k2(pTh(Q)p+2logLk)).(52) \begin{equation*} \frac {H( {\mathbf{B}}_{\mathrm {ut}})}{\binom{n' }{ 2H( {\mathbf{B}}_{12})}} \ge \frac {\binom{n }{ 2\left ({ {\mathbf{p}}^{T} h( {\mathbf{Q}}) {\mathbf{p}}- \frac {2k h( {\mathbf{p}}^{T} {\mathbf{Q}} {\mathbf{p}} )}{n-1}}\right )}}{\binom{n' }{ 2k^{2}\left ({ {\mathbf{p}}^{T} h( {\mathbf{Q}}) {\mathbf{p}}+ 2 \frac {\log L}{k}}\right )}}. \tag{52}\end{equation*}
andk=o(n),k=ω(1) , we have from (52)(n′2)k2∼(n2) which together with (49) yields the required result. View Sourcelim infn→∞H(But)(n′2H(B12))≥1,(53) \begin{equation*} \liminf _{n \to \infty } \frac {H( {\mathbf{B}}_{\mathrm {ut}})}{\binom{n' }{ 2H( {\mathbf{B}}_{12})}} \ge 1, \tag{53}\end{equation*}
Sincef(n)=Ω(1n2),f(n)=o(1): is a matrix ofB12 identically distributed Bernoulli random variables, we havek2 View SourceH(B12)≤k2h(A1,k+1)=k2h(f(n)pTQp).(54) \begin{equation*} H( {\mathbf{B}}_{12}) \le k^{2} h(A_{1,k+1}) = k^{2} h\left ({f(n) {\mathbf{p}}^{T} {\mathbf{Q}} {\mathbf{p}} }\right ). \tag{54}\end{equation*} Plugging this into the RHS of (50) then yields
We first observe that in this parameter range, since View SourceH(But)(n′2H(B12))≥(n2(pTh(f(n)Q)p−2kh(f(n)pTQp)n−1))(n′2k2h(f(n)pTQp)).(55) \begin{align*} &\frac {H( {\mathbf{B}}_{\mathrm {ut}})}{\binom{n' }{ 2H( {\mathbf{B}}_{12})}} \ge \frac {\binom{n }{ 2\left ({ {\mathbf{p}}^{T} h(f(n) {\mathbf{Q}}) {\mathbf{p}}- \frac {2k h(f(n) {\mathbf{p}}^{T} {\mathbf{Q}} {\mathbf{p}} )}{n-1}}\right )}}{\binom{n' }{ 2k^{2} h\left ({f(n) {\mathbf{p}}^{T} {\mathbf{Q}} {\mathbf{p}} }\right )}}. \tag{55}\end{align*}
, by Lemma 1, we havef(n)=o(1) Finally using that View SourcepTh(f(n)Q)p=(1+o(1))h(f(n)pTQp).(56) \begin{equation*} {\mathbf{p}}^{T} h(f(n) {\mathbf{Q}}) {\mathbf{p}}=(1+o(1)) h\left ({f(n) {\mathbf{p}}^{T} {\mathbf{Q}} {\mathbf{p}} }\right ). \tag{56}\end{equation*}
andk=o(n) , we establish(n′2)k2∼(n2) which together with (49) yields the required result. View Sourcelim infn→∞H(But)(n′2H(B12))≥1,(57) \begin{equation*} \liminf _{n \to \infty } \frac {H( {\mathbf{B}}_{\mathrm {ut}})}{\binom{n' }{ 2H( {\mathbf{B}}_{12})}} \ge 1, \tag{57}\end{equation*}
C. Length of the Laplace Probability Assignment
In this section, we prove Proposition 3, which is restated as follows.
Proposition 7 (Laplace Probability Assignment for Correlated Sequence):
Consider arbitrarily correlated
Proof of Proposition 3:
The joint probability of Laplace probability assignment given in equation (18) is a classic result. Nonetheless, we provide the derivation of (18) in Appendix B.
Now, we move on to prove the expected length upper bound (19). Let us first elaborate on the relation between probability assignment and compression length. In Algorithm 1, the terms
Now we analyze the compression length of the Laplace compressor for the sequence
D. Length of the KT Probability Assignment
To prove Proposition 4, we first state a technical lemma.
Lemma 4:
For any integer
We defer the proof of Lemma 4 to Appendix A.
Remark 13:
Equivalently, consider an urn containing a known number of balls with
Now, we are ready to Proposition 4, which is restated as follows.
Proposition 8 (KT Probability Assignment for Correlated Sequence):
Consider arbitrarily correlated
Proof of Proposition 4:
The joint probability of KT probability assignment given in equation (20) is a classic result. Nonetheless, we provide the derivation of (20) in Appendix B.
Now, we move on to prove the expected length upper bound (21). In this proof, we define a generalized form of the factorial function. Let
Minimax Redundancy Analysis
So far, we have proved the universality of our proposed algorithms. For this, we showed that the expected length of the proposed compressor matches the first-order term in the graph entropy. In this section, we study the redundancy of the proposed compressor
Theorem 4 (Minimax Redundancy):
Let
Towards that goal, we first state a proposition that upper bounds the redundancy of compressor
Proposition 9:
For every
If
andf(n)=o(1) , then the expected length of compressorf(n)=Ω(1n2−ϵ) defined in Section II-A is upper bounded asCk View SourceE[ℓ(Ck(An))]−H(An)≤(n2)f(n)(pTQplog(1pTQp)−pTQ∗p)+o(n2f(n))=o(H(An)),(63) \begin{align*} & \mathsf {E}[\ell (C_{k}(A_{n}))]-H(A_{n}) \\ &\le \binom {n}{2}f(n)\left ({ \mathbf {p}^{T}\mathbf {Q}\mathbf {p}\log \left ({ \frac {1}{ \mathbf {p}^{T}\mathbf {Q}\mathbf {p}}}\right )-\mathbf {p}^{T}\mathbf {Q^{*}p}}\right ) \\ &\;\;\;+o(n^{2}f(n)) \tag{63}\\ &= o(H(A_{n})),\end{align*} where
denotes anQ∗ matrix whoseL×L entry is(i,j) whenQijlog(1Qij) and 0 whenQij≠0 .Qij=0 If
, then the expected length of compressorf(n)=Θ(1) defined in Section II-A is upper bounded asCk View SourceE(ℓ(Ck(An)))−H(An)≤H(p)n2k+o(n2k)=o(H(An)),(64) \begin{align*} \mathsf {E}(\ell (C_{k}(A_{n})))-H(A_{n}) &\le H(\mathbf {p})\frac {n^{2}}{k}+o\left ({\frac {n^{2}}{k}}\right ) \tag{64}\\ &= o(H(A_{n})),\end{align*} where
.H(p)=∑Li=1pilog(1pi)
Because
Proof of Theorem 3:
Firstly, we prove the lower bound for the minimax redundancy. The proof essentially follows from the redundancy-capacity theorem [57]. To start with, we will define a vector
We now move on to evaluate the mutual information
Now, we prove the upper bound for the minimax redundancy. Let
Proof of Proposition 9:
We discuss the redundancy in two cases: (1)
First assume
Now, we will upper bound the redundancy of
Now assume
Performance Under the Local Weak Convergence Framework
In this section, we analyze the performance of the proposed compressor under the local weak convergence framework introduced in [34] and [42]. We focus on a subfamily of SBM in the
We first introduce some basic definitions of rooted graphs in Subsection VI-A. Then, we define the local weak convergence of graphs and derive the local weak convergence limit of the subfamily of stochastic block model in Subsection VI-B. Finally, we review the definition of BC entropy in Subsection VI-C and state the performance guarantee of our compression algorithm in Subsection VI-D.
A. Basic Definitions on Rooted Graphs
Let
For a sequence of graphs
B. Local Weak Convergence
In this section, we want to study the local weak limit of a sequence of graphs generated from stochastic block models. However, we notice that the definition for rooted neighbourhood distribution given in (78) is limited to deterministic graphs. Therefore, we first extend this definition to the case of random graphs. Suppose we have a random graph
To state the limiting distribution, we need to define the Galton–Watson tree probability distribution on rooted trees
Remember that the total variation distance between two probability measures
Proposition 10:
Let
The proof of Proposition 10 follows a similar idea as the proof of Proposition 2 in [53]. The key idea essentially follows from the fact that
To upper bound the total variation distance between these two distributions, we first review the definition of coupling between two probability distributions. Let
To formally state the proof of Proposition 10, we need a few more definitions. Let
Lemma 5:
If
;Gr−1=Tr−1 for everyYu=Zu ;u∈∂Gr−1 andCr holdDr
Proof of Lemma 5:
First of all, events
Next, we define a set of auxiliary events in order to complete the proof of our induction step. For any
Lemma 6:
For all
Proof of Lemma 6:
First of all, remember that for two random variables
For the second part, on
Lemma 7:
For any
Proof:
For the first claim, fix
Moreover, we state a lemma to bound the total variation distance between the Binomial distribution and the Poisson distribution.
Lemma 8 (Lemma 5 in[53]):
If
Proof:
[Proof of Proposition 10] For any random variable
Next, let us analyze the distribution of
With Proposition 10, we are ready to establish the local weak convergence of the stochastic block model.
Proposition 11 (Local weak limit of sparse SBMs):
Consider a sparse stochastic block model
Remark 14:
When
Proof of Proposition 11:
We want to show that for any uniformly continuous and bounded function
C. BC Entropy
In this section, we review the notion of BC entropy introduced in [42], which is shown to be the fundamental limit of universal lossless compression for certain graph family [66].
For a Polish space
The following lemma states the BC entropy of the Galton-Waston tree distribution.
Lemma 9 (Corollary 1.4 of[42]):
The BC entropy of the Galton–Watson tree distribution
D. Achieving BC Entropy in the Sparse Regime
With the Lemma above, we can give a performance guarantee of our algorithm corresponding to the BC entropy. It is a Theorem analogous to Proposition 1 in [66].
Theorem 5:
Let
Proof:
By our proof of Proposition 9, we have
Theorem 5 shows that in the sparse regime
Comparison to Alternative Methods
In this section, we consider three alternative compression methods which are natural attempts at designing universal graph compressors for stochastic block models. We demonstrate why they are not applicable to our problem or not universal in the same sense as the proposed algorithm.
One intuitive attempt is to convert the entries in the adjacency matrix into a one-dimensional sequence according to some ordering and then apply a universal compressor to the sequence. Let us take a closer look at the correlation among entries in the adjacency matrix and explain why existing universal sequence compressors developed for stationary processes may not be immediately applicable for certain orderings of the entries. Compressing
For example, consider listing the bits in the upper triangle row-wise (i.e., first listing the bits in the first row, followed by the bits in the second and so on, ending with
Another potential approach is to first recover the community labels of all vertices, which allows one to rearrange the entries in the adjacency matrix into i.i.d. groups, and then apply a universal sequence compressor to each group of entries. There are two main issues with this approach. First, even when the exact recovery of labels is possible, to the best of our knowledge, all algorithms for community detection require knowledge of the graph parameters such as the number of communities and/or the edge probabilities [67]. Second, it is known that exact recovery is information-theoretically impossible when the edge probabilities are
Another approach is based on run-length coding for the adjacency matrix
Concluding Remarks
In this paper, we constructed a compressor that is universal for the class of stochastic block models with connection probabilities ranging from
ACKNOWLEDGMENT
The authors would like to thank the associate editor and the anonymous reviewers for their invaluable contributions and constructive suggestions, which have significantly enhanced the quality and clarity of this article. The author Chi Wang would like to thank Richard Peng for his suggestion in writing. The author Lele Wang would like to thank Emmanuel Abbe and Tsachy Weissman for stimulating discussions in the initial phase of the work. She is grateful to Young-Han Kim and Ofer Shayevitz for their interest and encouragement in this result.
Appendix A
Appendix A
Proof of Lemma 4
Let
Joint Distributions Induce by Laplace and KT Probability Assignments
In this section, we re- derive the well-known joint distributions of Laplace and KT probability assignments for completeness. Firstly we consider the Laplace probability assignment. Recalled that we defined
Now, we consider KT probability assignment. Recalled that