Universal Graph Compression: Stochastic Block Models

Publisher: IEEE

Abstract:
Motivated by the prevalent data science applications of processing large-scale graph data such as social networks and biological networks, this paper investigates lossless compression of data in the form of a labeled graph. Particularly, we consider a widely used random graph model, stochastic block model (SBM), which captures the clustering effects in social networks. An information-theoretic universal compression framework is applied, in which one aims to design a single compressor that achieves the asymptotically optimal compression rate, for every SBM distribution, without knowing the parameters of the SBM. Such a graph compressor is proposed in this paper, which universally achieves the optimal compression rate with polynomial time complexity for a wide class of SBMs. Existing universal compression techniques are developed mostly for stationary ergodic one-dimensional sequences. However, the adjacency matrix of SBM has complex two-dimensional correlations. The challenge is alleviated through a carefully designed transform that converts two-dimensional correlated data into almost i.i.d. submatrices. The sequence of submatrices is then compressed by a Krichevsky-Trofimov compressor, whose length analysis is generalized to identically distributed but arbitrarily correlated sequences. In four benchmark graph datasets, the compressed files from competing algorithms take 2.4 to 27 times the space needed by the proposed scheme.
Published in: IEEE Transactions on Information Theory ( Volume: 70, Issue: 2, February 2024)
Page(s): 1473 - 1497
Date of Publication: 12 September 2023
ISSN Information:
Publisher: IEEE
Funding Agency:

SECTION I.

Introduction

In many data science applications, data appears in the form of large-scale graphs. For example, in social networks, vertices represent users and an edge between vertices represents friendship; in the World Wide Web, vertices are websites and edges indicate the hyperlinks from one site to the other; in biological systems, vertices can be proteins and edges illustrate protein-to- protein interaction. Such graphs may contain billions of vertices. In addition, edges tend to be correlated with each other since, for example, two people sharing many common friends are likely to be friends as well. How to efficiently compress such large-scale structural information to reduce the I/O and communication costs in storing and transmitting such data is a persisting challenge in the era of big data.

The problem of graph compression has been studied in different communities, following various different methodologies. Several methods exploited combinatorial properties such as cliques and cuts in the graph [1], [2]. Many works targeted at domain-specific graphs such as web graphs [3], biological networks [4], [5], and social networks [6]. Various representations of graphs were proposed, such as the text-based method, where the neighbour list of each vertex is treated as a “word” [7], [8], and the k2 -tree method, where the adjacency matrix is recursively partitioned into k2 equal-size submatrices [9]. Succinct graph representations that enable certain types of fast computation, such as adjacency query or vertex degree query, were also widely studied [10]. We refer the readers to [11] for a survey on lossless graph compression and space-efficient graph representations.

In this paper, we take an information theoretic approach to study lossless compression of graphs with vertex labels. We assume the graph is generated by some random graph model and investigate lossless compression schemes that achieve the theoretical limit, i.e., the entropy of the graph, asymptotically as the number of vertices goes to infinity. When the underlying distribution/statistics of the random graph model is known, optimal lossless compression can be achieved by methods like Huffman coding. However, in most real-world applications, the exact distribution is usually hard to obtain and the data we are given is a single realization of this distribution. This motivates us to consider the framework of universal compression, in which we assume the underlying distribution belongs to a known family of distributions and require that the encoder and the decoder should not be a function of the underlying distribution. The goal of universal compression is to design a single compression scheme that universally achieves the optimal theoretical limit, for every distribution in the family, without knowing which distribution generates the data. For this paper, we focus on the family of stochastic block models, which are widely used random graph models that capture the clustering effect in social networks. Our goal is to develop a universal graph compression scheme for a family of stochastic block models with as wide a range of parameters as possible.

How to design a computationally efficient universal compression scheme is a fundamental question in information theory. In the past several decades, a large number of universal compressors were proposed for one-dimensional sequences with fixed alphabet size, whose entropy is linear in the number of variables. Prominent results include the Laplace and Krichevsky–Trofimov (KT) compressors for i.i.d. processes [12], [13], Lempel–Ziv compressor [14], [15] and Burrows–Wheeler transform [16] for stationary ergodic processes, and context tree weighting [17] for finite memory processes. Many of these have been adopted in standard data compression applications such as compress, gzip, GIF, TIFF, and bzip2. Despite these exciting developments, existing universal compression techniques fall short of establishing optimality results for graph data due to the following challenges. Firstly, graph data generated from a stochastic block model has non-stationary two-dimensional correlations, so existing techniques do not immediately apply here. Secondly, in many practical applications, where the graph is sparse, the entropy of the graph may be sublinear in the number of entries in the adjacency matrix.

For the first challenge, a natural question arising is: can we convert the two-dimensional adjacency matrix of the graph into a one-dimensional sequence in some order and apply a universal compressor for the sequence? For some simple graph models such as Erdős- Rényi graph, where each edge is generated i.i.d. with probability p , this would indeed work. For more complex graph models including stochastic block models, it is unclear whether there is an ordering of the entries that results in a stationary process. We will show in Section VII1 several orders including row-by-row, column-by-column, and diagonal-by-diagonal fail to produce a stationary process.

We alleviate these challenges by designing a decomposition of the adjacency matrix into submatrices. We then show in Proposition 2 that with a carefully chosen parameter, the submatrix decomposition converts two-dimensional correlated entries into a sequence of almost i.i.d. submatrices with slowly growing alphabet size. To address the second challenge, we introduce a new definition of universality that accommodates data with an unknown leading order in its entropy expression. The new universality normalizes the compression length by the entropy of data, requiring it to go to one, as compared to normalizing by the number of variables in the standard definition. Normalizing by the number of variables, implicitly assumes that the entropy is linear in n .

Notation: For an integer n , let [n]={1,2,,n} . Let log()=log2() . We follow the standard order notation: f(n)=O(g(n)) if limn|f(n)|g(n)< ; f(n)=Ω(g(n)) if limnf(n)g(n)>0 ; f(n)=Θ(g(n)) if f(n)=O(g(n)) and f(n)=Ω(g(n)) ; f(n)=o(g(n)) if limnf(n)g(n)=0 ; f(n)=ω(g(n)) if limn|f(n)||g(n)|= ; and f(n)g(n) if limnf(n)g(n)=1 . For 0p1 , let h(p)=plog(p)(1p)log(1p) denote the binary entropy function. For a matrix W with entries in [0, 1], let h(W) be a matrix of the same dimension whose (i,j) entry is h(Wij) .

A. Problem Setup

1) Universal Graph Compressor:

For simplicity, we focus on simple (undirected, unweighted, no self-loop) graphs with labeled vertices in this paper. But our compression scheme and the corresponding analysis can be extended to more general graphs. Let An be the set of all labeled simple graphs on n vertices. Let {0,1}i be the set of binary sequences of length i , and set {0,1}=i=0{0,1}i . A lossless graph compressor C:An{0,1} is a one-to- one function that maps a graph to a binary sequence. Let (C(An)) denote the length of the output sequence. When An is generated from a distribution, it is known that the entropy H(An) is a fundamental lower bound on the expected length of any lossless compressor [18, Theorem 10.5]

H(An)log(e(H(An)+1))E[(C(An))],(1)
View SourceRight-click on figure for MathML and additional features. and therefore
lim infnE[(C(An))]H(An)1.
View SourceRight-click on figure for MathML and additional features.
Thus, a graph compressor is said to be universal for the family of distributions P if for all distribution PP and AnP , we have
lim supnE[(C(An))]H(An)=1.(2)
View SourceRight-click on figure for MathML and additional features.

2) Stochastic Block Model:

A stochastic block model SBM(n,L,p,W) defines a probability distribution over An . Here n is the number of vertices, L is the number of communities. Each vertex i[n] is associated with a community assignment Xi[L] . The length-L column vector p=(p1,p2,,pL)T is a probability distribution over [L] , where pi indicates the probability that any vertex is assigned community i . W is an L×L symmetric matrix, where Wij represents the probability of having an edge between a vertex with community assignment i and a vertex with community assignment j . We say AnSBM(n,L,p,W) if the community assignments X1,X2,,Xn are generated i.i.d. according to p and for every pair 1i<jn , an edge is generated between vertex i and vertex j with probability WXi,Xj . In other words, in the adjacency matrix An of the graph, AijBern(WXi,Xj) for i<j ; the diagonal entries Aii=0 for all i[n] ; and Aij=Aji for i>j . We write W=f(n)Q , where Q is an L×L symmetric matrix with maxi,jQi,j=Θ(1) . We assume all entries in p are non-zero constants, and L=Θ(1) . Furthermore, to make sure that the model is non-trivial, we assume that all entries in W are at most 1, and at least one entry is strictly less than 1. Also, because we can always scale the function f(n) , we assume without generality that maxi,jQi,j<1 . We will consider two families of stochastic block models: For 0<ϵ<1 ,

P1(ϵ):P2(ϵ):SBM(n,L,p,W)with f(n)=O(1),f(n)=Ω(1n2ϵ),SBM(n,L,p,W)with f(n)=o(1),f(n)=Ω(1n2ϵ).(3)(4)
View SourceRight-click on figure for MathML and additional features. Note that the edge probability 1n2 is the threshold for a random graph to contain an edge with high probability [19]. Thus, the family P1(ϵ) covers most non-trivial SBM graphs. Clearly, P2(ϵ) is a strict subset of P1(ϵ) , as it does not contain the constant regime f(n)=1 .

3) Minimax Redundancy:

Besides universality, which only concerns the first order terms in the expected length of the compressor and the entropy of the graph, the redundancy of a universal compressor is another important metric we wish to optimize in the design of universal graph compressors. Define the redundancy of a lossless compressor C for AnP as

R(C,An)=E[(C(An))]H(An).(5)
View SourceRight-click on figure for MathML and additional features. Then the minimax redundancy over a distribution family P is defined as
Rn(P)=infCsupPPR(C,An).(6)
View SourceRight-click on figure for MathML and additional features.
The minimax redundancy represents the lowest achievable redundancy of any lossless compressor for the family P . Define the family of SBM in regime f(n) with maximum entry Qmax in Q as
P3(f(n),Qmax):SBM(n,L,p,f(n)Q)with Qi,jQmaxfor all i,j[n].(7)
View SourceRight-click on figure for MathML and additional features.
We comment that if f(n) satisfies f(n)=o(1) and f(n)=Ω(1n2ϵ) for some 0<ϵ<1 , then P3(f(n),Qmax) is a subset of P2(ϵ) with fixed regime f(n) and entries in Q upper bounded by Qmax . Hence, it is also a subset of P1(ϵ) . In this work, we will study the minimax redundancy over the family P3(f(n),Qmax) instead of P1(ϵ) or P2(ϵ) . The reason is that both the expected length of the compressor and the entropy of the graph scale with f(n) and Qmax . By fixing f(n) and Qmax instead of taking supremum over them, we can better understand the minimax redundancy of SBMs in each regime.

B. Literature in the Information-Theoretic Framework

Under the information-theoretic framework, lossless graph compression has been studied in [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], and [40] and lossy compression of directed graphs was studied in [41]. In [23] and [31], universal compression of unlabeled graphs (isomorphism classes) is investigated. For lossless compression of labeled graphs, [24] and [25] studied the compression of graphs generated from SBMs. However, it does not consider universal compression schemes. In [22], a universal compression algorithm is proposed for two-dimensional arrays of data by first converting the two-dimensional data into a one-dimensional sequence using a Hilbert–Peano curve and then applying the Lempel–Ziv algorithm [15] for sequences. In [26], [27], [28], [32], and [33], universal compression of tree graphs were studied. In [37], [38], [39], and [40], a minimum description length based method was proposed to infer network structures in stochastic block models. Recently, universal compression of graphs with marked edges and vertices was studied by Delgosha and Anantharam [34], [35]. They focus on the sparse graph regime, where the number of edges is in the same order as the number of vertices n . They employ the framework of local weak convergence, which provides a technique to view a sequence of graphs as a sequence of distributions on neighbourhood structures. Built on this framework, they propose an algorithm that compresses graphs by describing the local neighbourhood structures. Moreover, they introduce a universality/optimality criterion through a notion of entropy for graph sequences under the local weak convergence framework, known as the Bordenave–Caputo (BC) entropy [42]. In [36] and [43], a sparse graphon framework is proposed for graphs with edge numbers in Θ(nα) for 1<α2 . A new universality criterion is introduced and the corresponding universal compressor is proposed. The two universality criteria in [34], [35], [36], and [43] are both stronger than the one used in this paper. They require the asymptotic length of the compressor to match the constants in both first and second-order terms in Shannon entropy, whereas the universality criterion we use only requires matching the first-order term. As a consequence of the stronger criteria, the compressors in [34], [35], [36], and [43] are universal over a random graph family with a smaller range of edge-numbers. Moreover, the algorithm in [36] and [43] is a combination of two different compressors, and has no polynomial-time implementation. The first compressor is universal over a family of graphs with Θ(n) edges that possess the so- called local weak convergence property. The second compressor is universal over the family of random graphs generated from sparse graphon with Θ(nα) edges for 1<α<2 . We note that the sparse graphon framework includes stochastic block models with f(n)=ω(1n) and f(n)=o(1) as a special case. In comparison, the compressor we propose is a single polynomial-time algorithm that applies to graphs with edge numbers in Θ(nα) for every 0<α2 . In Section VI, we evaluate the proposed compressor under the criterion in [34] for the family of stochastic block models. The proposed compressor achieves the same universality performance in terms of BC entropy.

C. Contribution and Organization

The main contributions of the paper are summarized as follows. Firstly, we propose a polynomial-time graph compressor and establish its universality over the corresponding family of stochastic block models. Secondly, we present upper and lower bounds for the minimax redundancy of stochastic block models. Finally, we analyze the proposed compressor under the local weak convergence framework considered in [34] and [42]. We show that the proposed compressor achieves the same performance guarantee (BC entropy) as the compressor in [34].

The rest of the paper is organized as follows. In Section I-A, we defined universality over a family of graph distributions, the stochastic block models and the minimax redundancy of a family of distributions. We present our main result in Section II, which is a polynomial-time graph compressor that is universal for a family containing most of the non-trivial stochastic block models. We describe the proposed graph compressor in Section II-A. In Section II-B, we state the main theorems of the paper—two theorems on the university of the proposed compressor as well as a theorem on upper and lower bounds for the minimax redundancy. In Section II-C, we implement our compressor in four benchmark graph datasets and compare its empirical performance to four competing algorithms. We illustrate key steps in establishing universality in Section III and elaborate on the proof of each step in Section IV. In Section V, we prove upper and lower bounds for the minimax redundancy. In Section VI, we provide the second-order analysis of the expected length of our compressor and compare it to the one in [34]. In Section VII, we explain why some natural attempts to compress stochastic block models are not applicable to our problem or not universal. In Section VIII, we conclude our results and introduce some open problems and potential future works.

SECTION II.

Main Results

We present our compression scheme in Section II-A and state its performance guarantee in Theorems 1 and 2. The key idea in our design is a decomposition of the adjacency matrix into submatrices, which, with a carefully chosen parameter, converts the two-dimensional correlated entries in the adjacency matrix into a sequence of almost i.i.d. submatrices. We then compress the submatrices using a Krichevsky–Trofimov or Laplace compressor and generalize their length analyses from i.i.d. processes to arbitrarily correlated but identically distributed processes.

A. Algorithm: Universal Graph Compressor

In this section, we describe our universal graph compression scheme. For each integer kn , the graph compressor Ck:An{0,1} is defined as follows.

  • Submatrix decomposition. Let n=n/k and n~=n/kk . For 1i,jn , let Bij be the submatrix of An formed by the rows (i1)k+1,(i1)k+2,,ik and the columns (j1)k+1,(j1)k+2,,jk . For example, we have

    B12=A1,k+1A2,k+1Ak,k+1A1,k+2A2,k+2Ak,k+2A1,2kA2,2kAk,2k.(8)
    View SourceRight-click on figure for MathML and additional features. We then write the top-left n~×n~ submatrix of An in the block-matrix form as
    B11B21Bn,1B12B22Bn,2B1,nB2,nBn,n.(9)
    View SourceRight-click on figure for MathML and additional features.
    Denote
    But:=B12,B13,B23,,B1,n,,Bn1,n(10)
    View SourceRight-click on figure for MathML and additional features.
    as the sequence of off-diagonal submatrices in the upper triangle and
    Bd:=B11,B22,,Bn,n(11)
    View SourceRight-click on figure for MathML and additional features.
    as the sequence of diagonal submatrices.

  • Binary to m -ary conversion. Let m:=2k2 . Each k×k submatrix with binary entries in the two submatrix sequences But and Bd is converted into a symbol in [m] .

  • KT probability assignment. Apply KT sequential probability assignment for the two m -ary sequences But and Bd respectively. Given an m -ary sequence x1,x2,,xN , KT sequential probability assignment defines N conditional probability distributions over [m] as follows. For j=0,1,2,,N1 , assign conditional probability

    qKT(i|xj):=qKT(Xj+1=i|Xj=xj)=Ni(xj)+1/2j+m/2for each i[m],(12)
    View SourceRight-click on figure for MathML and additional features. where Xj:=(X1,,Xj),xj:=(x1,x2,,xj) , and Ni(xj):=jk=11{xk=i} counts the number of symbol i in xj .

  • Adaptive arithmetic coding. With the KT sequential probability assignments, compress the two sequences But and Bd separately using adaptive arithmetic coding [44] (see description in Algorithm 1). In case k=1 , the diagonal sequence Bd becomes an all-zero sequence since we assume the graph is simple. So we will only compress the off-diagonal sequence But .

  • Encoding the remaining bits. The above process compressed the top-left n~×n~ block of the adjacency matrix. For the remaining (nn~)n~+(nn~2) entries in the upper diagonal of the adjacency matrix, we simply use 2logn bits to encode the row and column number of each one.

Algorithm 1 - m-Ary Adaptive Arithmetic Encoding With KT Probability Assignment
Algorithm 1

m-Ary Adaptive Arithmetic Encoding With KT Probability Assignment

Given the compressed graph sequence yL , the number of vertices n and the submatrix size k , the graph decompressor Dk:{0,1}An is defined as follows.

  • Adaptive arithmetic decoding. With the KT sequential probability assignments defined in (12), decompress the two code sequences for But and Bd separately using adaptive arithmetic decoding (see Algorithm 2). The length of data sequence But and Bd are n(n1)/2 and n respectively.

  • m -ary to binary conversion. Each m -ary symbol in the sequence is converted to a k2 -bit binary number and further converted into a k×k submatrix with binary entries.

  • Adjacency matrix recovery. With the submatrices in But and Bd , recover the top-left n~×n~ submatrix of An in the order described in (9), (10), and (11).

  • Decoding the remaining bits. Recover the remaining (nn~)n~+(nn~2) entries in the An using the row and column numbers of the ones.

Algorithm 2 - 
$m$
-Ary Adaptive Arithmetic Decoding With KT Probability Assignment
Algorithm 2

m -Ary Adaptive Arithmetic Decoding With KT Probability Assignment

Remark 1 (Complexity):

The computational complexity and the space complexity of the proposed compressor are O(2k2n2k2) and O(n2) respectively. For the choice of k that achieves universality over P1(ϵ) family in Theorem 1, O(2k2n2k2)=O(n2+δk2) for δ<ϵ . For the choice of k that achieves universality over P2(ϵ) family in Theorem 2, O(2k2n2k2)=O(n2) .

To see this, we consider the time and space complexity of the encoder step by step. The steps of submatrix decomposition and binary to m -ary conversion require reading the whole adjacency matrix and computing m -ary value of each block. They require O(n2) time complexity and O(n2k2) space complexity. Regarding the complexity of adaptive arithmetic coding with KT probability assignment, we consider the ideal case when we are given a computer with infinite floating point precision. In this case, the encoder runs for O(n2k2) iterations, and in each iteration, it requires O(2k2) time complexity for computing the conditional probabilities. So the total time complexity is O(2k2n2k2) . The space complexity is bounded by the total number of output bits, which is O(n2) . In practice, a computer cannot have infinite floating point precision, and this is a known difficulty in implementing an arithmetic encoder. In [45], the authors proposed a practical method of implementing arithmetic coding with finite floating point precision by outputting the known most significant bits of the code during the encoding process. We refer interested readers to their original paper for more details. Finally, the encoding of the remaining bits requires at most O(nk) time complexity and O(1) space complexity. Overall, the encoder requires O(2k2n2k2) time complexity and O(n2) space complexity. The time and space complexity of the decoder can be analyzed similarly.

Remark 2:

One can check that Ck is well-defined. The submatrix decomposition and the binary to m -ary conversion are clearly one-to-one. It is also known that for any valid probability assignment, arithmetic coding produces a prefix code, which is also one-to-one.

Remark 3:

The orders in But and Bd do not matter in terms of establishing universality. The current orders in (10) and (11) together with arithmetic coding enable a horizon-free implementation. That is, the encoder does not need to know the horizon n to start processing the data and can output partially coded bits on the fly before receiving all the data. This leads to short encoding and decoding delays. For some real-world applications, for example, when the number of users increases in a large social network, this compressor has the advantage of not requiring to re- process existing data and re- compress the whole graph from scratch.

Remark 4 (Laplace Probability Assignment):

As an alternative to the KT sequential probability assignment, one can also use the Laplace sequential probability assignment. Given an m -ary sequence x1,x2,,xN , Laplace sequential probability assignment defines N conditional probability distributions over [m] as follows. For j=0,1,2,,N1 , we assign conditional probability

qL(Xj+1=i|Xj=xj)=Ni(xj)+1j+mfor each i[m].(13)
View SourceRight-click on figure for MathML and additional features. Both methods can be shown to be universal, while Laplace probability assignment has a much cleaner derivation. However, KT probability assignment produces a better empirical performance. For this reason, we keep both in the paper.

B. Theoretical Performance

We now state the main result of this paper: the proposed compressor Ck , for a carefully chosen k , is universal over the classes P1(ϵ) and P2(ϵ) respectively for every 0<ϵ<1 .

Theorem 1 (Universality OverP1 ):

For every 0<ϵ<1 , the graph compressor Ck is universal over the family P1(ϵ) provided that

0<δ<ϵ,kδlogn,and k=ω(1).
View SourceRight-click on figure for MathML and additional features.

Remark 5:

Recall that P1(ϵ) is the family of SBMs with edge probability in the regime Ω(1n2ϵ) and O(1) . Moreover, 1n2 is the threshold for a random graph to contain an edge with high probability [19]. Thus, the family P1(ϵ) covers most non-trivial SBM graphs.

It turns out that for k=1 , the compressor C1 is universal over the family P2(ϵ) , which we state formally below.

Theorem 2 (Universality OverP2 ):

For every 0<ϵ<1 , the graph compressor C1 is universal over the family P2(ϵ) .

Remark 6:

Any compressor universal over the class P1(ϵ) is also universal over the class P2(ϵ) , but our compressor designed specifically for the class P2(ϵ) has a lower computational complexity.

The above results establish the universality of the proposed compressor over P1 and P2 . The minimax redundancy represents the lowest achievable redundancy of any lossless compressor for the family of SBMs. We bound the minimax redundancy of the family of SBMs P3 with fixed regime f(n) and fixed largest entry Qmax in Q in the following Theorem.

Theorem 3 (Minimax Redundancy):

Let f(n)=o(1) and f(n)=Ω(1/n2ϵ) for some 0<ϵ<1 . The minimax redundancy of the family P3(f(n),Qmax) is bounded as

12log((n2)f(n)Qmaxπe)R(P3(f(n),Qmax))1e(loge)Qmax(n2)f(n).
View SourceRight-click on figure for MathML and additional features.

Remark 7:

From Theorem 3, we can see that the lower bound scales as log(n2f(n)) , while the upper bound scales as n2f(n) . Closing the gap between the upper and lower bounds is an interesting future research question. Recall that the minimax redundancy is the optimal second-order term in the expected length of a universal compressor—we will show in Lemma 2 that H(An) , the first order term in the length, is Θ(n2f(n)log(1/f(n)) .

In Section VI, we analyze the performance of the proposed compressor under the local weak convergence framework considered in [34]. It turns out that our proposed compressor achieves the optimal compression rate under the stronger universality criterion considered in that framework. This result requires a lot more definitions to introduce and we defer it to Theorem 5 in Section VI.

C. Empirical Performance

We implement the proposed universal graph compressor (UGC) in four widely used benchmark graph datasets: protein-to- protein interaction network (PPI) [46], LiveJournal friendship network (Blogcatalog) [47], Flickr user network (Flickr) [47], and YouTube user network (YouTube) [48]. The submatrix decomposition size k is chosen to be 1, 2, 3, 4 and we present in Table I the compression ratios (the ratio between output length and input length of the encoder) of UGC for different choices of k . We present in Table II the compression ratios of four competing algorithms2. In Fig. 1, we combine the comparisons of the compression ratios in Tables I and II in the logarithmic scale.

  • CSR: Compressed sparse row is a widely used sparse matrix representation format. In the experiment, we further optimize its default compressor exploiting the fact that the graph is simple and its adjacency matrix is symmetric with binary entries.

  • Ligra+: This is another powerful sparse matrix representation format [49], [50], which improves upon CSR using byte codes with run-length coding.

  • LZ: This is an implementation of the algorithm proposed in [22], which first transforms the two-dimensional adjacency matrix into a one-dimensional sequence using the Peano-Hilbert space-filling curve and then compresses the sequence using Lempel–Ziv 78 algorithm [15].

  • PNG: The adjacency matrix of the graph is treated as a grey-scaled image and the PNG lossless image compressor is applied.

The compression ratios of the five algorithms implemented on four datasets are given as follows. The proposed UGC outperforms all competing algorithms in all datasets. The compression ratios from competing algorithms are 2.4 to 27 times that of the universal graph compressor.
TABLE I Compression Ratio of UGC Under Different k Values
Table I- 
Compression Ratio of UGC Under Different 
$k$
 Values
TABLE II Compression Ratios of Competing Algorithms
Table II- 
Compression Ratios of Competing Algorithms
Fig. 1. - Log-scale comparisons of the compression ratios for the proposed universal graph compressor with other competing compressors. See Tables I and II for the exact compression ratios. The column numbers of highlighted entries in Table I indicate the optimal 
$k$
 value for UGC we used in this plot.
Fig. 1.

Log-scale comparisons of the compression ratios for the proposed universal graph compressor with other competing compressors. See Tables I and II for the exact compression ratios. The column numbers of highlighted entries in Table I indicate the optimal k value for UGC we used in this plot.

SECTION III.

Main Ideas in Establishing Universality

In this section, we establish Theorems 1 and 2 in Section II-B based on four intermediate propositions. We defer the proofs of the propositions to Section IV.

A. Graph Entropy

We first calculate the entropy of the (random) graph An , which, recall, is the fundamental lower bound on the expected compression length for any compression scheme. Since we want to bound the redundancy of a universal compressor, we will be concerned with both the first and the second-order terms in H(An) . Recall, for p[0,1],h(p) denotes the binary entropy function and for a matrix W with entries in [0, 1], h(W) denotes a matrix of the same dimension whose (i,j) entry is h(Wij) .

Proposition 1 (Graph Entropy):

Let AnSBM(n,L,p,f(n)Q) with f(n)=O(1),f(n)=Ω(1n2) , and L=Θ(1) . Recall that Xi[L] denotes the community label of vertex i .

Then

H(An)=(n2)H(A12|X1,X2)(1+o(1))=(n2)pTh(f(n)Q)p+o(n2h(f(n))).(14)(15)
View SourceRight-click on figure for MathML and additional features. In particular, when f(n)=Ω(1n2) and f(n)=o(1) , expression (15) can be further simplified as
H(An)=(n2)f(n)log(1f(n))(pTQp+o(1)).(16)
View SourceRight-click on figure for MathML and additional features.

Remark 8:

In the regime f(n)=Ω(1n) and f(n)=O(1) , equation (15) has been established in [24]. We extend the analysis to the regime f(n)=Ω(1n2) and f(n)=o(1n) .

Remark 9:

Proposition 1 can be used to calculate the entropy of the graph for certain important regimes of f(n) , in which the SBM displays characteristic behavior. For f(n)=1 , we have H(An)=(n2)(pTh(Q)p+o(1)) ; for f(n)=lognn (the regime where the phase transition for exact recovery of the community assignments occurs [51], [52]), we have H(An)=nlog2n2(pTQp+o(1)) ; when f(n)=1n (the regime where the phase transition for detection between SBM and the Erdős- Rényi model occurs [53]), we have H(An)=nlogn2(pTQp+o(1)) ; when f(n)=1n2 (the regime where the phase transition for the existence of an edge occurs), we have H(An)=logn(pTQp+o(1)) .

B. Asymptotic i.i.d. via Submatrix Decomposition

To compress the matrix An , we wish to decompose it into a large number of components that have little correlation between them. This leads to the idea of submatrix decomposition described previously. Since the sequence of submatrices is used to compress An , the next proposition claims these submatrices are identically distributed and asymptotically independent in a precise sense described as follows.

Proposition 2 (Submatrix Decomposition):

Let AnSBM(n,L,p,f(n)Q)P1(ϵ) for some 0<ϵ<1 . Let k be an integer and n=n/k . Consider the k×k submatrix decomposition in (9). We have all the off-diagonal submatrices share the same joint distribution; all the diagonal submatrices share the same joint distribution. In other words, for any 1i1,i2,j1,j2n with i1j1,i2j2 and 1l1,l2n , we have

Bi1,j1Bl1,l1=dBi2,j2,=dBl2,l2.
View SourceRight-click on figure for MathML and additional features. In addition, if k=ω(1) and k=o(n) , the off-diagonal submatrices are asymptotically i.i.d. in the sense that
limnH(But)(n2H(B12))=1.(17)
View SourceRight-click on figure for MathML and additional features.

Remark 10:

In Proposition 2, we provide a more general result than what we need to prove Theorem 1. In Theorem 1, the range of k for the compressor to achieve universality over the family P1(ϵ) is given by

0<δ<ϵ,kδlogn,and k=ω(1),
View SourceRight-click on figure for MathML and additional features. while in Proposition 2, we show the asymptotic i.i.d property for a wider range k=ω(1) and k=o(n) .

Remark 11:

Proposition 2 implies that any probability assignment that is universal for i.i.d. sequences may be used to compress the blocks, provided that the compression length analysis can be expended to asymptotically i.i.d. sequences as in Propositions 3 and 4. The reason we choose to use the KT/Laplace probability assignment is that it is known to achieve the minimax redundancy and lead to better finite-length performance as compared to, for example, Lempel–Ziv coding which while still being universal incurs larger redundancy. Moreover, the KT-based approach leads to better empirical performance, as observed in Table II.

C. Length Analysis for Correlated Sequences

Thanks to the asymptotic i.i.d. property of the submatrix decomposition, we hope to compress these submatrices as if they are independent using a Laplace probability assignment (which, recall, is universal for the class of all m -ary i.i.d. processes). However, since these submatrices are still correlated (albeit weakly), we will need a result on the performance of Laplace probability assignment on correlated sequences with identical marginals, which we give next.

Proposition 3 (Laplace Probability Assignment for Correlated Sequence):

Consider arbitrarily correlated Z1,Z2,,ZN , where each Zi is identically distributed over an alphabet of size m2 . Let L(zN)=log1qL(zN)+1 , where

qL(zN):=N1!N2!Nm!N!1(N+m1m1)(18)
View SourceRight-click on figure for MathML and additional features. with NiNi(zN)=Nj=11{zj=i} is the joint distribution induced by the sequential Laplace probability assignment in (13). We then have
E[L(ZN)]mlog(2eN)+NH(Z1)+2.(19)
View SourceRight-click on figure for MathML and additional features.

We provide a similar result for the KT probability assignment.

Proposition 4 (KT Probability Assignment for Correlated Sequence):

Consider arbitrarily correlated Z1,Z2,, ZN , where each Zi is identically distributed over an alphabet of size m2 . Let KT(zN)=log1qKT(zN)+1 , where

qKT(zN)=(2N11)!!(2N21)!!(2Nm1)!!m(m+2)(m+2N2)(20)
View SourceRight-click on figure for MathML and additional features. with (1)!!1 and NiNi(zN)=Nj=11{zj=i} is the joint distribution induced by KT probability assignment in (12). We then have
E[KT(ZN)]m2log(e(1+2Nm))+12log(πN)+NH(Z1)+2.(21)
View SourceRight-click on figure for MathML and additional features.

D. Proof of Theorem 1

To prove Theorem 1, we first state a technical lemma, the proof of which is deferred to Section IV-A.

Lemma 1:

Let AnSBM(n,L,p,f(n)Q) . Recall that in the stochastic block model, Xi denotes the community label of vertex i .

We have

H(A12)=h(f(n)pTQp),(22)
View SourceRight-click on figure for MathML and additional features. and
H(A12|X1,X2)=pTh(f(n)Q)p.(23)
View SourceRight-click on figure for MathML and additional features.
Moreover, if f(n)=O(1) , then
h(f(n)pTQp)=Θ(pTh(f(n)Q)p),(24)
View SourceRight-click on figure for MathML and additional features.
and if f(n)=o(1) , we further have
h(f(n)pTQp)=(1+o(1))pTh(f(n)Q)p.(25)
View SourceRight-click on figure for MathML and additional features.
We are now ready to prove Theorem 1.

Proof of Theorem 1:

We will prove the universality of Ck for both the KT probability assignment and the Laplace probability assignment. Note that the upper bound on the expected length of KT in (21) is upper bounded by the upper bound on the length of Laplace in (19). So it suffices to show Laplace probability assignment is universal.

We use the bound in Proposition 3 to establish the upper bound on the length of the code. Recall that here we compress the diagonal submatrices Bd (m=2k2 sized alphabet, N=n submatrices) and the off-diagonal submatrices But (m=2k2 sized alphabet, N=(n2) submatrices) separately, and encode the position of each 1 in the remaining (nn~)n~+(nn~2) entries using 2logn bits. Let Nr denote the number of ones in the remaining (nn~)n~+(nn~2) entries. We have,

E((Ck(An)))H(An)=E(L(But))+E(L(Bd))+E(2Nrlogn)H(An)(n2H(B12)+2k2log(2e(n2))+nH(B11))H(An)+2k2log(2en)+4+E(2Nrlogn)H(An)(a)(n2H(B12)+2k2log(en2)+nH(B11))H(An)+2k2log(2en)+4+E(2Nrlogn)H(An)(b)(n2H(B12)+2k2log(2e2n3)+nk2H(A12))H(An)+4+E(2Nrlogn)H(An)=(n2H(B12))H(An)+2k2log(2e2n3)+4H(An)+nk2H(A12)H(An)+E(2Nrlogn)H(An),(26)
View SourceRight-click on figure for MathML and additional features. where in (a) we bound (n2)n2 and nn , and in (b) we note that H(B11)k2H(A12) since there are k2k elements of the matrix (all apart from the diagonal elements) are distributed identically as A12 . We will now analyze each of these four terms separately. Firstly, using Proposition 2 yields that (n2H(B12))H(An)1 . Next, since f(n)=Ω(1n2ϵ) , we have H(An)=Ω(nϵlogn) and subsequently substituting kδlogn , we have
2k2log(2en3)+4H(An)=O(nδlognnϵlogn)=O(nδϵ)=o(1)(27)
View SourceRight-click on figure for MathML and additional features.
since δ<ϵ . Moreover, we have
nk2H(A12)H(An)nk2H(A12)H(An|Xn)=nk2H(A12)(n2H(A12|X1,X2))=O(k2n)=o(1),(28)
View SourceRight-click on figure for MathML and additional features.
where the penultimate equality used the fact that H(A12)H(A12|X1,X2) by Lemma 1.

For the last term in (26), we have

E(2Nrlogn)H(An)=f(n)pTQp((nn~)n~+(nn~2))2lognH(An)=(c)Oknf(n)logn(n2)f(n)log1f(n)=o(1),(29)
View SourceRight-click on figure for MathML and additional features. where (c) follows because nn~=nnn/k<k , n~n and k=O(logn) . We have then established that
E((Ck(An)))H(An)(n2H(B12))H(An)+2k2log(2en3)+4H(An)+nk2H(A12)H(An)+E(2Nrlogn)H(An)=1+o(1),
View SourceRight-click on figure for MathML and additional features.
which finishes the proof.

E. Proof of Theorem 2

Proof:

To establish the universality of C1 over the family P2 , we follow a similar argument as in the proof of Theorem 1. In the case when k=1 , we do not need to compress the diagonal submatrix sequence since they are all zeros and we do not have any remaining bits to compress separately. By Proposition 3 with N=(n2) , we have

E((C1(An)))H(An)2log(2eN)+NH(A12)+2H(An)2log(2eN)+NH(A12)+2H(AnXn1)=(2log(2eN)+NH(A12)+2NH(A12))H(A12)H(A12X1,X2)=(1+2log(2eN)+2Nh(f(n)pTQp))h(f(n)pTQp)pTh(f(n)Q)p=(a)1+o(1).
View SourceRight-click on figure for MathML and additional features. Here, (a) is justified by noting that 2log(2eN)+2Nh(f(n)pTQp)2log(2en2)+2(n2h(n(2ϵ)pTQp))h(n(2ϵ)pTQp)h(f(n)pTQp) , and then noting that 2log(2en2)+2(n2h(n(2ϵ)pTQp))=o(1) and h(n(2ϵ)pTQp)h(f(n)pTQp)=O(1) when f(n)=Ω(1n2ϵ) and that h(f(n)pTQp)=(1+o(1))pTh(f(n)Q)p when f(n)=o(1) by Lemma 1.

Remark 12:

When f(n)=1 , the compressor C1 is strictly suboptimal. This is because the length achieved by C1 is (n2)h(f(n)pTQp)(1+o(1)) , whereas the first order term in the entropy is (n2)pTh(f(n)Q)p . When f(n) is o(1) , these two have the same first-order term. However, when f(n) is constant, pTh(f(n)Q)p is strictly smaller than h(f(n)pTQp) by concavity of entropy.

SECTION IV.

Proof of Propositions 1–​4

A. Graph Entropy

In this section, we establish Proposition 1. Towards that goal, we state and prove a stronger result on the graph entropy in Lemma 2, which will also be useful for proof of Theorem 3 later. Proposition 1 will follow easily from Lemma 2. The proofs of Lemmas 1 and 2 are provided after the proof of Proposition 1.

Lemma 2 (Graph entropy):

Let AnSBM(n,L,p,f(n)Q) with f(n)=O(1),f(n)=Ω(1n2) , and L=Θ(1) .

Then

H(An)=(n2)H(A12|X1,X2)(1+o(1))=(n2)pTh(f(n)Q)p+o(n2h(f(n))).(30)(31)
View SourceRight-click on figure for MathML and additional features.

In particular, when f(n)=ω(1n) and f(n)=o(1) , expression (31) can be simplified as3

H(An)=+(n2)f(n)(log(1f(n))pTQp+pTQplogepTQp+o(1)),(32)
View SourceRight-click on figure for MathML and additional features. where Q denotes an L×L matrix whose (i,j) entry is Qijlog(1Qij) when Qij0 and 0 when Qij=0 .

When f(n)=Ω(1n2) and f(n)=O(1n) , the entropy H(An) can be upper bounded as

H(An)(n2)f(n)(log(1f(n))pTQp+pTQploge+pTQplog1pTQp+o(1)),(33)
View SourceRight-click on figure for MathML and additional features. and lower bounded as
H(An)(n2)f(n)(log(1f(n))pTQp+pTQploge+pTQp+o(1)).(34)
View SourceRight-click on figure for MathML and additional features.

Now, we are ready to prove Proposition 1, which is restated as follows.

Proposition 5 (Graph Entropy):

Let AnSBM(n,L,p,f(n)Q) with f(n)=O(1),f(n)=Ω(1n2) , and L=Θ(1) . Recall that Xi[L] denotes the community label of vertex i .

Then

H(An)=(n2)H(A12|X1,X2)(1+o(1))(14)=(n2)pTh(f(n)Q)p+o(n2h(f(n))).(15)
View SourceRight-click on figure for MathML and additional features. In particular, when f(n)=Ω(1n2) and f(n)=o(1) , expression (15) can be further simplified as
H(An)=(n2)f(n)log(1f(n))(pTQp+o(1)).(16)
View SourceRight-click on figure for MathML and additional features.

Proof of Proposition 1:

Given Lemma 2, it suffices to prove equation (16) in Proposition 1. We consider two different cases. Firstly, assume that f(n)=ω(1n) and f(n)=o(1) . Recall that the case of f(n)=o(1) is particularly interesting as the simple compressor C1 attains universality. In this case, by equation (32), we have

H(An)=(n2)f(n)(log(1f(n))pTQp+pTQploge+pTQp+o(1))=(n2)f(n)log(1f(n))(pTQp+o(1)).
View SourceRight-click on figure for MathML and additional features. because pTQp=Θ(1) , pTQp=Θ(1) and log(1f(n))=ω(1) . Secondly, assume that f(n)=O(1n) and f(n)=Ω(1n2) . In this case, the right-hand side of equations (33) and (34) can both be written as (n2)f(n)log(1f(n))(pTQp+o(1)) because pTQp=Θ(1) , pTQp=Θ(1) and log(1f(n))=ω(1) . Therefore, equation (16) follows.

Finally, we establish Lemmas 1 and 2.

Proof of Lemma 1:

We note that

P(A12=1)=1i,jLP(A12=1,X1=i,X2=j)=1i,jLP(A12=1|X1=i,X2=j)P(X1=i,X2=j)=1i,jLf(n)Qijpipj=f(n)pTQp.
View SourceRight-click on figure for MathML and additional features. Therefore, A12Bern(f(n)pTQp) , which implies that H(A12)=h(f(n)pTQp) . For H(A12|X1,X2) , by the definition of conditional entropy, we have
H(A12|X1,X2)=i,jH(A12|X1=i,X2=j)pipj=(a)i,jh(f(n)Qij)pipj=pTh(f(n)Q)p,
View SourceRight-click on figure for MathML and additional features.
where (a) follows because A12|X1=i,X2=jBern(f(n)Qij) .

To see that h(f(n)pTQp)=Θ(pTh(f(n)Q)p) when f(n)=Θ(1) , we notice that pTQp=Θ(1) because all entries in p and Q are constant. Moreover, pTQp<1 because maxi,jQij<1 and all entries in p are non-zero. As a result, we have h(f(n)pTQp)=Θ(1) . For pTh(f(n)Q)p , similarly, we observe that maxi,jQij=Θ(1) and maxi,jQij<1 , so pTh(f(n)Q)p=Θ(1) since all entries in p are constants. This completes the proof of (24).

Now, consider the case of f(n)=o(1) . By the property of binary entropy function (see for example [54]), if f(n)=o(1) , we can expand h(f(n)pTQp) and pTh(f(n)Q)p as

h(f(n)pTQp)=f(n)(log(1f(n))pTQp+pTQploge+pTQplog1pTQp+o(1)),(35)
View SourceRight-click on figure for MathML and additional features. and
pTh(f(n)Q)p=f(n)(log(1f(n))pTQp+pTQploge+pTQp+o(1)),(36)
View SourceRight-click on figure for MathML and additional features.
where Q denotes an L×L matrix whose (i,j) entry is Qijlog(1Qij) when Qij0 and 0 when Qij=0 . To see (25), we take (35) minus (36) and get
h(f(n)pTQp)pTh(f(n)Q)p=f(n)(pTQplog1pTQppTQp+o(1))=(b)o(h(f(n)pTQp)),
View SourceRight-click on figure for MathML and additional features.
where (b) follows because log(1f(n))=ω(1) and pTQplog1pTQp=Θ(1) and pTQp=Θ(1) . This completes the proof of (25).

Proof of Lemma 2:

Note that

H(An)=H(An|Xn)+I(Xn;An)=(n2)H(A12|X1,X2)+I(Xn;An)=(n2)pTh(f(n)Q)p+I(Xn;An),(37)(38)
View SourceRight-click on figure for MathML and additional features. where the last equality follows by Lemma 1 and the fact that all the (n2) edges are identically distributed and also independent given Xn . Also, notice that
h(f(n))=f(n)log1f(n)+(1f(n))log11f(n)=Θ(1)=Θ(f(n)log1f(n))
View SourceRight-click on figure for MathML and additional features.
when f(n)=Θ(1) , and
h(f(n))=f(n)log1f(n)+(1f(n))log11f(n)=f(n)log1f(n)(1f(n))f(n)1f(n)=f(n)log1f(n)(1+o(1))
View SourceRight-click on figure for MathML and additional features.
when f(n)=o(1) . As a result, to prove (31), if suffices to prove
H(An)=(n2)pTh(f(n)Q)p+o(n2f(n)log1f(n)).
View SourceRight-click on figure for MathML and additional features.
In the following, we first establish statement (31) for f(n)=Θ(1) and then prove the simplified expression (32) for f(n)=ω(1n) and f(n)=o(1) . Finally, we prove the refined upper and lower bounds on H(An) in (33) and (34) for f(n)=Ω(1n2) and f(n)=O(1n) , which implies statement (31) for f(n) in the same regime.

For f(n)=Θ(1) , we have

0I(Xn;An)H(Xn)=nH(X1)nlogL=o(n2f(n)log1f(n)),(39)
View SourceRight-click on figure for MathML and additional features. which shows that
H(An)=(n2)pTh(f(n)Q)p+o(n2)=(n2)pTh(f(n)Q)p+o(n2f(n)log1f(n)).
View SourceRight-click on figure for MathML and additional features.

Next, consider the case when f(n)=ω(1n) and f(n)=o(1) . By equation (36), we have

pTh(f(n)Q)p=f(n)(log(1f(n))pTQp+pTQploge+pTQp+o(1))=(a)f(n)(log(1f(n))pTQp)(1+o(1)),(40)
View SourceRight-click on figure for MathML and additional features. where (a) follows since log(1f(n))=ω(1) and pTQplog1pTQp=Θ(1) and pTQp=Θ(1) .

Since f(n)=ω(1n) , we have

I(Xn;An)H(Xn)=nH(X1)nlogL=o(f(n)log1f(n)n2),(41)
View SourceRight-click on figure for MathML and additional features. which proves (31). Substituting (40) and (41) into (38) yields
H(An)=(n2)f(n)(log(1f(n))pTQp+pTQploge+pTQp+o(1)).
View SourceRight-click on figure for MathML and additional features.

Finally, consider the case when f(n)=Ω(1n2) and f(n)=O(1n) . By properties of the entropy, we have

H(An|Xn)H(An)(n2)H(A12).(42)
View SourceRight-click on figure for MathML and additional features. By Lemma 1, we know that
(n2)pTh(f(n)Q)pH(An)(n2)h(f(n)pTQp).(43)
View SourceRight-click on figure for MathML and additional features.
By (35) and (36), we can rewrite the upper and lower bounds as
(n2)h(f(n)pTQp)=(n2)f(n)(log(1f(n))pTQp+pTQploge+pTQplog1pTQp+o(1)).(44)
View SourceRight-click on figure for MathML and additional features.
and
(n2)pTh(f(n)Q)p=(n2)f(n)(log(1f(n))pTQp+pTQploge+pTQp+o(1)),(45)
View SourceRight-click on figure for MathML and additional features.
By substituting (45) and (44) into (43), we get bounds (33) and (34) in this regime. Next, we are going to verify (31) in this regime. To see that, we take (44) minus (45) and get
(n2)h(f(n)pTQp)(n2)pTh(f(n)Q)p=(n2)f(n)(pTQplog1pTQppTQp+o(1))=(a)o(n2f(n)log1f(n)),
View SourceRight-click on figure for MathML and additional features.
where (a) follows because f(n)=o(1) and (pTQplog1pTQppTQp)=Θ(1) .

B. Asymptotic i.i.d. via Submatrix Decomposition

We first invoke a known property of stochastic block models (see, for example, [55], [56]). We include the proof here for completeness.

Lemma 3 (Exchangeability of SBM):

Let AnSBM(n,L,p,W) . For a permutation π:[n][n] , let π(An) be an n×n matrix whose (i,j) entry is given by Aπ(i),π(j) . Then, for any permutation π:[n][n] , the joint distribution of An is the same as the joint distribution of π(An) , i.e.,

An=dπ(An).(46)
View SourceRight-click on figure for MathML and additional features.

Proof:

Let an be a realization of the random matrix An and π(Xn) be the permuted vector (Xπ(1),,Xπ(n)) . For any symmetric binary matrix an with zero diagonal entries, we have

P(An=an)=xn[L]nP(An=an,Xn=xn)=xn[L]nP(An=an|Xn=xn)i=1nP(Xi=xi)=(a)xn[L]ni,j1i<jnP(Aij=aij|Xi=xi,Xj=xj)i=1nP(Xπ(i)=xi)=(b)xn[L]ni,j1i<jn(Wxi,xj)aij(1Wxi,xj)1aiji=1nP(Xπ(i)=xi)=(c)xn[L]ni,j1i<jnP(Aπ(i),π(j)=aij|Xπ(i)=xi,Xπ(j)=xj)i=1nP(Xπ(i)=xi)=xn[L]nP(π(An)=an,π(Xn)=xn)=P(π(An)=an),
View SourceRight-click on figure for MathML and additional features. where (a) follows since Xn are i.i.d. and thus P(Xi=xi)=P(Xπ(i)=xi) and (b) follows since AijBern(WXi,Xj) , and thus
P(Aij=aij|Xi=xi,Xj=xj)={Wxi,xj1Wxi,xjif aij=1if aij=0=(Wxi,xj)aij(1Wxi,xj)1aij.(47)(48)
View SourceRight-click on figure for MathML and additional features.
The step in (c) follows since Aπ(i),π(j)Bern(WXπ(i),Xπ(j)) and the conditional probability has the same expression as in (48).

Now we are ready to establish Proposition 2, which is restated as follows.

Proposition 6 (Submatrix Decomposition):

Let AnSBM(n,L,p,f(n)Q)P1(ϵ) for some 0<ϵ<1 . Let k be an integer and n=n/k . Consider the k×k submatrix decomposition in (9). We have all the off-diagonal submatrices share the same joint distribution; all the diagonal submatrices share the same joint distribution. In other words, for any 1i1,i2,j1,j2n with i1j1,i2j2 and 1l1,l2n , we have

Bi1,j1Bl1,l1=dBi2,j2,=dBl2,l2.
View SourceRight-click on figure for MathML and additional features. In addition, if k=ω(1) and k=o(n) , the off-diagonal submatrices are asymptotically i.i.d. in the sense that
limnH(But)(n2H(B12))=1.(17)
View SourceRight-click on figure for MathML and additional features.

Proof of Proposition 2:

For any i1j1 and i2j2 , 1i1,j1,i2,j2n , consider a permutation π1:[n][n] that has

π1(x)={x+(i2i1)kx+(j2j1)kfor (i11)k+1xi1kfor (j11)k+1xj1k
View SourceRight-click on figure for MathML and additional features. and the remaining n2k arguments are mapped to the n2k values in [n]{(i21)k+1,,i2k,(j21)k,,j2k} in any order. Lemma 3 implies that Bi1,j1 , which is the submatrix formed by the rows (i11)k+1,,i1k and the columns (j11)k+1,,j1k has the same distribution as the submatrix formed by the rows π1((i11)k+1),,π1(i1k) and the columns π1((j11)k+1),,π1(j1k) . From the definition of π1 , we see that the latter submatrix is Bi2,j2 and we establish that Bi1,j1=dBi2,j2 . Similarly, defining a permutation π2:[n][n] which has
π2(x)={x+(l2l1)kfor (l11)k+1xl1k
View SourceRight-click on figure for MathML and additional features.
and invoking Lemma 3 establishes Bl1,l1=dBl2,l2 .

Now, clearly H(But)(n2)H(B12) , and therefore we have

lim supnH(But)(n2H(B12))1.(49)
View SourceRight-click on figure for MathML and additional features. Moreover, we have H(An)=H(But,Bd)H(But)+H(Bd)H(But)+nH(B11)H(But)+nk2H(A12) where the last inequality follows by noting that except for the diagonal elements of Bd (which are zero and thus have zero entropy), all other elements have the same distribution as A12 . We therefore obtain H(But)H(An)nk2H(A12)H(An)nkH(A12)H(An|Xn1)nkH(A12)=(n2)pTh(f(n)Q)pnkh(f(n)pTQp) , where the last equality follows by Lemma 1. Consequently,
H(But)(n2H(B12))(n2(pTh(f(n)Q)p2kh(f(n)pTQp)n1))(n2H(B12)).(50)
View SourceRight-click on figure for MathML and additional features.
We will now analyze the right-hand side of (50) in two parameter regimes.

  • f(n)=Θ(1) : We have

    H(B12)(a)H(B12|X2k1)+H(X2k1)=H(B12|X2k1)+2kH(p)=(b)k2H(A1,k+1|X1,Xk+1)+2kH(p)(c)k2(pTh(Q)p+2logLk),(51)
    View SourceRight-click on figure for MathML and additional features. where (a) follows from the chain rule, (b) follows since all elements of the matrix B12 are independent given X1,,X2k and (c) follows by Lemma 1. Plugging this into the right-hand side of (50) we obtain
    H(But)(n2H(B12))(n2(pTh(Q)p2kh(pTQp)n1))(n2k2(pTh(Q)p+2logLk)).(52)
    View SourceRight-click on figure for MathML and additional features.
    Since k=o(n),k=ω(1) and (n2)k2(n2) , we have from (52)
    lim infnH(But)(n2H(B12))1,(53)
    View SourceRight-click on figure for MathML and additional features.
    which together with (49) yields the required result.

  • f(n)=Ω(1n2),f(n)=o(1): Since B12 is a matrix of k2 identically distributed Bernoulli random variables, we have

    H(B12)k2h(A1,k+1)=k2h(f(n)pTQp).(54)
    View SourceRight-click on figure for MathML and additional features. Plugging this into the RHS of (50) then yields
    H(But)(n2H(B12))(n2(pTh(f(n)Q)p2kh(f(n)pTQp)n1))(n2k2h(f(n)pTQp)).(55)
    View SourceRight-click on figure for MathML and additional features.
    We first observe that in this parameter range, since f(n)=o(1) , by Lemma 1, we have
    pTh(f(n)Q)p=(1+o(1))h(f(n)pTQp).(56)
    View SourceRight-click on figure for MathML and additional features.
    Finally using that k=o(n) and (n2)k2(n2) , we establish
    lim infnH(But)(n2H(B12))1,(57)
    View SourceRight-click on figure for MathML and additional features.
    which together with (49) yields the required result.

C. Length of the Laplace Probability Assignment

In this section, we prove Proposition 3, which is restated as follows.

Proposition 7 (Laplace Probability Assignment for Correlated Sequence):

Consider arbitrarily correlated Z1,Z2,,ZN , where each Zi is identically distributed over an alphabet of size m2 . Let L(zN)=log1qL(zN)+1 , where

qL(zN):=N1!N2!Nm!N!1(N+m1m1)(18)
View SourceRight-click on figure for MathML and additional features. with NiNi(zN)=Nj=11{zj=i} is the joint distribution induced by the sequential Laplace probability assignment in (13). We then have
E[L(ZN)]mlog(2eN)+NH(Z1)+2.(19)
View SourceRight-click on figure for MathML and additional features.

Proof of Proposition 3:

The joint probability of Laplace probability assignment given in equation (18) is a classic result. Nonetheless, we provide the derivation of (18) in Appendix B.

Now, we move on to prove the expected length upper bound (19). Let us first elaborate on the relation between probability assignment and compression length. In Algorithm 1, the terms log(q(xj+1|xj)) are added up, which leads to the marginal probability implied by the sequential probability assignment

j=0N1log(q(xj+1|xj))=log(j=0N1q(xj+1|xj))=log(q(xN)).(58)
View SourceRight-click on figure for MathML and additional features. The compression output length of Algorithm 1 is log1q(xN)+1 .

Now we analyze the compression length of the Laplace compressor for the sequence Z1,Z2,,ZN . Define θi:=P(Z1=i),Ni:=Nk=11{Zk=i},i[m] . We have

log1qL(zN)=logθN11θN22θNmmqL(zN)+log1θN11θN22θNmm=log(N+m1m1)+log(N!N1!N2!Nm!θN11θN22θNmm)+log1θN11θN22θNmm(a)log(N+m1m1)+log1θN11θN22θNmm(b)(m1)log(e(Nm1+1))+log1θN11θN22θNmmmlog(2eN)+i=1mNilog1θi,(59)
View SourceRight-click on figure for MathML and additional features. where (a) follows since N!N1!N2!Nm!θN11θN22θNmm is a multinomial probability which is always upper bounded by 1, and (b) follows since (nk)(enk)k . Taking expectation on both sides of (59), we obtain
E[log1qL(ZN)]mlog(2eN)+i=1mE[Ni]log1θi=mlog(2eN)+i=1mE[k=1N1{Zk=i}]log1θi=(c)mlog(2eN)+i=1m(k=1NE[1{Zk=i}])log1θi=(d)mlog(2eN)+i=1mNP(Z1=i)log1θi=mlog(2eN)+i=1mNθilog1θi=mlog(2eN)+NH(Z1),(60)(61)(62)
View SourceRight-click on figure for MathML and additional features.
where (c) follows by the linearity of expectation and (d) follows because all Zi are identically distributed. Finally, we have
E[L(ZN)]E[log1qL(ZN)]+2mlog(2eN)+NH(Z1)+2
View SourceRight-click on figure for MathML and additional features.
as required.

D. Length of the KT Probability Assignment

To prove Proposition 4, we first state a technical lemma.

Lemma 4:

For any integer m>0 , N1,N2,NmN and probability distribution (θ1,θm) ,

(NN1,N2Nm)θN11θNmm(2N2N1,2N22Nm)θ2N11θ2Nmm1,
View SourceRight-click on figure for MathML and additional features. where N=mi=1Ni .

We defer the proof of Lemma 4 to Appendix A.

Remark 13:

Equivalently, consider an urn containing a known number of balls with m different colours. The lemma claims that the probability of getting N1 balls of colour 1, N2 of balls of colour 2,  Nm balls of colour m out of N draws with replacement is always greater than the probability of getting 2N1 balls of colour 1, 2N2 of balls of colour 2,  2Nm balls of colour m out of 2N draws with replacement.

Now, we are ready to Proposition 4, which is restated as follows.

Proposition 8 (KT Probability Assignment for Correlated Sequence):

Consider arbitrarily correlated Z1,Z2,, ZN , where each Zi is identically distributed over an alphabet of size m2 . Let KT(zN)=log1qKT(zN)+1 , where

qKT(zN)=(2N11)!!(2N21)!!(2Nm1)!!m(m+2)(m+2N2)(20)
View SourceRight-click on figure for MathML and additional features. with (1)!!1 and NiNi(zN)=Nj=11{zj=i} is the joint distribution induced by KT probability assignment in (12). We then have
E[KT(ZN)]m2log(e(1+2Nm))+12log(πN)+NH(Z1)+2.(21)
View SourceRight-click on figure for MathML and additional features.

Proof of Proposition 4:

The joint probability of KT probability assignment given in equation (20) is a classic result. Nonetheless, we provide the derivation of (20) in Appendix B.

Now, we move on to prove the expected length upper bound (21). In this proof, we define a generalized form of the factorial function. Let x be a positive integer, (x+12)!=1232(x+12) . Since (2N11)!!=(2N1)!2N1(N1)! , we have

m(m+2)(m+2N2)=2N(m2)(m+22)(m+2N22)=2N(m2+N1)!(m21)!.
View SourceRight-click on figure for MathML and additional features. Therefore we can rewrite the KT probability assignment in (20) as
qKT(zN)=(m21)!2N(m2+N1)!(2NN)(2NN)i=1m(2Ni)!Ni!2Ni=(m21)!2N(m2+N1)!(2NN)N!N!(2N)!i=1m(2Ni)!Ni!2Ni(a)(m21)!(2NN)4N(N+m212)m12N!(2N)!i=1m(2Ni)!Ni!=(b)θN11θNmm(m21)!(2NN)4N(N+m212)m12(NN1,N2Nm)θN11θNmm(2N2N1,2N22Nm)θ2N11θ2Nmm,
View SourceRight-click on figure for MathML and additional features.
where (a) follows that when m is even, N!(m2+N1)!=1(N+1)(m2+N1)1(N+m212)m12 and when m is odd, N!(m2+N1)!N!(m2+N12)!=1(N+1)(m2+N12)1(N+m212)m12 , (b) follows that (NN1,N2Nm)=N!mi=1Ni! and θiP(Z1=i) . By lemma 4, we have qKT(zN)θN11θNmm(m21)!(2NN)4N(N+m212)m12 . Thus,
log1qKT(zN)log1θN11θNmm+log4N(N+m212)m12(m21)!(2NN)=log1θN11θNmm+m12log(N+m12)+log4N(2NN)log(m21)!(a)log1θN11θNmm+m12log(N+m12)+log4N(2NN)(m21)log(m21e)(b)log1θN11θNmm+m12log(N+m12)+logπN(m21)log(m21e)m2loge(m2+N)m/2+logπN+log1θN11θNmm=m2log(e(1+2Nm))+12log(πN)+i=1mNilog1θi,
View SourceRight-click on figure for MathML and additional features.
where (a) follows Stirling’s approximation k!2πk(ke)ke112k+1 and (b) follows Stirling’s approximation for binomial coefficient, i.e., (2NN)4NπN . Therefore, we have
E[log1qKT(zN)]12mlog(e(1+2Nm))+12log(πN)+NH(Z1).
View SourceRight-click on figure for MathML and additional features.
Finally, it follows that
E[KT(ZN)]E[log1qKT(zN)]+2m2log(e(1+2Nm))+12log(πN)+NH(Z1)+2.
View SourceRight-click on figure for MathML and additional features.

SECTION V.

Minimax Redundancy Analysis

So far, we have proved the universality of our proposed algorithms. For this, we showed that the expected length of the proposed compressor matches the first-order term in the graph entropy. In this section, we study the redundancy of the proposed compressor Ck and prove the minimax redundancy result Theorem 3:

Theorem 4 (Minimax Redundancy):

Let f(n)=o(1) and f(n)=Ω(1/n2ϵ) for some 0<ϵ<1 . The minimax redundancy of the family P3(f(n),Qmax) is bounded as

12log((n2)f(n)Qmaxπe)R(P3(f(n),Qmax))1e(loge)Qmax(n2)f(n).
View SourceRight-click on figure for MathML and additional features.

Towards that goal, we first state a proposition that upper bounds the redundancy of compressor Ck .

Proposition 9:

For every 0<ϵ<1 , let AnSBM(n,L,p,f(n)Q)P1(ϵ) . Let k=ω(1) and kδlogn for some 0<δ<ϵ .

  • If f(n)=o(1) and f(n)=Ω(1n2ϵ) , then the expected length of compressor Ck defined in Section II-A is upper bounded as

    E[(Ck(An))]H(An)(n2)f(n)(pTQplog(1pTQp)pTQp)+o(n2f(n))=o(H(An)),(63)
    View SourceRight-click on figure for MathML and additional features. where Q denotes an L×L matrix whose (i,j) entry is Qijlog(1Qij) when Qij0 and 0 when Qij=0 .

  • If f(n)=Θ(1) , then the expected length of compressor Ck defined in Section II-A is upper bounded as

    E((Ck(An)))H(An)H(p)n2k+o(n2k)=o(H(An)),(64)
    View SourceRight-click on figure for MathML and additional features. where H(p)=Li=1pilog(1pi) .

Because Ck is a valid compressor for P3 , we can get an upper bound for the minimax redundancy by maximizing the right-hand side of equation (63). For the lower bound, we will apply the redundancy-capacity theorem in [57].

Proof of Theorem 3:

Firstly, we prove the lower bound for the minimax redundancy. The proof essentially follows from the redundancy-capacity theorem [57]. To start with, we will define a vector Θ of parameters L,p and Q for SBM. Consider stochastic block model SBM(n,L,p,f(n)Q) . Recall that L is a scalar. Vector p=(p1,,pL)T represents a distribution over [L] , so it can be written as p=(p1,,pL1,1L1i=1pi)T . For the L -by-L symmetric matrix Q , we can write the upper-triangle entries into a vector (Q11,,Q1L,Q22,,Q2L,,QLL) . We define the parameter vector

Θ(SBM(n,L,p,f(n)Q))=(L,p1,,pL1,Q11,,Q1L,Q22,,Q2L,,QLL).
View SourceRight-click on figure for MathML and additional features. Notice that by our definition, Θ is a variable-length vector. Its length L+L(L+1)2 is a function of its first entry L . For some fixed f(n) and Qmax , let
W:={Θ(SBM(n,L,p,f(n)Q)):SBM(n,L,p,f(n)Q)P3(f(n),Qmax)}
View SourceRight-click on figure for MathML and additional features.
denote the set of all parameter vectors for the family of SBMs in P3(f(n),Qmax) . Let μ() denote a probability distribution supported on W . Suppose AnSBM(n,L,p,f(n)Q) with random SBM parameters following prior μ , by redundancy-capacity theorem (see, for example, [57]), we have
Rn(P3(f(n),Qmax))=supμIμ(Θ;An),
View SourceRight-click on figure for MathML and additional features.
where Iμ(Θ;An) denote the mutual information between Θ and An under the prior μ and the supremum is taken over all possible probability distributions supported on W . To lower bound the quantity supμIμ(Θ;An) , we consider a particular distribution νW such that L is fixed to 1 and Q11Unif(0,Qmax) , i.e, the Erdős- Rényi model with edge probability being uniform from (0,f(n)Qmax] . Since ν is supported on W , we have supμIμ(Θ;An)Iν(Θ;An) .

We now move on to evaluate the mutual information Iν(Θ;An) . For an adjacency matrix An , we define Q^11:=K(n2)f(n) where K=n1i=1nj=i+1Aij denotes the number of ones in upper-triangle of An . We have

Iν(Θ;An)=Iν(Q11;An)=h(Q11)h(Q11|An)(a)logQmaxh(Q11|Q^11)=logQmaxh(Q11Q^11|Q^11)logQmaxh(Q11Q^11)(b)logQmax12log(2πeVar(Q11Q^11)),
View SourceRight-click on figure for MathML and additional features. where (a) follows since Q^11 is a function of An and (b) follows since Gaussian distribution maximizes differential entropy for a given variance. To evaluate the variance Var(Q11Q^11) , we consider the conditional variance
Var(Q11Q^11|Q11=q)=E(K(n2)f(n)Q11)2Q11=q=1(n2)2(f(n))2E[(K(n2)f(n)q)2Q11=q]=(c)(n2)f(n)q(1f(n)q)(n2)2(f(n))2=q(1f(n)q)(n2)f(n),
View SourceRight-click on figure for MathML and additional features.
where (c) follows since K|Q11=qBinom((n2),f(n)q) . By law of total variance and since E[Q^11Q11|Q11=q]=0 , we have
Var(Q11Q^11)=E[Q11(1f(n)Q11)(n2)f(n)]E[Q11](n2)f(n)=Qmax2(n2)f(n).
View SourceRight-click on figure for MathML and additional features.
Finally, we have
Iν(Θ;An)logQmax12log(2πeVar(Q11Q^11))logQmax12log(Qmaxπe(n2)f(n))=12log((n2)f(n)Qmaxπe),
View SourceRight-click on figure for MathML and additional features.
which proves the lower bound.

Now, we prove the upper bound for the minimax redundancy. Let S(f(n),Qmax) denote the set of all triples (L,p,Q) such that SBM(n,L,p,f(n)Q)P3(f(n),Qmax) . Recall that the minimax redundancy is defined as Rn(P3(f(n),Qmax))=infCsup(L,p,Q)S(f(n),Qmax)(E[(C(An))]H(An)) . Since our proposed compressor Ck is a valid lossless compressor for P3 , we can upper bound the minimax redundancy with the maximum redundancy of Ck . By Proposition 9, we have

Rn(P3(f(n),Qmax))=infCsup(L,p,Q)S(f(n),Qmax)(E[(C(An))]H(An))sup(L,p,Q)S(f(n),Qmax)(E[(Ck(An))]H(An))(d)sup(L,p,Q)S(f(n),Qmax)(n2)f(n)(pTQplog1pTQppTQp)+o(n2f(n)),
View SourceRight-click on figure for MathML and additional features. where (d) follows since P3 is a sub-family of P2 , so f(n)=o(1) . Therefore, it suffices to maximize the constant term (pTQplog1pTQppTQp) over all valid choices of L,p,Q . We can rewrite the term as
pTQplog1pTQppTQp=i,j[L]pipjQijlog1k,l[L]pkplQkli,j[L]pipjQijlog1Qij=Qmax(i,j[L]pipjQijQmaxlog1k,l[L]pkplQkli,j[L]pipjQijQmaxlog1Qij)=Qmax(i,j[L]pipjQijQmaxlog1k,l[L]pkplQklQmax1Qmaxi,j[L]pipjQijQmaxlog(QmaxQij1Qmax))=Qmax(i,j[L]pipjQijQmaxlog1k,l[L]pkplQklQmaxi,j[L]pipjQijQmaxlogQmaxQij)=Qmax(pTQplog1pTQppT(Q)p),
View SourceRight-click on figure for MathML and additional features.
where Q denotes the matrix Q/Qmax and (Q) denotes an L×L matrix whose (i,j) entry is Qijlog(1Qij) . By our assumptions, the entries in Q are all non-negative and the largest entry is 1. Since p is a probability distribution, we have 0<pTQp1 . Now, let us consider the function g(x)=xlog1x . When 0x1 , its supremum is obtained at x=1/e and its infimum is obtained at x=0 or x=1 . With these properties, we can bound our constant term
Qmax(pTQplog1pTQppT(Q)p)Qmax1eloge.
View SourceRight-click on figure for MathML and additional features.
In particular, the equality can be achieved when all the entries in Q are either zero or one and p is carefully chosen such that pTQp=1/e . This completes the proof for the upper bound.□

Proof of Proposition 9:

We discuss the redundancy in two cases: (1) f(n)=o(1) and f(n)=Ω(1/n2ϵ) and (2) f(n)=Θ(1) .

First assume f(n)=o(1) and f(n)=Ω(1/n2ϵ) . By Lemma 2, we can lower bound the graph entropy

H(An)(n2)f(n)(log(1f(n))pTQp+pTQploge+pTQp+o(1)).(65)
View SourceRight-click on figure for MathML and additional features.

Now, we will upper bound the redundancy of Ck for both the KT probability assignment and the Laplace probability assignment. By equation (26), we have

E[((Ck(An))](n2)H(B12)+2k2log(2e2n3)+(nk2H(A12)+4)+E(2Nrlogn).(66)
View SourceRight-click on figure for MathML and additional features. We will now analyze each of these four terms separately. Firstly, we have
(n2)H(B12)(n2)k2H(A1,k+1)=(n2)k2h(f(n)pTQp)=(a)(n2)f(n)((log1f(n))pTQp+pTQploge+pTQplog1pTQp+o(1)),(67)
View SourceRight-click on figure for MathML and additional features.
where (a) follows from equation (35). Next, since kδlogn and δ<ϵ , we have
2k2log(2e2n3)nδlog(2e2n3)=o(n2f(n)).(68)
View SourceRight-click on figure for MathML and additional features.
Moreover, we have
nk2H(A12)+4(nδlogn)H(A12)+4=(nδlogn)O(f(n)log1f(n))+4=o(n2f(n)).(69)
View SourceRight-click on figure for MathML and additional features.
Finally, by equation (29), we have
E(2Nrlogn)=O(knf(n)logn)=o(n2f(n)).(70)
View SourceRight-click on figure for MathML and additional features.
Substituting (67), (68), (69), (70) into the upper bound (66) and combining with the lower bound for the entropy (65) gives
E((Ck(An)))H(An)(n2)f(n)((log1f(n))pTQp+pTQploge+pTQplog1pTQp+o(1))+o(n2f(n))(n2)f(n)(log(1f(n))pTQp+pTQploge+pTQp+o(1))=(n2)f(n)(pTQplog(1pTQp)pTQp)+o(n2f(n))=o(H(An))(71)
View SourceRight-click on figure for MathML and additional features.
where (71) follows since H(An)=Θ(n2f(n)log(1/f(n))) by Lemma 2.

Now assume f(n)=Θ(1) . Then H(An) can be lower bounded as

H(An)H(An|Xn)=(n2)H(A12|X1,X2)=(n2)pTh(f(n)Q)p.(72)
View SourceRight-click on figure for MathML and additional features. Still, we can upper bound the expected length of the compressor Ck using (66) and we will bound these four terms separately. We have
(n2)H(B12)=(n2)(H(B12|X2k1)+I(X2k1;B12))=(n2)(k2H(A1,k+1|X1,Xk+1)+I(X2k1;B12))(b)(n2)(k2pTh(f(n)Q)p+H(X2k1))=(n2)(k2pTh(f(n)Q)p+2kH(p))n(nk)2pTh(f(n)Q)p+n(n1)H(p),(73)
View SourceRight-click on figure for MathML and additional features.
where (b) follows by Lemma 1,
2k2log(2e2n3)nδlog(2e2n3)=o(n2k),nk2H(A12)+4nδlognH(A12)+4=o(n2k).(74)(75)
View SourceRight-click on figure for MathML and additional features.
and
E(2Nrlogn)=f(n)pTQp((nn~)n~+(nn~2))2logn=O(knlogn)=o(n2k).(76)
View SourceRight-click on figure for MathML and additional features.
Combining the bounds (72), (73), (74), (75), (76) gives
E((Ck(An)))H(An)n(nk)2pTh(f(n)Q)p+n(n1)H(p)n(n1)2pTh(f(n)Q)p+o(n2k)=n(n1)H(p)+n(1k)2pTh(f(n)Q)p+o(n2k)H(p)n2k+o(n2k)=o(H(An)),
View SourceRight-click on figure for MathML and additional features.
where the last line follows since k=ω(1) and H(An)=Θ(n2) when f(n)=Θ(1) .

SECTION VI.

Performance Under the Local Weak Convergence Framework

In this section, we analyze the performance of the proposed compressor under the local weak convergence framework introduced in [34] and [42]. We focus on a subfamily of SBM in the f(n)=1/n regime, whose expected degree is identical for vertices in all communities, i.e., Qp=(λ,,λ)T . We will show that the proposed compressor achieves the same performance in terms of BC entropy as the compressor proposed in [34].

We first introduce some basic definitions of rooted graphs in Subsection VI-A. Then, we define the local weak convergence of graphs and derive the local weak convergence limit of the subfamily of stochastic block model in Subsection VI-B. Finally, we review the definition of BC entropy in Subsection VI-C and state the performance guarantee of our compression algorithm in Subsection VI-D.

A. Basic Definitions on Rooted Graphs

Let G=(V,E) be a simple graph (undirected, unweighted, no self-loop), with V a countable set of vertices and E a countable set of edges. Let uGv denote the connectivity of vertices u and v in G . G is said to be locally finite if, for all vV , the degree of v in G is finite. A rooted graph (G,o) is a locally finite and connected graph G=(V,E,o) with a distinguished vertex oV , called the root. Two rooted graphs (G1,o1)=(V1,E1,o1) and (G2,o2)=(V2,E2,o2) are isomorphic, denoted as (G1,o1)(G2,o2) , if there exists a bijection π:V1V2 such that π(o1)=o2 and uG1v if and only if π(u)G2π(v) for all u,vV1 . One can verify that this notion of isomorphism defines an equivalence relation on rooted graphs. Let [G,o] denote the equivalence class corresponding to (G,o) . Let G denote the set of all locally finite and connected rooted graphs. For (G,o)G and hN , we write (G,o)h for the truncated graph at depth h of the graph (G,o) , in other words, the induced subgraph on the vertices such that their distance from the root is less than or equal to h . The equivalence classes [G,o]h follow a similar definition. Let Gh denote the set of all [G,o]h . Now, we define a metric d on G . For any [G1,o1] and [G2,o2] , let

h^:=sup{hZ+:(G1,o1)h(G2,o2)hfor some (G1,o1)[G1,o1],(G2,o2)[G2,o2]}
View SourceRight-click on figure for MathML and additional features. and define the metric d as
d([G1,o1],[G2,o2]):=11+h^.
View SourceRight-click on figure for MathML and additional features.
As shown in [58], equipped with the metric defined above, G is a Polish space, i.e., a complete separable metric space. For this Polish space, let P(G) denote the Borel probability measures on it. We say that a sequence of measures μnP(G) converges weakly to μP(G) , written as μnμ , if for any bounded continuous function f on G , we have fdμnfdμ . It was shown in [59] that μnμ if for any uniformly continuous and bounded functions f , we have fdμnfdμ . For μP(G) , h{0,1,2,} , and [G,o]G , let μh denote the h -neighbourhood marginal of μ
μh([G,o])=[G,o]G:[G,o]h=[G,o]μ([G,o]).(77)
View SourceRight-click on figure for MathML and additional features.
For a locally finite graph G=(V,E) and a vertex vV , let G(v) denote the graph component in G that is connected to v . By our previous definitions, (G(v),v) denotes the rooted graph of the connected component of v and the root is located at v and [G(v),v] denotes the equivalence class corresponding to (G(v),v) . Now, the rooted neighbourhood distribution of G is defined as the distribution of the rooted graph when the root is chosen uniformly at random over V
U(G):=1|V|vVδ[G(v),v],(78)
View SourceRight-click on figure for MathML and additional features.
where δ is the Dirac delta function.

For a sequence of graphs {Gn}n=1 , we say a probability distribution μ on G is the local weak limit of {Gn}n=1 if U(Gn)μ . Here we quote an interpretation of the term local in this definition from [34]: An equivalent definition of the local weak limit μ is that, for any finite h0 , if we randomly select a vertex i and examine the h -hop local neighborhood centered at i , the resulting distribution on the neighborhood structure converges to μ .

B. Local Weak Convergence

In this section, we want to study the local weak limit of a sequence of graphs generated from stochastic block models. However, we notice that the definition for rooted neighbourhood distribution given in (78) is limited to deterministic graphs. Therefore, we first extend this definition to the case of random graphs. Suppose we have a random graph GnΛn , where Λn is a probability distribution supported on graphs with n vertices. Let supp(Λn) denote the support of Λn . Let B denote a measurable set subset of G . We define

U(Gn)(B)=gsupp(Λn)PΛn(Gn=g)1ni=1n1{[g(i),i]B},(79)
View SourceRight-click on figure for MathML and additional features. i.e., we first generate a random graph Gn according to law Λn and then pick a vertex i uniformly at random independently from the graph generation process, and U(Gn)(B) represents the probability that [Gn(i),i]B from this process. By rearranging the order of the sums in equation (79), we have
U(Gn)(B)=1ni=1ngsupp(Λn)PΛn(Gn=g)1{[g(i),i]B}=1ni=1nPΛn([Gn(i),i]B).(80)
View SourceRight-click on figure for MathML and additional features.
Now let’s consider the case when Λn=SBM(n,L,p,f(n)Q) , and we have a graph AnΛn . By the exchangeability of stochastic block model shown in Lemma 3, we know that for any BG , PΛn([An(i),i]B) is equal for all vertices i[n] . Therefore, by (80), we have
U(An)(B)=PΛn([An(ρ),ρ]B)(81)
View SourceRight-click on figure for MathML and additional features.
for any vertex ρV(An) , i.e., the rooted neighbourhood distribution U(An) reduces to the neighbourhood distribution at arbitrary vertex ρ in graph An under the law Λn . We denote this neighbourhood distribution by Nbr(ρ) . Next, we are going to study the limit property of NbrR(ρ) , i.e., the R -neighbourhood marginal of Nbr(ρ) , under certain assumptions on the parameters of the stochastic block model. Since this distribution is identical for every vertex in V(An) , we will simply write NbrR for the R -neighbourhood distribution of any vertex.

To state the limiting distribution, we need to define the Galton–Watson tree probability distribution on rooted trees GWT(Poi(λ)) . Let Poi(λ) denote the Poisson distribution with mean λ . We take a vertex as the root and generate Z(1)Poi(λ) as the number of children of the first generation. For the first generation, independent of Z(1) , we generate ξ(1)1,,ξ(1)Z(1) i.i.d. according to Poi(λ) as the number of children of each vertex in the first generation. Let Z(2)=Z(1)i=1ξ(1)i denote the total number of vertices in the second generation. In general, for the j th generation, j=1,2, , generate the number of children for each vertex in the j th generation ξ(j)1,,ξ(j)Z(j) i.i.d. according to Poi(λ) , independent of all previous variables {ξ(i1)1,,ξ(i1)Z(i1),Z(i),for all ij} . Let Z(j+1)=Z(j)k=1ξ(j)k denote the total number of vertices in the j th generation. In this way, we iteratively defined a measure on rooted trees.

Remember that the total variation distance between two probability measures μ1 and μ2 is defined as δ(μ1,μ2):=supg:G[1,1]|gdμ1gdμ2| . With the definitions above, we are ready to state the proposition that upper bounds the total variation distance between NbrR and a Galton–Watson tree.

Proposition 10:

Let AnSBM(n,L,p,1nQ) with Qp=λ1 for some positive constant λ and an all-1 vector 1 of dimension L×1 . Let R=logloglogn . Recall that NbrR and GWT(Poi(λ))R denote the R -neighbourhood marginals of Nbr and GWT(Poi(λ)) respectively, as defined in (77). Then, the total variation distance satisfies as n ,

dTV(NbrR,GWT(Poi(λ))R)0.
View SourceRight-click on figure for MathML and additional features.

The proof of Proposition 10 follows a similar idea as the proof of Proposition 2 in [53]. The key idea essentially follows from the fact that GWT(Poi(λ)) can be constructed from a sequence of Poisson random variables with mean λ , while NbrR can be constructed from a sequence of binomial random variables with approximately the same mean λ . The case when R=1 follows essentially from the Poisson approximation of Binomial in the 1/n probability regime. This Proposition generalizes it to R=O(logn) .

To upper bound the total variation distance between these two distributions, we first review the definition of coupling between two probability distributions. Let μ and ν be probability measures on the same measurable space (S,S) . A coupling of μ and ν is a probability measure γ on the product space (S×S,S×S) such that the marginal of γ coincide with μ and ν , i.e.,

γ(A×S)=μ(A)andγ(S×A)=ν(A),AS.
View SourceRight-click on figure for MathML and additional features. It is known that for two probability distributions μ and ν on (S,S) and a pair of random variables (X,Y)γ , where γ is a coupling of μ and ν , it holds that
δ(μ,ν)P(XY),
View SourceRight-click on figure for MathML and additional features.
(see, e.g., [60]). Therefore, it suffices to construct a coupling (GR,TR) such that the marginal distribution of GR is NbrR , the marginal distribution of TR is GWT(Poi(λ))R and P(GRTR)0 . To construct such a coupling, we will present an induction process in which each step holds with high probability.

To formally state the proof of Proposition 10, we need a few more definitions. Let V=V(An) denote the set of vertices in the graph An . For a vertex ρV and an integer r[R] , let Gr(ρ) denote the (random) r -neighbourhood of ρ , i.e, Gr(ρ) is the induced subgraph on the vertex set {vV:d(v,ρ)r} . We will simply write Gr for Gr(ρ) . Let VrVV(Gr) denote the set of vertices in V that are not in V(Gr) . Let Gr{vV:d(v,ρ)=r} denote the depth-r neighbourhood of ρ . For a vertex vGr , let Yv denote the number of neighbours v has in Vr . Let TR be a random tree generated from GWT(Poi(λ))R , i.e., TRGWT(Poi(λ))R. For a vertex uV(TR) , let Zu denote the number of children of u in TR . In order to couple GR with TR such that P(GRTR)0 , a necessary condition is that GR is a tree with high probability. Now, we define two sets of events to guarantee that GR is a tree. For any r[R] , let Cr denote the event that no vertex in Vr1 has more than one neighbour in Gr1 and let Dr denote the event that there are no edges within Gr . Clearly, if Cr and Dr holds for every r[R] , then GR is indeed a tree. With these definitions, we state a lemma that introduces the induction step.

Lemma 5:

  1. IfGr1=Tr1 ;

  2. Yu=Zu for every uGr1 ;

  3. Cr and Dr hold

then Gr=Tr .

Proof of Lemma 5:

First of all, events Cr and Dr guarantee that Gr is a tree. Moreover, Gr1=Tr1 and Yu=Zu for every vertex in the last generation of Gr1 and Tr1 , so we have Gr=Tr .

Next, we define a set of auxiliary events in order to complete the proof of our induction step. For any 0rR , we define Er to be the event

Er={|Gs|2sQsmaxlogn,sr+1}.
View SourceRight-click on figure for MathML and additional features. The utility of this auxiliary event is shown in the following two lemmas.

Lemma 6:

For all rR and some constant c>0 ,

P(Er|Er1)1nc.
View SourceRight-click on figure for MathML and additional features. Moreover, |Gr|=O(log2n) when Er1 holds true.

Proof of Lemma 6:

First of all, remember that for two random variables X1 and X2 , we say that X1 stochastically dominates X2 if P(X1x)P(X2x) for any x . By our definition, Yv is stochastically dominated by Binom(n,Qmax/n) for any v . On Er1 , |Gr|2rQrmaxlogn and so |Gr+1| is stochastically dominated by

ZBinom(2rQrmaxnlogn,Qmaxn).
View SourceRight-click on figure for MathML and additional features. Thus,
P(Ecr|Er1)=P(|Gr+1|>2r+1Qr+1maxlogn|Er1)P(Z2EZ)(a)(e4)EZ,
View SourceRight-click on figure for MathML and additional features.
where (a) follows by a multiplicative version of Chernoff’s inequality [61]
P(X>(1+δ)EX)(eδ(1+δ)1+δ)EX
View SourceRight-click on figure for MathML and additional features.
for binomial random variable X and any δ>0 . Because R=logloglogn , we have
EZ=2rQr+1maxlogn=Ω(logc1n),
View SourceRight-click on figure for MathML and additional features.
for some positive constant c1 , which proves the first part of the Lemma.

For the second part, on Er1

|Gr|=r=1R|Gr|r=1R2rQrmaxlogn(2Qmax)R+1logn=O(log2n),
View SourceRight-click on figure for MathML and additional features. since R=logloglogn .

Lemma 7:

For any r ,

P(Cr|Er1)1O(n3/4)P(Dr|Er1)1O(n3/4).
View SourceRight-click on figure for MathML and additional features.

Proof:

For the first claim, fix u,vGr . For any wVr , the probability that (u,w) and (v,w) both appear is O(n2) . Now, |Vr|n and Lemma 6 implies that |Gr|2=O(log4n) . Hence the result follows from the union bound over all triples u,v,w . For the second claim, the probability of an edge between any particular pair of vertices u,vGr is O(n1) . Lemma 6 implies that |Gr|2=O(log4n) and the result follows from a union bound over all the pairs u,v .

Moreover, we state a lemma to bound the total variation distance between the Binomial distribution and the Poisson distribution.

Lemma 8 (Lemma 5 in[53]):

If m and n are positive integers and c>0 is a positive constant, then

dTV(Binom(m,c/n),Poi(c))=O(max{1,|mn|}n).
View SourceRight-click on figure for MathML and additional features.Now, we are ready to prove Proposition 10.

Proof:

[Proof of Proposition 10] For any random variable X , let dist(X) denote the probability distribution of X . For a vertex vV , let Xv denote its community label. Fix r and suppose Er1 holds and Tr1=Gr1 . Let V(1),,V(L) denote the set of vertices in V with community label 1,,L respectively and similarly, let V(1)r1,,V(L)r1 denote the set of vertices in Vr1 with community label 1,,L respectively. Let us first provide a high probability bound for |V(1)|,,|V(L)| . For each i[L] , by Hoeffding’s inequality [62], we have

P(||V(i)|npi|n3/4)2eΩ(n1/2).
View SourceRight-click on figure for MathML and additional features. It follows from the union bound that
P(i[L],||V(i)|npi|n3/4)12LeΩ(n1/2).
View SourceRight-click on figure for MathML and additional features.
Combining with Lemma 6, conditioned on event Er1 and the event that {i[L],||V(i)|npi|n3/4} , the number |V(i)r1| of vertices in Vr1 with community label i can be upper and lower bounded as
npi+n3/4|V(i)||V(i)r1||V(i)||Gr1|npin3/4O(log2n)
View SourceRight-click on figure for MathML and additional features.
for any i[L] .

Next, let us analyze the distribution of Yv , i.e., the number of neighbours in Vr1 for a vertex vGr1 . Let Y(1)v,,Y(L)v denote the number of neighbours v has in Vr1 with community label 1,,L respectively. Then we have Yv=Li=1Y(i)v and Y(1)v,,Y(L)v are independent of each other given the community label Xv . We know that conditional distribution of Y(i)v given Xv=j is Binom(|V(i)r1|,Qji/n) . By Lemma 8, we have dTV(Binom(|V(i)r1|,Qji/n),Poi(piQji))=O(n1/4) . For any i[L] , let Z(i) denote a Poisson random variable with mean piQji , i.e., Z(i)Poi(piQji) . Therefore, conditioning on any Xv=j for any i[L] we can couple Y(i)v with Z(i) such that P(Y(i)vZ(i))=O(n1/4) . By a union bound over L communities, we can couple the L -tuples (Y(1)v,,Y(L)v) and (Z(1),,Z(L)) such that P((Y(1)v,,Y(L)v)(Z(1),,Z(L)))=O(n1/4) . By Poisson superposition, it follows that dTV(dist(Yv|Xv=j),Poi(Li=1piQji))=O(n1/4) . Recall our assumption that Li=1piQji=λ for any j[L] and that ZvPoi(λ) . Moreover, Yv is conditionally independent of Gr1 given its community label Xv . Therefore, we have

dTV(dist(Yv|Gr1=gr1),dist(Zv))=dTV(dist(Yv|Xv=j),Poi(i=1LpiQji))=O(n1/4)
View SourceRight-click on figure for MathML and additional features. for any possible realization gr1 . Thus, we can couple Yv with Zv such that P(YvZv)=O(n1/4) . Then by a union bound over all vertices in Gr1 , we have
P(vGr1:YvZv)|Gr1|P(YvZv)=O(log2n)O(n1/4)=O(n1/5).
View SourceRight-click on figure for MathML and additional features.
The argument above shows that we can find a coupling such that with probability at least 1O(n1/5) , Yu=Zu for any uGr1 . Moreover, Lemmas 6 and 7 imply that Cr , Dr and Er hold simultaneously with probability at least 1ncO(n3/4) . Putting these all together, we see that the hypothesis of Lemma 5 holds with probability at least 1O(nmin(c,1/5)) . Thus,
P(Gr=Tr,Er|Gr1=Tr1,Er1)1O(nmin(c,1/5)).
View SourceRight-click on figure for MathML and additional features.
For the base case, we have P(E0)1 as n by the multiplicative Chernoff’s inequality, and we can certainly couple G0 with T0 since they are both a single vertex. Therefore, we can apply a union bound over r=1,,R ,
P(GRTR)=P(r[R]s.t.GrTror Ecr|Gr1=Tr1,Er1)RO(nmin(c,1/5))=Θ(logloglogn)O(nmin(c,1/5))0
View SourceRight-click on figure for MathML and additional features.
as n , i.e., we can couple GR and TR such that GR=TR with high probability. Therefore, we have dTV(NbrR,GWT(Poi(λ))R)0 as n .

With Proposition 10, we are ready to establish the local weak convergence of the stochastic block model.

Proposition 11 (Local weak limit of sparse SBMs):

Consider a sparse stochastic block model SBM(n,L,p,1nQ) with Qp=λ1 for some positive constant λ , where 1 denote the all-1 vector of dimension L×1 . Let {An}n=1 denote a sequence of random graphs with AnSBM(n,L,p,1nQ) . Let U(An) be the rooted neighbourhood distribution of An . Then, the local weak limit of the sequence of graphs {An}n=1 is given by GWT(Poi(λ)) ,4 i.e.,

U(An)GWT(Poi(λ)).
View SourceRight-click on figure for MathML and additional features.

Remark 14:

When Qi,j=c for all i,j[L] , the stochastic block model recovers the well-known local weak convergence result on the Erdős-Rényi model (see, e.g., [65, Theorem 3.12]).

Proof of Proposition 11:

We want to show that for any uniformly continuous and bounded function f ,

fdU(An)fdGWT(Poi(λ))0
View SourceRight-click on figure for MathML and additional features. as n . Since f is a uniformly continuous function on G , for every ϵ>0 there exists δ>0 such that, for any pair of rooted graphs [G1,o1] and [G2,o2]G with d([G1,o1],[G2,o2])<δ we have |f(G1,o1)f(G2,o2)|<ϵ . Recall that d([G1,o1],[G2,o2]):=11+h^ , where h^ denotes the maximum layers of matching between [G1,o1] and [G2,o2] . Therefore, as long as h>1δ1 , we have |f((G,o)h)f(G,o)|<ϵ . It follows that |f([i,o])f([g,o])|<ϵ , if [i,o]h=[g,o] . Let μP(G) and assume h>1δ1 . We have
fdμhfdμ=[g,o]Ghf([g,o])μh([g,o])[i,o]Gf([i,o])μ([i,o])[g,o]Ghf([g,o])μh([g,o])[i,o]G:[i,o]h=[g,o]f([i,o])μ([i,o])=(a)[g,o]Gh[i,o]G:[i,o]h=[g,o](f([g,o])f([i,o]))μ([i,o])[g,o]Gh[i,o]G:[i,o]h=[g,o]|f([g,o])f([i,o])|μ([i,o])[g,o]Gh[i,o]G:[i,o]h=[g,o]ϵμ([i,o])=ϵ,
View SourceRight-click on figure for MathML and additional features.
where (a) follows since μh([g,o])=[i,o]G:[i,o]h=[g,o]μ([i,o]) . Therefore, |fdU(An)hfdU(An)|<ϵ and |fdGWT(Poi(λ))hfdGWT(Poi(λ))|<ϵ . By Proposition 10, there exists n0 such that if nn0 and logloglognR , we have δ(GWT(Poi(λ))R,U(An)R)<ϵ . Since f is a bounded function, we have |fdGWT(Poi(λ))RfdU(An)R|<ϵ , as long as n is large enough. Therefore, if we take n large enough such that logloglogn>1δ1 and |fdGWT(Poi(λ))hfdU(An)h|<ϵ , we have
fdU(An)fdGWT(Poi(λ))fdU(An)hfdU(An)+fdGWT(Poi(λ))hfdGWT(Poi(λ))+fdGWT(Poi(λ))hfdU(An)h<3ϵ,
View SourceRight-click on figure for MathML and additional features.
which completes the proof.

C. BC Entropy

In this section, we review the notion of BC entropy introduced in [42], which is shown to be the fundamental limit of universal lossless compression for certain graph family [66].

For a Polish space Ω , let P(Ω) denote the set of all Borel probability measures on Ω . Let A be a Borel set in Ω , we define the ϵ -extension of A, denoted Aϵ , as the union of the open balls with radius ϵ centered around the points in A . For two probability measures μ and ν in P(Ω) , we define the Lévy-Prokhorov distance dLP(μ,ν):=inf{ϵ>0:μ(A)ν(Aϵ)+ϵand ν(A)μ(Aϵ)+ϵ,AB(Ω)} , where B(Ω) denotes the Borel sigma algebra of Ω . Let ρP(G) . Let d be the expected number of neighbours of root under the law ρ and let a sequence m=m(n) such that m/nd/2 , as n . Define Gn,m to be the set of graphs with n vertices and m edges. For ϵ>0 , define

Gn,m(ρ,ϵ)={GGn,m:U(G)B(ρ,ϵ)},
View SourceRight-click on figure for MathML and additional features. where B(ρ,ϵ) denotes the open ball with radius ϵ around ρ with respect to Lévy-Prokhorov metric. Now, we define the ϵ -upper BC entropy of ρ as
Σ¯¯¯¯(ρ,ϵ)=lim supnlog|Gn,m(ρ,ϵ)|mlognn
View SourceRight-click on figure for MathML and additional features.
and define the upper BC entropy of ρ as
Σ¯¯¯¯(ρ)=limϵ0Σ¯¯¯¯(ρ,ϵ).
View SourceRight-click on figure for MathML and additional features.
Similarly we define the ϵ -lower BC entropy Σ(ρ,ϵ) and lower BC entropy Σ(ρ) with lim sup replaced by lim inf in above definitions. If ρ is such that Σ¯¯¯¯(ρ)=Σ(ρ) , then this common limit is called the BC entropy of ρ
Σ(ρ):=Σ¯¯¯¯(ρ)=Σ(ρ).
View SourceRight-click on figure for MathML and additional features.

The following lemma states the BC entropy of the Galton-Waston tree distribution.

Lemma 9 (Corollary 1.4 of[42]):

The BC entropy of the Galton–Watson tree distribution GWT(Poi(λ)) is given by

Σ(GWT(Poi(λ)))=λ2logeλbits.
View SourceRight-click on figure for MathML and additional features.

D. Achieving BC Entropy in the Sparse Regime

With the Lemma above, we can give a performance guarantee of our algorithm corresponding to the BC entropy. It is a Theorem analogous to Proposition 1 in [66].

Theorem 5:

Let AnSBM(n,L,p,1nQ) with Qp=λ1 for some positive constant λ and an all-1 vector 1 of dimension L×1 . Let m=(n2)λn be the expected number of edges in the model. Then, our compression algorithm achieves the BC entropy of the local weak limit of stochastic block models in the sense that

lim supnE[(Ck(An))]mlognnΣ(GWT(Poi(λ))).
View SourceRight-click on figure for MathML and additional features.

Proof:

By our proof of Proposition 9, we have

E[(Ck(An))](n2)H(B12)+2k2log(2en3)+nk2nH(A12)+4+E(2Nrlogn).
View SourceRight-click on figure for MathML and additional features. Notice that
(n2)H(B12)(n2)k2H(A12)=(n2)k2h(λ/n)=(a)(n2)k2(1nλlogneλ+o(1n))(b)(n2)(1nλlogn+1nλlogeλ+o(1n))=(n2)1nλlogn+λlogeλlogλ2n+o(n)=(c)mlogn+nΣ(GWT(Poi(λ)))+o(n)
View SourceRight-click on figure for MathML and additional features.
where (a) follows since h(p)=plogeploge2p2+o(p2) , (b) follows since nk=n and (c) follows from Lemma 9. Then it suffices to that the remaining terms in the upper bound of E[(Ck(An))] are all o(n) . Indeed we have
2k2log(2en3)2δlognlog(2en3)=nδlog(2en3)=o(n)
View SourceRight-click on figure for MathML and additional features.
since δ<1 , and
nk2nH(A12)+4=nk2nh(λ/n)+4=nk2n(1nλlogneλ+o(1n))+4nδlogn(1nλlogneλ+o(1n))+4=δlogn(λlogneλ)+o(logn)=o(n),
View SourceRight-click on figure for MathML and additional features.
and
E(2Nrlogn)=1npTQp((nn~)n~+(nn~2))2logn=O(klogn)=o(n).
View SourceRight-click on figure for MathML and additional features.

Theorem 5 shows that in the sparse regime f(n)=1n , our compressor achieves the BC entropy of the Galton–Watson tree, which is the local weak convergence limit of the stochastic block model.

SECTION VII.

Comparison to Alternative Methods

In this section, we consider three alternative compression methods which are natural attempts at designing universal graph compressors for stochastic block models. We demonstrate why they are not applicable to our problem or not universal in the same sense as the proposed algorithm.

One intuitive attempt is to convert the entries in the adjacency matrix into a one-dimensional sequence according to some ordering and then apply a universal compressor to the sequence. Let us take a closer look at the correlation among entries in the adjacency matrix and explain why existing universal sequence compressors developed for stationary processes may not be immediately applicable for certain orderings of the entries. Compressing An entails compressing A12,,A1,n,A23,,An1,n , i.e., the bits in the upper triangle of An . Clearly, these are not independent (because of the dependency through Xn1 ) so one cannot use any of the compressors universal for the class of iid processes to compress An . Moreover, we can show that some of the most natural orders of listing these (n2) bits result in a sequence that is non-stationary.

For example, consider listing the bits in the upper triangle row-wise (i.e., first listing the bits in the first row, followed by the bits in the second and so on, ending with An1,n ) resulting in the sequence A12,,A1,n,A23,,A2,n,,An1,n . This can be seen to be nonstationary for example by considering the case when n=4,L=2,Q11=Q12=1,Q12=0 . In this case the horizontal ordering is A12,A13,A14,A23,A24,A34 and we have P(A12=1,A13=0,A14=1)>0 but P(A23=1,A24=0,A34=1)=0 . Similar counterexamples can be given for establishing the nonstationarity of column-by-column ordering and diagonal-by-diagonal ordering. Therefore, it is unclear whether there exists an ordering that converts entries in the adjacency matrix into a stationary sequence.

Another potential approach is to first recover the community labels of all vertices, which allows one to rearrange the entries in the adjacency matrix into i.i.d. groups, and then apply a universal sequence compressor to each group of entries. There are two main issues with this approach. First, even when the exact recovery of labels is possible, to the best of our knowledge, all algorithms for community detection require knowledge of the graph parameters such as the number of communities and/or the edge probabilities [67]. Second, it is known that exact recovery is information-theoretically impossible when the edge probabilities are o(lognn) but our compressor is universal even in this regime. This implies that universal compression is a fundamentally easier task than community detection for stochastic block models.

Another approach is based on run-length coding for the adjacency matrix An , which stores the run-length of one of the symbols for the entire sequence, while storing the other symbol uncoded. For example, the sequence AAAABAABBAA, when encoded using the run length of As, would be stored as (4,B,2,B,0,B,2) (so that the run length of As between each B are stored). When applied to our problem, the expected length of this method can be shown to be upper bounded by pTQp(n2)f(n)[log(1/f(n)Qmin)+O(loglogn)] (by suitably modifying [68]), where Qmin=mini,jQi,j . Thus it fails to achieve the leading term in entropy when f(n)=Ω(1/logn) , since pmin=Θ(f(n)) and the term with loglogn contributes to the leading term. The analysis also fails in the case when there are zero entries in W (since pmin=0 ), which is a reasonable choice of parameters for the SBM. In contrast, the analysis for Ck is valid even when there are zero entries in W as long as maxijQij=Θ(1) .

SECTION VIII.

Concluding Remarks

In this paper, we constructed a compressor that is universal for the class of stochastic block models with connection probabilities ranging from Ω(1n2ϵ) to O(1) . There are many intriguing open problems that remain. Firstly, even though our algorithm takes polynomial time, O(n2) may be restrictive when n is too large, and a lower complexity method is desirable. Secondly, a tight characterization of the minimax redundancy in this problem is of interest. Finally, we have considered the problem of universal compression in the stochastic setting, where we assumed that our graph is generated from a random graph model. This leads naturally to the question of the individual graph setting, where the redundancy must be established relative to a class of compressors. Each of these are promising avenues for further study.

ACKNOWLEDGMENT

The authors would like to thank the associate editor and the anonymous reviewers for their invaluable contributions and constructive suggestions, which have significantly enhanced the quality and clarity of this article. The author Chi Wang would like to thank Richard Peng for his suggestion in writing. The author Lele Wang would like to thank Emmanuel Abbe and Tsachy Weissman for stimulating discussions in the initial phase of the work. She is grateful to Young-Han Kim and Ofer Shayevitz for their interest and encouragement in this result.

Appendix A

SECTION A.

Proof of Lemma 4

Let p1=N1/N,p2=N2/N,,pm=Nm/N. Notice that mi=1pi=1 , so (p1,pm) can be viewed as a probability distribution. And the entropy of this distribution is H(p1,pm)=mi=1pilogpi. Firstly we consider the case when N1,N2Nm are all positive and none of them equal to N. By Stirling’s approximation for factorial 2πn(ne)ne1/(12n+1)n!2πn(ne)ne1/12n , we can bound

(NN1,N2Nm)2πNNNexp(112N+1112N1112N2112Nm)(2π)m/2(N1N2Nm)1/2NN11NN22NNmm=exp(112N+1112N1112N2112Nm)(2π)m12(p1p2pm)1/2Nm122NH(p1,p2,,pm).
View SourceRight-click on figure for MathML and additional features. Similarly, we have
(2N2N1,2N22Nm)exp(124N124N1+1124N2+1124Nm+1)(2π)m122m12(p1p2pm)1/2Nm1222NH(p1pm)
View SourceRight-click on figure for MathML and additional features.
Consider the function
f(N1,N2,,Nm)=112N+1124N+(124N1+1112N1)+(124N2+1112N2)++(124Nm+1112Nm)
View SourceRight-click on figure for MathML and additional features.
and the function
g(n)=124n+1112n,
View SourceRight-click on figure for MathML and additional features.
where n is a positive integer. Function g(n) is minimized with n=1 and ming(n)=1/251/12 and we can bound function f(N1,N2,,Nm)112N+1124N+(1/251/12)m . Finally we are ready to prove the lemma.
(NN1,N2Nm)θN11θNmm(2N2N1,2N22Nm)θ2N11θ2Nmm2m12exp(f(N1,N2,,Nm))2NH(p1pm)θN11θNmm2m12exp(112N+1124N+(1/251/12)m)2NDKL(p||θ)=2m122NDKL(p||θ)2loge(112N+1124N+(1/251/12)m).
View SourceRight-click on figure for MathML and additional features.
Notice that 112N+1124N goes to zero when N , m12>(1/251/12)m and DKL(P||θ)0 . Therefore in this case,
(NN1,N2Nm)θN11θNmm(2N2N1,2N22Nm)θ2N11θ2Nmm1.
View SourceRight-click on figure for MathML and additional features.
When one of {Ni}Ni=1 equals to N , without loss of generality, we assume that N1=N . We have
(NN1,N2Nm)θN11θNmm(2N2N1,2N22Nm)θ2N11θ2Nmm=1θN11θNmm>1.
View SourceRight-click on figure for MathML and additional features.
When there are k numbers out of N1,N2,,Nm that equal to zero, we can simply remove these values and consider the case with alphabet size mk . And this will yield the same result.

SECTION B.

Joint Distributions Induce by Laplace and KT Probability Assignments

In this section, we re- derive the well-known joint distributions of Laplace and KT probability assignments for completeness. Firstly we consider the Laplace probability assignment. Recalled that we defined

qL(Xj+1=i|Xj=xj)=Ni(xj)+1j+mfor each i[m].
View SourceRight-click on figure for MathML and additional features. Consider a sequence xN with the number of symbol i given by Ni for each i[m] . It follows that
qL(xN)=j=0N1qL(Xj+1=xj+1|Xj=xj)=j=0N1Nxj+1(xj)+1j+m=(a)N1!N2!Nm!m(m+1)(m+N1),(82)
View SourceRight-click on figure for MathML and additional features.
To see (a), we notice that there are Ni ’s of symbol i appear in the sequence xN . As a result, terms 1,2,Ni will appear in the numerator for each i[m] . By dividing N! from both numerator and denominator in (82), we finally get
qL(xN):=N1!N2!Nm!N!1(N+m1m1).
View SourceRight-click on figure for MathML and additional features.

Now, we consider KT probability assignment. Recalled that

qKT(Xj+1=i|Xj=xj)=Ni(xj)+1/2j+m/2for each i[m].
View SourceRight-click on figure for MathML and additional features. Consider a sequence xN with the number of symbol i given by Ni for each i[m] . It follows that
qKT(xN)=j=0N1qKT(Xj+1=xj+1|Xj=xj)=j=0N1Ni(xj)+1/2j+m/2=j=0N12Ni(xj)+12j+m=(b)(2N11)!!(2N21)!!(2Nn1)!!m(m+2)(m+2N2).(83)
View SourceRight-click on figure for MathML and additional features.
To see (b), we notice that there are Ni ’s of symbol i appear in the sequence xN . As a result, terms 1,3,2Ni1 appear in the numerator for each i[m] .

    References

    1.
    R. A. Rossi and R. Zhou, "GraphZIP: A clique-based sparse graph compression method", J. Big Data, vol. 5, no. 1, pp. 1-14, Dec. 2018.
    2.
    Y. Lim, U. Kang and C. Faloutsos, "SlashBurn: Graph compression and mining beyond caveman communities", IEEE Trans. Knowl. Data Eng., vol. 26, no. 12, pp. 3077-3089, Dec. 2014.
    3.
    P. Boldi and S. Vigna, "The webgraph framework I: Compression techniques", Proc. 13th Int. Conf. World Wide Web, pp. 595-602, May 2004.
    4.
    T. C. Conway and A. J. Bromage, "Succinct data structures for assembling large genomes", Bioinformatics, vol. 27, no. 4, pp. 479-486, Feb. 2011.
    5.
    M. Hayashida and T. Akutsu, "Comparing biological networks via graph compression", BMC Syst. Biol., vol. 4, no. S2, pp. 1-11, Sep. 2010.
    6.
    F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi and P. Raghavan, "On compressing social networks", Proc. 15th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, pp. 219-228, Jun. 2009.
    7.
    G. Navarro, "Compressing web graphs like texts", 2007.
    8.
    K. Sadakane, "New text indexing functionalities of the compressed suffix arrays", J. Algorithms, vol. 48, no. 2, pp. 294-313, Sep. 2003.
    9.
    N. R. Brisaboa, S. Ladra and G. Navarro, " k 2 -trees for compact web graph representation ", Proc. 16th Int. Symp. String Process. Inf. Retr. (SPIRE), pp. 18-30, 2009.
    10.
    A. Farzan and J. I. Munro, "Succinct encoding of arbitrary graphs", Theor. Comput. Sci., vol. 513, pp. 38-52, Nov. 2013.
    11.
    M. Besta and T. Hoefler, "Survey and taxonomy of lossless graph compression and space-efficient graph representations", arXiv:1806.01799, 2018.
    12.
    P.-S. Laplace, Pierre-Simon Laplace Philosophical Essay on Probabilities, New York, NY, USA:Springer, 1995.
    13.
    R. Krichevsky and V. Trofimov, "The performance of universal encoding", IEEE Trans. Inf. Theory, vol. IT-27, no. 2, pp. 199-207, Mar. 1981.
    14.
    J. Ziv and A. Lempel, "A universal algorithm for sequential data compression", IEEE Trans. Inf. Theory, vol. IT-23, no. 3, pp. 337-343, May 1977.
    15.
    J. Ziv and A. Lempel, "Compression of individual sequences via variable-rate coding", IEEE Trans. Inf. Theory, vol. IT-24, no. 5, pp. 530-536, Sep. 1978.
    16.
    M. Effros, K. Visweswariah, S. R. Kulkarni and S. Verdú, "Universal lossless source coding with the burrows wheeler transform", IEEE Trans. Inf. Theory, vol. 48, no. 5, pp. 1061-1081, May 2002.
    17.
    F. M. J. Willems, Y. M. Shtarkov and T. J. Tjalkens, "The context-tree weighting method: Basic properties", IEEE Trans. Inf. Theory, vol. 41, no. 3, pp. 653-664, May 1995.
    18.
    Y. Polyanskiy and Y. Wu, "Lecture notes on information theory", vol. 6, pp. 7, 2014.
    19.
    A. Frieze and M. Karonski, Introduction to Random Graphs, Cambridge, U.K.:Cambridge Univ. Press, 2015.
    20.
    G. Turán, "On the succinct representation of graphs", Discrete Appl. Math., vol. 8, no. 3, pp. 289-294, Jul. 1984.
    21.
    M. Naor, "Succinct representation of general unlabeled graphs", Discrete Appl. Math., vol. 28, no. 3, pp. 303-307, Sep. 1990.
    22.
    A. Lempel and J. Ziv, "Compression of two-dimensional data", IEEE Trans. Inf. Theory, vol. IT-32, no. 1, pp. 2-8, Jan. 1986.
    23.
    Y. Choi and W. Szpankowski, "Compression of graphical structures: Fundamental limits algorithms and experiments", IEEE Trans. Inf. Theory, vol. 58, no. 2, pp. 620-638, Feb. 2012.
    24.
    E. Abbe, "Graph compression: The effect of clusters", Proc. 54th Annu. Allerton Conf. Commun. Control Comput. (Allerton), pp. 1-8, Sep. 2016.
    25.
    A. R. Asadi, E. Abbe and S. Verdú, "Compressing data on graphs with clusters", Proc. IEEE Int. Symp. Inf. Theory (ISIT), pp. 1583-1587, Jun. 2017.
    26.
    J. C. Kieffer, E.-H. Yang and W. Szpankowski, "Structural complexity of random binary trees", Proc. IEEE Int. Symp. Inf. Theory, pp. 635-639, Jun. 2009.
    27.
    Z. Golębiewski, A. Magner and W. Szpankowski, "Entropy and optimal compression of some general plane trees", ACM Trans. Algorithms, vol. 15, no. 1, pp. 1-23, Oct. 2018.
    28.
    A. Magner, K. Turowski and W. Szpankowski, "Lossless compression of binary trees with correlated vertex names", IEEE Trans. Inf. Theory, vol. 64, no. 9, pp. 6070-6080, Sep. 2018.
    29.
    T. {\L}uczak, A. Magner and W. Szpankowski, "Compression of preferential attachment graphs", Proc. IEEE Int. Symp. Inf. Theory, pp. 1697-1701, Jul. 2019.
    30.
    K. Turowski, A. Magner and W. Szpankowski, "Compression of dynamic graphs generated by a duplication model", Proc. 56th Annu. Allerton Conf. Commun. Control Comput. (Allerton), pp. 1089-1096, Oct. 2018.
    31.
    I. Kontoyiannis, Y. H. Lim, K. Papakonstantinopoulou and W. Szpankowski, "Symmetry and the entropy of small-world structures and graphs", Proc. IEEE Int. Symp. Inf. Theory (ISIT), pp. 3026-3031, Jul. 2021.
    32.
    J. Zhang, E.-H. Yang and J. C. Kieffer, "A universal grammar-based code for lossless compression of binary trees", IEEE Trans. Inf. Theory, vol. 60, no. 3, pp. 1373-1386, Mar. 2014.
    33.
    M. Ganardi, D. Hucke, M. Lohrey and L. Seelbach Benkner, "Universal tree source coding using grammar-based compression", IEEE Trans. Inf. Theory, vol. 65, no. 10, pp. 6399-6413, Oct. 2019.
    34.
    P. Delgosha and V. Anantharam, "Universal lossless compression of graphical data", IEEE Trans. Inf. Theory, vol. 66, no. 11, pp. 6962-6976, Nov. 2020.
    35.
    P. Delgosha and V. Anantharam, "A universal low complexity compression algorithm for sparse marked graphs", Proc. IEEE Int. Symp. Inf. Theory (ISIT), pp. 2349-2354, Jun. 2020.
    36.
    P. Delgosha and V. Anantharam, "A universal lossless compression method applicable to sparse graphs and heavy-tailed sparse graphs", Proc. IEEE Int. Symp. Inf. Theory (ISIT), pp. 2792-2797, Jul. 2021.
    37.
    T. P. Peixoto, "Parsimonious module inference in large networks", Phys. Rev. Lett., vol. 110, no. 14, Apr. 2013, [online] Available: https://link.aps.org/doi/10.1103/PhysRevLett.110.148701.
    38.
    T. P. Peixoto, "Hierarchical block structures and high-resolution model selection in large networks", Phys. Rev. X, vol. 4, no. 1, Mar. 2014, [online] Available: https://link.aps.org/doi/10.1103/PhysRevX.4.011047.
    39.
    T. P. Peixoto, "Model selection and hypothesis testing for large-scale network models with overlapping groups", Phys. Rev. X, vol. 5, no. 1, Mar. 2015, [online] Available: https://link.aps.org/doi/10.1103/PhysRevX.5.011033.
    40.
    T. P. Peixoto, "Nonparametric Bayesian inference of the microcanonical stochastic block model", Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdiscip. Top., vol. 95, no. 1, Jan. 2017, [online] Available: https://link.aps.org/doi/10.1103/PhysRevE.95.012317.
    41.
    R. Bustin and O. Shayevitz, "On lossy compression of directed graphs", IEEE Trans. Inf. Theory, vol. 68, no. 4, pp. 2101-2122, Apr. 2022.
    42.
    C. Bordenave and P. Caputo, "Large deviations of empirical neighborhood distribution in sparse random graphs", Probab. Theory Rel. Fields, vol. 163, no. 1, pp. 149-222, Nov. 2014.
    43.
    P. Delgosha and V. Anantharam, "A universal lossless compression method applicable to sparse graphs and heavy-tailed sparse graphs", arXiv:2107.08350, 2021.
    44.
    D. Marpe, H. Schwarz and T. Wiegand, "Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard", IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 620-636, Jul. 2003.
    45.
    I. H. Witten, R. M. Neal and J. G. Cleary, "Arithmetic coding for data compression", Commun. ACM, vol. 30, no. 6, pp. 520-540, Jun. 1987.
    46.
    A. Grover and J. Leskovec, "node2vec: Scalable feature learning for networks", Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, pp. 855-864, Aug. 2016.
    47.
    L. Tang and H. Liu, "Relational learning via latent social dimensions", Proc. 15th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, pp. 817-826, Jun. 2009.
    48.
    S. Nandanwar and M. N. Murty, "Structural neighborhood based classification of nodes in a network", Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, pp. 1085-1094, Aug. 2016.
    49.
    J. Shun and G. E. Blelloch, "Ligra: A lightweight graph processing framework for shared memory", Proc. 18th ACM SIGPLAN Symp. Princ. Pract. Parallel Program. (PPoPP), pp. 135-146, 2013.
    50.
    J. Shun, L. Dhulipala and G. E. Blelloch, "Smaller and faster: Parallel processing of compressed graphs with Ligra+", Proc. Data Compress. Conf., pp. 403-412, Apr. 2015.
    51.
    E. Abbe, A. S. Bandeira and G. Hall, "Exact recovery in the stochastic block model", IEEE Trans. Inf. Theory, vol. 62, no. 1, pp. 471-487, Jan. 2016.
    52.
    E. Abbe and C. Sandon, "Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery", Proc. IEEE 56th Annu. Symp. Found. Comput. Sci., pp. 670-688, Oct. 2015.
    53.
    E. Mossel, J. Neeman and A. Sly, "Reconstruction and estimation in the planted partition model", Probab. Theory Rel. Fields, vol. 162, no. 3, pp. 431-461, Aug. 2015.
    54.
    T. Courtade, Properties of the Binary Entropy Function, 2012, [online] Available: https://blogs.princeton.edu/blogit/2012/10/26/properties-of-the-binary-entropy-function/.
    55.
    S. Lauritzen, A. Rinaldo and K. Sadeghi, "Random networks graphical models and exchangeability", J. Roy. Stat. Soc. B Stat. Methodol., vol. 80, no. 3, pp. 481-508, Jun. 2018.
    56.
    P. Orbanz and D. M. Roy, "Bayesian models of graphs arrays and other exchangeable random structures", IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 2, pp. 437-461, Feb. 2015.
    57.
    N. Merhav and M. Feder, "A strong version of the redundancy-capacity theorem of universal coding", IEEE Trans. Inf. Theory, vol. 41, no. 3, pp. 714-722, May 1995.
    58.
    D. Aldous and R. Lyons, "Processes on unimodular random networks", Electron. J. Probab., vol. 12, pp. 1454-1508, Mar. 2006.
    59.
    P. Billingsley, Convergence of Probability Measures, New York, NY, USA:Wiley, 1999.
    60.
    S. Roch, Modern Discrete Probability: An Essential Toolkit, 2020, [online] Available: https://www.math.wisc.edu/~roch/mdp/.
    61.
    H. Chernoff, "A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations", Ann. Math. Statist., vol. 23, no. 4, pp. 493-507, Dec. 1952.
    62.
    W. Hoeffding, "Probability inequalities for sums of bounded random variables", J. Amer. Stat. Assoc., vol. 58, no. 301, pp. 13-30, Mar. 1963, [online] Available: https://www.jstor.org/stable/2282952.
    63.
    A. Bhatt, Z. Wang, C. Wang and L. Wang, "Universal graph compression: Stochastic block models", arXiv:2006.02643, 2020.
    64.
    R. van der Hofstad, Random Graphs and Complex Networks, vol. 2, 2023, [online] Available: https://www.win.tue.nl/~rhofstad/NotesRGCNII_colleagues_25_04_2022.pdf and https://www.win.tue.nl/rhofstad/NotesRGCNII.pdf.
    65.
    C. Bordenave, Lecture Notes on Random Graphs and Probabilistic Combinatorial Optimization, 2016, [online] Available: https://www.math.univ-toulouse.fr/~bordenave/coursRG.pdf.
    66.
    P. Delgosha and V. Anantharam, "Universal lossless compression of graphical data", Proc. IEEE Int. Symp. Inf. Theory (ISIT), pp. 1578-1582, Jun. 2017.
    67.
    E. Abbe, "Community detection and stochastic block models: Recent developments", J. Mach. Learn. Res., vol. 18, no. 1, pp. 6446-6531, Jan. 2017.
    68.
    M. F. Schilling, "The longest run of heads", College Math. J., vol. 21, no. 3, pp. 196-207, May 1990.