Many classes of real-world signals, including speech [1], financial time series [2], and natural images [3], [4], exhibit complex and non-Gaussian statistical dependencies. In such settings, it is well known that classical approaches to denoising and detection, many of which are based on assumptions of independence or joint Gaussianity, may lead to markedly suboptimal performance. It is therefore of considerable interest to develop and explore signal processing methods that are capable of modeling and exploiting a broader class of dependency structures. One such class of methods is provided by graphical models, a formalism in which random variables are associated with the nodes of a graph and the edges of the graph represent statistical dependencies among these variables. Graphical models have proven useful in a wide range of signal processing problems; we refer the reader to the survey papers [5] and [6] for an overview.
In the graphical models most commonly encountered in signal processing applications, the underlying graph is a chain or a tree. For such cycle-free graphs, efficient recursive algorithms [5]–[7] are available for calculating various statistical quantities of interest (e.g., likelihoods and other marginal probabilities). The elegance and familiarity of these recursive algorithms should not, however, obscure the fact that chains and trees capture rather limited forms of statistical dependency, and there are numerous applications—among them image denoising [4] and sensor fusion [8]—that would be better served by the richer class of graphical models in which cycles are allowed in the underlying graph. Accordingly, the algorithmic treatment of such graphical models with cycles is the focus of this paper.
As a specific example of these issues, and as motivation for the experimental results that we present later in the paper, let us consider a graphical modeling approach to statistical signal processing in the wavelet domain. Crouse et al. [9] presented a statistical model for wavelets in which each wavelet coefficient is modeled as a finite mixture of Gaussians, typically with two mixture components indexed by a binary {0,1}-valued random variable. The wavelet coefficients are coupled together by introducing statistical dependencies among the binary variables underlying each local Gaussian mixture model; this setup is sufficiently flexible to capture the non-Gaussian dependencies present in signal classes such as natural images (e.g., [3] and [4]). Working within the graphical modeling framework, Crouse et al. investigated two types of dependency among the binary mixture component labels. The first class of model involves linking variables across space, separately for each scale [see Fig. 1(a)]. The second class of model involves linking components across scale according to a tree structure [see Fig. 1(b)]. The latter model is known as a hidden Markov tree (“hidden” because the states of the mixture component labels is not observed in the data), whereas each of the chains composing the former model is known as a hidden Markov model—perhaps better referred to as a hidden Markov chain.
The advantage of hidden Markov chains and hidden Markov trees is that they permit the use of fast recursive algorithms for computing marginal probabilities. The classical algorithm for chains is known as the “forward-backward” algorithm [10]. Crouse et al. [9] presented an analog of this algorithm for trees (see also [11] and [12]). The general algorithm for computing marginal probabilities on cycle-free graphs is known as the sum-product algorithm [13], also referred to as belief propagation[7].
While hidden Markov chains and hidden Markov trees are associated with fast algorithms, they are not necessarily faithful models of the statistical dependencies among wavelet coefficients. In particular, tree-structured models are well known to introduce artifacts in the spatial domain [5], [14]. Consider for example the variables s, t, and u in Fig. 1(b). While the spatial separation of s and t is the same as the spatial separation between t and u, the vertices s and t are nearest neighbors in the tree whereas t and u are separated by a large distance in the tree. Although this type of boundary artifact is not present in the chain-structured model in Fig. 1(a), a chain-structured model fails to capture dependencies across scale. More desirable is the hybrid model shown in Fig. 1(c), in which both vertical and horizontal edges are present. This graphical model has cycles, however, and the computation of marginal probabilities no longer reduces to a straightforward recursion on a tree.
The need to couple multiple Markov chains arises not only in wavelet modeling [8], [15]. For example, many sensor fusion problems involve statistical dependencies among multiple time series, and one approach to treat such problems is to make use of a coupled set of hidden Markov chains, one for each time series. For example, Reyes et al. [8] described a model for separation of multiple speakers in which a hidden Markov chain is used for each speaker and the observable spectra are a function of the states of all of the chains [see Fig. 1(d)]. This graphical model has cycles and exact marginalization is feasible only for small numbers of chains.
Although introducing cycles into a graphical model leads to a more expressive class of probability distributions, it also raises the fundamental computational challenge of computing marginal probabilities in the presence of cycles. In principle, any Markov model on a graph with cycles can be converted, by a procedure of clustering nodes and augmenting the associated states, into the so-called junction tree form [16], to which exact recursive algorithms akin to the sum-product algorithm can be applied. However, the overall algorithm of the junction tree approach has computational cost exponential in the size of the augmented state space, a quantity which is unacceptably large in many applications. Thus, a key problem to be addressed—if graphical models with cycles are to be applied fruitfully to signal processing problems—is the development of efficient methods for computing approximate marginal distributions.
The sum-product algorithm involves a simple message-passing protocol in which each node in the graph computes outgoing messages based on transformations (sums of products) of the messages arriving from its neighbors. When there are cycles in the graph, the algorithm can still be implemented, but it is no longer guaranteed to converge, and the answers obtained (assuming convergence occurs) must be viewed as approximations to the underlying marginal probabilities. Despite these serious problems, this “loopy” form of the sum-product algorithm is widely used in practice and indeed it is the state-of-the-art approach to various signal processing problems involving graphical models with cycles [6], [13], [17]. Interestingly, the “loopy” sum-product algorithm can be characterized in terms of optimization: in particular, Yedidia et al. [18] showed its fixed points correspond to stationary points of the “Bethe free energy.” This important result not only provides an analysis tool, but also motivates seeking alternative algorithms via other optimization-based formulations.
The framework that we pursue in this paper begins by formulating the problem of exact marginalization as an optimization problem; our approximate marginalization algorithm is based on solving a relaxed version of this exact formulation. More precisely, we show in Section III how the general problem of computing marginal probabilities in graphical models for discrete random variables (and for a more general class of models known as “exponential family models”) can be formulated as a convex optimization problem—the problem involves the maximization of a certain concave cost function over a convex set. Both of these mathematical objects—the cost function and the constraint set—can be complex, however, and the optimization problem is intractable in general. The “Bethe free energy” approach involves approximating the cost function and relaxing to a simpler constraint set. However, as the Bethe free energy is often nonconvex, the loopy sum-product algorithm may have multiple fixed points, and may converge to a nonglobal optimum.
Rather than replacing a convex problem with a nonconvex one, it would seem desirable to maintain convexity in any relaxation. This is the contribution of the current paper: we propose a convex relaxation of the general problem of computing marginal probabilities on graphs with cycles. At the foundation of our method is a conjugate dual relation [19] that holds for any graphical model. The natural constraint set arising from this duality is the marginal polytope of all globally realizable marginal vectors. Our convex relaxation involves a semidefinite outer bound on the marginal polytope together with an upper bound on the conjugate dual function. The resulting problem is strictly convex, and its unique optimum can be found by efficient interior point methods [20]. We illustrate our relaxation in the context of denoising using the coupled mixture-of-Gaussians model of Crouse et al. [19] for graphs with cycles. As we will show, the performance of our method is either comparable or superior to the sum-product algorithm over a wide range of experimental conditions [i.e., coupling strengths, signal-to-noise ratio (SNR)]. A significant fact is that the improvement in performance over sum-product is particularly large for more strongly coupled problems, the regime in which accounting for statistical dependency is most relevant.
The remainder of this paper is organized as follows. Section II is devoted to background on graphical Markov models, their use in modeling coupled mixture of Gaussians (MOGs), and the role of marginalization in signal processing applications. In Section III, we show how the problem of computing marginal probabilities can be reformulated as a low-dimensional optimization problem. In Section IV, we develop a convex relaxation of this optimization problem. Experimental results from applying this relaxation as a technique for performing denoising in the coupled mixture-of-Gaussians model are described in Section V.
SECTION II.
Background and Problem Setup
In this section, we begin by providing some background on graphical Markov models; we refer the reader to [16] and [21] and survey papers [5] and [6] for further details. We then describe the coupled mixture-of-Gaussian model and its use in modeling and noisy prediction.
A. Graphical Markov Models
There exist various but essentially equivalent formalisms for describing graphical models, depending on the type of graph used. In this paper, we make use of an undirected graph G=(V,E), where V={1,…,n} is the vertex set and E is a set of edges joining pairs of vertices. We say that a set S⊂V is a vertex cutset in G if removing S from V breaks the graph into two or more disconnected components—say, A and B. To each node s∈V, we associate a random variable Xs taking values in some configuration space Xs. For any subset A⊆V, we define XA:={Xs|s∈A} with configuration space XA. The link between the random vector X≡XV and the graphical structure arises from Markov properties that are imposed by the graph. In particular, the random vector X is a Markov random field with respect to the graph G if, for all subsets A and B that are separated by some vertex cutset S, the random variables XA and XB are conditionally independent given XS. Fig. 1(a) provides a simple example of a graphical Markov model, in which the random variables X1,…,Xn of a Markov chain are associated with the nodes of a chain. In this chain, each vertex is a cutset; this property implies the familiar conditional independence properties of a Markov chain, in which the past and future are conditionally independent given the present.
An alternative specification of a Markov random field (MRF) is in terms of a particular factorization of the distribution that respects the structure of the graph. In this paper, we focus on pairwise MRFs, in which the factorization is specified by terms associated with nodes and edges of the underlying graph. It is convenient to specify the factorization as an additive decomposition in the exponential domain, as we now describe for a discrete random vector X (i.e., for which Xs={0,1,…,m}. For each s∈V and j∈Xs, let us define an indicator function Ij[xs] (equal to one if xs=j and zero otherwise). We then consider the singleton functions θs at each node, and coupling functions θst on each edge (s,t) that are weighted combinations of these indicator functions
θs(xs)=θst(xs,xt)=∑j∈Xsθs;jIj[xs]∑(j,k)θst;jkIj[xs]Ik[xt].(1)
View Source
\eqalignno{\theta_{s}(x_{s})=&\,\sum_{j\in{\cal X}_{s}}\theta_{s;j}\BBI_{j}[x_{s}]\cr\theta_{st}(x_{s},x_{t})=&\,\sum_{(j,k)}\theta_{st;jk}\BBI_{j}[x_{s}]\BBI_{k}[x_{t}].&\hbox{(1)}}The collection of functions
ϕ(x):={Ij[xs]}∪{Ij[xs]Ik[xt]} are known as the
sufficient statistics, and the vector
θ with elements
{θs;j}∪{θst;jk} is the (canonical)
parameter vector. As we have described it, the vector
θ is
d′-dimensional, where
d′=|V|m+|E|m2. In fact, since the indicator functions are linearly dependent (e.g.,
∑jIj[xs]=1 for all
xs), it is possible to describe the same model family using only a total of
d=|V|(m−1)+|E|(m−1)2 parameters.
For a pairwise MRF, the distribution of the random vector X, denoted by p(x;θ)≡p(X=x;θ), decomposes as
p(x;θ)=exp⎧⎩⎨∑s∈Vθs(xs)+∑(s,t)∈Eθst(xs,xt)−A(θ)⎫⎭⎬(2a)
View Source
\eqalignno{p(x;\theta)=&\,\exp\left\{\sum_{s\in V}\theta_{s}(x_{s})+\sum_{(s,t)\in E}\theta_{st}(x_{s},x_{t})-A(\theta)\right\}\cr&&\hbox{(2a)}}where the quantity
A(θ):=log∑x∈Xnexp×⎧⎩⎨∑s∈Vθs(xs)+∑(s,t)∈Eθst(xs,xt)⎫⎭⎬(2b)
View Source
\eqalignno{A(\theta):=&\,\log\sum_{x\in{\cal X}^{n}}\exp\cr&\times\left\{\sum_{s\in V}\theta_{s}(x_{s})+\sum_{(s,t)\in E}\theta_{st}(x_{s},x_{t})\right\}&\hbox{(2b)}}is the
log partition function that serves to ensure that the distribution is normalized properly. Note that
A is a function from
Rd to
R; from the definition
(2b), it can be seen that
A is both convex and continuous.
B. Coupled Mixture-of-Gaussians and Denoising
We now use the Markov random field to define a more general graphical model involving mixture variables. Let us view the random variable Xs∈{0,1,…,m−1} as indexing a Gaussian mixture with m components. More concretely, we define a random variable Zs whose conditional distribution is given by
p(Zs=zs|Xs=j;νs,σ2s)=12πσ2s;j−−−−−√exp{−(zs−νs;j)22σ2s;j},for j=0,1,…,m−1(3)
View Source
\eqalignno{&p\left(Z_{s}=z_{s}\vert X_{s}=j;\nu_{s},\sigma_{s}^{2}\right)\cr&\quad={1\over\sqrt{2\pi\sigma_{s;j}^{2}}}\exp\left\{-{(z_{s}-\nu_{s;j})^{2}\over 2\sigma_{s;j}^{2}}\right\},\cr&\qquad{\hbox{for}}\ j=0,1,\ldots,m-1&\hbox{(3)}}where
σ2:={σ2s;j|j∈Xs} and
νs:={νs;j|j∈Xs} are
m-vectors specifying the Gaussian variances and means respectively. Summing the joint distribution over the values of
Xs yields a Gaussian mixture distribution for
Zs. See
Fig. 2(a) for a simple graphical representation of this model.
The coupled mixture-of-Gaussians model is a joint probability model over (X,Z), defined by the distribution
p(X=x,Z=z;θ,σ2,ν)=p(x;θ)∏s∈Vp(zs|xs;νs,σ2s).(4)
View Source
p(X=x,Z=z;\theta,\sigma^{2},\nu)=p(x;\theta)\prod_{s\in V}p\left(z_{s}\vert x_{s};\nu_{s},\sigma_{s}^{2}\right).\eqno{\hbox{(4)}}The wavelet signal processing framework of Crouse
et al.
[9] involves a model of the form
(4), in which the underlying graph is a tree and each variable
Zs is a mixture of
m=2 Gaussian components. Our main focus in this paper is the generalization of this model when the underlying graph has cycles, such as the lattice shown in
Fig. 2(b).
One important application of the model (4) is to signal denoising. The problem setup is as follows: given a vector Y of noisy observations of the Gaussian mixture vector Z, we would like to use the noisy observations to form an optimal prediction of Z. When Zs is no longer directly observed [as it was in Fig. 2(a)], we have the new local model illustrated in Fig. 2(c), in which the third (shaded) node represents the noisy observation variable Ys. One common observation model, and the one that we consider in this paper, takes the form
Y=αZ+1−α2−−−−−√W(5)
View Source
Y=\alpha Z+\sqrt{1-\alpha^{2}}W\eqno{\hbox{(5)}}where
W∼N(0,I) is a Gaussian random noise vector and
α∈[0,1] controls the SNR. For continuous random variables, it is common to assess prediction performance using the mean-squared error (MSE), in which case it is well known that the optimal predictor of
Zs, given observations
Y=y, is the conditional mean
zˆs(y):=E[Zs|Y=y]. For an observation model of the form
(5), it is straightforward to derive that the conditional mean takes the form
zˆs(y)=∑j∈Xsp(Xs=j|y;θ)×{νs;j+ασ2s;jα2σ2s;j+(1−α2)[ys−νs;j]}.(6)
View Source
\displaylines{\widehat{z}_{s}(y)=\sum_{j\in{\cal X}_{s}}p(X_{s}=j\vert y;\theta)\hfill\cr\hfill\times\left\{\nu_{s;j}+{\alpha\sigma_{s;j}^{2}\over\alpha^{2}\sigma_{s;j}^{2}+(1-\alpha^{2})}[y_{s}-\nu_{s;j}]\right\}.\quad\hbox{(6)}}Note that
zˆs is a combination of linear least squares estimators (LLSEs), in which the LLSE for Gaussian component
j is weighted by the marginal probability
p(Xs=j|y;θ). Thus, the main challenge associated with performing optimal prediction is the computation of these marginal probabilities. For our development to follow, it is convenient to observe that since
y is an observed quantity (and hence fixed), the computation of the conditional marginal distribution
p(Xs=j|y;θ) can be reformulated as the computation of an ordinary marginal distribution
p(Xs=j;θ˜), where
θ˜ is a modified set of exponential parameters
{θ˜s,θ˜st} obtained by incorporating the observations into the model. Explicitly, the modified terms at each node have the form
θ˜s;k=θs;k+logp(y|Xs=k;θ,ν,σ) for each
k∈{0,1,…,m}; the coupling terms remain unchanged (i.e.,
θst=θ˜st).
SECTION III.
Exact Variational Formulation
In this section, we show how the problem of computing marginals and the log partition function can be reformulated as the solution of a low-dimensional convex optimization problem, which we refer to as a variational formulation. We discuss the challenges associated with an exact solution of this optimization problem for general Markov random fields.
At a high level, our strategy for obtaining the desired variational representation can be summarized as follows. Recall that the log partition function A maps parameter vectors θ∈Rd to real numbers and is a convex function of θ. The convexity of A means that its epigraph {(θ,t)|A(θ)≤t} is a convex subset of Rd+1, and therefore can be characterized as the intersection of all half-spaces that contain it [19]. This half-space representation of the epigraph is equivalent to saying that A can be written in a variational fashion as follows:
A(θ)=supμ∈Rd{⟨θ,μ⟩−A∗(μ)}.(7)
View Source
A(\theta)=\sup_{\mu\in\BBR^{d}}\left\{\langle\theta,\mu\rangle-A^{\ast}(\mu)\right\}.\eqno{\hbox{(7)}}Here
⟨θ⟩ denotes the Euclidean inner product between the vectors
θ,μ∈Rd and
A∗ is an auxiliary function, known as the conjugate dual, that we describe in more detail in
Section III-B. In geometric terms, the dual vector
μ represents the slope of the hyperplane describing a half-space, whereas the dual value
A∗(μ) represents the (negative) intercept of the hyperplane. In principle, this conjugate dual relation allows us to compute
A(θ) by solving the optimization problem
(7). Accordingly, we first turn to an investigation of the form of the dual function
A∗. Of particular importance is characterizing the subset of
Rd on which
A∗ is finite-valued (known as its
effective domain) because we can always restrict the optimization
(7) from
Rd to this set.
Although the variational principle (7) is related to the “free energy” principle of statistical physics [18], it differs from this classical approach in important ways. More specifically, it is low-dimensional convex problem, in which the optimization variables μ have a
natural interpretation as realizable marginal probabilities. An important consequence, as we will see later in this section, is that an effective solution to the optimization problem (7) yields not only the value of the log partition function A(θ) but it also the marginal probabilities (i.e., p(Xs=j;θ) and p(Xs=j,Xt=k;θ)) that we aim to compute. Moreover, in contrast to the free energy approach, this perspective clarifies that there are two distinct components to any relaxation of the variational principle (7)—namely, an approximation of the dual function and an outer bound on the effective domain.
A. Marginal Polytopes
We begin by defining the set that is to play the role of the constraint set in our variational principle. Recall the sufficient statistics in a discrete exponential family: they are the collection of indicator functions ϕ(x):={Ij[xs]}∪{Ij[xs]Ik[xt]}. For each discrete configuration x∈Xn, the quantity ϕ(x) is a {0–1}-valued vector contained within [0,1]d; of interest is the convex hull of this set of vectors. More formally, we define
MARG(G;ϕ):={μ∈Rd|∑x∈Xnϕ(x)p(x)=μfor some distribution p(⋅)∑x∈Xn}(8)
View Source
\displaylines{{\hbox{MARG}}(G;{\mmb\phi}):=\left\{\mu\in\BBR^{d}\vert\sum_{x\in{\cal X}^{n}}\phi(x)p(x)=\mu\right.\hfill\cr\hfill\left.{\hbox{for some distribution}}\ p(\cdot){\vphantom{\sum_{x\in{\cal X}^{n}}}}\right\}\quad\hbox{(8)}}where
p(⋅) is any valid distribution. The elements of
μ, which can be indexed as
{μs;j}∪{μst;jk}, have a very concrete interpretation. For instance, element
μs;j=∑xp(x)Ij[xs] is simply the marginal probability that
Xs=j (under the distribution
p(⋅)). Similarly, element
μst;jk corresponds to a particular joint marginal probability. Accordingly, we refer to the set MARG
(G;ϕ) as the
marginal polytope associated with the graph
G and the potentials
ϕ. We refer to elements
μ∈MARG(G;ϕ) as
mean parameters associated with the Markov random field defined by
G and
ϕ.
Fig. 3 provides a geometric illustration of a marginal polytope. Since it corresponds to the convex hull of a finite number of vectors, it must be a polytope (and hence can be characterized by a finite number of hyperplane constraints). Although MARG
(G;ϕ) has a very simple definition, it is actually a rather complicated set. In particular, the number of hyperplane constraints required to specify this polytope grows at least exponentially in the graph size
n (see
[22] and
[23] for further discussion of this point).
B. Conjugate Dual Function
Given the convexity of A, it is natural to consider its conjugate dual [19], which is a function A∗:Rd→R∪{+∞} defined as follows:
A∗(μ):=supθ∈Rd{⟨μ,θ⟩−A(θ)}.(9)
View Source
A^{\ast}(\mu):=\sup_{\theta\in\BBR^{d}}\left\{\langle\mu,\theta\rangle-A(\theta)\right\}.\eqno{\hbox{(9)}}Here the vector of
dual variables μ∈Rd is the same dimension as the vector
θ of exponential parameters.
Our choice of notation is deliberately suggestive, in that the dual variables μ turn out to be closely associated with the marginal polytope defined in the previous section. In order to see the connection, consider the gradient of the log partition function. A straightforward calculation using the definition (2b) of A yields that
∇A(θ)=∑x∈Xnp(x;θ)ϕ(x)=Eθ[ϕ(x)](10)
View Source
\nabla A(\theta)=\sum_{x\in{\cal X}^{n}}p(x;\theta){\mmb\phi}(x)=\BBE_{\theta}\left[{\mmb\phi}(x)\right]\eqno{\hbox{(10)}}so that elements of this gradient correspond to particular marginal probabilities (under the distribution
p(⋅;θ)). This fact implies that the image of the gradient mapping (i.e.,
∇A(Rd)) is contained within the marginal polytope
(8).
Our goal now is to compute a more explicit form for the dual function A∗. A quantity that plays a key role in this context is the discrete entropy [24] associated with a distribution p(⋅), defined as
H(p):=−∑x∈np(x)logp(x).(11)
View Source
H(p):=-\sum_{x\in{\cal}^{n}}p(x)\log p(x).\eqno{\hbox{(11)}}We begin by observing that for a fixed
μ∈Rd, the function
J(θ):=⟨μ,θ⟩−A(θ) is strictly concave and differentiable. Therefore, if there exists a solution
θ to the equation
∇J(θ)=0, then the supremum
(9) is attained at this point. Accordingly, we compute the gradient using
(10) and set it equal to zero
∇J(θ)=μ−Eθ[ϕ(x)]=0.(12)
View Source
\nabla J(\theta)=\mu-\BBE_{\theta}\left[{\mmb\phi}(x)\right]=0.\eqno{\hbox{(12)}}It can be shown
[23] that this equation has a unique solution
θ(μ) whenever
μ belongs to the interior of
MARG(G;ϕ). Substituting the relation
μ=Eθ(μ)[ϕ(x)] into the definition
(9) of
A∗ yields, for any
μ in the interior of
MARG(G;ϕ), the important relation
A∗(μ)=⟨μ,θ(μ)⟩−A(θ(μ)). To interpret
A∗(μ), we compute the negative entropy of
p(x;θ) as follows:
−H(p(x;θ(μ))===∑x∈Xnp(x;θ(μ)){logp(x;θ(μ))}∑x∈Xnp(x;θ(μ)){⟨θ(μ),ϕ(x)⟩−A(θ(μ))}A∗(μ).(13)
View Source
\eqalignno{-H\left(p\left(x;\theta(\mu)\right)\right.\!=\!&\,\sum_{x\in{\cal X}^{n}}\!p\left(x;\theta(\mu)\right)\!\left\{\log p\left(x;\theta(\mu)\right)\right\}\cr\!=\!&\,\sum_{x\in{\cal X}^{n}}\!p\left(x;\theta(\mu)\right)\!\left\{\left\langle\theta(\mu),{\mmb\phi}(x)\right\rangle\!-\!A\left(\theta(\mu)\right)\right\}\cr\!=\!&\,A^{\ast}(\mu).&\hbox{(13)}}Thus, we recognize the dual function
A∗(μ) as the negative entropy specified as a function of the mean parameters.
We have established that for μ in the interior of MARG (G;ϕ), the value of the conjugate dual A∗(μ) is given by the negative entropy of the distribution p(x;θ(μ)), where the pair θ(μ) and μ are dually coupled via (12). Moreover, it can be shown [23] that when μ lies outside the closure of the marginal polytope, then the value of the dual function is +∞. We summarize as follows:
A∗(μ)={−H(p(x;θ(μ)),+∞,for μ in the interior of MARG(G;ϕ)otherwise.(14)
View Source
A^{\ast}(\mu)=\cases{-H\left(p(x;\theta(\mu)\right),&for $\mu$ in the interior of\cr&\quad${\hbox{MARG}}(G;{\mmb\phi})$\cr+\infty,&otherwise.}\eqno{\hbox{(14)}}Since the function
A is differentiable on
Rd, we are guaranteed that taking the conjugate dual twice recovers the original function
[19]. By applying this fact to
A∗ and
A, we obtain the following relation:
A(θ)=maxμ∈MARG(G,ϕ){⟨θ,μ⟩−A∗(μ)}.(15)
View Source
A(\theta)=\max_{\mu\in{\rm{MARG}}(G,{\mmb\phi})}\left\{\langle\theta,\mu\rangle-A^{\ast}(\mu)\right\}.\eqno{\hbox{(15)}}Note that the optimization here, in contrast to
(7), is restricted to the marginal polytope MARG
(G;ϕ), since the function
A∗ is infinite outside of this set. Moreover, it can be shown
[23] that the optimum
(15) is attained uniquely at the vector
μ(θ) of marginals associated with
p(x;θ). Consequently, the optimization problem
(15) is a variational representation in two senses. First, the optimal
value of the optimization problem yields the value
A(θ) of the log partition function; and second, the optimal
arguments yield the mean parameters or marginals
μ=Eθ[ϕ(x)]. This variational formulation plays a central role in our development in the sequel.
SECTION IV.
Log-Determinant Relaxation
In this section, we derive an algorithm for approximating marginal probabilities based on the solution of a relaxed variational problem involving determinant maximization and semidefinite constraints. We first provide a high-level description of our approach. There are two challenges associated with the variational formulation (15). First, actually evaluating the dual function A∗(μ) for some vector μ of marginal probabilities is a very challenging problem, since it requires first computing the exponential family distribution p(x;θ(μ)) with those marginals, and then computing its entropy. Indeed, with the exception of trees and more general junction trees, it is typically impossible to specify an explicit form for the dual function A∗(μ). Second, for a general graph with cycles, an exact description of the marginal polytope MARG (G;ϕ) requires a number of inequalities that grows rapidly with the size of the graph [22]. Accordingly, our approach is to relax the exact variational formulation (15) as follows: we replace the marginal polytope by a convex outer bound, and bound the intractable dual function A∗ with a convex surrogate. In the following two sections, we describe each of these steps in turn.
Although the ideas and methods described here are more generally applicable, for the sake of clarity in exposition we focus here on the case of a binary random vector X∈{0,1}n. In this case, it is necessary only to consider the indicator functions I1[xs]=xs at each node and I1[xs]I1[xt]=xsxt on each edge. Thus, for a given graph G, the parameter vector θ has a total of d=|V|+|E| elements (i.e., its elements have the form {θs,s∈V}∪{θst,(s,t)∈E}). We refer the reader to Appendix A for discussion of general multinomial case.
A. Outer Bounds on the Marginal Polytope
We first show how to derive various outer bounds on the marginal polytope. In this context, it is convenient to consider the marginal polytope MARG (Kn;ϕ) associated with the complete graph Kn (i.e., the graph in which each node is joined by an edge to all (n−1) other nodes). It should be noted that this assumption entails no loss of generality, since an arbitrary pairwise Markov random field can be embedded into the complete graph by setting to zero a subset of the {θst} parameters. (In particular, for a graph G=(V,E), we simply set θst=0 for all pairs (s,t)∉E).
On the complete graph, the model dimension is d=n+(n2). Given an arbitrary vector μ∈Rd, consider the following (n+1)×(n+1) matrix:
M1[μ]:=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢1μ1μ2⋮μn−1μnμ1μ1μ21⋮⋮μn1μ2μ12μ2⋮⋮μn2⋯⋯⋯⋮⋮⋯μn−1⋯⋯⋮⋮μ(n−1),nμnμ1nμ2n⋮μn,(n−1)μn⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥.(16)
View Source
\eqalignno{M_{1}[\mu]:=\left[\matrix{1&\mu_{1}&\mu_{2}&\cdots&\mu_{n-1}&\mu_{n}\cr\mu_{1}&\mu_{1}&\mu_{12}&\cdots&\cdots&\mu_{1n}\cr\mu_{2}&\mu_{21}&\mu_{2}&\cdots&\cdots&\mu_{2n}\cr\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\cr\mu_{n-1}&\vdots&\vdots&\vdots&\vdots&\mu_{n,(n-1)}\cr\mu_{n}&\mu_{n1}&\mu_{n2}&\cdots&\mu_{(n-1),n}&\mu_{n}}\right].\cr&&{\hbox{(16)}}}The motivation underlying this definition is the following: suppose that the given dual vector
μ actually belongs to MARG
(Kn;ϕ), in which case there exists some distribution
p(x;θ) such that
μs=∑xp(x;θ)xs and
μst=∑xp(x;θ)xsxt. Under this condition, the matrix
M1[μ] can be interpreted as the matrix of second-order moments for the vector (1,
x), as computed under
p(x;θ). An important point here is that in computing these second-order moments, we use the fact that
x2s=xs when
xs∈{0,1}. It is for this reason that the elements
(μ1,…,μn) appear along the diagonal.
Given a symmetric matrix S∈Rd×d, we use S⪰0 to mean that S is positive semidefinite. The significance of the moment matrix M1[μ] is illustrated in the following.
Lemma 1 [Semidefinite Outer Bound]:
The binary marginal polytope MARG(Kn) is contained within the semidefinite constraint set SDEF1(Kn):={μ∈Rd|M1[μ]⪰0}.
Proof:
This result follows from the fact that any second-order moment matrix must be positive semidefinite, as can be verified by the following simple argument. Letting y:=(1,x), then for any vector a∈Rn+1, we have aTM1[μ]a=aTE[yyT]a=E[∥aTy∥2], which is certainly nonnegative.□
This semidefinite relaxation can be further strengthened by including higher order terms in the moment matrices, as described by Lasserre [25].
In addition to such semidefinite constraints, there are various linear constraints that any member μ of the marginal polytope MARG(G;ϕ) must satisfy. Here we consider some linear constraints that are relevant to the sum-product algorithm. For a given edge (s,t)∈E, there are three mean parameters associated with each pair of random variables (Xs,Xt)—namely, μs=E[Xt], μt=E[Xt] and μst=E[XsXt]. Collectively, these three mean parameters specify the joint marginal distribution over (Xs,Xt) as
p(xs,xt;μ)=[(1+μst−μs−μt)μs−μstμt−μstμst].(17)
View Source
p(x_{s},x_{t};\mu)=\left[\matrix{(1+\mu_{st}-\mu_{s}-\mu_{t})&\mu_{t}-\mu_{st}\cr\mu_{s}-\mu{st}&\mu_{st}}\right].\eqno{\hbox{(17)}}Therefore, the four inequality constraints obtained by requiring that each entry of
p(xs,xt;μ) be nonnegative are necessary and sufficient to ensure that the pairwise marginals on each edge
(s,t) are valid. For a binary-valued Markov random field defined on a general graph with cycles, these constraints are equivalent to the constraints that are enforced by the sum-product algorithm
[18]. Thus, these constraints provide a complete characterization of the marginal polytope for any tree-structured graph.
B. Gaussian Entropy Bound
We now describe how to upper bound the (negative) dual function −A∗(μ) using a Gaussian-type approximation. Our starting point is the well-known fact [24] that the (differential) entropy of any continuous random vector X˜ is upper bounded as
h(X˜)≤12logdetcov(X˜)+n2log(2πe).(18)
View Source
h(\widetilde{X})\leq{1\over 2}\log\det{\rm cov}(\widetilde{X})+{n\over 2}\log(2\pi e).\eqno{\hbox{(18)}}This upper bound corresponds to the differential entropy of a Gaussian matched to the covariance matrix (denoted
cov(X˜)) of the continuous random vector
X˜. However, it is not directly useful in application to a discrete random vector, for which the differential entropy is not well-defined. Our approach is to “smooth”
X by adding an independent noise variable, and then apply the bound
(18). After some derivation, the end result is the following:
Lemma 2:
The negative dual function is upper bounded as
−A∗(μ)≤12logdet(M1[μ]+112blkdiag[0,In])+n2log(2πe)(19)
View Source
\displaylines{-A^{\ast}(\mu)\leq{1\over 2}\log\det\left(M_{1}[\mu]+{1\over 12}{\rm blkdiag}[0,I_{n}]\right)\hfill\cr\hfill+{n\over 2}\log(2\pi e)\quad\hbox{(19)}}where the
(n+1)×(n+1) matrix
M1[μ] is defined in
(16) and
blkdiag[0,In] is an
(n+1)×(n+1) block-diagonal matrix with a 1 × 1 zero block, and another
n×n block with the identity matrix
In.
See Appendix B for the proof of this claim.
C. Log-Determinant Relaxation
Equipped with these building blocks, we are now ready to state our log-determinant relaxation for the log partition function.
Theorem 1:
Consider a binary Markov random field p(x;θ) over the vector X∈{0,1}n. Then for any compact convex outer bound OUT (Kn) on the marginal polytope MARG(Kn), the log partition function A(θ) is upper bounded by the solution of the following variational problem:
A(θ)≤maxμ∈OUT(Kn),M1[μ]⪰0×{⟨θ,μ⟩+12logdet×[M1(μ)+112blkdiag[0,In]]}+n2log(2πe).(20)
View Source
\eqalignno{A(\theta)\leq&\max_{\mu\in{\rm OUT}(K_{n}),M_{1}[\mu]\succeq 0}\cr&\times\left\{\langle\theta,\mu\rangle+{1\over 2}\log\det\times\left[M_{1}(\mu)+{1\over 12}{\rm blkdiag}[0,I_{n}]\right]\right\}\cr&+{n\over 2}\log(2\pi e).&\hbox{(20)}}This problem is strictly concave, and so has a unique global optimum.
Given our development thus far, the proof is straightforward. In particular, by examining the variational representation (15) of A, we see that an upper bound on A can be obtained via an upper bound on the negative dual function −A∗ and an outer bound on the marginal polytope MARG(Kn;ϕ). The bound thus follows by applying Lemmas 1 and 2. Strict concavity and uniqueness follow by standard results on the log-determinant function [20].
The simplest form of the relaxation (20) is obtained when the semidefinite constraint (see Lemma 1) M1[μ]⪰0 is the only constraint imposed. In this case, the relaxation has a natural interpretation as optimizing over a subset of valid covariance matrices. It is also straightforward to strengthen the relaxation via additional linear constraints on the marginal polytope, such as those associated with the Bethe problem and the sum-product algorithm [see (17)]. Overall, our approach will be to proceed in analogy to the exact variational principle (15): in particular, we will solve (20) and then make use of the optimizing arguments μˆ as approximations to the exact marginals. Insofar as our relaxation is relatively tight, this procedure can be expected to provide reasonable approximations, as we will see in Section V.
D. Efficient Solution of Log-Determinant Relaxation
An important fact is that the unique optimum of problem (20) can be obtained in polynomial time by interior point methods specialized for log-determinant problems [e.g., [20]]. However, the complexity of a generic interior point method is O(n6), which (though polynomial time) is too large to be practically viable. Accordingly, here we describe how a suitable dual reformulation leads to very efficient methods for solving a slightly weakened form of the log-determinant relaxation in Theorem 1.
Our starting point is the observation that the log-determinant term in (20) acts naturally as a barrier function to enforce the constraint M1[μ]⪰−(1/12)blkdiag[0,In], which is a somewhat weaker constraint than M1[μ]⪰0. This observation leads to the relaxed problem
maxμ{⟨θ,μ⟩+12logdet[M1(μ)+112blkdiag[0,In]]}.(21)
View Source
\eqalignno{\max_{\mu}\left\{\langle\theta,\mu\rangle+{1\over 2}\log\det\left[M_{1}(\mu)+{1\over 12}{\rm blkdiag}[0,I_{n}]\right]\right\}.\cr&&{\hbox{(21)}}}By an appropriate dual reformulation described in
Appendix C, we can convert
(21) into an
unconstrained log-determinant problem that can be solved with i) complexity
O(n3) per iteration for a full Newton method on an arbitrary graph or ii) complexity
O(n1.5) per iteration for a diagonally scaled quasi-Newton method on grid-structured problems. By comparison, the computational complexity per iteration of sum-product is
O(|E|). As particular illustrations, this amounts to a complexity per iteration of
O(n2) for complete graphs and
O(n) for grid-structured problems. The overall complexity is determined by the per iteration cost as well as the convergence rate, which determines the total number of iterations required to reach a pre-specified error tolerance. The convergence rate of the sum-product algorithm (assuming that it converges) is at best linear
[26]. In contrast, an appropriately scaled gradient method, like Newton'S method, has a superlinear rate of convergence
[27].
SECTION V.
Experimental Results
In this section, we describe experimental results of applying the log-determinant relaxation (21) to the noisy prediction problem in a coupled mixture-of-Gaussians (MOG) model (see Section II-B). Recall that the MOG model is specified by a discrete MRF p(x;θ) over the vector X of discrete mixing variables, a vector Z of Gaussian mixture variables, and a vector Y of noisy observations, as described by (5). The noisy prediction problem is to compute for each vertex s∈V the conditional mean zˆs(y), as defined in (6). We consider the approximation zˆs(y;μLD) to this conditional mean, in which the true marginals p(Xs=j|y;θ) are replaced by the approximations μLD obtained from the log-determinant relaxation. For all experiments reported here, we fix the variances σs;0=σs;1=0.25 and means νs;0=1=−νs;1 of the Gaussian mixture components. We study the behavior of the log-determinant (LD) relaxation (21) as the coupling strengths θ and the SNR α are varied. For purposes of comparison, we also show prediction results based on the approximate predictor zˆs(y;μBP) computed using the vector μBP of approximate marginals obtained from the sum-product or belief propagation (BP) algorithm (see [18, (4) and (5)]).
Our experiments cover two types of graphs: the complete graph Kn, in which each node is connected to every other, and the four nearest neighbor lattice graph [see Fig. 2(b) for an illustration]. For each set of trials on a given graph, we generate the parameter θ that specifies the coupling distribution (2a) p(X;θ) in the following way. In all trials, we set the single node parameters θs;0=θs;1=0 for all vertices s∈V. The edge parameters θst are chosen differently depending on the coupling strength dcoup≥0; here dcoup=0 corresponds to an independence model, whereas larger dcoup generates increasing amounts of dependence among the indicator variables Xs. Let U[a,b] denote the uniform distribution on the interval [a,b]. For a fixed coupling strength dcoup, we sample γst independently from the U[0,dcoup] distribution and then set
θst=[θst;00θst;10θst;01θst;11]=[γst−γst−γstγst].
View Source
\theta_{st}=\left[\matrix{\theta_{st;00}&\theta_{st;01}\cr\theta_{st;10}&\theta_{st;11}}\right]=\left[\matrix{\gamma_{st}&-\gamma_{st}\cr-\gamma_{st}&\gamma_{st}}\right].Note that this procedure for choosing the parameter vector
θ produces a distribution
p(x;θ) in which each variable
Xs is equally likely to be zero or one, and for which neighboring variables are more likely to take the same value. This probabilistic structure is consistent with the coupled mixture models used in practice (e.g., in wavelet denoising
[9]).
We solved the log-determinant relaxation (21) via Newton'S method, as described in Appendix C. We used the standard parallel message-passing form of the sum-product algorithm with a damping factor β=0.05; if the sum-product algorithm failed to converge, we switched to a convergent double-loop alternative [28]. For each graph (fully connected or grid), we examined a range of coupling strengths dcoup and the full range of SNRs parameterized by α∈[0,1]. For purposes of comparison, we computed the exact marginal values either by exhaustive summation on the complete graph or by applying the junction tree algorithm to grid-structured problems.
Due to the computational complexity of these exact calculations, we performed our experiments on n=16 nodes for complete graphs and n=100 nodes for the lattices. We used the following procedure to assess the error in the approximations. Let MSEBayes :=E{(1/n)∑ns=1[zˆs(y)−zs]2} denote the MSE of the optimal Bayes estimator (6). Similarly, let MSE(μLD) and MSE(μBP) denote the MSEs of the log-determinant and BP-based predictors, respectively. For any given experimental trial, our evaluation is based on the percentage increase in MSE; for instance, for the log-determinant predictor, we compute 100×((MSE(μLD)−MSEBBayes)/MSEBayes).
Fig. 4 shows the results for the grid with n=100 nodes; each plot displays the percentage increase in MSE versus the SNR parameter α for a fixed coupling strength dcoup. Each point in each solid line corresponds to the mean taken over 30 trials; the plus marks show the standard errors associated with these estimates. Shown in panel (a) is the case of low coupling strength (dcoup=0.45), for which the mixing variables on the grid interact only weakly. For these types of problems, the performance of BP is slightly but consistently better than the LD method; however, note that both methods lead to a percentage increase in MSE of less than 0.25% over all α. Panel (b), in contrast, shows the case of stronger coupling (dcoup=0.90). Here the percentage loss in MSE can be quite substantial for BP in the low SNR region (α small), whereas its behavior improves for high SNR. In contrast, the behavior of the LD method is more stable, with the percentage MSE loss remaining less than 2% over the entire range of α. The degradation of BP performance for strong couplings can be attributed to the nonconvexity of the Bethe problem and the appearance of multiple local optima.
Fig. 5 shows analogous results for the fully connected graph. In panel (a), we see the weakly coupled case (dcoup=0.075); as with the lattice model in Fig. 4(a), the performance loss for either method is less than 2.5% when the dependencies are weak. The results in panel (b), where the couplings are stronger (dcoup=0.15), are markedly different: for harder problems with low SNR, the BP method can show MSE percentage increases upwards of 25%. Over the same range of α, the loss in the LD method remains less than 5%. It should be noted that the relatively poor performance of BP for this fully connected graph is to be expected in some sense, since it is an approximation that is essentially tree-based. Nonetheless, it is interesting that the performance of the LD method remains reasonable even for this densely connected model over the same range of coupling strengths and SNRs.
Fig. 6 shows two cross-sections taken from a grid for coupling strength dcoup=0.90; panel (a) shows the low SNR behavior (α=0.30), whereas panel (b) shows the behavior for higher SNR (α=0.70). The cross-sections were chosen to illustrate a step discontinuity in the underlying signal, at which point the mixing variable switches from Xs=0 (Gaussian mean νs;0=1) to Xs=1 (Gaussian mean νs;1=−1). In the low SNR example, note that the BP algorithm outputs approximates marginals that are strongly skewed toward state Xs=1. As a result, the noisy signal reconstruction is biased toward νs;1=−1. In contrast, the LD method returns approximate marginals that are more balanced between the two mixture components, and thus reconstructs a smoother version of the step that is centered about zero. For the high SNR example in panel (b), both methods perform quite well as would be expected.
Graphical models have proven useful in a variety of signal processing applications. Although exact algorithms for cycle-free graphs are widely used, the algorithmic treatment of graphs with cycles presents a number of challenges that must be addressed in order for these models to be applied to signal processing problems. The foundation of this paper is an exact variational representation of the problem of computing marginals in general Markov random fields. We demonstrated the utility of semidefinite constraints in developing convex relaxations of this exact principle, which can be solved efficiently to obtain approximations of marginal distributions. The method presented here is based on a Gaussian entropy bound in conjunction with both linear and semidefinite constraints on the marginal polytope. An attractive feature of the resulting log-determinant maximization problem is that it can be solved rapidly by efficient interior point methods [20]. Moreover, we showed how a slightly modified log-determinant relaxation can be solved even more quickly by conventional Newton-like methods. As an illustration of our methods, we applied our relaxation to a noisy prediction problem in a coupled mixture-of-Gaussians problem and found that it performed well over a range of coupling strengths and SNR settings. An important open question, not addressed in this paper, is how to estimate model parameters for such prediction problems.
There are a number of additional research directions suggested by the methods proposed here. It remains to develop a deeper understanding of the interaction between the two choices involved in these approximations (i.e., the entropy bound, and the outer bound on the marginal polytope), as well as how to tailor approximations to particular graph structures. It is certainly possible to combine semidefinite constraints with entropy approximations (preferably convex) other than the Gaussian bound used in this paper. For instance, it would be interesting to investigate the behavior of “convexified” Bethe/Kikuchi entropy approximations [29] in conjunction with semidefinite constraints.
ACKNOWLEDGMENT
The authors would like to thank S. Boyd, C. Caramanis, L. El Ghaoui, and L. Vandenberghe for helpful discussions.
Appendix
SECTION A
Generalization to Multinomial States
This Appendix discusses the extension of our techniques to discrete spaces Xs={0,1,…,m−1} with m>2; our treatment is necessarily brief due to space constraints. Given a random variable Xs∈Xs, one set of sufficient statistics is the vector of monomials f(xs):={xjs,j=1,…,m−1}. The overall distribution on the random vector X can be represented in terms of these monomials, and the cross-terms f(xs)⊗f(xt):={xjsxkt,j,k=1,…,m−1} for each edge (s,t)∈E. Note that this is a natural generalization of the representation {xs,xsxt} that we used in the binary case. Let μjs=E[Xjs] and μjkst=E[XjsXkt] be the associated mean parameters. Let us define a q×q dimensional matrix, where q=1+(m−1)n, by taking the covariance cov{1,f(X1),…,f(Xn)}. The elements of this matrix are given in terms of the mean parameters defined above; it is the natural generalization of the matrix M1[μ] defined previously (16) for binary variables. As in Lemma 1, imposing a semidefinite constraint on this matrix generates an outer bound on the associated marginal polytope. Similarly, we can also derive an upper bound on the entropy of X based on the characterization of the Gaussian distribution as the maximum entropy distribution for a fixed covariance matrix. A first step in doing so is the observation that H(X1,…,Xn)=H(f(X1),…,f(Xn)); we can then upper bound the entropy in terms of the covariance matrix cov{1,f(X1),…,f(Xn)} as we did in the binary case.
SECTION B.
Proof of Lemma 2
We first convert our discrete random vector X (taking values in {0,1}n) to a continuous version X˜ with an equivalent differential entropy. Define a new continuous random vector via X˜:=X+U, where U is a random vector independent of X, with each element independently and identically distributed as Us∼U[−(1/2),1/2].
lemma3:
We have h(X˜)=H(X), where h and H denote the differential and discrete entropies of X˜ and X, respectively.
Proof:
Throughout this proof, we use p(⋅) to denote the probability density function of the continuous random vector X˜ and P(⋅) to denote the probability mass function of the discrete random vector X. By definition [24], the differential entropy of X˜ is given by the integral h(X˜):=−∫Sp(x˜)logp(x˜)dx˜, where S={x˜∈Rn|p(x˜)>0} is the support of X˜. By our construction of X˜, the support set S can be decomposed into a disjoint union of hyperboxes B(e) of unit volume, one for each configuration e∈{0,1}n. Using this decomposition, we write the differential entropy as
h(X˜)=−∑e∈{0,1}n∫B(e)p(x˜)logp(x˜)dx˜.
View Source
h(\widetilde{X})=-\sum_{{\bf e}\in\{0,1\}^{n}}\int\limits_{B({\bf e})}p(\widetilde{x})\log p(\widetilde{x})d\widetilde{x}.Now by our construction of
X˜, the quantity
p(x˜)logp(x˜) is equal to the constant
P(e)logP(e) for all
x˜ in the hyperbox
B(e), where
P(e) is the probability of the discrete configuration
e∈{0,1}n. Accordingly, we have
h(X˜)=−∑e∈{0,1}nP(e)logP(e)[∫B(e)1dx˜], which is equal to
H(X), since the volume of each box
B(e) is unity.□
Now to establish Lemma 2, let μ∈MARG(Kn) and let X be a random vector with these marginals. Consider the continuous-valued random vector X˜:=X+U. From Lemma 3, we have H(X)=h(X˜). Combining this equality with the Gaussian entropy bound (18) yields the upper bound
H(X)≤12logdetcov(X˜)+n2log(2πe).
View Source
H(X)\leq{1\over 2}\log\det{\rm cov}(\widetilde{X})+{n\over 2}\log(2\pi e).We now express the log-determinant quantity in an alternative form. First, using the independence of X and U, we write cov(X˜)=cov(X)+(1/12)In, where we have used the fact that the covariance matrix of an independent uniform random vector U on a unit box is (1/12)In. Next we use the Schur complement formula [20] to express logdetcov(X˜) in terms of the second-order moment matrix M1[μ] defined in (16) as follows:
logdet[cov(X)+112In]=logdet{M1[μ]+112blkdiag[0,In]}(22)
View Source
\displaylines{\log\det\left[{\rm cov}(X)+{1\over 12}I_{n}\right]\hfill\cr\hfill=\log\det\left\{M_{1}[\mu]+{1\over 12}{\rm blkdiag}[0,I_{n}]\right\}\quad\hbox{(22)}}where
blkdiag[0,In] is an
(n+1)×(n+1) block-diagonal matrix. Combining
(22) with the Gaussian upper bound on
H(X)=−A∗(μ) yields
−A∗(μ)≤(1/2)logdet(M1[μ]+(1/12)blkdiag[0,In])+(n/2)log(2πe), which is the statement of
Lemma 2.
SECTION C.
Derivation of Dual Updates
To derive Newton updates for the Lagrangian dual of the weakened relaxation described in Section IV-D, it is more convenient to work with the same relaxation, but as applied to “spin” variables V∈{−1,+1}n. The interaction among these spin variables can be captured by a Markov random field of the form p(v;γ)∝exp{∑s∈Vγsvs+∑(s,t)∈Eγstvsvt}. Let ηs=E[vs] and ηst=E[vsvt] be the associated moments of the spin vector. Let M1[η] denote the second-order moment matrix associated with V; it has the form of the matrix in (16), but with diagonal elements are all equal to one, since v2s=1 for spin variables in {−1,+1}. In this spin representation, we convert to a continuous version V˜:=(1/2)V+U, where again U is uniformly distributed on [−(1/2),+(1/2)]. (The rescaling by 1/2 is necessary to adjust the impulses to be within distance one of each other.) The log-determinant of the covariance of V˜ takes the form logdet{M1[η]+(1/3)blkdiag[0,In]}+nlog(1/4). Introducing the matrix-variable Y:=M1[η]+(1/3)blkdiag[0,In], the weakened semidefinite relaxation corresponds to maxY⪰0{⟨−A,Y⟩+logdetY} such that diag(Y)=d, where d=[1 4/3 … 4/3]T, A is an (n+1)×(n+1) matrix involving the weight vector γ, and ⟨A,B⟩:=trace(AB) is the Frobenius inner product. (As a sidenote, note that we have dropped a multiplicative factor of 1/2, as well as the additive constants in this form of the cost function. Moreover, the motivation for introducing a negative sign in A will be clear momentarily.)
Let λ∈Rn+1 be a vector of Lagrange multipliers associated with the linear constraints diag(Y)=d. Computing the (Lagrangian) dual function of the weakened relaxation yields
Q(λ)=−(n+1)−logdet[A+diag(λ)]+⟨diag(λ),diag(d)⟩.(23)
View Source
\eqalignno{Q(\lambda)\!=\!-(n+1)-\log\det\left[A\!+\!{\rm diag}(\lambda)\right]\!+\!\left\langle{\rm diag}(\lambda),{\rm diag}(d)\right\rangle.\cr&&{\hbox{(23)}}}Thus, the dual problem corresponds to optimizing a convex and continuously differentiable function over
Rn+1, which can be performed very efficiently by Newton'S method
[27]. In particular, using well-known properties of matrix derivatives, we can compute the gradient
∇Q(λ)=−diag[A+diag(λ)]−1+d and Hessian
∇2Q(λ)=−([A+diag(λ)]−1)⊙([A+diag(λ)]−1), where
⊙ denotes Hadamard product. Thus, we can apply Newton'S method or other scaled gradient methods to solve the dual problem. Given the optimal dual solution
λ∗, we obtain the optimal primal solution as
Y∗=(A+diag(λ))−1. Finally, the optimal moment matrix
M1[η∗] is given by
M1[η∗]=Y∗−(1/3)blkdiag[0,In].
In the absence of any particular graph structure, the overall computational complexity of full Newton updates is O(n3), which arises from the inversion of (n+1)×(n+1) of matrices. If we restrict our attention to a grid-structured graphical model, then the matrix [A+diag(λ)] will be sparse and grid-structured, and thus can be inverted with complexity O(n1.5) using nested dissection [30]. Thus, we can perform diagonally scaled gradient descent with a complexity of O(n1.5) on grid-structured problems.