With the growing size of the datasets collected from real-world problems, the graphical models have become a powerful statistical tool to extract and represent the underlying hidden structure of the variables. These models, commonly known as Markov Random Fields (MRFs), have simple and intuitive interpretation: The conditional dependencies of different variables are captured via weighted edges in their associated graph representation. The application of MRFs span from computer vision and natural language processing to economics [10], [24], [26]. One of the most commonly known classes of MRFs is Gaussian Graphical Models (GGMs). The applicability of GGM is contingent upon estimating the covariance or inverse covariance matrix (also known as concentration matrix) based on a limited number of independent and identically distribution (i.i.d.) samples drawn from a multivariate Gaussian distribution [5], [13], [28], [45].
In particular, we consider the problem of estimating an n×n
covariance matrix Σ
(or its inverse Σ−1
) of an n
-variate probability distribution from N
i.i.d. samples x1,x2,…,xN
drawn from the same probability distribution. In many applications [10], [24], [26], the matrix Σ−1
is often sparse, meaning that its matrix elements are mostly zero. For Gaussian distributions, the statistical interpretation of sparsity in Σ−1
is that most of the variables are pairwise conditionally independent [13].
Imposing sparsity upon Σ−1
gives rise to a shrinkage estimator for the inverse covariance matrix. This is particularly important in high-dimensional settings where n
is large, often significantly larger than the number of samples N≪n
. One popular approach is to solve the following ℓ1
-regularized problem
minimize X≻0trCX−logdetX+λ∑i=1n∑j=1n|Xi,j|.(1)
View Source
\begin{equation*} \underset {X\succ 0}{\text {minimize }} \mathrm {tr}\, CX-\log \det X+\lambda \sum _{i=1}^{n}\sum _{j=1}^{n}|X_{i,j}|. \tag{1}\end{equation*}
Here,
C=1N∑Ni=1(xi−x¯)(xi−x¯)T
is the sample covariance matrix with the sample mean
x¯=1N∑Ni=1xi
, and
X
is the resulting estimator for
Σ−1
. This approach, commonly known as the
graphical lasso [13], stems from maximum likelihood estimation of the inverse covariance matrix in the Gaussian setting. For more general distributions,
(1) corresponds to the
ℓ1
-regularized log-determinant Bregman divergence, which is a widely used method for measuring the distance between the true and estimated second order moments of a probability distribution
[35]. The graphical lasso is known to enjoy surprising statistical guarantees
[35],
[38], some of which are direct extensions of earlier work on the classical lasso
[22],
[31],
[32],
[43]. Variations on this theme are to only impose the
ℓ1
penalty on the off-diagonal elements of
X
, or to place different weights
λ
on the elements of the matrix
X
, as in the classical weighted lasso.
The decision variable is an n×n
matrix, so simply fitting all O(n2)
variables into memory is already a significant issue. General-purpose algorithms have either prohibitively high complexity or slow convergence. In practice, (1) is solved using problem-specific algorithms. The state-of-the-art include GLASSO [13], QUIC [20], and its “big-data” extension BIG-QUIC [21]. These algorithms use between O(n)
and O(n3)
time and between O(n2)
and O(n)
memory per iteration, but the number of iterations needed to converge to an accurate solution can be very large. Therefore, while the ℓ1
-regularized problem (1) is technically convex, it is considered intractable for real large-scale datasets. For instance, graphical lasso is used in gene regulatory networks to infer the interactions between various genes in response to different disease processes [29]. The number of genes in humans can easily reach tens of thousands; a size that is far beyond the capability of existing solvers. Graphical lasso is also a popular method for understanding the partial correlations between different brain regions in response to physical and mental activities. According to MyConnectome project, the full brain mask can include hundreds of thousands of voxels, thus making it prohibitive to solve using available numerical solvers.
This paper is organized as follows. In the rest of this section, we explain the existing results on graphical lasso and thresholding, a summary of our contributions, and its connection to state-of-the-art methods. In Section II, we provide sufficient conditions for the equivalence between generalized graphical lasso and a soft-thresholding method. Section III is devoted to the statement of our recursive closed-form solution for graphical lasso with chordal structures. Our main iterative algorithm for solving graphical lasso with general structures is provided in Section IV. Section VII focuses on the sketch of the proofs and the technical details of intermediate lemmas. The simulation results are demonstrated in Section V.
Notations: Let Rn
be the set of n×1
real vectors, and Sn
be the set of n×n
real symmetric matrices. (We denote x∈Rn
using lower-case, denote X∈Sn
using upper-case, and index the (i,j)
-th element of X
as Xi,j
.) We endow Sn
with the usual matrix inner product X∙Y=trXY
and Frobenius norm ∥X∥2F=X∙X
. Let Sn+⊂Sn
and Sn++⊂Sn+
be the associated sets of positive semidefinite and positive definite matrices. We will frequently write X⪰0
to mean X∈Sn+
and write X≻0
to mean X∈Sn++
. Sets are denoted by calligraphic letters. Given a sparsity set G
, we define SnG⊆Sn
as the set of n×n
real symmetric matrices whose (i,j)
-th element is zero if (i,j)∉G
. For a square matrix M
, let diag(M)
be a same-size diagonal matrix collecting the diagonal elements of M
.
A. Graphical Lasso, Soft-Thresholding, and MDMC
The high computational cost of graphical lasso has inspired a number of heuristics that are significantly cheaper to use. One simple idea is to threshold the sample covariance matrix C
: to examine all of its elements and keep only the ones whose magnitudes exceed some threshold. Thresholding can be fast—even for very large-scale datasets—because it is embarassingly parallel; its quadratic O(n2)
total work can be spread over thousands or millions of parallel processors, in a GPU or distributed on cloud servers. When the number of samples N
is small, i.e. N≪n
, thresholding can also be performed using O(n)
memory, by working directly with the n×N
centered matrix-of-samples [x1−x¯,x2−x¯,…,xN−x¯]
.
In a recent line of work [12], [27], [40], the simple heuristic of thresholding was shown to enjoy some surprising guarantees. In particular, [40] and [12] proved that when the ℓ1
regularization is imposed over only the off-diagonal elements of X
, under some assumptions, the sparsity pattern of the associated estimator from the graphical lasso can be recovered by performing a soft-thresholding operation on C
, as in
(Cλ)i,j=⎧⎩⎨⎪⎪⎪⎪⎪⎪Ci,jCi,j−λ0Ci,j+λi=j,Ci,j>λ,i≠j,|Ci,j|≤λi≠j,−λ>Ci,ji≠j,(2)
View Source
\begin{equation*} (C_{\lambda })_{i,j}=\begin{cases} C_{i,j} & i=j,\\ C_{i,j}-\lambda & C_{i,j}>\lambda,\;i\not =j,\\ 0 & |C_{i,j}|\le \lambda \;i\not =j,\\ C_{i,j}+\lambda & -\lambda > C_{i,j}\,i\not =j, \end{cases} \tag{2}\end{equation*}
and recovering its
sparsity set G={(i,j)∈{1,…,n}2:(Cλ)i,j≠0}.(3)
View Source
\begin{equation*} \mathcal {G}=\{(i,j)\in \{1,\ldots,n\}^{2}:(C_{\lambda })_{i,j}\ne 0\}. \tag{3}\end{equation*}
The support graph of G
, shown as supp(G)
, is obtained by viewing each element (i,j)
in G
as an edge between the i
-th and j
-th vertex in an undirected graph on n
nodes. Moreover, [40] and [12] showed that the estimator X
can be recovered by solving a version of (1) in which the sparsity set G
is explicitly imposed, as in
minimize X≻0 trCλX−logdetXsubject to Xi,j=0∀(i,j)∉G.(4)
View Source
\begin{align*}&\underset {X\succ 0}{\text {minimize }} ~ \mathrm {tr}\, C_{\lambda }X-\log \det X \\&\text {subject to } ~ X_{i,j}=0\quad \forall (i,j)\notin \mathcal {G}.\tag{4}\end{align*}
Recovering the exact value of
X
(and not just its sparsity pattern) is important because it provides a quantitative measure of the conditional dependence between variables.
The assumptions needed for the sparsity pattern of the graphical lasso to be equivalent to that of the thresholding are hard to check but relatively mild. Indeed, [12] proved that these assumptions are automatically satisfied whenever λ
is sufficiently large relative to the sample covariance matrix. Their numerical study found “sufficiently large” to be a fairly loose criterion in practice, particularly in view of the fact that large values of λ
are needed to induce a sufficiently sparse estimate of Σ−1
, e.g. with ≈10n
nonzero elements. By building upon this result, [12] introduces a closed-form solution for graphical lasso that is proven to be optimal when the soft-thresholded sample covariance matrix induces an acyclic graph.
B. Summary of Contributions
The purpose of this paper is three-fold.
Goal 1: We derive an extension of the guarantees derived in [12], [27], and [40] for a slightly more general version of the problem that we call generalized graphical lasso (GGL):
X^=minimize X≻0 trCX−logdetX+∑i=1n∑j=i+1nλi,j|Xi,j|subject to Xi,j=0∀(i,j)∉H.(5)
View Source
\begin{align*} \hat {X}=&\underset {X\succ 0}{\text {minimize }} ~ \mathrm {tr}\, CX-\log \det X+\sum _{i=1}^{n}\sum _{j=i+1}^{n}\lambda _{i,j}|X_{i,j}| \\&\text {subject to } ~ X_{i,j}=0\quad \forall (i,j)\notin \mathcal {H}.\tag{5}\end{align*}
GGL is
(1) penalized by a weighted regularization coefficient
λi,j
on the off-diagonals, and with
a priori sparsity set
H
imposed as an additional constraint. We use the sparsity set
H
to incorporate prior information on the structure of the graphical model. For example, if the sample covariance
C
is collected over a large-scale graph–such as gene regulatory networks, social networks, and transportation systems–then far-away variables can be assumed as pairwise conditionally independent
[7],
[19],
[34]. Including these neighborhood relationships into
H
can regularize the statistical problem, as well as reduce the numerical cost for a solution.
In Section II, we describe a procedure to transform GGL (5) into maximum determinant matrix completion (MDMC) problem–which is the dual of the problem (4)–in the same style as prior results by [12] for graphical lasso. More specifically, we soft-threshold the sample covariance C
and then project this matrix onto the sparsity set H
. We give conditions for the resulting sparsity pattern to be equivalent to the one obtained by solving (5). Furthermore, we prove that the resulting estimator X
can be recovered by solving an appropriately designed MDMC problem.
Goal 2: Upon converting the GGL to an MDMC problem, we derive recursive closed-form solutions for the cases where the soft-thresholded sample covariance has a chordal structure. More specifically, we show that the optimal solution of (5) can be obtained using a closed-form recursive formula when the soft-thresholded sample covariance matrix has a chordal structure. As previously pointed out, most of the numerical algorithms require a large number of iterations (at least O(n)
) for solving the graphical lasso and have the worst-case per-iteration complexity of O(n3)
. We show that our proposed recursive formula requires a number of iterations growing linearly in the size of the problem, and that the complexity of each iteration is dependent on the size of the maximum clique in its support graph. Therefore, given the soft-thresholded sample covariance matrix (which can be obtained while constructing the sample covariance matrix), the complexity of solving the graphical lasso for sparse and chordal structures reduces from O(n4)
to O(n)
.
Goal 3: Based on the connection between the GGL and MDMC problems and the closed-form recursive solution for chordal structures, we develop an efficient iterative algorithm for the GGL with nonchordal structures. In particular, instead of directly solving the GGL, we solve its MDMC counterpart based on the chordal embedding approach of [2], [3], and [8] and then recover the optimal solution. We embed GH=G∩H
within a chordal G~⊃GH
to result in a convex optimization problem over SnG~
, namely the space of real symmetric matrices with the sparsity pattern G~
. This way, the constraint X∈SnG~
is implicitly imposed, meaning that we simply ignore the nonzero elements not in G~
. This will significantly reduce the number of variables in the problem. After the embedding step, we solve the GGL (or to be precise, its associated MDMC) on SnG~
–as opposed to significantly larger cone of positive definite matrices–using a custom Newton-CG method. Roughly speaking, this will enable us to take advantage of fast operations on chordal structures, such as inversion, multiplication, and projection, in order to achieve a cheap per-iteration complexity in the algorithm, thereby significantly reducing its running time. The main idea behind our proposed algorithm is to use an inner conjugate gradients (CG) loop to solve the Newton subproblem of an outer Newton’s method. The proposed algorithm has a number of features designed to exploit problem structure, including the sparse chordal property of G~
, duality, and the ability for CG and Newton to converge superlinearly; these are outlined in Section IV.
Assuming that the chordal embedding is sparse with |G~|=O(n)
nonzero elements, we prove in Section IV-D, that in the worst-case, our algorithm converges to an ϵ
-accurate solution of MDMC (4) in
O(n⋅logϵ−1⋅loglogϵ−1) time and O(n) memory.(6)
View Source
\begin{equation*} O(n\cdot \log \epsilon ^{-1}\cdot \log \log \epsilon ^{-1}) ~\text {time and }O(n) ~\text {memory.} \tag{6}\end{equation*}
from which the optimal solution of the GGL can be efficiently recovered. Most importantly, the algorithm is highly efficient in practice. In
Section V, we present computation results on a suite of test cases. Both synthetic and real-life graphs are considered. Using our approach, we solve sparse inverse covariance estimation problems as large as
n=200,000
(more than 20 billion variables in graphical lasso), in less than an hour on a laptop computer, while the state-of-the-art methods cannot even start their iterations for these instances.
C. Related Work
1) Graphical Lasso With Prior Information
A number of approaches are available in the literature to introduce prior information to graphical lasso. The weighted version of graphical lasso mentioned before is an example. [11] introduced a class of graphical lasso in which the true graphical model is assumed to have a Laplacian structure. This structure commonly appears in signal and image processing [30]. Pathway graphical model is introduced in [17] which takes into account a priori knowledge on the conditional independence between different variables.
2) Algorithms for Graphical Lasso
Algorithms for graphical lasso are usually based on some mixture of Newton [33], proximal Newton [20], [21], iterative thresholding [36], and (block) coordinate descent [13], [41]. All of these suffer fundamentally from the need to keep track and act on all O(n2)
elements in the matrix X
decision variable. Even if the final solution matrix were sparse with O(n)
nonzeros, it is still possible for the algorithm to traverse through a “dense region” in which the iterate X
must be fully dense. Thresholding heuristics have been proposed to address issue, but these may adversely affect the outer algorithm and prevent convergence. It is generally impossible to guarantee a figure lower than O(n2)
time per-iteration, even if the solution contains only O(n)
nonzeros. Most of the algorithms mentioned above actually have worst-case per-iteration costs of O(n3)
.
3) Algorithms for MDMC
Our algorithm is inspired by a line of results [2], [3], [8], [23] for minimizing the log-det penalty on chordal sparsity patterns, culminating in the CVXOPT package [4]. These are Newton algorithms that solve the Newton subproblem by explicitly forming and factoring the fully-dense Newton matrix. When |G~|=O(n)
, these algorithms cost O(nm2+m3)
time and O(m2)
memory per iteration, where m
is the number of edges added to G
to yield the chordal G~
. In practice, m
is usually a factor of 0.1 to 20 times n
, so these algorithms are cubic O(n3)
time and O(n2)
memory. Our algorithm solves the Newton subproblem iteratively using CG. We prove that CG requires just O(n)
time to compute the Newton direction to machine precision (see Section IV-D). In practice, CG converges much faster than its worst-case bound, because it is able to exploit eigenvalue clustering to achieve superlinear convergence. The preconditioned CG (PCG) algorithm of [8] is superficially similar to our algorithm, but instead applies PCG directly to the log-det minimization. Its convergence is much slower than the companion sparse interior-point method, as demonstrated by their numerical results.
SECTION II.
Equivalence Between GGL and Thresholding
Let P_{ \mathcal {H}}(X)
denote the projection operator from { \mathbb {S}}^{n}
onto { \mathbb {S}}_{ \mathcal {H}}^{n}
, i.e. by setting all X_{i,j}=0
if (i,j)\notin \mathcal {H}
. Let C_{\lambda }
be the sample covariance matrix C
individually soft-thresholded by [\lambda _{i,j}]
, as in \begin{equation*} (C_{\lambda })_{i,j}=\begin{cases} C_{i,j} & i=j,\\ C_{i,j}-\lambda _{i,j} & C_{i,j}>\lambda _{i,j},\;i\not =j,\\ 0 & |C_{i,j}|\le \lambda _{i,j}\;i\not =j,\\ C_{i,j}+\lambda _{i,j} & -\lambda _{i,j}> C_{i,j}\,i\not =j, \end{cases} \tag{7}\end{equation*}
View Source
\begin{equation*} (C_{\lambda })_{i,j}=\begin{cases} C_{i,j} & i=j,\\ C_{i,j}-\lambda _{i,j} & C_{i,j}>\lambda _{i,j},\;i\not =j,\\ 0 & |C_{i,j}|\le \lambda _{i,j}\;i\not =j,\\ C_{i,j}+\lambda _{i,j} & -\lambda _{i,j}> C_{i,j}\,i\not =j, \end{cases} \tag{7}\end{equation*}
In this section, we state the conditions for
P_{ \mathcal {H}}(C_{\lambda })
—the projection of the soft-thresholded matrix
C_{\lambda }
in
(7) onto
\mathcal {H}
—to have the same sparsity pattern as the GGL estimator
\hat {X}
in
(5). For brevity, the technical proofs are omitted; these can be found in
Section VII.
Before we state the exact conditions, we begin by adopting some definitions and notations from the literature.
Definition 1[12]:
Given a matrix M\in { \mathbb {S}}^{n}
, define \mathcal {G}_{M}=\{(i,j):M_{i,j}\ne 0\}
as its sparsity pattern. Then, M
is called inverse-consistent if there exists a matrix N\in { \mathbb {S}}^{n}
such that \begin{align*} M+N\succ&0\tag{8a}\\ N=&0\quad \forall (i,j)\in \mathcal {G} _{M}\tag{8b}\\ (M+N)^{-1}\in&\mathbb {S}_{ \mathcal {G}_{M}}^{n}\tag{8c}\end{align*}
View Source
\begin{align*} M+N\succ&0\tag{8a}\\ N=&0\quad \forall (i,j)\in \mathcal {G} _{M}\tag{8b}\\ (M+N)^{-1}\in&\mathbb {S}_{ \mathcal {G}_{M}}^{n}\tag{8c}\end{align*}
The matrix N
is called an inverse-consistent complement of M
and is denoted by M^{(c)}
. Furthermore, M
is called sign-consistent if for every (i,j)\in \mathcal {G} _{M}
, the (i,j)
-th elements of M
and (M+M^{(c)})^{-1}
have opposite signs.
Example 1:
Consider the matrix:\begin{equation*} M=\left [{\begin{array}{cccc} 1&\quad 0.3 &\quad 0 &\quad 0\\ 0.3 &\quad 1 &\quad -0.4 &\quad 0\\ 0 &\quad -0.4 &\quad 1 &\quad 0.2\\ 0 &\quad 0 &\quad 0.2 &\quad 1 \end{array}}\right]\tag{9}\end{equation*}
View Source
\begin{equation*} M=\left [{\begin{array}{cccc} 1&\quad 0.3 &\quad 0 &\quad 0\\ 0.3 &\quad 1 &\quad -0.4 &\quad 0\\ 0 &\quad -0.4 &\quad 1 &\quad 0.2\\ 0 &\quad 0 &\quad 0.2 &\quad 1 \end{array}}\right]\tag{9}\end{equation*}
We show that
M
is both inverse- and sign-consistent. Consider the matrix
M^{(c)}
defined as
\begin{equation*} M^{(c)}=\left [{\begin{array}{cccc} 0&\quad 0 &\quad -0.12 &\quad -0.024\\ 0&\quad 0&\quad 0 &\quad -0.08\\ -0.12 &\quad 0 &\quad 0 &\quad 0\\ -0.024 &\quad -0.08 &\quad 0 &\quad 0 \end{array}}\right]\tag{10}\end{equation*}
View Source
\begin{equation*} M^{(c)}=\left [{\begin{array}{cccc} 0&\quad 0 &\quad -0.12 &\quad -0.024\\ 0&\quad 0&\quad 0 &\quad -0.08\\ -0.12 &\quad 0 &\quad 0 &\quad 0\\ -0.024 &\quad -0.08 &\quad 0 &\quad 0 \end{array}}\right]\tag{10}\end{equation*}
(M+M^{(c)})^{-1}
can be written as
\begin{align*} \left [{\!\begin{array}{cccc} \dfrac {1}{0.91}&\quad \dfrac {-0.3}{0.91} &\quad 0 &\quad 0\\[0.5pc] \dfrac {-0.3}{0.91} &\quad 1+\dfrac {0.09}{0.91}+\dfrac {0.16}{0.84} &\quad \dfrac {0.4}{0.84} &\quad 0\\[0.5pc] 0 &\quad \dfrac {0.4}{0.84} &\quad 1+\dfrac {0.16}{0.84}+\dfrac {0.04}{0.96} &\quad \dfrac {-0.2}{0.96}\\[0.5pc] 0 &\quad 0 &\quad \dfrac {-0.2}{0.96} &\quad \dfrac {1}{0.96} \end{array}\!}\right]\!\!\!\! \\\tag{11}\end{align*}
View Source
\begin{align*} \left [{\!\begin{array}{cccc} \dfrac {1}{0.91}&\quad \dfrac {-0.3}{0.91} &\quad 0 &\quad 0\\[0.5pc] \dfrac {-0.3}{0.91} &\quad 1+\dfrac {0.09}{0.91}+\dfrac {0.16}{0.84} &\quad \dfrac {0.4}{0.84} &\quad 0\\[0.5pc] 0 &\quad \dfrac {0.4}{0.84} &\quad 1+\dfrac {0.16}{0.84}+\dfrac {0.04}{0.96} &\quad \dfrac {-0.2}{0.96}\\[0.5pc] 0 &\quad 0 &\quad \dfrac {-0.2}{0.96} &\quad \dfrac {1}{0.96} \end{array}\!}\right]\!\!\!\! \\\tag{11}\end{align*}
Note that:
M+M^{(c)}
is positive-definite.
The sparsity pattern of M
is the complement of that of M^{(c)}
.
The sparsity pattern of M
and (M+M^{(c)})^{-1}
are equivalent.
The nonzero off-diagonal entries of M
and (M+M^{(c)})^{-1}
have opposite signs.
Therefore, it can be inferred that
M
is both inverse- and sign-consistent, and
M^{(c)}
is its inverse-consistent complement.
Definition 2:
We take the usual element-wise max-norm to exclude the diagonal, as in \|M\|_{\max }=\max _{i\not =j}|M_{ij}|
, and adopt the \beta (\mathcal {G},\alpha)
function defined with respect to the sparsity set \mathcal {G}
and scalar \alpha >0
:\begin{align*} \beta (\mathcal {G},\alpha)=&\max _{M\succ 0} ~ \|M^{(c)}\|_{\max }\tag{12}\\& {s.t.}~ M\in { \mathbb {S}}_{ \mathcal {G}}^{n} ~ {and }~\|M\|_{\max }\le \alpha \tag{13}\\&\hphantom {\text {s.t. }} M_{i,i}=1\quad \forall i\in \{1,\ldots,n\}\tag{14}\end{align*}
View Source
\begin{align*} \beta (\mathcal {G},\alpha)=&\max _{M\succ 0} ~ \|M^{(c)}\|_{\max }\tag{12}\\& {s.t.}~ M\in { \mathbb {S}}_{ \mathcal {G}}^{n} ~ {and }~\|M\|_{\max }\le \alpha \tag{13}\\&\hphantom {\text {s.t. }} M_{i,i}=1\quad \forall i\in \{1,\ldots,n\}\tag{14}\end{align*}
We are now ready to state the conditions for soft-thresholding to be equivalent to GGL.
Theorem 1:
Define C_{\lambda }
as in (7), define C_{ \mathcal {H}}=P_{ \mathcal {H}}(C_{\lambda })
and let \mathcal {G}_{ \mathcal {H}}=\{(i,j):(C_{ \mathcal {H}})_{i,j}\ne 0\}
be its sparsity pattern. Assume that the normalized matrix \tilde {C}=D^{-1/2}C_{ \mathcal {H}}D^{-1/2}
where D=\mathrm {diag}(C_{ \mathcal {H}})
satisfies the following conditions:
\tilde {C}
is positive definite,
\tilde {C}
is sign-consistent,
We have \begin{equation*} \beta \left ({G_{ \mathcal {H}},\|\tilde {C}\|_{\max }}\right)\leq \underset{(k,l)\notin \mathcal {G} _{ \mathcal {H}}}{\min} \frac {\lambda _{k,l}-|C_{k,l}|}{\sqrt {C_{k,k}\cdot C_{l,l}}} \tag{15}\end{equation*}
View Source
\begin{equation*} \beta \left ({G_{ \mathcal {H}},\|\tilde {C}\|_{\max }}\right)\leq \underset{(k,l)\notin \mathcal {G} _{ \mathcal {H}}}{\min} \frac {\lambda _{k,l}-|C_{k,l}|}{\sqrt {C_{k,k}\cdot C_{l,l}}} \tag{15}\end{equation*}
Then, C_{ \mathcal {H}}
has the same sparsity pattern and opposite signs as \hat {X}
in (5), i.e.\begin{align*}&(C_{ \mathcal {H}})_{i,j}=0\qquad \iff \qquad \hat {X}_{i,j}=0,\\&(C_{ \mathcal {H}})_{i,j}>0\qquad \iff \qquad \hat {X}_{i,j}< 0,\\&(C_{ \mathcal {H}})_{i,j}< 0\qquad \iff \qquad \hat {X}_{i,j}>0.\end{align*}
View Source
\begin{align*}&(C_{ \mathcal {H}})_{i,j}=0\qquad \iff \qquad \hat {X}_{i,j}=0,\\&(C_{ \mathcal {H}})_{i,j}>0\qquad \iff \qquad \hat {X}_{i,j}< 0,\\&(C_{ \mathcal {H}})_{i,j}< 0\qquad \iff \qquad \hat {X}_{i,j}>0.\end{align*}
Theorem 1 is an extension to Theorem 12 in [12], which shows that, under similar conditions, the soft-thresholded sample covariance matrix has the same sparsity pattern as the optimal solution of graphical lasso. We have extended these results to the GGL by considering an intermediate weighted graphical lasso and showing that, by carefully selecting the weights of the regularization coefficients, one can adopt the same arguments in [12] to show the equivalence between the GGL and thresholding; a detailed discussion can be found in Section VII.
Note that the diagonal elements of \tilde {C}_{\lambda }
are 1 and its off-diagonal elements are between −1 and 1. A sparse solution for GGL requires large regularization coefficients. This leads to numerous zero elements in \tilde {C}
and forces the magnitude of the nonzero elements to be small. This means that, in most instances, \tilde {C}
is positive definite or even diagonally dominant. Furthermore, the next two propositions reveal that the second and third conditions of Theorem 1 are not restrictive when the goal is to recover a sparse solution for the GGL.
Proposition 1:
There exists a strictly positive constant number \zeta (\mathcal {G}_{ \mathcal {H}})
such that \begin{equation*} \beta \left ({\mathcal {G}_{ \mathcal {H}},\|\tilde {C}\|_{\max }}\right)\leq \zeta (\mathcal {G}_{ \mathcal {H}}) \|\tilde {C}\|_{\max }^{2},\tag{16}\end{equation*}
View Source
\begin{equation*} \beta \left ({\mathcal {G}_{ \mathcal {H}},\|\tilde {C}\|_{\max }}\right)\leq \zeta (\mathcal {G}_{ \mathcal {H}}) \|\tilde {C}\|_{\max }^{2},\tag{16}\end{equation*}
and therefore, Condition 3 of Theorem 1 is satisfied if \begin{equation*} \zeta (\mathcal {G}_{ \mathcal {H}}) \|\tilde {C}\|_{\max }^{2}\leq \underset{ \substack { k\not =l\\ (k,l)\notin \mathcal {G} _{ \mathcal {H}} } }{\min} \frac {\lambda _{k,l}-|C_{k,l}|}{\sqrt {C_{k,k}\cdot C_{l,l}}}\tag{17}\end{equation*}
View Source
\begin{equation*} \zeta (\mathcal {G}_{ \mathcal {H}}) \|\tilde {C}\|_{\max }^{2}\leq \underset{ \substack { k\not =l\\ (k,l)\notin \mathcal {G} _{ \mathcal {H}} } }{\min} \frac {\lambda _{k,l}-|C_{k,l}|}{\sqrt {C_{k,k}\cdot C_{l,l}}}\tag{17}\end{equation*}
Proof:
The proof is similar to that of [12, Lemma 13]. The details are omitted for brevity.
Proposition 2:
There exist strictly positive constant numbers \alpha _{0}(\mathcal {G}_{ \mathcal {H}})
and \gamma (\mathcal {G}_{ \mathcal {H}})
such that \tilde {C}
is sign-consistent if \|\tilde {C}\|_{\max }\leq \alpha _{0}(\mathcal {G}_{ \mathcal {H}})
and \begin{equation*} \gamma (\mathcal {G}_{ \mathcal {H}})\times \|\tilde {C}\|_{\max }^{2}\leq \|\tilde {C}\|_{\min }\tag{18}\end{equation*}
View Source
\begin{equation*} \gamma (\mathcal {G}_{ \mathcal {H}})\times \|\tilde {C}\|_{\max }^{2}\leq \|\tilde {C}\|_{\min }\tag{18}\end{equation*}
where \|\tilde {C}\|_{\min }
is defined as the minimum absolute value of the nonzero off-diagonal entries of \tilde {C}
.Proof:
The proof is similar to that of [12, Lemma 14]. The details are omitted for brevity.
Both of these propositions suggest that, when \lambda
is large, i.e., when a sparse solution is sought for the GGL, \|\tilde {C}\|_{\max }\ll 1
and hence, Conditions 2 and 3 of Theorem 1 are satisfied.
Theorem 1 leads to the following corollary, which asserts that the optimal solution of GGL can be obtained by maximum determinant matrix completion: computing the matrix Z\succeq 0
with the largest determinant that “fills-in” the zero elements of C_{ \mathcal {H}} = P_{ \mathcal {H}}(C_{\lambda })
.
Corollary 1:
Suppose that the conditions in Theorem 1 are satisfied. Define \hat {S}
as the solution to the following MDMC problem \begin{align*} \hat {S}=&\underset {S\succeq 0}{{maximize }} ~ \log \det S \\& {subject~to}~ S_{i,j}=C_{ \mathcal {H}}\quad ~ {for~all }~(i,j) ~ {where }~(C_{ \mathcal {H}})_{i,j}\ne 0 \\\tag{19}\end{align*}
View Source
\begin{align*} \hat {S}=&\underset {S\succeq 0}{{maximize }} ~ \log \det S \\& {subject~to}~ S_{i,j}=C_{ \mathcal {H}}\quad ~ {for~all }~(i,j) ~ {where }~(C_{ \mathcal {H}})_{i,j}\ne 0 \\\tag{19}\end{align*}
Then, \hat {S}=\hat {X}^{-1}
, where \hat {X}
is the solution of (5).
Proof:
Under the conditions of Theorem 1, the (i,j)
-th element of the solution \hat {X}_{i,j}
has opposite signs to the corresponding element in (C_{ \mathcal {H}})_{i,j}
, and hence also C_{i,j}
. Replacing each |X_{i,j}|
term in (5) with \mathrm {sign}(\hat {X}_{i,j})X_{i,j}=-\mathrm {sign}(C_{i,j})X_{i,j}
yields \begin{align*} \hat {X}=&\underset {X\succ 0}{\text {minimize }} ~ \underbrace { \mathrm {tr}\, CX-\sum _{(i,j)\in \mathcal {G} }\mathrm {sign}(C_{i,j})\lambda _{i,j}X_{i,j}}_{\equiv \mathrm {tr}\, C_{\lambda }X}-\log \det X \\&\text {subject to}~ X_{i,j}=0\qquad \forall (i,j)\notin H.\tag{20}\end{align*}
View Source
\begin{align*} \hat {X}=&\underset {X\succ 0}{\text {minimize }} ~ \underbrace { \mathrm {tr}\, CX-\sum _{(i,j)\in \mathcal {G} }\mathrm {sign}(C_{i,j})\lambda _{i,j}X_{i,j}}_{\equiv \mathrm {tr}\, C_{\lambda }X}-\log \det X \\&\text {subject to}~ X_{i,j}=0\qquad \forall (i,j)\notin H.\tag{20}\end{align*}
The constraint
X\in { \mathbb {S}}_{ \mathcal {H}}^{n}
further makes
\mathrm {tr}\, C_{\lambda }X= \mathrm {tr}\, C_{\lambda }P_{ \mathcal {H}}(X)= \mathrm {tr}\, P_{ \mathcal {H}}(C_{\lambda })X\equiv \mathrm {tr}\, C_{ \mathcal {H}}X
. Taking the dual of
(20) yields
(19); complementary slackness yields
\hat {S}=\hat {X}^{-1}
.
Recall that the main goal of the GGL is to promote sparsity in the estimated inverse covariance matrix. Corollary 1 shows that, instead of merely identifying the sparsity structure of the optimal solution using Theorem 1, the soft-thresholded sample covariance matrix can be further exploited to convert the GGL to the MDMC problem; this conversion is at the core of our subsequent algorithms for solving the GGL to optimality.
SECTION III.
Warm-Up: Chordal Structures
In this section, we propose a recursive closed-form solution for the GGL when the soft-thresholded sample covariance has a chordal structure. To this goal, we first review the properties of sparse and chordal matrices and their connections to the MDMC problem.
A. Sparse Cholesky Factorization
Consider solving an n\times n
symmetric positive definite linear system \begin{equation*} Sx=b\end{equation*}
View Source
\begin{equation*} Sx=b\end{equation*}
by Gaussian elimination. The standard procedure comprises a
factorization step, where
S
is decomposed into the (unique) Cholesky factor matrices
LDL^{T}
, in which
D
is diagonal and
L
is lower-triangular with a unit diagonal, and a
substitution step, where the two triangular systems of linear equations
Ly=b
and
DL^{T}x=y
are solved to yield
x
.
In the case where S
is sparse, the Cholesky factor L
is often also sparse. It is common to store the sparsity pattern of L
in the compressed column storage format: a set of indices \mathcal {I}_{1},\ldots, \mathcal {I}_{d}\subseteq \{1,\ldots,d\}
in which \begin{equation*} \mathcal {I}_{j} =\{i\in \{1,\ldots,d\}:i>j,L_{i,j}\ne 0\}, \tag{21}\end{equation*}
View Source
\begin{equation*} \mathcal {I}_{j} =\{i\in \{1,\ldots,d\}:i>j,L_{i,j}\ne 0\}, \tag{21}\end{equation*}
encodes the locations of off-diagonal nonzeros in the
j^{\text {th}}
column of
L
. (The diagonal elements are not included because the matrix
L
has a unit diagonal by definition).
After storage has been allocated and the sparsity structure has been determined, the numerical values of D
and L
are computed using a sparse Cholesky factorization algorithm. This requires the use of the associated elimination tree T=\{\mathcal {V}, \mathcal {E}\}
, which is a rooted tree (or forest) on n
vertices, with edges \mathcal {E}=\{\{1,p(1) \},\ldots,\{d,p(d)\}\}
defined to connect each j^{\text {th}}
vertex to its parent at the p(j)^{\text {th}}
vertex (except root nodes, which have “0” as their parent), as in \begin{equation*} p(j)=\begin{cases} \min \mathcal {I} _{j} & | \mathcal {I}_{j}|>0,\\ 0 & | \mathcal {I}_{j}|=0, \end{cases} \tag{22}\end{equation*}
View Source
\begin{equation*} p(j)=\begin{cases} \min \mathcal {I} _{j} & | \mathcal {I}_{j}|>0,\\ 0 & | \mathcal {I}_{j}|=0, \end{cases} \tag{22}\end{equation*}
in which
\min \mathcal {I} _{j}
indicates the (numerically) smallest index in the index set
\mathcal {I}_{j}
[25]. The elimination tree encodes the dependency information between different columns of
L
, thereby allowing information to be passed without explicitly forming the matrix.
B. Chordal Sparsity Patterns
For a sparsity set \mathcal {S}
, recall that its support graph, denoted by \text {supp}(\mathcal {S})
, is defined as a graph with the vertex set \mathcal {V} = \{1,2, \ldots ,n\}
and the edge set \mathcal {S}
. Moreover, \mathcal {S}
is said to be chordal if its support graph does not contain an induced cycle with length greater than three. \mathcal {S}
is said to factor without fill if every U\in { \mathbb {S}}_{\mathcal {S}}^{n}
can be factored into LDL^{T}
such that L+L^{T}\in { \mathbb {S}}_{\mathcal {S}}^{n}
. If \mathcal {S}
factors without fill, then its support graph is chordal. Conversely, if the support graph is chordal, then there exists a permutation matrix Q
such that QUQ^{T}
factors without fill for every matrix U\in { \mathbb {S}}_{\mathcal {S}}^{n}
[14]. This permutation matrix Q
is called the perfect elimination ordering of the chordal set \mathcal {S}
.
C. Recursive Solution of the MDMC Problem.
An important application of chordal sparsity patterns is the efficient solution of the MDMC problem (19). The first-order optimality conditions for the GGL entails that \begin{equation*} P_{ \mathcal {G}_{\mathcal {H}}}(\hat {X}^{-1}) = C_{\lambda }\tag{23}\end{equation*}
View Source
\begin{equation*} P_{ \mathcal {G}_{\mathcal {H}}}(\hat {X}^{-1}) = C_{\lambda }\tag{23}\end{equation*}
On the other hand,
Corollary 1 shows that
\begin{equation*} \hat {S}=\hat {X}^{-1}\tag{24}\end{equation*}
View Source
\begin{equation*} \hat {S}=\hat {X}^{-1}\tag{24}\end{equation*}
In the case that the sparsity set
\mathcal {G}_{\mathcal {H}}
factors without fill,
[2] showed that
(23) is actually a linear system of equations over the Cholesky factor
L
and
D
of the solution
\hat {X}=LDL^{T}
; their numerical values can be explicitly computed using a recursive formula.
Note that finding the perfect elimination ordering is NP-hard in general. However, if \mathcal {G}_{\mathcal {H}}
is chordal, then we may find the perfect elimination ordering Q
in linear time [42], and apply the above algorithm to the matrix QC_{ \mathcal {H}}Q^{T}
, whose sparsity set does indeed factor without fill.
The algorithm takes n
steps, and the j^{\text {th}}
step requires a size-| \mathcal {I}_{j}|
linear solve and vector-vector product. Define w =\max _{j} | \mathcal {I}_{j}| -1,
which is commonly referred to as the treewidth of \mathrm {supp}(\mathcal {G}_{ \mathcal {H}})
and has the interpretation of the largest clique in \text {supp}(\mathcal {G}_{\mathcal {H}})
minus one. Combined, the algorithm has the time complexity \mathcal {O}(w^{3}n)
. This means that the matrix completion algorithm is linear-time if the treewidth of \text {supp}(\mathcal {G}_{\mathcal {H}})
is on the order of \mathcal {O}(1)
. Note that, for acyclic graphs, we have w = 1
.
Next, we show that if the equivalence between the thresholding method and GGL holds, the optimal solution of the GGL can be obtained using Algorithm 1.
Input.Matrix C_{\lambda }\in { \mathbb {S}}_{ \mathcal {G}_{\mathcal {H}}}^{n}
that has a positive definite completion.
Output.The Cholesky factors L
and D
of \hat {X}=LDL^{T}\in { \mathbb {S}}_{ \mathcal {G}_{\mathcal {H}}}^{n}
that satisfy P_{ \mathcal {G}_{\mathcal {H}}}(\hat {X}^{-1}) = C_{\lambda }
.
Algorithm.Iterate over j\in \{1,2,\ldots,n\}
in reverse, i.e. starting from n
and ending in 1. For each j
, compute D_{j,j}
and the j^{\text {th}}
column of L
from \begin{align*} L_{I_{j},j}=&-V_{j}^{-1}(C_{\lambda })_{ \mathcal {I}_{j},j}, \\ D_{j,j}=&((C_{\lambda })_{j,j}+(C_{\lambda })_{ \mathcal {I}_{j},j}^{T}L_{ \mathcal {I}_{j},j})^{-1}\end{align*}
View Source
\begin{align*} L_{I_{j},j}=&-V_{j}^{-1}(C_{\lambda })_{ \mathcal {I}_{j},j}, \\ D_{j,j}=&((C_{\lambda })_{j,j}+(C_{\lambda })_{ \mathcal {I}_{j},j}^{T}L_{ \mathcal {I}_{j},j})^{-1}\end{align*}
and compute the update matrices \begin{equation*} V_{i}=P_{ \mathcal {I}_{i}\cup \{i\}, \mathcal {I}_{i}}^{T}\begin{bmatrix}C_{j,j} &\quad C_{ \mathcal {I}_{j},j}^{T}\\ C_{ \mathcal {I}_{j},j} &\quad V_{j} \end{bmatrix}P_{ \mathcal {I}_{i}\cup \{i\}, \mathcal {I}_{i}}\end{equation*}
View Source
\begin{equation*} V_{i}=P_{ \mathcal {I}_{i}\cup \{i\}, \mathcal {I}_{i}}^{T}\begin{bmatrix}C_{j,j} &\quad C_{ \mathcal {I}_{j},j}^{T}\\ C_{ \mathcal {I}_{j},j} &\quad V_{j} \end{bmatrix}P_{ \mathcal {I}_{i}\cup \{i\}, \mathcal {I}_{i}}\end{equation*}
for each i
satisfying p(i)=j
, i.e. each child of j
in the elimination tree.
Proposition 3:
Suppose that the conditions in Theorem 1 are satisfied. Then, Algorithm 1 can be used to find \hat {X}
if \mathcal {G}_{\mathcal {H}}
is chordal.
Proof:
According to Algorithm 1, the output satisfies P_{ \mathcal {G}_{ \mathcal {H}}}(\hat {X}^{-1}) = C_{\lambda }
which is the first-order optimality condition for the GGL. Together with the uniqueness of the optimal solution, this concludes the proof.
Proposition 3 suggests that solving the GGL through its MDMC counterpart is much easier and can be performed in linear time using Algorithm 1 when the soft-thresholded sample covariance matrix has a sparse and chordal structure. Inspired by this observation, in the next section, we propose a highly efficient iterative algorithm for solving the GGL with nonchordal soft-thresholded sample covariance matrix, which is done by converting it to an MDMC problem and embedding its nonchordal structure within a chordal pattern. Through this chordal embedding, we show that the number of iterations of the proposed algorithm is essentially constant, each with a linear time complexity.
SECTION IV.
General Case: Nonchordal Structures
This section describes our algorithm to solve MDMC (4) in which the sparsity set \mathcal {G}
is nonchordal. If we assume that the input matrix C_{\lambda }
is sparse, and that sparse Cholesky factorization is able to solve C_{\lambda }x=b
in O(n)
time, then our algorithm is guaranteed to compute an \epsilon
-accurate solution in O(n\log \epsilon ^{-1})
time and O(n)
memory.
The algorithm is fundamentally a Newton-CG method, i.e. Newton’s method in which the Newton search directions are computed using conjugate gradients (CG). It is developed from four key insights:
Chordal embedding is easy via sparse matrix heuristics. State-of-the-art algorithms for (4) begin by computing a chordal embedding \tilde { \mathcal {G}}
for \mathcal {G}
. The optimal chordal embedding with the fewest number of nonzeros |\tilde { \mathcal {G}}|
is NP-hard to compute, but a good-enough embedding with O(n)
nonzeros is sufficient for our purposes. Obtaining a good \tilde { \mathcal {G}}
with |\tilde { \mathcal {G}}|=O(n)
is exactly the same problem as finding a sparse Cholesky factorization C_{\lambda }=LL^{T}
with O(n)
fill-in. Using heuristics developed for numerical linear algebra, we are able to find sparse chordal embeddings for graphs containing millions of edges and hundreds of thousands of nodes in seconds.
Optimize directly on the sparse matrix cone. Using log-det barriers for sparse matrix cones [2], [3], [8], [42], we can optimize directly in the space { \mathbb {S}}_{\tilde { \mathcal {G}}}^{n}
, while ignoring all matrix elements outside of \tilde { \mathcal {G}}
. If |\tilde { \mathcal {G}}|=O(n)
, then only O(n)
decision variables must be explicitly optimized. Moreover, each function evaluation, gradient evaluation, and matrix-vector product with the Hessian can be performed in O(n)
time, using the numerical recipes in [2].
The dual is easier to solve than the primal. The primal problem starts with a feasible point X\in { \mathbb {S}}_{\tilde { \mathcal {G}}}^{n}
and seeks to achieve first-order optimality. The dual problem starts with an infeasible optimal point X\notin { \mathbb {S}}_{\tilde { \mathcal {G}}}^{n}
satisfying first-order optimality, and seeks to make it feasible. We will show that feasibility is easier to achieve than optimality for the GGL, so the dual problem is easier to solve than the primal.
Conjugate gradients (CG) converges in O(1)
iterations. Our main result (Theorem 2) bounds the condition number of the Newton subproblem to be O(1)
, independent of the problem dimension n
and the current accuracy \epsilon
. It is therefore cheaper to solve this subproblem using CG to machine precision \delta _{ \mathrm {mach}}
in O(n\log \delta _{ \mathrm {mach}}^{-1})
time than it is to solve for it directly in O(nm^{2}+m^{3})
time, as in existing methods [2], [3], [8]. Moreover, CG is an optimal Krylov subspace method, and as such, it is often able to exploit clustering in the eigenvalues to converge superlinearly. Finally, computing the Newton direction to high accuracy further allows the outer Newton method to also converge quadratically.
The remainder of this section describes each consideration in further detail. We state the algorithm explicitly in Section IV-E. For the sake of simplicity of notation, we use \mathcal {G}
instead of \mathcal {G}_{\mathcal {H}}
–which is the intersection of the sparsity sets \mathcal {G}
and \mathcal {H}
–in the rest of this section.
A. Efficient Chordal Embedding
Following [8], we begin by reformulating (4) into a sparse chordal matrix program \begin{align*} \hat {X}=&\text {minimize} ~ \mathrm {tr}\, CX-\log \det X \\&\text {subject to} ~ X_{i,j}=0\quad \forall (i,j)\in \tilde { \mathcal {G}}\backslash \mathcal {G}. \\&\hphantom {\text {subject to} ~} X\in { \mathbb {S}}_{\tilde { \mathcal {G}}}^{n}.\tag{25}\end{align*}
View Source
\begin{align*} \hat {X}=&\text {minimize} ~ \mathrm {tr}\, CX-\log \det X \\&\text {subject to} ~ X_{i,j}=0\quad \forall (i,j)\in \tilde { \mathcal {G}}\backslash \mathcal {G}. \\&\hphantom {\text {subject to} ~} X\in { \mathbb {S}}_{\tilde { \mathcal {G}}}^{n}.\tag{25}\end{align*}
in which
\tilde { \mathcal {G}}
is a
chordal embedding for
\mathcal {G}
: a sparsity pattern
\tilde { \mathcal {G}}\supset \mathcal {G}
whose support graph contains no induced cycles greater than three. This can be implemented using standard algorithms for large-and-sparse linear equations, due to the following result.
Proposition 4:
Let C\in { \mathbb {S}}_{ \mathcal {G}}^{n}
be a positive definite matrix. Compute its unique lower-triangular Cholesky factor L
satisfying C=LL^{T}
. Ignoring perfect numerical cancellation, the sparsity set of L+L^{T}
is a chordal embedding \tilde { \mathcal {G}}\supset \mathcal {G}
.
Proof:
The original proof is due to [37]; see also [42].
Note that \tilde { \mathcal {G}}
can be determined directly from \mathcal {G}
using a symbolic Cholesky algorithm, which simulates the steps of Gaussian elimination using Boolean logic. Moreover, we can substantially reduce the number of elements added to \mathcal {G}
by reordering the columns and rows of C
using a fill-reducing ordering.
Corollary 2:
Let \Pi
be a permutation matrix. For the same C\in { \mathbb {S}}_{ \mathcal {G}}^{n}
in Proposition 4, we compute the unique Cholesky factor satisfying \Pi C\Pi ^{T}=LL^{T}
. Ignoring perfect numerical cancellation, the sparsity set of \Pi (L+L^{T})\Pi ^{T}
is a chordal embedding \tilde { \mathcal {G}}\supset \mathcal {G}
.
The problem of finding the best choice of \Pi
is known as the fill-minimizing problem, and is NP-complete [44]. However, good orderings are easily found using heuristics developed for numerical linear algebra, like minimum degree ordering [15] and nested dissection [1], [16]. In fact, [16] proved that nested dissection is \mathcal {O}(\log (n))
suboptimal for bounded-degree graphs, and noted that “we do not know a class of graphs for which [nested dissection is suboptimal] by more than a constant factor.”
If \mathcal {G}
admits sparse chordal embeddings, then a good-enough |\tilde { \mathcal {G}}|=\mathcal {O}(n)
will usually be found using minimum degree or nested dissection. In MATLAB, the minimum degree ordering and symbolic factorization steps can be performed in a few lines of code; see the snippet in Figure 1.
B. Logarithmic Barriers for Sparse Matrix Cones
Define the cone of sparse positive semidefinite matrices \mathcal {K}
, and the cone of sparse matrices with positive semidefinite completions \mathcal {K}_{*}
, as the following \begin{equation*} \mathcal {K}= { \mathbb {S}}_{+}^{n}\cap { \mathbb {S}}_{\tilde { \mathcal {G}}}^{n},\quad \mathcal {K} _{*}=\{S\bullet X\!\ge \!0:S\!\in \! { \mathbb {S}}_{\tilde { \mathcal {G}}}\}\!=\!P_{\tilde { \mathcal {G}}}({ \mathbb {S}}_{+}^{n})\tag{26}\end{equation*}
View Source
\begin{equation*} \mathcal {K}= { \mathbb {S}}_{+}^{n}\cap { \mathbb {S}}_{\tilde { \mathcal {G}}}^{n},\quad \mathcal {K} _{*}=\{S\bullet X\!\ge \!0:S\!\in \! { \mathbb {S}}_{\tilde { \mathcal {G}}}\}\!=\!P_{\tilde { \mathcal {G}}}({ \mathbb {S}}_{+}^{n})\tag{26}\end{equation*}
Then,
(25) can be posed as the primal-dual pair:
\begin{align*}&\arg \min _{X\in \mathcal {K} } ~ \{C\bullet X+f(X):A^{T}(X)=0\}, \tag{27}\\&\arg \!\!\!\!\!\max _{_{S\in \mathcal {K} _{*},y\in \mathbb {R}^{m}}} ~ \{-f_{*}(S):S=C-A(y)\}, \tag{28}\end{align*}
View Source
\begin{align*}&\arg \min _{X\in \mathcal {K} } ~ \{C\bullet X+f(X):A^{T}(X)=0\}, \tag{27}\\&\arg \!\!\!\!\!\max _{_{S\in \mathcal {K} _{*},y\in \mathbb {R}^{m}}} ~ \{-f_{*}(S):S=C-A(y)\}, \tag{28}\end{align*}
in which
f
and
f_{*}
are the “log-det” barrier functions on
\mathcal {K}
and
\mathcal {K}_{*}
as introduced by
[2],
[3],
[8]:
\begin{equation*} f(X)=-\log \det X,\qquad f_{*}(S)=-\min _{X\in \mathcal {K} }\{S\bullet X-\log \det X\}.\end{equation*}
View Source
\begin{equation*} f(X)=-\log \det X,\qquad f_{*}(S)=-\min _{X\in \mathcal {K} }\{S\bullet X-\log \det X\}.\end{equation*}
The linear map
A: \mathbb {R}^{m}\to { \mathbb {S}}_{\tilde { \mathcal {G}}\backslash \mathcal {G} }^{n}
converts a list of
m
variables into the corresponding matrix with the sparsity set
\tilde { \mathcal {G}}\backslash \mathcal {G}
. The gradients of
f
are simply the projections of their usual values onto
{ \mathbb {S}}_{\tilde { \mathcal {G}}}^{n}
, as in
\begin{equation*} \nabla f(X)=-P_{\tilde { \mathcal {G}}}(X^{-1}),\qquad \nabla ^{2}f(X)[Y]=P_{\tilde { \mathcal {G}}}(X^{-1}YX^{-1}).\end{equation*}
View Source
\begin{equation*} \nabla f(X)=-P_{\tilde { \mathcal {G}}}(X^{-1}),\qquad \nabla ^{2}f(X)[Y]=P_{\tilde { \mathcal {G}}}(X^{-1}YX^{-1}).\end{equation*}
Given any
S\in \mathcal {K} _{*}
let
X\in \mathcal {K}
be the unique matrix satisfying
P_{\tilde { \mathcal {G}}}(X^{-1})=S
. Then, we have
\begin{align*} f_{*}(S)=&n+\log \det X,\tag{29a}\\ \nabla f_{*}(S)=&-X,\tag{29b}\\ \nabla ^{2}f_{*}(S)[Y]=&\nabla ^{2}f(X)^{-1}[Y].\tag{29c}\end{align*}
View Source
\begin{align*} f_{*}(S)=&n+\log \det X,\tag{29a}\\ \nabla f_{*}(S)=&-X,\tag{29b}\\ \nabla ^{2}f_{*}(S)[Y]=&\nabla ^{2}f(X)^{-1}[Y].\tag{29c}\end{align*}
Assuming that \tilde { \mathcal {G}}
is sparse and chordal, all six operations can be efficiently evaluated in O(n)
time and O(n)
memory, using the numerical recipes described in [2].
C. Solving the Dual Problem
Our algorithm actually solves the dual problem (28), which can be rewritten as an unconstrained optimization problem \begin{equation*} \hat {y}\equiv \arg \min _{y\in \mathbb {R} ^{m}}g(y)\equiv f_{*}(C_{\lambda }-A(y)). \tag{30}\end{equation*}
View Source
\begin{equation*} \hat {y}\equiv \arg \min _{y\in \mathbb {R} ^{m}}g(y)\equiv f_{*}(C_{\lambda }-A(y)). \tag{30}\end{equation*}
After the solution
\hat {y}
is found, we can recover the optimal estimator for the primal problem via
\hat {X}=-\nabla f_{*}(C_{\lambda }-A(y))
. The dual problem
(28) is easier to solve than the primal
(27) because the origin
y=0
often lies very close to the solution
\hat {y}
. To see this, note that
y=0
produces a candidate estimator
\tilde {X}=-\nabla f_{*}(C_{\lambda })
that solves the
chordal matrix completion problem
\begin{equation*} \tilde {X}=\arg \min \{ \mathrm {tr}\, C_{\lambda }X-\log \det X:X\in { \mathbb {S}}_{\tilde { \mathcal {G}}}^{n}\},\end{equation*}
View Source
\begin{equation*} \tilde {X}=\arg \min \{ \mathrm {tr}\, C_{\lambda }X-\log \det X:X\in { \mathbb {S}}_{\tilde { \mathcal {G}}}^{n}\},\end{equation*}
which is a relaxation of the nonchordal problem posed over
{ \mathbb {S}}_{ \mathcal {G}}^{n}
since
\tilde { \mathcal {G}}\supset \mathcal {G}
. As observed by several previous authors
[8], this relaxation is a high quality guess, and
\tilde {X}
is often “almost feasible” for the original nonchordal problem posed over
{ \mathbb {S}}_{ \mathcal {G}}^{n}
, as in
\tilde {X}\approx P_{ \mathcal {G}}(\tilde {X})
. Simple algebra shows that the gradient
\nabla g
evaluated at the origin has the Euclidean norm
\|\nabla g(0)\|=\|\tilde {X}-P_{ \mathcal {G}}(\tilde {X})\|_{F}
, so if
\tilde {X}\approx P_{ \mathcal {G}}(\tilde {X})
holds true, then the origin
y=0
is close to optimal. Starting from this point, we can expect Newton’s method to rapidly converge at a quadratic rate.
D. CG Converges in O(1)
Iterations
The most computationally expensive part of Newton’s method is the solution of the Newton direction \Delta y
via the m\times m
system of equations \begin{equation*} \nabla ^{2}g(y)\Delta y=-\nabla g(y). \tag{31}\end{equation*}
View Source
\begin{equation*} \nabla ^{2}g(y)\Delta y=-\nabla g(y). \tag{31}\end{equation*}
The Hessian matrix
\nabla ^{2}g(y)
is fully dense, but matrix-vector products with it cost just
O(n)
operations. This insight motivates the solution of
(31) using an iterative Krylov subspace method like conjugate gradients (CG). We defer to standard texts
[6] for implementation details, and only note that CG requires a single matrix-vector product with the Hessian
\nabla ^{2}g(y)
at each iteration, in
O(n)
time and memory when
|\tilde { \mathcal {G}}|=O(n)
. Starting from the origin
p=0
, the method converges to an
\epsilon
-accurate search direction
p
satisfying
\begin{equation*} (p-\Delta y)^{T}\nabla ^{2}g(y)(p-\Delta y)\le \epsilon |\Delta y^{T}\nabla g(y)|\end{equation*}
View Source
\begin{equation*} (p-\Delta y)^{T}\nabla ^{2}g(y)(p-\Delta y)\le \epsilon |\Delta y^{T}\nabla g(y)|\end{equation*}
in at most
\begin{equation*} \left \lceil{ \sqrt {\kappa _{g}}\log (2/\epsilon)}\right \rceil \text {CG iterations,} \tag{32}\end{equation*}
View Source
\begin{equation*} \left \lceil{ \sqrt {\kappa _{g}}\log (2/\epsilon)}\right \rceil \text {CG iterations,} \tag{32}\end{equation*}
where
\kappa _{g}=\|\nabla ^{2}g(y)\|\|\nabla ^{2}g(y)^{-1}\|
is the condition number of the Hessian matrix
[18],
[39]. In many important convex optimization problems, the condition number
\kappa _{g}
grows like
O(1/\epsilon)
or
O(1/\epsilon ^{2})
as the outer Newton iterates approach an
\epsilon
-neighborhood of the true solution. As a consequence, Newton-CG methods typically require
O(1/\sqrt {\epsilon })
or
O(1/\epsilon)
CG iterations.
It is therefore surprising that we are able to bound \kappa _{g}
globally for the MDMC problem, over the entire trajectory of Newton’s method. Below, we state our main result, which asserts that \kappa _{g}
depends polynomially on the problem data and the quality of the initial point, but is independent of the problem dimension n
and the accuracy of the current iterate \epsilon
.
Theorem 2:
The condition number of the Hessian matrix is bound \begin{equation*} \mathrm {cond}\,(\nabla ^{2}g(y_{k}))\le \frac {\lambda _{\max }(\mathbf {A}^{T} \mathbf {A})}{\lambda _{\min }(\mathbf {A}^{T} \mathbf {A})} \left ({2+\frac {\phi _{\max }^{2}\lambda _{\max }(X_{0})}{\lambda _{\min }(\hat {X})}}\right)^{2}\end{equation*}
View Source
\begin{equation*} \mathrm {cond}\,(\nabla ^{2}g(y_{k}))\le \frac {\lambda _{\max }(\mathbf {A}^{T} \mathbf {A})}{\lambda _{\min }(\mathbf {A}^{T} \mathbf {A})} \left ({2+\frac {\phi _{\max }^{2}\lambda _{\max }(X_{0})}{\lambda _{\min }(\hat {X})}}\right)^{2}\end{equation*}
where:
\phi _{\max } =g(y_{0})-g(\hat {y})
is the initial infeasibility,
\mathbf {A}=[\mathrm {vec}\, A_{1},\ldots, \mathrm {vec}\, A_{m}]
is the vectorized data matrix,
X_{0} =-\nabla f_{*}(C-A(y_{0}))
satisfies P_{\tilde { \mathcal {G}}}(X_{0}^{-1})=C-A(y_{0}), X_{0}\in \mathcal {K}
,
and \hat {X} =-\nabla f_{*}(C-A(\hat {y}))
satisfies P_{\tilde { \mathcal {G}}}(\hat {X}^{-1})=C-A(\hat {y}), \hat {X}\in \mathcal {K}
.
Note that, without loss of generality, we assume \lambda _{\min }(\mathbf {A}^{T} \mathbf {A})> 0
; otherwise, the linearly-dependent rows of A^{T}
can be eliminated until the reduced A^{T}
is full row-rank. Applying Theorem 2 to (32) shows that CG solves each Newton subproblem to \epsilon
-accuracy in O(\log \epsilon ^{-1})
iterations. Multiplying this figure by the O(\log \log \epsilon ^{-1})
Newton steps to converge yields a global iteration bound of \begin{equation*} O(\log \epsilon ^{-1}\cdot \log \log \epsilon ^{-1})\approx O(1) \text {CG iterations.}\end{equation*}
View Source
\begin{equation*} O(\log \epsilon ^{-1}\cdot \log \log \epsilon ^{-1})\approx O(1) \text {CG iterations.}\end{equation*}
Multiplying this by the
O(n)
cost of each CG iteration proves the claimed time complexity in
(6). The associated memory complexity is
O(n)
, i.e. the number of decision variables in
\tilde { \mathcal {G}}
.
In practice, CG typically converges much faster than this worst-case bound, due to its ability to exploit the clustering of eigenvalues in \nabla ^{2}g(y)
; see [18], [39]. Moreover, accurate Newton directions are only needed to guarantee quadratic convergence close to the solution. During the initial Newton steps, we may loosen the error tolerance for CG for a significant speed-up. Inexact Newton steps can be used to obtain a speed-up of a factor of 2-3.
E. The Full Algorithm
In this subsection, we assemble the previously-presented steps and summarize the full algorithm. To begin with, we compute a chordal embedding \tilde { \mathcal {G}}
for the sparsity pattern \mathcal {G}
of C_{\lambda }
, using the code snippet in Figure 1. We use the embedding to reformulate (4) as (25), and solve the unconstrained problem \hat {y}=\min _{y}g(y)
defined in (30), using Newton’s method \begin{align*} y_{k+1}=&y_{k}+\alpha _{k}\Delta y_{k},\\ \Delta y_{k}\equiv&-\nabla ^{2}g(y_{k})^{-1}\nabla g(y_{k})\end{align*}
View Source
\begin{align*} y_{k+1}=&y_{k}+\alpha _{k}\Delta y_{k},\\ \Delta y_{k}\equiv&-\nabla ^{2}g(y_{k})^{-1}\nabla g(y_{k})\end{align*}
starting at the origin
y_{0}=0
. The function value
g(y)
, gradient
\nabla g(y)
and Hessian matrix-vector products are all evaluated using the numerical recipes described by
[2].
At each k
-th Newton step, we compute the Newton search direction \Delta y_{k}
using conjugate gradients. A loose tolerance is used when the Newton decrement \delta _{k}=|\Delta y_{k}^{T}\nabla g(y_{k})|
is large, and a tight tolerance is used when the decrement is small, implying that the iterate is close to the true solution.
Once a Newton direction \Delta y_{k}
is computed with a sufficiently large Newton decrement \delta _{k}
, we use a backtracking line search to determine the step-size \alpha _{k}
. In other words, we select the first instance of the sequence \{1,\rho,\rho ^{2},\rho ^{3}, {\dots }\}
that satisfies the Armijo–Goldstein condition \begin{equation*} g(y+\alpha \Delta y)\le g(y)+\gamma \alpha \Delta y^{T}\nabla g(y),\end{equation*}
View Source
\begin{equation*} g(y+\alpha \Delta y)\le g(y)+\gamma \alpha \Delta y^{T}\nabla g(y),\end{equation*}
in which
\gamma \in (0,0.5)
and
\rho \in (0,1)
are line search parameters. Our implementation used
\gamma =0.01
and
\rho =0.5
. We complete the step and repeat the process, until convergence.
We terminate the outer Newton’s method if the Newton decrement \delta _{k}
falls below a threshold. This implies either that the solution has been reached or that CG is not converging to a good enough \Delta y_{k}
to make significant progress. The associated estimator for \Sigma ^{-1}
is recovered by evaluating \hat {X}=-\nabla f_{*}(C_{\lambda }-A(\hat {y}))
.
SECTION V.
Numerical Results
Finally, we benchmark our algorithm against QUIC [20], commonly considered the fastest solver for graphical lasso or RGL. (Another widely-used algorithm is GLASSO [13], but we found it to be significantly slower than QUIC.) We consider three case studies:
Banded graphs without thresholding, to verify the claimed O(n)
complexity of our MDMC algorithm on problems with a nearly-banded structure.
Banded graphs, to verify the ability of our threshold-MDMC procedure to recover the correct GL solution on problems with a nearly-banded structure.
Real-life Graphs, to benchmark the full threshold-MDMC procedure on graphs collected from real-life applications.
Experiments 1 and 3 are performed on a laptop computer with an Intel Core i7 quad-core 2.50 GHz CPU and 16GB RAM. Experiment 2 is performed on a desktop workstation with a slower Intel Core CPU, but 48 GB of RAM. The reported results are based on a serial implementation in MATLAB-R2017b. Both our Newton decrement threshold and
QUIC’s convergence threshold are 10
−7.
We implemented the soft-thresholding set (7) as a serial routine that uses O(n)
memory by taking the n\times N
matrix-of-samples \mathbf {X}=[\mathbf {x}_{1},\mathbf {x}_{2},\ldots,\mathbf {x}_{N}]
satisfying C=\frac {1}{N}\mathbf {X}\mathbf {X}^{T}
as input. The routine implicitly partitions C
into submatrices of size 4000\times 4000
, and iterates over the submatrices one at a time. For each submatrix, it explicitly forms the submatrix, thresholds it using dense linear algebra, and then stores the result as a sparse matrix.
After thresholding, we use our Newton-CG algorithm to solve the MDMC problem with respected to the thresholded matrix C_{\lambda }
. We measure the quality of the resulting X
by the following two metrics:\begin{align*} \text {relative optimality gap}\equiv&\frac {\|P_{G}(C_{\lambda }-X^{-1})\|_{F}}{\|C_{\lambda }\|_{F}},\tag{33}\\ \text {relative infeasibility}\equiv&\frac {\|P_{G}(X)-X\|_{F}}{\|X\|_{F}},\tag{34}\end{align*}
View Source
\begin{align*} \text {relative optimality gap}\equiv&\frac {\|P_{G}(C_{\lambda }-X^{-1})\|_{F}}{\|C_{\lambda }\|_{F}},\tag{33}\\ \text {relative infeasibility}\equiv&\frac {\|P_{G}(X)-X\|_{F}}{\|X\|_{F}},\tag{34}\end{align*}
where
G
is the sparsity pattern associated with
C_{\lambda }
. If
X
is an exact solution to MDMC
(4), then the first-order optimality condition would read
P_{G}(X^{-1})=C_{\lambda }
while the associated feasibility condition
X\in { \mathbb {S}}_{G}^{n}
would imply
P_{G}(X)=X
. In this case, both the relative optimality gap and the relative infeasibility are identically zero.
A. Case Study 1: Banded Patterns Without Thresholding
The first case study aims to verify the claimed O(n)
complexity of our algorithm for MDMC. Here, we avoid the proposed thresholding step, and focus solely on the MDMC (4) problem. Each sparsity pattern G
is a corrupted banded matrices with the bandwidth 101. The off-diagonal nonzero elements of C
are selected from the uniform distribution in [−2, 0) and then corrupted to zero with probability 0.3. The diagonal elements are fixed to 5. Our numerical experiments fix the bandwidth and vary the number of variables n
from 1,000 to 200,000. A time limit of 2 hours is set for both algorithms.
Figure 2a compares the running time of both algorithms. A log-log regression results in an empirical time complexity of O(n^{1.1})
for our algorithm, and O(n^{2})
for QUIC. The extra 0.1 in the exponent is most likely an artifact of our MATLAB implementation. In either case, QUIC’s quadratic complexity limits it to n=1.5\times 10^{4}
. By contrast, our algorithm solves an instance with n=2\times 10^{5}
in less than 33 minutes. The resulting solutions are extremely accurate, with optimality and feasibility gaps of less than 10−16 and 10−7, respectively.
B. Case Study 2: Banded Patterns
In the second case study, we repeat our first experiment over banded patterns, but also include the initial thresholding step. We restrict our attention to the RGL version of the problem, meaning that the bandwidth is known ahead of time, but the exact location of the nonzeros are unknown. The quality of an estimator \hat {X}
compared to the true inverse covariance matrix \Theta \equiv \Sigma ^{-1}
is measured using the following three standard metrics:\begin{align*}&\hspace {-2.8pc}\text {relative Frobenius loss}\equiv \|\hat {X}-\Theta \|_{F}/\|\Theta \|_{F},\tag{35}\\ \text {TPR}\equiv&\frac {|\{(i,j):i\ne j,\hat {X}_{i,j}\ne 0,\Theta _{i,j}\ne 0\}|}{|\{(i,j):i\ne j,\Theta _{i,j}\ne 0\}|},\tag{36}\\ \text {FPR}\equiv&\frac {|\{(i,j):i\ne j,\hat {X}_{i,j}\ne 0,\Theta _{i,j}=0\}|}{|\{(i,j):i\ne j,\Theta _{i,j}=0\}|}.\tag{37}\end{align*}
View Source
\begin{align*}&\hspace {-2.8pc}\text {relative Frobenius loss}\equiv \|\hat {X}-\Theta \|_{F}/\|\Theta \|_{F},\tag{35}\\ \text {TPR}\equiv&\frac {|\{(i,j):i\ne j,\hat {X}_{i,j}\ne 0,\Theta _{i,j}\ne 0\}|}{|\{(i,j):i\ne j,\Theta _{i,j}\ne 0\}|},\tag{36}\\ \text {FPR}\equiv&\frac {|\{(i,j):i\ne j,\hat {X}_{i,j}\ne 0,\Theta _{i,j}=0\}|}{|\{(i,j):i\ne j,\Theta _{i,j}=0\}|}.\tag{37}\end{align*}
In words, the relative Frobenius loss is the normalized error of the estimator. The true positive rate (TPR) is a measure of
sensitivity, and is defined as the proportion of actual nonzeros that are correctly identified as such. The false positive rate (FPR) is a measure of
specificity, and is defined as the portion of zero elements that are misidentified as being nonzero.
The results are shown in Table 1. The threshold-MDMC procedure is able to achieve the same statistical recovery properties as QUIC in a fraction of the solution time. In larger instances, both methods are able to almost exactly recover the true sparsity pattern. However, only our procedure continues to work with n
up to 200000.
C. Case Study 3: Real-Life Graphs
The third case study aims to benchmark the full thresholding-MDMC procedure for sparse inverse covariance estimation on real-life graphs. The actual graphs (i.e. the sparsity patterns) for \Sigma ^{-1}
are chosen from SuiteSparse Matrix Collection [9]—a publicly available dataset for large-and-sparse matrices collected from real-world applications. Our chosen graphs vary in size from n=3918
to n=201062
, and are taken from applications in chemical processes, material science, graph problems, optimal control and model reduction, thermal processes and circuit simulations.
For each sparsity pattern G
, we design a corresponding \Sigma ^{-1}
as follows. For each (i,j)\in G
, we select (\Sigma ^{-1})_{i,j}=(\Sigma ^{-1})_{j,i}
from the uniform distribution in [−1, 1], and then corrupt it to zero with probability 0.3. Then, we set each diagonal to (\Sigma ^{-1})_{i,i}=1+\sum _{j}|(\Sigma ^{-1})_{i,j}|
. Using this \Sigma
, we generate N=5000
samples i.i.d. as \mathbf {x}_{1},\ldots,\mathbf {x}_{N}\sim \mathcal {N}(0,\Sigma)
. This results in a sample covariance matrix C=\frac {1}{N}\sum _{i=1}^{N}\mathbf {x}_{i}\mathbf {x}_{i}^{T}
.
We solve graphical lasso and RGL with the C
described above using our proposed soft-thresholding-MDMC algorithm and QUIC, in order to estimate \Sigma ^{-1}
. In the case of RGL, we assume that the graph G
is known a priori, while noting that 30% of the elements of \Sigma ^{-1}
have been corrupted to zero. Our goal here is to discover the location of these corrupted elements. In all of our simulations, the threshold \lambda
is set so that the number of nonzero elements in the the estimator is roughly the same as the ground truth. We limit both algorithms to 3 hours of CPU time.
Figure 2b compares the CPU time of the two algorithms for this case study; the specific details are provided in Table 2. A log-log regression results in an empirical time complexity of O(n^{1.64})
and O(n^{1.55})
for graphical lasso and RGL using our algorithm, and O(n^{2.46})
and O(n^{2.52})
for the same using QUIC. The exponents of our algorithm are ≥1 due to the initial soft-thresholding step, which is quadratic-time on a serial computer, but ≤2 because the overall procedure is dominated by the solution of the MDMC. Both algorithms solve graphs with n\le 1.5\times 10^{4}
within the allotted time limit, though our algorithm is 11 times faster on average. Only our algorithm is able to solve the estimation problem with n\approx 2\times 10^{5}
in a little more than an hour.
To check whether thresholding-MDMC really does solve graphical lasso and RGL, we substitute the two sets of estimators back into their original problems (1) and (5). The corresponding objective values have a relative difference \le 4\times 10^{-4}
, suggesting that both sets of estimators are about equally optimal. This observation verifies our claims in Theorem 1 and Corollary 1 about (1) and (5): thresholding-MDMC does indeed solve graphical lasso and RGL.
In this section, we present the technical proofs of Theorems 1 and 2. To this goal, we need a number of lemmas.
To prove Theorem 1, first we consider the optimality (KKT) conditions for the unique solution of the GGL with \mathbb {S}_{ \mathcal {H}}^{n}=\mathbb {S}^{n}
.
Lemma 1:
\hat {X}
is the optimal solution of the GGL with \mathbb {S}_{ \mathcal {H}}^{n}=\mathbb {S}^{n}
if and only if it satisfies the following conditions for every i,j\in \{1,2, \ldots ,n\}
: \begin{align*} (\hat {X})_{i,j}^{-1}=&C_{i,j} \quad {if}\quad i=j\tag{38a}\\ (\hat {X})_{i,j}^{-1}=&C_{i,j}+\lambda _{i,j}\times {{ sign}}(\hat {X}_{i,j}) \quad {if}~ \hat {X}_{i,j}\not =0\qquad \tag{38b}\\ C_{i,j}-\lambda _{i,j}\leq&(\hat {X})_{i,j}^{-1}\leq \Sigma _{i,j}+\lambda _{i,j} \quad {if}~\hat {X}_{i,j}=0\tag{38c}\end{align*}
View Source
\begin{align*} (\hat {X})_{i,j}^{-1}=&C_{i,j} \quad {if}\quad i=j\tag{38a}\\ (\hat {X})_{i,j}^{-1}=&C_{i,j}+\lambda _{i,j}\times {{ sign}}(\hat {X}_{i,j}) \quad {if}~ \hat {X}_{i,j}\not =0\qquad \tag{38b}\\ C_{i,j}-\lambda _{i,j}\leq&(\hat {X})_{i,j}^{-1}\leq \Sigma _{i,j}+\lambda _{i,j} \quad {if}~\hat {X}_{i,j}=0\tag{38c}\end{align*}
Proof:
The proof is straightforward and omitted for brevity.
Now, consider the following optimization:\begin{align*}&\hspace {-2pc}\underset {X\succ 0}{\text {minimize }} \mathrm {tr}\,\tilde {C}X-\log \det X \\&\qquad \qquad \qquad +\sum _{(i,j)\in \mathcal {H} }\tilde {\lambda }_{i,j}|{X}_{i,j}|+2\sum _{(i,j)\notin \mathcal {H} }|{X}_{i,j}| \tag{39}\end{align*}
View Source
\begin{align*}&\hspace {-2pc}\underset {X\succ 0}{\text {minimize }} \mathrm {tr}\,\tilde {C}X-\log \det X \\&\qquad \qquad \qquad +\sum _{(i,j)\in \mathcal {H} }\tilde {\lambda }_{i,j}|{X}_{i,j}|+2\sum _{(i,j)\notin \mathcal {H} }|{X}_{i,j}| \tag{39}\end{align*}
where
\begin{equation*} \tilde {C}_{i,j}=\frac {C_{i,j}}{\sqrt {C_{i,i}\times C_{j,j}}}\quad \tilde {\lambda }_{i,j}=\frac {\lambda _{i,j}}{\sqrt {C_{i,i}\times C_{j,j}}}\tag{40}\end{equation*}
View Source
\begin{equation*} \tilde {C}_{i,j}=\frac {C_{i,j}}{\sqrt {C_{i,i}\times C_{j,j}}}\quad \tilde {\lambda }_{i,j}=\frac {\lambda _{i,j}}{\sqrt {C_{i,i}\times C_{j,j}}}\tag{40}\end{equation*}
Let
\tilde {X}
denotes the optimal solution of
(39) and recall
D = \mathrm {diag}(C)
. The following lemma relates
\tilde {X}
to
\hat {X}
.
Lemma 2:
We have \hat {X}=D^{-1/2}\tilde {X} D^{-1/2}
.
Proof:
To prove this lemma, we define an intermediate optimization problem as \begin{align*} \min _{X\in \mathbb {S}_{+}^{n}}f(X)=&\mathrm {tr}\,{C}X-\log \det X+\sum _{(i,j)\in \mathcal {H} }{\lambda }_{i,j}|{X}_{i,j}| \\&\qquad \qquad \quad +\,2\max _{k}\{{C}_{k,k}\}\sum _{(i,j)\notin \mathcal {H} } |{X}_{i,j}| \tag{41}\end{align*}
View Source
\begin{align*} \min _{X\in \mathbb {S}_{+}^{n}}f(X)=&\mathrm {tr}\,{C}X-\log \det X+\sum _{(i,j)\in \mathcal {H} }{\lambda }_{i,j}|{X}_{i,j}| \\&\qquad \qquad \quad +\,2\max _{k}\{{C}_{k,k}\}\sum _{(i,j)\notin \mathcal {H} } |{X}_{i,j}| \tag{41}\end{align*}
Denote
X^{\sharp }
as the optimal solution for
(41). First, we show that
X^{\sharp }=\hat {X}
. Trivially,
\hat {X}
is a feasible solution for
(41) and hence
f(X^{\sharp })\leq f(\hat {X})
. Now, we prove that
X^{\sharp }
is a feasible solution for the GGL. To this goal, we show that
X_{i,j}^{\sharp }=0
for every
(i,j)\notin \mathcal {H}
. By contradiction, suppose
X_{i,j}^{\sharp }\not =0
for some
(i,j)\notin \mathcal {H}
. Note that, due to the positive definiteness of
{X^{\sharp }}^{-1}
, we have
\begin{equation*} (X^{\sharp })_{i,i}^{-1}\times (X^{\sharp })_{j,j}^{-1}-((X^{\sharp })_{i,j}^{-1})^{2}>0 \tag{42}\end{equation*}
View Source
\begin{equation*} (X^{\sharp })_{i,i}^{-1}\times (X^{\sharp })_{j,j}^{-1}-((X^{\sharp })_{i,j}^{-1})^{2}>0 \tag{42}\end{equation*}
Now, based on
Lemma 1, one can write
\begin{equation*} (X^{\sharp })_{ij}^{-1}=C_{i,j}+2\max _{k}\{C_{k,k}\}\times \mathrm {sign}(X_{i,j}^{\sharp }) \tag{43}\end{equation*}
View Source
\begin{equation*} (X^{\sharp })_{ij}^{-1}=C_{i,j}+2\max _{k}\{C_{k,k}\}\times \mathrm {sign}(X_{i,j}^{\sharp }) \tag{43}\end{equation*}
Considering the fact that
C\succeq 0
, we have
|C_{i,j}|\leq \max _{k}\{C_{k,k}\}
. Together with
(43), this implies that
|(X^{\sharp })_{i,j}^{-1}|\geq \max _{k}\{C_{k,k}\}
. Furthermore, due to
Lemma 1, one can write
(X^{\sharp })_{i,i}^{-1}=C_{i,i}
and
(X^{\sharp })_{j,j}^{-1}=C_{j,j}
. This leads to
\begin{align*} (X^{\sharp })_{i,i}^{-1}\times (X^{\sharp })_{j,j}^{-1}-((X^{\sharp })_{i,j}^{-1})^{2}=&C_{i,i}\times C_{j,j}-(\max _{k}\{C_{k,k}))^{2} \\\leq&0\tag{44}\end{align*}
View Source
\begin{align*} (X^{\sharp })_{i,i}^{-1}\times (X^{\sharp })_{j,j}^{-1}-((X^{\sharp })_{i,j}^{-1})^{2}=&C_{i,i}\times C_{j,j}-(\max _{k}\{C_{k,k}))^{2} \\\leq&0\tag{44}\end{align*}
contradicting
(42). Therefore,
X^{\sharp }
is a feasible solution for the GGL. This implies that
f(X^{*})=f(X^{\sharp })
. Due to the uniqueness of the solution of
(41), we have
X^{*}=X^{\sharp }
. Now, note that
(41) can be reformulated as
\begin{align*}&\hspace {-3pc}\min _{X\in \mathbb {S}_{+}^{n}} ~ \mathrm {tr}\,{\tilde {C}}D^{1/2}{X}D^{1/2}-\log \det X \\&+\sum _{(i,j)\in \mathcal {H} }{\lambda }_{i,j}|{X}_{i,j}|+2\max _{k}\{{C}_{k,k}\}\sum _{(i,j)\notin \mathcal {H} }|{X}_{i,j}| \tag{45}\end{align*}
View Source
\begin{align*}&\hspace {-3pc}\min _{X\in \mathbb {S}_{+}^{n}} ~ \mathrm {tr}\,{\tilde {C}}D^{1/2}{X}D^{1/2}-\log \det X \\&+\sum _{(i,j)\in \mathcal {H} }{\lambda }_{i,j}|{X}_{i,j}|+2\max _{k}\{{C}_{k,k}\}\sum _{(i,j)\notin \mathcal {H} }|{X}_{i,j}| \tag{45}\end{align*}
Upon defining
\begin{equation*} \tilde {X}=D^{1/2}XD^{1/2} \tag{46}\end{equation*}
View Source
\begin{equation*} \tilde {X}=D^{1/2}XD^{1/2} \tag{46}\end{equation*}
and following some algebra, one can verify that 41 is equivalent to
\begin{align*}&\hspace {-2.5pc}\min _{\tilde {X}\in \mathbb {S}_{+}^{n}} ~ \mathrm {tr}\,{\tilde {C}}\tilde {X}-\log \det \tilde {X} \\&+\sum _{(i,j)\in \mathcal {H} }\tilde {\lambda }_{i,j}|\tilde {X}_{i,j}|+2\sum _{(i,j)\notin \mathcal {H} }|\tilde {X}_{i,j}|+\log \det (D) \tag{47}\end{align*}
View Source
\begin{align*}&\hspace {-2.5pc}\min _{\tilde {X}\in \mathbb {S}_{+}^{n}} ~ \mathrm {tr}\,{\tilde {C}}\tilde {X}-\log \det \tilde {X} \\&+\sum _{(i,j)\in \mathcal {H} }\tilde {\lambda }_{i,j}|\tilde {X}_{i,j}|+2\sum _{(i,j)\notin \mathcal {H} }|\tilde {X}_{i,j}|+\log \det (D) \tag{47}\end{align*}
Dropping the constant term in
(47) gives rise to the optimization
(39). Therefore,
\hat {X}=D^{-1/2}\tilde {X} D^{-1/2}
holds in light of 46.
Define \tilde {C}_{\lambda } = D^{-1/2}C_{\lambda }D^{-1/2}
and note that, due to Lemma 2, \tilde {C}_{\lambda }
and \tilde {X}
have the same signed sparsity pattern as {C}_{\lambda }
and \hat {X}
, respectively. Therefore, it suffices to show that the signed sparsity structures of \tilde {C}_{\lambda }
and \tilde {X}
are the same, which can be done by analyzing the optimization problem (39) and its connection to the GGL (explained in Lemma 2). Since (39) is the weighted analog of graphical lasso, the arguments made in the proof of Theorem 12 in [12] can be adopted to prove Theorem 1. The details are omitted for brevity.
Recall the definition of the cone \mathcal {K}
and its dual \mathcal {K}_{*}
as in 26. Being dual cones, \mathcal {K}
and \mathcal {K}_{*}
satisfy Farkas’ lemma.
Lemma 3 (Farkas’ Lemma):
Given an arbitrary Y\in { \mathbb {S}}_{V}^{n}
Either Y\in \mathcal {K}
, or there exists a separating hyperplane S\in \mathcal {K} _{*}
such that S\bullet Y< 0
.
Either Y\in \mathcal {K} _{*}
, or there exists a separating hyperplane X\in \mathcal {K}
such that Y\bullet X< 0
.
From the definition of g(y)
in (30) and the relations (29), we see that the Hessian matrix \nabla ^{2}g(y)
can be written it terms of the Hessian \nabla ^{2}f(X)
and the unique X\in \mathcal {K}
satisfying P_{\tilde { \mathcal {G}}}(X^{-1})=C-A(y)
, as in \begin{equation*} \nabla ^{2}g(y) =A^{T}(\nabla ^{2}f(X)^{-1}[A(y)])= \mathbf {A}^{T}\nabla ^{2}f(X)^{-1} \mathbf {A},\end{equation*}
View Source
\begin{equation*} \nabla ^{2}g(y) =A^{T}(\nabla ^{2}f(X)^{-1}[A(y)])= \mathbf {A}^{T}\nabla ^{2}f(X)^{-1} \mathbf {A},\end{equation*}
in which
\mathbf {A}=[\mathrm {vec}\, A_{1},\ldots, \mathrm {vec}\, A_{m}]
. Moreover, using the theory of Kronecker products, we can write
\begin{equation*} \mathrm {vec}\,\nabla ^{2}f(X)[Y]=Q^{T}(X^{-1}\otimes X^{-1})Q \mathrm {vec}\, Y\end{equation*}
View Source
\begin{equation*} \mathrm {vec}\,\nabla ^{2}f(X)[Y]=Q^{T}(X^{-1}\otimes X^{-1})Q \mathrm {vec}\, Y\end{equation*}
in which the
\frac {1}{2}n(n+1)\times |\tilde { \mathcal {G}}|
matrix
Q
is the orthogonal basis matrix of
{ \mathbb {S}}_{\tilde { \mathcal {G}}}^{n}
in
{ \mathbb {S}}^{n}
. Because of this, we see that the eigenvalues of the Hessian
\nabla ^{2}g(y)
are bound
\begin{align*} \lambda _{\min }(\mathbf {A}^{T} \mathbf {A})\lambda _{\min }^{2}(X^{-1})\le&\lambda _{i}(\nabla ^{2}g(y)) \\\le&\lambda _{\max }(\mathbf {A}^{T} \mathbf {A})\lambda _{\max }^{2}(X^{-1}),\tag{48}\end{align*}
View Source
\begin{align*} \lambda _{\min }(\mathbf {A}^{T} \mathbf {A})\lambda _{\min }^{2}(X^{-1})\le&\lambda _{i}(\nabla ^{2}g(y)) \\\le&\lambda _{\max }(\mathbf {A}^{T} \mathbf {A})\lambda _{\max }^{2}(X^{-1}),\tag{48}\end{align*}
and therefore its condition number is bound by the eigenvalues of
X
\begin{equation*} \mathrm {cond}\,(\nabla ^{2}g(y))\le \frac {\lambda _{\max }(\mathbf {A}^{T} \mathbf {A})}{\lambda _{\min }(\mathbf {A}^{T} \mathbf {A})} \left ({\frac {\lambda _{\max }(X)}{\lambda _{\min }(X)}}\right)^{2}. \tag{49}\end{equation*}
View Source
\begin{equation*} \mathrm {cond}\,(\nabla ^{2}g(y))\le \frac {\lambda _{\max }(\mathbf {A}^{T} \mathbf {A})}{\lambda _{\min }(\mathbf {A}^{T} \mathbf {A})} \left ({\frac {\lambda _{\max }(X)}{\lambda _{\min }(X)}}\right)^{2}. \tag{49}\end{equation*}
Consequently most of our effort will be expended in bounding the eigenvalues of
X
.
To simplify notation, we will write y_{0},~\hat {y}
, and y
as the initial point, the solution, and any k
-th iterate. From this, we define S_{0},~\hat {S},~S
, with each satisfying S=C-A(y)
, and X_{0},~\hat {X}
, and X
, the points in \mathcal {K}
each satisfying P_{V}(X^{-1})=S
.
Our goal is to bound the extremal eigenvalues of X
. To do this, we make two observations. The first is that the sequence generated by Newton’s method is monotonously decreasing, as in \begin{equation*} g(\hat {y})\le g(y)\le g(y_{0}).\end{equation*}
View Source
\begin{equation*} g(\hat {y})\le g(y)\le g(y_{0}).\end{equation*}
Evaluating each
f_{*}(S)
as
n+\log \det X
yields our first key inequality
\begin{equation*} \log \det \hat {X}\le \log \det X\le \log \det X_{0}. \tag{50}\end{equation*}
View Source
\begin{equation*} \log \det \hat {X}\le \log \det X\le \log \det X_{0}. \tag{50}\end{equation*}
Our second observation is the fact that every Newton direction
\Delta y
points away from the boundary of
\mathrm {dom}\, g
.
Lemma 4 (Newton Direction Is Positive Definite):
Given y\in \mathrm {dom}\, g
, define the Newton direction \Delta y=-\nabla ^{2}g(y)^{-1}\nabla g(y)
. Then, \Delta S=-A(\Delta y)\in \mathcal {K} _{*}
.
Proof:
The Newton direction is explicitly written as \begin{equation*} \Delta y=-[\mathbf {A}^{T}\nabla f_{*}(X)^{-1} \mathbf {A}]^{-1} \mathbf {A}^{T} \mathrm {vec}\, X\end{equation*}
View Source
\begin{equation*} \Delta y=-[\mathbf {A}^{T}\nabla f_{*}(X)^{-1} \mathbf {A}]^{-1} \mathbf {A}^{T} \mathrm {vec}\, X\end{equation*}
and
\Delta S=-A(\Delta y)
. For any
W\in \mathcal {K}
, we have by substituting
\begin{align*} W\bullet \Delta S=&(\mathrm {vec}\, W)^{T}(\mathrm {vec}\,\Delta S)\\=&(\mathrm {vec}\, W)^{T}(\mathbf {A}[\mathbf {A}^{T}\nabla ^{2}f(X)^{-1} \mathbf {A}]^{-1} \mathbf {A}^{T}) \mathrm {vec}\, X\\\ge&c_{1}(\mathrm {vec}\, W)^{T}(\mathrm {vec}\, X)=c_{1}(W\bullet X)\\\ge&c_{2} \mathrm {tr}\, W>0\end{align*}
View Source
\begin{align*} W\bullet \Delta S=&(\mathrm {vec}\, W)^{T}(\mathrm {vec}\,\Delta S)\\=&(\mathrm {vec}\, W)^{T}(\mathbf {A}[\mathbf {A}^{T}\nabla ^{2}f(X)^{-1} \mathbf {A}]^{-1} \mathbf {A}^{T}) \mathrm {vec}\, X\\\ge&c_{1}(\mathrm {vec}\, W)^{T}(\mathrm {vec}\, X)=c_{1}(W\bullet X)\\\ge&c_{2} \mathrm {tr}\, W>0\end{align*}
where
c_{1}=\sigma _{\min }(\mathbf {A}[\mathbf {A}^{T}\nabla ^{2}f(X)^{-1} \mathbf {A}]^{-1} \mathbf {A}^{T})\ge \mathrm {cond}\, ^{-1}\,\,(\mathbf {A}^{T} \mathbf {A})\lambda _{\max }^{-2}(X)>0
and
c_{2}=\lambda _{\min }(X)>0
. Since there can be no separating hyperplane
W\bullet \Delta S< 0
, we have
\Delta S\in \mathcal {K} _{*}
by Farkas’ lemma.
Finally, we introduce the following function, which often appears in the study of interior-point methods \begin{equation*} \phi (M)= \mathrm {tr}\, M-\log \det M-n\ge 0,\end{equation*}
View Source
\begin{equation*} \phi (M)= \mathrm {tr}\, M-\log \det M-n\ge 0,\end{equation*}
and is well-known to provide a control on the arthmetic and geometric means of the eigenvalues of
M
. Indeed, the function attains its unique minimum at
\phi (I)=0
, and it is nonnegative precisely because of the arithmetic-geometric inequality. Let us show that it can also bound the arithmetic-geometric means of the extremal eigenvalues of
M
.
Lemma 5:
Denote the n
eigevalues of M
as \lambda _{1}\ge \cdots \ge \lambda _{n}
. Then\begin{equation*} \phi \ge \lambda _{1}+\lambda _{n}-2\sqrt {\lambda _{1}\lambda _{n}}=(\sqrt {\lambda _{1}}-\sqrt {\lambda _{n}})^{2}.\end{equation*}
View Source
\begin{equation*} \phi \ge \lambda _{1}+\lambda _{n}-2\sqrt {\lambda _{1}\lambda _{n}}=(\sqrt {\lambda _{1}}-\sqrt {\lambda _{n}})^{2}.\end{equation*}
Proof:
Noting that x-\log x-1\ge 0
for all x\ge 0
, we have \begin{align*} \phi (M)=&\sum _{i=1}^{n}(\lambda _{i}-\log \lambda _{i}-1)\\\ge&(\lambda _{1}-\log \lambda _{1}-1)+(\lambda _{n}-\log \lambda _{n}-1)\\=&\lambda _{1}+\lambda _{n}-2\log \sqrt {\lambda _{1}\lambda _{n}}-2\\=&\lambda _{1}+\lambda _{n}-2\sqrt {\lambda _{1}\lambda _{n}}+2(\sqrt {\lambda _{1}\lambda _{n}}-\log \sqrt {\lambda _{1}\lambda _{n}}-1)\\\ge&\lambda _{1}+\lambda _{n}-2\sqrt {\lambda _{1}\lambda _{n}}.\end{align*}
View Source
\begin{align*} \phi (M)=&\sum _{i=1}^{n}(\lambda _{i}-\log \lambda _{i}-1)\\\ge&(\lambda _{1}-\log \lambda _{1}-1)+(\lambda _{n}-\log \lambda _{n}-1)\\=&\lambda _{1}+\lambda _{n}-2\log \sqrt {\lambda _{1}\lambda _{n}}-2\\=&\lambda _{1}+\lambda _{n}-2\sqrt {\lambda _{1}\lambda _{n}}+2(\sqrt {\lambda _{1}\lambda _{n}}-\log \sqrt {\lambda _{1}\lambda _{n}}-1)\\\ge&\lambda _{1}+\lambda _{n}-2\sqrt {\lambda _{1}\lambda _{n}}.\end{align*}
Completing the square yields
\phi (M)\ge (\sqrt {\lambda _{1}}-\sqrt {\lambda _{n}})^{2}
.
The following upper-bounds are the specific to our problem, and are the key to our intended final claim.
Lemma 6:
Define the initial suboptimality \phi _{\max }=\log \det X_{0}-\log \det \hat {X}
. Then we have\begin{equation*} \phi (\hat {X}X^{-1})\le \phi _{\max },\quad \phi (XX_{0}^{-1})\le \phi _{\max }.\end{equation*}
View Source
\begin{equation*} \phi (\hat {X}X^{-1})\le \phi _{\max },\quad \phi (XX_{0}^{-1})\le \phi _{\max }.\end{equation*}
Proof:
To prove the first inequality, we take first-order optimality at the optimal point \hat {y}
\begin{equation*} \nabla g(\hat {y})=A^{T}(\hat {X})=0.\end{equation*}
View Source
\begin{equation*} \nabla g(\hat {y})=A^{T}(\hat {X})=0.\end{equation*}
Noting that
\hat {X}\in { \mathbb {S}}_{V}^{n}
, we further have
\begin{align*} X^{-1}\bullet \hat {X}=P_{V}(X^{-1})\bullet \hat {X}=&[C-A(y)]\bullet \hat {X}=C\bullet \hat {X}\\&-\,y^{T}\underbrace {A^{T}(\hat {X})}_{=0}\\=&[P_{V}(\hat {X}^{-1})+A(\hat {y})]\bullet \hat {X}\\=&P_{V}(\hat {X}^{-1})\bullet \hat {X}=n\end{align*}
View Source
\begin{align*} X^{-1}\bullet \hat {X}=P_{V}(X^{-1})\bullet \hat {X}=&[C-A(y)]\bullet \hat {X}=C\bullet \hat {X}\\&-\,y^{T}\underbrace {A^{T}(\hat {X})}_{=0}\\=&[P_{V}(\hat {X}^{-1})+A(\hat {y})]\bullet \hat {X}\\=&P_{V}(\hat {X}^{-1})\bullet \hat {X}=n\end{align*}
and hence
\phi (X^{-1}\hat {X})
has the value of the suboptimality at
X
, which is bound by the initial suboptimality in
(50):
\begin{align*} \phi (X^{-1}\hat {X})=&\underbrace {X^{-1}\bullet \hat {X}}_{n}-\log \det X^{-1}\hat {X}-n\\=&\log \det X-\log \det \hat {X}\\\le&\log \det X_{0}-\log \det \hat {X}=\phi _{\max }.\end{align*}
View Source
\begin{align*} \phi (X^{-1}\hat {X})=&\underbrace {X^{-1}\bullet \hat {X}}_{n}-\log \det X^{-1}\hat {X}-n\\=&\log \det X-\log \det \hat {X}\\\le&\log \det X_{0}-\log \det \hat {X}=\phi _{\max }.\end{align*}
We begin with the same steps to prove the second inequality:
\begin{align*} X_{0}^{-1}\bullet X=&P_{V}(X_{0}^{-1})\bullet X =[C-A(y_{0})]\bullet X\\=&[P_{V}(X^{-1})+A(\hat {y})]\bullet X-A(y_{0})\bullet X\\=&n+A(y-y_{0})\bullet X.\end{align*}
View Source
\begin{align*} X_{0}^{-1}\bullet X=&P_{V}(X_{0}^{-1})\bullet X =[C-A(y_{0})]\bullet X\\=&[P_{V}(X^{-1})+A(\hat {y})]\bullet X-A(y_{0})\bullet X\\=&n+A(y-y_{0})\bullet X.\end{align*}
Now, observe that we have arrived at
y
by taking Newton steps
y=y_{0}+\sum _{j=1}^{k}\alpha _{j}\Delta y_{j}
, and that each Newton direction points away from the boundary of the feasible set, as in
-A(\Delta y)\in \mathcal {K} _{*}
(See
Lemma 4). Since
X\in \mathcal {K}
, we must always have
\begin{align*} X\bullet A(y-y_{0})=&\alpha _{j}\sum _{j=1}^{k}X\bullet A(\Delta y_{j})\\=&-\alpha _{j}\sum _{j=1}^{k}X\bullet \Delta S\le 0.\end{align*}
View Source
\begin{align*} X\bullet A(y-y_{0})=&\alpha _{j}\sum _{j=1}^{k}X\bullet A(\Delta y_{j})\\=&-\alpha _{j}\sum _{j=1}^{k}X\bullet \Delta S\le 0.\end{align*}
Substituting yields the second bound and applying
(50) yields
\begin{align*} \phi (X_{0}^{-1}X)=&X_{0}^{-1}\bullet X-\log \det X_{0}^{-1}X-n\\=&(n+A(y-y_{0})\bullet X-\log \det X_{0}^{-1}X)-n\\=&\underbrace {\log \det X_{0}X^{-1}}_{\le \phi _{\max }}+\underbrace {A(y-y_{0})\bullet X}_{\le 0}\le \phi _{\max }.\end{align*}
View Source
\begin{align*} \phi (X_{0}^{-1}X)=&X_{0}^{-1}\bullet X-\log \det X_{0}^{-1}X-n\\=&(n+A(y-y_{0})\bullet X-\log \det X_{0}^{-1}X)-n\\=&\underbrace {\log \det X_{0}X^{-1}}_{\le \phi _{\max }}+\underbrace {A(y-y_{0})\bullet X}_{\le 0}\le \phi _{\max }.\end{align*}
Using the two upper-bounds to bound the eigenvalues of their arguments is enough to derive a condition number bound on X
, which immediately translates into a condition number bound on \nabla ^{2}g(y)
.
Here, we will prove \begin{equation*} \frac {\lambda _{\max }(X)}{\lambda _{\min }(X)}\le 2+\frac {\phi _{\max }^{2}\lambda _{\max }(X_{0})}{\lambda _{\min }(\hat {X})},\end{equation*}
View Source
\begin{equation*} \frac {\lambda _{\max }(X)}{\lambda _{\min }(X)}\le 2+\frac {\phi _{\max }^{2}\lambda _{\max }(X_{0})}{\lambda _{\min }(\hat {X})},\end{equation*}
which yields the desired condition number bound on
\nabla ^{2}g(y)
by substituting into
(49). Writing
\lambda _{1}=\lambda _{\max }(X)
and
\lambda _{n}=\lambda _{\min }(X)
, we have from the two lemmas above:
\begin{align*} \phi _{\max }\ge&\lambda _{\min }(\hat {X})(\sqrt {\lambda _{n}^{-1}}-\sqrt {\lambda _{1}^{-1}})^{2}>0,\\ \phi _{\max }\ge&\lambda _{\min }(X_{0}^{-1})(\sqrt {\lambda _{1}}-\sqrt {\lambda _{n}})^{2}>0.\end{align*}
View Source
\begin{align*} \phi _{\max }\ge&\lambda _{\min }(\hat {X})(\sqrt {\lambda _{n}^{-1}}-\sqrt {\lambda _{1}^{-1}})^{2}>0,\\ \phi _{\max }\ge&\lambda _{\min }(X_{0}^{-1})(\sqrt {\lambda _{1}}-\sqrt {\lambda _{n}})^{2}>0.\end{align*}
Multiplying the two upper-bounds and substituing
\lambda _{\min }(X_{0}^{-1})=1/\lambda _{\max }(X_{0})
yields
\begin{equation*} \frac {\phi _{\max }^{2}\lambda _{\max }(X_{0})}{\lambda _{\min }(\hat {X})}\ge \left ({\sqrt {\frac {\lambda _{1}}{\lambda _{n}}}-\sqrt {\frac {\lambda _{n}}{\lambda _{1}}}}\right)^{2}=\frac {\lambda _{1}}{\lambda _{n}}+\frac {\lambda _{n}}{\lambda _{1}}-2.\end{equation*}
View Source
\begin{equation*} \frac {\phi _{\max }^{2}\lambda _{\max }(X_{0})}{\lambda _{\min }(\hat {X})}\ge \left ({\sqrt {\frac {\lambda _{1}}{\lambda _{n}}}-\sqrt {\frac {\lambda _{n}}{\lambda _{1}}}}\right)^{2}=\frac {\lambda _{1}}{\lambda _{n}}+\frac {\lambda _{n}}{\lambda _{1}}-2.\end{equation*}
Finally, bounding
\lambda _{n}/\lambda _{1}\ge 0
yields the desired bound.