Typesetting math: 30%
Scheduled Maintenance: On Friday, September 22, IEEE Xplore will undergo scheduled maintenance from 2:00-4:00 PM ET (6:00-8:00 PM UTC). The site will be down intermittently during this time. We apologize for any inconvenience.

Linear-Time Algorithm for Learning Large-Scale Sparse Graphical Models

Publisher: IEEE

Abstract:
We consider the graphical lasso, a popular optimization problem for learning the sparse representations of high-dimensional datasets, which is well-known to be computationally expensive for large-scale problems. A recent line of results has shown-under mild assumptions-that the sparsity pattern of the graphical lasso estimator can be retrieved by soft-thresholding the sample covariance matrix. Based on this result, a closed-form solution has been obtained that is optimal when the thresholded sample covariance matrix has an acyclic structure. In this paper, we prove an extension of this result to generalized graphical lasso (GGL), where additional sparsity constraints are imposed based on prior knowledge. Furthermore, we describe a recursive closed-form solution for the problem when the thresholded sample covariance matrix is chordal. By building upon this result, we describe a novel Newton-Conjugate Gradient algorithm that can efficiently solve the GGL with general structures. Assuming that the thresholded sample covariance matrix is sparse with a sparse Cholesky factorization, we prove that the algorithm converges to an ϵ -accurate solution in O(nlog(1/ϵ)) time and O(n) memory. The algorithm is highly efficient in practice: we solve instances with as many as 200000 variables to 7-9 digits of accuracy in less than an hour on a standard laptop computer running MATLAB.
Comparison between the performance of our algorithm and the fastest known algorithm (known as QUIC) for solving sparse inverse covariance estimation problem. A log-log re...
Published in: IEEE Access ( Volume: 7)
Page(s): 12658 - 12672
Date of Publication: 01 January 2019
Electronic ISSN: 2169-3536
INSPEC Accession Number: 18609301
Publisher: IEEE
Funding Agency:

SECTION I.

Introduction

With the growing size of the datasets collected from real-world problems, the graphical models have become a powerful statistical tool to extract and represent the underlying hidden structure of the variables. These models, commonly known as Markov Random Fields (MRFs), have simple and intuitive interpretation: The conditional dependencies of different variables are captured via weighted edges in their associated graph representation. The application of MRFs span from computer vision and natural language processing to economics [10], [24], [26]. One of the most commonly known classes of MRFs is Gaussian Graphical Models (GGMs). The applicability of GGM is contingent upon estimating the covariance or inverse covariance matrix (also known as concentration matrix) based on a limited number of independent and identically distribution (i.i.d.) samples drawn from a multivariate Gaussian distribution [5], [13], [28], [45].

In particular, we consider the problem of estimating an n×n covariance matrix Σ (or its inverse Σ1 ) of an n -variate probability distribution from N i.i.d. samples x1,x2,,xN drawn from the same probability distribution. In many applications [10], [24], [26], the matrix Σ1 is often sparse, meaning that its matrix elements are mostly zero. For Gaussian distributions, the statistical interpretation of sparsity in Σ1 is that most of the variables are pairwise conditionally independent [13].

Imposing sparsity upon Σ1 gives rise to a shrinkage estimator for the inverse covariance matrix. This is particularly important in high-dimensional settings where n is large, often significantly larger than the number of samples Nn . One popular approach is to solve the following 1 -regularized problem

minimize X0trCXlogdetX+λi=1nj=1n|Xi,j|.(1)
View SourceRight-click on figure for MathML and additional features. Here, C=1NNi=1(xix¯)(xix¯)T is the sample covariance matrix with the sample mean x¯=1NNi=1xi , and X is the resulting estimator for Σ1 . This approach, commonly known as the graphical lasso [13], stems from maximum likelihood estimation of the inverse covariance matrix in the Gaussian setting. For more general distributions, (1) corresponds to the 1 -regularized log-determinant Bregman divergence, which is a widely used method for measuring the distance between the true and estimated second order moments of a probability distribution [35]. The graphical lasso is known to enjoy surprising statistical guarantees [35], [38], some of which are direct extensions of earlier work on the classical lasso [22], [31], [32], [43]. Variations on this theme are to only impose the 1 penalty on the off-diagonal elements of X , or to place different weights λ on the elements of the matrix X , as in the classical weighted lasso.

The decision variable is an n×n matrix, so simply fitting all O(n2) variables into memory is already a significant issue. General-purpose algorithms have either prohibitively high complexity or slow convergence. In practice, (1) is solved using problem-specific algorithms. The state-of-the-art include GLASSO [13], QUIC [20], and its “big-data” extension BIG-QUIC [21]. These algorithms use between O(n) and O(n3) time and between O(n2) and O(n) memory per iteration, but the number of iterations needed to converge to an accurate solution can be very large. Therefore, while the 1 -regularized problem (1) is technically convex, it is considered intractable for real large-scale datasets. For instance, graphical lasso is used in gene regulatory networks to infer the interactions between various genes in response to different disease processes [29]. The number of genes in humans can easily reach tens of thousands; a size that is far beyond the capability of existing solvers. Graphical lasso is also a popular method for understanding the partial correlations between different brain regions in response to physical and mental activities. According to MyConnectome project,1 the full brain mask can include hundreds of thousands of voxels, thus making it prohibitive to solve using available numerical solvers.

This paper is organized as follows. In the rest of this section, we explain the existing results on graphical lasso and thresholding, a summary of our contributions, and its connection to state-of-the-art methods. In Section II, we provide sufficient conditions for the equivalence between generalized graphical lasso and a soft-thresholding method. Section III is devoted to the statement of our recursive closed-form solution for graphical lasso with chordal structures. Our main iterative algorithm for solving graphical lasso with general structures is provided in Section IV. Section VII focuses on the sketch of the proofs and the technical details of intermediate lemmas. The simulation results are demonstrated in Section V.

Notations: Let Rn be the set of n×1 real vectors, and Sn be the set of n×n real symmetric matrices. (We denote xRn using lower-case, denote XSn using upper-case, and index the (i,j) -th element of X as Xi,j .) We endow Sn with the usual matrix inner product XY=trXY and Frobenius norm X2F=XX . Let Sn+Sn and Sn++Sn+ be the associated sets of positive semidefinite and positive definite matrices. We will frequently write X0 to mean XSn+ and write X0 to mean XSn++ . Sets are denoted by calligraphic letters. Given a sparsity set G , we define SnGSn as the set of n×n real symmetric matrices whose (i,j) -th element is zero if (i,j)G . For a square matrix M , let diag(M) be a same-size diagonal matrix collecting the diagonal elements of M .

A. Graphical Lasso, Soft-Thresholding, and MDMC

The high computational cost of graphical lasso has inspired a number of heuristics that are significantly cheaper to use. One simple idea is to threshold the sample covariance matrix C : to examine all of its elements and keep only the ones whose magnitudes exceed some threshold. Thresholding can be fast—even for very large-scale datasets—because it is embarassingly parallel; its quadratic O(n2) total work can be spread over thousands or millions of parallel processors, in a GPU or distributed on cloud servers. When the number of samples N is small, i.e. Nn , thresholding can also be performed using O(n) memory, by working directly with the n×N centered matrix-of-samples [x1x¯,x2x¯,,xNx¯] .

In a recent line of work [12], [27], [40], the simple heuristic of thresholding was shown to enjoy some surprising guarantees. In particular, [40] and [12] proved that when the 1 regularization is imposed over only the off-diagonal elements of X , under some assumptions, the sparsity pattern of the associated estimator from the graphical lasso can be recovered by performing a soft-thresholding operation on C , as in

(Cλ)i,j=Ci,jCi,jλ0Ci,j+λi=j,Ci,j>λ,ij,|Ci,j|λij,λ>Ci,jij,(2)
View SourceRight-click on figure for MathML and additional features. and recovering its sparsity set
G={(i,j){1,,n}2:(Cλ)i,j0}.(3)
View SourceRight-click on figure for MathML and additional features.

The support graph of G , shown as supp(G) , is obtained by viewing each element (i,j) in G as an edge between the i -th and j -th vertex in an undirected graph on n nodes. Moreover, [40] and [12] showed that the estimator X can be recovered by solving a version of (1) in which the sparsity set G is explicitly imposed, as in

minimize X0 trCλXlogdetXsubject to  Xi,j=0(i,j)G.(4)
View SourceRight-click on figure for MathML and additional features. Recovering the exact value of X (and not just its sparsity pattern) is important because it provides a quantitative measure of the conditional dependence between variables.

The assumptions needed for the sparsity pattern of the graphical lasso to be equivalent to that of the thresholding are hard to check but relatively mild. Indeed, [12] proved that these assumptions are automatically satisfied whenever λ is sufficiently large relative to the sample covariance matrix. Their numerical study found “sufficiently large” to be a fairly loose criterion in practice, particularly in view of the fact that large values of λ are needed to induce a sufficiently sparse estimate of Σ1 , e.g. with 10n nonzero elements. By building upon this result, [12] introduces a closed-form solution for graphical lasso that is proven to be optimal when the soft-thresholded sample covariance matrix induces an acyclic graph.

B. Summary of Contributions

The purpose of this paper is three-fold.

Goal 1: We derive an extension of the guarantees derived in [12], [27], and [40] for a slightly more general version of the problem that we call generalized graphical lasso (GGL):

X^=minimize X0 trCXlogdetX+i=1nj=i+1nλi,j|Xi,j|subject to  Xi,j=0(i,j)H.(5)
View SourceRight-click on figure for MathML and additional features. GGL is (1) penalized by a weighted regularization coefficient λi,j on the off-diagonals, and with a priori sparsity set H imposed as an additional constraint. We use the sparsity set H to incorporate prior information on the structure of the graphical model. For example, if the sample covariance C is collected over a large-scale graph–such as gene regulatory networks, social networks, and transportation systems–then far-away variables can be assumed as pairwise conditionally independent [7], [19], [34]. Including these neighborhood relationships into H can regularize the statistical problem, as well as reduce the numerical cost for a solution.

In Section II, we describe a procedure to transform GGL (5) into maximum determinant matrix completion (MDMC) problem–which is the dual of the problem (4)–in the same style as prior results by [12] for graphical lasso. More specifically, we soft-threshold the sample covariance C and then project this matrix onto the sparsity set H . We give conditions for the resulting sparsity pattern to be equivalent to the one obtained by solving (5). Furthermore, we prove that the resulting estimator X can be recovered by solving an appropriately designed MDMC problem.

Goal 2: Upon converting the GGL to an MDMC problem, we derive recursive closed-form solutions for the cases where the soft-thresholded sample covariance has a chordal structure. More specifically, we show that the optimal solution of (5) can be obtained using a closed-form recursive formula when the soft-thresholded sample covariance matrix has a chordal structure. As previously pointed out, most of the numerical algorithms require a large number of iterations (at least O(n) ) for solving the graphical lasso and have the worst-case per-iteration complexity of O(n3) . We show that our proposed recursive formula requires a number of iterations growing linearly in the size of the problem, and that the complexity of each iteration is dependent on the size of the maximum clique in its support graph. Therefore, given the soft-thresholded sample covariance matrix (which can be obtained while constructing the sample covariance matrix), the complexity of solving the graphical lasso for sparse and chordal structures reduces from O(n4) to O(n) .

Goal 3: Based on the connection between the GGL and MDMC problems and the closed-form recursive solution for chordal structures, we develop an efficient iterative algorithm for the GGL with nonchordal structures. In particular, instead of directly solving the GGL, we solve its MDMC counterpart based on the chordal embedding approach of [2], [3], and [8] and then recover the optimal solution. We embed GH=GH within a chordal G~GH to result in a convex optimization problem over SnG~ , namely the space of real symmetric matrices with the sparsity pattern G~ . This way, the constraint XSnG~ is implicitly imposed, meaning that we simply ignore the nonzero elements not in G~ . This will significantly reduce the number of variables in the problem. After the embedding step, we solve the GGL (or to be precise, its associated MDMC) on SnG~ –as opposed to significantly larger cone of positive definite matrices–using a custom Newton-CG method.2 Roughly speaking, this will enable us to take advantage of fast operations on chordal structures, such as inversion, multiplication, and projection, in order to achieve a cheap per-iteration complexity in the algorithm, thereby significantly reducing its running time. The main idea behind our proposed algorithm is to use an inner conjugate gradients (CG) loop to solve the Newton subproblem of an outer Newton’s method. The proposed algorithm has a number of features designed to exploit problem structure, including the sparse chordal property of G~ , duality, and the ability for CG and Newton to converge superlinearly; these are outlined in Section IV.

Assuming that the chordal embedding is sparse with |G~|=O(n) nonzero elements, we prove in Section IV-D, that in the worst-case, our algorithm converges to an ϵ -accurate solution of MDMC (4) in

O(nlogϵ1loglogϵ1) time and O(n) memory.(6)
View SourceRight-click on figure for MathML and additional features. from which the optimal solution of the GGL can be efficiently recovered. Most importantly, the algorithm is highly efficient in practice. In Section V, we present computation results on a suite of test cases. Both synthetic and real-life graphs are considered. Using our approach, we solve sparse inverse covariance estimation problems as large as n=200,000 (more than 20 billion variables in graphical lasso), in less than an hour on a laptop computer, while the state-of-the-art methods cannot even start their iterations for these instances.

C. Related Work

1) Graphical Lasso With Prior Information

A number of approaches are available in the literature to introduce prior information to graphical lasso. The weighted version of graphical lasso mentioned before is an example. [11] introduced a class of graphical lasso in which the true graphical model is assumed to have a Laplacian structure. This structure commonly appears in signal and image processing [30]. Pathway graphical model is introduced in [17] which takes into account a priori knowledge on the conditional independence between different variables.

2) Algorithms for Graphical Lasso

Algorithms for graphical lasso are usually based on some mixture of Newton [33], proximal Newton [20], [21], iterative thresholding [36], and (block) coordinate descent [13], [41]. All of these suffer fundamentally from the need to keep track and act on all O(n2) elements in the matrix X decision variable. Even if the final solution matrix were sparse with O(n) nonzeros, it is still possible for the algorithm to traverse through a “dense region” in which the iterate X must be fully dense. Thresholding heuristics have been proposed to address issue, but these may adversely affect the outer algorithm and prevent convergence. It is generally impossible to guarantee a figure lower than O(n2) time per-iteration, even if the solution contains only O(n) nonzeros. Most of the algorithms mentioned above actually have worst-case per-iteration costs of O(n3) .

3) Algorithms for MDMC

Our algorithm is inspired by a line of results [2], [3], [8], [23] for minimizing the log-det penalty on chordal sparsity patterns, culminating in the CVXOPT package [4]. These are Newton algorithms that solve the Newton subproblem by explicitly forming and factoring the fully-dense Newton matrix. When |G~|=O(n) , these algorithms cost O(nm2+m3) time and O(m2) memory per iteration, where m is the number of edges added to G to yield the chordal G~ . In practice, m is usually a factor of 0.1 to 20 times n , so these algorithms are cubic O(n3) time and O(n2) memory. Our algorithm solves the Newton subproblem iteratively using CG. We prove that CG requires just O(n) time to compute the Newton direction to machine precision (see Section IV-D). In practice, CG converges much faster than its worst-case bound, because it is able to exploit eigenvalue clustering to achieve superlinear convergence. The preconditioned CG (PCG) algorithm of [8] is superficially similar to our algorithm, but instead applies PCG directly to the log-det minimization. Its convergence is much slower than the companion sparse interior-point method, as demonstrated by their numerical results.

SECTION II.

Equivalence Between GGL and Thresholding

Let P_{ \mathcal {H}}(X) denote the projection operator from { \mathbb {S}}^{n} onto { \mathbb {S}}_{ \mathcal {H}}^{n} , i.e. by setting all X_{i,j}=0 if (i,j)\notin \mathcal {H} . Let C_{\lambda } be the sample covariance matrix C individually soft-thresholded by [\lambda _{i,j}] , as in \begin{equation*} (C_{\lambda })_{i,j}=\begin{cases} C_{i,j} & i=j,\\ C_{i,j}-\lambda _{i,j} & C_{i,j}>\lambda _{i,j},\;i\not =j,\\ 0 & |C_{i,j}|\le \lambda _{i,j}\;i\not =j,\\ C_{i,j}+\lambda _{i,j} & -\lambda _{i,j}> C_{i,j}\,i\not =j, \end{cases} \tag{7}\end{equation*}

View SourceRight-click on figure for MathML and additional features. In this section, we state the conditions for P_{ \mathcal {H}}(C_{\lambda }) —the projection of the soft-thresholded matrix C_{\lambda } in (7) onto \mathcal {H} —to have the same sparsity pattern as the GGL estimator \hat {X} in (5). For brevity, the technical proofs are omitted; these can be found in Section VII.

Before we state the exact conditions, we begin by adopting some definitions and notations from the literature.

Definition 1[12]:

Given a matrix M\in { \mathbb {S}}^{n} , define \mathcal {G}_{M}=\{(i,j):M_{i,j}\ne 0\} as its sparsity pattern. Then, M is called inverse-consistent if there exists a matrix N\in { \mathbb {S}}^{n} such that \begin{align*} M+N\succ&0\tag{8a}\\ N=&0\quad \forall (i,j)\in \mathcal {G} _{M}\tag{8b}\\ (M+N)^{-1}\in&\mathbb {S}_{ \mathcal {G}_{M}}^{n}\tag{8c}\end{align*}

View SourceRight-click on figure for MathML and additional features. The matrix N is called an inverse-consistent complement of M and is denoted by M^{(c)} . Furthermore, M is called sign-consistent if for every (i,j)\in \mathcal {G} _{M} , the (i,j) -th elements of M and (M+M^{(c)})^{-1} have opposite signs.

Example 1:

Consider the matrix:\begin{equation*} M=\left [{\begin{array}{cccc} 1&\quad 0.3 &\quad 0 &\quad 0\\ 0.3 &\quad 1 &\quad -0.4 &\quad 0\\ 0 &\quad -0.4 &\quad 1 &\quad 0.2\\ 0 &\quad 0 &\quad 0.2 &\quad 1 \end{array}}\right]\tag{9}\end{equation*}

View SourceRight-click on figure for MathML and additional features. We show that M is both inverse- and sign-consistent. Consider the matrix M^{(c)} defined as \begin{equation*} M^{(c)}=\left [{\begin{array}{cccc} 0&\quad 0 &\quad -0.12 &\quad -0.024\\ 0&\quad 0&\quad 0 &\quad -0.08\\ -0.12 &\quad 0 &\quad 0 &\quad 0\\ -0.024 &\quad -0.08 &\quad 0 &\quad 0 \end{array}}\right]\tag{10}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
(M+M^{(c)})^{-1} can be written as \begin{align*} \left [{\!\begin{array}{cccc} \dfrac {1}{0.91}&\quad \dfrac {-0.3}{0.91} &\quad 0 &\quad 0\\[0.5pc] \dfrac {-0.3}{0.91} &\quad 1+\dfrac {0.09}{0.91}+\dfrac {0.16}{0.84} &\quad \dfrac {0.4}{0.84} &\quad 0\\[0.5pc] 0 &\quad \dfrac {0.4}{0.84} &\quad 1+\dfrac {0.16}{0.84}+\dfrac {0.04}{0.96} &\quad \dfrac {-0.2}{0.96}\\[0.5pc] 0 &\quad 0 &\quad \dfrac {-0.2}{0.96} &\quad \dfrac {1}{0.96} \end{array}\!}\right]\!\!\!\! \\\tag{11}\end{align*}
View SourceRight-click on figure for MathML and additional features.
Note that:

  • M+M^{(c)} is positive-definite.

  • The sparsity pattern of M is the complement of that of M^{(c)} .

  • The sparsity pattern of M and (M+M^{(c)})^{-1} are equivalent.

  • The nonzero off-diagonal entries of M and (M+M^{(c)})^{-1} have opposite signs.

Therefore, it can be inferred that M is both inverse- and sign-consistent, and M^{(c)} is its inverse-consistent complement.

Definition 2:

We take the usual element-wise max-norm to exclude the diagonal, as in \|M\|_{\max }=\max _{i\not =j}|M_{ij}| , and adopt the \beta (\mathcal {G},\alpha) function defined with respect to the sparsity set \mathcal {G} and scalar \alpha >0 :\begin{align*} \beta (\mathcal {G},\alpha)=&\max _{M\succ 0} ~ \|M^{(c)}\|_{\max }\tag{12}\\& {s.t.}~ M\in { \mathbb {S}}_{ \mathcal {G}}^{n} ~ {and }~\|M\|_{\max }\le \alpha \tag{13}\\&\hphantom {\text {s.t. }} M_{i,i}=1\quad \forall i\in \{1,\ldots,n\}\tag{14}\end{align*}

View SourceRight-click on figure for MathML and additional features.

We are now ready to state the conditions for soft-thresholding to be equivalent to GGL.

Theorem 1:

Define C_{\lambda } as in (7), define C_{ \mathcal {H}}=P_{ \mathcal {H}}(C_{\lambda }) and let \mathcal {G}_{ \mathcal {H}}=\{(i,j):(C_{ \mathcal {H}})_{i,j}\ne 0\} be its sparsity pattern. Assume that the normalized matrix \tilde {C}=D^{-1/2}C_{ \mathcal {H}}D^{-1/2} where D=\mathrm {diag}(C_{ \mathcal {H}}) satisfies the following conditions:

  1. \tilde {C} is positive definite,

  2. \tilde {C} is sign-consistent,

  3. We have \begin{equation*} \beta \left ({G_{ \mathcal {H}},\|\tilde {C}\|_{\max }}\right)\leq \underset{(k,l)\notin \mathcal {G} _{ \mathcal {H}}}{\min} \frac {\lambda _{k,l}-|C_{k,l}|}{\sqrt {C_{k,k}\cdot C_{l,l}}} \tag{15}\end{equation*}

    View SourceRight-click on figure for MathML and additional features.

Then, C_{ \mathcal {H}} has the same sparsity pattern and opposite signs as \hat {X} in (5), i.e.\begin{align*}&(C_{ \mathcal {H}})_{i,j}=0\qquad \iff \qquad \hat {X}_{i,j}=0,\\&(C_{ \mathcal {H}})_{i,j}>0\qquad \iff \qquad \hat {X}_{i,j}< 0,\\&(C_{ \mathcal {H}})_{i,j}< 0\qquad \iff \qquad \hat {X}_{i,j}>0.\end{align*}
View SourceRight-click on figure for MathML and additional features.

Proof:

See Section VII.

Remark 1:

Theorem 1 is an extension to Theorem 12 in [12], which shows that, under similar conditions, the soft-thresholded sample covariance matrix has the same sparsity pattern as the optimal solution of graphical lasso. We have extended these results to the GGL by considering an intermediate weighted graphical lasso and showing that, by carefully selecting the weights of the regularization coefficients, one can adopt the same arguments in [12] to show the equivalence between the GGL and thresholding; a detailed discussion can be found in Section VII.

Note that the diagonal elements of \tilde {C}_{\lambda } are 1 and its off-diagonal elements are between −1 and 1. A sparse solution for GGL requires large regularization coefficients. This leads to numerous zero elements in \tilde {C} and forces the magnitude of the nonzero elements to be small. This means that, in most instances, \tilde {C} is positive definite or even diagonally dominant. Furthermore, the next two propositions reveal that the second and third conditions of Theorem 1 are not restrictive when the goal is to recover a sparse solution for the GGL.

Proposition 1:

There exists a strictly positive constant number \zeta (\mathcal {G}_{ \mathcal {H}}) such that \begin{equation*} \beta \left ({\mathcal {G}_{ \mathcal {H}},\|\tilde {C}\|_{\max }}\right)\leq \zeta (\mathcal {G}_{ \mathcal {H}}) \|\tilde {C}\|_{\max }^{2},\tag{16}\end{equation*}

View SourceRight-click on figure for MathML and additional features. and therefore, Condition 3 of Theorem 1 is satisfied if \begin{equation*} \zeta (\mathcal {G}_{ \mathcal {H}}) \|\tilde {C}\|_{\max }^{2}\leq \underset{ \substack { k\not =l\\ (k,l)\notin \mathcal {G} _{ \mathcal {H}} } }{\min} \frac {\lambda _{k,l}-|C_{k,l}|}{\sqrt {C_{k,k}\cdot C_{l,l}}}\tag{17}\end{equation*}
View SourceRight-click on figure for MathML and additional features.

Proof:

The proof is similar to that of [12, Lemma 13]. The details are omitted for brevity.

Proposition 2:

There exist strictly positive constant numbers \alpha _{0}(\mathcal {G}_{ \mathcal {H}}) and \gamma (\mathcal {G}_{ \mathcal {H}}) such that \tilde {C} is sign-consistent if \|\tilde {C}\|_{\max }\leq \alpha _{0}(\mathcal {G}_{ \mathcal {H}}) and \begin{equation*} \gamma (\mathcal {G}_{ \mathcal {H}})\times \|\tilde {C}\|_{\max }^{2}\leq \|\tilde {C}\|_{\min }\tag{18}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \|\tilde {C}\|_{\min } is defined as the minimum absolute value of the nonzero off-diagonal entries of \tilde {C} .

Proof:

The proof is similar to that of [12, Lemma 14]. The details are omitted for brevity.

Both of these propositions suggest that, when \lambda is large, i.e., when a sparse solution is sought for the GGL, \|\tilde {C}\|_{\max }\ll 1 and hence, Conditions 2 and 3 of Theorem 1 are satisfied.

Theorem 1 leads to the following corollary, which asserts that the optimal solution of GGL can be obtained by maximum determinant matrix completion: computing the matrix Z\succeq 0 with the largest determinant that “fills-in” the zero elements of C_{ \mathcal {H}} = P_{ \mathcal {H}}(C_{\lambda }) .

Corollary 1:

Suppose that the conditions in Theorem 1 are satisfied. Define \hat {S} as the solution to the following MDMC problem \begin{align*} \hat {S}=&\underset {S\succeq 0}{{maximize }} ~ \log \det S \\& {subject~to}~ S_{i,j}=C_{ \mathcal {H}}\quad ~ {for~all }~(i,j) ~ {where }~(C_{ \mathcal {H}})_{i,j}\ne 0 \\\tag{19}\end{align*}

View SourceRight-click on figure for MathML and additional features. Then, \hat {S}=\hat {X}^{-1} , where \hat {X} is the solution of (5).

Proof:

Under the conditions of Theorem 1, the (i,j) -th element of the solution \hat {X}_{i,j} has opposite signs to the corresponding element in (C_{ \mathcal {H}})_{i,j} , and hence also C_{i,j} . Replacing each |X_{i,j}| term in (5) with \mathrm {sign}(\hat {X}_{i,j})X_{i,j}=-\mathrm {sign}(C_{i,j})X_{i,j} yields \begin{align*} \hat {X}=&\underset {X\succ 0}{\text {minimize }} ~ \underbrace { \mathrm {tr}\, CX-\sum _{(i,j)\in \mathcal {G} }\mathrm {sign}(C_{i,j})\lambda _{i,j}X_{i,j}}_{\equiv \mathrm {tr}\, C_{\lambda }X}-\log \det X \\&\text {subject to}~ X_{i,j}=0\qquad \forall (i,j)\notin H.\tag{20}\end{align*}

View SourceRight-click on figure for MathML and additional features. The constraint X\in { \mathbb {S}}_{ \mathcal {H}}^{n} further makes \mathrm {tr}\, C_{\lambda }X= \mathrm {tr}\, C_{\lambda }P_{ \mathcal {H}}(X)= \mathrm {tr}\, P_{ \mathcal {H}}(C_{\lambda })X\equiv \mathrm {tr}\, C_{ \mathcal {H}}X . Taking the dual of (20) yields (19); complementary slackness yields \hat {S}=\hat {X}^{-1} .

Recall that the main goal of the GGL is to promote sparsity in the estimated inverse covariance matrix. Corollary 1 shows that, instead of merely identifying the sparsity structure of the optimal solution using Theorem 1, the soft-thresholded sample covariance matrix can be further exploited to convert the GGL to the MDMC problem; this conversion is at the core of our subsequent algorithms for solving the GGL to optimality.

SECTION III.

Warm-Up: Chordal Structures

In this section, we propose a recursive closed-form solution for the GGL when the soft-thresholded sample covariance has a chordal structure. To this goal, we first review the properties of sparse and chordal matrices and their connections to the MDMC problem.

A. Sparse Cholesky Factorization

Consider solving an n\times n symmetric positive definite linear system \begin{equation*} Sx=b\end{equation*}

View SourceRight-click on figure for MathML and additional features. by Gaussian elimination. The standard procedure comprises a factorization step, where S is decomposed into the (unique) Cholesky factor matrices LDL^{T} , in which D is diagonal and L is lower-triangular with a unit diagonal, and a substitution step, where the two triangular systems of linear equations Ly=b and DL^{T}x=y are solved to yield x .

In the case where S is sparse, the Cholesky factor L is often also sparse. It is common to store the sparsity pattern of L in the compressed column storage format: a set of indices \mathcal {I}_{1},\ldots, \mathcal {I}_{d}\subseteq \{1,\ldots,d\} in which \begin{equation*} \mathcal {I}_{j} =\{i\in \{1,\ldots,d\}:i>j,L_{i,j}\ne 0\}, \tag{21}\end{equation*}

View SourceRight-click on figure for MathML and additional features. encodes the locations of off-diagonal nonzeros in the j^{\text {th}} column of L . (The diagonal elements are not included because the matrix L has a unit diagonal by definition).

After storage has been allocated and the sparsity structure has been determined, the numerical values of D and L are computed using a sparse Cholesky factorization algorithm. This requires the use of the associated elimination tree T=\{\mathcal {V}, \mathcal {E}\} , which is a rooted tree (or forest) on n vertices, with edges \mathcal {E}=\{\{1,p(1) \},\ldots,\{d,p(d)\}\} defined to connect each j^{\text {th}} vertex to its parent at the p(j)^{\text {th}} vertex (except root nodes, which have “0” as their parent), as in \begin{equation*} p(j)=\begin{cases} \min \mathcal {I} _{j} & | \mathcal {I}_{j}|>0,\\ 0 & | \mathcal {I}_{j}|=0, \end{cases} \tag{22}\end{equation*}

View SourceRight-click on figure for MathML and additional features. in which \min \mathcal {I} _{j} indicates the (numerically) smallest index in the index set \mathcal {I}_{j} [25]. The elimination tree encodes the dependency information between different columns of L , thereby allowing information to be passed without explicitly forming the matrix.

B. Chordal Sparsity Patterns

For a sparsity set \mathcal {S} , recall that its support graph, denoted by \text {supp}(\mathcal {S}) , is defined as a graph with the vertex set \mathcal {V} = \{1,2, \ldots ,n\} and the edge set \mathcal {S} . Moreover, \mathcal {S} is said to be chordal if its support graph does not contain an induced cycle with length greater than three. \mathcal {S} is said to factor without fill if every U\in { \mathbb {S}}_{\mathcal {S}}^{n} can be factored into LDL^{T} such that L+L^{T}\in { \mathbb {S}}_{\mathcal {S}}^{n} . If \mathcal {S} factors without fill, then its support graph is chordal. Conversely, if the support graph is chordal, then there exists a permutation matrix Q such that QUQ^{T} factors without fill for every matrix U\in { \mathbb {S}}_{\mathcal {S}}^{n} [14]. This permutation matrix Q is called the perfect elimination ordering of the chordal set \mathcal {S} .

C. Recursive Solution of the MDMC Problem.

An important application of chordal sparsity patterns is the efficient solution of the MDMC problem (19). The first-order optimality conditions for the GGL entails that \begin{equation*} P_{ \mathcal {G}_{\mathcal {H}}}(\hat {X}^{-1}) = C_{\lambda }\tag{23}\end{equation*}

View SourceRight-click on figure for MathML and additional features. On the other hand, Corollary 1 shows that \begin{equation*} \hat {S}=\hat {X}^{-1}\tag{24}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
In the case that the sparsity set \mathcal {G}_{\mathcal {H}} factors without fill, [2] showed that (23) is actually a linear system of equations over the Cholesky factor L and D of the solution \hat {X}=LDL^{T} ; their numerical values can be explicitly computed using a recursive formula.

Note that finding the perfect elimination ordering is NP-hard in general. However, if \mathcal {G}_{\mathcal {H}} is chordal, then we may find the perfect elimination ordering Q in linear time [42], and apply the above algorithm to the matrix QC_{ \mathcal {H}}Q^{T} , whose sparsity set does indeed factor without fill.

The algorithm takes n steps, and the j^{\text {th}} step requires a size-| \mathcal {I}_{j}| linear solve and vector-vector product. Define w =\max _{j} | \mathcal {I}_{j}| -1, which is commonly referred to as the treewidth of \mathrm {supp}(\mathcal {G}_{ \mathcal {H}}) and has the interpretation of the largest clique in \text {supp}(\mathcal {G}_{\mathcal {H}}) minus one. Combined, the algorithm has the time complexity \mathcal {O}(w^{3}n) . This means that the matrix completion algorithm is linear-time if the treewidth of \text {supp}(\mathcal {G}_{\mathcal {H}}) is on the order of \mathcal {O}(1) . Note that, for acyclic graphs, we have w = 1 .

Next, we show that if the equivalence between the thresholding method and GGL holds, the optimal solution of the GGL can be obtained using Algorithm 1.

Algorithm 1 [2, Algorithm 4.2]

Input.

Matrix C_{\lambda }\in { \mathbb {S}}_{ \mathcal {G}_{\mathcal {H}}}^{n} that has a positive definite completion.

Output.

The Cholesky factors L and D of \hat {X}=LDL^{T}\in { \mathbb {S}}_{ \mathcal {G}_{\mathcal {H}}}^{n} that satisfy P_{ \mathcal {G}_{\mathcal {H}}}(\hat {X}^{-1}) = C_{\lambda } .

Algorithm.

Iterate over j\in \{1,2,\ldots,n\} in reverse, i.e. starting from n and ending in 1. For each j , compute D_{j,j} and the j^{\text {th}} column of L from \begin{align*} L_{I_{j},j}=&-V_{j}^{-1}(C_{\lambda })_{ \mathcal {I}_{j},j}, \\ D_{j,j}=&((C_{\lambda })_{j,j}+(C_{\lambda })_{ \mathcal {I}_{j},j}^{T}L_{ \mathcal {I}_{j},j})^{-1}\end{align*}

View SourceRight-click on figure for MathML and additional features. and compute the update matrices \begin{equation*} V_{i}=P_{ \mathcal {I}_{i}\cup \{i\}, \mathcal {I}_{i}}^{T}\begin{bmatrix}C_{j,j} &\quad C_{ \mathcal {I}_{j},j}^{T}\\ C_{ \mathcal {I}_{j},j} &\quad V_{j} \end{bmatrix}P_{ \mathcal {I}_{i}\cup \{i\}, \mathcal {I}_{i}}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
for each i satisfying p(i)=j , i.e. each child of j in the elimination tree.

Proposition 3:

Suppose that the conditions in Theorem 1 are satisfied. Then, Algorithm 1 can be used to find \hat {X} if \mathcal {G}_{\mathcal {H}} is chordal.

Proof:

According to Algorithm 1, the output satisfies P_{ \mathcal {G}_{ \mathcal {H}}}(\hat {X}^{-1}) = C_{\lambda } which is the first-order optimality condition for the GGL. Together with the uniqueness of the optimal solution, this concludes the proof.

Proposition 3 suggests that solving the GGL through its MDMC counterpart is much easier and can be performed in linear time using Algorithm 1 when the soft-thresholded sample covariance matrix has a sparse and chordal structure. Inspired by this observation, in the next section, we propose a highly efficient iterative algorithm for solving the GGL with nonchordal soft-thresholded sample covariance matrix, which is done by converting it to an MDMC problem and embedding its nonchordal structure within a chordal pattern. Through this chordal embedding, we show that the number of iterations of the proposed algorithm is essentially constant, each with a linear time complexity.

SECTION IV.

General Case: Nonchordal Structures

This section describes our algorithm to solve MDMC (4) in which the sparsity set \mathcal {G} is nonchordal. If we assume that the input matrix C_{\lambda } is sparse, and that sparse Cholesky factorization is able to solve C_{\lambda }x=b in O(n) time, then our algorithm is guaranteed to compute an \epsilon -accurate solution in O(n\log \epsilon ^{-1}) time and O(n) memory.

The algorithm is fundamentally a Newton-CG method, i.e. Newton’s method in which the Newton search directions are computed using conjugate gradients (CG). It is developed from four key insights:

  1. Chordal embedding is easy via sparse matrix heuristics. State-of-the-art algorithms for (4) begin by computing a chordal embedding \tilde { \mathcal {G}} for \mathcal {G} . The optimal chordal embedding with the fewest number of nonzeros |\tilde { \mathcal {G}}| is NP-hard to compute, but a good-enough embedding with O(n) nonzeros is sufficient for our purposes. Obtaining a good \tilde { \mathcal {G}} with |\tilde { \mathcal {G}}|=O(n) is exactly the same problem as finding a sparse Cholesky factorization C_{\lambda }=LL^{T} with O(n) fill-in. Using heuristics developed for numerical linear algebra, we are able to find sparse chordal embeddings for graphs containing millions of edges and hundreds of thousands of nodes in seconds.

  2. Optimize directly on the sparse matrix cone. Using log-det barriers for sparse matrix cones [2], [3], [8], [42], we can optimize directly in the space { \mathbb {S}}_{\tilde { \mathcal {G}}}^{n} , while ignoring all matrix elements outside of \tilde { \mathcal {G}} . If |\tilde { \mathcal {G}}|=O(n) , then only O(n) decision variables must be explicitly optimized. Moreover, each function evaluation, gradient evaluation, and matrix-vector product with the Hessian can be performed in O(n) time, using the numerical recipes in [2].

  3. The dual is easier to solve than the primal. The primal problem starts with a feasible point X\in { \mathbb {S}}_{\tilde { \mathcal {G}}}^{n} and seeks to achieve first-order optimality. The dual problem starts with an infeasible optimal point X\notin { \mathbb {S}}_{\tilde { \mathcal {G}}}^{n} satisfying first-order optimality, and seeks to make it feasible. We will show that feasibility is easier to achieve than optimality for the GGL, so the dual problem is easier to solve than the primal.

  4. Conjugate gradients (CG) converges in O(1) iterations. Our main result (Theorem 2) bounds the condition number of the Newton subproblem to be O(1) , independent of the problem dimension n and the current accuracy \epsilon . It is therefore cheaper to solve this subproblem using CG to machine precision \delta _{ \mathrm {mach}} in O(n\log \delta _{ \mathrm {mach}}^{-1}) time than it is to solve for it directly in O(nm^{2}+m^{3}) time, as in existing methods [2], [3], [8]. Moreover, CG is an optimal Krylov subspace method, and as such, it is often able to exploit clustering in the eigenvalues to converge superlinearly. Finally, computing the Newton direction to high accuracy further allows the outer Newton method to also converge quadratically.

The remainder of this section describes each consideration in further detail. We state the algorithm explicitly in Section IV-E. For the sake of simplicity of notation, we use \mathcal {G} instead of \mathcal {G}_{\mathcal {H}} –which is the intersection of the sparsity sets \mathcal {G} and \mathcal {H} –in the rest of this section.

A. Efficient Chordal Embedding

Following [8], we begin by reformulating (4) into a sparse chordal matrix program \begin{align*} \hat {X}=&\text {minimize} ~ \mathrm {tr}\, CX-\log \det X \\&\text {subject to} ~ X_{i,j}=0\quad \forall (i,j)\in \tilde { \mathcal {G}}\backslash \mathcal {G}. \\&\hphantom {\text {subject to} ~} X\in { \mathbb {S}}_{\tilde { \mathcal {G}}}^{n}.\tag{25}\end{align*}

View SourceRight-click on figure for MathML and additional features. in which \tilde { \mathcal {G}} is a chordal embedding for \mathcal {G} : a sparsity pattern \tilde { \mathcal {G}}\supset \mathcal {G} whose support graph contains no induced cycles greater than three. This can be implemented using standard algorithms for large-and-sparse linear equations, due to the following result.

Proposition 4:

Let C\in { \mathbb {S}}_{ \mathcal {G}}^{n} be a positive definite matrix. Compute its unique lower-triangular Cholesky factor L satisfying C=LL^{T} . Ignoring perfect numerical cancellation, the sparsity set of L+L^{T} is a chordal embedding \tilde { \mathcal {G}}\supset \mathcal {G} .

Proof:

The original proof is due to [37]; see also [42].

Note that \tilde { \mathcal {G}} can be determined directly from \mathcal {G} using a symbolic Cholesky algorithm, which simulates the steps of Gaussian elimination using Boolean logic. Moreover, we can substantially reduce the number of elements added to \mathcal {G} by reordering the columns and rows of C using a fill-reducing ordering.

Corollary 2:

Let \Pi be a permutation matrix. For the same C\in { \mathbb {S}}_{ \mathcal {G}}^{n} in Proposition 4, we compute the unique Cholesky factor satisfying \Pi C\Pi ^{T}=LL^{T} . Ignoring perfect numerical cancellation, the sparsity set of \Pi (L+L^{T})\Pi ^{T} is a chordal embedding \tilde { \mathcal {G}}\supset \mathcal {G} .

Proof:

See [42].

The problem of finding the best choice of \Pi is known as the fill-minimizing problem, and is NP-complete [44]. However, good orderings are easily found using heuristics developed for numerical linear algebra, like minimum degree ordering [15] and nested dissection [1], [16]. In fact, [16] proved that nested dissection is \mathcal {O}(\log (n)) suboptimal for bounded-degree graphs, and noted that “we do not know a class of graphs for which [nested dissection is suboptimal] by more than a constant factor.”

If \mathcal {G} admits sparse chordal embeddings, then a good-enough |\tilde { \mathcal {G}}|=\mathcal {O}(n) will usually be found using minimum degree or nested dissection. In MATLAB, the minimum degree ordering and symbolic factorization steps can be performed in a few lines of code; see the snippet in Figure 1.

FIGURE 1. - MATLAB code for chordal embedding. Given a sparse matrix (C), compute a chordal embedding (Gt) and the number of added edges (m).
FIGURE 1.

MATLAB code for chordal embedding. Given a sparse matrix (C), compute a chordal embedding (Gt) and the number of added edges (m).

B. Logarithmic Barriers for Sparse Matrix Cones

Define the cone of sparse positive semidefinite matrices \mathcal {K} , and the cone of sparse matrices with positive semidefinite completions \mathcal {K}_{*} , as the following \begin{equation*} \mathcal {K}= { \mathbb {S}}_{+}^{n}\cap { \mathbb {S}}_{\tilde { \mathcal {G}}}^{n},\quad \mathcal {K} _{*}=\{S\bullet X\!\ge \!0:S\!\in \! { \mathbb {S}}_{\tilde { \mathcal {G}}}\}\!=\!P_{\tilde { \mathcal {G}}}({ \mathbb {S}}_{+}^{n})\tag{26}\end{equation*}

View SourceRight-click on figure for MathML and additional features. Then, (25) can be posed as the primal-dual pair:\begin{align*}&\arg \min _{X\in \mathcal {K} } ~ \{C\bullet X+f(X):A^{T}(X)=0\}, \tag{27}\\&\arg \!\!\!\!\!\max _{_{S\in \mathcal {K} _{*},y\in \mathbb {R}^{m}}} ~ \{-f_{*}(S):S=C-A(y)\}, \tag{28}\end{align*}
View SourceRight-click on figure for MathML and additional features.
in which f and f_{*} are the “log-det” barrier functions on \mathcal {K} and \mathcal {K}_{*} as introduced by [2], [3], [8]:\begin{equation*} f(X)=-\log \det X,\qquad f_{*}(S)=-\min _{X\in \mathcal {K} }\{S\bullet X-\log \det X\}.\end{equation*}
View SourceRight-click on figure for MathML and additional features.
The linear map A: \mathbb {R}^{m}\to { \mathbb {S}}_{\tilde { \mathcal {G}}\backslash \mathcal {G} }^{n} converts a list of m variables into the corresponding matrix with the sparsity set \tilde { \mathcal {G}}\backslash \mathcal {G} . The gradients of f are simply the projections of their usual values onto { \mathbb {S}}_{\tilde { \mathcal {G}}}^{n} , as in \begin{equation*} \nabla f(X)=-P_{\tilde { \mathcal {G}}}(X^{-1}),\qquad \nabla ^{2}f(X)[Y]=P_{\tilde { \mathcal {G}}}(X^{-1}YX^{-1}).\end{equation*}
View SourceRight-click on figure for MathML and additional features.
Given any S\in \mathcal {K} _{*} let X\in \mathcal {K} be the unique matrix satisfying P_{\tilde { \mathcal {G}}}(X^{-1})=S . Then, we have \begin{align*} f_{*}(S)=&n+\log \det X,\tag{29a}\\ \nabla f_{*}(S)=&-X,\tag{29b}\\ \nabla ^{2}f_{*}(S)[Y]=&\nabla ^{2}f(X)^{-1}[Y].\tag{29c}\end{align*}
View SourceRight-click on figure for MathML and additional features.

Assuming that \tilde { \mathcal {G}} is sparse and chordal, all six operations can be efficiently evaluated in O(n) time and O(n) memory, using the numerical recipes described in [2].

C. Solving the Dual Problem

Our algorithm actually solves the dual problem (28), which can be rewritten as an unconstrained optimization problem \begin{equation*} \hat {y}\equiv \arg \min _{y\in \mathbb {R} ^{m}}g(y)\equiv f_{*}(C_{\lambda }-A(y)). \tag{30}\end{equation*}

View SourceRight-click on figure for MathML and additional features. After the solution \hat {y} is found, we can recover the optimal estimator for the primal problem via \hat {X}=-\nabla f_{*}(C_{\lambda }-A(y)) . The dual problem (28) is easier to solve than the primal (27) because the origin y=0 often lies very close to the solution \hat {y} . To see this, note that y=0 produces a candidate estimator \tilde {X}=-\nabla f_{*}(C_{\lambda }) that solves the chordal matrix completion problem \begin{equation*} \tilde {X}=\arg \min \{ \mathrm {tr}\, C_{\lambda }X-\log \det X:X\in { \mathbb {S}}_{\tilde { \mathcal {G}}}^{n}\},\end{equation*}
View SourceRight-click on figure for MathML and additional features.
which is a relaxation of the nonchordal problem posed over { \mathbb {S}}_{ \mathcal {G}}^{n} since \tilde { \mathcal {G}}\supset \mathcal {G} . As observed by several previous authors [8], this relaxation is a high quality guess, and \tilde {X} is often “almost feasible” for the original nonchordal problem posed over { \mathbb {S}}_{ \mathcal {G}}^{n} , as in \tilde {X}\approx P_{ \mathcal {G}}(\tilde {X}) . Simple algebra shows that the gradient \nabla g evaluated at the origin has the Euclidean norm \|\nabla g(0)\|=\|\tilde {X}-P_{ \mathcal {G}}(\tilde {X})\|_{F} , so if \tilde {X}\approx P_{ \mathcal {G}}(\tilde {X}) holds true, then the origin y=0 is close to optimal. Starting from this point, we can expect Newton’s method to rapidly converge at a quadratic rate.

D. CG Converges in O(1) Iterations

The most computationally expensive part of Newton’s method is the solution of the Newton direction \Delta y via the m\times m system of equations \begin{equation*} \nabla ^{2}g(y)\Delta y=-\nabla g(y). \tag{31}\end{equation*}

View SourceRight-click on figure for MathML and additional features. The Hessian matrix \nabla ^{2}g(y) is fully dense, but matrix-vector products with it cost just O(n) operations. This insight motivates the solution of (31) using an iterative Krylov subspace method like conjugate gradients (CG). We defer to standard texts [6] for implementation details, and only note that CG requires a single matrix-vector product with the Hessian \nabla ^{2}g(y) at each iteration, in O(n) time and memory when |\tilde { \mathcal {G}}|=O(n) . Starting from the origin p=0 , the method converges to an \epsilon -accurate search direction p satisfying \begin{equation*} (p-\Delta y)^{T}\nabla ^{2}g(y)(p-\Delta y)\le \epsilon |\Delta y^{T}\nabla g(y)|\end{equation*}
View SourceRight-click on figure for MathML and additional features.
in at most \begin{equation*} \left \lceil{ \sqrt {\kappa _{g}}\log (2/\epsilon)}\right \rceil \text {CG iterations,} \tag{32}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where \kappa _{g}=\|\nabla ^{2}g(y)\|\|\nabla ^{2}g(y)^{-1}\| is the condition number of the Hessian matrix [18], [39]. In many important convex optimization problems, the condition number \kappa _{g} grows like O(1/\epsilon) or O(1/\epsilon ^{2}) as the outer Newton iterates approach an \epsilon -neighborhood of the true solution. As a consequence, Newton-CG methods typically require O(1/\sqrt {\epsilon }) or O(1/\epsilon) CG iterations.

It is therefore surprising that we are able to bound \kappa _{g} globally for the MDMC problem, over the entire trajectory of Newton’s method. Below, we state our main result, which asserts that \kappa _{g} depends polynomially on the problem data and the quality of the initial point, but is independent of the problem dimension n and the accuracy of the current iterate \epsilon .

Theorem 2:

The condition number of the Hessian matrix is bound \begin{equation*} \mathrm {cond}\,(\nabla ^{2}g(y_{k}))\le \frac {\lambda _{\max }(\mathbf {A}^{T} \mathbf {A})}{\lambda _{\min }(\mathbf {A}^{T} \mathbf {A})} \left ({2+\frac {\phi _{\max }^{2}\lambda _{\max }(X_{0})}{\lambda _{\min }(\hat {X})}}\right)^{2}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where:

  • \phi _{\max } =g(y_{0})-g(\hat {y}) is the initial infeasibility,

  • \mathbf {A}=[\mathrm {vec}\, A_{1},\ldots, \mathrm {vec}\, A_{m}] is the vectorized data matrix,

  • X_{0} =-\nabla f_{*}(C-A(y_{0})) satisfies P_{\tilde { \mathcal {G}}}(X_{0}^{-1})=C-A(y_{0}), X_{0}\in \mathcal {K} ,

  • and \hat {X} =-\nabla f_{*}(C-A(\hat {y})) satisfies P_{\tilde { \mathcal {G}}}(\hat {X}^{-1})=C-A(\hat {y}), \hat {X}\in \mathcal {K} .

Proof:

See Section VII.

Note that, without loss of generality, we assume \lambda _{\min }(\mathbf {A}^{T} \mathbf {A})> 0 ; otherwise, the linearly-dependent rows of A^{T} can be eliminated until the reduced A^{T} is full row-rank. Applying Theorem 2 to (32) shows that CG solves each Newton subproblem to \epsilon -accuracy in O(\log \epsilon ^{-1}) iterations. Multiplying this figure by the O(\log \log \epsilon ^{-1}) Newton steps to converge yields a global iteration bound of \begin{equation*} O(\log \epsilon ^{-1}\cdot \log \log \epsilon ^{-1})\approx O(1) \text {CG iterations.}\end{equation*}

View SourceRight-click on figure for MathML and additional features. Multiplying this by the O(n) cost of each CG iteration proves the claimed time complexity in (6). The associated memory complexity is O(n) , i.e. the number of decision variables in \tilde { \mathcal {G}} .

In practice, CG typically converges much faster than this worst-case bound, due to its ability to exploit the clustering of eigenvalues in \nabla ^{2}g(y) ; see [18], [39]. Moreover, accurate Newton directions are only needed to guarantee quadratic convergence close to the solution. During the initial Newton steps, we may loosen the error tolerance for CG for a significant speed-up. Inexact Newton steps can be used to obtain a speed-up of a factor of 2-3.

E. The Full Algorithm

In this subsection, we assemble the previously-presented steps and summarize the full algorithm. To begin with, we compute a chordal embedding \tilde { \mathcal {G}} for the sparsity pattern \mathcal {G} of C_{\lambda } , using the code snippet in Figure 1. We use the embedding to reformulate (4) as (25), and solve the unconstrained problem \hat {y}=\min _{y}g(y) defined in (30), using Newton’s method \begin{align*} y_{k+1}=&y_{k}+\alpha _{k}\Delta y_{k},\\ \Delta y_{k}\equiv&-\nabla ^{2}g(y_{k})^{-1}\nabla g(y_{k})\end{align*}

View SourceRight-click on figure for MathML and additional features. starting at the origin y_{0}=0 . The function value g(y) , gradient \nabla g(y) and Hessian matrix-vector products are all evaluated using the numerical recipes described by [2].

At each k -th Newton step, we compute the Newton search direction \Delta y_{k} using conjugate gradients. A loose tolerance is used when the Newton decrement \delta _{k}=|\Delta y_{k}^{T}\nabla g(y_{k})| is large, and a tight tolerance is used when the decrement is small, implying that the iterate is close to the true solution.

Once a Newton direction \Delta y_{k} is computed with a sufficiently large Newton decrement \delta _{k} , we use a backtracking line search to determine the step-size \alpha _{k} . In other words, we select the first instance of the sequence \{1,\rho,\rho ^{2},\rho ^{3}, {\dots }\} that satisfies the Armijo–Goldstein condition \begin{equation*} g(y+\alpha \Delta y)\le g(y)+\gamma \alpha \Delta y^{T}\nabla g(y),\end{equation*}

View SourceRight-click on figure for MathML and additional features. in which \gamma \in (0,0.5) and \rho \in (0,1) are line search parameters. Our implementation used \gamma =0.01 and \rho =0.5 . We complete the step and repeat the process, until convergence.

We terminate the outer Newton’s method if the Newton decrement \delta _{k} falls below a threshold. This implies either that the solution has been reached or that CG is not converging to a good enough \Delta y_{k} to make significant progress. The associated estimator for \Sigma ^{-1} is recovered by evaluating \hat {X}=-\nabla f_{*}(C_{\lambda }-A(\hat {y})) .

SECTION V.

Numerical Results

Finally, we benchmark our algorithm3 against QUIC [20], commonly considered the fastest solver for graphical lasso or RGL.4 (Another widely-used algorithm is GLASSO [13], but we found it to be significantly slower than QUIC.) We consider three case studies:

  1. Banded graphs without thresholding, to verify the claimed O(n) complexity of our MDMC algorithm on problems with a nearly-banded structure.

  2. Banded graphs, to verify the ability of our threshold-MDMC procedure to recover the correct GL solution on problems with a nearly-banded structure.

  3. Real-life Graphs, to benchmark the full threshold-MDMC procedure on graphs collected from real-life applications.

Experiments 1 and 3 are performed on a laptop computer with an Intel Core i7 quad-core 2.50 GHz CPU and 16GB RAM. Experiment 2 is performed on a desktop workstation with a slower Intel Core CPU, but 48 GB of RAM. The reported results are based on a serial implementation in MATLAB-R2017b. Both our Newton decrement threshold and QUIC’s convergence threshold are 10−7.

We implemented the soft-thresholding set (7) as a serial routine that uses O(n) memory by taking the n\times N matrix-of-samples \mathbf {X}=[\mathbf {x}_{1},\mathbf {x}_{2},\ldots,\mathbf {x}_{N}] satisfying C=\frac {1}{N}\mathbf {X}\mathbf {X}^{T} as input. The routine implicitly partitions C into submatrices of size 4000\times 4000 , and iterates over the submatrices one at a time. For each submatrix, it explicitly forms the submatrix, thresholds it using dense linear algebra, and then stores the result as a sparse matrix.

After thresholding, we use our Newton-CG algorithm to solve the MDMC problem with respected to the thresholded matrix C_{\lambda } . We measure the quality of the resulting X by the following two metrics:\begin{align*} \text {relative optimality gap}\equiv&\frac {\|P_{G}(C_{\lambda }-X^{-1})\|_{F}}{\|C_{\lambda }\|_{F}},\tag{33}\\ \text {relative infeasibility}\equiv&\frac {\|P_{G}(X)-X\|_{F}}{\|X\|_{F}},\tag{34}\end{align*}

View SourceRight-click on figure for MathML and additional features. where G is the sparsity pattern associated with C_{\lambda } . If X is an exact solution to MDMC (4), then the first-order optimality condition would read P_{G}(X^{-1})=C_{\lambda } while the associated feasibility condition X\in { \mathbb {S}}_{G}^{n} would imply P_{G}(X)=X . In this case, both the relative optimality gap and the relative infeasibility are identically zero.

A. Case Study 1: Banded Patterns Without Thresholding

The first case study aims to verify the claimed O(n) complexity of our algorithm for MDMC. Here, we avoid the proposed thresholding step, and focus solely on the MDMC (4) problem. Each sparsity pattern G is a corrupted banded matrices with the bandwidth 101. The off-diagonal nonzero elements of C are selected from the uniform distribution in [−2, 0) and then corrupted to zero with probability 0.3. The diagonal elements are fixed to 5. Our numerical experiments fix the bandwidth and vary the number of variables n from 1,000 to 200,000. A time limit of 2 hours is set for both algorithms.

Figure 2a compares the running time of both algorithms. A log-log regression results in an empirical time complexity of O(n^{1.1}) for our algorithm, and O(n^{2}) for QUIC. The extra 0.1 in the exponent is most likely an artifact of our MATLAB implementation. In either case, QUIC’s quadratic complexity limits it to n=1.5\times 10^{4} . By contrast, our algorithm solves an instance with n=2\times 10^{5} in less than 33 minutes. The resulting solutions are extremely accurate, with optimality and feasibility gaps of less than 10−16 and 10−7, respectively.

FIGURE 2. - CPU time Newton-CG vs QUIC: (a) case study 1; (b) case study 2.
FIGURE 2.

CPU time Newton-CG vs QUIC: (a) case study 1; (b) case study 2.

B. Case Study 2: Banded Patterns

In the second case study, we repeat our first experiment over banded patterns, but also include the initial thresholding step. We restrict our attention to the RGL version of the problem, meaning that the bandwidth is known ahead of time, but the exact location of the nonzeros are unknown. The quality of an estimator \hat {X} compared to the true inverse covariance matrix \Theta \equiv \Sigma ^{-1} is measured using the following three standard metrics:\begin{align*}&\hspace {-2.8pc}\text {relative Frobenius loss}\equiv \|\hat {X}-\Theta \|_{F}/\|\Theta \|_{F},\tag{35}\\ \text {TPR}\equiv&\frac {|\{(i,j):i\ne j,\hat {X}_{i,j}\ne 0,\Theta _{i,j}\ne 0\}|}{|\{(i,j):i\ne j,\Theta _{i,j}\ne 0\}|},\tag{36}\\ \text {FPR}\equiv&\frac {|\{(i,j):i\ne j,\hat {X}_{i,j}\ne 0,\Theta _{i,j}=0\}|}{|\{(i,j):i\ne j,\Theta _{i,j}=0\}|}.\tag{37}\end{align*}

View SourceRight-click on figure for MathML and additional features. In words, the relative Frobenius loss is the normalized error of the estimator. The true positive rate (TPR) is a measure of sensitivity, and is defined as the proportion of actual nonzeros that are correctly identified as such. The false positive rate (FPR) is a measure of specificity, and is defined as the portion of zero elements that are misidentified as being nonzero.

The results are shown in Table 1. The threshold-MDMC procedure is able to achieve the same statistical recovery properties as QUIC in a fraction of the solution time. In larger instances, both methods are able to almost exactly recover the true sparsity pattern. However, only our procedure continues to work with n up to 200000.

TABLE 1 Details of Case Study 2. Here, “ n ” is the Size of the Covariance Matrix, “ m ” is the Number of Edges Added to Make its Sparsity Graph Chordal, “ \ell_{F} ” is the Relative Frobenius Loss, “TPR” is the True Positive Rate, “FPR” is the False Positive Rate, “Sec” is the Running Time in Seconds, “Gap” is the Relative Optimality Gap, “Infeas” is the Relative Infeasibility, “Diff. Gap” is the Difference in Duality Gaps for the Two Different Methods, and “Speed-Up” is the Fact Speed-Up Over QUIC Achieved by Our Algorithm
Table 1- 
Details of Case Study 2. Here, “
$n$
” is the Size of the Covariance Matrix, “
$m$
” is the Number of Edges Added to Make its Sparsity Graph Chordal, “
$\ell_{F}$
” is the Relative Frobenius Loss, “TPR” is the True Positive Rate, “FPR” is the False Positive Rate, “Sec” is the Running Time in Seconds, “Gap” is the Relative Optimality Gap, “Infeas” is the Relative Infeasibility, “Diff. Gap” is the Difference in Duality Gaps for the Two Different Methods, and “Speed-Up” is the Fact Speed-Up Over QUIC Achieved by Our Algorithm

C. Case Study 3: Real-Life Graphs

The third case study aims to benchmark the full thresholding-MDMC procedure for sparse inverse covariance estimation on real-life graphs. The actual graphs (i.e. the sparsity patterns) for \Sigma ^{-1} are chosen from SuiteSparse Matrix Collection [9]—a publicly available dataset for large-and-sparse matrices collected from real-world applications. Our chosen graphs vary in size from n=3918 to n=201062 , and are taken from applications in chemical processes, material science, graph problems, optimal control and model reduction, thermal processes and circuit simulations.

For each sparsity pattern G , we design a corresponding \Sigma ^{-1} as follows. For each (i,j)\in G , we select (\Sigma ^{-1})_{i,j}=(\Sigma ^{-1})_{j,i} from the uniform distribution in [−1, 1], and then corrupt it to zero with probability 0.3. Then, we set each diagonal to (\Sigma ^{-1})_{i,i}=1+\sum _{j}|(\Sigma ^{-1})_{i,j}| . Using this \Sigma , we generate N=5000 samples i.i.d. as \mathbf {x}_{1},\ldots,\mathbf {x}_{N}\sim \mathcal {N}(0,\Sigma) . This results in a sample covariance matrix C=\frac {1}{N}\sum _{i=1}^{N}\mathbf {x}_{i}\mathbf {x}_{i}^{T} .

We solve graphical lasso and RGL with the C described above using our proposed soft-thresholding-MDMC algorithm and QUIC, in order to estimate \Sigma ^{-1} . In the case of RGL, we assume that the graph G is known a priori, while noting that 30% of the elements of \Sigma ^{-1} have been corrupted to zero. Our goal here is to discover the location of these corrupted elements. In all of our simulations, the threshold \lambda is set so that the number of nonzero elements in the the estimator is roughly the same as the ground truth. We limit both algorithms to 3 hours of CPU time.

Figure 2b compares the CPU time of the two algorithms for this case study; the specific details are provided in Table 2. A log-log regression results in an empirical time complexity of O(n^{1.64}) and O(n^{1.55}) for graphical lasso and RGL using our algorithm, and O(n^{2.46}) and O(n^{2.52}) for the same using QUIC. The exponents of our algorithm are ≥1 due to the initial soft-thresholding step, which is quadratic-time on a serial computer, but ≤2 because the overall procedure is dominated by the solution of the MDMC. Both algorithms solve graphs with n\le 1.5\times 10^{4} within the allotted time limit, though our algorithm is 11 times faster on average. Only our algorithm is able to solve the estimation problem with n\approx 2\times 10^{5} in a little more than an hour.

TABLE 2 Details of Case Study 3. Here, “ n ” is the Size of the Covariance Matrix, “ m ” is the Number of Edges Added to Make its Sparsity Graph Chordal, “Sec” is the Running Time in Seconds, “Gap” is the Relative Optimality Gap, “Feas” is the Relative Infeasibility, “Diff. Gap” is the Difference in Duality Gaps for the Two Different Methods, and “Speed-Up” is the Fact Speed-Up Over QUIC Achieved by Our Algorithm
Table 2- 
Details of Case Study 3. Here, “
$n$
” is the Size of the Covariance Matrix, “
$m$
” is the Number of Edges Added to Make its Sparsity Graph Chordal, “Sec” is the Running Time in Seconds, “Gap” is the Relative Optimality Gap, “Feas” is the Relative Infeasibility, “Diff. Gap” is the Difference in Duality Gaps for the Two Different Methods, and “Speed-Up” is the Fact Speed-Up Over QUIC Achieved by Our Algorithm

To check whether thresholding-MDMC really does solve graphical lasso and RGL, we substitute the two sets of estimators back into their original problems (1) and (5). The corresponding objective values have a relative difference \le 4\times 10^{-4} , suggesting that both sets of estimators are about equally optimal. This observation verifies our claims in Theorem 1 and Corollary 1 about (1) and (5): thresholding-MDMC does indeed solve graphical lasso and RGL.

SECTION VI.

Conclusions

Graphical lasso is a widely-used approach for estimating a covariance matrix with a sparse inverse from limited samples. In this paper, we consider a slightly more general formulation called restricted graphical lasso (RGL), which additionally enforces a prior sparsity pattern to the estimation. We describe an efficient approach that substantially reduces the cost of solving RGL: 1) soft-thresholding the sample covariance matrix and projecting onto the prior pattern, to recover the estimator’s sparsity pattern; and 2) solving a maximum determinant matrix completion (MDMC) problem, to recover the estimator’s numerical values. The first step is quadratic O(n^{2}) time and memory but embarrassingly parallelizable. If the resulting sparsity pattern is sparse and chordal, then the second step can be solved in closed-form. But more generally, if the resulting sparsity pattern has a sparse chordal embedding, then the second step can be performed using the Newton-CG algorithm described in this paper. In either case, the complexity of this second step is linear O(n) time and memory. We tested the algorithm on both synthetic and real-life data, solving instances with as many as 200,000 variables to 7–9 digits of accuracy within an hour on a standard laptop computer. Our algorithm achieves the same Frobenius loss, true positive rate, and false positive rate as the existing state-of-the-art, but using just a fraction of the computation time.

SECTION VII.

Proofs

In this section, we present the technical proofs of Theorems 1 and 2. To this goal, we need a number of lemmas.

A. Proof of Theorem 1

To prove Theorem 1, first we consider the optimality (KKT) conditions for the unique solution of the GGL with \mathbb {S}_{ \mathcal {H}}^{n}=\mathbb {S}^{n} .

Lemma 1:

\hat {X} is the optimal solution of the GGL with \mathbb {S}_{ \mathcal {H}}^{n}=\mathbb {S}^{n} if and only if it satisfies the following conditions for every i,j\in \{1,2, \ldots ,n\} : \begin{align*} (\hat {X})_{i,j}^{-1}=&C_{i,j} \quad {if}\quad i=j\tag{38a}\\ (\hat {X})_{i,j}^{-1}=&C_{i,j}+\lambda _{i,j}\times {{ sign}}(\hat {X}_{i,j}) \quad {if}~ \hat {X}_{i,j}\not =0\qquad \tag{38b}\\ C_{i,j}-\lambda _{i,j}\leq&(\hat {X})_{i,j}^{-1}\leq \Sigma _{i,j}+\lambda _{i,j} \quad {if}~\hat {X}_{i,j}=0\tag{38c}\end{align*}

View SourceRight-click on figure for MathML and additional features.

Proof:

The proof is straightforward and omitted for brevity.

Now, consider the following optimization:\begin{align*}&\hspace {-2pc}\underset {X\succ 0}{\text {minimize }} \mathrm {tr}\,\tilde {C}X-\log \det X \\&\qquad \qquad \qquad +\sum _{(i,j)\in \mathcal {H} }\tilde {\lambda }_{i,j}|{X}_{i,j}|+2\sum _{(i,j)\notin \mathcal {H} }|{X}_{i,j}| \tag{39}\end{align*}

View SourceRight-click on figure for MathML and additional features. where \begin{equation*} \tilde {C}_{i,j}=\frac {C_{i,j}}{\sqrt {C_{i,i}\times C_{j,j}}}\quad \tilde {\lambda }_{i,j}=\frac {\lambda _{i,j}}{\sqrt {C_{i,i}\times C_{j,j}}}\tag{40}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
Let \tilde {X} denotes the optimal solution of (39) and recall D = \mathrm {diag}(C) . The following lemma relates \tilde {X} to \hat {X} .

Lemma 2:

We have \hat {X}=D^{-1/2}\tilde {X} D^{-1/2} .

Proof:

To prove this lemma, we define an intermediate optimization problem as \begin{align*} \min _{X\in \mathbb {S}_{+}^{n}}f(X)=&\mathrm {tr}\,{C}X-\log \det X+\sum _{(i,j)\in \mathcal {H} }{\lambda }_{i,j}|{X}_{i,j}| \\&\qquad \qquad \quad +\,2\max _{k}\{{C}_{k,k}\}\sum _{(i,j)\notin \mathcal {H} } |{X}_{i,j}| \tag{41}\end{align*}

View SourceRight-click on figure for MathML and additional features. Denote X^{\sharp } as the optimal solution for (41). First, we show that X^{\sharp }=\hat {X} . Trivially, \hat {X} is a feasible solution for (41) and hence f(X^{\sharp })\leq f(\hat {X}) . Now, we prove that X^{\sharp } is a feasible solution for the GGL. To this goal, we show that X_{i,j}^{\sharp }=0 for every (i,j)\notin \mathcal {H} . By contradiction, suppose X_{i,j}^{\sharp }\not =0 for some (i,j)\notin \mathcal {H} . Note that, due to the positive definiteness of {X^{\sharp }}^{-1} , we have \begin{equation*} (X^{\sharp })_{i,i}^{-1}\times (X^{\sharp })_{j,j}^{-1}-((X^{\sharp })_{i,j}^{-1})^{2}>0 \tag{42}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
Now, based on Lemma 1, one can write \begin{equation*} (X^{\sharp })_{ij}^{-1}=C_{i,j}+2\max _{k}\{C_{k,k}\}\times \mathrm {sign}(X_{i,j}^{\sharp }) \tag{43}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
Considering the fact that C\succeq 0 , we have |C_{i,j}|\leq \max _{k}\{C_{k,k}\} . Together with (43), this implies that |(X^{\sharp })_{i,j}^{-1}|\geq \max _{k}\{C_{k,k}\} . Furthermore, due to Lemma 1, one can write (X^{\sharp })_{i,i}^{-1}=C_{i,i} and (X^{\sharp })_{j,j}^{-1}=C_{j,j} . This leads to \begin{align*} (X^{\sharp })_{i,i}^{-1}\times (X^{\sharp })_{j,j}^{-1}-((X^{\sharp })_{i,j}^{-1})^{2}=&C_{i,i}\times C_{j,j}-(\max _{k}\{C_{k,k}))^{2} \\\leq&0\tag{44}\end{align*}
View SourceRight-click on figure for MathML and additional features.
contradicting (42). Therefore, X^{\sharp } is a feasible solution for the GGL. This implies that f(X^{*})=f(X^{\sharp }) . Due to the uniqueness of the solution of (41), we have X^{*}=X^{\sharp } . Now, note that (41) can be reformulated as \begin{align*}&\hspace {-3pc}\min _{X\in \mathbb {S}_{+}^{n}} ~ \mathrm {tr}\,{\tilde {C}}D^{1/2}{X}D^{1/2}-\log \det X \\&+\sum _{(i,j)\in \mathcal {H} }{\lambda }_{i,j}|{X}_{i,j}|+2\max _{k}\{{C}_{k,k}\}\sum _{(i,j)\notin \mathcal {H} }|{X}_{i,j}| \tag{45}\end{align*}
View SourceRight-click on figure for MathML and additional features.
Upon defining \begin{equation*} \tilde {X}=D^{1/2}XD^{1/2} \tag{46}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
and following some algebra, one can verify that 41 is equivalent to \begin{align*}&\hspace {-2.5pc}\min _{\tilde {X}\in \mathbb {S}_{+}^{n}} ~ \mathrm {tr}\,{\tilde {C}}\tilde {X}-\log \det \tilde {X} \\&+\sum _{(i,j)\in \mathcal {H} }\tilde {\lambda }_{i,j}|\tilde {X}_{i,j}|+2\sum _{(i,j)\notin \mathcal {H} }|\tilde {X}_{i,j}|+\log \det (D) \tag{47}\end{align*}
View SourceRight-click on figure for MathML and additional features.
Dropping the constant term in (47) gives rise to the optimization (39). Therefore, \hat {X}=D^{-1/2}\tilde {X} D^{-1/2} holds in light of 46.

Proof ofTheorem 1:

Define \tilde {C}_{\lambda } = D^{-1/2}C_{\lambda }D^{-1/2} and note that, due to Lemma 2, \tilde {C}_{\lambda } and \tilde {X} have the same signed sparsity pattern as {C}_{\lambda } and \hat {X} , respectively. Therefore, it suffices to show that the signed sparsity structures of \tilde {C}_{\lambda } and \tilde {X} are the same, which can be done by analyzing the optimization problem (39) and its connection to the GGL (explained in Lemma 2). Since (39) is the weighted analog of graphical lasso, the arguments made in the proof of Theorem 12 in [12] can be adopted to prove Theorem 1. The details are omitted for brevity.

B. Proof of Theorem 2

Recall the definition of the cone \mathcal {K} and its dual \mathcal {K}_{*} as in 26. Being dual cones, \mathcal {K} and \mathcal {K}_{*} satisfy Farkas’ lemma.

Lemma 3 (Farkas’ Lemma):

Given an arbitrary Y\in { \mathbb {S}}_{V}^{n}

  1. Either Y\in \mathcal {K} , or there exists a separating hyperplane S\in \mathcal {K} _{*} such that S\bullet Y< 0 .

  2. Either Y\in \mathcal {K} _{*} , or there exists a separating hyperplane X\in \mathcal {K} such that Y\bullet X< 0 .

From the definition of g(y) in (30) and the relations (29), we see that the Hessian matrix \nabla ^{2}g(y) can be written it terms of the Hessian \nabla ^{2}f(X) and the unique X\in \mathcal {K} satisfying P_{\tilde { \mathcal {G}}}(X^{-1})=C-A(y) , as in \begin{equation*} \nabla ^{2}g(y) =A^{T}(\nabla ^{2}f(X)^{-1}[A(y)])= \mathbf {A}^{T}\nabla ^{2}f(X)^{-1} \mathbf {A},\end{equation*}

View SourceRight-click on figure for MathML and additional features. in which \mathbf {A}=[\mathrm {vec}\, A_{1},\ldots, \mathrm {vec}\, A_{m}] . Moreover, using the theory of Kronecker products, we can write \begin{equation*} \mathrm {vec}\,\nabla ^{2}f(X)[Y]=Q^{T}(X^{-1}\otimes X^{-1})Q \mathrm {vec}\, Y\end{equation*}
View SourceRight-click on figure for MathML and additional features.
in which the \frac {1}{2}n(n+1)\times |\tilde { \mathcal {G}}| matrix Q is the orthogonal basis matrix of { \mathbb {S}}_{\tilde { \mathcal {G}}}^{n} in { \mathbb {S}}^{n} . Because of this, we see that the eigenvalues of the Hessian \nabla ^{2}g(y) are bound \begin{align*} \lambda _{\min }(\mathbf {A}^{T} \mathbf {A})\lambda _{\min }^{2}(X^{-1})\le&\lambda _{i}(\nabla ^{2}g(y)) \\\le&\lambda _{\max }(\mathbf {A}^{T} \mathbf {A})\lambda _{\max }^{2}(X^{-1}),\tag{48}\end{align*}
View SourceRight-click on figure for MathML and additional features.
and therefore its condition number is bound by the eigenvalues of X \begin{equation*} \mathrm {cond}\,(\nabla ^{2}g(y))\le \frac {\lambda _{\max }(\mathbf {A}^{T} \mathbf {A})}{\lambda _{\min }(\mathbf {A}^{T} \mathbf {A})} \left ({\frac {\lambda _{\max }(X)}{\lambda _{\min }(X)}}\right)^{2}. \tag{49}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
Consequently most of our effort will be expended in bounding the eigenvalues of X .

To simplify notation, we will write y_{0},~\hat {y} , and y as the initial point, the solution, and any k -th iterate. From this, we define S_{0},~\hat {S},~S , with each satisfying S=C-A(y) , and X_{0},~\hat {X} , and X , the points in \mathcal {K} each satisfying P_{V}(X^{-1})=S .

Our goal is to bound the extremal eigenvalues of X . To do this, we make two observations. The first is that the sequence generated by Newton’s method is monotonously decreasing, as in \begin{equation*} g(\hat {y})\le g(y)\le g(y_{0}).\end{equation*}

View SourceRight-click on figure for MathML and additional features. Evaluating each f_{*}(S) as n+\log \det X yields our first key inequality \begin{equation*} \log \det \hat {X}\le \log \det X\le \log \det X_{0}. \tag{50}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
Our second observation is the fact that every Newton direction \Delta y points away from the boundary of \mathrm {dom}\, g .

Lemma 4 (Newton Direction Is Positive Definite):

Given y\in \mathrm {dom}\, g , define the Newton direction \Delta y=-\nabla ^{2}g(y)^{-1}\nabla g(y) . Then, \Delta S=-A(\Delta y)\in \mathcal {K} _{*} .

Proof:

The Newton direction is explicitly written as \begin{equation*} \Delta y=-[\mathbf {A}^{T}\nabla f_{*}(X)^{-1} \mathbf {A}]^{-1} \mathbf {A}^{T} \mathrm {vec}\, X\end{equation*}

View SourceRight-click on figure for MathML and additional features. and \Delta S=-A(\Delta y) . For any W\in \mathcal {K} , we have by substituting \begin{align*} W\bullet \Delta S=&(\mathrm {vec}\, W)^{T}(\mathrm {vec}\,\Delta S)\\=&(\mathrm {vec}\, W)^{T}(\mathbf {A}[\mathbf {A}^{T}\nabla ^{2}f(X)^{-1} \mathbf {A}]^{-1} \mathbf {A}^{T}) \mathrm {vec}\, X\\\ge&c_{1}(\mathrm {vec}\, W)^{T}(\mathrm {vec}\, X)=c_{1}(W\bullet X)\\\ge&c_{2} \mathrm {tr}\, W>0\end{align*}
View SourceRight-click on figure for MathML and additional features.
where c_{1}=\sigma _{\min }(\mathbf {A}[\mathbf {A}^{T}\nabla ^{2}f(X)^{-1} \mathbf {A}]^{-1} \mathbf {A}^{T})\ge \mathrm {cond}\, ^{-1}\,\,(\mathbf {A}^{T} \mathbf {A})\lambda _{\max }^{-2}(X)>0 and c_{2}=\lambda _{\min }(X)>0 . Since there can be no separating hyperplane W\bullet \Delta S< 0 , we have \Delta S\in \mathcal {K} _{*} by Farkas’ lemma.

Finally, we introduce the following function, which often appears in the study of interior-point methods \begin{equation*} \phi (M)= \mathrm {tr}\, M-\log \det M-n\ge 0,\end{equation*}

View SourceRight-click on figure for MathML and additional features. and is well-known to provide a control on the arthmetic and geometric means of the eigenvalues of M . Indeed, the function attains its unique minimum at \phi (I)=0 , and it is nonnegative precisely because of the arithmetic-geometric inequality. Let us show that it can also bound the arithmetic-geometric means of the extremal eigenvalues of M .

Lemma 5:

Denote the n eigevalues of M as \lambda _{1}\ge \cdots \ge \lambda _{n} . Then\begin{equation*} \phi \ge \lambda _{1}+\lambda _{n}-2\sqrt {\lambda _{1}\lambda _{n}}=(\sqrt {\lambda _{1}}-\sqrt {\lambda _{n}})^{2}.\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Proof:

Noting that x-\log x-1\ge 0 for all x\ge 0 , we have \begin{align*} \phi (M)=&\sum _{i=1}^{n}(\lambda _{i}-\log \lambda _{i}-1)\\\ge&(\lambda _{1}-\log \lambda _{1}-1)+(\lambda _{n}-\log \lambda _{n}-1)\\=&\lambda _{1}+\lambda _{n}-2\log \sqrt {\lambda _{1}\lambda _{n}}-2\\=&\lambda _{1}+\lambda _{n}-2\sqrt {\lambda _{1}\lambda _{n}}+2(\sqrt {\lambda _{1}\lambda _{n}}-\log \sqrt {\lambda _{1}\lambda _{n}}-1)\\\ge&\lambda _{1}+\lambda _{n}-2\sqrt {\lambda _{1}\lambda _{n}}.\end{align*}

View SourceRight-click on figure for MathML and additional features. Completing the square yields \phi (M)\ge (\sqrt {\lambda _{1}}-\sqrt {\lambda _{n}})^{2} .

The following upper-bounds are the specific to our problem, and are the key to our intended final claim.

Lemma 6:

Define the initial suboptimality \phi _{\max }=\log \det X_{0}-\log \det \hat {X} . Then we have\begin{equation*} \phi (\hat {X}X^{-1})\le \phi _{\max },\quad \phi (XX_{0}^{-1})\le \phi _{\max }.\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Proof:

To prove the first inequality, we take first-order optimality at the optimal point \hat {y} \begin{equation*} \nabla g(\hat {y})=A^{T}(\hat {X})=0.\end{equation*}

View SourceRight-click on figure for MathML and additional features. Noting that \hat {X}\in { \mathbb {S}}_{V}^{n} , we further have \begin{align*} X^{-1}\bullet \hat {X}=P_{V}(X^{-1})\bullet \hat {X}=&[C-A(y)]\bullet \hat {X}=C\bullet \hat {X}\\&-\,y^{T}\underbrace {A^{T}(\hat {X})}_{=0}\\=&[P_{V}(\hat {X}^{-1})+A(\hat {y})]\bullet \hat {X}\\=&P_{V}(\hat {X}^{-1})\bullet \hat {X}=n\end{align*}
View SourceRight-click on figure for MathML and additional features.
and hence \phi (X^{-1}\hat {X}) has the value of the suboptimality at X , which is bound by the initial suboptimality in (50):\begin{align*} \phi (X^{-1}\hat {X})=&\underbrace {X^{-1}\bullet \hat {X}}_{n}-\log \det X^{-1}\hat {X}-n\\=&\log \det X-\log \det \hat {X}\\\le&\log \det X_{0}-\log \det \hat {X}=\phi _{\max }.\end{align*}
View SourceRight-click on figure for MathML and additional features.
We begin with the same steps to prove the second inequality:\begin{align*} X_{0}^{-1}\bullet X=&P_{V}(X_{0}^{-1})\bullet X =[C-A(y_{0})]\bullet X\\=&[P_{V}(X^{-1})+A(\hat {y})]\bullet X-A(y_{0})\bullet X\\=&n+A(y-y_{0})\bullet X.\end{align*}
View SourceRight-click on figure for MathML and additional features.
Now, observe that we have arrived at y by taking Newton steps y=y_{0}+\sum _{j=1}^{k}\alpha _{j}\Delta y_{j} , and that each Newton direction points away from the boundary of the feasible set, as in -A(\Delta y)\in \mathcal {K} _{*} (See Lemma 4). Since X\in \mathcal {K} , we must always have \begin{align*} X\bullet A(y-y_{0})=&\alpha _{j}\sum _{j=1}^{k}X\bullet A(\Delta y_{j})\\=&-\alpha _{j}\sum _{j=1}^{k}X\bullet \Delta S\le 0.\end{align*}
View SourceRight-click on figure for MathML and additional features.
Substituting yields the second bound and applying (50) yields \begin{align*} \phi (X_{0}^{-1}X)=&X_{0}^{-1}\bullet X-\log \det X_{0}^{-1}X-n\\=&(n+A(y-y_{0})\bullet X-\log \det X_{0}^{-1}X)-n\\=&\underbrace {\log \det X_{0}X^{-1}}_{\le \phi _{\max }}+\underbrace {A(y-y_{0})\bullet X}_{\le 0}\le \phi _{\max }.\end{align*}
View SourceRight-click on figure for MathML and additional features.

Using the two upper-bounds to bound the eigenvalues of their arguments is enough to derive a condition number bound on X , which immediately translates into a condition number bound on \nabla ^{2}g(y) .

Proof ofTheorem 2:

Here, we will prove \begin{equation*} \frac {\lambda _{\max }(X)}{\lambda _{\min }(X)}\le 2+\frac {\phi _{\max }^{2}\lambda _{\max }(X_{0})}{\lambda _{\min }(\hat {X})},\end{equation*}

View SourceRight-click on figure for MathML and additional features. which yields the desired condition number bound on \nabla ^{2}g(y) by substituting into (49). Writing \lambda _{1}=\lambda _{\max }(X) and \lambda _{n}=\lambda _{\min }(X) , we have from the two lemmas above:\begin{align*} \phi _{\max }\ge&\lambda _{\min }(\hat {X})(\sqrt {\lambda _{n}^{-1}}-\sqrt {\lambda _{1}^{-1}})^{2}>0,\\ \phi _{\max }\ge&\lambda _{\min }(X_{0}^{-1})(\sqrt {\lambda _{1}}-\sqrt {\lambda _{n}})^{2}>0.\end{align*}
View SourceRight-click on figure for MathML and additional features.
Multiplying the two upper-bounds and substituing \lambda _{\min }(X_{0}^{-1})=1/\lambda _{\max }(X_{0}) yields \begin{equation*} \frac {\phi _{\max }^{2}\lambda _{\max }(X_{0})}{\lambda _{\min }(\hat {X})}\ge \left ({\sqrt {\frac {\lambda _{1}}{\lambda _{n}}}-\sqrt {\frac {\lambda _{n}}{\lambda _{1}}}}\right)^{2}=\frac {\lambda _{1}}{\lambda _{n}}+\frac {\lambda _{n}}{\lambda _{1}}-2.\end{equation*}
View SourceRight-click on figure for MathML and additional features.
Finally, bounding \lambda _{n}/\lambda _{1}\ge 0 yields the desired bound.

    References

    References is not available for this document.