Summary

Sparse high dimensional graphical model selection is a topic of much interest in modern day statistics. A popular approach is to apply l  1-penalties to either parametric likelihoods, or regularized regression/pseudolikelihoods, with the latter having the distinct advantage that they do not explicitly assume Gaussianity. As none of the popular methods proposed for solving pseudolikelihood-based objective functions have provable convergence guarantees, it is not clear whether corresponding estimators exist or are even computable, or if they actually yield correct partial correlation graphs. We propose a new pseudolikelihood-based graphical model selection method that aims to overcome some of the shortcomings of current methods, but at the same time retain all their respective strengths. In particular, we introduce a novel framework that leads to a convex formulation of the partial covariance regression graph problem, resulting in an objective function comprised of quadratic forms. The objective is then optimized via a co-ordinatewise approach. The specific functional form of the objective function facilitates rigorous convergence analysis leading to convergence guarantees; an important property that cannot be established by using standard results, when the dimension is larger than the sample size, as is often the case in high dimensional applications. These convergence guarantees ensure that estimators are well defined under very general conditions and are always computable. In addition, the approach yields estimators that have good large sample properties and also respect symmetry. Furthermore, application to simulated and real data, timing comparisons and numerical convergence is demonstrated. We also present a novel unifying framework that places all graphical pseudolikelihood methods as special cases of a more general formulation, leading to important insights.

1. Introduction

One of the hallmarks of modern day statistics is the advent of high dimensional data sets arising particularly from applications in the biological sciences, environmental sciences and finance. A central quantity of interest in such applications is the covariance matrix Σ of high dimensional random vectors. It is well known that the sample covariance matrix S can be a poor estimator of Σ, especially when p/n is large, where n is the sample size and p is the number of variables in the data set. Hence S is not a useful estimator for Σ for high dimensional data sets, where often either pn (‘large p, small n’) or when p is comparable with n and both are large (‘large p, large n’). The basic problem here is that the number of parameters in Σ is of the order p  2. Hence, in the settings just mentioned, the sample size is often not sufficiently large to obtain a good estimator.

For many real life applications, the quantity of interest is the inverse covariance or partial covariance matrix Ω=Σ−1. In such situations, it is often reasonable to assume that there are only a few significant partial correlations and the other partial correlations are negligible in comparison. In mathematical terms, this amounts to making the assumption that the inverse covariance matrix Ω = Σ−1 = ((ω  ij))1⩽i, jp is sparse, i.e. many entries in Ω are zero. Note that ω  ij = 0 is equivalent to saying that the partial correlation between the ith and jth variables is zero (under Gaussianity, this reduces to the statement that the ith and jth variables are conditionally independent given the other variables). The zeros in Ω can be conveniently represented by partial correlation graphs. The assumption of a sparse graph is often deemed very reasonable in applications. For example, as Peng et al. (2009) pointed out, among 26 examples of published networks compiled by Newman (2003), 24 networks had edge density less than 4%.

Various methods have been proposed for identifying sparse partial correlation graphs in the penalized-likelihood and penalized-regression-based framework (Meinshausen and Bühlmann, 2006; Friedman et al., 2008, 2010; Peng et al., 2009). The main focus here is estimation of the sparsity pattern. Many of these methods do not necessarily yield positive definite estimates of Ω. However, once a sparsity pattern has been established, a positive definite estimate can be easily obtained by using efficient methods (see Hastie et al. (2009) and Speed and Kiiveri (1986)).

The penalized likelihood approach induces sparsity by minimizing the (negative) log-likelihood function with an l  1-penalty on the elements of Ω. In the Gaussian set-up, this approach was pursued by Banerjee et al. (2008) and others. Friedman et al. (2008) proposed the graphical lasso algorithm ‘Glasso’ for the above minimization problem, which is substantially faster than earlier methods. In recent years, many interesting and useful methods have been proposed for speeding up the performance of the graphical lasso algorithm (see Mazumder and Hastie (2012) for instance). It is worth noting that, for these methods to provide substantial improvements over the graphical lasso, certain assumptions are required on the number and size of the connected components of the graph implied by the zeros in Ω^ (the minimizer).

Another useful approach that was introduced by Meinshausen and Bühlmann (2006) estimates the zeros in Ω by fitting separate lasso regressions for each variable given the other variables. These individual lasso fits give neighbourhoods that link each variable to others. Peng et al. (2009) improved this neighbourhood selection method (the NS algorithm) by taking the natural symmetry in the problem into account (i.e. Ωij = Ωji), as not doing so could result in less efficiency and contradictory neighbourhoods.

In particular, the sparse partial correlation estimation (SPACE) method was proposed by Peng et al. (2009) as an effective alternative to existing methods for sparse estimation of Ω. The SPACE procedure iterates between

  • updating partial correlations by a joint lasso regression and

  • separately updating the partial variances.

As indicated above, it also accounts for the symmetry in Ω and is computationally efficient. Peng et al. (2009) showed that, under suitable regularity conditions, SPACE yields consistent estimators in high dimensional settings. All the above properties make SPACE an attractive regression-based approach for estimating sparse partial correlation graphs. In the examples that were presented in Peng et al. (2009), the authors found that empirically the SPACE algorithm seems to converge very fast. It is, however, not clear whether SPACE will converge in general. Convergence is of course critical so that the corresponding estimator is always guaranteed to exist and is therefore meaningful, both computationally and statistically. In fact, as we illustrate in Section 2, the SPACE algorithm might fail to converge in simple cases, for both the standard choices of weights that were suggested in Peng et al. (2009). Motivated by SPACE, Friedman et al. (2010) presented a co-ordinatewise descent approach (the ‘symmetric lasso’), which may be considered as a symmetrized version of the approach in Meinshausen and Bühlmann (2006). As we show in Section 2.3, it is also not clear whether the symmetric lasso will converge.

In this paper, we present a new method called the convex correlation selection method and algorithm (CONCORD) for sparse estimation of Ω. The algorithm obtains estimates of Ω by minimizing an objective function, which is jointly convex, but more importantly comprised of quadratic forms in the entries of Ω. The subsequent minimization is performed via co-ordinatewise descent. The convexity is strict if np, in which case standard results guarantee the convergence of the co-ordinatewise descent algorithm to the unique global minimum. If n < p, the objective function may not be strictly convex. As a result, a unique global minimum may not exist, and existing theory does not guarantee convergence of the sequence of iterates of the co-ordinatewise descent algorithm to a global minimum. In Section 4, by exploiting the quadratic forms that are present in the objective, it is rigorously demonstrated that the sequence of iterates does indeed converge to a global minimum of the objective function regardless of the dimension of the problem. Furthermore, it is shown in Section 6 that the CONCORD algorithm estimators are asymptotically consistent in high dimensional settings under regularity assumptions that are identical to those of Peng et al. (2009). Hence, our method preserves all the attractive properties of SPACE, while also providing a theoretical guarantee of convergence to a global minimum. In the process the CONCORD algorithm yields an estimator Ω^ that is well defined and is always computable. The strengths of the method are further illustrated in the simulations and real data analysis that are presented in Section 5. A comparison of the relevant properties of various algorithms proposed in the literature is provided in Table 1 (NS by Meinshausen and Bühlmann (2006), SPACE by Peng et al. (2009), the symmetric lasso algorithm SYMLASSO by Friedman et al. (2010), the pseudolikelihood inverse covariance estimation algorithm SPLICE by Rocha et al. (2008) and CONCORD). Table 1 shows that the CONCORD algorithm preserves all the attractive properties of existing algorithms, while also providing rigorous convergence guarantees. Another major contribution of the paper is the development of a unifying framework that renders the various pseudolikelihood-based graphical model selection procedures as special cases. This general formulation facilitates a direct comparison between the above pseudolikelihood-based methods and gives deep insights into their respective strengths and weaknesses.

Table 1.

Comparison of regression-based graphical model selection methods

PropertyMethod
 NS SPACE SYMLASSO SPLICE CONCORD 
Symmetry  
Convergence guarantee (fixed nNA    
Asymptotic consistency (n, p → ∞)   

A ‘+’ sign indicates that a specified method has the given property. A blank space indicates the absence of a property. ‘NA’ stands for ‘not applicable’.

Table 1.

Comparison of regression-based graphical model selection methods

PropertyMethod
 NS SPACE SYMLASSO SPLICE CONCORD 
Symmetry  
Convergence guarantee (fixed nNA    
Asymptotic consistency (n, p → ∞)   

A ‘+’ sign indicates that a specified method has the given property. A blank space indicates the absence of a property. ‘NA’ stands for ‘not applicable’.

The remainder of the paper is organized as follows. Section 2 briefly describes the SPACE algorithm and presents examples where it fails to converge. This section motivates our work and also analyses other regression-based or pseudolikelihood methods that have been proposed. Section 3 introduces the convex correlation selection method and presents a general framework that unifies recently proposed pseudolikelihood methods. Section 4 establishes convergence of CONCORD to a global minimum, even if n < p. Section 5 illustrates the performance of the CONCORD algorithm on simulated and real data. Comparisons with SPACE and Glasso are provided. When applied to gene expression data, the results given by CONCORD are validated in a significant way by a recent extensive breast cancer study. Section 6 establishes large sample properties of the convex correlation selection approach. Concluding remarks are given in Section 7. The on-line supplemental document contains proofs of some of the results in the paper.

2. The SPACE algorithm and convergence properties

Let the random vector Yk=(y1k,y2k,,ypk), k = 1,2,…,n, denote independent and identically distributed (IID) observations from a multivariate distribution with mean vector 0 and covariance matrix Σ. Let Ω = Σ−1 = ((ω  ij))1⩽i, jp denote the inverse covariance matrix, and let ρ = (ρ  ij)1⩽i < jp where ρ  ij = − ω  ij/√(ω  ii  ω  jj) denotes the partial correlation between the ith and jth variable for 1⩽ijp. Note that ρ  ij = ρ  ji for ij. Denote the sample covariance matrix by S, and the sample corresponding to the ith variable by Yi=(yi1,yi2,,yin).

2.1. The SPACE algorithm

Peng et al. (2009) proposed a novel iterative algorithm called SPACE to estimate the partial correlations {ρ  ij}1⩽i < jp and the partial covariances {ω  ii}1⩽ip corresponding to Ω. This algorithm is summarized in the on-line supplemental section A.

2.2. Convergence properties of SPACE

From empirical studies, Peng et al. (2009) found that the SPACE algorithm converges quickly. As mentioned in Section 1, it is not immediately clear whether convergence can be established theoretically. In an effort to understand such properties, we now place the SPACE algorithm in a useful optimization framework. (See the on-line supplemental section A for a proof.)

Lemma 1
For the choice of weights w  i = ω  ii, the SPACE algorithm corresponds to an iterative partial minimization procedure for the following objective function:
Qspc(Ω)=12i=1pnlog(ωii)+ωiiYijiρijωjjωiiYj2+λ1i<jp|ρij|=12i=1pnlog(ωii)+12ωiiYi+jiωijωiiYj2+λ1i<jp|ρij|.
(1)

Although lemma 1 identifies SPACE as an iterative partial minimization algorithm, the existing theory for iterative partial minimization (see for example Zangwill (1969), Jensen et al. (1991) and Lauritzen (1996)) only guarantees that every accumulation point of the sequence of iterates is a stationary point of the objective function Q  spc. To establish convergence, one needs to prove that every contour of the function Q  spc contains only finitely many stationary points. It is not clear whether this latter condition holds for the function Q  spc. Moreover, for choice of weights w  i = 1, the SPACE algorithm does not appear to have an iterative partial minimization interpretation.

To improve our understanding of the convergence properties of SPACE, we started by testing the algorithm on simple examples. On some examples, SPACE converges very quickly; however, examples can be found where SPACE does not converge when using the two possible choices for weights: partial variance weights (w  i = ω  ii) and uniform weights (w  i = 1). We now give an example of the lack of convergence.

2.1.1 Example 1

Consider the following population covariance and inverse covariance matrices:
Ω=3.02.10.02.13.02.10.02.13.0,Σ=Ω1=8.50011.6678.16711.66716.66711.6678.16711.6678.500.
(2)
A sample of n = 100 IID vectors was generated from the corresponding N(0,Σ) distribution. The data were standardized and the SPACE algorithm was run with choice of weights w  i = ω  ii and λ = 160. After the first few iterations successive SPACE iterates alternate between the following two matrices:
29.00957027.2664600.00000027.26646051.86332024.6801400.00000024.68014026.359350,28.34004027.2215200.70539027.22152054.25519024.5699000.70539024.56990025.753040,
(3)
thereby establishing non-convergence of the SPACE algorithm in this example (see also Fig. 1(a)). Note that the two matrices in expression (3) have different sparsity patterns. A similar example of non-convergence of SPACE with uniform weights is provided in the on-line supplemental section Q.
Fig. 1.

Illustrations of the non-convergence of SPACE and the convergence of CONCORD (the y-axes are on the log-scale; for SPACE, the log-absolute difference between entries of successive estimates becomes constant (thus indicating non-convergence) (graphic, pvar.1; graphic, pvar.2; graphic, pvar.3, graphic, pcor.1; graphic, pcor.2; graphic, pcor.3): (a) SPACE algorithm (partial variance weights) applied to the data set in example 1; (b) CONCORD algorithm applied to the data set in example 1

A natural question to ask is whether the non-convergence of SPACE is pathological or whether it is widespread in settings of interest. For this, the following simulation study was undertaken.

2.2.2 Example 2

We created a sparse 100 × 100 matrix Ω with edge density 4% and a condition number of 100. A total of 100 multivariate Gaussian data sets (with n = 100) having mean vector zero and covariance matrix Σ=Ω−1 were generated. Table 2 summarizes the number of times (out of 100) that algorithms SPACE1 (SPACE with uniform weights) and SPACE2 (SPACE with partial variance weights) do not converge within 1500 iterations. When they do converge, the mean numbers of iterations are 22.3 for SPACE1 and 14.1 for SPACE2 (since the original implementation of SPACE by Peng et al. (2009) was programmed to stop after three iterations, we modified the implementation to allow for more iterations to check for convergence of parameter estimates). It is clear from Table 2 that both variations of SPACE, using unit weights as well as ω  ii-weights, exhibit extensive non-convergence behaviour. Our simulations suggest that the convergence problem is exacerbated as the condition number of Ω increases.

Table 2.

Number of simulations (out of 100) that do not converge within 1500 iterations, NC, for select values of penalty parameter (λ  *=λ/n)

Results for SPACE1 (w  i = 1)Results for SPACE2 (w  i = ω  ii)
λ  *NZ(%)NCλ  *NZ(%)NC
0.026 60.9 92 0.085 79.8 100 
0.099 19.7 100 0.160 28.3 
0.163 7.6 100 0.220 10.7 
0.228 2.9 100 0.280 4.8 
0.614 0.4 0.730 0.5 97 

The average percentages of non-zeros, NZ, in Ω are also shown.

Table 2.

Number of simulations (out of 100) that do not converge within 1500 iterations, NC, for select values of penalty parameter (λ  *=λ/n)

Results for SPACE1 (w  i = 1)Results for SPACE2 (w  i = ω  ii)
λ  *NZ(%)NCλ  *NZ(%)NC
0.026 60.9 92 0.085 79.8 100 
0.099 19.7 100 0.160 28.3 
0.163 7.6 100 0.220 10.7 
0.228 2.9 100 0.280 4.8 
0.614 0.4 0.730 0.5 97 

The average percentages of non-zeros, NZ, in Ω are also shown.

2.3 Symmetric lasso

The symmetric lasso algorithm was proposed as a useful alternative to SPACE in recent work by Friedman et al. (2010). The symmetric lasso minimizes the (negative) pseudolikelihood
Qsym(α,Ω˘)=12i=1pnlog(αii)+1αiiYi+jiωijαiiYj2+λ1i<jp|ωij|.
(4)
where α  ii = 1/ω  ii. Here α denotes the vector with entries α  ii for i = 1,…,p and Ω˘ denotes the matrix Ω with diagonal entries set to 0. A comparison of equations (1) and (4) shows a deep connection between SPACE (with w  i = ω  ii) and symmetric lasso objective functions. In particular, the Qsym(α,Ω˘) objective function in equation (4) is a reparameterization of equation (1): the only difference is that the l  1-penalty on the elements of ρ is replaced by a penalty on the elements of Ω in equation (4). The minimization of the objective function in equation (4) is performed by co-ordinatewise descent on (α,Ω˘). The symmetric lasso is indeed a useful and computationally efficient procedure. However, theoretical properties such as convergence or asymptotic consistency have not yet been established. The following lemma investigates the properties of the objective function that are used in the symmetric lasso.

Lemma 2

The symmetric lasso objective in equation (4) is a non-convex function of (α,Ω˘).

The proof of lemma 2 is given in the on-line supplemental section B. The arguments in the proof of lemma 2 demonstrate that the objective function that is used in the symmetric lasso is not convex, or even biconvex in the parameterization that is used above. However, it can be shown that the SYMLASSO algorithm objective function is jointly convex in the elements of Ω (see Lee and Hastie (2014) and the on-line supplemental section O). It is straightforward to check that the co-ordinatewise descent algorithms for both parameterizations are exactly the same. However, unless a function is strictly convex, there are no general theoretical guarantees of convergence for the corresponding co-ordinatewise descent algorithm. Indeed, when n < p, the SYMLASSO objective function is not strictly convex. Therefore, it is not clear whether the co-ordinate descent algorithm converges in general. We conclude this section by remarking that both SPACE and the symmetric lasso are useful additions to the graphical model selection literature, especially because they both respect symmetry and give computationally fast procedures.

2.4 The SPLICE algorithm

The sparse pseudolikelihood inverse covariance estimatesalgorithm SPLICE was proposed by Rocha et al. (2008) as an alternative means to estimate Ω. In particular, the SPLICE formulation uses an l  1-penalized regression-based pseudolikelihood objective function parameterized by matrices D and B where Ω = D  −2(IB). The diagonal matrix D has elements d  jj = 1/√ω  jj, j = 1,…,p. The (asymmetric) matrix B has as columns the vectors of regression coefficients, βjRp. These coefficients, β  j, arise when regressing Y  j on the remaining variables. A constraint on each β  j is imposed so that regression of Y  j is performed without including itself as a predictor variable, i.e. β  jj = 0. On the basis of the above properties, the l  1-penalized pseudolikelihood objective function of the SPLICE algorithm (without the constant term) is given by
Qspl(B,D)=n2i=1plog(dii2)+12i=1p1dii2YijiβijYj2+λi<j|βij|.
(5)
To optimize equation (5) with respect to B and D, Rocha et al. (2008) also proposed an iterative algorithm that alternates between maximizing B fixing D, followed by maximizing D fixing B. As with other regression-based graphical model selection algorithms, a proof of convergence of SPLICE is not available. The following lemma gives the convexity properties of the SPLICE objective function.

Lemma 3
  • The SPLICE objective function Q  spl(B,D) is not jointly convex in (B,D).

  • Under the transformation C = D  −1, Q  spl(B,C) is biconvex.

The proof of lemma 3 is given in on-line supplemental section C. The convergence properties of the SPLICE algorithm are not immediately clear since its objective function is non-convex. Furthermore, it is not clear whether the SPLICE solution yields a global optimum.

3. CONCORD: a convex pseudolikelihood framework for sparse partial covariance estimation

The two pseudolikelihood-based approaches, SPACE and the symmetric lasso, have several attractive properties such as computational efficiency, simplicity and use of symmetry. They also do not directly depend on the more restrictive Gaussian assumption. Additionally, Peng et al. (2009) also established (under suitable regularity assumptions) consistency of SPACE estimators for distributions with sub-Gaussian tails. However, none of the existing pseudolikelihood-based approaches yield a method that is provably convergent. In Section 2.2, we showed that there are instances where SPACE does not converge. As explained earlier, convergence is critical as this property guarantees well-defined estimators which always exist, and are computable regardless of the data at hand. An important research objective therefore is the development of a pseudolikelihood framework which preserves all the attractive properties of the SPACE and SYMLASSO algorithms and, at the same time, leads to theoretical guarantees of convergence. It is not clear immediately, however, how to achieve this goal. A natural approach to take is to develop a convex formulation of the problem. Such an approach can yield many advantages, including

  • a.

    a guarantee of existence of a global minimum,

  • b.

    a better chance of convergence by using convex optimization algorithms and

  • c.

    a deeper theoretical analysis of the properties of the solution and corresponding algorithm.

As we have shown, the SPACE objective function is not jointly convex in the elements of Ω (or any natural reparameterization). Hence, one is not in a position to leverage tools from convex optimization theory for understanding its behaviour. The SYMLASSO objective function is jointly convex in the elements of Ω. However, unless a function is strictly convex, there are no general guarantees of convergence for the corresponding co-ordinatewise descent algorithm. Indeed, when n < p, the SYMLASSO objective function is not strictly convex, and it is not clear whether the corresponding co-ordinatewise descent algorithm converges.

In this section, we introduce a new approach for estimating Ω, called the convex correlation selection method and algorithm CONCORD, that aim to achieve the above objective. The CONCORD algorithm constructs sparse estimators of Ω by minimizing an objective function that is jointly convex in the entries of Ω. We start by introducing the objective function for the convex correlation selection method and then proceed to derive the details of the corresponding co-ordinatewise descent updates. Convergence is not obvious, as the function may not be strictly convex if n < p. It is proved in Section 4 that the corresponding co-ordinatewise descent algorithm does indeed converge to a global minimum. Computational complexity and running time comparisons for CONCORD are given in the on-line supplemental section E and Section 5.1 respectively. Subsequently, large sample properties of the resulting estimator are established in Section 6 to provide asymptotic guarantees in the regime when both the dimension p and the sample size n tend to ∞. Thereafter, the performance of CONCORD on simulated data and on real data from biomedical and financial applications is demonstrated. Such analysis serves to establish that CONCORD preserves all the attractive properties of existing pseudolikelihood methods and additionally provides the crucial theoretical guarantee of convergence and existence of a well-defined solution.

3.1. The CONCORD objective function

To develop a convex formulation of the pseudolikelihood graphical model selection problem let us first revisit the formulation of the SPACE objective function (1) with arbitrary weights w  i instead of ω  ii:
Qspc(Ω)=12i=1pnlog(ωii)+wiYijiρijωjjωiiYj22+λ1i<jp|ωij|.
(6)
Now note that this objective function is not jointly convex in the elements of Ω, since
  • a.

    the middle term for the regression with the choices w  i = 1 or w  i = ω  ii is not a jointly convex function of the elements of Ω and

  • b.

    the penalty term is on the partial correlations ρ  ij = − ω  ij/√(ω  ii  ω  jj) and is hence not a jointly convex function of the elements of Ω.

Now note the following relationship for the regression term:
wiYijiρij(ωjjωii)Yj22= wiYi+jiωijωiiYj22( therefore ρij=ωij(ωiiωjj))=wi1ωii(ωiiYi+jiωijYj)22=wiωii2j=1pωijYj22=wiωii2(ωiYYωi).
The choice of weights wi=ωii2 yields
wiYijiρijωjjωiiYj22=ω·iYYω·i0.
(7)
Expression (7) is a quadratic form (and hence jointly convex) in the elements of Ω. Putting the l  1-penalty term on the partial covariances ω  ij instead of on the partial correlations ρ  ij yields the following jointly convex objective function:
Qcon(Ω)=Lcon(Ω)+λ1i<jp|ωij|=i=1pnlog(ωii)+12i=1pωiiYi+jiωijYj22+λ1i<jp|ωij|.
(8)

The function Lcon(Ω) can be regarded as a pseudolikelihood function in the spirit of Besag (1975). Since − log (x) and |x| are convex functions, and Σi=1pωiiYi+ΣjiωijYj2 is a positive semidefinite quadratic form in Ω, it follows that Q  con(Ω) is a jointly convex function of Ω (but not necessarily strictly convex). As we shall see later, this particular formulation helps us to establish theoretical guarantees of convergence (see Section 4), and, consequently, yields a regression-based graphical model estimator that is well defined and is always computable. Note that the n/2 in equation (6) has been replaced by n in expression (8). The point is elaborated further in remark 4. We now proceed to derive the details of the co-ordinatewise descent algorithm for minimizing Q  con(Ω).

3.2 A co-ordinatewise minimization algorithm for minimizing Q  con (Ω)

Let Ap denote the set of p × p real symmetric matrices. Let the parameter space M be defined as
M:={ΩAp:ωii>0,forevery1ip}.
As in other regression-based approaches (see Peng et al. (2009)), we have deliberately not restricted Ω to be positive definite as the main goal is to estimate the sparsity pattern in Ω. As mentioned in Section 1, a positive definite estimator can be obtained by using standard methods (Hastie et al., 2009; Xu et al., 2011) once a partial correlation graph has been determined.
Let us now proceed to optimize Q  con(Ω). For 1⩽ijp, define the function Tij:MM by
Tij(Ω)=arg min{Ω~:(Ω~)kl=ωkl(k,l)(i,j)}Qcon(Ω~).
(9)

For each (i, j), T  ij(Ω) gives the matrix where all the elements of Ω are left as they are except the (i, j)th element. The (i, j)th element is replaced by the value that minimizes Q  con(Ω) with respect to ω  ij holding all other variables ω  kl,(k, l)≠(i, j), constant. We now proceed to evaluate T  ij(Ω) explicitly.

Lemma 4
The function T  ij(Ω) defined in equation (9) can be computed in closed form. In particular,
(Tii(Ω))ii=jiωijsij+jiωijsij2+4sii2sii,for1ip,
(10)
and
(Tij(Ω))ij=Sλ/njjωijsjj+iiωijsiisii+sjj,for1i<jp,
(11)
where s  ij is the (i, j)th entry of (1/n)Y  T  Y, and Sλ(x):=sgn(x)(|x|λ)+.

The proof is given in the on-line supplemental section F. An important contribution of lemma 4 is that it gives the necessary ingredients for designing a co-ordinate descent approach to minimizing the CONCORD objective function. More specifically, equation (10) can be used to update the partial variance terms, and equation (11) can be used to update the partial covariance terms. The co-ordinatewise descent algorithm for CONCORD is summarized in algorithm 2 in the on-line supplemental section D. The computational complexity of the CONCORD algorithm is min{O(np  2),O(p  3)}. Hence CONCORD is competitive with the SPACE and the symmetric lasso algorithms (see the on-line supplemental section E for details). The zeros in the estimated partial covariance matrix can then subsequently be used to construct a partial covariance or partial correlation graph.

The following procedure can be used to select the penalty parameter λ. Define the residual sum of squares for i = 1,…,p as
RSSi(λ)=k=1nyikjiωijωiiyjk2.
Further, the ith component of a Bayes information criterion type of score can be defined as
BICi(λ)=nlog{RSSi(λ)}+log(n)|{j:ji,ωij,λ0}|.
The penalty parameter λ can be chosen to minimize the sum BIC(λ)=Σi=1pBICi(λ).

3.3. A unifying framework for pseudolikelihood-based graphical model selection

In this section, we provide a unifying framework which formally connects the five pseudolikelihood formulations that are considered in this paper, namely algorithms SPACE1, SPACE2, SYMLASSO, SPLICE and CONCORD (counting two choices for weights in the SPACE algorithm as two different formulations). Recall that the random vectors Yk=(y1k,y2k,,ypk), k = 1,2,…,n, denote IID observations from a multivariate distribution with mean vector 0 and covariance matrix Σ, the precision matrix is given by Ω = Σ−1=((ω  ij))1⩽i, jp, and S denotes the sample covariance matrix. Let ΩD denote the diagonal matrix with ith diagonal entry given by ω  ii. Lemma 5 below formally identifies the relationship between all five of the regression-based pseudolikelihood methods.

Lemma 5
  • The (negative) pseudolikelihood functions of CONCORD, SPACE1, SPACE2, SYMLASSO and SPLICE formulations can be expressed in matrix form as shown in Table 3 (up to reparameterization).

  • All five pseudolikelihoods above correspond to a unified or generalized form of the Gaussian log-likelihood function
    Luni{G(Ω),H(Ω)}=n2(log[det{G(Ω)}]+tr{SH(Ω)}),
    where G(Ω) and H(Ω) are functions of Ω. The functions G and H which characterize the pseudolikelihood formulations corresponding to CONCORD, SPACE1, SPACE2, SYMLASSO and SPLICE are as follows:
    Gcon(Ω)=ΩD2,Hcon(Ω)=Ω2;Gspc,1(Ω)=ΩD,Hspc,1(Ω)=ΩΩD2Ω;Gspc,2(Ω)=Gsym(Ω)=Gspl(Ω)=ΩD,Hspc,2(Ω)=Hsym(Ω)=Hspl(Ω)=ΩΩD1Ω.

Table 3.

Pseudo-log-likelihood for graphical models in both regression and matrix forms

FunctionRegression formMatrix formExpression
Lcon(Ω) 12i=1pnlog(ωii2)+ωiiYi+jiωijYj22 n2{log|ΩD2|+tr(SΩ2)} (12) 
Lspc,1(ΩD,ρ) 12i=1pnlog(ωii)+YijiρijωjjωiiYj22 n2{log|ΩD|+tr(SΩΩD2Ω)} (13) 
Lspc,2(ΩD,ρ) 12i=1pnlog(ωii)+ωiiYijiρijωjjωiiYj22 n2{log|ΩD|+tr(SΩΩD1Ω)} (14) 
Lsym(α,ΩF) 12i=1pnlog(αii)+(1/αii)Yi+jiωijαiiYj2 n2{log|ΩD|+tr(SΩΩD1Ω)} (15) 
Lspl(B,D) 12i=1pnlog(dii2)+(1/dii2)YijiβijYj22 n2{log|ΩD|+tr(SΩΩD1Ω)} (16) 
   
   
   
   
   
Table 3.

Pseudo-log-likelihood for graphical models in both regression and matrix forms

FunctionRegression formMatrix formExpression
Lcon(Ω) 12i=1pnlog(ωii2)+ωiiYi+jiωijYj22 n2{log|ΩD2|+tr(SΩ2)} (12) 
Lspc,1(ΩD,ρ) 12i=1pnlog(ωii)+YijiρijωjjωiiYj22 n2{log|ΩD|+tr(SΩΩD2Ω)} (13) 
Lspc,2(ΩD,ρ) 12i=1pnlog(ωii)+ωiiYijiρijωjjωiiYj22 n2{log|ΩD|+tr(SΩΩD1Ω)} (14) 
Lsym(α,ΩF) 12i=1pnlog(αii)+(1/αii)Yi+jiωijαiiYj2 n2{log|ΩD|+tr(SΩΩD1Ω)} (15) 
Lspl(B,D) 12i=1pnlog(dii2)+(1/dii2)YijiβijYj22 n2{log|ΩD|+tr(SΩΩD1Ω)} (16) 
   
   
   
   
   

The proof of lemma 5 is given in the on-line supplemental section I. Lemma 5 gives various useful insights into the different pseudolikelihoods that have been proposed for the inverse covariance estimation problem. The following remarks discuss these insights.

Remark 1

When G(Ω) = H(Ω) = Ω, L{G(Ω),H(Ω)} corresponds to the standard (negative) Gaussian log-likelihood function.

Remark 2

ΩD1Ω is a rescaling of Ω to make all the diagonal elements 1 (hence the sparsities between Ω and ΩD1Ω are the same). In this sense, the SPACE2, SYMLASSO and SPLICE algorithms make the same approximation to the Gaussian likelihood with the log-determinant term, log |Ω|, replaced by log|ΩD|. The trace term tr(SΩ) is approximated by tr(SΩΩD1Ω). Moreover, if Ω is sparse, then ΩD1Ω is close to the identity matrix, i.e. ΩD1ΩI+C for some C. In this case, the term in the Gaussian likelihood tr(SΩ) is perturbed by an off-diagonal matrix C, resulting in an expression of the form tr{SΩ(I+C)}.

Remark 3

Conceptually, the sole source of difference between the three regularized versions of the objective functions of the SPACE2, SYMLASSO and SPLICE algorithms is in the way in which the l  1-penalties are specified. SPACE2 applies the penalty to the partial correlations, SYMLASSO to the partial covariances and SPLICE to the symmetrized regression coefficients.

Remark 4
The convex correlation selection method approximates the normal likelihood by approximating the log |Ω| term by log|ΩD2|, and tr(SΩ) by tr(SΩ2). Hence, the CONCORD algorithm can be considered as a reparameterization of the Gaussian likelihood with the concentration matrix Ω2 (together with an approximation to the log-determinant term). More specifically,
Lcon(Ω)=Luni(ΩD2,Ω2)=n2[log{det(ΩD2)}+tr(SΩ2)]=nlog{det(ΩD)}+12tr(SΩ2),
and justifies the appearance of ‘n’ as compared with ‘n/2’ in the CONCORD objective in expression (8). In the on-line supplemental section J, we illustrate the usefulness of this correction based on the insight from our unification framework, and we show that it leads to better estimates of Ω.

4. Convergence of CONCORD

We now proceed to consider the convergence properties of the CONCORD algorithm. Note that Q  con(Ω) is not differentiable. Also, if n < p, then Q  con(Ω) is not necessarily strictly convex. Hence, the global minimum may not be unique and, as discussed below, the convergence of the co-ordinatewise minimization algorithm to a global minimum does not follow from existing theory. Although Q  con(Ω) is not differentiable, it can be expressed as a sum of a smooth function of Ω and a separable function of Ω (namely λΣ1i<jp|ωij|). Tseng (1988, 2001) proved that, under certain conditions, every cluster point of the sequence of iterates of the co-ordinatewise minimization algorithm for such an objective function is a stationary point of the objective function. However, if the function is not strictly convex, there is no general guarantee that the sequence of iterates has a unique cluster point, i.e. there is no theoretical guarantee that the sequence of iterates converges. The following theorem shows that the cyclic co-ordinatewise minimization algorithm applied to the CONCORD objective function converges to a global minimum. A proof of this result can be found in the on-line supplemental section K.

Theorem 1

If S  ii>0 for every 1⩽ip, the sequence of iterates {Ω^(r)}r0 that is obtained by algorithm 2 converges to a global minimum of Q  con(Ω). More specifically, Ω^(r)Ω^M as r → ∞ for some Ω^, and furthermore Qcon(Ω^)Qcon(Ω) for all ΩM.

Remark 5

If n⩾2, and none of the underlying p marginal distributions (corresponding to the p-variate distribution for the data vectors) is degenerate, it follows that the diagonal entries of the data covariance matrix S are strictly positive with probability 1.

With theory in hand, we now proceed to illustrate numerically the convergence properties that have been established above. When CONCORD is applied to the data set in example 1, convergence is achieved (see Fig. 1(b) in Section 2.2), whereas SPACE does not converge (see Fig. 1(a)).

5. Applications

5.1. Simulated data

5.1.1 Timing comparison

We now proceed to compare the timing performance of CONCORD with the graphical lasso algorithm Glasso and the two versions of SPACE. The algorithm names SPACE1 and SPACE2 denote SPACE estimates using uniform weights and partial variance weights respectively. We first consider the setting p = 1000 and n = 200. For this simulation study, a p × p positive definite matrix Ω (with p = 1000) with condition number 10 was used. Thereafter, 50 independent data sets were generated, each consisting of n = 200 IID samples from an Np(0,Σ=Ω1) distribution. For each data set, the four algorithms were run until convergence for a range of penalty parameter values. We note that the default number of iterations for SPACE in the R function by Peng et al. (2009) is 3. However, given the convergence issues for SPACE, we ran SPACE until convergence or until 50 iterations (whichever number of iterations was smaller). The timing results (averaged over the 100 data sets) in the top part of Table 4 show wall clock times until convergence (in seconds) for Glasso, CONCORD, SPACE1 and SPACE2.

We can see that, in the p = 1000, n = 200, setting, CONCORD is uniformly faster than its competitors. Note that the low penalty parameter cases correspond to high dimensional settings where the estimated covariance matrix is typically poorly conditioned and the log-likelihood surface is very flat. The results in Table 4 indicate that in such settings CONCORD is faster than its competitors by orders of magnitude (even though Glasso is implemented in Fortran). Both SPACE1 and SPACE2 are much slower than CONCORD and Glasso in this setting. The wall clock time for an iterative algorithm can be thought of as a function of the number of iterations until convergence, the order of computations for a single iteration and also the implementational details (such as the choice of software and efficiency of the code). Note that the order of computations for a single iteration is the same for SPACE and CONCORD, and lower than that of Glasso when n < p. It is likely that the significant increase in the wall clock time for SPACE is due to implementational details and the larger number of iterations that are required for convergence (or non-convergence, since we are stopping SPACE if the algorithm does not satisfy the convergence criterion by 50 iterations).

Table 4.

Timing comparison for p = 1000,3000 and varying n  

Results for GlassoResults for CONCORDResults for SPACE1 (w  i = 1)Results for SPACE2 (w  i = ω  ii  )
λNZ (%)Time (s)λ*NZ (%)Time (s)λ*NZ (%) Time (s)λ*NZ (%)Time (s)
p = 1000,n = 200 
0.14 4.77 87.60 0.12 4.23 6.12 0.10 4.49 101.78 0.16 100.00 19206.55 
0.19 0.87 71.47 0.17 0.98 5.10 0.17 0.64 99.20 0.21 1.76 222.00 
0.28 0.17 5.41 0.28 0.15 5.37 0.28 0.14 138.01 0.30 0.17 94.59 
0.39 0.08 5.30 0.39 0.07 4.00 0.39 0.07 75.55 0.40 0.08 108.61 
0.51 0.04 6.38 0.51 0.04 4.76 0.51 0.04 49.59 0.51 0.04 132.34 
Results for GlassoResults for CONCORDResults for GlassoResults for CONCORD
λNZ (%)Time (s)λ*NZ (%)Time (s)λNZ (%)Time (s)λ*NZ (%)Time (s)
p = 3000, n = 600 p = 3000, n = 900 
0.09 2.71 1842.74 0.09 2.10 266.69 0.09 0.70 1389.96 0.09 0.64 298.21 
0.10 1.97 1835.32 0.10 1.59 235.49 0.10 0.44 1395.42 0.10 0.41 298.00 
0.10 1.43 1419.41 0.10 1.19 232.67 0.10 0.27 1334.78 0.10 0.26 302.15 

SPACE is run until convergence or 50 iterations (whichever number of iterations is smaller). Note that SPACE1 and SPACE2 are much slower than CONCORD and Glasso in wall clock time, for the p = 1000 simulations. Hence, for p = 3000, only Glasso and CONCORD are compared. Here, λ denotes the value of the penalty parameter for the respective algorithms, with λ  *=λ/n for CONCORD and SPACE. NZ is the percentage of non-zero entries in the corresponding estimator.

Table 4.

Timing comparison for p = 1000,3000 and varying n  

Results for GlassoResults for CONCORDResults for SPACE1 (w  i = 1)Results for SPACE2 (w  i = ω  ii  )
λNZ (%)Time (s)λ*NZ (%)Time (s)λ*NZ (%) Time (s)λ*NZ (%)Time (s)
p = 1000,n = 200 
0.14 4.77 87.60 0.12 4.23 6.12 0.10 4.49 101.78 0.16 100.00 19206.55 
0.19 0.87 71.47 0.17 0.98 5.10 0.17 0.64 99.20 0.21 1.76 222.00 
0.28 0.17 5.41 0.28 0.15 5.37 0.28 0.14 138.01 0.30 0.17 94.59 
0.39 0.08 5.30 0.39 0.07 4.00 0.39 0.07 75.55 0.40 0.08 108.61 
0.51 0.04 6.38 0.51 0.04 4.76 0.51 0.04 49.59 0.51 0.04 132.34 
Results for GlassoResults for CONCORDResults for GlassoResults for CONCORD
λNZ (%)Time (s)λ*NZ (%)Time (s)λNZ (%)Time (s)λ*NZ (%)Time (s)
p = 3000, n = 600 p = 3000, n = 900 
0.09 2.71 1842.74 0.09 2.10 266.69 0.09 0.70 1389.96 0.09 0.64 298.21 
0.10 1.97 1835.32 0.10 1.59 235.49 0.10 0.44 1395.42 0.10 0.41 298.00 
0.10 1.43 1419.41 0.10 1.19 232.67 0.10 0.27 1334.78 0.10 0.26 302.15 

SPACE is run until convergence or 50 iterations (whichever number of iterations is smaller). Note that SPACE1 and SPACE2 are much slower than CONCORD and Glasso in wall clock time, for the p = 1000 simulations. Hence, for p = 3000, only Glasso and CONCORD are compared. Here, λ denotes the value of the penalty parameter for the respective algorithms, with λ  *=λ/n for CONCORD and SPACE. NZ is the percentage of non-zero entries in the corresponding estimator.

We further compare the timing performance of CONCORD and Glasso for p = 3000 with n = 600 and n = 900. (SPACE is not considered here because of the timing issues that were mentioned above. These issues are amplified in this more demanding setting.) A p × p positive definite matrix Ω (with p = 3000) with 3% sparsity is used. Thereafter, 50 independent data sets were generated, each consisting of n = 600 IID samples from an Np(0,Σ=Ω1) distribution. The same exercise was repeated with n = 900. The timing results (averaged over the 100 data sets) in the bottom part of Table 4 show wall clock times until convergence (in seconds) for Glasso, CONCORD, SPACE1 and SPACE2 for various penalty parameter values. It can be seen that, in both the n = 600 and the n = 900 cases, CONCORD was around 10 times faster than Glasso.

In conclusion, these simulation results in this subsection illustrate that CONCORD is much faster compared with SPACE and Glasso, especially in very high dimensional settings. We also note that a downloadable version of the CONCORD algorithm has been developed in R and is freely available from http://cran.r-project.org/web/packages/gconcord.

5.1.2 Model selection comparison

In this section, we perform a simulation study in which we compare the model selection performance of CONCORD and Glasso when the underlying data are drawn from a multivariate t-distribution (the reasons for not considering SPACE are provided in a remark at the end of this section). The data are drawn from a multivariate t-distribution to illustrate the potential benefit of using penalized regression methods (convex correlation selection) outside the Gaussian setting.

For this study, using a similar approach to that in Peng et al. (2009), a p × p sparse positive definite matrix Ω (with p = 1000) with condition number 13.6 is chosen. Using this Ω for each sample size n = 200, n = 400 and n = 800, 50 data sets, each having an IID multivariate t-distribution with mean 0 and covariance matrix Σ=Ω−1, are generated. We compare the model selection performance of Glasso and CONCORD in this heavy-tailed setting with receiver operating characteristic (ROC) curves, which compare false positive rates FPR and true positive rates TPR. Each ROC curve is traced out by varying the penalty parameter λ over 50 possible values.

We use the area under the curve, AUC, as a means to compare model selection performance. This measure is frequently used to compare ROC curves (Fawcett, 2006; Friedman et al., 2010). The AUC of a full ROC curve resulting from perfect recovery of zero–non-zero structure in Ω would be 1. In typical real applications, FPR is controlled to be sufficiently low. We therefore compare model selection performance when FPR is less than 15% (or 0.15). When controlling FPR to be less than 0.15, a perfect method will yield AUC=0.15. Table 5 provides the median of the AUCs (divided by 0.15 to normalize to 1), as well as the interquartile ranges IQR over the 50 data sets for n = 200, n = 400 and n = 800.

Table 5.

Median and IQR of area under the curve, AUC, for 50 simulations

SolverResults for n = 200Results for n = 400Results for n = 800
MedianIQRMedianIQRMedianIQR
Glasso 0.745 0.032 0.819 0.030 0.885 0.029 
CONCORD 0.811 0.011 0.887 0.012 0.933 0.013 

Each simulation yields an ROC curve from which AUC is computed for FPR in the interval [0, 0.15] and normalized to 1.

Table 5.

Median and IQR of area under the curve, AUC, for 50 simulations

SolverResults for n = 200Results for n = 400Results for n = 800
MedianIQRMedianIQRMedianIQR
Glasso 0.745 0.032 0.819 0.030 0.885 0.029 
CONCORD 0.811 0.011 0.887 0.012 0.933 0.013 

Each simulation yields an ROC curve from which AUC is computed for FPR in the interval [0, 0.15] and normalized to 1.

Table 5 shows that CONCORD has a much better model selection performance compared with Glasso. Moreover, it turns out that CONCORD has a higher AUC than Glasso for every single one of the 150 data sets (50 each for n = 200,400,800). We note that CONCORD not only recovers the sparsity structure more accurately in general but also has much less variation.

Remark 5

We need to simulate 50 data sets for each of the three sample sizes n = 200,400,800. For each of these data sets, an algorithm must be run for 50 different penalty parameter values. In totality, this amounts to running the algorithm 7500 times. As we demonstrated in the simulations in Section 5.1.1, when SPACE is run until convergence (or terminated after the number of iterations is 50), then SPACE's intractability makes it infeasible to run it 7500 times. As an alternative, one could follow the approach of Peng et al. (2009) and stop SPACE after running three iterations. However, given the possible non-convergence issues that are associated with SPACE, it is not clear whether the resulting estimate is meaningful. Even so, if we follow this approach of stopping SPACE after three iterations, we find that CONCORD outperforms SPACE1 and SPACE2. For example, if we consider the n = 200 case, then the median AUC-value for SPACE1 is 0.779 (with IQR = 0.054) and the median AUC-value for SPACE2 is 0.802 (with IQR=0.013).

5.2. Application to breast cancer data

We now illustrate the performance of the convex correlation selection method on a real data set. To facilitate comparison, we consider data from a breast cancer study (Chang et al., 2005) on which SPACE was illustrated. This data set contains expression levels of 24481 genes on 248 patients with breast cancer. The data set also contains extensive clinical data including survival times.

Following the approach in Peng et al. (2009) we focus on a smaller subset of genes. This reduction can be achieved by utilizing clinical information that is provided together with the microarray expression data set. In particular, survival analysis via univariate Cox regression with patient survival times is used to select a subset of genes that are closely associated with breast cancer. A choice of p-value less than 0.0003 yields a reduced data set with 1107 genes. This subset of the data is then mean centred and scaled so that the median absolute deviation is 1 (as outliers seem to be present). Following a similar approach to that in Peng et al. (2009), penalty parameters for each partial correlation graph estimation method were chosen so that each partial correlation graph yields 200 edges.

Partial correlation graphs can be used to identify genes that are biologically meaningful and can lead to gene therapeutic targets. In particular, there is compelling evidence from the biomedical literature that highly connected nodes are central to biological networks (Carter et al., 2004; Jeong et al., 2001; Han et al., 2004). For this, we focus on identifying the 10 most highly connected genes (‘hub’ genes) identified by each partial correlation graph estimation method. Table 6 in the on-line supplemental section L summarizes the top 10 hub genes obtained by CONCORD, SYMLASSO, SPACE1 and SPACE2. That table also gives references from the biomedical literature that place these genes in the context of breast cancer. These references illustrate that most of the genes identified are indeed quite relevant in the study of breast cancer. It can also be seen that there is a large level of overlap in the top 10 genes identified by the four methods. There are also, however, some notable differences. For example, TPX2 has been identified only by CONCORD. Bibby et al. (2009) suggested that mutation of Aurora A—a known general cancer-related gene—reduces cellular activity and mislocalization due to loss of interaction with TPX2. Moreover, a recent extensive study by Maxwell et al. (2011) ( http://www.ncbi.nlm.nih.gov/pubmed/22110403) identifies a gene regulatory mechanism in which TPX2, Aurora A, RHAMM and BRCA1 play a key role. This finding is especially significant given that BRCA1 (breast cancer type 1 susceptibility protein) is one of the most well-known genes linked to breast cancer. We also remark that, if a higher number of hub genes are targeted (like the top 20 or top 100 versus the top 10), CONCORD identifies additional genes that have not been discovered by existing methods. However, identification of even a single important gene can lead to significant findings and novel gene therapeutic targets, since many gene silencing experiments often focus on one or two genes at a time.

Table 6.

Realized Sharpe ratio of various investment strategies corresponding to different estimators with various N  est  

N  estRatios for the following methods:
SampleGraphical glassoCONCORDCondRegLedoit–WolfDJIA
35 0.357 0.489 0.487 0.486 0.470 0.185 
40 0.440 0.491 0.490 0.473 0.439 0.185 
45 0.265 0.468 0.473 0.453 0.388 0.185 
50 0.234 0.481 0.482 0.458 0.407 0.185 
75 0.379 0.403 0.475 0.453 0.368 0.185 
150 0.286 0.353 0.480 0.476 0.384 0.185 
225 0.367 0.361 0.502 0.494 0.416 0.185 
300 0.362 0.359 0.505 0.488 0.409 0.185 

The maximum annualized Sharpe ratios for each row, and others within 1% of this maximum, are highlighted in italics.

Table 6.

Realized Sharpe ratio of various investment strategies corresponding to different estimators with various N  est  

N  estRatios for the following methods:
SampleGraphical glassoCONCORDCondRegLedoit–WolfDJIA
35 0.357 0.489 0.487 0.486 0.470 0.185 
40 0.440 0.491 0.490 0.473 0.439 0.185 
45 0.265 0.468 0.473 0.453 0.388 0.185 
50 0.234 0.481 0.482 0.458 0.407 0.185 
75 0.379 0.403 0.475 0.453 0.368 0.185 
150 0.286 0.353 0.480 0.476 0.384 0.185 
225 0.367 0.361 0.502 0.494 0.416 0.185 
300 0.362 0.359 0.505 0.488 0.409 0.185 

The maximum annualized Sharpe ratios for each row, and others within 1% of this maximum, are highlighted in italics.

We conclude this section by remarking that convex correlation selection is a useful addition to the graphical models literature as it is competitive with other methods in terms of model selection accuracy, timing and relevance for applications, and also gives provable convergence guarantees.

5.3. Application to portfolio optimization

We now consider the efficacy of using CONCORD in a financial portfolio optimization setting where a stable estimate of the covariance matrix is often required. We follow closely the exposition of the problem that was given in Won et al. (2013). A portfolio of financial instruments constitutes a collection of both risky and risk-free assets that are held by a legal entity. The return on the overall portfolio over a given holding period is defined as the weighted average of the returns on the individual assets, where the weights for each asset correspond to its proportion in monetary terms. The primary objective of the portfolio optimization problem is to determine the weights that maximize the overall return on the portfolio subject to a certain level of risk (or vice versa). In Markowitz mean variance portfolio theory, this risk is taken to be the standard deviation of the portfolio (Markowitz, 1952). As noted in Luenberger (1997) and Merton (1980), the optimal portfolio weights or the optimal allocation depends critically on the mean and covariance matrix of the individual asset returns, and hence estimation of these quantities is central to mean variance portfolio theory. As one of the goals in this paper is to illustrate the efficacy of using CONCORD to obtain a stable covariance matrix estimate, we shall consider the minimum variance portfolio problem, compared with the mean variance portfolio optimization problem. The former requires estimating only the covariance matrix and thus presents an ideal setting for comparing covariance estimation methods in the portfolio optimization context (see Chan et al. (1999) for more details). In particular, we aim to compare the performance of CONCORD with other covariance estimation methods, for constructing a minimum variance portfolio. The performance of each of the methods and the associated strategies will be compared over a sustained period of time to assess their respective merits.

5.3.1. Minimum variance portfolio rebalancing

The minimum variance portfolio selection problem is defined as follows. Given p risky assets, let r  it denote the return of asset i over period t, which in turn is defined as the change in its price over time period t, divided by the price at the beginning of the period. As usual, let Σt denote the covariance matrix of the daily returns, rtT=(r1t,r2t,,rpt). The portfolio weights wkT=(w1k,w2k,,wpk) denote the weight of asset i = 1,…,p in the portfolio for the kth time period. A long position or a short position for asset i during period k is given by the sign of w  ik, i.e. w  ik>0 for long and w  ik<0 for short positions respectively. The budget constraint can be written as 1  T  w  k = 1, where 1 denotes the vector of all 1s. Note that the risk of a given portfolio as measured by the standard deviation of its return is simply (wkTΣwk)1/2.

The minimum variance portfolio selection problem for investment period k can now be formally defined as
minimizewkTΣwksubject to1Twk=1.
(17)
As definition (17) is a simple quadratic programme, it has an analytic solution given by wk*=(1TΣ11)1Σ11. The solution depends on the theoretical covariance matrix Σ. In practice, the parameter Σ must be estimated.

The most basic approach to the portfolio selection problem often makes the unrealistic assumption that returns are stationary in time. A standard approach to dealing with the non-stationarity in such financial time series is to use a periodic rebalancing strategy. In particular, at the beginning of each investment period k = 1,2,…,K, portfolio weights w  k = (w  1k,…,w  pk) are computed from the previous N  est days of observed returns (N  est is called the ‘estimation horizon’). These portfolio weights are then held constant for the duration of each investment period. The process is repeated at the start of the next investment period and is often referred to as ‘rebalancing’. More details of the rebalancing strategy are provided in on-line supplemental section M.3.

5.3.2 Application to the Dow Jones industrial average

We now consider the problem of investing in the stocks that feature in the Dow Jones industrial average index. The Dow Jones industrial average is a composite blue chip index consisting of 30 stocks (note that the Kraft Foods data were removed in our analysis because of their limited data span). (Kraft Foods were a component stock of the Dow Jones industrial average from September 22nd, 2008, to September 13th, 2012. From September 14th, 2012, Kraft Foods were replaced with the United Health Group.) Table 7 in the on-line supplemental section M.1 lists the 29 component stocks that were used in our analysis.

Rebalancing time points were chosen to be every 4 weeks starting from February 18th, 1995, to October 26th, 2012 (approximately 17 years), and are shown in Table 8 in the on-line supplemental section M.2. Start and end dates of each period are selected to be calendar weeks and need not coincide with a trading day. The total number of investment periods is 231, and the number of trading days in each investment period varies between 15 and 20 days. We shall compare the following five methods for estimating the covariance matrix: sample covariance, the graphical lasso of Friedman et al. (2008), CONCORD, the condition-number-regularized estimator CondReg of Won et al. (2013) and the Ledoit–Wolf estimator of Ledoit and Wolf (2004). We consider various choices of N  est, in particular N  est ∈ {35,40,45,50,75,150,225,300}, in our analysis. Once a choice for N  est has been made, it is kept constant for all the 231 investment periods.

For l  1-penalized regression methods, such as Glasso and CONCORD, a value for the penalty parameter must be chosen. For the purposes of this study, cross-validation was performed within each estimation horizon to minimize the residual sum of squares from out-of-sample prediction averaged over all stocks. Further details are given in the on-line supplemental section M.4. The condition-number-regularized and Ledoit–Wolf estimators each use different criteria to perform cross-validation. The readers is referred to Won et al. (2013) and Ledoit and Wolf (2004) for details on the cross-validation procedure for these methods. For comparison with Won et al. (2013), we use the following quantities to assess the performance of the five minimum variance rebalancing strategies: the realized return, realized risk, realized Sharpe ratio SR, turnover, size of the short side and normalized wealth growth. Precise definitions of these quantities are given in the on-line supplemental section M.5.

Table 6 gives the realized Sharpe ratio of all MVR strategies for the various choices of estimation horizon N  est. The column DJIA stands for the passive index tracking strategy that tracks the Dow Jones industrial average index. It is clear from Table 6 that CONCORD performs uniformly well across different choices of estimation horizons.

Fig. 2 shows normalized wealth growth over the trading horizon for the choice N  est = 225. A normalized wealth growth curve for another choice N  est = 75 is provided in the on-line supplemental section M.5. These plots demonstrate that CONCORD is either very competitive or better than leading covariance estimation methods.

Fig. 2.

Normalized wealth growth after adjusting for transaction costs (0.5% of the principal) and borrowing costs (interest rate of 7% annualized percentage rate) with N  est = 225: graphic, CONCORD; graphic, CondReg; graphic, graphical lasso; graphic, Ledoit–Wolf; graphic, Sample; graphic, Dow Jones industrial average

We also note that trading costs that are associated with CONCORD are the lowest for most choices of estimation horizons and are very comparable with CondReg for N  est = {35,40} (see Table 12 in the on-line supplemental section M.5). Moreover, CONCORD also has by far the lowest short side for most choices of estimation horizons. This property reduces the dependence on borrowed capital for shorting stocks and is also reflected in the higher normalized wealth growth.

6. Large sample properties

In this section, large sample properties of the CONCORD algorithm, estimation consistency and oracle properties under suitable regularity conditions are investigated. We adapt the approach in Peng et al. (2009) with suitable modifications. Now let the dimension p=pn vary with n so that our treatment is relevant to high dimensional settings. Let {Ω¯n}n1 denote the sequence of true inverse covariance matrices. As in Peng et al. (2009), for consistency, we assume the existence of suitably accurate estimates of the diagonal entries and consider the accuracy of the estimates of the off-diagonal entries obtained after running the CONCORD algorithm with diagonal entries fixed. In particular, the following assumption is made.

Assumption 1
(accurate diagonal estimates). There are estimates {α^n,ii}1ipn such that, for any η>0, there is a constant C>0 such that
max1ipn|α^n,iiω¯ii|Clog(n)n,
holds with probability larger than 1 − O(n  η).

The theory that follows is valid when the estimates {α^n,ii}1ipn and the estimates of the off-diagonal entries are obtained from the same data set. When limsupnpn/n<1, Peng et al. (2009) showed that the diagonal entries of S  −1 can be used as estimates of the diagonal entries of Ω. However, no such general recipe was provided in Peng et al. (2009) for the case pn>n. Nevertheless, establishing consistency in the above framework is useful, as it indicates that the estimators obtained are statistically well behaved when n and p both increase to ∞.

For vectors ωoRpn(pn1)/2 and ωdR+pn, the notation Ln(ωo,ωd) stands for Lcon/n (Lcon is defined in expression (8)) evaluated at a matrix with off-diagonal entries ω  o and diagonal entries ω  d. Let ω¯no=((ω¯n,ij))1i<jpn denote the vector of off-diagonal entries of Ω¯n, and α^pnR+pn denotes the vector with entries {α^n,ii}1ipn. Let An denote the set of non-zero entries in the vector ω¯no, and let qn=|An|. Let Σ¯n=Ω¯n1 denote the true covariance matrix for every n⩾1. The following standard assumptions are required.

Assumption 2

(bounded eigenvalues). The eigenvalues of Ω¯n are bounded below by λ  min>0 and bounded above by λ  max<∞ uniformly for all n.

Assumption 3

(sub-Gaussianity). The random vectors Y  1,…,Y  n are IID sub-Gaussian for every n⩾1, i.e. there is a constant c>0 such that, for every xRpn,E[exp(xYi)]exp(cxΣ¯nx) and, for every i, j>0, there exists η  j>0 such that E[exp{t(Yji)2}]<K whenever |t|<ηj. Here K is independent of i and j.

Assumption 4
(incoherence condition). There exists δ < 1 such that, for all (i,j)An,
|L¯ij,An(Ω¯n)L¯An,An(Ω¯n)1sgn(ω¯Ano)|δ,
where, for 1i,j,t,spn satisfying i < j and t < s,
L¯ij,ts(Ω¯n):=EΩ¯n[(Ln(Ω¯n))ij,ts]=Σ¯n,js1{i=t}+Σ¯n,it1{j=s}+Σ¯n,is1{j=t}+Σ¯n,jt1{i=s}.

Conditions analogous to assumption 4 have been used in Zhao and Yu (2006), Peng et al. (2009) and Meinshausen and Bühlmann (2006) to establish high dimensional model selection consistency. In the context of lasso regression, Zhao and Yu (2006) showed that such a condition (which they referred to as an irrepresentable condition) is almost necessary and sufficient for model selection consistency and they provided some examples when this condition is satisfied. We provide some examples of situations where condition (4) is satisfied, along the lines of Zhao and Yu (2006), in the on-line supplemental section P.

Define θ¯no=((θ¯n,ij))1i<jpnRpn(pn1)/2 by θ¯n,ij=ω¯n,ij/(α^n,iiα^n,jj) for 1i<jpn. Let sn=min(i,j)Anω¯n,ij. The assumptions above can be used to establish the following theorem.

Theorem 2

Suppose that assumptions 1-4 are satisfied. Suppose that pn=O(nκ) for some κ>0, q  n = o[√{n/ log (n)}], {qnlog(n)/n}=o(λn), λ  n√{n/ log (n)}→∞, s  n/√q  n  λ  n→∞ and √q  n  λ  n→0, as n → ∞. Then there is a constant C such that, for any η>0, the following events hold with probability at least 1 − O(n  η).

  • There is a minimizer ω^no=((ω^n,ij))1i<jpn of Qcon(ωo,α^n).

  • Any minimizer ω^no of Qcon(ωo,α^n) satisfies ω^noω¯no2Cqnλn and sgn(ω^n,ij)=sgn(ω¯n,ij),1i<jpn.

The proof of theorem 2 is provided in the on-line supplemental section N.

7. Conclusion

This paper proposes a novel regression-based graphical model selection method that aims to overcome some of the shortcomings of current methods, but at the same time to retain their strengths. We first placed the highly useful SPACE method in an optimization framework, which in turn allowed us to identify SPACE with a specific objective function. These and other insights led to the formulation of the CONCORD objective function. It was then shown that the CONCORD objective function is comprised of quadratic forms, is convex and can be regarded as a penalized pseudolikelihood. A co-ordinatewise descent algorithm that minimizes this objective, via closed form iterates, was proposed and subsequently analysed. The convergence of this co-ordinatewise descent algorithm was established rigorously, thus ensuring that CONCORD leads to well-defined symmetric partial correlation estimates that are always computable—a guarantee that is not available with popular regression-based methods. The large sample properties of CONCORD establish consistency of the method as both the sample size and dimension tend to ∞. The performance of CONCORD was also illustrated via simulations and was shown to be competitive in terms of graphical model selection accuracy and timing. CONCORD was then applied to a biomedical data set and to a finance data set, leading to novel findings. Last, but not least, a framework that unifies all pseudolikelihood methods was established, yielding important insights.

Given the attractive properties of CONCORD, a natural question that arises is whether one should move away from penalized likelihood estimation (such as the graphical lasso) and rather use only pseudolikelihood methods. We note that CONCORD is attractive over the Glasso algorithm for several reasons: firstly, it does not assume Gaussianity and is hence more flexible. Secondly, the computational complexity per iteration of CONCORD is lower than that of Glasso. Thirdly, CONCORD is faster (in terms of wall clock time) than Glasso by an entire order of magnitude in higher dimensions. Fourthly, CONCORD delivers better model selection performance. It is, however, important to note that, if there is a compelling reason to assume multivariate Gaussianity (which some applications may warrant), then using both Glasso and CONCORD can potentially be useful for affirming multivariate associations of interest. In this sense, the two classes of method could be complementary in practice.

Acknowledgements

KK was supported in part by National Science Foundation grant DMS-1106084. SO and BR were supported in part by the National Science Foundation under grants DMS-0906392, DMS-CG 1025465, AGS-1003823, DMS-1106642, DMS CAREER-1352656 and grants NSA H98230-11-1-0194, DARPA-YFA N66001-111-4131 and SMC-DBNKY. Mark Hayes and SMC are gratefully acknowledged for useful discussions.

References

1

Banerjee
,
O.
,
El Ghaoui
,
L.
and
D'Aspremont
,
A.
(
2008
)
Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data
.
J. Mach. Learn. Res.
,
9
,
485
516
.

2

Besag
,
J.
(
1975
)
Statistical analysis of non-lattice data
.
Statistician
,
24
,
179
195
.

3

Bibby
,
R. A.
,
Tang
,
C.
,
Faisal
,
A.
,
Drosopoulos
,
K.
,
Lubbe
,
S.
,
Houlston
,
R.
,
Bayliss
,
R.
and
Linardopoulos
,
S.
(
2009
)
A cancer-associated aurora A mutant is mislocalized and misregulated due to loss of interaction with TPX2
.
J. Biol. Chem.
,
284
,
33177
33184
.

4

Carter
,
S. L.
,
Brechbühler
,
C. M.
,
Griffin
,
M.
and
Bond
,
A. T.
(
2004
)
Gene co-expression network topology provides a framework for molecular characterization of cellular state
.
Bioinformatics
,
20
,
2242
2250
.

5

Chan
,
L. K.
,
Karceski
,
J.
and
Lakonishok
,
J.
(
1999
)
On portfolio optimization: forecasting covariances and choosing the risk model
.Working Paper 7039

6

Chang
,
H. Y.
,
Nuyten
,
D. S. A.
,
Sneddon
,
J. B.
,
Hastie
,
T.
,
Tibshirani
,
R.
,
Sørlie
,
T.
,
Dai
,
H.
,
He
,
Y. D.
,
van't Veer
,
L. J.
,
Bartelink
,
H.
,
van de Rijn
,
M.
,
Brown
,
P. O.
and
van de Vijver
,
M. J.
(
2005
)
Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival
.
Proc. Natn. Acad. Sci. USA
,
102
,
3738
3743
.

7

Fawcett
,
T.
(
2006
)
An introduction to ROC analysis
.
Pattn Recogn Lett.
,
27
,
861
874
.

8

Friedman
,
J.
,
Hastie
,
T.
and
Tibshirani
,
R.
(
2008
)
Sparse inverse covariance estimation with the graphical lasso
.
Biostatistics
,
9
,
432
441
.

9

Friedman
,
J.
,
Hastie
,
T.
and
Tibshirani
,
R.
(
2010
)
Applications of the lasso and grouped lasso to the estimation of sparse graphical models
.Technical Report

10

Han
,
J.-D. J.
,
Bertin
,
N.
,
Hao
,
T.
,
Goldberg
,
D. S.
,
Berriz
,
G. F.
,
Zhang
,
L. V.
,
Dupuy
,
D.
,
Walhout
,
A. J. M.
,
Cusick
,
M. E.
,
Roth
,
F. P.
and
Vidal
,
M.
(
2004
)
Evidence for dynamically organized modularity in the yeast protein-protein interaction network
.
Nature
,
430
,
88
93
.

11

Hastie
,
T.
,
Tibshirani
,
R.
and
Friedman
,
J. H.
(
2009
)
The Elements of Statistical Learning
.
New York
:
Springer
.

12

Jensen
,
S. R. T.
,
Johansen
,
S. R.
and
Lauritzen
,
S. L.
(
1991
)
Globally convergent algorithms for maximizing likelihood function
.
Biometrika
,
78
,
867
877
.

13

Jeong
,
H.
,
Mason
,
S. P.
,
Barabasi
,
A.-L.
and
Oltvai
,
Z. N.
(
2001
)
Lethality and centrality in protein networks
.
Nature
,
411
,
41
42
.

14

Lauritzen
,
S. L.
(
1996
)
Graphical Models
.
New York
:
Oxford University Press
.

15

Ledoit
,
O.
and
Wolf
,
M.
(
2004
)
A well-conditioned estimator for large-dimensional covariance matrices
.
J. Multiv. Anal.
,
88
,
365
411
.

16

Lee
,
J. D.
and
Hastie
,
T. J.
(
2014
)
Learning the stucture of mixed graphical models
.
J. Computnl Graph. Statist.
, to be published.

17

Luenberger
,
D. G.
(
1997
)
Investment Science
.
New York
:
Oxford University Press
.

18

Markowitz
,
H.
(
1952
)
Portfolio selection
.
J. Finan.
,
7
,
77
91
.

19

Maxwell
,
C. A.
,
Benítez
,
J.
,
Gómez-Bald
,
L
,
Osorio
,
A.
,
Bonifaci
,
N.
 
Fernández-Ramires
,
R.
,
Costes
,
S. V.
,
Guin
,
E.
,
Chen
,
H.
,
Evans
,
G. J. R.
,
Mohan
,
P.
,
Catal
,
I.
,
Petit
,
A.
,
Aguilar
,
H.
,
Villanueva
,
A.
,
Aytes
,
A.
,
Serra-Musach
,
J.
,
Rennert
,
G.
,
Lejbkowicz
,
F.
,
Peterlongo
,
P.
,
Manoukian
,
S.
,
Peissel
,
B.
,
Ripamonti
,
C. B.
,
Bonanni
,
B.
,
Viel
,
A.
,
Allavena
,
A.
,
Bernard
,
L.
,
Radice
,
P.
,
Friedman
,
E.
,
Kaufman
,
B.
,
Laitman
,
Y.
,
Dubrovsky
,
M.
,
Milgrom
,
R.
,
Jakubowska
,
A.
,
Cybulski
,
C.
,
Gorski
,
B.
,
Jaworska
,
K.
,
Durda
,
K.
,
Sukiennicki
,
G.
,
Lubiski
,
J.
,
Shugart
,
Y. Y.
,
Domchek
,
S. M.
,
Letrero
,
R.
,
Weber
,
B. L.
,
Hogervorst
,
F. B. L.
,
Rookus
,
M. A.
,
Collee
,
J. M.
,
Devilee
,
P.
,
Ligtenberg
,
M. J.
,
van der Luijt
,
R. B.
,
Aalfs
,
C. M.
,
Waisfisz
,
Q.
,
Wijnen
,
J.
,
van Roozendaal
,
C. E. P.
,
Easton
,
D. F.
,
Peock
,
S.
,
Cook
,
M.
,
Oliver
,
C.
,
Frost
,
D.
,
Harrington
,
P.
,
Evans
,
D. G.
,
Lalloo
,
F.
,
Eeles
,
R.
,
Izatt
,
L.
,
Chu
,
C.
,
Eccles
,
D.
,
Douglas
,
F.
,
Brewer
,
C.
,
Nevanlinna
,
H.
,
Heikkinen
,
T.
,
Couch
,
F. J.
,
Lindor
,
N. M.
,
Wang
,
X.
,
Godwin
,
A. K.
,
Caligo
,
M. A.
,
Lombardi
,
G.
,
Loman
,
N.
,
Karlsson
,
P.
,
Ehrencrona
,
H.
,
von Wachenfeldt
,
A.
,
Bjork Barkardottir
,
R.
,
Hamann
,
U.
,
Rashid
,
M. U.
,
Lasa
,
A.
,
Caldés
,
T.
,
Andrés
,
R.
,
Schmitt
,
M.
,
Assmann
,
V.
,
Stevens
,
K.
,
Offit
,
K.
,
Curado
,
J.
,
Tilgner
,
H.
,
Guig
,
R.
,
Aiza
,
G.
,
Brunet
,
J.
,
Castellsagu
,
J.
,
Martrat
,
G.
,
Urruticoechea
,
A.
,
Blanco
,
I.
,
Tihomirova
,
L.
,
Goldgar
,
D. E.
,
Buys
,
S.
,
John
,
E. M.
,
Miron
,
A.
,
Southey
,
M.
,
Daly
,
M. B.
,
Schmutzler
,
R. K.
,
Wappenschmidt
,
B.
,
Meindl
,
A.
,
Arnold
,
N.
,
Deissler
,
H.
,
Varon-Mateeva
,
R.
,
Sutter
,
C.
,
Niederacher
,
D.
,
Imyamitov
,
E.
,
Sinilnikoya
,
O. M.
,
Stoppa-Lyonne
,
D.
,
Mazoyer
,
S.
,
Verny-Pierre
,
C.
,
Castera
,
L.
,
de Pauw
,
A.
,
Bignon
,
Y.-J.
,
Uhrhammer
,
N.
,
Peyrat
,
J.-P.
,
Vennin
,
P.
,
Fert Ferrer
,
S.
,
Collonge-Rame
,
M.-A.
,
Mortemousque
,
I.
,
Spurdle
,
A. B.
,
Beesley
,
J.
,
Chen
,
X.
,
Healey
,
S.
,
Barcellos-Hoff
,
M. H.
,
Vidal
,
M.
,
Gruber
,
S. B.
,
Lzaro
,
C.
,
Capell
,
G.
,
McGuffog
,
L.
,
Nathanson
,
K. L.
,
Antoniou
,
A. C.
,
Chenevix-Trench
,
G.
,
Fleisch
,
M. C.
,
Moreno
,
V.
,
Pujana
,
M. A.
, HEBON, EMBRACE, SWE-BRCA, BCFR, GEMO Study Collaborators, and kConFab (
2011
)
Interplay between BRCA1 and RHAMM regulates epithelial apicobasal polarization and may influence risk of breast cancer
.
PLOS Biol
.,
9
, no. 11,
article e1001199
.

20

Mazumder
,
R.
and
Hastie
,
T.
(
2012
)
Exact covariance thresholding into connected components for large-scale graphical lasso
.
J. Mach. Learn. Res.
,
13
,
781
794
.

21

Meinshausen
,
N.
and
Bühlmann
,
P.
(
2006
)
High-dimensional graphs and variable selection with the Lasso
.
Ann. Statist.
,
34
,
1436
1462
.

22

Merton
,
R. C.
(
1980
)
On estimating the expected return on the market: an exploratory investigation
.Working Paper 444

23

Newman
,
M. E. J.
(
2003
)
The structure and function of complex networks
.
SIAM Rev.
,
45
,
167
256
.

24

Peng
,
J.
,
Wang
,
P.
,
Zhou
,
N.
and
Zhu
,
J.
(
2009
)
Partial correlation estimation by joint sparse regression models
.
J. Am. Statist. Ass.
,
104
,
735
746
.

25

Rocha
,
G.
,
Zhao
,
P.
and
Yu
,
B.
(
2008
)
A path following algorithm for Sparse Pseudo-Likelihood Inverse Covariance Estimation (SPLICE)
. Technical Report. Statistics Department,
University of California at Berkeley
,
Berkeley
.

26

Speed
,
T. P.
and
Kiiveri
,
H. T.
(
1986
)
Gaussian Markov distributions over finite graphs
.
Ann. Statist.
,
14
,
138
150
.

27

Tseng
,
P.
(
1988
)
Coordinate ascent for maximizing nondifferentiable concave functions
. Technical Report.
Massachusetts Institute of Technology
,
Cambridge
.

28

Tseng
,
P.
(
2001
)
Convergence of a block coordinate descent method for nondifferentiable minimization
.
J. Optimizn Theor. Appl.
,
109
,
475
494
.

29

Won
,
J.-H.
,
Lim
,
J.
,
Kim
,
S.-J.
and
Rajaratnam
,
B.
(
2013
)
Condition-number-regularized covariance estimation
.
J. R. Statist. Soc. B
,
75
,
427
450
.

30

Xu
,
P.-F.
,
Guo
,
J.
and
He
,
X.
(
2011
)
An improved iterative proportional scaling procedure for Gaussian graphical models
.
J. Computnl Graph. Statist.
,
20
,
417
431
.

31

Zangwill
,
W. I.
(
1969
)
Nonlinear Programming: a Unified Approach
.
Englewood Cliffs: Prentice-Hall
.

32

Zhao
,
P.
and
Yu
,
B.
(
2006
)
On model selection consistency of lasso
.
J. Mach. Learn. Res.
,
7
,
2541
2563
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)