Abstract

We propose penalized likelihood methods for estimating the concentration matrix in the Gaussian graphical model. The methods lead to a sparse and shrinkage estimator of the concentration matrix that is positive definite, and thus conduct model selection and estimation simultaneously. The implementation of the methods is nontrivial because of the positive definite constraint on the concentration matrix, but we show that the computation can be done effectively by taking advantage of the efficient maxdet algorithm developed in convex optimization. We propose a BIC-type criterion for the selection of the tuning parameter in the penalized likelihood methods. The connection between our methods and existing methods is illustrated. Simulations and real examples demonstrate the competitive performance of the new methods.

1 Introduction

Let X = (X(1), …, X(p)) be a p-dimensional random vector following a multivariate normal distribution forumla with unknown mean μ and nonsingular covariance matrix Σ. Given a random sample X1, …, Xn of X, we wish to estimate the concentration matrix C = Σ−1. Of particular interest is the identification of zero entries in the concentration matrix C = (cij), since a zero entry cij = 0 indicates the conditional independence between the two random variables X(i) and X(j) given all other variables. This is the covariance selection problem (Dempster, 1972) or the model-selection problem in the Gaussian concentration graph model (Cox & Wermuth, 1996).

A Gaussian concentration graph model for the Gaussian random vector X is represented by an undirected graph G = (V, E), where V contains p vertices corresponding to the p coordinates and the edges forumla describe the conditional independence relationships among X(1), …, X(p). The edge between X(i) and X(j) is absent if and only if X(i) and X(j) are independent conditional on the other variables, and corresponds to cij = 0. Thus parameter estimation and model selection in the Gaussian concentration graph model are equivalent to estimating parameters and identifying zeros in the concentration matrix; see Whittaker (1990), Lauritzen (1996) and Edwards (2000) for statistical properties of Gaussian concentration graph models and commonly used model selection and parameter estimation methods in such models.

The standard approach to model selection in Gaussian graphical models is greedy stepwise forward-selection or backward-deletion, and parameter estimation is based on the selected model. In each step the edge selection or deletion is typically done through hypothesis testing at some level α. It has long been recognized that this procedure does not correctly take account of the multiple comparisons involved (Edwards, 2000). Another drawback of the common stepwise procedure is its computational complexity. To remedy these problems, Drton & Perlman (2004) proposed a method that produces conservative simultaneous 1 − α confidence intervals, and uses these confidence intervals to do model selection in a single step. The method is based on asymptotic considerations. Meinshausen & Bühlmann (2006) proposed a computationally attractive method for covariance selection that can be used for very large Gaussian graphs. They perform neighbourhood selection for each node in the graph and combine the results to learn the structure of a Gaussian concentration graph model. They showed that their method is consistent for sparse high-dimensional graphs. In all of the above mentioned methods, model selection and parameter estimation are done separately. The parameters in the concentration matrix are typically estimated based on the model selected. As demonstrated by Breiman (1996), the discrete nature of such procedures often leads to instability of the estimator: small changes in the data may result in very different estimates. Other recent advances include a Duke University discussion paper by A. Dobra and M. West, who presented a novel Bayesian framework for building Gaussian graphical models and illustrated their approach in a large scale gene expression study, and Li & Gui (2006), who adopted gradient-directed regularization, which is described in a technical report by J. Friedman and B. Popescu, available at http://www-stat.stanford.edu/∼jhf, to construct sparse Gaussian graphical models.

In this paper, we propose a penalized-likelihood method that does model selection and parameter estimation simultaneously in the Gaussian concentration graph model. We employ an ℓ1 penalty on the off-diagonal elements of the concentration matrix. This is similar to the idea of the lasso in linear regression (Tibshirani, 1996). The ℓ1 penalty encourages sparsity and at the same time gives shrinkage estimates. In addition, we explicitly ensure that the estimator of the concentration matrix is positive definite. We also introduce a ‘nonnegative garrote’ type method that is closely related to the aforementioned approach.

There is a connection between the neighbourhood-selection method in Meinshausen & Bühlmann (2006) and our penalized-likelihood approach, which we illustrate in forumla5. The neighbourhood-selection method can be cast as a penalized M-estimation without incorporating the positive definiteness or symmetry constraint. The loss function in the penalized M-estimation is a particular quadratic form. The neighbourhood selection method is computationally faster because of its simpler form and because it does not consider the positive definite constraint. Our method is more efficient because of the incorporation of the positive definite constraint and the use of likelihood.

Throughout the paper we assume that the observations are suitably centred and scaled. The sample mean is centred to be zero. One may scale to have the diagonal elements of the sample covariance matrix equal to one or to have the diagonal elements of the sample concentration matrix equal to one. In our experience these two scalings give very similar performance, and in this paper we assume the latter since it seems to be more natural for estimating the concentration matrix.

2 Methodology

2·1 Lasso-type estimator

The loglikelihood for μ and C = Σ−1 based on a random sample X1, …, Xn of X is
up to a constant not depending on μ and C. The maximum likelihood estimator of (μ, Σ) is (X, A), where
The commonly used sample covariance matrix is S = nA/(n − 1). The concentration matrix C can be naturally estimated by A−1 or S−1. However, because of the large number, p(p + 1)/2, of unknown parameters to be estimated, S is not a stable estimator of Σ for moderate or large p. In general, the matrix S−1 is positive definite when forumla, but does not lead to ‘sparse’ graph structure since the matrix typically contains no zero entry.
To achieve ‘sparse’ graph structure and to give a better estimator of the concentration matrix, we adapt the lasso idea and seek the minimizer forumla of
1
over the set of positive definite matrices. Here forumla is a tuning parameter. When t = ∞, the solution to (1) is the maximum likelihood estimator A−1 provided that the inverse exists. On the other hand, if t = 0, then the constraint forces C to be diagonal, which implies that X(1), …, X(p) are mutually independent. It is clear that forumla regardless of t. Since the observations are centred, we have forumla. Therefore, Ĉ is the positive definite matrix that minimizes
2
We can further rewrite (2) as
3
Since both the objective function and feasible region of (3) are convex, we can equivalently use the Lagrangian form
4
with forumla being the tuning parameter.

2·2 Nonnegative garrote-type estimator

If a relatively reliable estimator forumla of C is available, a shrinkage estimator can be defined through forumla, where the symmetric matrix D = (dij) is the minimizer of
5
and with C positive definite. Equivalently, this can be written as
6
subject to forumla and with C positive definite. For a relatively large sample size, A−1 is an obvious choice for the preliminary estimator. This procedure is similar in spirit to the nonnegative garrote estimator proposed by Breiman (1995) for linear regression.

2·3 Illustration

Consider the special case in which p = 2. Denote the maximum likelihood estimator of the concentration matrix by
where the diagonal elements are 1 because of the scaling. Therefore,
Substitution in (4) gives
where we used the fact that C is symmetric.
 
LEMMA 1.
In the case of the bivariate normal, the proposed penalized likelihood estimator given by the solution to (4) is
where (x)+ = max(x, 0) and
7

Similarly, the garrote type estimator can also be obtained in an explicit form in this case.

 
LEMMA 2.
Withforumla, the minimizer of (6) is
and ĉ11 = ĉ22are given by (7).

The estimators are illustrated Fig. 1. If the true value of c12 is zero, r will tend to be small in magnitude. With an appropriately chosen λ, both estimates of c12 can be shrunk to zero, so that model selection for the graphical model can be achieved. Note that the proposed estimators are continuous functions of r and consequently of the data. Such continuity, not shared by the existing methods that perform maximum likelihood estimation on a selected graph structure, ensures the stability of our estimators. The garrote-type estimator penalizes large r’s less heavily than small r’s. As will be demonstrated in the next section, this can be advantageous for model-fitting. However, the disadvantage of the garrote-type estimator is that it can only be applied when a good initial estimator is available.

Fig. 1

(a) Lasso-type estimator, (b) garrote type estimator for the case of p = 2.

3 Asymptotic Theory

In this section, we derive asymptotic properties of the proposed estimators that are analogous to those for the lasso (Knight & Fu, 2000). For simplicity, we assume that p is held fixed as the sample size forumla. Although it might be more realistic to consider the case when forumla as forumla, the following results nevertheless provide an adequate illustration of the mechanism of the proposed estimators.

 
THEOREM 1.
Ifforumlaasforumla, the lasso-type estimator is such that
in distribution, where
in which W is a random symmetric p × p matrix such thatforumla, and Λ is such that
As an illustration, consider an example where p = 3 and
Note that, for i ≠ j ≠ k ≠ l,
After some tedious algebraic manipulation, we obtain that

We simulated 1000 observations from the distribution of arg min V. Figure 2 gives the scatterplot of the off-diagonal elements for λ0 = 0, 0·5 and 1. When λ0 = 0, our estimator is asymptotically equivalent to the maximum likelihood estimator, and the asymptotic distribution for the elements of C is multivariate normal; see Fig. 2(a). If λ0 > 0, the proposed estimator will have a positive probability of estimating c13 by its true value 0, and this probability increases as λ0 increases. From Theorem 1 pr(ĉ13 = 0) tends to 0·30 if λ0 = 0·5 and to 0·45 when λ0 = 1.

Fig. 2

Example with p = 3. Asymptotic distribution of our estimator as estimated by 1000 simulations, for different values of λ0, (a) λ0 = 0, (b) λ0 = 0·5, (c) λ0 = 1.

Similarly to Theorem 1, we can derive the asymptotic properties of the nonnegative garrote-type estimator.

 
THEOREM 2.

Denote by Ĉ the minimizer of (6) with initial estimatorforumla. Ifforumlaandforumlaasforumla, thenforumlaifcij = 0, and other elements of Ĉ have the same limiting distribution as the maximum likelihood estimator on the true graph structure.

Theorem 2 indicates that the garrote-type estimator enjoys the so-called oracle property: it selects the right graph with probability tending to one and at the same time gives a root-n consistent estimator of the concentration matrix.

4 Computation

4·1 The maxdet problem

The nonlinearity of the objective function and the positive-definiteness constraint make the optimization problem (3) nontrivial. We take advantage of the connection between (3) and the determinant-maximization problem, the maxdet problem (Vandenberghe et al., 1998), which can be solved very efficiently with the interior point algorithm.

The maxdet problem is of the form
where b ∈ Rm, G(x) is positive definite, F(x) is positive semidefinite, and the functions forumla and forumla are affine:
where Fi and Gi are symmetric matrices. To use the algorithm of Vandenberghe et al. (1998), it is also required that Fi, i = 1, …, m, be linearly independent and the same be true of Gi, i = 1, …, m. It is not hard to see that the garrote-type estimator (5) solves a maxdet problem.

4·2 Algorithm for lasso-type estimator

If the signs of the cij’s are known, (3) can be expressed as the following maxdet problem:
subject to ∑iciiI(i) + ∑i<jcijI(ij) being positive definite,
8
where C = (cij), S = (sij), A = (aij), I(i) is an n × n matrix with the (i, i)th entry being 1 and all other entries being 0, I(ij) is an n × n matrix with the (i, j)th and the (j, i)th entries being 1 and all other entries being 0, and sij is the sign of cij. Since the signs of the cij’s are not known in advance, we propose to update the sij’s and cij’s iteratively using the following steps. In our experience the algorithm usually converges within a small number of iterations. Clearly, other initial values for s can also be used.
  • Step 1. Let forumla and forumla for all i ≠ j.

  • Step 2. Let Ĉnew solve (8) over the set of positive definite matrices.

  • Step 3. If Ĉnew = Ĉold, then stop and let Ĉ = Ĉnew. Otherwise, set Ĉold = Ĉnew and sij = −sij for all pairs (i, j) such that ĉij = 0 and go back to Step 2.

 
LEMMA 3.

The above algorithm always converges and converges to the solution to (3).

4·3 Tuning

So far we have concentrated on the calculation of the minimizer of (3) for any fixed tuning parameter t. In practice, we need to choose a tuning parameter so as to minimize a score measuring the goodness-of-fit. A commonly used such score is the multifold crossvalidation score, but a computationally more efficient alternative is the BIC for model selection and estimation. To evaluate the BIC for the current setting, one must first obtain an estimate of the degrees of freedom, which is defined as the number of unknown parameters in the case of the maximum likelihood estimate.

Let  = Ĉ−1 and Ω = {(i, j):ĉij ≠ 0}. From the Karush-Kuhn-Tucker conditions, it is not hard to see that the lasso-type estimator satisfies âij = aij + λsign(ĉij) for all pairs (i, j) ∈ Ω. The remaining card(Ωc)/2 unique entries of  can be obtained by solving card(Ωc)/2 equations, Ĉij = 0, where i < j and card(·) is the cardinality of a set. Therefore, Ĉ relies on the observations only through aij, (i, j) ∈ Ω. Note that the number of parameters in {aij:(i, j) ∈ Ω} is forumla. Since the aij’s are maximum likelihood estimates, we can define, for a given tuning parameter t,
where êij = 0 if ĉij = 0, and êij = 1 otherwise.

5 Quadratic Approximation

Provided that A is nonsingular, a second-order approximation to the objective function of (3) around A−1 can be written as (Boyd & Vandenberghe, 2003)
Therefore, the solution to (4) can be approximated by the solution to
9
This second-order approximation is closely connected to the approach proposed by Meinshausen & Bühlmann (2006). In their approach, for each i = 1, …, p, we seek the minimizer forumla of
10
where X[−i] is the n × (p − 1) matrix resulting from the deletion of the ith column from the data matrix X. A vertex j is taken to be a neighbour of vertex i if and only if forumla. The two vertices are connected by an edge in the graphical model if either vertex is the neighbour of the other one.

Note that θii, i = 1, …, p, are not determined. For notational purposes, we write θii = 1 for i = 1, …, p. Recall that we scale each component of X so that all the diagonal elements of the sample concentration matrix are unity. The following lemma reveals a close connection between the approach of Meinshausen & Bühlmann (2006) and the second-order approximation (9).

 
LEMMA 4.
The matrix Θ = (θij) defined by (10) is the unconstrained solution to
11
over all p × p matrices with diagonal elements fixed at 1.

Lemma 4 shows that the approach of Meinshausen & Bühlmann (2006) seeks a sparse C close to A−1. However, it does not incorporate the symmetry and positive-definiteness constraint in the estimation of the concentration matrix, and therefore an additional step is needed to estimate either the covariance matrix or the concentration matrix. Also, the loss function used by Meinshausen & Bühlmann is different from the quadratic approximation to the loglikelihood, and therefore the approach is expected to be less efficient than our penalized likelihood method or the corresponding quadratic approximation (9).

6 Simulation

We consider eight different models in our simulation.

  • Model 1. Heterogeneous model with Σ = diag(1, 2, …, n).

  • Model 2. An AR(1) model with cii = 1 and ci, i−1 = ci−1, i = 0·5.

  • Model 3. An AR(2) model with cii = 1, ci, i−1 = ci−1, i = 0·5 and ci, i−2 = ci−2, i = 0·25.

  • Model 4. An AR(3) model with cii = 1, ci, i−1 = ci−1, i = 0·4 and ci, i−2 = ci−2, i = ci, i−3 = ci−3, i = 0·2.

  • Model 5. An AR(4) model with cii = 1, ci, i−1 = ci−1, i = 0·4, ci, i−2 = ci−2, i = ci, i−3 = ci−3, i = 0·2 and ci, i−4 = ci−4, i = 0·1.

  • Model 6. Full model with cij = 2 if i = j and cij = 1 otherwise.

  • Model 7. Star model with every node connected to the first node, with cii = 1, c1, i = ci, 1 = 0·2 and cij = 0 otherwise.

  • Model 8. Circle model with cii = 1, ci, i−1 = ci−1, i = 0·5 and c1n = cn1 = 0·4.

For each model, we simulated samples with size 25 and dimension p = 5, or size 50 and dimension 10. We compare our methods with the approach of Meinshausen & Bühlmann (2006) and the method proposed by Drton & Perlman (2004) in terms of the Kullback–Leibler loss,
the number of false positives (FP; incorrectly identified edges) and the number of false negatives (FN; incorrectly missed edges). The approach of Meinshausen & Bühlmann (2006) was implemented using the LARS package from R and the method of Drton & Perlman (2004) has also been implemented in the SIN package of R. Their method gives each edge of the full graph a p-value and two different cut-off values, 5% and 25%, were suggested in their original paper. Both of these methods focus on model selection and do not consider the problem of estimating the covariance matrix or the concentration matrix. For comparison, we estimate the concentration matrix by the maximum likelihood estimate after the graph structure is selected using their methods. Table 1 documents the means and standard errors, in parentheses, from 100 runs for each combination. Our penalized likelihood method is referred to as Lasso in the table because of its connection to the idea of the lasso in linear regression. Similarly, the extension described in forumla2·2 is referred to as Garrote in the table.
Table 1

Results for the eight simulated models. Averages and standard errors from 100 runs

LassoGarroteMBSIN (0·05)SIN (0·25)
pModelKLFPFNKLFPFNKLFPFNKLFPFNKLFPFN
10·270·200·000·310·420·000·450·910·000·260·050·000·320·260·00
(0·02)(0·06)(0·00)(0·02)(0·08)(0·00)(0·04)(0·08)(0·00)(0·02)(0·03)(0·00)(0·03)(0·07)(0·00)
20·703·310·070·671·200·140·630·680·161·880·032·471·400·151·59
(0·05)(0·12)(0·03)(0·05)(0·12)(0·04)(0·05)(0·08)(0·04)(0·06)(0·02)(0·10)(0·06)(0·05)(0·09)
30·891·292·240·870·602·580·980·473·681·160·015·421·060·074·16
(0·05)(0·10)(0·24)(0·04)(0·08)(0·18)(0·04)(0·06)(0·09)(0·04)(0·01)(0·12)(0·05)(0·04)(0·15)
40·790·225·600·800·165·860·830·166·230·930·008·140·900·016·97
(0·03)(0·04)(0·29)(0·04)(0·04)(0·23)(0·04)(0·04)(0·11)(0·03)(0·00)(0·11)(0·03)(0·01)(0·15)
550·780·007·060·760·006·980·800·007·260·880·009·130·860·007·94
(0·04)(0·00)(0·30)(0·03)(0·00)(0·23)(0·04)(0·00)(0·12)(0·03)(0·00)(0·11)(0·03)(0·00)(0·16)
61·090·004·531·110·004·581·300·007·051·280·006·181·180·003·77
(0·04)(0·00)(0·44)(0·04)(0·00)(0·41)(0·04)(0·00)(0·11)(0·04)(0·00)(0·24)(0·05)(0·00)(0·25)
70·450·313·470·510·463·020·610·552·750·430·003·920·500·133·61
(0·02)(0·08)(0·10)(0·03)(0·08)(0·12)(0·03)(0·07)(0·10)(0·02)(0·00)(0·03)(0·03)(0·05)(0·06)
80·732·550·110·771·280·260·800·170·371·890·033·291·480·112·12
(0·05)(0·13)(0·03)(0·05)(0·12)(0·06)(0·05)(0·05)(0·06)(0·05)(0·02)(0·10)(0·05)(0·04)(0·09)
10·220·260·000·260·750·000·633·480·000·230·070·000·260·230·00
(0·01)(0·09)(0·00)(0·01)(0·14)(0·00)(0·03)(0·17)(0·00)(0·01)(0·03)(0·00)(0·02)(0·05)(0·00)
21·4231·760·000·604·830·000·582·250·004·010·063·752·390·252·05
(0·04)(0·26)(0·00)(0·02)(0·27)(0·00)(0·02)(0·15)(0·00)(0·14)(0·02)(0·14)(0·12)(0·06)(0·12)
31·2210·873·301·035·863·051·522·927·071·900·0711·261·600·218·91
(0·04)(0·59)(0·34)(0·03)(0·34)(0·21)(0·03)(0·15)(0·14)(0·03)(0·03)(0·20)(0·03)(0·05)(0·20)
41·223·3714·101·072·1412·641·281·9514·331·680·0520·801·490·1518·34
(0·04)(0·32)(0·52)(0·03)(0·19)(0·39)(0·03)(0·14)(0·20)(0·02)(0·02)(0·20)(0·03)(0·04)(0·27)
1051·211·0823·341·060·9820·581·231·0221·191·420·0226·661·300·0824·13
(0·03)(0·18)(0·56)(0·03)(0·12)(0·47)(0·03)(0·09)(0·20)(0·02)(0·01)(0·22)(0·02)(0·03)(0·31)
61·660·0044·601·660·0044·302·080·0038·052·100·0017·291·950·009·94
(0·01)(0·00)(0·16)(0·01)(0·00)(0·18)(0·02)(0·00)(0·20)(0·04)(0·00)(1·01)(0·05)(0·00)(0·79)
70·710·737·610·692·145·820·973·374·770·780·078·790·810·238·29
(0·01)(0·15)(0·21)(0·02)(0·24)(0·25)(0·03)(0·16)(0·16)(0·01)(0·03)(0·05)(0·02)(0·05)(0·09)
80·8919·240·020·655·810·030·933·580·006·830·064·504·030·252·55
(0·04)(0·63)(0·02)(0·03)(0·30)(0·02)(0·02)(0·19)(0·00)(0·23)(0·02)(0·16)(0·18)(0·06)(0·13)

MB, method of Meinshausen & Bühlmann (2006); SIN (0·05), method of Drton & Perlman (2004) based on a cut-off of 0·05; SIN (0·25), method of Drton & Perlman (2004) based on a cut-off of 0·25; KL, Kullback-Leibler loss defined in (6); FP, number of incorrectly identified edges; FN, number of incorrectly missed edges.

Table 1

Results for the eight simulated models. Averages and standard errors from 100 runs

LassoGarroteMBSIN (0·05)SIN (0·25)
pModelKLFPFNKLFPFNKLFPFNKLFPFNKLFPFN
10·270·200·000·310·420·000·450·910·000·260·050·000·320·260·00
(0·02)(0·06)(0·00)(0·02)(0·08)(0·00)(0·04)(0·08)(0·00)(0·02)(0·03)(0·00)(0·03)(0·07)(0·00)
20·703·310·070·671·200·140·630·680·161·880·032·471·400·151·59
(0·05)(0·12)(0·03)(0·05)(0·12)(0·04)(0·05)(0·08)(0·04)(0·06)(0·02)(0·10)(0·06)(0·05)(0·09)
30·891·292·240·870·602·580·980·473·681·160·015·421·060·074·16
(0·05)(0·10)(0·24)(0·04)(0·08)(0·18)(0·04)(0·06)(0·09)(0·04)(0·01)(0·12)(0·05)(0·04)(0·15)
40·790·225·600·800·165·860·830·166·230·930·008·140·900·016·97
(0·03)(0·04)(0·29)(0·04)(0·04)(0·23)(0·04)(0·04)(0·11)(0·03)(0·00)(0·11)(0·03)(0·01)(0·15)
550·780·007·060·760·006·980·800·007·260·880·009·130·860·007·94
(0·04)(0·00)(0·30)(0·03)(0·00)(0·23)(0·04)(0·00)(0·12)(0·03)(0·00)(0·11)(0·03)(0·00)(0·16)
61·090·004·531·110·004·581·300·007·051·280·006·181·180·003·77
(0·04)(0·00)(0·44)(0·04)(0·00)(0·41)(0·04)(0·00)(0·11)(0·04)(0·00)(0·24)(0·05)(0·00)(0·25)
70·450·313·470·510·463·020·610·552·750·430·003·920·500·133·61
(0·02)(0·08)(0·10)(0·03)(0·08)(0·12)(0·03)(0·07)(0·10)(0·02)(0·00)(0·03)(0·03)(0·05)(0·06)
80·732·550·110·771·280·260·800·170·371·890·033·291·480·112·12
(0·05)(0·13)(0·03)(0·05)(0·12)(0·06)(0·05)(0·05)(0·06)(0·05)(0·02)(0·10)(0·05)(0·04)(0·09)
10·220·260·000·260·750·000·633·480·000·230·070·000·260·230·00
(0·01)(0·09)(0·00)(0·01)(0·14)(0·00)(0·03)(0·17)(0·00)(0·01)(0·03)(0·00)(0·02)(0·05)(0·00)
21·4231·760·000·604·830·000·582·250·004·010·063·752·390·252·05
(0·04)(0·26)(0·00)(0·02)(0·27)(0·00)(0·02)(0·15)(0·00)(0·14)(0·02)(0·14)(0·12)(0·06)(0·12)
31·2210·873·301·035·863·051·522·927·071·900·0711·261·600·218·91
(0·04)(0·59)(0·34)(0·03)(0·34)(0·21)(0·03)(0·15)(0·14)(0·03)(0·03)(0·20)(0·03)(0·05)(0·20)
41·223·3714·101·072·1412·641·281·9514·331·680·0520·801·490·1518·34
(0·04)(0·32)(0·52)(0·03)(0·19)(0·39)(0·03)(0·14)(0·20)(0·02)(0·02)(0·20)(0·03)(0·04)(0·27)
1051·211·0823·341·060·9820·581·231·0221·191·420·0226·661·300·0824·13
(0·03)(0·18)(0·56)(0·03)(0·12)(0·47)(0·03)(0·09)(0·20)(0·02)(0·01)(0·22)(0·02)(0·03)(0·31)
61·660·0044·601·660·0044·302·080·0038·052·100·0017·291·950·009·94
(0·01)(0·00)(0·16)(0·01)(0·00)(0·18)(0·02)(0·00)(0·20)(0·04)(0·00)(1·01)(0·05)(0·00)(0·79)
70·710·737·610·692·145·820·973·374·770·780·078·790·810·238·29
(0·01)(0·15)(0·21)(0·02)(0·24)(0·25)(0·03)(0·16)(0·16)(0·01)(0·03)(0·05)(0·02)(0·05)(0·09)
80·8919·240·020·655·810·030·933·580·006·830·064·504·030·252·55
(0·04)(0·63)(0·02)(0·03)(0·30)(0·02)(0·02)(0·19)(0·00)(0·23)(0·02)(0·16)(0·18)(0·06)(0·13)

MB, method of Meinshausen & Bühlmann (2006); SIN (0·05), method of Drton & Perlman (2004) based on a cut-off of 0·05; SIN (0·25), method of Drton & Perlman (2004) based on a cut-off of 0·25; KL, Kullback-Leibler loss defined in (6); FP, number of incorrectly identified edges; FN, number of incorrectly missed edges.

As shown in Table 1, the proposed penalized likelihood methods enjoy better performance than the other methods. The method of Meinshausen & Bühlmann (2006) and both versions of the Drton–Perlman method tend to have larger FN, which may partly explain their relatively poor performance. However, the results suggest that the proposed penalized likelihood approach combined with BIC may have relatively larger FP. The solution path of (3) may therefore be more informative in determining the graph structure.

All four methods under comparison require the selection of tuning parameters that control the trade-off between sensitivity and specificity. To appreciate better the merits of different methods independently of the tuning parameter selection, we plotted the receiver operating characteristic curves for different models and methods, each averaged over the 100 simulated datasets. The AR(4) and full models are not included because the specificity in these cases is not well defined when p = 5. From the plot, not shown here because of space restrictions, the proposed methods outperform the other approaches for all models when p = 5 and n = 25. In the cases when p = 10 and n = 50, the performance of all methods improves but Lasso and Garrote still enjoy competitive performance when compared with the other approaches.

7 Real World Examples

We first consider three real-world examples. The cork borings data are presented in Whittaker (1990, p. 267) and were originally used by Rao (1948). The p = 4 measurements are the weights of cork borings on n = 28 trees in the four directions, north, east, south and west. Fret's heads dataset contains head measurements on the first and the second adult son in a sample of 25 families. The 4 variables are the head length of the first son, the head breadth of the first son, the head length of the second son and the head breadth of the second son. The data are also presented in Whittaker (1990, p. 255). The Mathematics marks dataset (Mardia et al., 1979, p. 3) contains the marks of n = 88 students in the p = 5 examinations in mechanics, vectors, algebra, analysis and statistics. The data also appear in Whittaker (1990, p. 1).

Figures 3–5 depict the solution paths of (3) for each of the three datasets.

Fig. 3

Cork borings dataset. The Meinshausen–Bühlmann method selects (d); both methods based on SIN, with cut-off 0·05 and 0·25, select (b); both Lasso and Garrote with BIC select (e).

Fig. 4

Fret's heads dataset. The Meinshausen–Bühlmann method selects (f); SIN with cut-off 0·05 selects (a); SIN with cut-off 0·25 selects (b); both Lasso and Garrote with BIC select (f).

Fig. 5

Mathematics marks dataset. The Meinshausen–Bühlmann method selects (f); SIN with cut-off 0·05 selects (b) with an additional edge between Mechanics and Vectors; SIN with cut-off 0·25 selects (f); Lasso with BIC selects (j); and Garrote with BIC selects (f).

To compare the accuracy of different methods, fivefold crossvalidation was applied on the datasets. Table 2 documents the average values of KL distances for each method, where now
in which Ĉ is the concentration matrix estimated on the training set and forumla is the sample covariance matrix evaluated on the test set.
Table 2

Small-scale examples. Averaged forumla loss estimated by fivefold crossvalidation

DatasetLassoGarroteMBSIN (0·05)SIN (0·25)Sample
Cork borings21·6522·2822·4625·2124·4522·68
Fret's heads18·6818·3320·1521·1021·2220·00
Maths marks29·5229·5329·8330·6630·2629·84

MB, method of Meinshausen & Bühlmann (2006); SIN (0·05), method of Drton & Perlman (2004) based on a cut-off of 0·05; SIN (0·25), method of Drton & Perlman (2004) based on a cut-off of 0·25.

Table 2

Small-scale examples. Averaged forumla loss estimated by fivefold crossvalidation

DatasetLassoGarroteMBSIN (0·05)SIN (0·25)Sample
Cork borings21·6522·2822·4625·2124·4522·68
Fret's heads18·6818·3320·1521·1021·2220·00
Maths marks29·5229·5329·8330·6630·2629·84

MB, method of Meinshausen & Bühlmann (2006); SIN (0·05), method of Drton & Perlman (2004) based on a cut-off of 0·05; SIN (0·25), method of Drton & Perlman (2004) based on a cut-off of 0·25.

Next we considered a larger problem. The opening prices of 35 stocks were collected for the years 2003 and 2004. Different methods were applied to estimate the covariance matrix using the data from 2003. The KL loss of the estimates are then evaluated using the data from 2004, and Table 3 reports the improved KL loss over the sample covariance matrix.

Table 3

Stock market example. Improvement of predictive forumla loss over sample covariance matrix

DatasetLassoGarroteMBSIN (0·05)SIN (0·25)
Stock Market0·050·16−0·58−5·89−4·81
Table 3

Stock market example. Improvement of predictive forumla loss over sample covariance matrix

DatasetLassoGarroteMBSIN (0·05)SIN (0·25)
Stock Market0·050·16−0·58−5·89−4·81

As shown in Tables 2 and 3, the proposed penalized likelihood methods enjoy very competitive performance.

The authors wish to thank the editor and two anonymous referees for comments that greatly improved the paper. This research was supported in part by grants from the U.S. National Science Foundation.

Appendix

Proofs

 
Proof of Lemma 1
The first-order condition leads to
A1
A2
A3
where sign(c12) = 1 if c12 > 0, sign(c12) = −1 if c12 < −1, and sign(c12) is anywhere between −1 and 1 if c12 = 0. Equation (7) can be easily obtained from (A1) and (A2). Together with (A3), we conclude that
A4
The sign of the left-hand side of the above equation, sign(c12), should therefore be equal to the sign of the right-hand side, sign(r). It follows that (A4) implies that
A5
The proof can be completed by the solution of (A5).  □
 
Proof of Theorem 1
Define Vn(U) as
Note that
where σi(·) denotes the ith-largest eigenvalue of a matrix. Since
we conclude that
On the other hand,
Together with the fact that
nVn(U) can be re-written as
where forumla. Therefore, forumla, in distribution. Since both V(U) and nVn(U) are convex and V(U) has a unique minimum, it follows that,
  □
 
Proof of Theorem 2
The proof proceeds in the same fashion as that of Theorem 1. Define Vn(U) as
As before,
Note that forumla if cij = 0, forumla in probability and forumla. Therefore, the above expression can be rewritten as
Since forumla, we conclude that the minimizer of nVn(U) satisfies uij = 0 if cij = 0 with probability tending to one. The proof is now completed if we note that the maximum likelihood estimator Ĉtrue for the true graph (V, E = (cij ≠ 0)) is such that
in distribution, where the minimum is taken over all symmetric matrices U such that uij = 0 if cij = 0.  □
 
Proof of Lemma 2

Simple matrix calculus shows that the matrix of second derivatives of the objective function in (8) is positive definite and therefore the objective function is strictly convex. Since the feasible region is compact, Ĉnew is always well defined. We now show that the algorithm will terminate in a finite number of iterations. Note that, at each iteration, Ĉold lies in the feasible region of Step 2. If the algorithm does not terminate, that is, at each step Ĉnew ≠ Ĉold, then the minimum attained at Step 2 is strictly smaller than that from the previous iteration. The minima attained in the iterations form a strictly decreasing sequence, which in turn implies that the sign matrix in (8) must be different for all iterations. However, this contradicts the fact that there are only a finite number, 2p(p−1)/2, of possible choices for the sign matrix S. Therefore the algorithm has to terminate.

Now we show that the algorithm converges to the solution to (3). Denote the solution at convergence of the algorithm by Ĉ. By the algorithm we see there exist two sign matrices Ŝ and forumla, with forumla, forumla, and forumla for any ĉij = 0, such that Ĉ solves (8) with both Ŝ and forumla. Let forumla. Then, by the Karush-Kuhn-Tucker conditions (Boyd & Vandenberghe, 2003), there exist λ1 > 0 and λ2 > 0 such that
A6
A7
and
A8
A9
Together with the fact that forumla for any ĉij ≠ 0, (A6) and (A8) imply that λ1 = λ2 ≡ λ. Combining this with (A7) and (A9), we conclude that
which implies that Ĉ is also the solution to (4), and equivalently (3), again by the Karush-Kuhn-Tucker conditions.  □
 
Proof of Lemma 3
Let B = A−1. Then bii = 1 according to our scaling and bi,−i is the least-squares estimator corresponding to regressing X(i) on the other elements; see Lauritzen (1996) and Meinshausen & Bühlmann (2006). Using this fact, we may write (11) as
To minimize this function, we have θii = 1 and θi,−i as the minimizer of (10).  □

References

Boyd
S.
Vandenberghe
L.
Convex Optimization
2003
Cambridge
Cambridge University Press
Breiman
L.
Better subset regression using the nonnegative garrote
Technometrics
1995
, vol. 
37
 (pg. 
373
-
84
)
Breiman
L.
Heuristics of instability and stabilization in model selection
Ann. Statist
1996
, vol. 
24
 (pg. 
2350
-
83
)
Cox
D. R.
Wermuth
N.
Multivariate Dependencies: Models, Analysis and Interpretation
1996
London
Chapman and Hall
Dempster
A. P.
Covariance selection
Biometrika
1972
, vol. 
32
 (pg. 
95
-
108
)
Drton
M.
Perlman
M.
Model selection for Gaussian concentration graphs
Biometrika
2004
, vol. 
91
 (pg. 
591
-
602
)
Edwards
D. M.
Introduction to Graphical Modelling
2000
New York
Springer
Knight
K.
Fu
W.
Asymptotics for lasso-type estimators
Ann. Statist
2000
, vol. 
28
 (pg. 
1356
-
78
)
Lauritzen
S. L.
Graphical Models
1996
Oxford
Clarendon Press
Li
H.
Gui
J.
Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks
Biostatistics
2006
, vol. 
7
 (pg. 
302
-
17
)
Mardia
K. V.
Kent
J. T.
Bibby
J. M.
Multivariate Analysis
1979
London
Academic Press
Meinshausen
N.
Bühlmann
P.
High-dimensional graphs with the Lasso
Ann. Statist
2006
, vol. 
34
 (pg. 
1436
-
62
)
Rao
C.
Tests of significance in multivariate analysis
Biometrika
1948
, vol. 
35
 (pg. 
58
-
79
)
Tibshirani
R.
Regression shrinkage and selection via the lasso
J. R. Statist. Soc. B
1996
, vol. 
58
 (pg. 
267
-
88
)
Vandenberghe
L.
Boyd
S.
Wu
S.-P.
Determinant maximization with linear matrix inequality constraints
SIAM J. Matrix Anal. Appl
1998
, vol. 
19
 (pg. 
499
-
533
)
Whittaker
J.
Graphical Models in Applied Multivariate Statistics
1990
Chichester
John Wiley and Sons