Abstract
We consider the problem of estimating sparse graphs by a lasso penalty applied to the inverse covariance matrix. Using a coordinate descent procedure for the lasso, we develop a simple algorithm—the graphical lasso—that is remarkably fast: It solves a 1000-node problem (∼500000 parameters) in at most a minute and is 30–4000 times faster than competing methods. It also provides a conceptual link between the exact problem and the approximation suggested by Meinshausen and Bühlmann (2006). We illustrate the method on some cell-signaling data from proteomics.
1. INTRODUCTION
In recent years a number of authors have proposed the estimation of sparse undirected graphical models through the use of L1 (lasso) regularization. The basic model for continuous data assumes that the observations have a multivariate Gaussian distribution with mean μ and covariance matrix Σ. If the ijth component of Σ−1 is zero, then variables i and j are conditionally independent, given the other variables. Thus, it makes sense to impose an L1 penalty for the estimation of Σ−1 to increase its sparsity.
Meinshausen and Bühlmann (2006) take a simple approach to this problem; they estimate a sparse graphical model by fitting a lasso model to each variable, using the others as predictors. The component is then estimated to be nonzero if either the estimated coefficient of variable i on j or the estimated coefficient of variable j on i is nonzero (alternatively, they use an AND rule). They show that asymptotically, this consistently estimates the set of nonzero elements of Σ−1.
Other authors have proposed algorithms for the exact maximization of the L1-penalized log-likelihood; Yuan and Lin (2007), Banerjee and others (2007), and Dahl and others (2007) adapt interior-point optimization methods for the solution to this problem. Both papers also establish that the simpler approach of Meinshausen and Bühlmann (2006) can be viewed as an approximation to the exact problem.
We use the blockwise coordinate descent approach in Banerjee and others (2007) as a launching point and propose a new algorithm for the exact problem. This new procedure is extremely simple and is substantially faster competing approaches in our tests. It also bridges the “conceptual gap” between the (Meinshausen and Bühlmann, 2006) proposal and the exact problem.
2. THE PROPOSED METHOD
Now to the main point of this paper. Problem (2.4) looks like a lasso (L1-regularized) least-squares problem. In fact if W11 = S11, then the solutions are easily seen to equal the lasso estimates for the pth variable on the others and hence related to the Meinshausen and Bühlmann (2006) proposal. As pointed out by Banerjee and others (2007), W11 ≠ S11 in general, and hence, the Meinshausen and Bühlmann (2006) approach does not yield the maximum likelihood estimator. They point out that their blockwise interior point procedure is equivalent to recursively solving and updating the lasso problem (2.4), but do not pursue this approach. We do, to great advantage, because fast coordinate descent algorithms (Friedman and others, 2007) make solution of the lasso problem very attractive.
In terms of inner products, the usual lasso estimates for the pth variable on the others take as input the data S11 and s12. To solve (2.4), we instead use W11 and s12, where W11 is our current estimate of the upper block of W. We then update w and cycle through all of the variables until convergence.
Note that from (2.6), the solution wii = sii + ρ for all i, since θii > 0, and hence, Γii = 1. For convenience we call this algorithm the graphical lasso. Here is the algorithm in detail:
Graphical lasso algorithm
Start with W = S + ρI. The diagonal of W remains unchanged in what follows.
For each j = 1,2,… p,1,2,…p,…, solve the lasso problem (2.4), which takes as input the inner products W11 and s12. This gives a p− 1 vector solution
. Fill in the corresponding row and column of W using w12 = W11
.
Continue until convergence.
The point is that problem (2.1) is not equivalent to p separate regularized regression problems, but to p coupled lasso problems that share the same W and Θ =W−1. The use of in place of W11S11 shares the information between the problems in an appropriate fashion.
Note that will typically be sparse, and so the computation w12 =W11
will be fast; if there are r nonzero elements, it takes rp operations.
Interestingly, if W = S, these are just the formulas for obtaining the inverse of a partitioned matrix. That is, if we set W = S and ρ = 0 in the above algorithm, then one sweep through the predictors computes S−1, using a linear regression at each stage.
REMARK 2.2 If the diagonal elements are left out of the penalty in (2.1), the solution for wiiis simply sii, and otherwise the algorithm is the same as before.
3. TIMING COMPARISONS
We simulated Gaussian data from both sparse and dense scenarios, for a range of problem sizes p. The sparse scenario is the AR(1) model taken from Yuan and Lin (2007):
We compared the graphical lasso to the COVSEL program provided by Banerjee and others (2007). This is a Matlab program, with a loop that calls a C language code to do the box-constrained QP for each column of the solution matrix. To be as fair as possible to COVSEL, we only counted the CPU time spent in the C program. We set the maximum number of outer iterations to 30 and, following the authors code, set the the duality gap for convergence to 0.1.
The number of CPU seconds for each trial is shown in Table 1. The algorithm took between 2 and 8 iterations of the outer loop. In the dense scenarios for p = 200 and 400, COVSEL had not converged by 30 iterations. We see that the graphical lasso is 30–4000 times faster than COVSEL and only about 2–10 times slower than the approximate method.
Timings (seconds) for graphical lasso, Meinhausen–Buhlmann approximation, and COVSEL procedures
p | Problem type | (1) Graphical lasso | (2) Approximation | (3) COVSEL | Ratio of (3) to (1) |
100 | Sparse | 0.014 | 0.007 | 34.7 | 2476.4 |
100 | Dense | 0.053 | 0.018 | 2.2 | 40.9 |
200 | Sparse | 0.050 | 0.027 | > 205.35 | > 4107 |
200 | Dense | 0.497 | 0.146 | 16.9 | 33.9 |
400 | Sparse | 1.23 | 0.193 | > 1616.7 | > 1314.3 |
400 | Dense | 6.2 | 0.752 | 313.0 | 50.5 |
Figure 1 shows the number of CPU seconds required for the graphical lasso procedure, for problem sizes up to 1000. The computation time is
Number of CPU seconds required for the graphical lasso procedure.
Timing results for dense scenario, p = 400, for different values of the regularization parameter ρ. The middle column is the number of nonzero coefficients
ρ | Fraction nonzero | CPU time (s) |
0.01 | 0.96 | 26.7 |
0.03 | 0.62 | 8.5 |
0.06 | 0.36 | 4.1 |
0.60 | 0.00 | 0.4 |
4. ANALYSIS OF CELL-SIGNALLING DATA
For illustration, we analyze a flow cytometry data set on p = 11 proteins and n = 7466 cells, from Sachs and others (2003). These authors fit a directed acyclic graph (DAG) to the data, producing the network in Figure 2.
Directed acylic graph from cell-signaling data, from Sachs and others (2003).
The result of applying the graphical lasso to these data is shown in Figure 3, for 12 different values of the penalty parameter ρ. There is moderate agreement between, for example, the graph for L1 norm = 0. 00496 and the DAG:The former has about half of the edges and nonedges that appear in the DAG. Figure 4 shows the lasso coefficients as a function of total L1 norm of the coefficient vector.
Cell-signaling data: Undirected graphs from graphical lasso with different values of the penalty parameter ρ.
Cell-signaling data: Profile of coefficients as the total L1norm of the coefficient vector increases, that is as ρdecreases. Profiles for the largest coefficients are labeled with the corresponding pair of proteins.
In the left panel of Figure 5, we tried 2 different kinds of 10-fold cross-validation for estimation of the parameter ρ. In the “regression” approach, we fit the graphical lasso to nine-tenths of the data and used the penalized regression model for each protein to predict the value of that protein in the validation set. We then averaged the squared prediction errors over all 11 proteins. In the “likelihood” approach, we again applied the graphical lasso to nine-tenths of the data and then evaluated the log-likelihood (2.1) over the validation set. The 2 cross-validation curves indicate that the unregularized model is the best, not surprising given the large number of observations and relatively small number of parameters. However, we also see that the likelihood approach is far less variable than the regression method.
Cell-signaling data. Left panel shows 10-fold cross-validation using both regression and likelihood approaches (details in text). Right panel compares the regression sum of squares of the exact graphical lasso approach to the Meinhausen–Buhlmann approximation.
The right panel compares the cross-validated sum of squares of the exact graphical lasso approach to the Meinhausen–Buhlmann approximation. For lightly regularized models, the exact approach has a clear advantage.
5. DISCUSSION
We have presented a simple and fast algorithm for estimation of a sparse inverse covariance matrix using an L1 penalty. It cycles through the variables, fitting a modified lasso regression to each variable in turn. The individual lasso problems are solved by coordinate descent.
The speed of this new procedure should facilitate the application of sparse inverse covariance procedures to large data sets involving thousands of parameters.
An R language package glasso is available on the third author's Web site.
FUNDING
National Science Foundation (DMS-97-64431 to J.F., DMS-0505676 to T. H., DMS-9971405 to R.T.); National Institutes of Health (2R01 CA 72028-07 to T. H.); National Institutes of Health Contract (N01-HV-28183 to R.T.).
We thank the authors of Banerjee and others (2007) for making their COVSEL program publicly available, Larry Wasserman for helpful discussions, and an editor and 2 referees for comments that led to improvements in the manuscript.
References
We note that while most authors use this formulation, Yuan and Lin (2007) omit the diagonal elements from the penalty.
The corresponding expression in Banerjee and others (2007) does not have the leading and has a factor of
in b. We have written it in this equivalent form to avoid factors of
later.