Summary
Graphical lasso is one of the most used estimators for inferring genetic networks. Despite its diffusion, there are several fields in applied research where the limits of detection of modern measurement technologies make the use of this estimator theoretically unfounded, even when the assumption of a multivariate Gaussian distribution is satisfied. Typical examples are data generated by polymerase chain reactions and flow cytometer. The combination of censoring and high-dimensionality make inference of the underlying genetic networks from these data very challenging. In this article, we propose an -penalized Gaussian graphical model for censored data and derive two EM-like algorithms for inference. We evaluate the computational efficiency of the proposed algorithms by an extensive simulation study and show that, when censored data are available, our proposal is superior to existing competitors both in terms of network recovery and parameter estimation. We apply the proposed method to gene expression data generated by microfluidic Reverse Transcription quantitative Polymerase Chain Reaction technology in order to make inference on the regulatory mechanisms of blood development. A software implementation of our method is available on github (https://github.com/LuigiAugugliaro/cglasso).
1. Introduction
An important aim in genomics is to understand interactions among genes, characterized by the regulation and synthesis of proteins under internal and external signals. These relationships can be represented by a genetic network, i.e. a graph where nodes represent genes and edges describe the interactions among them. Genetic networks can be used to gain new insights into the activity of biological pathways and to deduce unknown functions of genes from their dependence on other genes.
Gaussian graphical models (Lauritzen, 1996) have been widely used for reconstructing a genetic network from expression data. The reason of their widespread use relies on the statistical properties of the multivariate Gaussian distribution, which allow the topological structure of a network to be related to the non-zero elements of the concentration matrix, i.e. the inverse of the covariance matrix. Thus, the problem of network inference can be recast as the problem of estimating a concentration matrix. The graphical lasso (Yuan and Lin, 2007) is a popular method for estimating a sparse concentration matrix, based on the idea of adding an -penalty to the likelihood function of the multivariate Gaussian distribution. Nowadays, this estimator is widely used in applied research (e.g. Menéndez and others, 2010; Vinciotti and others, 2016) and widely studied in the computational as well as theoretical (e.g. Bickel and Levina, 2008; Friedman and others, 2008; Witten and others, 2011) literature. The interested reader is referred to Augugliaro and others (2016) for an extensive review.
Despite the widespread literature on the graphical lasso estimator, there is a great number of fields in applied research where modern measurement technologies make the use of this graphical model theoretically unfounded, even when the assumption of a multivariate Gaussian distribution is satisfied. A first example of this is Reverse Transcription quantitative Polymerase Chain Reaction (RT-qPCR), a popular technology for gene expression profiling (Derveaux and others, 2010). This technique is used to measure the expression of a set of target genes in a given sample through repeated cycles of sequence-specific DNA amplification followed by expression measurements. The cycle at which the observed expression first exceeds a fixed threshold is commonly called the cycle-threshold (McCall and others, 2014). If a target is not expressed, the threshold is not reached after the maximum number of cycles (limit of detection) and the corresponding cycle-threshold is undetermined. For this reason, the resulting data is naturally right-censored (McCall and others, 2014; Pipelers and others, 2017). Another example is given by the flow cytometer, which is an essential tool in the diagnosis of diseases such as acute leukemias and malignant lymphomas (Brown and Wittwer, 2000). A flow cytometer measures a limited range of signal strength and records each marker value within a fixed range, such as between 0 and 1023. If a measurement falls outside this range, then the value is replaced by the nearest legitimate value; that is, a value smaller than 0 is censored to 0 and a value larger than 1023 is censored to 1023. A direct application of the graphical lasso for network inference from data such as these is theoretically unfunded since it does not consider the effects of the censoring mechanism on the estimator of the concentration matrix.
In this article, we propose an extension of the graphical lasso estimator that takes into account the censoring mechanism of the data explicitly. Our work can be related to Städler and Bühlmann (2012), who propose an -penalized estimator of the inverse covariance matrix of a multivariate Gaussian model based on the assumption that the data are missing at random. As we shall see in the following of this article, failure to take into account the censoring mechanism causes a poor behavior of this approach when compared with our proposal. Our proposal can also be related to the work of Perkins and others (2013), Hoffman and Johnson (2015), and Pesonen and others (2015), who provide a maximum likelihood estimator of the covariance matrix under left-censoring. However, these works do not address the estimation of the precision matrix under a sparsity assumption and for this reason they are applicable only when the sample size is larger than the number of nodes in the network.
The remaining part of this article is structured as follows. In Section 2, we extend the notion of Gaussian graphical model to censored data and in Section 3, we propose the extension of the graphical lasso estimator to a censored graphical lasso estimator and two Expectation-Maximization (EM) algorithms for inference of parameters in the censored Gaussian graphical model. Section 4 is devoted to the evaluation of the behavior of the proposed algorithms and the comparison of the proposed estimator with existing competitors. Finally, in Section 5, we study a real dataset and in Section 6, we draw some conclusions.
2. The censored Gaussian graphical model
In order to describe the technical details of the proposed method, let be a -dimensional random vector. Graphical models allow to represent the set of conditional independencies among these random variables by a graph , where is the set of nodes associated to and is the set of ordered pairs, called edges, representing the conditional dependencies among the random variables (Lauritzen, 1996).
The Gaussian graphical model is a member of this class of models based on the assumption that
follows a multivariate Gaussian distribution with expected value
and covariance matrix
. Denoting with
the concentration matrix, i.e. the inverse of the covariance matrix, the density function of
can be written as
As shown in Lauritzen (1996), the pattern of non-zero elements in defines the corresponding graph, namely the undirected edge is an element of the edge set of the corresponding conditional independence graph if and only if .
Let
be a (partially) latent random vector with density function (
2.1). In order to include the censoring mechanism inside our framework, let us denote by
and
, with
for
, the vectors of known left and right censoring values. Thus,
is observed only if it is inside the interval
otherwise it is censored from below if
or censored from above if
. Under this setting, a rigorous definition of the joint distribution of the observed data can be obtained using the approach for missing data with ignorable mechanism (
Little and Rubin, 2002). This requires the specification of the distribution of a
-dimensional random vector, denoted by
, used to encode the censoring patterns. Formally, the
th element of
is defined as
, where
denotes the indicator function. By construction
is a discrete random vector with support in the set
and probability function
where
.
Given a censoring pattern, we can simplify our notation by partitioning the set
into the sets
and
and, in the following of this article, we shall use the convention that a vector indexed by a set of indices denotes the corresponding subvector. For example, the subvector of observed elements in
is denoted by
and, consequently, the observed data is the vector
. As explained in
Little and Rubin (2002), the joint probability distribution of the observed data, denoted by
, is obtained by integrating
and
out of the joint distribution of
and
, which can be written as the product of the density function (
2.1) and the conditional distribution of
given
. As shown in
Schafer (1997), this results in
where
and
. The density function (
2.2) is used in
Lee and Scott (2012) inside the framework of a mixture of multivariate Gaussian distributions with censored data. Using (
2.2) the censored Gaussian graphical model can be formally defined.
Definition 1Let be a -dimensional Gaussian distribution whose density function factorizes according to an undirected graph and let be a -dimensional random censoring-data indicator defined by the censoring vectors and . The censored Gaussian graphical model (cGGM) is defined to be the set .
A closer look at Definition 1 reveals that the proposed notion of censored Gaussian graphical model is characterized by a high degree of generality since it covers also the special cases of the classical Gaussian graphical model (, for any ), and the cases in which there is only left-censored data ( for any ) or right-censored data ( for any ).
3. -Penalized estimator for censored Gaussian graphical model
3.1. The censored graphical lasso estimator
Consider a sample of independent observations drawn from the censored Gaussian graphical model . For ease of exposition, we shall assume that and are fixed across the observations, but the extension to the cases where the censoring vectors are specific to each observation is straightforward and does not require a specific treatment. If these values are unknown we suggest to use the smallest and largest observed value for each variable as possible estimates for and , respectively.
Let
be the
th realization of the random vector
. Then the
th observed data is the vector
, with
. Using the density function (
2.2), the observed log-likelihood function can be written as
where, as before,
, with
and
, and
. Although inference about the parameters of this model can be carried out via the maximum likelihood method, the application of this inferential procedure to real datasets, such as the gene expression data described in Section
5, is limited for three main reasons. Firstly, the number of measured variables is larger than the sample size, and this implies the non-existence of the maximum likelihood estimator even when the dataset is fully observed. Secondly, even when the sample size is large enough, the maximum likelihood estimator will exhibit a very high variance (
Uhler, 2012). Thirdly, empirical evidence suggests that gene networks or more general biochemical networks are not fully connected (
Gardner and others, 2003). In terms of Gaussian graphical models this evidence translates in the assumption that
has a sparse structure, i.e. only few
are different from zero, which is not obtained by a direct (or indirect) maximization of the observed log-likelihood function (
3.1).
All that considered, in this article, we propose to estimate the parameters of the censored Gaussian graphical model by generalizing the approach proposed in
Yuan and Lin (2007), i.e. by maximizing a new objective function defined by adding a lasso-type penalty function to the observed log-likelihood (
3.1). The resulting estimator, called censored graphical lasso (cglasso), is formally defined as
Like in the standard graphical lasso estimator, the nonnegative tuning parameter is used to control the amount of sparsity in the estimated concentration matrix and, consequently, in the corresponding estimated graph , where . When is large enough, some are shrunken to zero resulting in the removal of the corresponding link in ; on the other hand, when is equal to zero and the sample size is large enough the estimator coincides with the maximum likelihood estimator of the concentration matrix, which implies a fully connected estimated graph.
3.2. Fitting the censored graphical lasso model
Using known results about the multivariate Gaussian distribution, the conditional distribution of
given
is also a multivariate Gaussian distribution with concentration matrix
and conditional expected value equal to
, where
. As we shall show in the next theorem, the conditions characterizing the cglasso estimator are based on the first and second moment of the Gaussian distribution
truncated over the region
. To this end, we let
where
denotes the expected value computed using the conditional distribution of
given
truncated over
. Finally, we let
,
,
, and
. Using this notation, Theorem 1 gives the Karush–Kuhn–Tucker conditions for the proposed estimator.
Theorem 1Necessary and sufficient conditions for
to be the solution of the maximization problem
are
where
denotes the subgradient of the absolute value function at
, i.e.,
if
and
if
.
A proof of Theorem 1 is reported in the
supplementary material available at
Biostatistics online. The stationary conditions (
3.3) show that, while in the standard graphical lasso the parameter
can be estimated by the empirical average regardless of
and the inference about the concentration matrix can be carried out using the profile log-likelihood function, inside our framework the two inferential problems cannot be separated, since the tuning parameter also affects the estimator of the expected value. Furthermore, the conditions suggest that, for a given value of the tuning parameter, the cglasso estimator can be computed using the EM algorithm (
Dempster and others, 1977). This algorithm is based on the idea of repeating two steps until a convergence criterion is met. The first step, called E-Step, requires the calculation of the conditional expected value of the complete log-likelihood function using the current estimates. The resulting function, called
-function, is maximized in the second step, i.e. the M-Step. As explained in
McLachlan and Krishnan (2008), the EM algorithm can be significantly simplified when the complete probability density function is a member of the regular exponential family. In this case, the E-Step simply requires the computation of the conditional expected values of the sufficient statistics. In our case, if we denote by
an initial estimate of the parameters, the E-Step reduces to the computation of
and
, for
. From this, conditions (
3.3) can be written as
which are the stationary conditions of a standard graphical lasso problem (
Witten and others, 2011) with
used as the current estimate of the empirical covariance matrix. The conditions (
3.4) imply that in the M-Step the parameter
is estimated by
, while
is estimated by solving the following maximization problem:
which is a standard graphical lasso problem. The following steps summarize the proposed EM algorithm for the derivation of the cglasso estimator:
- 1.
Let be initial estimates;
- 2.
E-step: Compute and ;
- 3.
M-step: Let and solve the problem (3.5) using ;
- 4.
If a convergence criterion is met then return else let and ;
- 5.
Repeat steps 2–4.
Although the maximization problem (
3.5) can be efficiently solved using, for example, the algorithm proposed by
Friedman and others (2008),
Rothman and others (2008), or
Witten and others (2011), the previous steps reveal that the main computational cost of the proposed EM algorithm comes from the evaluation of the moments of the multivariate truncated Gaussian distribution, which are needed to compute
and
. These can be computed using the methods proposed by
Lee (1983),
Leppard and Tallis (1989), or
Arismendi (2013), but these methods require complex numerical algorithms for the calculation of the integral of the multivariate normal density function (see for a review,
Genz and Bretz, 2002), and they are computationally infeasible also for moderate size problems. A possible solution is to approximate the moments using Monte Carlo methods. Our preliminary study, however, shows that the Monte Carlo error causes an increment in the number of EM steps required for convergence of the proposed algorithm, thus removing the gain from using the faster approach for the calculation of the moments. Taking all of this into consideration, we propose a different approximate EM algorithm. In particular, following the idea proposed by
Guo and others (2015), we approximate the quantities
, for any
, by:
As shown by Guo and others (2015), the approach works well when the real concentration matrix is sparse or the tuning parameter is sufficiently large. The main advantage coming from the approximation (3.6) is that now, in order to evaluate the quantities and , we only need to compute and , for . These quantities can be computed by using exact formulas (Johnson and others, 1994) and require only the evaluation of the cumulative distribution function of the univariate Gaussian distribution. In the following of this article, we denote by the approximate estimate of the empirical covariance matrix resulting from this approach. By an extensive simulation study, in Section 4.1, we shall study the behavior of the proposed algorithm based on the usage of the matrix in steps 2 and 3.
As with graphical lasso, our proposed method requires a sequence of -values, which should be suitably defined so as to reduce the computational cost needed to compute the entire path of the estimated parameters. Theorem 2 gives the exact formula for the derivation of the largest -value, denoted by , and the corresponding cglasso estimator.
Theorem 2For any index
define the sets
,
and
and compute the marginal maximum likelihood estimates
maximizing:
Then , where are the elements of the matrix computed using and . Furthermore, and are the corresponding cglasso estimators.
A proof of Theorem 2 is reported in the supplementary material available at Biostatistics online. Theorem 2 shows how the maximum value of can be efficiently computed using optimization problems which do not involve the tuning parameter, and then calculating . Once is calculated, the entire path of the cglasso estimates is calculated as follow:
- 1.
Compute as specified in Theorem 2 and let be the smallest -value;
- 2.
Compute a decreasing sequence of distinct -values starting from to ;
- 3.
for to do
- 4.
Let and
- 5.
Use the EM algorithm to compute with as starting values;
- 6.
end for
The previous description shows that the entire path can be computed using the estimates obtained for a given -value as warm starts for fitting the next cglasso model. This strategy is commonly used also in other efficient lasso algorithms and R packages (Friedman and others, 2010, Friedman and others, 2018). In our model, this strategy turns out to be remarkably efficient since using a sufficiently fine sequence of -values, the starting values defined in Step 4 will be sufficiently close to the estimates computed in Step 5, thus increasing the speed of convergence of the resulting algorithm.
3.3. Tuning parameter selection
The tuning parameter plays a central role in the proposed cglasso estimator, since it is designed to control the complexity of the topological structure of the estimated graph. In this article, we propose to select the optimal
-value of the cglasso estimator by using the extended Bayesian Information Criterion (
Foygel and Drton 2010). For our proposed estimator, this is given by:
where
and
are the maximum likelihood estimates of the Gaussian graphical model specified by
, and
denotes the number of nonzero off-diagonal estimates of
. The measure is indexed by the parameter
, with
corresponding to the classical BIC measure. Although a grid search can be performed to select the
-value that minimizes
, the computational burden related to the evaluation of the log-likelihood function can make this strategy infeasible also for moderate size problems. For this reason, following
Ibrahim and others (2008), we propose to select the
-value by minimizing the following approximate measure:
which is defined by substituting the exact log-likelihood function with the
-function used in the M-Step of the proposed algorithm, which is easily obtained as a byproduct of the EM algorithm. In the
supplementary material available at
Biostatistics online, a simulation study is reported where we compare the behavior of the
and
measures. The results show that the two criteria are equivalent in terms of false discovery rate and true positive rate. Furthermore, as suggested in
Foygel and Drton (2010), the optimal results are obtained when
.
4. Simulation studies
4.1. Evaluating the behavior of the approximate EM algorithm
In a first simulation, we evaluate the effects of the approximation (3.6) on the accuracy of the estimators obtained with the EM algorithm. As in the real dataset studied in Section 5, we consider right-censored data and we set the right-censoring value equal to 40, for any . For this simulation, where we plan to use the full EM algorithm, we set the number of variables to 10 and the sample size to 100. To simulate a sample from a sparse right-censored model, we use the following procedure. First, using the method implemented in the R package huge (Zhao and others, 2015), we simulate a sparse concentration matrix with a random structure, where we set the probability that a is different from zero to 0.1. Then, the elements of the parameter are selected to obtain a fixed probability of right censoring for any given random variable. More specifically, we randomly draw a subset from and for each the value of the parameter is such that , while for each the parameter is such that the probability of right censoring is approximately equal to . The cardinality of the set , denoted by , is used in the study as a tool to analyze the effects of the number of censored variables on the behavior of the proposed EM algorithm and its approximated version. Finally, we draw a sample from a multivariate Gaussian distribution with parameters given in the previous steps and we treat each value greater than 40 as a missing (censored) value. We simulated 100 datasets from this model and for each simulation we compute a path of cglasso estimates using the proposed EM algorithm and a path using the approximated EM algorithm. In the following of this section, the estimates belonging to the two paths are denoted by and , respectively. For each path, the largest value of the tuning parameter was computed using the results given in Theorem 2 while the smallest value was set equal to .
The first part of the Table 1 reports the average CPU times for computing the path. As expected, the computational time needed to compute a path is always an increasing function of the number of the censored variables but, when we use the matrix in the E-step, the table reveals that the form of the relationship between the CPU time and is almost exponential implying that the computation of the cglasso estimator is infeasible also for relatively small datasets. In contrast to this, when we use the matrix to approximate the current estimate of the empirical covariance matrix, the table shows an almost linear dependence of the CPU time, which implies a significant reduction of the computational complexity compared with the full EM algorithm. Although the results showed in Table 1 strongly suggest the use of the approximated EM algorithm to compute the cglasso estimator, they do not provide information about the difference between the two estimates. For this reason, we also computed the largest Euclidean distance between and , denoted as , and the largest Frobenius distance between and , denoted by . The average and standard deviations of these values are reported in the second part of the Table 1. All the results clearly show that the two estimators are sufficiently close to each other, and point to the use of the approximated EM algorithm for the derivation of the cglasso estimator.
Table 1.Results of the simulation study on evaluating the effect of the approximation (3.6): first part reports the average CPU time for computing a path, whereas second part reports the average of and
. | Number of censored variables
. |
. |
---|
. | 2
. | 3
. | 4
. | 5
. | 6
. | 7
. | 8
. |
---|
Average CPU time (s) |
Exact EM | 9.60 | 25.87 | 68.36 | 104.77 | 140.01 | 249.40 | 374.34 |
| (2.79) | (7.39) | (18.87) | (28.53) | (34.39) | (57.13) | (96.26) |
Approx. EM | 5.08 | 7.89 | 13.81 | 15.31 | 16.26 | 21.28 | 24.90 |
| (1.02) | (1.44) | (2.23) | (2.20) | (2.33) | (2.64) | (3.10) |
Difference between the two estimates |
| 2.9 | 8.2 | 1.7 | 2.2 | 7.8 | 1.1 | 2.1 |
| (8.6) | (2.1) | (3.7) | (3.1) | (1.0) | (1.2) | (2.3) |
| 3.0 | 8.3 | 2.0 | 2.6 | 2.3 | 2.6 | 6.6 |
| (8.5) | (2.0) | (3.7) | (3.8) | (3.1) | (2.4) | (6.0) |
Table 1.Results of the simulation study on evaluating the effect of the approximation (3.6): first part reports the average CPU time for computing a path, whereas second part reports the average of and
. | Number of censored variables
. |
. |
---|
. | 2
. | 3
. | 4
. | 5
. | 6
. | 7
. | 8
. |
---|
Average CPU time (s) |
Exact EM | 9.60 | 25.87 | 68.36 | 104.77 | 140.01 | 249.40 | 374.34 |
| (2.79) | (7.39) | (18.87) | (28.53) | (34.39) | (57.13) | (96.26) |
Approx. EM | 5.08 | 7.89 | 13.81 | 15.31 | 16.26 | 21.28 | 24.90 |
| (1.02) | (1.44) | (2.23) | (2.20) | (2.33) | (2.64) | (3.10) |
Difference between the two estimates |
| 2.9 | 8.2 | 1.7 | 2.2 | 7.8 | 1.1 | 2.1 |
| (8.6) | (2.1) | (3.7) | (3.1) | (1.0) | (1.2) | (2.3) |
| 3.0 | 8.3 | 2.0 | 2.6 | 2.3 | 2.6 | 6.6 |
| (8.5) | (2.0) | (3.7) | (3.8) | (3.1) | (2.4) | (6.0) |
4.2. Comparison of methods on data simulated from a censored Gaussian graphical model
In a second simulation study, we compare our proposed estimator with MissGlasso (Städler and Bühlmann, 2012), which performs penalized estimation under the assumption that the censored data are missing at random, and with the glasso estimator (Friedman and others, 2008), where the empirical covariance matrix is calculated by imputing the missing values with the limit of detection. These estimators are evaluated in terms of both recovering the structure of the true graph and the mean squared error. We use a similar approach to the previous simulation for generating right censored data. In particular, we set the right censoring value to 40 for any variable and the sample size to 100. We generate a sparse concentration matrix with random structure and set the probability of observing a link between two nodes to , where is the number of variables and is used to control the amount of sparsity in . Finally, we set the mean in such a way that for the censored variables, i.e. , while for the remaining variables is sampled from a uniform distribution on the interval . At this point, we simulate a sample from the latent -variate Gaussian distribution and treat all values greater than 40 as missing. The quantities , , and are used to specify the different models used to analyze the behavior of the considered estimators. In particular, we consider the following cases:
Model 1: , and . This setting is used to evaluate the effects of the number of censored variables on the behavior of the proposed estimators when .
Model 2: , and . This setting is used to evaluate the effects of the sparsity of the matrix on the considered estimators when .
Model 3: , and . This setting is used to evaluate the impact of the high dimensionality on the estimators ().
For each model, we simulate 100 samples from the right-censored model and in each simulation we compute the coefficients path using cglasso, MissGlasso, and glasso. Each path is computed using an equally spaced sequence of 30
-values. Figure
1a shows the precision-recall curves for Model 1 with
. The curves report the relationship between precision and recall for any
-value, which are defined by:
Fig. 1.
Results of the comparative simulation study based on Model 1 with . (a) The average precision-recall curves are shown; (b) The box-plots of for the considered estimators is shown.
The curves show how cglasso gives a better estimate of the concentration matrix both in terms of precision and recall, for any given value of the tuning parameter. Figure 1b shows the distributions of the quantity , which gives the minimum value of the mean squared error attained along the path of solutions. These box-plots emphasize that, not only the cglasso has a mean squared error much smaller than glasso and MissGlasso, but also that it is much more stable than its competitors. The same behavior was also observed in the other models used in our numerical study, and can be found in the Supplementary Material available at Biostatistics online. Table 2 reports the summary statistics from all the simulations. In addition to the quantity described above (second meta-column), the first meta-column shows the mean squared error in the estimation of the mean (denoted by ) and the third meta-column reports the Area Under the precision-recall Curve (AUC). Note that the considered measures allow us to study the behavior of the estimators along the entire path. The distribution of the minimum value of the mean squared errors shows that, not only our estimator is able to recover the structure of the graph but also outperforms the competitors in terms of both estimation of and . We did not report for glasso since this method does not allow to estimate the parameter . The results on the AUC suggest that cglasso can be used as an efficient tool for recovering the structure of the true concentration matrix of a Gaussian graphical model from censored data.
Table 2.Comparison between cglasso and two existing methods on simulated data: for each measure the table reports the average value and standard deviation between parentheses
Model
. |
. |
. | AUC
. |
---|
. |
. |
. | cglasso
. | MissGlasso
. | cglasso
. | glasso
. | MissGlasso
. | cglasso
. | glasso
. | MissGlasso
. |
---|
| | | 0.47 | 14.50 | 8.76 | 103.35 | 96.75 | 0.48 | 0.37 | 0.19 |
| | | (0.11) | (0.69) | (0.64) | (14.43) | (16.01) | (0.04) | (0.03) | (0.02) |
| | | 0.48 | 21.00 | 10.11 | 139.76 | 131.99 | 0.47 | 0.33 | 0.15 |
| | | (0.10) | (0.76) | (0.84) | (15.94) | (18.81) | (0.05) | (0.03) | (0.02) |
| | | 0.47 | 18.31 | 6.92 | 128.60 | 119.65 | 0.61 | 0.42 | 0.16 |
| | | (0.10) | (0.81) | (0.82) | (17.75) | (17.38) | (0.08) | (0.05) | (0.03) |
| | | 0.46 | 17.39 | 12.02 | 113.84 | 105.70 | 0.43 | 0.34 | 0.20 |
| | | (0.10) | (0.92) | (0.85) | (14.79) | (15.71) | (0.04) | (0.03) | (0.02) |
| | | 1.92 | 63.34 | 41.57 | 398.02 | 373.86 | 0.32 | 0.21 | 0.13 |
| | | (0.19) | (1.54) | (1.52) | (29.78) | (32.46) | (0.02) | (0.02) | (0.01) |
Table 2.Comparison between cglasso and two existing methods on simulated data: for each measure the table reports the average value and standard deviation between parentheses
Model
. |
. |
. | AUC
. |
---|
. |
. |
. | cglasso
. | MissGlasso
. | cglasso
. | glasso
. | MissGlasso
. | cglasso
. | glasso
. | MissGlasso
. |
---|
| | | 0.47 | 14.50 | 8.76 | 103.35 | 96.75 | 0.48 | 0.37 | 0.19 |
| | | (0.11) | (0.69) | (0.64) | (14.43) | (16.01) | (0.04) | (0.03) | (0.02) |
| | | 0.48 | 21.00 | 10.11 | 139.76 | 131.99 | 0.47 | 0.33 | 0.15 |
| | | (0.10) | (0.76) | (0.84) | (15.94) | (18.81) | (0.05) | (0.03) | (0.02) |
| | | 0.47 | 18.31 | 6.92 | 128.60 | 119.65 | 0.61 | 0.42 | 0.16 |
| | | (0.10) | (0.81) | (0.82) | (17.75) | (17.38) | (0.08) | (0.05) | (0.03) |
| | | 0.46 | 17.39 | 12.02 | 113.84 | 105.70 | 0.43 | 0.34 | 0.20 |
| | | (0.10) | (0.92) | (0.85) | (14.79) | (15.71) | (0.04) | (0.03) | (0.02) |
| | | 1.92 | 63.34 | 41.57 | 398.02 | 373.86 | 0.32 | 0.21 | 0.13 |
| | | (0.19) | (1.54) | (1.52) | (29.78) | (32.46) | (0.02) | (0.02) | (0.01) |
4.3. Testing the robustness of the method on more realistic biological data
In a third simulation study, we test the robustness of the method on more realistic biological data. In particular, we consider expression data from Wille and others (2004) on the extensively studied Arabidopsis thaliana biological system. The study reports data from experiments on genes, whose regulatory network is of interest. Although the data are fully observed, we test our method on a dataset where observations are made artificially right censored, similarly to Städler and Bühlmann (2012). In particular, we produce three datasets at different levels of right censoring, by recording as missing the 10%, 20%, and 30% of the highest values, respectively. Then, we compare our method, cglasso, with MissGlasso (Städler and Bühlmann, 2012) and with three additional methods, which impute missing values first and then infer the network using graphical lasso on the imputed values. As in Städler and Bühlmann (2012), we consider k-nearest neighbor imputation, which we denote with missknn, and imputation using random forests, which we denote with MissForest. These methods are based on a missing at random assumption. We further consider a method which relies on multivariate Gaussian data under a censoring mechanism which we denote with imputeLCMD (Lazar, 2015).
Table 3 shows the Euclidean norm between the true observed data and the imputed one for the five methods considered and across the three different levels of censoring. The results show how the performance of all methods decreases the higher the level of censoring and how cglasso is overall superior to the other methods across all comparisons, followed closely in some cases by imputeLCMD. The areas under the precision-recall curves of the networks inferred by the five methods compared with the graphical lasso network selected by from the fully observed data show how the proposed cglasso method leads to a significant gain in network recovery over MissGlasso, missknn, and MissForest even with 30% of censored data.
Table 3.Comparison of methods on the Arabidopsis thaliana expression data: Euclidian distance (ED) between the observed and imputed data and area under the precision-recall curve (AUC) across the five methods and the three levels of censoring
. | Percentage censored
. | cglasso
. | imputeLMCD
. | MissGlasso
. | missknn
. | missForest
. |
---|
ED | 10% | 9.95 | 13.49 | 27.89 | 31.37 | 33.31 |
| 20% | 15.22 | 18.60 | 37.73 | 40.16 | 46.21 |
| 30% | 19.76 | 24.38 | 46.65 | 50.08 | 57.54 |
AUC | 10% | 0.93 | 0.88 | 0.78 | 0.79 | 0.77 |
| 20% | 0.90 | 0.86 | 0.70 | 0.72 | 0.68 |
| 30% | 0.87 | 0.85 | 0.63 | 0.70 | 0.61 |
Table 3.Comparison of methods on the Arabidopsis thaliana expression data: Euclidian distance (ED) between the observed and imputed data and area under the precision-recall curve (AUC) across the five methods and the three levels of censoring
. | Percentage censored
. | cglasso
. | imputeLMCD
. | MissGlasso
. | missknn
. | missForest
. |
---|
ED | 10% | 9.95 | 13.49 | 27.89 | 31.37 | 33.31 |
| 20% | 15.22 | 18.60 | 37.73 | 40.16 | 46.21 |
| 30% | 19.76 | 24.38 | 46.65 | 50.08 | 57.54 |
AUC | 10% | 0.93 | 0.88 | 0.78 | 0.79 | 0.77 |
| 20% | 0.90 | 0.86 | 0.70 | 0.72 | 0.68 |
| 30% | 0.87 | 0.85 | 0.63 | 0.70 | 0.61 |
5. Application to single cell-data: megakaryocyte-erythroid progenitors
Recent advances in single-cell techniques have provided the opportunity to finely dissect cellular heterogeneity within known populations and to uncover rare cell types. In a study about the formation of blood cells, Psaila and others (2016) have recently identified three distinct subpopulations of cells, which are all derived from hematopoietic stem cells through cell differentiation. One of these sub-populations, denoted by MK-MEP, is a previously unknown, rare population of cells that are bipotent but primarily generate megakaryocytic progeny. In this section, we look closely at this sub-population and investigate the molecular mechanisms of regulation within these cells. To this end, we used data available from Psaila and others (2016) on 87 genes and 48 single human MK-MEP cells profiled by multiplex RT-qPCR.
As discussed in the Introduction, RT-qPCR data are typically right-censored. In this particular study, the limit of detection is fixed by the manufacturer to 40. Raw data have been mean normalized using the method proposed in Pipelers and others (2017) and using the two housekeeping genes provided. Figure 2a shows the relationship between the proportion of right-censored data and the mean of the normalized cycle-threshold. Two main conclusions can be drawn from this figure. Firstly, the proportion of censored data is an increasing function of the mean of the normalized cycle-threshold: this is expected and it means that the assumption of missing-at-random is not justified on this dataset. Secondly, there are some genes with a very high proportion of censoring (see the points above the dashed line): these may be genes whose transcription failed to amplify, in which case they should be treated as missing rather than censored. Following this explorative analysis and considering also the possible computational problems caused by the inclusion of these genes, we filtered out the genes with a proportion of censoring above 85% for subsequent analysis. The resulting data set contains the expression of 63 genes measured on 48 cells.
Fig. 2.
Real data analysis. (a) The proportion of right-censored data versus the mean of the normalized cycle-threshold; the black line is obtained by fitting a logistic regression model, while the dashed line identifies the threshold used to filter out the genes from the study. (b) Path of the measure; the vertical dashed line identifies the optimal value of the tuning parameter. (c) The estimated graph.
The normalized cyclic thresholds of the remaining genes are used to fit a right-censored GGM by using the proposed cglasso estimator. Figure 2b shows the path of the measure and the vertical dashed line is traced in correspondence of the optimal -value. The resulting estimated graph (see Figure 2c) contains about of all possible edges and about of the considered genes. Consistent with the existing knowledge about this subpopulation, the estimated graph shows the central role of two genes, CD42 (glycoprotein 1b) and MYB (Psaila and others, 2016). The first one, CD42, is a megakaryocyte gene whose expression was found to be significantly high in the MK-MEP subpopulation. In particular, it is expressed later during megakaryocyte differentiation and has been associated with unipotent megakaryopoietic activity in mouse models. The link between CD42 and VWF, another highly expressed genes in this subpopulation (Psaila and others, 2016), was discovered by Chan and others (2017) as well as other connections, such as the link between the genes CD42 and CD61 and that between CD42 and TGFB1. The second central gene, MYB, is a transcription factor that is known to enhance erythroid differentiation at the expense of megakaryopoiesis.
6. Conclusion
In this article, we have proposed a computational approach to fit Gaussian Graphical models in the presence of censored data. The approach includes both the cases of right- and left-censored data. Since classical Gaussian graphical models cannot be used in high-dimensional settings, we also introduced -penalization to produce sparsity (model selection) and parameter estimation simultaneously. The resulting estimator is called censored graphical lasso (cglasso). The computational problem of estimating the mean and conditional independence graph of a Gaussian graphical model from censored data is solved via an EM-algorithm. An extensive simulation study showed that the proposed estimator overcomes the existing estimators both in terms of parameter estimation and of network recovery. The analysis of a real RT-qPCR dataset showed how the method is able to infer the regulatory network underlying blood development under high levels of censoring.
In addition, we have proposed an approximated EM-algorithm, which is computationally more efficient and makes the method feasible for high-dimension settings. This approach relies on an efficient approximation of the mixed moments of the latent variables conditional on the observed. The approximation is exact under conditional independence and has thus been found to work well under sparse settings, as shown by Guo and others (2015) and our own simulations. Future work will investigate the theoretical justifications for this as well as refining the scenarios under which a good performance is to be expected.
Acknowledgments
Conflict of Interest: None declared.
Funding
The project was partially supported by the European Cooperation in Science and Technology (COST) [COST Action CA15109 European Cooperation for Statistics of Network Data Science (COSTNET)].
References
Arismendi,
J. C.
(
2013
).
Multivariate truncated moments.
Journal of Multivariate Analysis
117
,
41
–
75
.
Augugliaro,
L.
,
Mineo,
A. M.
and
Wit,
E. C.
(
2016
).
-Penalized methods in high-dimensional Gaussian Markov random fields
. In:
Dehmer,
M.
, Shi,
Y.
and Emmert-Streib,
F.
(editors),
Computational Network Analysis with R: Applications in Biology, Medicine, and Chemistry
, Chapter 8.
Weinheim, Germany
:
Wiley-VCH Verlag GmbH & Co. KGaA
, pp.
201
–
267
.
Bickel,
P. J.
and
Levina,
E.
(
2008
).
Regularized estimation of large covariance matrices.
The Annals of Statistics
36
,
199
–
227
.
Brown,
M.
and
Wittwer,
C.
(
2000
).
Flow cytometry: principles and clinical applications in hematology
.
Clinical Chemistry
46
,
1221
—
1229
.
Chan,
T. E.
,
Stumpf,
M. P. H.
and
Babtie,
A. C.
(
2017
).
Gene regulatory network inference from single-cell data using multivariate information measures.
Cell Systems
5
,
251
–
267
.
Dempster,
A. P.
,
Laird,
N. M.
and
Rubin,
D. B.
(
1977
).
Maximum likelihood from incomplete data via the EM algorithm.
Journal of the Royal Statistical Society. Series B
39
,
1
–
38
.
Derveaux,
S.
,
Vandesompele,
J.
and
Hellemans,
J.
(
2010
).
How to do successful gene expression analysis using real-time PCR.
Methods
50
,
227
–
230
.
Foygel,
R.
and
Drton,
M.
(
2010
).
Extended Bayesian information criteria for Gaussian graphical models
. In:
Lafferty,
J.
, Williams,
C.
, Shawe-taylor,
J.
, Zemel,
R.s.
and Culott,
A.
(editors),
Advances in Neural Information Processing Systems
,
Vancouver, British Columbia, Canada
:
Curran Associates
, Inc., pp.
604
–
612
.
Friedman,
J. H.
,
Hastie,
T.
and
Tibshirani,
R.
(
2008
).
Sparce inverse covariance estimation with the graphical lasso.
Biostatistics
9
,
432
–
441
.
Friedman,
J. H.
,
Hastie,
T.
and
Tibshirani,
R.
(
2010
).
Regularization paths for generalized linear models via coordinate descent.
Journal of Statistical Software
33
,
1
–
22
.
Friedman,
J. H.
,
Hastie,
T.
and
Tibshirani,
R.
(
2018
).
glasso: Graphical Lasso-Estimation of Gaussian Graphical Models
.
Gardner,
T. S.
,
di Bernardo,
D.
,
Lorenz,
D.
and
Collins,
J. J.
(
2003
).
Inferring genetic networks and identifying compound mode of action via expression profiling.
Science
301
,
102
–
105
.
Genz,
A.
and
Bretz,
F.
(
2002
).
Comparison of methods for the computation of multivariate probabilities.
Journal of Computational and Graphical Statistics
11
,
950
–
971
.
Guo,
J.
,
Levina,
E.
,
Michailidis,
G.
and
Zhu,
J.
(
2015
).
Graphical models for ordinal data.
Journal of Computational and Graphical Statistics
24
,
183
–
204
.
Hoffman,
H. J.
and
Johnson,
R. E.
(
2015
).
Pseudo-likelihood estimation of multivariate normal parameters in the presence of left-censored data.
Journal of Agricultural, Biological, and Environmental Statistics
20
,
156
–
171
.
Ibrahim,
J. G.
,
Zhu,
H.
and
Tang,
N.
(
2008
).
Model selection criteria for missing-data problems using the EM algorithm.
Journal of the American Statistical Association
103
,
1648
–
1658
.
Johnson,
N. L.
,
Kotz,
S.
and
Balakrishnan,
N.
(
1994
).
Continuous Univariate Distributions
, 2nd edition, Volume
1
.
New York
:
John Wiley & Sons, Inc
.
Lauritzen,
S. L.
(
1996
).
Graphical Models
.
Oxford
:
Oxford University Press
.
Lazar,
C.
(
2015
).
imputeLCMD: A Collection of Methods for Left-Censored Missing Data Imputation
. .
Lee,
G.
and
Scott,
C.
(
2012
).
EM algorithms for multivariate Gaussian mixture models with truncated and censored data.
Computational Statistics & Data Analysis
56
,
2816
–
2829
.
Lee,
L.-F.
(
1983
).
The determination of moments of the doubly truncated multivariate normal tobit model.
Economics Letters
11
,
245
–
250
.
Leppard,
P.
and
Tallis,
G. M.
(
1989
).
Algorithm AS 249: Evaluation of the mean and covariance of the truncated multinormal distribution.
Journal of the Royal Statistical Society. Series C
38
,
543
–
553
.
Little,
R. J. A.
and
Rubin,
D. B.
(
2002
).
Statistical Analysis with Missing Data
, 2nd edition.
Hoboken, NJ, USA
:
John Wiley & Sons
, Inc.
McCall,
M. N.
,
McMurray,
H. R.
,
Land,
H.
and
Almudevar,
A.
(
2014
).
On non-detects in qPCR data.
Bioinformatics
30
,
2310
–
2316
.
McLachlan,
G.
and
Krishnan,
T.
(
2008
).
The EM Algorithm and Exetnsions
, 2nd edition.
Hoboken, NJ, USA
:
John Wiley & Sons
, Inc.
Menéndez,
P.
,
Kourmpetis,
Y. A. I.
,
ter Braak,
C. J. F.
and
van Eeuwijk,
F. A.
(
2010
).
Gene regulatory networks from multifactorial perturbations using graphical lasso: application to DREAM4 challange
.
PLoS One
5
,
e14147
.
Perkins,
N. J.
,
Schisterman,
E. F.
and
Vexier,
A.
(
2013
).
Multivariate normally distributed biomarkers subject to limits of detection and receiver operating characteristic curve inference.
Academic Radiology
20
,
838
–
846
.
Pesonen,
M.
,
Pesonen,
H.
and
Nevalainen,
J.
(
2015
).
Covariance matrix estimation for left-censored data.
Computational Statistics & Data Analysis
92
,
13
–
25
.
Pipelers,
P.
,
Clement,
L.
,
Vynck,
M.
,
Hellemans,
J.
,
Vandesompele,
J.
and
Thas,
O.
(
2017
).
A unified censored normal regression model for qPCR differential gene expression analysis
.
PLoS One
12
,
e0182832
.
Psaila,
B.
,
Barkas,
N.
,
Iskander,
D.
,
Roy,
A.
,
Anderson,
S.
,
Ashley,
N.
,
Caputo,
V. S.
,
Lichtenberg,
J.
,
Loaiza,
S.
,
Bodine,
D. M.
. (
2016
).
Single-cell profiling of human megakaryocyte-erythroid progenitors identifies distinct megakaryocyte and erythroid differentiation pathways.
Genome Biology
17
,
83
–
102
.
Rothman,
A.
,
Bickel,
P. J.
,
Levina,
E.
and
Zhu,
J.
(
2008
).
Sparse permutation invariant covariance estimation.
Electronic Journal of Statistics
2
,
494
–
515
.
Schafer,
J. L.
(
1997
).
Analysis of Incomplete Multivariate Data
, .
Boca Raton, Florida
:
Chapman and Hall/CRC
.
Städler,
N.
and
Bühlmann,
P.
(
2012
).
Missing values: sparse inverse covariance estimation and an extension to sparse regression.
Statistics and Computing
22
,
219
–
235
.
Uhler,
C.
(
2012
).
Geometry of maximum likelihood estimation in Gaussian graphical models.
The Annals of Statistics
40
,
238
–
261
.
Vinciotti,
V.
,
Augugliaro,
L.
,
Abbruzzo,
A.
and
Wit,
E. C.
(
2016
).
Model selection for factorial Gaussian graphical models with an application to dynamic regulatory networks.
Statistical Applications in Genetics and Molecular Biology
15
,
193
–
212
.
Wille,
A.
,
Zimmermann,
P.
,
Vranová,
E.
,
Fürholz,
A.
,
Laule,
O.
,
Bleuler,
S.
,
Hennig,
L.
,
Prelić,
A.
,
xvon Rohr,
P.
,
Thiele,
L.
,
Zitzler,
E.
,
Gruissem,
W.
. (
2004
).
Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana
.
Genome Biology
5
,
R92
.
Witten,
D. M.
,
Friedman,
J. H.
and
Simon,
N.
(
2011
).
New insights and faster computations for the graphical lasso.
Journal of Computational and Graphical Statistics
20
,
892
–
900
.
Yuan,
M.
and
Lin,
Y.
(
2007
).
Model selection and estimation in the Gaussian graphical model.
Biometrika
94
,
19
–
35
.
Zhao,
T.
,
Li,
X.
,
Liu,
H.
,
Roeder,
K.
,
Lafferty,
J.
and
Wasserman,
L.
(
2015
).
huge: High-Dimensional Undirected Graph Estimation
. .
© The Author 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.