Summary

Graphical lasso is one of the most used estimators for inferring genetic networks. Despite its diffusion, there are several fields in applied research where the limits of detection of modern measurement technologies make the use of this estimator theoretically unfounded, even when the assumption of a multivariate Gaussian distribution is satisfied. Typical examples are data generated by polymerase chain reactions and flow cytometer. The combination of censoring and high-dimensionality make inference of the underlying genetic networks from these data very challenging. In this article, we propose an 1-penalized Gaussian graphical model for censored data and derive two EM-like algorithms for inference. We evaluate the computational efficiency of the proposed algorithms by an extensive simulation study and show that, when censored data are available, our proposal is superior to existing competitors both in terms of network recovery and parameter estimation. We apply the proposed method to gene expression data generated by microfluidic Reverse Transcription quantitative Polymerase Chain Reaction technology in order to make inference on the regulatory mechanisms of blood development. A software implementation of our method is available on github (https://github.com/LuigiAugugliaro/cglasso).

1. Introduction

An important aim in genomics is to understand interactions among genes, characterized by the regulation and synthesis of proteins under internal and external signals. These relationships can be represented by a genetic network, i.e. a graph where nodes represent genes and edges describe the interactions among them. Genetic networks can be used to gain new insights into the activity of biological pathways and to deduce unknown functions of genes from their dependence on other genes.

Gaussian graphical models (Lauritzen, 1996) have been widely used for reconstructing a genetic network from expression data. The reason of their widespread use relies on the statistical properties of the multivariate Gaussian distribution, which allow the topological structure of a network to be related to the non-zero elements of the concentration matrix, i.e. the inverse of the covariance matrix. Thus, the problem of network inference can be recast as the problem of estimating a concentration matrix. The graphical lasso (Yuan and Lin, 2007) is a popular method for estimating a sparse concentration matrix, based on the idea of adding an 1-penalty to the likelihood function of the multivariate Gaussian distribution. Nowadays, this estimator is widely used in applied research (e.g. Menéndez and others, 2010; Vinciotti and others, 2016) and widely studied in the computational as well as theoretical (e.g. Bickel and Levina, 2008; Friedman and others, 2008; Witten and others, 2011) literature. The interested reader is referred to Augugliaro and others (2016) for an extensive review.

Despite the widespread literature on the graphical lasso estimator, there is a great number of fields in applied research where modern measurement technologies make the use of this graphical model theoretically unfounded, even when the assumption of a multivariate Gaussian distribution is satisfied. A first example of this is Reverse Transcription quantitative Polymerase Chain Reaction (RT-qPCR), a popular technology for gene expression profiling (Derveaux and others, 2010). This technique is used to measure the expression of a set of target genes in a given sample through repeated cycles of sequence-specific DNA amplification followed by expression measurements. The cycle at which the observed expression first exceeds a fixed threshold is commonly called the cycle-threshold (McCall and others, 2014). If a target is not expressed, the threshold is not reached after the maximum number of cycles (limit of detection) and the corresponding cycle-threshold is undetermined. For this reason, the resulting data is naturally right-censored (McCall and others, 2014; Pipelers and others, 2017). Another example is given by the flow cytometer, which is an essential tool in the diagnosis of diseases such as acute leukemias and malignant lymphomas (Brown and Wittwer, 2000). A flow cytometer measures a limited range of signal strength and records each marker value within a fixed range, such as between 0 and 1023. If a measurement falls outside this range, then the value is replaced by the nearest legitimate value; that is, a value smaller than 0 is censored to 0 and a value larger than 1023 is censored to 1023. A direct application of the graphical lasso for network inference from data such as these is theoretically unfunded since it does not consider the effects of the censoring mechanism on the estimator of the concentration matrix.

In this article, we propose an extension of the graphical lasso estimator that takes into account the censoring mechanism of the data explicitly. Our work can be related to Städler and Bühlmann (2012), who propose an 1-penalized estimator of the inverse covariance matrix of a multivariate Gaussian model based on the assumption that the data are missing at random. As we shall see in the following of this article, failure to take into account the censoring mechanism causes a poor behavior of this approach when compared with our proposal. Our proposal can also be related to the work of Perkins and others (2013), Hoffman and Johnson (2015), and Pesonen and others (2015), who provide a maximum likelihood estimator of the covariance matrix under left-censoring. However, these works do not address the estimation of the precision matrix under a sparsity assumption and for this reason they are applicable only when the sample size is larger than the number of nodes in the network.

The remaining part of this article is structured as follows. In Section 2, we extend the notion of Gaussian graphical model to censored data and in Section 3, we propose the extension of the graphical lasso estimator to a censored graphical lasso estimator and two Expectation-Maximization (EM) algorithms for inference of parameters in the censored Gaussian graphical model. Section 4 is devoted to the evaluation of the behavior of the proposed algorithms and the comparison of the proposed estimator with existing competitors. Finally, in Section 5, we study a real dataset and in Section 6, we draw some conclusions.

2. The censored Gaussian graphical model

In order to describe the technical details of the proposed method, let X=(X1,,Xp) be a p-dimensional random vector. Graphical models allow to represent the set of conditional independencies among these random variables by a graph G={V,E}, where V is the set of nodes associated to X and EV×V is the set of ordered pairs, called edges, representing the conditional dependencies among the p random variables (Lauritzen, 1996).

The Gaussian graphical model is a member of this class of models based on the assumption that X follows a multivariate Gaussian distribution with expected value μ=(μ1,,μp) and covariance matrix Σ=(σhk). Denoting with Θ=(θhk) the concentration matrix, i.e. the inverse of the covariance matrix, the density function of X can be written as
ϕ(x;μ,Θ)=(2π)p/2|Θ|1/2exp{12(xμ)Θ(xμ)}.
(2.1)

As shown in Lauritzen (1996), the pattern of non-zero elements in Θ defines the corresponding graph, namely the undirected edge (h,k) is an element of the edge set E of the corresponding conditional independence graph if and only if θhk0.

Let X be a (partially) latent random vector with density function (2.1). In order to include the censoring mechanism inside our framework, let us denote by l=(l1,,lp) and u=(u1,,up), with lh<uh for h=1,,p, the vectors of known left and right censoring values. Thus, Xh is observed only if it is inside the interval [lh,uh] otherwise it is censored from below if Xh<lh or censored from above if Xh>uh. Under this setting, a rigorous definition of the joint distribution of the observed data can be obtained using the approach for missing data with ignorable mechanism (Little and Rubin, 2002). This requires the specification of the distribution of a p-dimensional random vector, denoted by R(X;l,u), used to encode the censoring patterns. Formally, the hth element of R(X;l,u) is defined as R(Xh;lh,uh)=I(Xh>uh)I(Xh<lh), where I() denotes the indicator function. By construction R(X;l,u) is a discrete random vector with support in the set {1,0,1}p and probability function
\[Prob{R(X;l,u)=r}=Drϕ(x;μ,Θ)dx,\]
where Dr={xRp:R(x;l,u)=r}.
Given a censoring pattern, we can simplify our notation by partitioning the set I={1,,p} into the sets o={hI:rh=0},c={hI:rh=1} and c+={hI:rh=+1} and, in the following of this article, we shall use the convention that a vector indexed by a set of indices denotes the corresponding subvector. For example, the subvector of observed elements in x is denoted by xo=(xh)ho and, consequently, the observed data is the vector (xo,r). As explained in Little and Rubin (2002), the joint probability distribution of the observed data, denoted by φ(xo,r;μ,Θ), is obtained by integrating Xc+ and Xc out of the joint distribution of X and R(X;l,u), which can be written as the product of the density function (2.1) and the conditional distribution of R(X;l,u) given X=x. As shown in Schafer (1997), this results in
φ(xo,r;μ,Θ)=uc++lcϕ(xo,xc,xc+;μ,Θ)dxcdxc+I(loxouo)=Dcϕ(xo,xc;μ,Θ)dxcI(loxouo),
(2.2)
where c=cc+ and Dc=(,lc)×(uc+,+). The density function (2.2) is used in Lee and Scott (2012) inside the framework of a mixture of multivariate Gaussian distributions with censored data. Using (2.2) the censored Gaussian graphical model can be formally defined.
 
Definition 1

Let X be a p-dimensional Gaussian distribution whose density function ϕ(x;μ,Θ) factorizes according to an undirected graph G={V,E} and let R(X;l,u) be a p-dimensional random censoring-data indicator defined by the censoring vectors l and u. The censored Gaussian graphical model (cGGM) is defined to be the set {X,R(X;l,u),φ(xo,r;μ,Θ),G}.

A closer look at Definition 1 reveals that the proposed notion of censored Gaussian graphical model is characterized by a high degree of generality since it covers also the special cases of the classical Gaussian graphical model (lh=, uh=+ for any h), and the cases in which there is only left-censored data (uh=+ for any h) or right-censored data (lh= for any h).

3. 1-Penalized estimator for censored Gaussian graphical model

3.1. The censored graphical lasso estimator

Consider a sample of n independent observations drawn from the censored Gaussian graphical model {X,R(X;l,u),φ(xo,r;μ,Θ),G}. For ease of exposition, we shall assume that l and u are fixed across the n observations, but the extension to the cases where the censoring vectors are specific to each observation is straightforward and does not require a specific treatment. If these values are unknown we suggest to use the smallest and largest observed value for each variable as possible estimates for lh and uh, respectively.

Let ri be the ith realization of the random vector R(Xi;l,u). Then the ith observed data is the vector (xioi,ri), with oi={hI:rih=0}. Using the density function (2.2), the observed log-likelihood function can be written as
(μ,Θ)=i=1nlogDciϕ(xioi,xici;μ,Θ)dxici=i=1nlogφ(xioi,ri;μ,Θ),
(3.1)
where, as before, ci=cici+, with ci={hI:rih=1} and ci+={hI:rih=+1}, and Dci=(,lci)×(uci+,+). Although inference about the parameters of this model can be carried out via the maximum likelihood method, the application of this inferential procedure to real datasets, such as the gene expression data described in Section 5, is limited for three main reasons. Firstly, the number of measured variables is larger than the sample size, and this implies the non-existence of the maximum likelihood estimator even when the dataset is fully observed. Secondly, even when the sample size is large enough, the maximum likelihood estimator will exhibit a very high variance (Uhler, 2012). Thirdly, empirical evidence suggests that gene networks or more general biochemical networks are not fully connected (Gardner and others, 2003). In terms of Gaussian graphical models this evidence translates in the assumption that Θ has a sparse structure, i.e. only few θhk are different from zero, which is not obtained by a direct (or indirect) maximization of the observed log-likelihood function (3.1).
All that considered, in this article, we propose to estimate the parameters of the censored Gaussian graphical model by generalizing the approach proposed in Yuan and Lin (2007), i.e. by maximizing a new objective function defined by adding a lasso-type penalty function to the observed log-likelihood (3.1). The resulting estimator, called censored graphical lasso (cglasso), is formally defined as
{μ^ρ,Θ^ρ}=argmaxμ,Θ01ni=1nlogφ(xioi,ri;μ,Θ)ρhk|θhk|.
(3.2)

Like in the standard graphical lasso estimator, the nonnegative tuning parameter ρ is used to control the amount of sparsity in the estimated concentration matrix Θ^ρ=(θ^hkρ) and, consequently, in the corresponding estimated graph G^ρ={V,E^ρ}, where E^ρ={(h,k):θ^hkρ0}. When ρ is large enough, some θ^hkρ are shrunken to zero resulting in the removal of the corresponding link in G^ρ; on the other hand, when ρ is equal to zero and the sample size is large enough the estimator Θ^ρ coincides with the maximum likelihood estimator of the concentration matrix, which implies a fully connected estimated graph.

3.2. Fitting the censored graphical lasso model

Using known results about the multivariate Gaussian distribution, the conditional distribution of Xci given Xoi=xoi is also a multivariate Gaussian distribution with concentration matrix Θcici=(θhk)h,kci and conditional expected value equal to Ecioi(Xci)=μcioi=μciΘcici1Θcioi(xioiμoi), where Θcioi=(θhk)hci,koi. As we shall show in the next theorem, the conditions characterizing the cglasso estimator are based on the first and second moment of the Gaussian distribution ϕ(xci;μcioi,Θcici) truncated over the region Dci. To this end, we let
Extra \left or missing \right
where Ecioi(XiciDci) denotes the expected value computed using the conditional distribution of Xici given xioi truncated over Dci. Finally, we let x¯h(μ,Θ)=i=1nxi,h(μ,Θ)/n, x¯(μ,Θ)={x¯1(μ,Θ),,x¯p(μ,Θ)}, shk(μ,Θ)=i=1nxi,hk(μ,Θ)/nx¯h(μ,Θ)x¯k(μ,Θ), and S(μ,Θ)={shk(μ,Θ)}. Using this notation, Theorem 1 gives the Karush–Kuhn–Tucker conditions for the proposed estimator.
 
Theorem 1
Necessary and sufficient conditions for {μ^ρ,Θ^ρ} to be the solution of the maximization problem
\[maxμ,Θ01ni=1nlogφ(xioi,ri;μ,Θ)ρhk|θhk|,\]
are
\[x¯h(μ^ρ,Θ^ρ)μ^hρ=0σ^hkρ(μ^ρ,Θ^ρ)shk(μ^ρ,Θ^ρ)ρv^hk=0}\]
(3.3)
where v^hk denotes the subgradient of the absolute value function at θ^hkρ, i.e., v^hk=sign(θ^hkρ) if θ^hkρ0 and |v^hk|1 if θ^hkρ=0.
A proof of Theorem 1 is reported in the supplementary material available at Biostatistics online. The stationary conditions (3.3) show that, while in the standard graphical lasso the parameter μ can be estimated by the empirical average regardless of ρ and the inference about the concentration matrix can be carried out using the profile log-likelihood function, inside our framework the two inferential problems cannot be separated, since the tuning parameter also affects the estimator of the expected value. Furthermore, the conditions suggest that, for a given value of the tuning parameter, the cglasso estimator can be computed using the EM algorithm (Dempster and others, 1977). This algorithm is based on the idea of repeating two steps until a convergence criterion is met. The first step, called E-Step, requires the calculation of the conditional expected value of the complete log-likelihood function using the current estimates. The resulting function, called Q-function, is maximized in the second step, i.e. the M-Step. As explained in McLachlan and Krishnan (2008), the EM algorithm can be significantly simplified when the complete probability density function is a member of the regular exponential family. In this case, the E-Step simply requires the computation of the conditional expected values of the sufficient statistics. In our case, if we denote by {μ^iniρ,Θ^iniρ} an initial estimate of the parameters, the E-Step reduces to the computation of xi,h(μ^iniρ,Θ^iniρ) and xi,hk(μ^iniρ,Θ^iniρ), for i=1,,n. From this, conditions (3.3) can be written as
\[x¯h(μ^iniρ,Θ^iniρ)μ^hρ=0σ^hkρ(μ^ρ,Θ^ρ)shk(μ^iniρ,Θ^iniρ)ρv^hk=0}\]
(3.4)
which are the stationary conditions of a standard graphical lasso problem (Witten and others, 2011) with S(μ^iniρ,Θ^iniρ) used as the current estimate of the empirical covariance matrix. The conditions (3.4) imply that in the M-Step the parameter μ is estimated by x¯(μ^iniρ,Θ^iniρ), while Θ is estimated by solving the following maximization problem:
maxΘ0Q(ΘΘ^iniρ)=maxΘ0logdetΘtr{ΘS(μ^iniρ,Θ^iniρ)}ρhk|θhk|,
(3.5)
which is a standard graphical lasso problem. The following steps summarize the proposed EM algorithm for the derivation of the cglasso estimator:
  1. 1.

    Let {μ^iniρ,Θ^iniρ} be initial estimates;

  2. 2.

    E-step: Compute x¯(μ^iniρ,Θ^iniρ) and S(μ^iniρ,Θ^iniρ);

  3. 3.

    M-step: Let μ^ρ=x¯(μ^iniρ,Θ^iniρ) and solve the problem (3.5) using S(μ^iniρ,Θ^iniρ);

  4. 4.

    If a convergence criterion is met then return {μ^ρ,Θ^ρ} else let μ^iniρ=μ^ρ and Θ^iniρ=Θ^ρ;

  5. 5.

    Repeat steps 2–4.

Although the maximization problem (3.5) can be efficiently solved using, for example, the algorithm proposed by Friedman and others (2008), Rothman and others (2008), or Witten and others (2011), the previous steps reveal that the main computational cost of the proposed EM algorithm comes from the evaluation of the moments of the multivariate truncated Gaussian distribution, which are needed to compute x¯(μ^iniρ,Θ^iniρ) and S(μ^iniρ,Θ^iniρ). These can be computed using the methods proposed by Lee (1983), Leppard and Tallis (1989), or Arismendi (2013), but these methods require complex numerical algorithms for the calculation of the integral of the multivariate normal density function (see for a review, Genz and Bretz, 2002), and they are computationally infeasible also for moderate size problems. A possible solution is to approximate the moments using Monte Carlo methods. Our preliminary study, however, shows that the Monte Carlo error causes an increment in the number of EM steps required for convergence of the proposed algorithm, thus removing the gain from using the faster approach for the calculation of the moments. Taking all of this into consideration, we propose a different approximate EM algorithm. In particular, following the idea proposed by Guo and others (2015), we approximate the quantities Ecioi(XihXikXiciDci), for any hk, by:
Ecioi(XihXikXiciDci)Ecioi(XihXiciDci)Ecioi(XikXiciDci).
(3.6)

As shown by Guo and others (2015), the approach works well when the real concentration matrix is sparse or the tuning parameter is sufficiently large. The main advantage coming from the approximation (3.6) is that now, in order to evaluate the quantities x¯(μ^iniρ,Θ^iniρ) and S(μ^iniρ,Θ^iniρ), we only need to compute Ecioi(XihXiciDci) and Ecioi(Xih2XiciDci), for h=1,,p. These quantities can be computed by using exact formulas (Johnson and others, 1994) and require only the evaluation of the cumulative distribution function of the univariate Gaussian distribution. In the following of this article, we denote by S¯(μ^iniρ,Θ^iniρ) the approximate estimate of the empirical covariance matrix resulting from this approach. By an extensive simulation study, in Section 4.1, we shall study the behavior of the proposed algorithm based on the usage of the matrix S¯(μ^iniρ,Θ^iniρ) in steps 2 and 3.

As with graphical lasso, our proposed method requires a sequence of ρ-values, which should be suitably defined so as to reduce the computational cost needed to compute the entire path of the estimated parameters. Theorem 2 gives the exact formula for the derivation of the largest ρ-value, denoted by ρmax, and the corresponding cglasso estimator.

 
Theorem 2
For any index h define the sets oh={i:rih=0}, ch={i:rih=1} and ch+={i:rih=+1} and compute the marginal maximum likelihood estimates {μ^h,σ^h2} maximizing:
(μh,σh2)=iohlogϕ(xih;μh,σh2)+|ch|loglhϕ(x;μh,σh2)dx+|ch+|loguh+ϕ(x;μh,σh2)dx.

Then ρmax=maxhk|shk(μ^,Θ^)|, where shk(μ^,Θ^) are the elements of the matrix S(μ^,Θ^) computed using μ^=(μ^1,,μ^p) and Θ^=diag(σ^12,,σ^p2). Furthermore, μ^ and Θ^ are the corresponding cglasso estimators.

A proof of Theorem 2 is reported in the supplementary material available at Biostatistics online. Theorem 2 shows how the maximum value of ρ can be efficiently computed using p optimization problems which do not involve the tuning parameter, and then calculating S(μ^,Θ^). Once ρmax is calculated, the entire path of the cglasso estimates is calculated as follow:

  1. 1.

    Compute ρmax as specified in Theorem 2 and let ρmin be the smallest ρ-value;

  2. 2.

    Compute a decreasing sequence {ρ(k)}k=1K of distinct ρ-values starting from ρmax to ρmin;

  3. 3.

    for k=2 to K do

  4. 4.

      Let μ^iniρ(k)=μ^ρ(k1) and Θ^iniρ(k)=Θ^ρ(k1)

  5. 5.

      Use the EM algorithm to compute {μ^ρ(k),Θ^ρ(k)} with {μ^iniρ(k),Θ^iniρ(k)} as starting values;

  6. 6.

    end for

The previous description shows that the entire path can be computed using the estimates obtained for a given ρ-value as warm starts for fitting the next cglasso model. This strategy is commonly used also in other efficient lasso algorithms and R packages (Friedman and others, 2010, Friedman and others, 2018). In our model, this strategy turns out to be remarkably efficient since using a sufficiently fine sequence of ρ-values, the starting values defined in Step 4 will be sufficiently close to the estimates computed in Step 5, thus increasing the speed of convergence of the resulting algorithm.

3.3. Tuning parameter selection

The tuning parameter plays a central role in the proposed cglasso estimator, since it is designed to control the complexity of the topological structure of the estimated graph. In this article, we propose to select the optimal ρ-value of the cglasso estimator by using the extended Bayesian Information Criterion (Foygel and Drton 2010). For our proposed estimator, this is given by:
\[BICγ(E^ρ)=2i=1nlogφ(xioi,ri;μ^,Θ^(E^ρ))+a(ρ)(logn+4γlogp),\]
where μ^ and Θ^(E^ρ) are the maximum likelihood estimates of the Gaussian graphical model specified by E^ρ={(h,k):θ^hkρ0}, and a(ρ) denotes the number of nonzero off-diagonal estimates of Θ^ρ. The measure is indexed by the parameter γ[0,1], with γ=0 corresponding to the classical BIC measure. Although a grid search can be performed to select the ρ-value that minimizes BICγ(E^ρ), the computational burden related to the evaluation of the log-likelihood function can make this strategy infeasible also for moderate size problems. For this reason, following Ibrahim and others (2008), we propose to select the ρ-value by minimizing the following approximate measure:
\[BICγ(E^ρ)=n[logdetΘ^ρtr{ΘS(μ^,Θ^(E^ρ))}]+a(ρ)(logn+4γlogp),\]
which is defined by substituting the exact log-likelihood function with the Q-function used in the M-Step of the proposed algorithm, which is easily obtained as a byproduct of the EM algorithm. In the supplementary material available at Biostatistics online, a simulation study is reported where we compare the behavior of the BICγ(E^ρ) and BICγ(E^ρ) measures. The results show that the two criteria are equivalent in terms of false discovery rate and true positive rate. Furthermore, as suggested in Foygel and Drton (2010), the optimal results are obtained when γ=0.5.

4. Simulation studies

4.1. Evaluating the behavior of the approximate EM algorithm

In a first simulation, we evaluate the effects of the approximation (3.6) on the accuracy of the estimators obtained with the EM algorithm. As in the real dataset studied in Section 5, we consider right-censored data and we set the right-censoring value uh equal to 40, for any h=1,,p. For this simulation, where we plan to use the full EM algorithm, we set the number of variables p to 10 and the sample size n to 100. To simulate a sample from a sparse right-censored model, we use the following procedure. First, using the method implemented in the R package huge (Zhao and others, 2015), we simulate a sparse concentration matrix with a random structure, where we set the probability that a θhk is different from zero to 0.1. Then, the elements of the parameter μ are selected to obtain a fixed probability of right censoring for any given random variable. More specifically, we randomly draw a subset D from I={1,,p} and for each hD the value of the parameter μh is such that Prob{R(Xh;,40)=+1}=0.25, while for each hD the parameter μh is such that the probability of right censoring is approximately equal to 1011. The cardinality of the set D, denoted by |D|, is used in the study as a tool to analyze the effects of the number of censored variables on the behavior of the proposed EM algorithm and its approximated version. Finally, we draw a sample from a multivariate Gaussian distribution with parameters given in the previous steps and we treat each value greater than 40 as a missing (censored) value. We simulated 100 datasets from this model and for each simulation we compute a path of cglasso estimates using the proposed EM algorithm and a path using the approximated EM algorithm. In the following of this section, the estimates belonging to the two paths are denoted by {μ^eρ;Θ^eρ} and {μ^aρ;Θ^aρ}, respectively. For each path, the largest value of the tuning parameter was computed using the results given in Theorem 2 while the smallest value was set equal to 1×103.

The first part of the Table 1 reports the average CPU times for computing the path. As expected, the computational time needed to compute a path is always an increasing function of the number of the censored variables but, when we use the matrix S(μ^iniρ,Θ^iniρ) in the E-step, the table reveals that the form of the relationship between the CPU time and |D| is almost exponential implying that the computation of the cglasso estimator is infeasible also for relatively small datasets. In contrast to this, when we use the matrix S¯(μ^iniρ,Θ^iniρ) to approximate the current estimate of the empirical covariance matrix, the table shows an almost linear dependence of the CPU time, which implies a significant reduction of the computational complexity compared with the full EM algorithm. Although the results showed in Table 1 strongly suggest the use of the approximated EM algorithm to compute the cglasso estimator, they do not provide information about the difference between the two estimates. For this reason, we also computed the largest Euclidean distance between μ^eρ and μ^aρ, denoted as maxρΔμ^ρ2, and the largest Frobenius distance between Θ^eρ and Θ^aρ, denoted by maxρΔΘ^ρF2. The average and standard deviations of these values are reported in the second part of the Table 1. All the results clearly show that the two estimators are sufficiently close to each other, and point to the use of the approximated EM algorithm for the derivation of the cglasso estimator.

Table 1.

Results of the simulation study on evaluating the effect of the approximation (3.6): first part reports the average CPU time for computing a path, whereas second part reports the average of maxρΔμ^ρ2 and maxρΔΘ^ρF2

 Number of censored variables 
 2345678
Average CPU time (s)
Exact EM9.6025.8768.36104.77140.01249.40374.34
 (2.79)(7.39)(18.87)(28.53)(34.39)(57.13)(96.26)
Approx. EM5.087.8913.8115.3116.2621.2824.90
 (1.02)(1.44)(2.23)(2.20)(2.33)(2.64)(3.10)
Difference between the two estimates
maxρΔμ^ρ22.9×1068.2×1061.7×1052.2×1057.8×1051.1×1042.1×104
 (8.6×106)(2.1×105)(3.7×105)(3.1×105)(1.0×104)(1.2×104)(2.3×104)
maxρΔΘ^ρF23.0×1058.3×1052.0×1042.6×1042.3×1032.6×1036.6×103
 (8.5×105)(2.0×104)(3.7×104)(3.8×104)(3.1×104)(2.4×103)(6.0×103)

Standard deviations are shown in brackets.

Table 1.

Results of the simulation study on evaluating the effect of the approximation (3.6): first part reports the average CPU time for computing a path, whereas second part reports the average of maxρΔμ^ρ2 and maxρΔΘ^ρF2

 Number of censored variables 
 2345678
Average CPU time (s)
Exact EM9.6025.8768.36104.77140.01249.40374.34
 (2.79)(7.39)(18.87)(28.53)(34.39)(57.13)(96.26)
Approx. EM5.087.8913.8115.3116.2621.2824.90
 (1.02)(1.44)(2.23)(2.20)(2.33)(2.64)(3.10)
Difference between the two estimates
maxρΔμ^ρ22.9×1068.2×1061.7×1052.2×1057.8×1051.1×1042.1×104
 (8.6×106)(2.1×105)(3.7×105)(3.1×105)(1.0×104)(1.2×104)(2.3×104)
maxρΔΘ^ρF23.0×1058.3×1052.0×1042.6×1042.3×1032.6×1036.6×103
 (8.5×105)(2.0×104)(3.7×104)(3.8×104)(3.1×104)(2.4×103)(6.0×103)

Standard deviations are shown in brackets.

4.2. Comparison of methods on data simulated from a censored Gaussian graphical model

In a second simulation study, we compare our proposed estimator with MissGlasso (Städler and Bühlmann, 2012), which performs 1penalized estimation under the assumption that the censored data are missing at random, and with the glasso estimator (Friedman and others, 2008), where the empirical covariance matrix is calculated by imputing the missing values with the limit of detection. These estimators are evaluated in terms of both recovering the structure of the true graph and the mean squared error. We use a similar approach to the previous simulation for generating right censored data. In particular, we set the right censoring value to 40 for any variable and the sample size n to 100. We generate a sparse concentration matrix with random structure and set the probability of observing a link between two nodes to k/p, where p is the number of variables and k is used to control the amount of sparsity in Θ. Finally, we set the mean μ in such a way that μh=40 for the H censored variables, i.e. Prob{R(Xh;,40)=+1}=0.50, while for the remaining variables μh is sampled from a uniform distribution on the interval [10;35]. At this point, we simulate a sample from the latent p-variate Gaussian distribution and treat all values greater than 40 as missing. The quantities k, p, and H are used to specify the different models used to analyze the behavior of the considered estimators. In particular, we consider the following cases:

  • Model 1: k=3, p=50 and H{25,35}. This setting is used to evaluate the effects of the number of censored variables on the behavior of the proposed estimators when n>p.

  • Model 2: k{1,5}, p=50 and H=30. This setting is used to evaluate the effects of the sparsity of the matrix Θ on the considered estimators when n>p.

  • Model 3: k=3, p=200 and H=100. This setting is used to evaluate the impact of the high dimensionality on the estimators (pn).

For each model, we simulate 100 samples from the right-censored model and in each simulation we compute the coefficients path using cglasso, MissGlasso, and glasso. Each path is computed using an equally spaced sequence of 30 ρ-values. Figure 1a shows the precision-recall curves for Model 1 with H=25. The curves report the relationship between precision and recall for any ρ-value, which are defined by:
\[Precision(ρ)=number of θ^hkρ0 and θhk0number of θ^hkρ0,Recall(ρ)=number of θ^hkρ0 and θhk0number of θhk0.\]
Fig. 1.

Results of the comparative simulation study based on Model 1 with H=25. (a) The average precision-recall curves are shown; (b) The box-plots of minρMSE(Θ^ρ) for the considered estimators is shown.

The curves show how cglasso gives a better estimate of the concentration matrix both in terms of precision and recall, for any given value of the tuning parameter. Figure 1b shows the distributions of the quantity minρMSE(Θ^ρ), which gives the minimum value of the mean squared error attained along the path of solutions. These box-plots emphasize that, not only the cglasso has a mean squared error much smaller than glasso and MissGlasso, but also that it is much more stable than its competitors. The same behavior was also observed in the other models used in our numerical study, and can be found in the Supplementary Material available at Biostatistics online. Table 2 reports the summary statistics from all the simulations. In addition to the quantity described above (second meta-column), the first meta-column shows the mean squared error in the estimation of the mean (denoted by minρMSE(μ^ρ)) and the third meta-column reports the Area Under the precision-recall Curve (AUC). Note that the considered measures allow us to study the behavior of the estimators along the entire path. The distribution of the minimum value of the mean squared errors shows that, not only our estimator is able to recover the structure of the graph but also outperforms the competitors in terms of both estimation of μ and Θ. We did not report minρMSE(μ^ρ) for glasso since this method does not allow to estimate the parameter μ. The results on the AUC suggest that cglasso can be used as an efficient tool for recovering the structure of the true concentration matrix of a Gaussian graphical model from censored data.

Table 2.

Comparison between cglasso and two existing methods on simulated data: for each measure the table reports the average value and standard deviation between parentheses

ModelminρMSE(μ^ρ)minρMSE(Θ^ρ)AUC
pH/pk/pcglassoMissGlassocglassoglassoMissGlassocglassoglassoMissGlasso
500.50.060.4714.508.76103.3596.750.480.370.19
   (0.11)(0.69)(0.64)(14.43)(16.01)(0.04)(0.03)(0.02)
500.70.060.4821.0010.11139.76131.990.470.330.15
   (0.10)(0.76)(0.84)(15.94)(18.81)(0.05)(0.03)(0.02)
500.60.020.4718.316.92128.60119.650.610.420.16
   (0.10)(0.81)(0.82)(17.75)(17.38)(0.08)(0.05)(0.03)
500.60.100.4617.3912.02113.84105.700.430.340.20
   (0.10)(0.92)(0.85)(14.79)(15.71)(0.04)(0.03)(0.02)
2000.50.0151.9263.3441.57398.02373.860.320.210.13
   (0.19)(1.54)(1.52)(29.78)(32.46)(0.02)(0.02)(0.01)
Table 2.

Comparison between cglasso and two existing methods on simulated data: for each measure the table reports the average value and standard deviation between parentheses

ModelminρMSE(μ^ρ)minρMSE(Θ^ρ)AUC
pH/pk/pcglassoMissGlassocglassoglassoMissGlassocglassoglassoMissGlasso
500.50.060.4714.508.76103.3596.750.480.370.19
   (0.11)(0.69)(0.64)(14.43)(16.01)(0.04)(0.03)(0.02)
500.70.060.4821.0010.11139.76131.990.470.330.15
   (0.10)(0.76)(0.84)(15.94)(18.81)(0.05)(0.03)(0.02)
500.60.020.4718.316.92128.60119.650.610.420.16
   (0.10)(0.81)(0.82)(17.75)(17.38)(0.08)(0.05)(0.03)
500.60.100.4617.3912.02113.84105.700.430.340.20
   (0.10)(0.92)(0.85)(14.79)(15.71)(0.04)(0.03)(0.02)
2000.50.0151.9263.3441.57398.02373.860.320.210.13
   (0.19)(1.54)(1.52)(29.78)(32.46)(0.02)(0.02)(0.01)

4.3. Testing the robustness of the method on more realistic biological data

In a third simulation study, we test the robustness of the method on more realistic biological data. In particular, we consider expression data from Wille and others (2004) on the extensively studied Arabidopsis thaliana biological system. The study reports data from n=118 experiments on p=39 genes, whose regulatory network is of interest. Although the data are fully observed, we test our method on a dataset where observations are made artificially right censored, similarly to Städler and Bühlmann (2012). In particular, we produce three datasets at different levels of right censoring, by recording as missing the 10%, 20%, and 30% of the highest values, respectively. Then, we compare our method, cglasso, with MissGlasso (Städler and Bühlmann, 2012) and with three additional methods, which impute missing values first and then infer the network using graphical lasso on the imputed values. As in Städler and Bühlmann (2012), we consider k-nearest neighbor imputation, which we denote with missknn, and imputation using random forests, which we denote with MissForest. These methods are based on a missing at random assumption. We further consider a method which relies on multivariate Gaussian data under a censoring mechanism which we denote with imputeLCMD (Lazar, 2015).

Table 3 shows the Euclidean norm between the true observed data and the imputed one for the five methods considered and across the three different levels of censoring. The results show how the performance of all methods decreases the higher the level of censoring and how cglasso is overall superior to the other methods across all comparisons, followed closely in some cases by imputeLCMD. The areas under the precision-recall curves of the networks inferred by the five methods compared with the graphical lasso network selected by BIC0.5(E^ρ) from the fully observed data show how the proposed cglasso method leads to a significant gain in network recovery over MissGlasso, missknn, and MissForest even with 30% of censored data.

Table 3.

Comparison of methods on the Arabidopsis thaliana expression data: Euclidian distance (ED) between the observed and imputed data and area under the precision-recall curve (AUC) across the five methods and the three levels of censoring

 Percentage censoredcglassoimputeLMCDMissGlassomissknnmissForest
ED10%9.9513.4927.8931.3733.31
 20%15.2218.6037.7340.1646.21
 30%19.7624.3846.6550.0857.54
AUC10%0.930.880.780.790.77
 20%0.900.860.700.720.68
 30%0.870.850.630.700.61
Table 3.

Comparison of methods on the Arabidopsis thaliana expression data: Euclidian distance (ED) between the observed and imputed data and area under the precision-recall curve (AUC) across the five methods and the three levels of censoring

 Percentage censoredcglassoimputeLMCDMissGlassomissknnmissForest
ED10%9.9513.4927.8931.3733.31
 20%15.2218.6037.7340.1646.21
 30%19.7624.3846.6550.0857.54
AUC10%0.930.880.780.790.77
 20%0.900.860.700.720.68
 30%0.870.850.630.700.61

5. Application to single cell-data: megakaryocyte-erythroid progenitors

Recent advances in single-cell techniques have provided the opportunity to finely dissect cellular heterogeneity within known populations and to uncover rare cell types. In a study about the formation of blood cells, Psaila and others (2016) have recently identified three distinct subpopulations of cells, which are all derived from hematopoietic stem cells through cell differentiation. One of these sub-populations, denoted by MK-MEP, is a previously unknown, rare population of cells that are bipotent but primarily generate megakaryocytic progeny. In this section, we look closely at this sub-population and investigate the molecular mechanisms of regulation within these cells. To this end, we used data available from Psaila and others (2016) on 87 genes and 48 single human MK-MEP cells profiled by multiplex RT-qPCR.

As discussed in the Introduction, RT-qPCR data are typically right-censored. In this particular study, the limit of detection is fixed by the manufacturer to 40. Raw data have been mean normalized using the method proposed in Pipelers and others (2017) and using the two housekeeping genes provided. Figure 2a shows the relationship between the proportion of right-censored data and the mean of the normalized cycle-threshold. Two main conclusions can be drawn from this figure. Firstly, the proportion of censored data is an increasing function of the mean of the normalized cycle-threshold: this is expected and it means that the assumption of missing-at-random is not justified on this dataset. Secondly, there are some genes with a very high proportion of censoring (see the points above the dashed line): these may be genes whose transcription failed to amplify, in which case they should be treated as missing rather than censored. Following this explorative analysis and considering also the possible computational problems caused by the inclusion of these genes, we filtered out the genes with a proportion of censoring above 85% for subsequent analysis. The resulting data set contains the expression of 63 genes measured on 48 cells.

Fig. 2.

Real data analysis. (a) The proportion of right-censored data versus the mean of the normalized cycle-threshold; the black line is obtained by fitting a logistic regression model, while the dashed line identifies the threshold used to filter out the genes from the study. (b) Path of the BIC0.5(E^ρ) measure; the vertical dashed line identifies the optimal value of the tuning parameter. (c) The estimated graph.

The normalized cyclic thresholds of the remaining genes are used to fit a right-censored GGM by using the proposed cglasso estimator. Figure 2b shows the path of the BIC0.5(E^ρ) measure and the vertical dashed line is traced in correspondence of the optimal ρ-value. The resulting estimated graph (see Figure 2c) contains about 0.6% of all possible edges and about 19% of the considered genes. Consistent with the existing knowledge about this subpopulation, the estimated graph shows the central role of two genes, CD42 (glycoprotein 1b) and MYB (Psaila and others, 2016). The first one, CD42, is a megakaryocyte gene whose expression was found to be significantly high in the MK-MEP subpopulation. In particular, it is expressed later during megakaryocyte differentiation and has been associated with unipotent megakaryopoietic activity in mouse models. The link between CD42 and VWF, another highly expressed genes in this subpopulation (Psaila and others, 2016), was discovered by Chan and others (2017) as well as other connections, such as the link between the genes CD42 and CD61 and that between CD42 and TGFB1. The second central gene, MYB, is a transcription factor that is known to enhance erythroid differentiation at the expense of megakaryopoiesis.

6. Conclusion

In this article, we have proposed a computational approach to fit Gaussian Graphical models in the presence of censored data. The approach includes both the cases of right- and left-censored data. Since classical Gaussian graphical models cannot be used in high-dimensional settings, we also introduced 1-penalization to produce sparsity (model selection) and parameter estimation simultaneously. The resulting estimator is called censored graphical lasso (cglasso). The computational problem of estimating the mean and conditional independence graph of a Gaussian graphical model from censored data is solved via an EM-algorithm. An extensive simulation study showed that the proposed estimator overcomes the existing estimators both in terms of parameter estimation and of network recovery. The analysis of a real RT-qPCR dataset showed how the method is able to infer the regulatory network underlying blood development under high levels of censoring.

In addition, we have proposed an approximated EM-algorithm, which is computationally more efficient and makes the method feasible for high-dimension settings. This approach relies on an efficient approximation of the mixed moments of the latent variables conditional on the observed. The approximation is exact under conditional independence and has thus been found to work well under sparse settings, as shown by Guo and others (2015) and our own simulations. Future work will investigate the theoretical justifications for this as well as refining the scenarios under which a good performance is to be expected.

Acknowledgments

Conflict of Interest: None declared.

Funding

The project was partially supported by the European Cooperation in Science and Technology (COST) [COST Action CA15109 European Cooperation for Statistics of Network Data Science (COSTNET)].

References

Arismendi,
J. C.
(
2013
).
Multivariate truncated moments.
Journal of Multivariate Analysis
117
,
41
75
.

Augugliaro,
L.
,
Mineo,
A. M.
and
Wit,
E. C.
(
2016
).
1-Penalized methods in high-dimensional Gaussian Markov random fields
. In:
Dehmer,
M.
,
Shi,
Y.
and
Emmert-Streib,
F.
(editors),
Computational Network Analysis with R: Applications in Biology, Medicine, and Chemistry
, Chapter 8.
Weinheim, Germany
:
Wiley-VCH Verlag GmbH & Co. KGaA
, pp.
201
267
.

Bickel,
P. J.
and
Levina,
E.
(
2008
).
Regularized estimation of large covariance matrices.
The Annals of Statistics
36
,
199
227
.

Brown,
M.
and
Wittwer,
C.
(
2000
).
Flow cytometry: principles and clinical applications in hematology
.
Clinical Chemistry
46
,
1221
1229
.

Chan,
T. E.
,
Stumpf,
M. P. H.
and
Babtie,
A. C.
(
2017
).
Gene regulatory network inference from single-cell data using multivariate information measures.
Cell Systems
5
,
251
267
.

Dempster,
A. P.
,
Laird,
N. M.
and
Rubin,
D. B.
(
1977
).
Maximum likelihood from incomplete data via the EM algorithm.
Journal of the Royal Statistical Society. Series B
39
,
1
38
.

Derveaux,
S.
,
Vandesompele,
J.
and
Hellemans,
J.
(
2010
).
How to do successful gene expression analysis using real-time PCR.
Methods
50
,
227
230
.

Foygel,
R.
and
Drton,
M.
(
2010
).
Extended Bayesian information criteria for Gaussian graphical models
. In:
Lafferty,
J.
,
Williams,
C.
,
Shawe-taylor,
J.
,
Zemel,
R.s.
and
Culott,
A.
(editors),
Advances in Neural Information Processing Systems
,
Vancouver, British Columbia, Canada
:
Curran Associates
, Inc., pp.
604
612
.

Friedman,
J. H.
,
Hastie,
T.
and
Tibshirani,
R.
(
2008
).
Sparce inverse covariance estimation with the graphical lasso.
Biostatistics
9
,
432
441
.

Friedman,
J. H.
,
Hastie,
T.
and
Tibshirani,
R.
(
2010
).
Regularization paths for generalized linear models via coordinate descent.
Journal of Statistical Software
33
,
1
22
.

Friedman,
J. H.
,
Hastie,
T.
and
Tibshirani,
R.
(
2018
).
glasso: Graphical Lasso-Estimation of Gaussian Graphical Models
.
R package version 1.10

Gardner,
T. S.
,
di Bernardo,
D.
,
Lorenz,
D.
and
Collins,
J. J.
(
2003
).
Inferring genetic networks and identifying compound mode of action via expression profiling.
Science
301
,
102
105
.

Genz,
A.
and
Bretz,
F.
(
2002
).
Comparison of methods for the computation of multivariate t probabilities.
Journal of Computational and Graphical Statistics
11
,
950
971
.

Guo,
J.
,
Levina,
E.
,
Michailidis,
G.
and
Zhu,
J.
(
2015
).
Graphical models for ordinal data.
Journal of Computational and Graphical Statistics
24
,
183
204
.

Hoffman,
H. J.
and
Johnson,
R. E.
(
2015
).
Pseudo-likelihood estimation of multivariate normal parameters in the presence of left-censored data.
Journal of Agricultural, Biological, and Environmental Statistics
20
,
156
171
.

Ibrahim,
J. G.
,
Zhu,
H.
and
Tang,
N.
(
2008
).
Model selection criteria for missing-data problems using the EM algorithm.
Journal of the American Statistical Association
103
,
1648
1658
.

Johnson,
N. L.
,
Kotz,
S.
and
Balakrishnan,
N.
(
1994
).
Continuous Univariate Distributions
, 2nd edition, Volume
1
.
New York
:
John Wiley & Sons, Inc
.

Lauritzen,
S. L.
(
1996
).
Graphical Models
.
Oxford
:
Oxford University Press
.

Lazar,
C.
(
2015
).
imputeLCMD: A Collection of Methods for Left-Censored Missing Data Imputation
.
R package version 2.0
.

Lee,
G.
and
Scott,
C.
(
2012
).
EM algorithms for multivariate Gaussian mixture models with truncated and censored data.
Computational Statistics & Data Analysis
56
,
2816
2829
.

Lee,
L.-F.
(
1983
).
The determination of moments of the doubly truncated multivariate normal tobit model.
Economics Letters
11
,
245
250
.

Leppard,
P.
and
Tallis,
G. M.
(
1989
).
Algorithm AS 249: Evaluation of the mean and covariance of the truncated multinormal distribution.
Journal of the Royal Statistical Society. Series C
38
,
543
553
.

Little,
R. J. A.
and
Rubin,
D. B.
(
2002
).
Statistical Analysis with Missing Data
, 2nd edition.
Hoboken, NJ, USA
:
John Wiley & Sons
, Inc.

McCall,
M. N.
,
McMurray,
H. R.
,
Land,
H.
and
Almudevar,
A.
(
2014
).
On non-detects in qPCR data.
Bioinformatics
30
,
2310
2316
.

McLachlan,
G.
and
Krishnan,
T.
(
2008
).
The EM Algorithm and Exetnsions
, 2nd edition.
Hoboken, NJ, USA
:
John Wiley & Sons
, Inc.

Menéndez,
P.
,
Kourmpetis,
Y. A. I.
,
ter Braak,
C. J. F.
and
van Eeuwijk,
F. A.
(
2010
).
Gene regulatory networks from multifactorial perturbations using graphical lasso: application to DREAM4 challange
.
PLoS One
5
,
e14147
.

Perkins,
N. J.
,
Schisterman,
E. F.
and
Vexier,
A.
(
2013
).
Multivariate normally distributed biomarkers subject to limits of detection and receiver operating characteristic curve inference.
Academic Radiology
20
,
838
846
.

Pesonen,
M.
,
Pesonen,
H.
and
Nevalainen,
J.
(
2015
).
Covariance matrix estimation for left-censored data.
Computational Statistics & Data Analysis
92
,
13
25
.

Pipelers,
P.
,
Clement,
L.
,
Vynck,
M.
,
Hellemans,
J.
,
Vandesompele,
J.
and
Thas,
O.
(
2017
).
A unified censored normal regression model for qPCR differential gene expression analysis
.
PLoS One
12
,
e0182832
.

Psaila,
B.
,
Barkas,
N.
,
Iskander,
D.
,
Roy,
A.
,
Anderson,
S.
,
Ashley,
N.
,
Caputo,
V. S.
,
Lichtenberg,
J.
,
Loaiza,
S.
,
Bodine,
D. M.
and others
. (
2016
).
Single-cell profiling of human megakaryocyte-erythroid progenitors identifies distinct megakaryocyte and erythroid differentiation pathways.
Genome Biology
17
,
83
102
.

Rothman,
A.
,
Bickel,
P. J.
,
Levina,
E.
and
Zhu,
J.
(
2008
).
Sparse permutation invariant covariance estimation.
Electronic Journal of Statistics
2
,
494
515
.

Schafer,
J. L.
(
1997
).
Analysis of Incomplete Multivariate Data
,
Monographs on Statistics and Applied Probability
.
Boca Raton, Florida
:
Chapman and Hall/CRC
.

Städler,
N.
and
Bühlmann,
P.
(
2012
).
Missing values: sparse inverse covariance estimation and an extension to sparse regression.
Statistics and Computing
22
,
219
235
.

Uhler,
C.
(
2012
).
Geometry of maximum likelihood estimation in Gaussian graphical models.
The Annals of Statistics
40
,
238
261
.

Vinciotti,
V.
,
Augugliaro,
L.
,
Abbruzzo,
A.
and
Wit,
E. C.
(
2016
).
Model selection for factorial Gaussian graphical models with an application to dynamic regulatory networks.
Statistical Applications in Genetics and Molecular Biology
15
,
193
212
.

Wille,
A.
,
Zimmermann,
P.
,
Vranová,
E.
,
Fürholz,
A.
,
Laule,
O.
,
Bleuler,
S.
,
Hennig,
L.
,
Prelić,
A.
,
xvon Rohr,
P.
,
Thiele,
L.
,
Zitzler,
E.
,
Gruissem,
W.
and others
. (
2004
).
Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana
.
Genome Biology
5
,
R92
.

Witten,
D. M.
,
Friedman,
J. H.
and
Simon,
N.
(
2011
).
New insights and faster computations for the graphical lasso.
Journal of Computational and Graphical Statistics
20
,
892
900
.

Yuan,
M.
and
Lin,
Y.
(
2007
).
Model selection and estimation in the Gaussian graphical model.
Biometrika
94
,
19
35
.

Zhao,
T.
,
Li,
X.
,
Liu,
H.
,
Roeder,
K.
,
Lafferty,
J.
and
Wasserman,
L.
(
2015
).
huge: High-Dimensional Undirected Graph Estimation
.
R package version 1.2.7
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data