Summary
The basic building block of a gene regulatory network consists of a gene encoding a transcription factor (TF) and the gene(s) it regulates. Considerable efforts have been directed recently at devising experiments and algorithms to determine TFs and their corresponding target genes using gene expression and other types of data. The underlying problem is that the expression of a gene coding for the TF provides only limited information about the activity of the TF, which can also be controlled posttranscriptionally. In the absence of a reliable technology to routinely measure the activity of regulators, it is of great importance to understand whether this activity can be inferred from gene expression data. We here develop a statistical framework to reconstruct the activity of a TF from gene expression data of the target genes in its regulatory module. The novelty of our approach is that we embed the deterministic Michaelis–Menten model of gene regulation in this statistical framework. The kinetic parameters of the gene regulation model are inferred together with the profile of the TF regulator. We also obtain a goodness-of-fit test to verify the fit of the model. The model is applied to a time series involving the Streptomyces coelicolor bacterium. We focus on the transcriptional activator cdaR, which is partly responsible for the production of a particular type of antibiotic. The aim is to reconstruct the activity profile of this regulator. Our approach can be extended to include more complex regulatory relationships, such as multiple regulatory factors, competition, and cooperativity.
1. Introduction
Linking transcription factors (TFs) to their targets is a central problem in postgenomic biology. While genes regulated by the same TFs tend to be co-expressed, the relationship between the gene expression profiles of the TFs and their regulated genes can be quite complicated, often exhibiting time-shifted or inverted behavior (Yu et al., 2003). This could be due to the fact that changes in the expression of a TF are subtle and its activity is often controlled at levels other than expression, for example, via posttranscriptional modifications. Therefore, the expression of a gene coding for a TF generally provides only limited information on the true transcription factor activity (TFA). The situation becomes even more complex in the presence of cooperativity or competition between two or more TFs that regulate a target gene.
New computational methods have been proposed to infer TFAs from the gene expression data under the assumption that the two are not necessarily the same. Zhou et al. (2005) propose to validate TFA through cross-platform integration of expression data. Kao et al. (2004) and Boulesteix and Strimmer (2005) estimate the TFAs by setting the problem in a (partial) least-squares framework and by using algebraic matrix decomposition to deal with the high-dimensionality issue. Both assume a linear additive model of gene regulation. Gao, Foat, and Bussemaker (2004) suggested a multivariate regression analysis, using the ChIP occupancy log ratios for the TFs as a response and the genes as predictors. The coefficients of the regression express the changes in TFA. Regulated genes are those that are correlated with the TFA profile. In all of the above models, the data on the connectivity comes from outside sources, like ChIP-chip data or a priori knowledge.
In this article we develop a statistical framework to model regulatory pairs of TFs and their target genes using Michaelis–Menten (MM) kinetics for gene regulation. The MM model has been successfully used in various biological applications, including the regulation of a gene by a TF (Bolouri and Davidson, 2002; Mangan and Alon, 2003). Nachman, Regev, and Friedman (2004) were the first to incorporate the quantitative MM regulation model into the generative Bayesian probabilistic model. These authors attempted to estimate simultaneously the structure of regulatory modules as well as the kinetic parameters of the MM regulation model and the levels of ideal regulators that control them. They considered multiple TFs and multiple targets in their model, as well as a dynamic temporal behavior. They applied their Bayesian learning algorithms to yeast cell cycle expression data sets. In contrast, we develop a frequentist approach to find the parameters of the MM model of regulation by embedding this model in the statistical framework.
In Section 2, we introduce a model for observed gene expressions within a general network motif. Then in Section 3, we focus on the special case of a single input motif (SIM), for which we can obtain an explicit expression of the MM ordinary differential equation. In Section 4, we show how conjugate gradient methods can be used to estimate the kinetic MM parameters and the TFA of the regulator can be estimated via maximum likelihood. Finally, in Section 5, our statistical framework is applied to a 10-point time-course data set for a wild type and mutant type Streptomyces coelicolor. We obtain some interesting biological results and show that the model we propose has good fit to the data.
2. Model for TF-Initiated Gene Transcription
In this section, we present a general gene expression model that takes into account (i) transcription rate, (ii) decay rate, (iii) network structure, and (iv) stochastic effects.
2.1 Kinetic Model of Gene Transcription
The gene expression of a regulated gene, μ(t), defined as the number of transcribed RNA molecules present at time t, changes due to gene transcription and the decay of RNA molecules. The average rate of change in expression of a target gene, , is therefore described by the number of RNA molecules transcribed per unit of time and the number of decaying molecules per unit time:
where p(t) is a production term, that is, the rate of gene transcription, and δ is a linear degradation rate. Here μ(t) stands for the underlying expression of the regulated gene at time t. The general solution of the above linear differential equation is given by
Gene regulation is usually controlled by one or more TFs. The rate of gene transcription, p(t), depends on the type of regulation, that is, activation or repression, and on the type of regulation control, namely a single TF or multiple TFs. The transcription rate depends also on the so-called gate type in the case of multiple TF regulators. For example, the “AND” gate means that all TFs are required for regulation, while the “OR” gate implies that either of the TFs is sufficient to regulate the transcription of the target gene(s). In addition, the production term, p(t), depends on gene-specific kinetics of regulation, θ. For example, the target genes can have different values for the maximal production rate. Also, the transcription of different targets could saturate at different levels of the TF regulator.
Gene regulation has commonly been described using a linear model: either the transcription rate of a target gene or its expression is assumed proportional to the level of the TFs that regulate this gene (Kao et al., 2004; Boulesteix and Strimmer, 2005). In this article we model gene transcription with the so-called MM kinetics. The MM kinetics have been used in modeling enzyme-mediated reactions and have also been applied to TF-initiated transcription (Bolouri and Davidson, 2002; Nachman et al., 2004). The MM kinetic model, unlike a linear model, is able to describe saturation effects, which are biologically plausible.
It is worth noting that the proposed statistical framework is by no means limited to a specific microarray platform. The model can equally be applied to both cDNA and oligonucleotide microarrays, as well as gene expression profiles obtained by other technologies, such as quantitative real-time PCR.
2.2 Network Motif
The term network motif, coined by Milo et al. (2002), is defined as patterns of interconnections that recur in different parts of a network at frequencies much higher than those found in random networks. Several basic network motifs have been found in biological networks. Each network motif consists of several target genes regulated by one (SIM) or several (multiple input motif, feed forward loop) TFs.
Within a network motif, the gene expression of a gene k at time t, μk(t), depends on the decay rate-constant δk and on the transcription rate, pk(t), as defined by equation (2). The transcription rate pk(t) depends on several gene-specific kinetic parameters, θk, as well as on the activity of its TF regulator(s), whose activity levels are denoted by η1(t), … , ηM(t) (M≥ 1). The TFs are the common regulators to all the target genes, μk, k = 1, … , K in the network motif, while the kinetic parameters of gene regulation are likely to be target-dependent:
It is biologically compelling to assume that the gene-specific parameters of the gene kinetic equation, θk, are the same between the different biological conditions, such as wild type and mutant. The only exception is the initial amount of gene expression, μk0, which can be different due to all sorts of external factors that affect gene transcription. In Section 3, we consider a MM model implementation of the case of a SIM, that is, one regulator and many targets.
2.3 Noise Model
As the MM kinetic model requires that we model the intensities on the original rather than log-transformed scale, it is important to find a suitable distribution for the noise process. In particular, it is unlikely if not impossible to have merely additive noise. As log-transformed intensity ratios have been found to be approximately normal (Lee et al., 2000), we use the log-normal distribution for the ratios of the intensities. Moreover, as every microarray measures the gene expression of a different biological sample due to destructive sampling, it is reasonable to assume that all observations are independent. Let us denote by gkcr(t) the observed gene expression of gene k at a time-point t for the replicate r under condition c. The condition c stands, for example, for wild type or mutant. We assume that the observed gene expressions of a target gene k are independent and log-normally distributed with location parameter mkc(t) and scale parameter σ2k. This distribution takes into account the different variances associated with different amplitudes. The log-likelihood contribution of a single observation gkcr(t) is given by
Given the expectation of a log-normal distribution , the relationship between the true gene expression under condition c, μkc(t), and the location parameter of the log-normal distribution, mkc(t), is given by
Therefore, the location parameter mkc(t) ≡mkc(t; θkc, σ2k, η1, … , ηM) implicitly depends on the kinetic parameters θkc of the gene regulation model and on the TFs levels η1, … , ηM. The likelihood contribution in equation (3) can then be written as a function of the kinetic parameters of the gene regulation model as well as activities of TFs, namely l(θkc, σ2k, η1, … , ηM | gkcr(t)).
3. Michaelis–Menten Model of a Single Input Motif
3.1 Single Input Motif
We now apply our general methodology to a simple network architecture, called the SIM. It consists of a set of genes that are controlled by a single TF (Shen-Orr et al., 2002). All of the genes are under the same type of regulation (either all activated or all repressed), which presumably happen under a specific set of circumstances. None of these genes have additional transcriptional regulation. SIMs are potentially useful for coordinating a discrete unit of some biological function, such as a set of genes that code for the subunits of a biosynthesis apparatus or enzymes of a metabolic pathway (Lee et al. 2002). SIM is probably the simplest logical unit of a transcriptional regulatory network architecture that could serve as a starting point for the reconstruction of the TFA.
There is compelling experimental evidence that SIMs frequently occur in biological systems (Lee et al. 2002; Shen-Orr et al., 2002). It is partly an open question as how to identify new SIMs, verify the targets, and infer the activity of the regulators. The first source of information for SIMs is in databases such as RegulonDB, that were used by Shen-Orr et al. (2002) in their original study of network motifs. There is an increasing amount of ChIP-chip data, pioneered by Lee et al. (2002), which identify TF-and-target pairs. The use of such data together with statistical models such as (Bar-Joseph et al., 2003; Yu et al., 2004) helps to identify and verify SIMs.
Another rich source of data for identifying SIMs is contained in microarrays studies. For example, an experiment comparing a wild type and a mutant, wherein the TF of interest is knocked out, yields a list of differentially expressed genes, which are potential targets of this TF. To identify whether these targets are primary or secondary, further experiments, such as data on binding sites, or a priori knowledge is required. In this paper, we identify a SIM for Streptomyces coelicolor by finding differentially expressed genes between a wild type and a mutant type (where the TF has been knocked out) combined with biological knowledge on specific location of the TF and targets within the genome.
3.2 Michaelis–Menten Model
When a gene is regulated by a single TF that binds to the promoter region of the regulated gene, the transcription rate p(t) depends on the level of this TF, η and gene-specific kinetic parameters. The MM model of gene transcription activated by some TF states that production occurs in a saturating manner:
Here β is the rate of production, γ is the half-saturation constant, and α is the basal level of gene expression production. The general solution of the transcription equation (2) takes the form
In a SIM, the same TF regulates more than one gene. The gene expression profile of gene k, μk(t), depends on several kinetic parameters that are gene specific, αk, βk, γk, δk, μk0, as well as the activity of the regulator, η, that is common for all targets in the SIM regulated by it. We use the following notation μk(t) ≡μk(t; θk, η) and θk≡ (αk, βk, γk, δk, μk0).
4. Parameter Estimation
4.1 Likelihood
The kinetic parameters of the MM model, θk, and the variance of the log-normal distribution, σ2k, for a single gene, k, can be estimated by an approximate maximum likelihood procedure. The likelihood for a gene k, regulated by one TF, ηc, given all observations, gkcr(t), across all time-points, t, conditions, c, and replicates, r, is given by
The likelihood of the whole SIM, wherein the TF with activity level η(t) regulates several target-genes, can be written as
Here G = {g1, … , gK} is the set of K target genes; Θ represents all the kinetic parameters of the MM model, θk, for all genes in the SIM and Σ2 stands for all the scale parameters of the log-normal distribution, σ2k, that are also assumed to be gene specific.
4.2 Transcription Factor Activity
A common approach (Bar-Joseph et al., 2003; Qian et al., 2003; Segal et al., 2003) assumes that the transcription of the gene coding for the TF represents its activity reasonably well. Therefore, the observed gene expression values for the TF (TFX) are used as a proxy for TFA. A biologically more plausible model suggests that the TFA is not equal or not necessarily even correlated with the TFX (Gao et al., 2004; Nachman et al., 2004) due to the processes of translation and posttranslational modifications. In this case, the TFA, η(t), can be thought of as an unknown parameter. The idea is that η(t) can be reconstructed from the expression data of the genes that are known to be regulated by it. In a SIM, where a given TF regulates several gene-targets, the TFA profile, ηc(t), is the same for all target genes in the regulatory module as all target genes become activated (or repressed) by the master TF regulator under a specific set of conditions, c∈C. The kinetic parameters of regulation are gene specific for each of the K genes with profiles μ1(t), … , μK(t). These kinetic parameters as well as the TFA profile can be found by maximizing the likelihood (8) for a given set of genes.
4.3 MM Model Constraint
The true expression of a target gene k at time t depends on a continuous integral of the TFA values (6). Without any further constraints, it is clear that the function η(t) is unidentifiable. We, therefore, assume that the TFA can be approximated by a piecewise constant step function on the intervals (tj, tj + 1), where tj are the sampling points (j = 0, … , N− 2). Given this constraint, the integral in (6) can be approximated by a sum,
yielding the full general solution of the gene transcription equation (6)
This approximation is used for each of the target k = 1, … , K in the SIM. The parameter is N− 1 dimensional, but due to its collinearity of β on the one hand and γ on the other in equation (9), it can only be identified up to a multiplicative constant. Therefore, without loss of generality we can fix
. Computational details on maximizing likelihood by conjugate gradient are given in supplementary materials.
5. Application
The model described above has been applied to two 10-point time-series of two Streptomyces coelicolor strains grown on solid medium, one wild type and one mutant type for which a transcriptional regulator cdaR (SCO3217) has been knocked out. Each time-point of the two time-courses is replicated twice using independent biological samples, as the sampling mechanism is destructive.
The importance of the genus Streptomyces results from the bacterium's production of over two-thirds of naturally derived antibiotics in current use, as well as many antitumor agents and immunosuppressants. Streptomyces coelicolor produces at least four chemically distinct antibiotics (Bibb, 1996). The genes responsible for the synthesis of each of the four antibiotics have been found to be clustered in distinct locations (Bentley et al., 2002). Here we study genes in the cluster responsible for the production of calcium-dependent antibiotics (CDA) (Hojati et al., 2002). This cluster of 40 genes (SCO3210-SCO3249) contains at least two genes encoding the transcriptional regulators, CdaR and AbsA2, whose specific roles in the regulation of antibiotic biosynthesis have not been characterized in detail. Only 34 genes from the 40-gene cdaR cluster are present on the arrays, so only these genes have been considered in the current study. The cdaR gene product is known to positively regulate genes for CDA biosynthesis, while AbsA2 acts as an inhibitor, repressing CDA promoters, perhaps in competition with CdaR (Ryding, Anderson, and Champness, 2002; Sheeler, MacMillan, and Nodwell, 2005). At the same time, the cdaR gene appears to be expressed independently of absA (Ryding et al., 2002). The current experimental and modeling study focuses on analyzing the role of the cdaR gene product in regulating the expression of the cdaR gene cluster. The details on data preprocessing can be found in supplementary materials.
5.1 Identification of cdaR Regulatory Module
As there is not much a priori biological knowledge available, we use the data to inform us about which of the 34 available gene targets might be directly regulated by cdaR. We implement this by means of an ANOVA and checking the significance of the knock-out effectκc, gctr = μ+κc + τt + εctr for each gene in the cdaR cluster separately (c = mutant, wild-type; t = 1, … , 10 time; r = 1, 2 replicates) accounting for a possible time effect. Apart from cdaR gene itself, another 17 genes within the cdaR cluster have been identified (with p-values < 0.01) as being differentially expressed between the two strains. Although performing 34 tests simultaneously, a p-value of 0.01 guarantees that it is unlikely that more than one of the 17 genes is falsely discovered. These 17 genes are therefore assumed to be activated directly by the transcriptional activator CdaR. Ten of these genes, SCO3235-39 (with SCO3238 absent from the array) and SCO3244-49, form two stretches of coregulated genes probably belonging to the same operons, the latter extending from the fab operon (with known members SCO3245-49) that encodes the biosynthesis of the fatty acid moiety of CDA (Hojati et al., 2002). We further assume that this TF and its 17 target genes constitute a SIM.
5.2 Reconstruction of CdaR Activity
To reconstruct the activity profile of the CdaR regulator, the profiles of all 17 differentially expressed genes within this regulatory module are used. In other words, we consider a SIM with CdaR as its master regulator and the 17 genes as its targets. The maximum likelihood estimate of the activity profile for CdaR found by the conjugate gradient method using gene expression data for all 17 targets for wild type organism is shown in Figure 1.
TFA of CdaR inferred from gene expression data. (a) the TFA is the piece-wise constant step function (solid line) together with its 95% confidence bounds (dashed lines). The inferred profile is rescaled between zero and one. Corresponding confidence bounds are rescaled accordingly. (b) TFA vs TFX of cdaR for wild-type time-course. The TFA profile is smoothed using R spline function (solid line) from inferred piece-wise constant function. Points represent the observed data for two biological replicates for TFX (wild type) (dashed lines). Smoothed profile has been rescaled between zero and one; data points have also been rescaled independently to be between zero and one.
The confidence bounds for the -components were obtained via a classical Wilks procedure. Let L* be the value of the maximum likelihood found with respect to all parameters, including
. By perturbing each
, we obtain a value of likelihood
. The 95% confidence bound for
is found by finding ▵j such that (L*−L*j)/2 =χ21,0.95. Figure 1a shows the reconstructed CdaR activity profile as a piece-wise constant function (solid line). Dashed lines show upper and lower 95% Wilks confidence bounds for each
. A CdaR profile smoothed over the reconstructed piece-wise profile is shown on Figure 1b (solid line). The smoothed profile was obtained by the cubic spline function (R-function pspline). Points (connected with dashed lines) represent the observed data for cdaR gene expression for the two independent biological replicates.
Because of the arbitrary scale of the expression data, the shapes of the reconstructed and the expression data for cdaR are of interest to us, rather than their absolute values. The Pearson correlation between inferred activity profile and the average expression profile is 0.45, suggesting that the regulator CdaR is modified posttranslationally. Indeed, it is highly likely that the activity of the CdaR protein is influenced by its phosphorylation state. The deduced CdaR protein sequence has a putative ATP-binding site and it is known that the activity of related streptomycete antibiotic regulatory proteins, such as AfsR, is governed by protein phosphorylation. The data presented here would be consistent with such posttranscriptional modification. This indicates that it is not safe to substitute the activity of the regulator by the measured gene expression in the models of gene regulation.
Expression profiles of all 17 differentially expressed genes between wild type and mutant has been used in the reconstruction of the transcription activity profile of their common regulator. To evaluate how sensitive the result is to the false positives among the targets, we performed the same analysis by iteratively leaving one of the putative targets out. The TFA profiles found for each of the 17 SIMs with 16 targets were compared to the TFA profile found for the original SIM with 17 targets. The results are shown in Figure 2.
Sensitivity of the TFA to possible false positives among the targets. The TFA reconstructed from an original SIM with 17 targets (solid line); TFAs reconstructed for SIMs with 16 targets (leave-one-out) (dotted lines); 95% confidence bounds (dashed lines).
The mean correlation between the original TFA and the ones found for SIMs with 16 targets is 0.872. It is clear from Figure 2 that some difference is noticeable on the first and last time intervals. However, in each case the reconstructed profile of the gene target that has been left out shows excellent fit with the expression data for this gene (not shown). This is not surprising, as each of the inferred TFA profiles has a high correlation with the original inferred profile.
5.3 Kinetic Profiles of Target Genes
For each of the 17 target differentially expressed (DE) genes, the mean gene expression profiles μk(t) and kinetic parameters θk, k = 1, … , 17, of the MM model (6) were estimated given the reconstructed profile of the TF, . Two representative gene profiles within the regulatory module for wild type are shown in Figure 3.
Two representative profiles of target genes within the SIM regulatory module. The points connected by dotted lines stand for the observed data for the wild type (two replicates). The solid line is for a gene profile fitted with the inferred TFA of CdaR regulator . (a) Gene SCO3230. ML estimates of kinetic parameters are β= 168, γ= 569, δ= 48, α = 0.55, σ= 0.14. (b) Gene SCO3235. ML estimates of kinetic parameters are β= 265, γ= 516, δ= 9.4, α = 0.000001, σ= 0.31.
It is difficult to evaluate the estimates of the kinetic parameters, θk, as quantitative biological knowledge on gene transcription in general, and for Streptomyces coelicolor in particular, is very limited. It is nevertheless worthwhile mentioning some details about the kinetic parameters (see Web Appendix D).
5.4 Goodness-of-Fit
As the observed gene expression data, gkcr(t), are assumed to be log-normally distributed, log[gkcr(t)] is normally distributed. Therefore,
where the location parameter mkc is given by formula (4). Whether the inferred data truly comes from a normal distribution can be tested by a Kolmogorov–Smirnov test by using, for example, the R-function ks.test.
Figure 4 shows a QQ-plot of the p-values from the Kolmogorov–Smirnov test for all 17 differentially expressed genes between wild type and cdaR mutant. This figure shows that the MM model combined with log-normal deviations displays a very good fit to the observed time-course gene expression data. The dashed line stands for an ideal fit of the data to the model. If the p-values fall below this line, the fit is poor. P-values above the line indicate some overfit of the model. However, the 95% confidence bounds of the uniform distribution (dotted line) show how most of the p-values might be higher than the line simply by chance, as they fall within the upper confidence bound.
Fit of MM-kinetics with log-normal noise and ML estimate of ηA for 17 genes identified as differentially expressed between the wild type and the cdaR-mutant. The p-values from Kolmogorov-Smirnov test are shown versus the quantiles of the uniform distribution. Dashed line stands for an ideal fit of the data to the model.
To address concerns of overfitting, we compare the current model with gene-specific variances σ2k, with a model, wherein a common variance σ2 is used for all genes. The maximum likelihood estimate for common σ2 has been found by a grid-search between the smallest and largest values of σ2k. The likelihoods of the two models are compared using a χ2-test with 16 degrees of freedom, that is, the difference in the number of parameters. This yields a statistic of 153.17, which far exceeds the 95% cut-off of χ216,0.95 = 26.3. This suggests that the model with a gene-specific variance gives the best trade-off between the goodness of fit and the number of parameters in the model.
6. Discussion
In this paper we developed a statistical framework that embeds the deterministic MM kinetics of gene regulation within a stochastic model of microarray measurement noise. As an alternative for direct experimental measurement of the activity profile of the TF, the model reconstructs this profile using the gene expression profiles of its targets within a SIM regulatory module. In addition, estimates of gene-specific kinetic parameters of the gene regulation are found. We have shown that in the case of posttranscriptional modifications, such as is the case of the cdaR gene in Streptomyces coelicolor, the amount of mRNA of a regulator is not a good approximation for its protein activity levels and one cannot be substituted for the other in quantitative models of gene regulation.
Our statistical framework requires some knowledge of the structure of the regulatory module, which can be determined by experimental methods (ChIP technology), analytical (e.g., by finding differentially expressed genes) and available biological knowledge. Currently, in the absence of a reliable technology to routinely measure the TFA of regulators, it is of great importance to understand whether TFA can be inferred from the expression of its targets. A straightforward experimental verification of the results is to measure the phosphorylation profile of CdaR and compare it with the TFA, inferred by our model.
The statistical framework developed in this paper can be extended to include cooperativity and competitive regulation by two or more TFs with both AND and OR gate types. It can be used to reconstruct the activity of TFs in known regulatory modules and to discriminate between the types of regulation (activation/inhibition; gate types) by using likelihood ratio and goodness-of-fit tests. The model can also be extended to search for the TFA and gene-specific kinetic parameters of regulation by combining different microarray data sets. Other types of data as well as available knowledge can be incorporated in the model.
7. Supplementary Materials
R-code and other supplementary materials are available under the Paper Information link at the Biometrics website: http://www.tibs.org/biometrics. The microarray data used in the paper is available publicly in ArrayExpress (http://www.ebi.ac.uk/arrayexpress/; accession numbers: Experiment, E-MAXD-14; Arrays: A-MAXD-6, UMIST S COELICOLOR SC8 7337; A-MAXD-7, UMIST S COELICOLOR SC3 6077; A-MAXD-8, UMIST S COELICOLOR SC4 6884).
Acknowledgements
The work on this project was funded by a BBSRC Exploiting Genomics initiative consortium grant “DNA microarray data analysis and modeling: an integrated approach.”
These authors contributed equally to the work.