mLDM infers microbe and environmental associations in metagenomic datasets
It removes indirect associations among OTUs correlated with a common factor
The model accounts for compositional bias and variance in metagenomic sequencing data
We apply mLDM to marine and human gut microbial datasets
Summary
The inference of associations between environmental factors and microbes and among microbes is critical to interpreting metagenomic data, but compositional bias, indirect associations resulting from common factors, and variance within metagenomic sequencing data limit the discovery of associations. To account for these problems, we propose metagenomic Lognormal-Dirichlet-Multinomial (mLDM), a hierarchical Bayesian model with sparsity constraints, to estimate absolute microbial abundance and simultaneously infer both conditionally dependent associations among microbes and direct associations between microbes and environmental factors. We empirically show the effectiveness of the mLDM model using synthetic data, data from the TARA Oceans project, and a colorectal cancer dataset. Finally, we apply mLDM to 16S sequencing data from the western English Channel and report several associations. Our model can be used on both natural environmental and human metagenomic datasets, promoting the understanding of associations in the microbial community.
). Most microbes cannot be cultured in laboratories, making it difficult to gain an understanding of their interactions with existing technologies. However, with the advancement of high-throughput sequencing technology, we are able to sequence 16S rRNA genes or whole metagenome of uncultured microbes directly from samples at diverse times or spots and, as a result, obtain microbial abundance information (
) over the last few years. One of the major challenges is to discover associations, usually referred to as positive and negative relationships, among microbes and between microbes and environmental factors, or EFs. Such associations could help us to unravel real interactions, including, for example, commensalism, parasitism and competition in a community, resulting in a broad understanding of community-wide dynamics.
Associations can be measured by different statistical methods to investigate underlying relationships. Existing association studies can be classified into two main categories. The first is pairwise association calculation, such as Pearson’s correlation coefficient (PCC) and Spearman’s rank correlation coefficient (SCC), which directly computes the correlation between two species. Local similarity association (LSA) also computes pairwise association, but its mechanism differs from the others, and it calculates associations using the dynamic programming (
). The second is complex association calculation that estimates the relationships between one species and the remaining species and/or EFs via multivariate regression-based methods (
), but such methods are not suitable for metagenomic datasets for the following two reasons. First, their calculated values may not indicate real associations because of compositional bias introduced when association is computed using methods that assume data are unconstrained, while ignoring dependence among the elements of compositional data (
). More specifically, the abundance of each microbe in metagenomic samples is usually normalized as the compositional relative abundance by dividing its read count over the total read count of a particular sample. Thus, after normalization, the relative abundance xi is not independent from the relative abundance of the rest of the microbes, regardless of their underlying relationships as:
Compositional bias tends to be more severe when some dominant species exist. This is particularly widespread in the marine microbial community (
). Consequently, for association studies, it is desirable to develop computational methods that bypass compositional bias in order to enable the inference of associations in metagenomic sequencing data. Second, the observed read count of one microbe may deviate from its true abundance based on a given experimental protocol, in which a series of sample preparation, amplification (
), and sequencing steps, can lead to large variance of read counts. This variance within metagenomic sequencing data is also ignored by pairwise association calculation methods.
Recent advances have been made in the development of statistical methods to study associations using sequencing data, while taking compositional bias into account. For example, CCREPE (
) estimates the compositionally corrected p value for every association, allowing the extraction of significant associations via pairwise association calculation. Permutation and bootstrapping have also been used to generate the null distribution of the association while considering compositional bias, and the corrected p value is obtained by the pooled-variance Z-test. However, the limited number of data samples results in unreliable null distribution and corrected p values that are sensitive to noise. SparCC (
) infers correlations among microbes by utilizing log-ratio transformation to eliminate the effect of the total number of read counts, while imposing sparsity of correlations among microbes. SPIEC-EASI (
) uses the covariance of the centered log-ratio-transformed data to approximate the covariance of log-transformed absolute abundance of microbes and obtains conditionally dependent associations among microbes. Similar to SPIEC-EASI, CCLasso (
) estimates the covariance matrix via an alternating direction algorithm instead of the graphical lasso. However, without considering environmental factors, many associations between and among microbes, as determined by these methods, may not be real. For example, Figure 1C shows that two unrelated microbes (OTU-1 and OTU-2) may appear to be associated just because they both respond to the same environmental perturbation (EF-1).
considered the effect of environmental factors by filtering out indirect associations among OTUs. However, this method is limited since only triples, i.e., two OTUs and one EF, are included each time. The influence of EFs on OTU-OTU associations was previously explored by
by testing the statistical significance of associations among OTUs. A null model based on the assumption that associations are independent from either taxa or locations is constructed via the binary presence-absence matrix, which records the presence or absence of taxa in samples. However, the mere presence-absence binary information in taxa of samples, not abundance, is utilized; therefore, the results may be restricted.
Controlling for Bias, Indirect Effects, and Variance Using a Hierarchical Bayesian Model
To address the shortcomings of the methods noted above, we propose the metagenomic Lognormal-Dirichlet-Multinomial (mLDM) model in this study (Figure 1). It is a typical hierarchical Bayesian model (
) that learns complex relationships underlying the data. The sequencing process in which millions of DNA molecules are randomly sampled from a DNA library for sequencing (
) can be modeled by a multinomial distribution. In metagenomics, a DNA library consists of a large number of amplified 16S or 18S rRNA gene sequences, and the relative abundances of OTUs in a library, which are determined by their real abundances in the environmental samples, can be modeled by a Dirichlet distribution. Thus mLDM models read counts of OTUs via Dirichlet-Multinomial distributions to estimate associations among OTUs considering the effect of EFs (Figure 1B). The real abundances of OTUs are determined by associations, both among OTUs and between OTUs and EFs, and consequently, mLDM applies a lognormal distribution to parameterize these two kinds of associations using two matrices, one denoting conditionally dependent associations among microbes and the other representing direct associations between microbes and EFs. Finally, the two estimated parameters are visualized as an OTU-OTU association network (Figure 1D) and an EF-OTU association network (Figure 1E), respectively.
Simulated Experiment on Synthetic Dataset
To show the effectiveness of the proposed mLDM model, we conducted several experiments and compared mLDM with several state-of-the-art models, including eight programs: PCC, SCC, CCREPE, SparCC, CCLasso, glasso (graphical lasso) (
), SPIEC (multiple lasso [ml]), and SPIEC (graphical lasso [gl]). SPIEC (ml) and SPIEC (gl) are two different modules within SPIEC-EASI. The first five methods estimate associations via the calculation of correlations with PCC as the baseline, and the last three compute the conditional dependence with glasso as the baseline. In the next experiment, we will estimate the following: (1) OTU-OTU associations among all microbes (or OTUs) and (2) EF-OTU associations between environmental factors and microbes. Synthetic data can be naturally produced via our generative process, as shown in Figure 1B, based on five different graphical structures (random, cluster, scale-free, hub, and band) for microbes parameterized by , and randomly sparse association structure between microbes and EFs parameterized by B. Receiver operating characteristic (ROC) curves, area under the curve (AUC) scores, and distance, which is defined as the L1 distance between estimated results and ground truth, are used for evaluation.
First, we compared performances of all nine methods based on the estimation of OTU-OTU and EF-OTU associations with simulation parameters p = 50, Q = 5, and n = 500, corresponding to 500 samples with 50 OTUs and five EFs. Figure 2A shows the ROC curves of OTU-OTU association studies for five different types of graphical structures. The corresponding AUC scores and distances are summarized in Table S1. From the ROC curves, we learn that mLDM has larger true-positive rates than any of the other methods when false-positive rates are small. The AUC scores of mLDM are superior to those of all other state-of-the-art methods across all five different graphical structures. A direct comparison between mLDM and glasso and SPIEC-EASI, which both estimate conditionally dependent associations without considering the variance of metagenomic data, shows that mLDM achieves the highest AUC scores across all five graphical structures. We also observe that mLDM has smaller distances than most of the other methods, suggesting that mLDM is able to accurately estimate the weight and sign of conditionally dependent associations. On the cluster graph, the ROC curves of SparCC and CCLasso increase more slowly than those of mLDM at the beginning, but climb higher as the false-positive rates become larger. This can be explained by the local density of each standalone cluster in the graph. Under these conditions, mLDM tends to shrink edges with low weights, finally retaining fewer edges than either SparCC or CCLasso. However, we argue that an initial high true-positive rate, when the false-positive rate is small, is very significant in biological applications, essentially because a higher ratio of predicted associations will be true.
Figure 2Performance of Association Inference of Nine Methods on Synthetic Experiment
Figure 2B shows the ROC curves for the estimated associations between EFs and OTUs (EF-OTU), where simulation parameters are set the same as those shown in Figure 2A. The corresponding AUC scores and distances are shown in Table S2. CCREPE, SparCC, CCLasso, and SPIEC do not estimate EF-OTU associations; therefore, we compared mLDM with PCC, SCC and glasso only. From the ROC curves, we observe that mLDM has higher true-positive rates and lower false-positive rates than the other four methods. From the AUC scores, we observe that mLDM has better performance than the other methods. For distances, mLDM also performs better than the other methods, with the exception of SCC, which does slightly better in the Band graph. More comparisons (with DirMulti, Dirichlet-multinomial regression [
Next, to show the sensitivity of the computational models with respect to different sample sizes, we fixed the number of microbes as p = 50 and the number of EFs as Q = 5 and simulated metagenomic sequencing datasets with various sample sizes, including n = 25, 50, 200, and 500. The AUC scores of the estimated OTU-OTU associations by glasso, SPIEC (gl), SPIEC (ml), and mLDM are plotted in Figure 2C. As expected, the AUC scores of all five methods increase when the sample size increases. Among these methods, mLDM gives the highest AUC scores on all five graphical structures, which again proves that mLDM can accurately estimate conditionally dependent associations. The AUC scores of the estimated EF-OTU associations by PCC, SCC, glasso, and mLDM are shown in Figure 2D, and, again, the AUC scores of mLDM are higher than those of PCC, SCC, or glasso.
Performance on TARA Oceans Eukaryotic Data
To validate the performance of mLDM on discovering OTU-OTU associations from real metagenomic sequencing data, we show the results of mLDM, as well as eight other methods, on TARA Oceans eukaryotic data (
). The eukaryotic abundance profiles were estimated by sequencing and clustering the V9 region of eukaryotic 18 s rRNA genes. A subset for evaluations was extracted from datasets established by the original authors. This subset consists of 67 OTUs with 28 known genus-level interactions and 17 EFs from 221 samples.
It should be noted that the known interactions are at genus level, and thus we evaluated the results at genus level. Since the exact OTU-OTU associations at species level are unidentified, we further specified that a predicted association between two OTUs would match a known genus-level interaction if the two OTUs belonged to two interacting genera. Since this is not a ground truth dataset because of its incompleteness, we reported the numbers of matched genus-level associations among the top-N predicted associations (with the highest weights) of all methods, as listed in Table S3. It can be seen that mLDM is superior to other programs in terms of the number of matched associations for six cases, demonstrating its power of association inference. SCC is competitive with mLDM when N≤40, but its performance decreases as N increases. Both CCLasso and SparCC tend to report a dense association network, which includes a large number of false-positive associations, as shown in Figures 3E and 3F. In contrast, mLDM assumes network sparsity and therefore selects associations with higher weights, as shown in Figure 3D.
Figure 3Results of Experiments on TARA Oceans Eukaryotic Dataset and West English Channel Data
The ground truth, consisting of 28 genus-level symbiotic interactions, as listed in Table S4, and the top 40 highest valued genus-level associations discovered by mLDM, are plotted in Figures 3A and 3B, respectively. The strong negative association between the genus Amoebophrya and the genus Alexandrium, as given by mLDM, implies a parasitic interaction, which matches known parasitic interactions (
). The known parasitic interactions between Amoebophrya and Peridiniaceae, and between Amoebophrya and Acanthometra were also detected by mLDM as having negative associations (
). We also list the top 10 predicted OTU-OTU associations, i.e., those with largest weights, together with relevant citations in Table S5.
Figure 3C shows EF-OTU associations estimated by mLDM. Compared to OTU-OTU associations estimated by mLDM, as shown in Figure 3D, fewer EF-OTU associations are found, indicating that EFs have direct effect on only some OTUs, while OTU-OTU associations comprise the greater share of forces that drive the changes of microbial community. Similarly, we show the top 10 estimated EF-OTU associations in Table S5. Some of these predictions were consistent with findings reported in the literature. For example, Cope-1 (Corycaeus sp.) is positively associated with the depth of maximum Brunt-Väisälä frequency, which is a measure of the stability of ocean’s stratification. This discovery is consistent with a previous work about the predictability of the depth of maximum Brunt-Väisälä frequency to Cope-1 (
. The relationships between the depth of maximum chlorophyll and Cope-2 (Oithona sp.) and between moon phase and Cope-7 (Centropages fu.) were also studied in other projects (
). The composition of gut microbes of patients with colorectal cancer (CRC) has been found to differ from that of the normal gut microbial community, and some microbes, such as Fusobacterium, Peptostreptococcus, Parvimonas, and Porphyromonas, have been reported to be enriched in the patients’ gut (
). A total of 117 OTUs and five meta data (FIT results, site, Dx_Bin, age, and gender) out of 490 samples were selected from the dataset provided by the original authors to construct association networks. Among the meta data, “site” contains four cities with three cities in the US and one in Canada, and “Dx_Bin” comprises five diagnostic states, “normal,” “high-risk normal,” “adv adenoma,” “Adenoma,” and “cancer.” Results of mLDM and previous studies were compared to verify our model’s effectiveness. Table 1 lists top 12 EF-OTU associations estimated by mLDM.
Table 1Top 12 EF-OTU Associations Estimated by mLDM on Colorectal Cancer Data Dataset
OTU
EF
Association
p value (Wilcoxon rank-sum test)
Peptostreptococcus (OTU310)
cancer
+0.865
2.00 × 10−15
Porphyromonas (OTU105)
cancer
+0.617
2.08 × 10−14
Fusobacterium (OTU264)
normal
−0.463
1.74 × 10-5
Fusobacterium (OTU264)
FIT positive
+0.442
N/A
Parvimonas (OTU281)
cancer
+0.378
3.50 × 10-12
Porphyromonas (OTU105)
normal
−0.372
7.34 × 10−6
Veillonella (OTU66)
age
+0.364
N/A
Parvimonas (OTU281)
normal
−0.307
7.94 × 10-5
Pasteurellaceae (OTU58)
cancer
−0.298
2.95 × 10-5
Porphyromonas (OTU105)
FIT positive
+0.288
N/A
Parasutterella (OTU82)
age
−0.275
N/A
Fusobacterium (OTU264)
cancer
+0.272
2.68 × 10-7
All predicted EF-OTU associations are sorted in descending order according to their absolute values. Wilcoxon rank-sum test is performed on eight out of 12 EF-OTU associations where the EFs include five diagnostic states, “normal,” “high-risk normal,” adv Adenoma,” “adenoma,” and “cancer,” “FIT” (fecal immunochemical test), and “age.” The content in the “OTU” column consists of annotated OTUs and OTU numbers from the original article. Wilcoxon rank-sum test was performed on the diagnostic state-OTU associations but is not applicable for other types of EF-OTU associations.
We observed that four OTUs, Peptostreptococcus (OTU310), Porphyromonas (OTU105), Parvimonas (OTU281), and Fusobacterium (OTU264), appear among the top 12 EF-OTU associations and are positively associated with CRC. The unclassified Prevotella (OTU57) reported by the original authors is also found to be positively associated with CRC, and it is the 25th largest EF-OTU association (+0.185). These results are consistent with previous studies. We believe that this is convincing validation for the accuracy of EF-OTU associations estimated by mLDM.
In addition, for eight out of the 12 EF-OTU associations, p values via the Wilcoxon rank-sum test are shown in Table 1. All these associations are statistically significant, which again shows the efficiency of mLDM as a predictor of EF-OTU associations. Interestingly, among the top 12 EF-OTU associations, we also discover that two microbes are associated with the “age,” Veillonella (OTU66) (+0.364), and Parasutterella (OTU82) (−0.275), and that a special species, Pasteurellaceae (OTU58), is negatively associated (−0.298) with CRC. More studies are needed to explain these associations.
Association Inference on West English Channel Data
Finally, we applied mLDM to other marine metagenomic sequencing data to infer the underlying OTU-OTU associations and EF-OTU associations. In the marine community, huge numbers of marine microbes play important roles in ocean food chains. However, very little is known about how marine microbes interact with each other or how they are affected by environmental factors.
studied the dynamics of the marine microbial community in the West English Channel by analyzing high-throughput 16S rRNA data sampled from 2003 to 2008. From these data, we extracted 48 OTUs and eight EFs that appear in 46 samples and employed mLDM to infer associations.
The OTU-OTU association network for the 48 OTUs is shown in Figure 3H. In general, the number of positive associations (brown edges) among OTUs is more than that of the negative associations (blue edges). The network is clearly dominated by OTUs from Proteobacteria, which are colored green. This result is consistent with the original discovery by
. The OTU Alphap17, which belongs to the family Rhodospirillaceae, plays an important role in the network, as it is a hub connecting most OTUs. Rhodospirillaceae is known to produce energy through photosynthesis, which is critical to the marine microbial community on the surface of the ocean.
also found that a single Rhodobactereaceae OTU acts as a hub and is correlated with different groups. Although the OTU Alphap5 from the genus Thalassobacter, the OTU Alphap2 from the family SAR11, and the OTU Alphap17 are from the same class, Alphaproteobacteria, their associations are different. Alphap5 and Alphap17 have a strong negative association while Alphap2 and Alphap17 have a positive association. The OTUs Gammap47 and Gammap76 are from the same family, SAR86, and both have a positive association with the OTU Alphap17. It is remarkable that the relative abundance of Alphap17 is so low, while still connecting many big OTUs with high relative abundance levels, such as Alphap1, Alphap2, Gammap76, and Gammap7, implying that we should pay more attention to rare OTUs with low abundance in future research.
Figure 3I shows the EF-OTU association network between eight EFs and 48 OTUs. We observe that temperature has the most significant impact on OTUs, especially on the phylum Proteobacteria. This is consistent with previous observations. Furthermore, the OTU Alphap17, which connects many other OTUs, is very strongly and positively associated with day length. This is consistent with the photosynthesis function of OTU Alphap17 and further confirms that the photosynthesis of Alphap17 is critical to the whole marine microbial community.
also associated day length with the variance of microbial community via discriminant function analysis. In addition, the OTU Alphap16 from the family Rhodobacteraceae has a positive association with temperature. The top ten OTU-OTU and EF-OTU associations are shown in Table S6. The positive associations between temperature and both Alphap16 and Gammap58 were previously reported by
To discover the underlying associations among microbes from metagenomic samples, we propose mLDM, a hierarchical Bayesian model with sparsity constraints to discover associations among microbes and between microbes and the environmental factors that affect them. mLDM can infer both conditionally dependent associations among microbes and direct associations between microbes and environmental factors, by taking into account both compositional bias and variance of metagenomic data, an approach not previously studied. This newly discovered conditionally dependent association provides insight into the mechanisms underlying a microbial community by capturing the direct relationship underlying each microbial pair and removing the indirect connection induced from other common factors. The effectiveness of mLDM was verified on the basis of experiments involving both synthetic and real datasets.
To address the question whether environmental factors are important for the inference of OTU-OTU associations, we applied mLDM on synthetic datasets, when only one type of associations, either OTU-OTU or EF-OTU associations, was estimated, similar to the approach of
. The results are shown in Figure S1. Compared to the methods considering both types of associations, we observe lower ROC curves for those estimating only one type of associations. Therefore, it can be concluded that environmental factors would affect the estimation of OTU-OTU associations, that OTU-OTU associations would affect EF-OTU associations, and that both types of associations should be considered in association estimation.
Since mLDM assumes sparsity of true association, we also test whether the sparsity pattern of OTU-EF interactions (matrix B) would affect association estimation by this method. In matrix B, coefficients of a row correspond to the impact of an environmental factor, and coefficients of a column correspond to the impact of multiple environmental factors to an out. Therefore, we assumed some sparsity patterns of B by assigning some fractions of rows or columns as nonzero and plotted the ROC curves of estimated EF-OTU associations by mLDM, as shown in Figure S2. Overall, mLDM is insensitive to these patterns and works well in all cases. These results indicate that mLDM can be applied to estimate various types of sparse associations.
mLDM assumes that microbes respond linearly to environmental factors. However, the abundance of microbes may reach optima under certain environmental conditions, such as some range of temperature and depth. While this appears to be a limitation, we argue that some nonlinear associations can, to some extent, be captured by our model, when data points are distributed askew, which is typical in real datasets. For example, among EF-OTU associations estimated by mLDM on the TARA Oceans Eukaryotic Data, we observed that the OTU Cope-1 annotated with the strain Corycaeus sp. is negatively associated with oxygen concentration. Almost 95% of all 221 samples of the TARA Oceans dataset are either from the surface waters or from the deep chlorophyll maximum subsurface, whose depths range from 5.374 to 183.31 m. From the samples near the ocean surface, the abundance of Corycaeus sp. does not increase linearly with the increase of oxygen but rather tends to be more abundant when the concentration of oxygen is within a certain range, as plotted in Figure 3G. This example demonstrates that mLDM is capable of capturing some of these nonlinear associations.
Consistency is an important property that shows the robustness of methods against noise. Accordingly, we tested mLDM on the Human Microbiome Project dataset (HMP) and constructed two datasets to evaluate consistency. Since some subjects had two gut samples from two time points, we constructed the first dataset using the samples from the first time point and the second dataset using those from the second time point. Consistency was measured by Jaccard similarity, i.e., the fraction of the number of intersections among the top 200 largest OTU-OTU associations, as estimated from the two datasets, over the number of the union of these two sets of associations. The consistency of nine methods was then plotted, and it is shown in Figure S3. We observe that the consistency of mLDM is ranked fifth among all, and among the four methods that estimate conditionally dependent associations, including glasso, SPIEC (gl), SPIEC (ml), and mLDM, mLDM is the second best. Of the nine methods compared, CCLasso had the highest consistency, while glasso had the lowest consistency. Overall, methods that estimate direct correlations had higher consistency than those that estimate conditionally dependent associations. However, approaches that estimate conditionally dependent associations are sensitive to heterogeneity or noise within the dataset, particularly in model selection. However, consistency needs not to be the best standard to assess methods because, to some extent, consistency may reflect this method is misled by systematic bias.
For future work, we will develop a more scalable mLDM model to analyze large microbial network structures with tens of thousands of microbes by using stochastic gradient descent and parallel computing techniques. For rare OTUs, which only exist in a small fraction of the samples, the lognormal distribution may be not suitable, and other appropriate distributions need to be explored. We will also develop dynamic mLDM models to analyze time series data and learning time-varying network structures.
Please contact the corresponding author Dr. Ting Chen ( tingchen@tsinghua.edu.cn ) for further information and requests about codes and datasets.
Methods Details
The metagenomic Lognormal-Dirichlet-Multinomial Model
Suppose there are N samples . Each is a P-dimensional vector that contains P microbes (or Operational Taxonomic Units (OTUs)), where represents the sequence/read count of the j-th microbes in the i-th sample. Let represent the environmental factors, where each is a Q-dimensional vector and represents the value of the j-th environmental factor associated with the i-th sample.
Figure 1B illustrates the mLDM model for metagenomic sequencing, where is the read count vector of the i-th sample and records values of the environmental factors corresponding to the i-th sample. The latent variable is the vector of the relative abundance levels of P microbes in the extracted sample, and represents the absolute abundance levels of the microbes in the original community. We assume that the counts are proportional to the latent microbial ratios which are determined by their absolute abundance . Microbial absolute abundance can be influenced by two factors: 1) environmental factors , whose effects on the microbes are denoted by a linear regression model , and 2) the associations among microbes encoded by a latent vector , which is determined by the matrix that records microbial associations and the mean vector that affects the basic absolute abundance of microbes. The microbial basic absolute abundance can be regarded as the average result of effects of all other factors that have an effect on microbial abundance, but are not included in the mLDM. More specifically, the generative process of the metagenomic Lognormal-Dirichlet-Multinomial hierarchical model is defined as:
where is a parameter matrix, is a P-dimensional vector, and is the inverse covariance matrix (i.e., precision matrix) of a multivariate Gaussian distribution. With this model, our goal is to infer both , the environmental factor-microbe (or EF-OTU) associations, and , the microbe-microbe (or OTU-OTU) associations, under some sparsity regularization which will be made clear in next section. We now explain the design of each component in mLDM.
We assume that read count data follows a multinomial distribution with the microbial ratio parameter :
(1)
where is the total read count of the i-th sample. Since the multinomial parameter is subject to the constraint that , we assume it follows a Dirichlet distribution
(2)
where , is the Gamma function and . Based on the conjugacy of Dirichlet and multinomial distribution, we can obtain the following Dirichlet-Multinomial distribution via integrating out
(3)
The flexible variance-covariance property of the Dirichlet-multinomial distribution is suitable for modeling the sequencing data. A simple explanation is as follow. We calculate the variance of the read count , , and the covariance of two read counts and , , where and , are true relative abundance levels. We can see that both the variance and covariance of microbial counts are regulated by the sequencing depth and the true relative abundance of the microbes. Moreover, the coefficient between and is negative, which models the compositional negative bias.
We further assume that the absolute abundance for all microbes in the i-th sample follows the multivariate lognormal distribution with mean and covariance which is commonly used to model most microbial abundance except for some occasional species (
). Microbes survive in a community through conditionally dependent associations. However, at the same time, microbes are also subjected to unpredictable fluctuations impacted by their microenvironment. Therefore, we record associations among microbes in the matrix and let the mean vary with the environmental data vector by a linear regression model. Then the prior distribution is defined as
(4)
where . Using the relationship between the lognormal and Gaussian distributions, it is also equivalent to the following form:
(5)
where . This formulation avoids positivity constraint in the lognormal distribution. This is beneficial for finding the estimates, e.g., by using some unconstrained optimization algorithms, as explained in the next section.
With the above model, we capture both the conditionally dependent associations among microbes and the direct associations between microbes and environmental factors. More specifically, the conditionally dependent associations among microbes are encoded in the precision matrix . To visualize the microbial association network, we use an undirected graph denoted as employed in the Gaussian Markov random field (
) to represent , where represents the set of nodes denoting P microbes and is the set of conditionally dependent associations with each element representing the association between the i-th and j-th microbes. If , then the i-th and the j-th microbes are conditionally independent, and hence, no edge exists between the two microbes in graph . The weight of edge , , is the strength of the association between the two microbes.
The direct associations between microbes and environmental factors are encoded in weight matrix B. The association between the i-th microbe and the j-th environmental factor is , and we can plot them in another bipartite graph , where the set of nodes represents both P microbes and Q environmental factors, and the edge in represents the direct association between the j-th environmental factor and the i-th microbe. The weight of edge equals .
Overall, our metagenomic association network consists of these two graphs and , as illustrated in Figures 1D and 1E.
Sparse association estimation
We now explain how to estimate the metagenomic association network by using sparsity regularization. Given metagenomic data and environmental factors , the posterior distribution of the latent factors is
(6)
where can be calculated with Equation 3, and with each factor being a Gaussian distribution. As a consequence of the deterministic relationship , it should be noted that the distribution is a Dirac delta function. In general, associations among microbes are not expected to be dense and only a few environmental factors will predominate. This motivated us to identify a sparse association network which could be effectively achieved by sparse learning techniques (
). Also, in practice, the number of samples is usually smaller than the number of microbes, or . Therefore, introducing sparsity regularization helps avoid overfitting. Specifically, we estimate the sparse association network by solving the following problem:
(7)
where , is the log gamma function, and the positive parameters and are used to control the sparsity of the solution with larger values representing sparser results. Then, the model parameters can be estimated by optimizing the objective function with respect to Z,B,B0 and alternately.
1)
For , we minimize the objective function in Equation 7 with respect to . Because of independence, we can solve for each independently by the gradient descent methods. Here, we adopt the limited-memory quasi-Newton (L-BFGS) algorithm (
), which is a quasi-Newton method and converges fast. L-BFGS requires the derivative of , which is computed as follows:
(8)
where is the digamma function and is the j-th row of the matrix .
2)
For B, we minimize Equation 7 with respect to B. The objective is not differentiable by the existence of the L1 norm regularizer. Therefore we use the orthant-wise limited-memory quasi-Newton (OWL-QN) algorithm (
), which is based on L-BFGS and can minimize the log likelihood function with L1 regularization for optimization. The derivative of is
(9)
where
and .
3)
For B0, we have the update rule , which is the mean of the latent vectors .
4)
For , this step is equal to solving the classical problem of a graphical lasso (glasso):
(10)
where the empirical covariance . This problem is also termed as sparse inverse covariance estimation and can be solved with a standard graphical lasso (glasso) algorithm by (
). However, different from the fully observed glasso, where the empirical covariance is computed once, we should note that our S depends on the inferred latent vectors and needs to update at each iteration. Since and mutually influence each other in explaining the observed data (see the Figure 1B), the learned sparse graph (i.e., ) is affected by environmental factors, matching our intuition in Figure 1C.
For model selection, we choose the best parameters for and via extended Bayesian information criteria (EBIC) (
). EBIC improves the original BIC by assigning larger prior to lower dimension models, a strategy more suitable for model selection in large model spaces.
Quantification and Statistical Analysis
Data Generation and Evaluation Metrics in Synthetic Experiment
The synthetic data can be naturally produced via our generative process. First, the environmental factor matrix is sampled from the multivariate normal distribution and then normalized with and . The element of matrix B is sampled from the uniform distribution of and set to 0 with probability of 0.85. Since dominant microbes are found in some microbial communities, we produce vector by uniformly sampling from with probability of 0.2 and with probability of 0.8 to affect the distribution of absolute abundance of microbes. To evaluate the ability of mLDM to recover network structures, we follow
) is used to produce a graph in which a) initially two nodes in are connected and b) every new node is added in by linking to a node in the current graph with probability proportional to the degree of the node.
Hub Graph: Nodes are randomly split into groups, and within the same group, every node is connected with a center node with probability of 1. Finally, random edges are included in the .
Band Graph: Each adjacent node pair i and j in is connected if and P−1 edges are generated in .
) to generate and obtain the positive definite covariance matrix . In order to make the covariance matrix sparse, and thus beneficial to methods estimating the correlations, we set if . Then, is sampled from the normal distribution , and is calculated via Equation 5. Next, we generate the Dirichlet-multinomial samples from Equation 3. This process relies on the R package ‘HMP’, which includes the generation of Dirichlet-multinomial samplers. For B, B0 and with five structures, all methods are compared with the following four experimental settings: P = 50, and N = 25,50,200 and 500. We use public codes glasso, CCREPE, SPIEC-EASI, CCLasso and the implementation of SparCC in SPIEC-EASI. Here PCC and SCC are implemented in R language, and the candidates of associations are selected via p value. We set p value at 0.05 for PCC, SCC and CCREPE, and the threshold of correlation for SparCC is 0.1. For each parameter setting, we randomly generate 20 sets of data for evaluation. For all experimental results, it should be noted that we show the mean and variance of evaluation results from the 20 synthetic datasets.
We use three metrics for evaluation:
ROC curve: We plot the ROC curves using two criteria. For PCC, SCC, CCREPE, SparCC and CCLasso, which estimate pairwise correlations, we compare their results with the true correlation matrix with each element being . For glasso, SPIEC-EASI and mLDM, which estimate conditional independence, we compare their results with the true precision matrix .
AUC score: We compute the area under the ROC curves directly. The AUC scores are calculated by ignoring the sign of edges.
distance: It is defined as the distance between the estimated edge weights and the true weights in the graph. A smaller distance indicates a higher accuracy. Let and denote the distance for the OTU-OTU and EF-OTU association graphs, respectively. For the pairwise correlation methods, , where is the estimated value and is the true value. For the conditional independence methods, , and .
Preprocessing of TARA Oceans Eukaryotic Data
The TARA Oceans eukaryotic OTU table and environmental data, including the known genus-level eukaryotic symbiotic interactions were downloaded from the PANGAEA website (https://doi.pangaea.de/10.1594/PANGAEA.843018) and the TARA OCEANS project website (http://www.raeslab.org/companion/ocean-interactome.html). A total of 91 genus-level mapped eukaryotic symbiotic interactions that consist of both parasitism and mutualism were collected based on the literature (
) and were used to evaluate the effectiveness of all methods. Samples with missing environmental factor values or with too large or small read counts were removed. OTUs that appear in less than 40% of the samples were omitted. For comparison, we chose OTUs that were involved in known genus-level symbiotic interactions. Finally we constructed a dataset consisting of 67 OTUs with 28 known genus-level interactions and 17 environmental factors from 221 samples for evaluation.
and downloaded the OTU and meta data from the github (https://github.com/SchlossLab/Baxter_glne007Modeling_GenomeMed_2015). We selected a total of 117 OTUs, including 112 that existed in at least half of all 490 samples, and 5 that were CRC-associated OTUs reported in the article, including Prevotella (OTU57), Porphyromonas (OTU105), Fusobacterium (OTU264), Parvimonas (OTU281) and Peptostreptococcus (OTU310).
Preprocessing of West English Channel Data
For the West English Channel data, we downloaded the OTU table from the VAMPS website (https://vamps.mbl.edu/). Forty-seven samples from position L4 were selected for association estimation. We extracted 48 OTUs that appeared in at least 46 samples, and the total abundance of these OTUs exceeds 50% of the total read counts. This dataset has 8 EFs, including temperature, day length, as well as concentrations of salinity, ammonia, chlorophyll, nitrate, phosphate and silicate, which were used to infer EF-OTU associations.
Data and Software Availavility
The program of mLDM is freely available at https://github.com/tinglab/mLDM/. Now, for the synthetic dataset with 50 OTUs, 5 EFs and 500 samples, mLDM runs about 20 min on a server with an Intel Xeon v3 2.5GHz CPU and 128G RAM.
Author Contributions
Conceptualization, Y.Y., N.C., and T.C.; Methodology, Y.Y. and N.C.; Software, Y.Y.; Format Analysis, Y.Y.; Investigation, Y.Y., N.C., and T.C.; Writing – Original Draft, Y.Y., N.C., and T.C.; Writing – Review & Editing, Y.Y., N.C., and T.C.; Visualization, Y.Y.; Supervision, N.C. and T.C.
Acknowledgments
We are thankful for the assistance of Jun Zhu on methodology. This work is supported by the National Natural Science Foundation of China (nos: 61305066 , 61561146396 , 61322308 , 61673241 ). An early version of this paper was submitted to and peer reviewed at the 2016 Annual International Conference on Research in Computational Molecular Biology (RECOMB). The manuscript was revised and then independently further reviewed at Cell Systems.
Andrew, G., and Gao, J. (2007). Scalable training of L1-regularized log-linear models. Proceedings of the 24th International Conference on Machine Learning. 33–40.
Genetic diversity of Amoebophryidae (Syndiniales) during Alexandrium catenella/tamarense (Dinophyceae) blooms in the Thau lagoon (Mediterranean Sea, France).
Global-scale distributions of marine surface bacterioplankton groups along gradients of salinity, temperature, and chlorophyll: A meta-analysis of fluorescence in situ hybridization studies.
To submit a comment for a journal article, please use the space above and note the following:
We will review submitted comments within 2 business days.
This forum is intended for constructive dialog. Comments that are commercial or promotional in nature, pertain to specific medical cases, are not relevant to the article for which they have been submitted, or are otherwise inappropriate will not be posted.
We recommend that commenters identify themselves with full names and affiliations.