Abstract
Identifying disease-related genes is an important issue in computational biology. Module structure widely exists in biomolecule networks, and complex diseases are usually thought to be caused by perturbations of local neighborhoods in the networks, which can provide useful insights for the study of disease-related genes. However, the mining and effective utilization of the module structure is still challenging in such issues as a disease gene prediction.
We propose a hybrid disease-gene prediction method integrating multiscale module structure (HyMM), which can utilize multiscale information from local to global structure to more effectively predict disease-related genes. HyMM extracts module partitions from local to global scales by multiscale modularity optimization with exponential sampling, and estimates the disease relatedness of genes in partitions by the abundance of disease-related genes within modules. Then, a probabilistic model for integration of gene rankings is designed in order to integrate multiple predictions derived from multiscale module partitions and network propagation, and a parameter estimation strategy based on functional information is proposed to further enhance HyMM’s predictive power. By a series of experiments, we reveal the importance of module partitions at different scales, and verify the stable and good performance of HyMM compared with eight other state-of-the-arts and its further performance improvement derived from the parameter estimation.
The results confirm that HyMM is an effective framework for integrating multiscale module structure to enhance the ability to predict disease-related genes, which may provide useful insights for the study of the multiscale module structure and its application in such issues as a disease-gene prediction.
Introduction
The progress of human disease gene discovery has promoted the understanding of the underlying molecular basis of human diseases, but genes known to be associated with diseases only account for a very small proportion of the incidences [1–4]. Traditional approaches such as linkage analysis and genome-wide association studies (GWAS) often provide a long list of candidate genes, requiring expensive and time-consuming experimental identification [5, 6]. Therefore, with the accumulation of biomedical data [7–10], developing computational algorithms for predicting disease-related candidate genes is indispensable to accelerate the discovery of disease-related genes [3, 11, 12].
Organism as a complex biological system is composed of a large number of biomolecules (e.g. genes and proteins) with complex relationships (physical interactions or functional associations), forming a complex biomolecule network system, where the biomolecules exert biological functions through intermolecular synergy while rarely function alone. Human complex diseases can be viewed as the consequences of perturbations or functional abnormalities of associated synergistic biomolecules in the complex network system [13]. Therefore, it is very necessary to study complex diseases and relevant biological phenomena from the perspective of system biology, and biological networks provide an important means for the research of system biology [14–18]. Especially, network-based algorithms have been a popular strategy for the study of disease-related genes [3, 19–28], since genes associated with the same or similar diseases are more similar functionally and their products tend to be highly interconnected in biomolecule networks [4, 29] (see next section). However, how to mine the characteristics of biomolecule networks so as to more effectively explore disease-related genes and related issues is still under continuous exploration due to the inherent complexity of biomolecule networks and the limitation of existing knowledge (e.g. the incompleteness of protein interactome) [15, 17, 30–34].
As we know, module structure as a common property of complex networks is ubiquitous in biomolecule networks, and the modular nature of human diseases can provide useful insights for the study of diseases, but it has not been fully explored in disease-gene prediction [35–37]. Generally, the genes and their products of disease tend to form a disease module due to their high interconnectivity in biomolecule networks [4, 29], but they are usually found to be distributed in multiple modules/subnetworks due to the intrinsic definition of a specific algorithm and the existence of multiscale module structure in the networks [4, 38–41]. The multiscale structure is indeed widespread in biological networks. For example, a module in a protein network may contain several sub-modules, e.g. some protein complexes (such as SAGA) contain several secondary complexes; most of the biological information (e.g. in Gene Ontology) is organized in the form of hierarchical structure. Many algorithms with a flexible resolution parameter have been proposed and applied to mine multiscale module structures in biological networks [42–45], where the resolution parameter can adjust the size or scale of identified modules (see next section). This can provide richer information for studying complex systems such as biomolecule networks, but there are still many challenging issues, such as how to effectively identify the multiscale modules from a network and how to mine the valuable information hidden in the multiscale structure.
To make use of multiscale module structure to more effectively predict disease-related genes, we therefore propose a hybrid method integrating the information of multiscale modules (HyMM) (Figure 1). HyMM extracts a series of module partitions from local to global scales by multiscale modularity optimization (MO) with exponential sampling, and estimates the disease relatedness of genes in partitions by the abundance of disease-related genes in modules. Then, a probabilistic model for integration of gene rankings is designed so as to integrate multiple predictions derived from multiscale module partitions and network propagation, and a parameter estimation strategy based on functional information for Gene Ontology (GO) annotations, pathways or disease genes is proposed to further enhance HyMM’s predictive power.
Workflow of the HyMM method integrating multiscale module structure. (a) Datasets: GG, DD and DG denote the gene–gene, disease–disease and disease–gene associations, respectively; GO and PW denote the GO annotations and pathways of genes, respectively. (b) Extract multiscale module partitions by multiscale algorithm. (c) The extracted module partitions are transformed into matrix representation. (d) A ranking list of genes is generated for each partition matrix; these ranking lists are organized into a ranking matrix of genes, and then, an integrated ranking list of genes based on this ranking matrix is generated by a probabilistic model along with a parameter estimation based on functional information. (e) Generate a ranking list of genes by network propagation. (f) Final rankings of genes are generated by integrating the ranking lists of genes from multiscale modules and network propagation.
The rest of the paper is organized as follows. Firstly, we introduce some related work in this study (including the identification of module structure and the prediction of disease-related genes). Secondly, we present the datasets and details of HyMM, as well as evaluation methods. Thirdly, we study the effectiveness of multiscale module information in disease-gene prediction by combining the functional analysis of multiscale modules; then, by a series of experimental tests, we verify the good performance of HyMM and study the effects of various factors including the definition of conditional probability, multiscale module extraction, parameter estimation based on functional information, sampling of multiscale module partitions and random shuffling of disease-gene associations. Furthermore, we apply HyMM to other datasets as well as specific diseases [e.g. Alzheimer’s disease (AD)] to further demonstrate the effectiveness of HyMM. These results confirm that HyMM can enhance the ability to predict disease-related genes by integrating a multiscale module structure. It is a protocol for disease-gene prediction integrating multiscale module structure, which may become a very useful computational tool for the study of disease-related genes.
Related work
In this study, we focus on the mining of information hidden in the multiscale structure to enhance the ability of disease-gene prediction, which involves two main aspects: disease-gene prediction and (multiscale) module identification (also called community detection or community mining in the field of complex networks [46–48]).
Disease-gene prediction is not only an important issue in computational biology but also is an important field of network medicine/biology [4, 14, 16, 17, 19, 43–45]. Numerous network-based methods for disease-gene prediction have been proposed, based on various approaches, e.g. from homogeneous (HO) network to heterogeneous (HE) network and from single-layer network to multi-layer network [15, 17, 34, 49–52]. For HO network model, for example, Köhler et al. [53] predicted disease-associated genes by the random walk with restart (RWR) on a protein interactome; Chen et al. [54] prioritized disease candidate genes by the k-step markov (KS) method; Hsu et al. [55] developed the gene interconnectedness-based method to rank candidate genes by evaluating the network closeness of them to seeds (known disease-related genes); Zhu et al. [56] proposed the vertex similarity-based (VS) method to discover disease-associated genes. For HE network model, for example, Li et al. [57] proposed the RWR on a disease-gene HE network to infer disease-gene associations; Wu et al. [58] proposed the network-based global inference method called CIPHER to predict human disease-related genes; Xie et al. [59] proposed the bi-random walk (BiRW) to predict disease-gene associations; Singh-Blom et al. [60] predicted disease-gene associations by developing the KATZ measure on a HE network, inspired by social network analyses. For more sophistic network models or techniques, for example, Valdeolivas et al. proposed the RWR on multiplex and HE networks [27]; Xiang et al. [34] proposed the network impulsive dynamics on the multiplex network for disease-gene prediction; Liu et al. [33] proposed a new network embedded representation algorithm to infer pathogenic genes; Xiang et al. [61] proposed a disease-gene-prediction method based on fast network embedding, which can effectively use information in a multi-source HE network constructed by integrating multiple types of association data. Moreover, some module-based algorithms are also applied to the analysis of disease-related genes/modules as well as related issues [42, 62–65]. See references [2, 3, 15, 19, 66–68] for related reviews.
The existing methods in literature have promoted the progress of disease-gene prediction, while the ‘guilt-by-association’ becomes a top-down central principle for predicting disease-related genes in the networks [69]. Evaluating the closeness or distance between candidates and known disease-related genes is a direct strategy to infer disease-related genes in the networks [20, 55, 56], while network propagation (e.g. RWR, KS, RWRH and BiRW) can effectively make use of more information in the whole network to mine potential disease-gene associations [26, 53, 54, 70, 71]. Network propagation (especially the RWR) shows excellent performance in many scenarios [19], so it has been widely applied in the study of bio-entity associations including disease-gene prediction [70].
Biological networks such as protein–protein interaction networks are an important basis for network-based methods in disease-gene prediction, and the mining of biological networks can be helpful for understanding the characteristics of networks and promoting the study of relevant issues. The existence of module structure is an important property of biological networks [72–74], and the research of module structure has been an important topic in the study of complex networks including biological networks. In the past decades, a large number of different types of algorithms based on various approaches (e.g. MO [75], dynamics [76] and statistical inference [77]) were proposed to identify modules (or say communities) in networks, which involve various types of modules (e.g. from single scale to multiple scales, and from non-overlapping to overlapping) and various types of networks (e.g. from unweighted to weighted networks, from undirected to directed networks, and from unsigned to signed networks) [46, 47]. Many of the module identification algorithms, especially MO-based algorithms, have been applied to the study of biological networks, e.g. functional module mining [72, 74, 78], protein complex detection [79–81], and disease module identification [42].
As mentioned above, genes/proteins associated with the same disease tend to form relevant disease modules in a biomolecule network [4, 29], but these genes are usually distributed in multiple modules by specific algorithms [4, 38]. There are several possible reasons for this phenomenon. (i) Complex diseases usually involve functional abnormalities of multiple genes, and these genes may have different functions, playing different roles in the development of complex diseases [82]. (ii) The existing biomolecule networks such as protein–protein interactions are still incomplete [30, 83]. This may cause the detected network modules to be broken and incomplete. (iii) Detected modules in networks are often algorithm-specific, because specific definitions of modules are different for different algorithms [46]. Some algorithms may split a large module into several small submodules in a network, or aggregate several small modules into a large one, because of the existence of a resolution limit that is related to the intrinsic definition or mechanism of algorithms [39–41]. This also implies the existence of a multiscale structure in the network.
In fact, multiscale structure widely exists in various natural and artificial complex networks, including biological networks [84, 85]. In this case, algorithms with flexible resolution parameters, e.g. multiscale MO [86, 87], may more effectively mine the module structure of networks at different scales, where the resolution parameter can be used to tune the scale or size of identified modules [48, 86, 88]. For example, multiscale MO can find relatively large modules in a network when the resolution parameter is small, while it can identify relatively small modules when the resolution parameter is large. This is similar to observing an object from a local to a global scale by a microscope with adjustable resolution parameters. Modules at different scales from local to global ones can be identified by adjusting the resolution parameter in continuous real number space. To study modules at different scales, one generally extracts a set of module partitions corresponding to a set of values sampled from the space of the resolution parameter by a suitable strategy (e.g. exponential sampling).
Multiscale module identification is important for studying biomolecule networks. Dunn et al. [89] have used edge-betweenness clustering to separate protein interaction networks into modules correlating to annotated gene functions, where modules of different sizes can be identified by removing different numbers of edges. Lewis et al. [90] investigated the correlation between the functions of sets of proteins and network module/community structure at multiple resolutions/scales, and they showed that there exist different important scales of module/community structure depending on studied proteins and processes. Wang et al. [91] proposed a fast hierarchical clustering (HC) algorithm using the local metric of edge clustering value, which can uncover the hierarchical organization of functional modules that approximately corresponds to the hierarchical structure of GO annotations. Extended (multiscale) MO was used to identify disease modules [42]. More recently, Zheng et al. developed a multiscale approach called HiDeF to identify robust structures at all scales by integrating the concept of persistent homology with existing community detection algorithms (e.g. multiscale MO).
The mining of multiscale module structure can reveal the features of biological networks at multiple scales, reflecting the correlations of network nodes (e.g. genes) at different levels. This can provide more abundant information for the relevant research involving biological networks, such as network-based disease–gene prediction and protein–protein interaction prediction. Therefore, in this study, we will explore methods for disease-gene prediction by integrating a multiscale module structure.
Materials and methods
Here, we introduce the datasets in this study, and then propose the details of HyMM (including multiscale MO with exponential sampling, disease-relatedness estimation of genes based on multiscale modules, a probabilistic model for integration of multiple gene rankings as well as parameter estimation based on functional information) and evaluation methods. See Figure 1 and Supplementary Note 1 (see Supplementary Data available online at https://academic.oup.com/bib) for the workflow of HyMM.
Datasets
To investigate the predictive ability of algorithms, we employ the disease-gene associations, gene–gene associations and disease-disease associations. In order to conduct functional analysis of modules and parameter estimation based on functional information, we adopt three types of functional groups: GO annotations, PW, and disease-gene sets. See Supplementary Note 2 (see Supplementary Data available online at https://academic.oup.com/bib) for details of datasets.
Disease–gene associations
We use three disease-gene association datasets. (i) The first dataset is an integrated disease-gene dataset [30, 92] retrieved from GWAS and Online Mendelian Inheritance in Man [93]. It is denoted as the Medical Subject Headings Ontology (MeSH) dataset, since MeSH is used to combine the different disease nomenclatures of the two sources into a single standard vocabulary. (ii) The second one is obtained from the DISEASES database, which is a weekly updated web resource for disease-gene associations [94]. (iii) The third one is obtained from the DisGeNet database (https://www.disgenet.org/), which is known as a platform that contains one of the largest publicly available collections of disease-related genes [95]. The UMLS (Unified Medical Language System) diseases in the dataset are mapped into MeSH diseases.
Gene–gene associations
Genes and their products mainly perform their biological functions through their direct or indirect interactions, forming a complex gene–gene association network [83, 96–99]. The gene–gene associations are very important for the study of disease research, since complex diseases are usually considered to be caused by local disturbances of complex biomolecule networks [17, 100, 101]. Here, the gene–gene associations are derived from protein–protein interactions (PPIs). Because single-source protein–protein networks are often incomplete and there exist data noises in existing protein networks, we adopt a comprehensive protein interactome that consists of multiple sources of protein–protein interactions: regulatory interactions, binary interactions from several yeast two-hybrid high-throughput and literature-curated datasets, literature-curated interactions derived mostly from low-throughput experiments, metabolic enzyme-coupled interactions, protein complexes, kinase-substrate pairs and signaling interactions [30]. The network data considers only physical protein interactions with experimental support. The identifiers of proteins are mapped into gene symbols.
These gene–gene associations form a HO network of genes. Furthermore, we construct a disease-gene HE network by integrating gene–gene associations, disease-gene associations and disease-disease associations mentioned above. Note that only the disease-gene associations in the training set are used in the construction of the HE network.
Disease–disease associations
The disease-disease associations can provide useful knowledge for the discovery of disease-related genes. Here, the disease–disease association network is constructed by using the associations between symptoms and diseases. The strengths of these associations between a symptom
Three types of functional groups
(i) The GO annotations are downloaded from the Molecular Signatures Databases (MSigDB) [103, 104], which omits GO terms with fewer than 5 genes or in very broad categories; (ii) the pathway-gene sets (PW) are also obtained from MSigDB, which were curated from several online pathway databases (such as KEGG and Reactome), publications in PubMed and knowledge of domain experts [105, 106] and (iii) the disease-gene sets (DG) are obtained as mentioned above.
Multiscale MO with exponential sampling
Identification of module/community structure itself is an important issue in the research of networked systems [46, 48, 107]. We here extract module structure from local to global scales by MO with exponential sampling. Moreover, we also consider two other multiscale methods: asymptotic surprise (AS) [88] and fast HC [91]. All of them have flexible resolution parameters to adjust the scale or size of modules. Given a set of reasonably sampled resolution-parameter values, they can generate a set of network module partitions that contain important information of network structure. (see Supplementary Note 3, see Supplementary Data available online at https://academic.oup.com/bib, for details).
Multiscale MO
MO can detect module structure at different scales by varying the resolution parameter
MO needs to be realized with the help of effective optimization algorithms. Here, the Louvain algorithm is applied, because it is a very effective and widely used strategy for optimizing objective functions of module structure in networks, and we have shown that it can be further improved by an effective initialization process and refining process [48, 75, 88].
Exponential sampling of multiscale module partitions
To extract a set of meaningful module partitions from local to global scales {
According to the set of sampled
Disease-relatedness estimation of genes based on multiscale module structure
A vector indicating association scores between
A partition matrix
A diagonal matrix
The disease relatedness scorings of modules in the
The disease-relatedness scores of genes in the
Then, the union matrix of gene scorings can be calculated by
According to the above gene scoring/ranking strategy, genes within the same module have the same disease-relatedness scorings/rankings. Therefore, these scoring/ranking lists of genes contain the information of disease relatedness as well as the information of multiscale module structure from low to high resolutions. For example, if a module partition consists of two modules: one contains disease-related genes while not for another, gene scorings/rankings will have two values/levels. If the module with disease-related genes is further split into two sub-modules with disease-related genes, then gene scorings/rankings will have three values/levels if there is no degeneracy of module scorings. The number of values/levels in the scoring/ranking list of genes is closely related to the number of disease-related modules in a module partition. As the resolution increases, we can get the gene scoring/ranking lists with more levels/values, thereby revealing different levels of disease-related information in the network.
This provides another possible way to understand the scorings based on multiscale module partitions.
Probabilistic model for integration of multiple gene rankings
The prior knowledge
In order to calculate the final scores of candidate genes, it is necessary to provide an explicit mathematical form of the above conditional probability function (CPF)
Without loss of generality, given a ranking list of genes
CPF1:
CPF2:
CPF3:
where
For CPF1,
Given the above union matrix of gene rankings from multiscale modules, a comprehensive scoring list
The scorings/rankings from multiscale modules may provide useful and complementary information for disease-gene prediction, which is different from that of many other algorithms based on various principles, e.g. network propagation. So, we further integrate the ranking list
Parameter estimation based on functional information
Because of the good performance of CPF3, it will be used as the default form of
We firstly introduce the functional consistency of a module
For a set of functional groups, the functional consistency of a module is defined as the maximal functional consistency over all the functional groups
The functional consistency of a module partition
We will use the functional consistency metrics to quantify the functional relevance of network modules and module partitions, based on the set of functional groups for GO/PW/DG. This may provide insights for parameter estimation of
Evaluation methods
We implement the above procedure of HyMM (including the multiscale algorithms) by Matlab (2016 version). To evaluate the performance of algorithms in disease-gene prediction, we use two evaluation strategies: traditional 5-fold cross-validation (5FCV) and independent test (IndTest). (i) For 5FCV, known disease-related genes for each disease are randomly split into five subsets. In each realization, one of the subsets is treated as a test set, while the rest is treated as a training set. (ii) For IndTest, the disease-gene associations in the MeSH dataset are used as a training set, and the disease-gene associations that only belong to the DisGeNet dataset are used as a test set.
To construct the candidate set of genes, which consists of a test set of genes and a control set of genes, we will construct two kinds of control sets: artificial linkage-interval control set (ALICS) and whole-genome control set (WGCS). (i) For ALICS, each test gene selects 99 control genes from genes closest to this test gene on the same chromosome. This simulates the scenario with disease-related mutation locations (e.g. derived from the genome-wide association study or linkage analysis) [53]. (ii) For WGCS, all unknown genes outside the training and test sets are used as a control set. This simulates the scenario without the information of disease-related mutation locations.
Then, based on the ranking list of candidate genes, several standard evaluation metrics (AUPRC, Recall, and Precision) are used to quantify the performance of prediction algorithms. (i) AUPRC denotes the area under the Precision-Recall curve (PRC), where the PRC curve has Recall on the x-axis and Precision on the y-axis. This is a widely used metric to comprehensively evaluate the performance of algorithms. (ii) Recall measures the ratio of known disease-related genes found in the top-k ranking list compared to the test set, which focuses on how many disease-related genes in the test set have been retrieved. (iii) Precision (Prec) measures the probability of discovering known disease-related genes in the top-k ranking list. Recall and Precision as a function of k-value can provide an intuitive comparison for the local performance of prediction algorithms.
Experimental results
In this section, we first study the effectiveness of multiscale modules and display the good performance of the HyMM framework in disease-gene prediction by a series of experimental tests, including the effect of various factors such as the CPF, multiscale algorithm, parameter estimation based on functional information, and sampling of multiscale module partitions.
Effectiveness of multiscale modules in disease-gene prediction
Here, we study the predictive performance of each module partition being used independently (Figure 2; Figures S2–S4, and Supplementary Note 5, see Supplementary Data available online at https://academic.oup.com/bib). As a whole, the predictive power of the algorithm based on single-scale module partition first has a large upward trend and then a downward trend with the increase of resolution, and the downward trend appears earlier and is more obvious in the HO network. This clearly indicates that different scale module partitions have different levels of importance in disease-gene prediction.
Predictive ability of MO-based module partitions at different scales in disease-gene prediction, as a function of resolution parameter, in HO and HE networks, under different control sets (WGCS and ALICS).
The main reason behind the above phenomenon is that as the resolution increases, modules become more and more fine-grained, and the relevance between nodes in the same module is getting higher and higher. Gradually splitting modules into more fine-grained ones can filter low-relevance nodes while retaining high-relevance nodes in a module, but this will also lead to the loss of information, resulting in the decline of prediction power, because modules without disease-related information are trivial for our scoring strategy.
In order to verify the relationship between the above-mentioned functional relevance of nodes inside modules and the studied scale (resolution), we calculate the functional consistency of module partitions at different scales by using the GO, PW and DG functional groups. The results show that the functional consistency scores of module partitions increase with the increase of resolution (Figure S5, see Supplementary Data available online at https://academic.oup.com/bib). This means that the ratio of similar genes within the modules is increasing with the resolution: these genes are more likely to have the same GO annotations or belong to the same pathway or disease-gene set. This is because the edge density in the modules becomes higher with the increase of resolution. Genes in the modules are more likely to tend to interact with each other and thus have the same or similar functions or participate in a common biological process. Therefore, the module partitions at different scales can provide different levels of information about the relationship between genes. This may provide a more comprehensive understanding of genes and their functions.
Further, with the increase of resolution, disease-related genes also tend to be dispersed into more modules with smaller sizes but stronger functional relevance (e.g. for GO, PW or DG) (Figure S6, see Supplementary Data available online at https://academic.oup.com/bib). As a whole, candidate genes in these modules will be more likely to be disease-related. So, the predictive power of the algorithm based on single-scale module partition gradually increases with the increase of resolution, but the decline of predictive power may appear due to information loss caused by over-filtering of low-relevance nodes, especially when the network is divided into extremely broken modules.
HyMM effectively enhances ability to disease-gene prediction
HyMM outperforms baseline algorithms
Here, we evaluate the performance of HyMM/MM when default setting is used, by comparing to eight baseline algorithms: RWR (Random Walk with Restart) [53], KS (K-Step Markov) [54], VS (Vertex Similarity) [56], ICN (Interconnectedness) [55], RWRH (Random Walk with Restart on Heterogeneous network) [57], CIPHER (Correlating protein Interaction network and PHEnotype network to pRedict disease genes) [58], BiRW (Bi-Random Walk) [59] and KATZ [60] (also see Supplementary Note 6). MM denotes the algorithm that uses only multiscale modules to generate predictive scores. For simplicity, we here define the ratio of performance improvement as
The experimental results show that HyMM outperforms all these baseline algorithms, in both the HO and HE networks, under both the ALICS and WGCS control sets (see the results of AUPRC/Recall/Prec in Figure 3; Figures S7 and S10, see Supplementary Data available online at https://academic.oup.com/bib). Specifically, under the ALICS control set, HyMM in the HO network exceeds the best baseline algorithm by 7, 4 and 7% in AUPRC, Recall and Prec metrics, respectively; HyMM in the HE networks exceeds the best baseline algorithm by 27, 25 and 23% in AUPRC, Recall and Prec metrics, respectively. Under the WGCS control set, HyMM in HO exceeds the best baseline algorithms by 9, 21 and 31% in AUPRC, Recall and Prec, respectively; HyMM in HE exceeds the best baseline algorithm by 28, 33 and 32% in AUPRC, Recall and Prec, respectively. The results of top-k Recall/Prec curves have again confirmed the performance of HyMM (Figure 4; Figures S8 and S11, see Supplementary Data available online at https://academic.oup.com/bib).
Performance comparison of HyMM/MM to different baseline algorithms under the ALICS control set. (a and b) AUPRC, (c and d) Recall and (e and f) Precision (Prec) in the HO/HE networks.
Comparison of local performance of HyMM to different baseline algorithms under the ALICS control set. (a and b) Top-k Recall and Precision in the HO network; (c and d) top-k Recall and Precision in the HE network.
HyMM provides a useful framework to enhance ability to disease-gene prediction
We systematically test the performance of the HyMM framework by integrating it with other baseline algorithms. For simplicity, we here define the ratio of performance improvement due to the HyMM framework as
Performance improvement of different baseline algorithms due to the use of multiscale module information in (a) HO and (b) HE networks, under the ALICS control set. −MM and + MM denote the non-use and use of multiscale module information, respectively.
Moreover, ALICS and WGCS simulate the scenarios with and without disease-related mutation locations, respectively. ALICS has a smaller set of candidate genes than WGCS, due to its more information (about mutation locations). Thus, ALICS generally has relatively larger values of evaluation metrics (AUPRC, Recall, and Precision). In fact, this can also be understood from a random point of view, since the probability of randomly selecting correct genes in a smaller candidate set is usually greater.
Comparison of different CPFs
We have compared the performance of HyMM using different CPFs (CPF1/CPF2/CPF3) (Figures S13 and S14, in Supplementary Note 7, see Supplementary Data available online at https://academic.oup.com/bib). In HE, CPF3 can obtain the best performance under both the ALICS and WGCS control sets. In HO, HyMM using CPF3 has comparable or better performance than that using CPF1/CPF2 under the ALICS and WGCS control sets. So, CPF3 is used as the default form of CPF.
Moreover, it is interesting that the nonlinear forms of CPF (e.g. CPF2/CPF3) are better than the linear form (e.g. CPF1). This means that it is beneficial to give more preference to high-ranking genes in this probabilistic framework.
Comparison of different multiscale algorithms
We have compared the performance of HyMM using different multiscale algorithms (MO/AS/HC). Under the ALICS control set, HyMM using MO outperforms HyMM using AS and HC in most cases, while the Recall- and Prec-values of AS in HE are slightly higher than those of MO (Figure 6; Figure S15, in Supplementary Note 8, see Supplementary Data available online at https://academic.oup.com/bib). Under the WGCS control set, HyMM using MO has good performance, although HC is the best in HO, and AS is the best in HE (Figure S16, see Supplementary Data available online at https://academic.oup.com/bib). Moreover, MM using MO is better than that using AS and HC in all the cases. Overall, MO can robustly produce better or comparable performance in various tests, so MO is used as the default choice.
For HyMM/MM, comparison of different multiscale algorithms (MO, AS and HC) under the ALICS control set. (a and b) AUPRC, (c and d) Recall and (e and f) Precision (Prec), in HO and HE networks. HyMM/MM denotes the default algorithms using MO; HyMM-AS/MM-AS and HyMM-HC/MM-HC denote the algorithms using AS and HC, respectively.
Performance improvement through parameter estimation based on functional information
Since module partitions at different scales are of different importance for disease-gene prediction, the optimization of
Due to the use of functional information of DG/GO/PW, performance improvement of (a–c) MM and (d–f) HyMM (HE) (using MO, AS and HC), under the ALICS control set. EW denotes that equal weight is used.
Stability to sampling of multiscale module partitions
Sampling of multiscale module partitions is closely related to the number of multiscale module partitions and the amount of information extracted from the network. To study the effect of sampling of multiscale module partitions on prediction performance, we evaluate the performance of HyMM for different values of resolution interval
Performance stability of HyMM using MO/AS/HC (i.e. HyMM_MO, HyMM_AS and HyMM_HC) under the ALICS control set, as a function of resolution interval: (a) AUPRC, (b) Recall and (c) Precision in the HO network; (d) AUPRC, (e) Recall and (f) Precision in the HE network.
Effect of random shuffling of disease-gene associations on predictive performance under the ALICS control set, as a function of ratio of shuffled disease-gene associations: (a and b) AUPRC, (c and d) Recall and (e and f) Precision, in the HO and HE networks, respectively.
Effect of random shuffling of disease-gene associations
Further, we study the effect of random shuffling of disease-gene associations on the predictive performance of algorithms. We generate a series of disease-gene datasets with different degrees of randomization from no randomization to complete randomization by randomly replacing a certain ratio of known disease-gene associations in a dataset with randomly sampled unknown associations; and then we test the performance of algorithms on these datasets (see Supplementary Note 11, see Supplementary Data available online at https://academic.oup.com/bib, for details).
The results show that HyMM consistently outperforms other baseline algorithms with an increasing degree of randomization (Figure 9, Figures S21 and S22, see Supplementary Data available online at https://academic.oup.com/bib). This again confirms the stable and good performance of HyMM in disease-gene prediction. Moreover, as expected, the performance of all algorithms obviously degrades as the degree of randomization increases. This means that HyMM on real datasets is far superior to that on random datasets, and the known disease-gene associations are critical for effectively inferring disease-gene associations. The predictive power of existing prediction algorithms is extremely dependent on the accumulation of confirmed and reliable disease-gene associations, which is the solid basis for the development of disease-gene-prediction algorithms.
Applications to other datasets
In the above sections, we have demonstrated that HyMM has stable and good performance in disease-gene prediction. Here, we further apply HyMM to the other two datasets: the DISEASES and DisGeNET datasets, e.g. by cross-validation and independent test (see Supplementary Note 12, see Supplementary Data available online at https://academic.oup.com/bib).
Performance evaluation in cross-validation
For the DISEASES dataset, the disease terminology in Disease Ontology (DO) database is used, and thus the similarity scores between diseases are calculated based on the DO database by the DOSE package [115]. For the DisGeNet dataset, the UMLS diseases are mapped into the MeSH diseases according to the disease mappings in the DisGeNet database, and thus the MeSH symptom-based disease similarity scores are still used. Here, the disease-gene associations in the two datasets will be used as a benchmark in turn, and we have tested the performance of HyMM in the two datasets by cross-validation experiments (Figures S23 and S24, see Supplementary Data available online at https://academic.oup.com/bib). In the DISEASES dataset, HyMM has higher or comparable values of AUPRC/Recall/Prec than the best baseline algorithm(s) under both the ALICS control set, and HyMM has good overall performance under the WGCS control set (see Supplementary Note 12, see Supplementary Data available online at https://academic.oup.com/bib). In the DisGeNet dataset, HyMM consistently has higher values of AUPRC/Recall/Prec than the best baseline algorithm(s) under both the control sets. These results show that, as in the MeSH dataset, HyMM also has good performance when applied to the datasets, further confirming the effectiveness of HyMM in disease-gene prediction.
Performance evaluation on external dataset
Furthermore, we evaluate HyMM by experimental test on the external dataset (also denoted as IndTest). We calculate the scores of candidate genes by using the MeSH dataset of disease-gene associations as a training set and then evaluate the prediction performance by using the disease-gene associations belonging to DisGeNet (excluding the training set) as a test set, since DisGeNet is one of the largest publicly available datasets of disease-related genes. In this test, HyMM shows higher Recall- and Prec-values than the baseline algorithms under both the ALICS and WGCS control sets (Figure S25, see Supplementary Data available online at https://academic.oup.com/bib). Especially, the Prec of HyMM is obviously better than the baseline algorithms.
Applications to specific diseases
In the above sections, we have confirmed the ability of HyMM to a disease-gene prediction by its average performance in different datasets. Here, we further display the predictive ability of HyMM for specific diseases.
Disease-specific performance evaluation
We first study the effectiveness of the HyMM framework in enhancing the ability of predicting specific disease-related genes by using the MeSH dataset as a benchmark. The results of AUPRC/Recall/Prec show that HyMM can improve the ability of predicting disease-related genes for many diseases such as AD (see Figures S26–S28, see Supplementary Data available online at https://academic.oup.com/bib).
Then, we further study the performance of HyMM for AD and some related diseases. Figure 10 and Figures S29 and S30 (see Supplementary Data available online at https://academic.oup.com/bib) display the results of AD, Huntington disease (HD), Parkinson disease (PD), Lewy Body disease (LBD), Frontotemporal Lobar Degeneration (FLD), Anxiety Disorder (Anxiety), Major Depressive Disorder (MDD), and Depressive Disorder (DD). The results show that, under both ALICS and WGCS control sets, for AD and some related diseases (e.g. HD, PD and LBD), HyMM has consistently better performance of AUPRC than the best baseline algorithm(s), except for the results of FLD. Under ALICS, for most diseases (e.g. AD, Anxiety, MDD, DD, HD, PD and LBD), HyMM has comparable or higher values of Recall/Prec compared to the best baselines. Especially for AD, Anxiety, MDD, PD and LBD, HyMM has obviously higher values of Prec. Under WGCS, for most diseases (e.g. AD, Anxiety, MDD, DD, HD and LBD), HyMM has higher values of Recall/Prec compared to the best baselines, except for the results of PD and FLD. Especially for AD, MDD, DD and HD, HyMM has obviously higher values of Recall/Prec. see Supplementary Note 13 (see Supplementary Data available online at https://academic.oup.com/bib) for details.
Performance of HyMM and other methods for specific diseases in DisGeNet dataset under ALICS control set: (a) AUPRC, (b) Recall and (c) Prec.
Overall, HyMM has good performance for AD and many related diseases. Especially for AD and some related diseases, HyMM has better performance than the best baseline algorithm(s), though it is not specially designed for these diseases.
Case study
AD is a progressive neurodegenerative disease and the most common dementia. Its prevalence is increasing in our aging population, resulting in a huge socio-economic burden [116–118]. AD involves specific onset and course of age-related cognitive and functional decline, as well as specific neuropathology. It is a highly hereditary disease with high complexity, and identifying AD-related genes is of great significance for determining its therapeutic targets [118]. Here, we calculate the scores of candidate genes related to AD by using the known disease-gene associations in the MeSH dataset as a training set and then obtain the top-20 genes from the ranking list of candidate genes based on the decreasing order of the scores (Table S1, see Supplementary Data available online at https://academic.oup.com/bib).
Through literature verification, we find that some biomedical studies have implied the associations between AD and many genes in this list of candidate genes [119–125]. For example, Park et al. [119] showed that ALK was important to the tau-mediated AD pathology; Annunziata et al. [120] showed that the deficiency of NEU1 caused the occurrence of an AD-like amyloidogenic process; Qi et al. [126] showed that GAA promoted Aβ clearance by promoting autophagy via the Axl/Pak1 signaling pathway in microglial cells and improved cognitive deficiency in a mouse model; Michele et al. [123] observed a statistically significant increase of CNVs for C4B in AD patients, suggesting a possible role for C4A CNVs in the risk of AD; Pichiah et al. [124] showed that C4B was differentially expressed in AD; Lian et al. [121] showed that the dysregulation of neuron–glia interaction through NFκB/C3/C3aR signaling might lead to synaptic dysfunction in AD; Rasmussen et al. [122] confirmed that the low baseline levels of complement C3 were associated with a high risk of AD; Stoye et al. [125] demonstrated that APOA1 might be a key factor within intestine altered in AD-like pathology. Rai et al. [127] showed that the MTHFR C677T polymorphism was associated with an increased risk of AD; Feng et al. showed that the autophagosome-lysosome fusion could be repressed by the AD-like MAPT accumulation, showing a vicious cycle of MAPT accumulation and autophagy deficit in the chronic course of AD [128]. MTHFR and MAPT have been recorded as related to AD in DisGeNet.
By the enrichment analysis of the above genes, we obtain the most relevant KEGG pathways and GO terms (Tables S2 and S3, see Supplementary Data available online at https://academic.oup.com/bib), many of which are known to be related to AD, such as the pathways (Lysosome, Metabolic pathways, Oxidative phosphorylation and PD) and the GO annotations (myeloid leukocyte activation, leukocyte mediated immunity, regulated exocytosis, oxidation–reduction process, glycosphingolipid metabolic process, mitochondrial respiratory chain complex I assembly, neutrophil degranulation, energy derivation by oxidation of organic compounds, mitochondrion organization, small molecule metabolic process). As we know, lysosomes are the main digestive compartments in cells that degrade extracellular and intracellular substances by a series of processes (e.g. autophagy, endocytosis and phagocytosis), and the dysfunction of lysosomes leads to the accumulation of undigested substances [129]. For example, pathological aggregates of proteins Aβ and τ can result in AD [130]. Autophagy-lysosome defects appear in the early stage of AD and are considered to be an important factor in the AD process [131]. Removing these aggregates by autophagy and degrading them in lysosomes may be a promising treatment. Many evidences showed that AD is a widespread metabolic disorder that is related to the dysregulation of multiple biochemical pathways [132, 133]. Understanding the metabolic perturbations related to AD is essential to identify new therapeutic targets. The reduction of oxidative phosphorylation enzyme activities may be related to β- Amyloid accumulation or other neurodegenerative processes, which may play a critical role in the pathology of AD [134, 135]. Many neurodegenerative disorders are closely related [136, 137]. For example, iron plays an important role in maintaining the normal physiological function of the brain, and the iron metabolism dysregulation associated with cell injury and oxidative stress often co-occurs in several neurodegenerative diseases such as AD and PD [136].
In addition, we analyze the druggability of the candidate genes (Table S1, see Supplementary Data available online at https://academic.oup.com/bib) and find that there are many genes corresponding to protein targets of approved or clinical trial-phase drug candidates [138, 139], and many genes have a large number of interacting drugs [140], which may be potential therapeutic agents.
Conclusions and discussions
Identifying disease-related genes is important for the study of human diseases. Network-based algorithms for disease-gene prediction are very popular, because human complex diseases are usually considered to be caused by the perturbations or functional abnormalities of biomolecule networks. Multiscale module structure widely exists in the biomolecule networks, but it is not fully utilized in the analysis and prediction of disease-related genes. Therefore, we proposed the hybrid method called HyMM that integrates the information of multiscale modules to more effectively predict disease-related genes. HyMM consists of several key components: the multiscale MO with exponential sampling for extracting multiscale module structure, the disease-relatedness estimation of genes based on multiscale modules and the probabilistic model for integration of multiple gene rankings, along with the parameter estimation based on functional information.
We first revealed the importance of module partitions at different scales in disease-gene prediction by the partition-by-partition analysis of multiscale modules (e.g. by MO, AS and HC). Then, by a series of experimental tests, we verified the good performance of HyMM, and showed the effect of different conditional probability forms and different multiscale module extraction algorithms. Next, we confirmed the performance improvement derived from parameter estimations based on functional information (DG/PW/GO), and the stability of HyMM to multiscale module partition sampling and random shuffling of disease-gene associations. Finally, the applications of HyMM to other datasets as well as specific diseases further demonstrated the effectiveness of HyMM. Overall, HyMM provides an effective framework for integrating multiscale module structure to enhance the ability to predict disease-related genes. This framework can provide useful insights for the study of the multiscale module structure and its application in such issues as a disease-gene prediction.
In this study, multiscale module identification is critical to the HyMM framework, but we confirmed the effectiveness of multiscale modules in enhancing the ability of disease-gene prediction by using only MO and two other multiscale algorithms. There is a great possibility that HyMM can be further improved by using more advanced module identification algorithms, sampling methods and parameter estimation methods. Motifs, i.e. small patterns recurring in a network, widely exist in many biological networks (e.g. metabolic networks and PPI networks), which are generally considered as building blocks of biological networks [141], while we do not specifically consider network motifs in module identification. The study of motifs in networks is an important topic, and there has been some research on network clustering using motifs. For example, HiSCF (Higher-order Structural Clustering Framework) [142] is able to perform the clustering analysis by exploiting a variety of network motifs, which demonstrates that the consideration of higher-order network motifs gains new insight into the analysis of biological networks. Moreover, the extracted multiscale modules are used by HyMM in the way of integrating independent rankings of genes, but they can also form a (feature) matrix reflecting the module affiliations of genes at different scales, which may be used to infer disease-associated genes by kernel method or machine learning algorithm [143–146]. These interesting ideas are worth further trying in the future.
Biomedical data is an important basis for the research of complex diseases and their related genes [7, 8]. Individual status (disease or health) can be reflected through gene expression, which is affected by multiple factors such as gene mutation, methylation and transcription factors, so the analysis of multi-omics data is very important for disease research, which may promote the discovery of unknown biological knowledge. With the development of high-throughput sequencing technologies, a large amount of omics data (e.g. from genomics and transcriptomics to proteomics and metagenomics) are continuously being generated [9, 147–149], and disease-related research will benefit from the increase in the amount and type of the data as well as the improvement in data quality [14]. The integrative use of omics data is expected to improve the ability of disease-related association prediction, and may promote the innovation of relevant technologies and methods (e.g. dimension reduction techniques, network embedding, structured sparsity regularization and multilayer network methods), accelerating the development of systems biology [14, 16, 146, 150–155]. However, it is still difficult to manage, analyze and use these data, though many studies for integrative bioinformatics and omics data source interoperability are actively promoting the solution of the related problems [9, 149, 156–158]. For example, HE repositories with multiple formats and different quality levels hinder the integration of genomic data [156]. The disease-related datasets also have similar problems: diversity of disease terminology systems, disease term redundancy, lack/incompleteness of mapping between terminologies in different datasets, data reliability, etc.
Moreover, precision medicine is to realize the personalized diagnosis and treatment of diseases, while the research on disease-gene prediction in literature basically focuses on the disease class or its subclass. Patient-level datasets with genotypic data and phenotypic data provide the possibility to study individual pathogenic genes [10], [154]. Especially with the development of single-cell sequencing technologies, a large amount of cell-level multi-omics data is growing explosively [147, 158], which provides new opportunities for the research of tissue heterogeneity, and cell function, as well as the personalized study of diseases and pathological genes. The integrative analysis of single-cell data at different molecular levels is expected to reveal the overall complexity of biological systems. We believe that these are worthy of further exploration in the future, though the integration of the single-cell data is still a challenge due to their intrinsic heterogeneity.
Developing computational methods for predicting disease-related genes is important to the study of human diseases, due to the high cost and time consumption of biological experiments.
We proposed a hybrid framework for disease-gene-prediction by integrating multiscale module structure (HyMM), which can utilize multiscale information from local to global structure to more effectively predict disease-related genes.
HyMM extracts module partitions from local to global scales by multiscale modularity optimization with exponential sampling, and estimates the disease relatedness of genes in partitions by the abundance of disease-related genes within modules. A probabilistic model for aggregation of gene rankings is designed in order to integrate multiple predictions derived from multiscale module partitions and network propagation, and a parameter estimation strategy based on functional information is proposed to further enhance HyMM’s predictive power.
By a series of experiments, we reveal the importance of module partitions at different scales, and verify the good performance of HyMM and its further performance improvement derived from the parameter estimation.
Funding
National Key Research and Development Program of China (Grant No. 2019YFA0706202); the Training Program for Excellent Young Innovators of Changsha (Grant No. kq2106075), the Fundamental Research Funds for the Central Universities of Central South University (Grant no. 2019zzts279), the Project funded by China Postdoctoral Science Foundation (Grant No.2021M703633). National Natural Science Foundation of China (Grant No. 61702054).
Ju Xiang is currently working toward the PhD degree in the School of Computer Science and Engineering, Central South University, China. He is an Associate Professor with Changsha Medical University, Hunan, China. His research interests include complex networks, bioinformatics, machine learning and deep learning.
Xiangmao Meng is currently postdoctoral in the School of Computer Science and Engineering, Central South University, China. His current research interests include bioinformatics, complex network analysis and data mining.
Yichao Zhao is currently working toward the PhD degree in the School of Computer Science and Engineering, Central South University, China. His current research interests include bioinformatics and system biology.
Fang-Xiang Wu is a Professor in the Division of Biomedical Engineering, Department of Computer Science, Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada. He is a senior member of IEEE. His current research interests include bioinformatics and artificial intelligence.
Min Li is currently the vice dean and a Professor at the School of Computer Science and Engineering, Central South University, China. Her main research interests include bioinformatics and system biology.
References
Hu L, Zhang J, Pan X, Yan H, You Z-H. HiSCF: leveraging higher-order structures for clustering analysis in biological networks.