Introduction
With the global population getting older and older, the prevalence of long-term illnesses has become the main chal-lenge in healthcare [1]. This becomes even more apparent in critical care, where the medical outcome of a patient depends on the interaction of a number of factors. The coexistence of multiple long-term illnesses, or multimorbidity, is one of the most important of such factors and it heavily contributes to the heterogeneity of adverse outcomes [2]–[5].
For this reason, clustering techniques, such as k- means, have been used to identify multimorbidity profiles and contextu-alize the potential impact of specifically adapted treatments, including those for sepsis [6]–[8]. However, even when they are successful, these computational techniques commonly rely on heuristics to optimize an objective function, and have been criticized due to their dependency on the quality of the data and simplicity of their models [9]. In response to such concerns, latent class analysis (LCA) has been proposed as an alternative [9], [10]. Recently, it has been used on laboratory and demographic data to identify four multimorbidity profiles with different prevalence of sepsis and mortality [11]. Similarly, using data on pre-existing conditions, Zador et. al. identified six multimorbidity profiles [12].
Despite being a more principled approach and addressing some of the issues of other clustering algorithms, LCA still shares some of the same drawbacks: First, it does not use a principled method to define the optimal number of clusters, which could lead to poor results [5], [13]. Second, similar to traditional clustering techniques, LCA makes assumptions about the causal relationship between variables, assumptions that might however be unrealistic, consequently affecting the quality of the results [5]. Finally, LCA uses unobserved variables to find clusters, making them hard to understand as they cannot be interpreted directly from the observed data [14].
Network science is the study of complex systems based on the connections between their constituting elements [15], which makes it particularly suitable to analyze the complex interactions between long-term disorders in patients. In com-plex networks, the analogous of clustering is called community detection, a problem whereby the entities constituting a net-work (nodes) are grouped based on their connectivity patterns [16]. However, most of the methods proposed to do this are heuristic and, therefore, show similar problems to those of traditional clustering techniques [17], [18]. To overcome this problem and provide a more robust solution to the task of patient clustering, we propose the use of stochastic block modeling (SBM) [19], [20]. This encompasses a family of generative models commonly used for community detection that, thanks to their probabilistic approach, are not prone to the same issues that affect heuristic methods [18], [21]. Specifically, to accurately address the shortcomings of the clustering techniques discussed above, we use the hierarchical stochastic block model (hSBM), a non-parametric version of SBM that provides hierarchical clusters and therefore a better resolution, and has been already successfully applied to clustering outside standard community detection [21]–23].
One of the main advantages of this approach is that the hSBM is an unsupervised, non-parametric method, and there-fore it does not require any input or assumption about the data. Additionally, this guarantees that the model cannot overfit and find structure where there is none [21], [22]. Finally, the model presents a hierarchical cluster structure that can facilitate the visualization of the solutions, and enhance their interpretability [24].
We use the hSBM to detect clusters of patients based on their multimorbidity and demographic information. Our results show that it finds clusters with homogeneous demographic and multimorbidity profiles that explain the data in more detail than in recent work [12], also uncovering important groups of patients missed by existing approaches. Additionally, these groups show distinct, statistically significant sepsis and mor-tality rates which are more informative than those suggested by existing methods used in critical care.
Methods and Data
In this section we briefly describe the dataset used for our analysis, and present the hSBM and its use in the context of our work.
A. Dataset
To perform our analysis, we use information on 38,417 patients from the Medical Information Mart in Critical Care (MIMIC-III), an anonymized dataset comprising information on the admissions to the critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012, which has been widely used since its release [25], [26].
Following Zador et al, [12], we consider age, sex, admission type (elective, non-elective) and secondary diagnoses as features for our study. Patients under 16 years of age are not considered, and the age of the rest is discretized into the following ranges: 16-24,25-44,45-64, 65–84, and over 85. We only consider the first ICU admission for patients with multiple ones. The resulting dataset is consistent with that reported by Zador et. al in terms of both demographic (gender and age) and morbidity distributions.
Morbidities are computed from the rich collection of sec-ondary diagnoses by using the Elixhauser comorbidity index [27], a well established method to detect long-term disorders in patients based on ICD-9 codes. Finally, sepsis is computed following the definition given by Angus et al., whose im-plementation is available from the official MIMIC-III code repository [26], [28]: A patient is considered to have sepsis if it is explicitly recorded in the their history, if there exists an infection (bacterial/fungal) and organ dysfunction, or if there exists an infection (bacterial/fungal) and the patient is under mechanical ventilation.
B. Hierarchical Stochastic Block Model for Patient Clustering
The stochastic block model (SBM) is a generative model commonly used for community detection in complex net-works, and is based on the assumption that nodes have a given probability to connect to each other, and that this probability solely depends on the community (or block) to which they belong [29], [30]. A major limitation of the SBM is that the model requires prior information on the number of blocks the network possesses [17]. To address this issue, recent studies have proposed a series of Bayesian, non-parametric versions of the SBM [21], [22]. One of such versions, the hierarchical - or nested - stochastic block model (hSBM), provides hierarchical clustering and was introduced to improve the resolution limits of the non-parametric SBM, allowing the model to discover finer-grained clusters [22].
This figure displays the posterior odds ratio obtained at every batch run of the merge-split mcmc. We run the algorithm for 1,000 batches of 10 runs each. The posterior probability of a partition is computed for each of these batches. Then, we compute the posterior odds ratio by dividing the posterior probability of the partition obtained at a given batch run by the posterior probability of the partition we obtained after running the agglomerative multilevel mcmc 100 times. It is possible to note that after around 200 batch runs, the posterior odds ratio stabilises, after having reached a state whose partition is
In this paper, we use the hSBM to identify informative clus-ters of homogeneous patients. To this end, we create a bipartite unweighted network, in which one set of nodes represents patients, whereas the other represents their features, including demographics, admission type, and morbidities. Each patient node is connected to its features, as shown in Fig. 2. Then, we use the hSBM to find clusters of patients. There are several ways by which the hSBM can find the best partition, so we decide to follow a procedure that gives the highest probability of not getting stuck in a local minimum. Specifically, we first use an agglomerative multilevel Markov Chain Monte Carlo (MCMC) algorithm that starts by partitioning each node in a different cluster and then, at each step, proposes moving nodes to different clusters [31]. These moves are accepted with a given probability based on their resulting minimum description length gain. Given its stochastic nature, this algorithm is not guaranteed to always find the best partition. To limit this possibility, we run the algorithm 100 times and keep the run that yields the highest posterior probability. However, this still does not ensure that the partition obtained is optimal, as there is still a small chance that the algorithm found a solution corresponding to a local minimum. For this reason, we further refine the resulting partition by running another optimization algorithm on it, the merge-split MCMC, proposed to address this issue [32]. We run this 10,000 times, in batches of 10, to ensure that no further significant improvement is possible. Indeed, we find that, after roughly 200 run batches, the improvement in the posterior likelihood reaches a plateau, as can be seen in Fig. 1. This result suggests that the clusters obtained by the hSBM are either optimal or near-optimal.
Cluster Analysis
We compare our clusters with those obtained by Zador et al. using LCA [12] through a discussion of their composition and relationship to the prevalence of sepsis and mortality among the patients that belong to them.
A. The Benchmark-Lca for Patient Clustering
Zador et al. find groups of patients with similar morbidity profiles and the relationship of such groups with sepsis and mortality rates [12]. We use this work as our benchmark for three reasons: First, this is one of the few studies that cluster patients based on their multimorbidity profiles. Second, to achieve this they use MIMIC-III. Third, they focus their analysis of clusters on adverse outcomes such as mortality and sepsis. To the best of our knowledge, this is currently the state-of-the-art for studies that possess all three of the aforementioned characteristics.
Zador et al. find six groups of patients with statistically significant different multimorbidity profiles. They compute the prevalence of sepsis and organ dysfunction for all the clusters, and the associated mortality rates. Their results suggest that two commonly used scores by clinicians to assess the risk of sepsis and mortality at the time of admission, namely the Oxford Acute Severity of Illness Score (OASIS)[33] and the Sequential Organ Failure Assessment (SOFA) [34], provide predictions which are in contrast to the observed prevalence of adverse events in their clusters. This implies that, by using information on multimorbidity, it is possible to improve the assessment of adverse outcomes. Despite this promising finding though, not all the clusters they obtain show discernible mortality and sepsis rates [12]. Also, these clusters miss some categories of patient with far higher or lower probabilities to develop sepsis or die, such as those patients with elective admissions or those with no long-term illnesses, respectively. In the next section, we show that our approach can uncover these groups of patients, which are missed by both LCA, and OASIS and SOFA scores.
B. Clinically Homogeneous Multimorbidity Clusters
Fig. 2 presents the clusters and their hierarchy as found using the hSBM. At the highest hierarchical level, it is possible to see that the hSBM captures the bipartite structure of the network, separating features (on the left-hand side) from patients (on the right-hand side).
At the intermediate hierarchical level, the hSBM identified six patient clusters. This is the same number of clusters identified by Zador et al. with LCA. However, despite some similarities, the clusters provided by the two methods are very different. In both cases, it does not seem that gender plays a significant role in assigning patients to a group. Moreover, both approaches identified clusters of similar age profiles.
The bipartite network of patients and their features. Patients are displayed on the right-hand side, and features are displayed on the left-hand side of the network. Only 1000 edges out of the 235,137 between patients and their features are randomly sampled to be displayed here, to reduce visual clutter. The tree illustrates the hierarchical structure of the clusters we find, and is discussed in Sec III.
For instance, the hSBM includes most patients in the age ranges 16–24 and 25–44 in the same clusters - clusters A and B, Fig. 3, as does LCA [12]. Similarly, both groups of patients over 85 and patients in the age range 65–84 are clustered separately with both methods. Morbidities aside, the main difference is that the admission type (i.e. elective vs non-elective admissions) plays a role in several of the clusters found by the hSBM. We can clearly see this in clusters B and D, in which patients have a low prevalence of elective admissions (roughly 60% lower than average). This difference becomes even more pronounced when considering clusters at the bottom hierarchical level, with clusters B 1, D 1, and D2 showing little to no elective admissions. This is a level of detail that LCA could not capture.
Another major difference in cluster composition emerges when examining multimorbidity profiles: LCA separates patients based on several groups of diseases, suggesting that mul-timorbidity is the main separation criterion for this algorithm [12]. Furthermore, they show that in all clusters patients have a multimorbidity count, which suggests that only the morbidity type plays an effective role in grouping patients.
Our results are in stark contrast to these. In fact, the hSBM discerns patients not only based on multimorbidity profiles, but also their combinations with demographics, admission type and, importantly, the number of morbidities patients have (see Fig. 5). Specifically, we see that only two clusters, D and F, include patients with a heterogeneous multimorbidity count, whereas the others include patients with a definite number of morbidities. For instance, a remarkable finding by our approach is that none of the 2716 patients in cluster A have morbidities. This cluster, in fact, represents younger patients with no long-term conditions who have been admitted to the hospital following some traumatic event, such as a car accident, a stab or gun wound. Importantly, this category of patients is expected and of particular interest, but is undetectable by LCA. This is also reflected in the much lower than average mortality and sepsis rates for these patients (see Fig. 4), which is not captured by either LCA or the SOFA and OASIS scores (see Fig. 7).
This figure illustrates the multimorbidity composition of patient clusters inferred by the hsbm. For example cluster b splits into clusters bland b2. The lines between clusters at the bottom level group clusters that originate from the same parent cluster at the intermediate level. The heatmap shows the relative difference in the prevalence of morbidity between each cluster and the whole dataset. Although relative changes can be larger, we cap the colormap at 500% to improve readability.
This figure displays the heterogeneity in the prevalence of sepsis, mortality, and mortality given sepsis across the clusters we find. These are compared to the respective averages in the whole subset of 38,417 patients we use in our analysis. Especially considering the fine-grained clusters at the bottom hierarchical level, starting from ai, it is possible to see that some clusters present far higher than average prevalence of sepsis and mortality, such as dl, fl, and f3, which is not captured by the oasis and sofa scores computed at admission (fig. 7). Similarly, some of the clusters we uncover display a high prevalence of patients with low risk of developing sepsis and die, such as clusters from a 1 to C 1, which is once again in contrast with the information provided by sofa and, in particular, oasis scores.
C. Analysis of Fine-Grained Clusters
In the previous section, we showed how the six clusters obtained at the intermediate hierarchical level display infor- mation which could not be retrieved with traditional clustering and LCA. At the lowest hierarchical level, these clusters split into 12, more fine-grained ones, that discriminate even further between complex patient and multimorbidity profiles and provide greater insights on sepsis and mortality in the ICU.
Given the different number of clusters, an accurate comparison between the hSBM and LCA at this point would hardly be significant. Instead, we provide a detailed analysis of each of these 12 clusters next, focusing on their composition, and on the mortality and sepsis rates for the patients they represent. We will further discuss the similarities and differences with LCA clusters in the discussions in Sec IV.
Cluster Al-Young patients without morbidities.
This figure displays the multimorbidity count - i.e. The number of co-existing long-term disorders - by each cluster at both the intermediate and bottom hierarchical levels. The circle size is proportional to the frequency of patients who have exactly that number of morbidities. It is interesting to see that, contrary to existing literature, we find several clusters in which the patients have only few, if any, long-term illnesses. Importantly, the fact that clusters such as ai, bi, CI, etc. Are composed of patients who all have the same multimorbidity count, suggests that the number of morbidities has a major role in determining the clusters. The exact number seems to matter less though when the multimorbidity count becomes higher, such as in the case of clusters DI, d3, and f2.
This cluster has a much higher than average prevalence of younger patients, who have no morbidities and expectedly show a far lower prevalence of sepsis (9.68% vs 27.52%) and mortality given sepsis (13.3% vs 21.6%) than average (Fig. 4).
Clusters Bl and B2 - Younger patients with substance abuse issues and non-elective admissions.
These two clusters are also defined by patients under 45 years of age, but compared to A 1 they present a higher proportion of patients in the age range 25–44. Besides age, from Fig. 3 it is possible to see that patients in these two clusters are distinguished by a much lower than average prevalence of elective admissions, and by a prevalence of drug abuse which is 3 and 3.89 times higher than average for B 1 and B2, respectively. Similarly, alcohol abuse is present in 22.74% and 21.17% of the patients, whereas the average in the dataset is 8.42%. It is worth noting that the main factor that distinguishes these two - otherwise similar - clusters is that in cluster B 1 all patients only have one morbidity each, whereas in B2 they all have exactly two (see Fig. 5). This is reflected in the higher prevalence of AIDS, psychoses and depression that we can see in B2, but also in the fact that patients grouped in B 1 have a lower mortality rate after they develop sepsis (Fig. 4). Overall, these two clusters have a low mortality rate, both with and without sepsis, and also lower than average prevalence of sepsis, which is coherent with the fact that patients in these clusters are younger and have only one or two long-term conditions.
Cluster Cl- Elective admissions of middle-aged patients
with exactly one morbidity.
This cluster is the one with the highest rate of elective admissions, which is 38.38% higher than average (Fig. 6). From Fig. 5, we can see that patients in this cluster only have one morbidity, but on average, the prevalence of all morbidities is lower than in the whole dataset (Fig. 3). To gain a better idea of who these patients are, we inspect the most common causes of admission. We find that these are mostly patients who are receiving coronary artery bypass graft surgery, which is compatible with the fact that most admissions in this cluster are elective. This might also explain the second lowest prevalence of sepsis among the clusters.
Clusters Dl, D2, D3 - Patients with mental and neu- rological disorders.
This figure shows the relative change of gender and admission type prevalence between each cluster and the average among all patients. It is immediate to see that gender plays a major role in clustering, with only two clusters - DI and d2 - having a significantly lower than average prevalence of female patients, and only fi having a significantly higher prevalence. Conversely, we can see that the admission type is more influential. For instance, D i and d2 have little to no patients with elective admissions, whereas C i is composed of many patients scheduled critical surgery.
At the intermediate hierarchical level, cluster D showed a significantly high prevalence of substance abuse, followed by a number of other conditions with higher than average prevalence such as AIDS, coagulopathy, liver disease and fluid electrolyte disorder, all compatible with substance abuse [35], [36]. However, the split of this cluster into D 1, D2, and D3 tells a much more complex story. Clusters D 1 and D2 still present the highest prevalence of substance abuse, AIDS, and liver disease among all clusters (see Fig. 3). Similar to Bl and B2, the main difference among Dl and D2 is the multimorbidity count, which is either 3 or 4 for all patients in D2 but is far higher and more heterogeneous for patients in D 1. If analyzed together with the age profiles - in D 1 patients are older than in D2 - this suggests that D 1 groups those patients who are in the later stage of substance abuse: these patients have developed a number of long-term illnesses that are not as present in their younger counterpart, such as higher prevalence of liver disease, coagulopathy, peptic ulcer, and weight loss. This is unsurprisingly reflected in their far higher sepsis and mortality rates (Fig. 4).
Interestingly, despite sharing the same parent cluster as D 1 and D2, cluster D3 does not show an higher than average substance abuse prevalence, but instead displays the highest number of obese patients - 28.22% vs an average of 4.92% - and a number of associated cardiovascular and metabolic conditions, including diabetes with and without complications and pulmonary circulation disorders. Although at first sight D3 may seem very different from D 1 and D2, it is possible to see that all these three clusters share a very similar prevalence of neurological disorders, psychoses, depression, and even paralysis. Further, D3 and D 1 are linked by a very similar - almost identical - multimorbidity count distribution.
Cluster El - Elderly patients with multimorbidity and elective admissions.
This figure shows the distribution of oasis and sofa scores among the patients in our clusters. As discussed in sec. Iii-d, we find that the actual rates of sepsis and mortality in our clusters follow different patterns that what suggested by these scores.
The defining characteristics of this cluster are age - with a higher prevalence of elderly patients- elective admissions - 29.71 % higher than average - and the number of morbidities, either 3 or 4. Apart from this, patients in this cluster show a much lower than average prevalence of substance abuse, AIDS and liver disease, but slightly higher prevalence of everything else. This cluster is representative of the dataset, so it does not come as a surprise that its sepsis and mortality rates are close to the average.
Clusters F1, F2, F3, and F4 - Patients over 45 with heterogeneous multimorbidity profiles.
At the intermediate level, cluster F shows that its patients are predominantly older - with a peak of patients over 85 years old - and have a number of co-occurring morbidities. By inspecting its ramifications at the bottom level from Fig. 3, we can see that we can further divide this category of patients into two groups. Clusters Fl and F3 are mostly composed of patients who do not have a high number of morbidities, between 2 and 4 and exactly 2, respectively. The main difference between these two clusters is the age of the patients, which is strictly over 85 for Fl and between 45 and 84 for F3. Remarkably, this difference allows us to uncover significantly different mortality rates between the two groups, which is far higher at 21.29% for the older patients of F 1, compared to 8.75 % of patients that belong to F3, as it can be seen in Fig. 4. The other two clusters of this group, F2 and F4, are also fairly similar in terms of multimorbidity profiles and demographics, but differ in the number of morbidities their patients have. In fact, although patients in both clusters have a high multimorbidity count, those who belong to F4 have either 5 or 6, whereas patients in F2 strictly have at least 7 morbidities (Fig. 5). This is clearly reflected in their sepsis prevalence, which is the highest among all clusters at 56.11 %, more than double the average of all patients. A similar observation can be made for their mortality rate, which is 21.29%, or 86.5% higher than the average. These observations are to be expected, given both the age of the patients and the fact that they have complex multimorbidity profiles which include a number of cardiovascular diseases [37].
D. Comparison of Sepsis and Mortality Rates with Oasis and Sofa Scores
OASIS and SOFA are commonly used scores for patient risk assessment at time of admission. OASIS focuses on assessing the risk of mortality, whereas SOFA is used to evaluate the risk of organ dysfunction (and consequently sepsis) [33], [34]. These scores have the merit of being easy to compute, not requiring any laboratory data, and easy to understand. However, these scores do not take into account multimorbidity, despite it being a key factor in determining the outcome of a patient, as we have seen from our results in Sec. III-C. Our results suggest that these scores alone are not accurate enough to be effectively used in the risk assessment of a patient. Specifically, we want to investigate whether the relative risk of mortality and sepsis across our clusters as predicted by OASIS and SOFA is coherent with the prevalence of these two adverse outcomes we find. By inspecting our results from Fig. 7 and Fig. 4 it is possible to see that OASIS and SOFA provide scores which are in disagreement with the actual prevalence of adverse outcomes we find in our clusters. There are two stark examples of this that we will use to illustrate our point. First, it is cluster D, whose children clusters D 1, D2, and D3 have progressively higher OASIS and SOFA scores. However, if we analyze the actual prevalence of mortality and sepsis, we see that the highest are found in cluster D 1. In fact, we find that mortality rates progressively decrease, rather than increase, in these clusters. The second, and perhaps most powerful example is represented by F2, which is the cluster with the lowest average SOFA score but the highest prevalence of sepsis and the second highest mortality rate across all clusters. These results show that our approach provides significantly different insights into sepsis and mortality rates than common scores and can potentially be used to assess the risk of a patient more accurately.
Discussion
By representing patients' data as a bipartite network, we can map the problem of patient clustering to community detection, and find structure in our data using the hSBM, a non-parametric Bayesian generative model. Thanks to this approach, we are able to unveil complex relationships between multimorbidity and adverse events in the ICU, and find more significant profiles than existing methods. Our results show that our approach has three distinctive advantages over LCA.
First, we are able to retrieve clusters that would otherwise be undetected, such as those with low multimorbidity prevalence. For instance, cluster Al only includes patients with no pre-existing conditions, who have been admitted to the hospital primarily due to traumatic events. Not surprisingly, this cluster has the lowest sepsis rates of all. Moreover, the two clusters in which patients only have one or two long-term disorders, namely Bland B2, have the lowest mortality rates. Second, from our comparison with LCA it is immediate to see that the hSBM captures more fine-grained relationships. In fact, although our approach still identifies clusters similar to those reported by Zador et al., revolving around substance abuse, cardiovascular diseases, diabetes, etc., it is also capable to differentiate them into more detailed depictions. This is clear from analyzing, for instance, clusters Band D. They both represent substance abuse clusters but, within cluster B, pa-tients are younger and have a low prevalence of multimor- bidity, whereas in cluster D patients are older and have a higher prevalence of disorders commonly associated with drug abuse (Fig. 3). These differences create a significant divide in mortality and sepsis rates, which is not observed in Zador et al. [12]. Third, the hSBM is non-parametric, and it does not even require input on the number of clusters. Thanks to the network representation of the data, the fact that the model is non-parametric ensures that no overfitting occurs, and, consequently, that the resulting clusters are completely unbiased, even in presence of highly unbalanced data. A remarkable consequence of these features is that we find several clusters in which patients display an exceedingly high prevalence of characteristics which have, instead, a particularly low prevalence in the whole dataset, such as being between 16 and 24 years old (2.9%), obesity (4.9%), peptic ulcer (0.82%), and AIDS (0.57%). A second, equally important, consequence is that, unlike recent work, all our clusters include patients with a largely homogeneous number of long-term disorders [11], [12].
Besides comparison with other clustering methods, we also show that the sepsis and mortality rates found in our clusters largely differ from the predictions made by OASIS and SOFA scores. For this reason, we argue that our results constitute robust evidence that multimorbidity should be included in critical care risk assessment.
Conclusion
There is currently a limited understanding of the co-occurrence of long-term health conditions and their associated health outcomes due to the complex interactions between morbidities and also interactions with other factors such as demographics. Our work shows that hierarchical stochastic block modeling and, more generally, a network representation of patient data offer several intrinsic advantages, such as the elucidation of fine-grained associations, over traditional clustering methods. It is an original contribution to a growing number of research efforts aimed at mapping and identifying disease clusters and understanding adverse health outcomes - in our case sepsis and death in the ICU - for people with complex sets of pre-existing conditions.