For our model, let be the number of nodes in a hypergraph. Each node is assigned to one of groups. We let denote the group assignment of node and collect these assignments in a vector . As in the dyadic DCSBM, each node is assigned a parameter θ governing its degree, and we collect these parameters in a vector ∈ ℝ. Let ℛ represent the set of unordered node tuples, so that each ∈ ℛ is a set of nodes representing the location of a possible hyperedge (following the standard choice for the DCSBM in graphs, we allow ℛ to include node tuples with repeated nodes). Let denote the vector of cluster labels for nodes in a given tuple , and the vector of degree parameters.
We use an affinity function Ω to control the probability of placing a hyperedge at a given node tuple , which depends on the group memberships of the nodes in . Formally, Ω maps the group assignments to a non-negative number. If Ω() is large, then there is a higher probability that a hyperedge forms between the nodes in . In our model, the number of hyperedges placed at ∈ ℛ is distributed as ∼ Poisson(π()Ω()), where denotes the number of distinct ways to order the nodes of and π() = ∏θ is the product of degree parameters. The probability of realizing a given value is then
This edge generation process has the following intuitive interpretation: For each of possible orderings of nodes in , we attempt to place a Poisson (π()Ω()) number of hyperedges on this tuple. The result is a weighted hyperedge on the unordered tuple , whose weight can be any non-negative integer. This is a helpful modeling feature, as many empirical hypergraphs contain multiple hyperedges between the same set of nodes. Even in hypergraph datasets where we only know the presence or absence of hyperedges (but no weights), the Poisson-based model serves as a computationally convenient approximation to a Bernoulli-based model. The probability of realizing a given hyperedge set = () is then just the product of probabilities over each ∈ ℛ.In the maximum likelihood framework, we learn estimates of the node labels, of the affinity function, and of the degree parameters by solving the optimization problem
where is a given dataset represented by a collection of (integer-weighted) hyperedges. As usual, it is easier to work with the log-likelihood, which has the same local optima. The log-likelihood iswhereThe first term (, Ω, ) is the only part of the log-likelihood that depends on the group assignments and affinity function Ω. The second term depends on , while the third term depends only on the data and can be disregarded for inferential purposes.In the coordinate ascent approach to maximum likelihood, we alternate between two stages. In the first stage, we assume a current estimate and obtain new estimates of Ω and by solving
The resulting pair can be viewed as maximum likelihood estimates, conditioned on the current estimate of the label vector . In the second stage, we assume current estimates and and obtain a new estimate of by solving
We alternate between these two stages until convergence.There are several identifiability issues that must be addressed. First, permuting the group labels in and Ω does not alter the value of the likelihood. We therefore impose an arbitrary order on group labels. Second, the number of possible groups can, in principle, be larger than the number of groups present in . Such a case would correspond to the presence of groups that are statistically possible but empty in the given data realization. While other treatments are possible, we choose to disregard empty groups and treat as equal to the number of distinct labels in an estimate of . A final form of unidentifiability relates to the scales of and Ω. For a fixed and Ω, we can construct ≠ and Ω ≠ Ω such that ℒ(, Ω, ) = ℒ(, Ω, ) (Supplementary Appendix A). To enforce identifiability, we must therefore place a joint normalization condition on either or Ω. We choose to constrain such that
where and is the (weighted) number of hyperedges in which node appears. In this expression and below, δ is an indicator function with value 1 if all its inputs are equal and value 0 otherwise.The usefulness of eq. is that, when is known or estimated, the conditional maximum likelihood estimates and in eq. take simple, closed forms. First, for a fixed label vector , when using the normalization , the maximum likelihood estimate for is (see Supplementary Appendix B)
Second, conditioned on , if Ω takes constant value ω on some set of unordered tuples of labels, then the maximum likelihood estimate for ω is (see Supplementary Appendix C)In full generality, we can estimate one such ω for every possible label arrangement in the data. Later, we will make natural restrictions on Ω. Although eq. assumes that was fixed, it is not necessary to know to form the estimate . However, forming the estimate via eq. requires that we know or estimate . It is therefore important to remember that is not a globally optimal estimate, but rather a locally optimal estimate conditioned on the currently estimated group labels.Our results from the previous section imply that the estimated degree parameter and piecewise constant affinity function can be efficiently estimated in closed form, provided an estimate of . This provides a solution to the first stage of coordinate ascent. We now discuss the second stage . From eq. , it suffices to optimize with respect to . To do so, it is helpful to impose some additional structure on .
We now define generalized cuts and volumes corresponding to a possible partition vector for tuples of nodes
where ℛ is the subset of tuples in ℛ consisting of nodes. The function () counts the number of edges that are split by into the specified partition , while the function () is a sum-product of volumes over all grouping vectors that induce partition . Let 𝒫 be the set of partition vectors on sets up to size , the maximum size of hyperedges. We show in Supplementary Appendix D that the symmetric modularity objective can then be written asFor a partition vector for tuples of nodes, direct calculation of () is a summation of elements, which can be impractical when either or are large. We show in Supplementary Appendix E, however, that it is possible to efficiently evaluate these sums via a combinatorial identity. We also give a formula for updating volume terms () when a candidate labeling is modified.An important subtlety has been recently raised () concerning the relationship between blockmodels and modularity derived in (). This consideration also applies to our derivation of eq. above and eq. below. We derived the conditional maximum likelihood estimates and of and Ω under the assumption of a general, unconstrained affinity function Ω. It is not guaranteed that these same estimates maximize the likelihood when additional constraints—such as the symmetry constraint Ω() = Ω(ϕ())—are imposed. In the case of dyadic graphs, eqs. and for estimating and are only exact under the symmetry assumption on Ω when (𝓁) is constant for each (). When the sizes of groups vary, as is typical in most datasets, eqs. and are instead approximations of the exact conditional maximum likelihood estimates. The situation is reminiscent of the tendency of the graph modularity objective to generate clusters of approximately equal sizes (). The objectives and algorithms that we develop below should therefore be understood as approximations to coordinate-ascent maximum likelihood inference, which are exact only in the case that all clusters have equal volumes.
Inserting the AON affinity function from into eq. yields, after some algebra (Supplementary Appendix F), the objective
where β = log ω − log ω, , and () collects terms that do not depend on the partition . We collect {β} and {γ} into vectors . We have also definedRecently, the authors of () proposed a “strict modularity” objective for hypergraphs. This strict modularity is a special case of eq. , obtained by choosing ω and ω such that β = 1 and , where is the sum of all node degrees in hypergraph . However, leaving these parameters free lends important flexibility to our proposed AON objective eq. . Tuning allows one to specify which hyperedge sizes are considered to be most relevant for clustering. In email communications, for example, a very large list of recipients may carry minimal information about their social relationships, and it may be desirable to down-weight large hyperedges. Tuning has the effect of modifying the sizes of clusters favored by the objective, in a direct generalization of the resolution parameter in dyadic modularity (, ). It is not necessary to specify the values of these parameters a priori; instead, they can be adaptively estimated via eq. .
Another approach to influencing the number of clusters is to impose a Bayesian prior on the community labels. In the simplest version of a Bayesian approach, one assumes that each node is independently assigned one of labels with equal probability, before sampling edges. The probability of realizing a given label vector is then , which generates a term of the form in the log-likelihood ℒ. This term may then be incorporated into Louvain implementations, with the result that greedy moves that reduce the total number of clusters are strongly incentivized. The resulting regularized algorithm may then label vector with slightly smaller numbers of distinct clusters. This can be useful when it is known a priori that the true number of clusters in the data is small. Our implementation of AON HMLL incorporates this optional regularization term. We use this term only in the synthetic detectability experiments presented below.
Dyadic Louvain algorithms are known for being highly efficient in large graphs. Here, we show that AON HMLL can achieve similar performance on synthetic data to graph MLL (GMLL), a variant of the standard dyadic Louvain algorithm in which we return the combination of resolution parameter and partition that yield the highest dyadic likelihood. For a fixed number of nodes , we consider a DCHSBM-like hypergraph model with clusters and = 10 hyperedges with size uniformly distributed between 2 and 4. Each -edge is, with probability , placed uniformly at random on any nodes within the same cluster. Otherwise, with probability 1 − , the edge is instead placed uniformly at random on any set of nodes. We use this model rather than a direct DCHSBM to avoid the computational burden of sampling edges at each -tuple of nodes, which is prohibitive when is large. For the purpose of performance testing, we compute estimates of the parameter vectors and (in the case of AON HMLL) and the resolution parameter γ (in the case of GMLL) using ground truth cluster labels. We emphasize that this is typically not possible in practical applications, because the ground truth labels are not known. We make this choice to focus on a direct comparison of runtimes of each algorithm in a situation in which both can succeed. In later sections, we study the ability of HMLL and GMLL to recover known groups in synthetic and empirical data when affinities and resolution parameters are not known.
shows runtime, adjusted Rand index (ARI), and number of clusters returned on synthetic hypergraphs when = 0.6, = 1/, and = 1/. These parameter are chosen so that hyperedges of size three and four are rarely (if ever) contained completely inside clusters. Thus, hyperedges of different sizes provide different signal regarding ground truth clusters. For this experiment, we implemented GMLL by computing a normalized clique projection, in which nodes are joined by weighted dyadic edges with weights
We also performed experiments on an unnormalized clique projection with = ∣ : , ∈ ∣ but do not show these results because, in this experiment, the associated MLL algorithm consistently fails to recover labels correlated with the planted clusters.Informally, an algorithm is able to detect communities in a random graph model with fixed labels when the output labeling of that algorithm is, with probability bounded above zero, correlated with . Using arguments from statistical physics, the authors of () conjectured the existence of a regime in the graph SBM in which no algorithm can successfully detect communities. This conjecture has since been refined and proven in various special cases; see () for a survey. In the dyadic SBM with two equal-sized planted communities, a necessary condition for detectability in the large-graph limit is
where is the mean number of within-cluster edges attached to a node and is the mean number of between-cluster edges attached to a node. If this condition is not satisfied, then no algorithm can reliably detect communities in the associated graph SBM, although the communities are statistically distinct. This bound limits direct inferential methods, such as Bayesian or maximum likelihood techniques, and methods based on maximization of modularity or other graph objectives (). Several recent papers have considered the detectability problem in the case of uniform hypergraphs (, , ). In our model, the presence of edges of multiple sizes complicates analysis. Here, we limit ourselves to an experimental demonstration that the regimes of detectability for the graph SBM and our DCHSBM can differ significantly.For = 2,3, is the proportion of within-cluster edges of size . Each pixel gives the mean ARI over 20 independently generated DCHSBMs of size = 500 where each node is incident to, on average, five 2-edges and five 3-edges. () The recovered partition is obtained from GMLL. () The recovered partition is obtained from AON HMLL (algorithm S1). The dashed and dotted lines give various detectability thresholds as described in the main text. In each panel, the returned partition is the highest-likelihood partition from 20 alternations between updating and inference of the affinity parameters. In this experiment only, we incorporate a regularization term in the modularity objective to promote label vectors with fewer clusters.
In this experiment only, both GMLL and AON HMLL discussed below use the Bayesian regularization term in the likelihood objective to encourage each algorithm to form a relatively small number of clusters. In the lefthand panel, we show the ARI of the returned partition against the true partition when using the unnormalized variant of GMLL (results for the normalized variant are similar). This choice reflects the fact that the true number of clusters is known and is equal to 2. The dashed and dotted white lines give the boundaries at which eq. holds with equality. The dashed white line gives the detectability threshold for the assortative regime in which nodes are more likely to link with others in the same cluster. Louvain, as an agglomerative algorithm, is well suited for detecting assortative clusterings and is able to detect communities in much, but not all, of this regime. The gap between the theoretical threshold and the performance of Louvain reflects the fact that Louvain, as a stagewise greedy algorithm, has no optimality guarantees. There is also a disassortative detectable region below the dotted white line. The agglomerative structure of graph-based Louvain causes the algorithm to entirely fail here.
Shown are the number of nodes , number of hyperedges , mean degree 〈〉, SD of degree , mean edge size 〈〉, SD of edge size (), and number of data labels .
It is often stated that higher-order features are important for understanding the structure and function of complex networks. It is less often clarified what kinds of higher-order features are relevant for which networks. Generative modeling provides one way to compare different kinds of higher-order structure. In the DCHSBM, this structure is specified by the affinity function Ω. Comparison of the likelihood functions obtained by each affinity can indicate which one is most plausible as a higher-order generative mechanism of the underlying data. We performed such a comparison using the symmetric affinity functions from and the labels for nodes described above. In this setup, we can compute an approximate ML estimate for Ω, given its functional form. To make concrete comparisons, it is necessary to specify the functional forms of the GN, RP, and Pairwise affinities. We use the following parameterizations
The GN affinity function assigns a separate parameter to each combination of edge size and number of groups. The RP affinity function assigns one parameter for the case that the difference between the largest and second largest groups within an edge exceeds /4, where is the size of the edge. The Pairwise affinity function assigns one parameter to the case that the total number of dyadic pairs in differing groups exceeds half the possible number of these pairs. RP, which favors edges that the two most common labels are roughly balanced in representation, is qualitatively distinct from AON, GN, and Pairwise, all of which favor edges with homogeneous cluster labels.Because these affinity functions have different numbers of parameters, we compare them via the Bayesian Information Criterion (BIC) (), which penalizes affinity functions with more parameters than are supported by the data. In computing the BIC, we exclude the parameters , as these are the same in each model and therefore contribute an unimportant additive constant. The AON, RP, and Pairwise affinities each have parameters. In the case of GN, we compute the number of possible parameters for each edge size by computing the number of possible groups using the number of distinct labels in the given partition. For example, if the given partition contained only three distinct groups, then we do not posit parameters corresponding to edges containing more than three groups. It would also be reasonable to remove this restriction, in which case there would be parameters for edges of size regardless of .
We can obtain some qualitative insight into the behavior of HMLL by studying the structure of the inferred affinity function Ω. The most intuitive way to do so is through the derived parameters β and γ from . The bottom row of shows these parameters and the distribution of edge sizes. The dependence of β on edge size provides one explanation of why GMLL succeeds in contact-primary-school but makes several errors in contact-high-school. Under the standard dyadic projection, a -hyperedge generates 2-edges and therefore appears in the dyadic modularity objective distinct times. In the case of contact-primary-school, the estimated importance parameter β is indeed relatively close to (, bottom center). At the optimal partition, the relative weights of edges are therefore distorted relatively little by the clique projection. On the other hand, the estimates for β in contact-high-school deviate considerably from , especially for = 2,3. Here, small edges feature much more prominently in the polyadic modularity objective than they do in the projected dyadic objective, implying that the latter is a poorer approximation to the former near the optimal partition. This difference may explain the small errors in GMLL in contact-high-school. The bottom-right panel of compares the inferred value of the size-specific resolution parameter γ to γ = /(), the implicit value used in (). The inferred resolution parameters are consistently larger γ and increase with , highlighting the value of adaptively estimating these parameters in our approach.
In , we study the ability of AON HMLL to recover ground truth communities in several more of our study datasets. Unlike the two contact networks, each of these datasets contains edges of size up to 25 nodes. We have excluded house-committees and senate-committees on the grounds that these datasets are disassortative, indicating that AON is clearly inappropriate. We compare AON HMLL to two variants of GMLL. In the unnormalized variant, we obtain a dyadic graph by replacing each -edge with a -clique, thus generating a total of dyadic edges. In the normalized variant, we weight each edge in the -clique by a factor of . The normalized dyadic degree of each node is then equal to its degree in the original hypergraph. In either case, we then alternate between the dyadic Louvain algorithm for estimating clusters and conditional maximum likelihood inference of the resolution parameter γ. In each trial, we perform 20 iterations of AON HMLL and the two GMLL variants, returning from these the combination of group labels and parameters that achieves the highest likelihood. We then compare the clustering to the ground truth labels via the ARI. We vary the maximum edge size to show how each algorithm responds to the incorporation of progressively larger edges. Because extreme sparsity poses issues for community detection algorithms in general (), we show experiments for progressively denser cores of trivago-clicks and walmart-purchases. The -core of of a hypergraph is defined as the largest subhypergraph such that all nodes in have degree at least . Points give the ARI of the highest-likelihood partition obtained after 20 alternations between partitioning and parameter estimation. The maximum edge size varies along the horizontal axis. In the panel titles, is the number of nodes and the number of edges when . Note that the vertical axis limits vary between panels.
The results highlight the strong dependence of the performance of AON HMLL on the relative plausibility of the AON affinity function as a generative mechanism for the data (cf. ). In trivago-clicks, the AON affinity function achieved the lowest BIC of all four candidates. Because AON is a more plausible generating mechanism by this metric, it is not unusual that AON HMLL is able to find partitions considerably more correlated with the supplied data labels than those returned by the dyadic variants. In walmart-purchases, on the other hand, the Pairwise affinity is preferred to AON. In this case, AON HMLL performs much worse and, in the 2-core, even returns clusters that are anticorrelated with the supplied labels. As weakly connected nodes are removed and the resulting data become denser, HMLL begins to return correlated clusters. However, the normalized GMLL variant is at least as effective in recovering the data labels. In the two congressional bills datasets, the Pairwise affinity achieves a lower BIC than AON in the House and a comparable one in the Senate. Echoing this finding, a dyadic method outperforms AON HMLL in each of these cases. Unnormalized GMLL performs best in house-bills and senate-bills, while normalized GMLL is preferable in walmart-purchases. In addition, HMLL is the worst algorithm only in the case of the 2-core of walmart-purchases for small . HMLL may therefore be the algorithm of choice in cases when it is not known whether normalized or unnormalized dyadic representations are more appropriate for the data.
Article Information
vol. 7 no. 28
- Received for publication February 17, 2021
- Accepted for publication May 24, 2021
- .
Author Information
- 1Department of Mathematics, University of California, Los Angeles, 520 Portola Plaza, Los Angeles, CA 90095, USA.
- 2Center for Applied Mathematics, Cornell University, 657 Frank H.T. Rhodes Hall, Ithaca, NY 14853, USA.
- 3Department of Computer Science, Cornell University, 413B Gates Hall, Ithaca, NY 14853, USA.
- ↵*Corresponding author. Email: phil@math.ucla.edu