Block modelling in dynamic networks with non-homogeneous Poisson processes and exact ICL
- First Online:
- 02 August 2016
- Received:
- Revised:
- Accepted:
DOI: 10.1007/s13278-016-0368-3
- Cite this article as:
- Corneli, M., Latouche, P. & Rossi, F. Soc. Netw. Anal. Min. (2016) 6: 55. doi:10.1007/s13278-016-0368-3
Abstract
We develop a model in which interactions between nodes of a dynamic network are counted by non-homogeneous Poisson processes. In a block modelling perspective, nodes belong to hidden clusters (whose number is unknown) and the intensity functions of the counting processes only depend on the clusters of nodes. In order to make inference tractable, we move to discrete time by partitioning the entire time horizon in which interactions are observed in fixed-length time sub-intervals. First, we derive an exact integrated classification likelihood criterion and maximize it relying on a greedy search approach. This allows to estimate the memberships to clusters and the number of clusters simultaneously. Then, a maximum likelihood estimator is developed to estimate nonparametrically the integrated intensities. We discuss the over-fitting problems of the model and propose a regularized version solving these issues. Experiments on real and simulated data are carried out in order to assess the proposed methodology.
Keywords
Dynamic network Stochastic block model Exact ICL Non-homogeneous Poisson process1 Introduction
Graph clustering (Schaeffer 2007) is probably one of the main exploratory tools used in network analysis as it provides data analysts with a high level summarized view of complex networks. One of the main paradigms for graph clustering is community search (Fortunato 2010): a community is a subset of nodes in a graph that are densely connected and have relatively few connections to nodes outside of the community. While this paradigm is very successful in many applications, it suffers from a main limitation: it cannot be used to detect other important structures that arise in graphs, such as bipartite structures, hubs, authorities and other patterns.
The alternative solution favoured in this paper is provided by block models (Lorrain and White 1971; White et al. 1976): in such a model, a cluster consists of nodes that share the same connectivity patterns to other clusters, regardless of the pattern itself (community, hub, bipartite, etc.). A popular probabilistic view on block models is provided by the stochastic block model (SBM, Holland et al. 1983; Wang and Wong 1987). The main idea is to assume that a hidden random variable is attached to each node. This variable contains the cluster membership information while connection probabilities between clusters are handled by the parameters of the model. The reader is send to Goldenberg et al. (2009) for a survey of probabilistic models for graphs and to Wasserman and Faust (1994), Ch.16, for an overview of the stochastic block models.
This paper focuses on dynamic graphs in the following sense: we assume that nodes of the graph are fixed and that interactions between them are directed and take place at a specific instant. In other words, we consider a directed multigraph (two nodes can be connected by more than one edge) in which each directed edge is labelled with an occurrence time. We are interested in extending the SBM to this type of graphs. More precisely, the proposed model is based on a counting process point of view of the interactions between nodes: we assume that the number of interactions between two nodes follows a non-homogeneous Poisson counting process (NHPP). As in a standard SBM, nodes are assumed to belong to clusters that do not change over time; thus, the temporal aspect is handled only via the non-homogeneity of the counting processes. Then, the block model hypothesis takes the following form: the intensity of the NHPP that counts interactions between two nodes depends only on the clusters of the nodes. In order to obtain a tractable inference, a segmentation of the time interval under study is introduced and the interactions are aggregated over the sub-intervals of the partition. Following Côme and Latouche (2015), the model is adjusted to the data via the maximization of the integrated classification likelihood (ICL Biernacki et al. 2000) in an exact form. As in Côme and Latouche (2015) (and Wyse et al. (2014) for latent block models), the maximization is done via a greedy search. This allows us to choose automatically the number of clusters in the block model.
When the number of sub-intervals is large, the model can suffer from a form of over-fitting as the ICL penalizes only a large number of clusters. Therefore, we introduce a variant, based on the model developed in Corneli et al. (2015), in which sub-intervals are clustered into classes of homogeneous intensities. Those clusters are accounted for in a new version of the ICL which prevents over-fitting.
This paper is structured as follows: in Sect. 2, we mention works related to the approach we propose, Sect. 3 presents the proposed temporal extension of the SBM and Sect. 4 derives the exact ICL for this model and presents the greedy search algorithm used to maximize the ICL. Section 5 gathers experimental results on simulated data and on real-world data. Section 6 concludes the paper.
2 Related works
Numerous extensions of the original SBM have already been proposed to deal with dynamic graphs. In this context, both nodes memberships to a cluster and interactions between nodes can be seen as stochastic processes. In Yang et al. (2011), for instance, authors introduce a Markov Chain to obtain the cluster of node in t given its cluster at time t-1. Xu and Hero (2013) as well as Xing et al. (2010) used a state space model to describe temporal changes at the level of the connectivity pattern. In the latter, the authors developed a method to retrieve overlapping clusters through time. In general, the proposed temporal variations of the SBM share a similar approach: the data set consists in a sequence of graphs rather than the more general structure we assume. Some papers remove those assumptions by considering continuous time models in which edges occur at specific instants (for instance, when someone sends an email). This is the case of e.g. Dubois et al. (2013) and of Guigourès et al. (2012, 2015). A temporal stochastic block model, related to the one presented in this paper, is independently developed by Matias et al. (2015). They assume that nodes in a network belong to clusters whose composition do not change over time and interactions are counted by a non-homogeneous Poisson process whose intensity only depends on the nodes clusters. In order to estimate (nonparametrically) the instantaneous intensity functions of the Poisson processes, they develop a variational EM algorithm to maximize an approximation of the likelihood.
3 The model
We consider a fixed set of N nodes, \{1,\ldots ,N\}, that can interact as frequently as wanted during the time interval [0, T]. Interactions are directed from one node to another and are assumed to be instantaneous.1 A natural mathematical model for this type of interactions is provided by counting processes on [0, T]. Indeed a counting process is a stochastic process with values that are non-negative integers increasing through time: the value at time t can be seen as the number of interactions that took place from 0 to t. Then, the classical adjacency matrix (X_{ij})_{1\le i, j\le N} of static graphs is replaced by a N\times N collection of counting processes, (X_{ij}(t))_{1\le i, j\le N}, where X_{ij}(t) is the counting process that gives the number of interactions from node i to node j. We still call {\mathbf {X}}=(X_{ij}(t))_{1\le i, j\le N} the adjacency matrix of this dynamical graph.
We introduce in this section a generative model for adjacency matrices of dynamical graphs that is inspired by the classical stochastic block model (SBM).
3.1 Non-homogeneous Poisson counting process
3.2 Block modelling
The main idea of the SBM (Holland et al. 1983; Wang and Wong 1987) is to assume that nodes have some (hidden) characteristics that solely explain their interactions, in a stochastic sense. In our context, this means that rather than having pairwise intensity functions \lambda _{ij}, those functions are shared by nodes that have the same characteristics.
In a second step, we assume that given {\mathbf {z}}, the counting processes X_{ij}(t) are independent and in addition that the intensity function \lambda _{ij} depends only on z_i and z_j. In order to keep notations tight, we denote \lambda _{z_iz_j} the common intensity function and we will not use directly the pairwise intensity functions \lambda _{ij}. We denote \varvec{\lambda } the matrix valued intensity function \varvec{\lambda }=(\lambda _{kg}(t))_{1\le k,g \le K}.
3.3 Discrete time version
Considering the network as a whole, we can introduce two tensors of order 3. Y is a N\times N\times U random tensor whose element (i, j, u) is the random variable Y_{ij}^{I_u} and \pi is the K \times K \times U tensor whose element (k, g, u) is \pi _{kg}^{I_u}. Y can be seen as an aggregated (or discrete time version) of the adjacency process {\mathbf {X}} while \pi can be seen as summary of the matrix valued intensity function \varvec{\lambda }.
3.4 A constrained version
As will be shown in Sect. 4.4, the model presented thus far is prone to over-fitting when the number of sub-intervals U is large compared to N. Additional constraints on the intensity functions \{\varLambda _{kg}(t)\}_{k,g \le K} are needed in this situation.
Remark 1
The introduction of this hidden vector {\mathbf {y}} is not the only way to impose regularity constraints to the integrated function \varLambda _{kg}(t). For example, a segmentation constraint could be imposed by forcing each temporal cluster to contain only adjacent time intervals.
3.4.1 Summary
- Model A
the model has two meta-parameters, K the number of clusters and \varvec{\omega } the parameters of a multinomial distribution on \{1,\ldots ,K\}. The hidden variable {\mathbf {z}} is generated by the multivariate multinomial distribution of Eq. (2). Then, the model has a K\times K\times U tensor of parameters \pi. Given {\mathbf {z}} and \pi, the model generates a tensor of interaction counts Y using Eq. (9).
- Model B
is a constrained version of model A. In addition to the meta-parameters K and \varvec{\omega } of model A, it has two meta-parameters, D the number of clusters of time sub-intervals and \varvec{\rho } the parameters of a multinomial distribution on \{1,\ldots ,D\}. The hidden variable {\mathbf {y}} is generated by the multivariate multinomial distribution of Eq. (10). Model B has a K\times K\times D tensor of parameters \pi. Given {\mathbf {z}}, {\mathbf {y}} and \pi, the model generates a tensor of interaction counts Y using Eq. (12).
4 Estimation
4.1 Nonparametric estimation of integrated intensities
Remark 2
Estimator (14) at times \{t_u\}_{u\le U}, can be viewed as an extension to random graphs and mixture models of the nonparametric estimator proposed in Leemis (1991). In that article, N-trajectories of independent NHPPs, sharing the same intensity function, are observed and the proposed estimator is basically obtained via method of moments.
4.2 ICL
4.3 Model B
4.4 Greedy search
- 1.
An initial configuration for both {\mathbf {z}} and K is set (standard clustering algorithms like k-means or hierarchical clustering can be used).
- 2.
Labels switches leading to the highest increase in the exact ICL are repeatedly made. A label switch consists in a merge of two clusters or in a node switch from one cluster to another.
Remark 3
The greedy algorithm described in this section, makes the best choice locally. A convergence towards the global optimum in not guaranteed and often this optimum can only be approximated by a local optimum reached by the algorithm.
Remark 4
The exact ICL (as well as the ICL criterion) penalizes the number of parameters. Since the tensor \pi has dimension K \times K \times U, when U, which is fixed, is very hight, the ICL will take its maximum for K=1. In other words, the only way the ICL has to make the model more parsimonious is to reduce K up to one. By doing so, any community (or other) structure will not be detected. This over-fitting problem has nothing to see with the possible limitations of the greedy search algorithm and it can be solved by switching to model B.
Once K_{\max } has been fixed, together with an initial value of {\mathbf {z}}, a shuffled sequence of all the nodes in the graph is created. Each node in the sequence is moved to the cluster, leading to the highest increase in the ICL, if any. This procedure is repeated until no further increase in the ICL is still possible. Henceforth, we refer to this step as to Greedy Exchange (GE). When maximizing the modularity score to detect communities, the GE usually is a final refinement step to be adopted after repeatedly merging clusters of nodes. In that context, moreover, the number of clusters is initialized to U and each node is alone in its own cluster. See, for example Noack and Rotta (2008). Here, we follow a different approach, proposed by Côme and Latouche (2015) and Blondel et al. (2008): after running the GE , we try to merge the remaining clusters of nodes in the attempt to increase the ICL. In this final step (henceforth GM), all the possible merges are tested and the best one is retained.
- 1.
GE + GM for nodes at first and then for times (we will call this strategy TN, henceforth).
- 2.
GE + GM for time intervals at first and then for nodes (NT strategy).
- 3.
An hybrid strategy, involving alternate switching of nodes and time intervals (M strategy).
5 Experiments
In this section, experiments on both synthetic and real data are provided. All running times are measured on a twelve cores Intel Xeon server with 92 GB of main memory running a GNU Linux operating system, the greedy algorithm described in Sect. 4.4 being implemented in C++. A Euclidean hierarchical clustering algorithm was used to initialize the labels, and K_{\max } was set to N/2.
In the following, we call TSBM the temporal SBM we propose and we refer to the optimization algorithm described in the previous section as greedy ICL.
5.1 Simulated data
5.1.1 First scenario
Real (a) and estimated (b) integrated intensity functions (IIFs) according to the considered generative model (\psi =2). In blue we have \varLambda _1(t), for \psi =4, in red\varLambda _2(t) (colour figure online)
A tensor Y, with dimensions N \times N \times U, is drawn. Its (i, j, u) component is the sampled number of interactions from node i to node j over the time interval I_u. Moreover, sampled interactions are aggregated over the whole time horizon to obtain an adjacency matrix. In other words, each tensor is integrated over its third dimension. We compared the greedy ICL algorithm with the Gibbs sampling approach introduced by Nouedoui and Latouche (2013). The former was run on the tensor Y (providing estimates in 11.86 s on average) the latter on the corresponding adjacency matrix. This experiment was repeated 50 times, and estimates of random vector {\mathbf {z}} were provided at each iteration. Each estimate \hat{{\mathbf {z}}} is compared with the true {\mathbf {z}} and an adjusted rand index (ARI Rand 1971) is computed. This index takes values between zero and one, where one corresponds to the perfect clustering (up to label switching).
Remark 5
The true structure is always recovered by the TSBM: 50 unitary values of the ARI are obtained. Conversely, the standard SBM never succeeds in recovering any hidden structures present in the data (50 null ARIs are obtained). This can easily be explained since the time clusters have opposite interaction patterns, making them hard to uncover when aggregating over time.
Relying on an efficient estimate of {\mathbf {z}}, the two integrated intensity functions can be estimated through the estimator in Eq. (15). Results are shown in Fig. 1b, where the estimated functions (coloured dots) overlap the real functions 1a.
Unfortunately, the model is not robust to such changes. Indeed, when running the greedy ICL algorithm on each sampled tensor Y, the algorithm does not see any community structure and all nodes are placed in the same cluster. This leads to a null ARI, for each estimation. As mentioned in Sect. 4.4, the ICL penalizes the number of parameters and since the tensor \pi has dimension K \times K \times U, for a fixed K, when moving from the larger decomposition (U=100) to the finer one (U=1000), the number of free parameters in the model is approximatively4 multiplied by 10. The increase we observe in the likelihood, when increasing the number of clusters of nodes from K=1 to K=2, is not sufficient to compensate the penalty due to the high number of parameters, and hence, the ICL decreases. Therefore, the maximum is taken for K=1 and a single cluster is detected.
Model B allows to tackle this issue. When allowing the integrated intensity functions \varLambda _1(t) and \varLambda _2(t) to grow at the same rate on each interval I_u belonging to the same time cluster {\mathcal {C}}_d, we basically reduce the third dimension of the tensor \pi from U to D.
Box plots for both clusterings of nodes and time intervals: 50 dynamic graphs are sampled according to the considered generative model, estimates of {\mathbf {z}} and {\mathbf {y}} are provided by the greedy ICL (model B)
5.1.2 Second scenario
Since the node clusters are fixed over time, the TSBM model can be seen as an alternative to a standard SBM to estimate the label vector {\mathbf {z}}. The previous scenario shows that the TSBM can recover the true vector {\mathbf {z}} in situations where the SBM fails. In this paragraph, we show how the TSBM and the SBM can sometimes have similar performances.
Real (a) and estimated (b) integrated intensity functions (IIFs) according to the considered generative model. In blue we have \varLambda _1(t), for \psi =4, in red\varLambda _2(t) (colour figure online)
Box plots of ARIs for different levels of contrast (\psi). We compare the proposed model with a standard SBM. a ARIs obtained by greedy ICL and b ARIs obtained with the Gibbs sampling procedure for SBM
Box plots of ARIs for different levels of contrast (\psi). Data have been sampled by non-homogeneous Poisson processes counting interactions in a dynamic graph whose nodes are grouped in three clusters and interactivity patterns vary across two time clusters
5.1.3 Scalability
A full scalability analysis of the proposed algorithm as well as the convergence properties of the proposed estimators are outside the scope of this paper. Nonetheless, in “Appendix” we provide details about the computational complexity of the greedy ICL algorithm. Future works could certainly be devoted to improve both the algorithm efficiency and scalability through the use of more sophisticated data structures.
5.2 Real data
ID1 | ID2 | Time interval (15 min) | Number of interactions |
---|---|---|---|
52 | 26 | 5 | 16 |
It means that conference attendees 52 and 26, between 9 a.m. and 9.15 a.m., have spoken for 16 \times 20\,\hbox {s} \approx 5\,\hbox {min}\,30\,\hbox {s}.
Box plot of the ten final values of the ICL produced by the greedy ICL algorithm for different initializations
a Cumulated aggregated connections for each time interval for cluster {\mathcal {A}}_4. b The estimated IIF for interactions inside cluster {\mathcal {A}}_4. Vertical red lines delimit the lunch break and the wine and cheese reception (colour figure online)
13.00–15.00—lunch break.
18.00–19.00—wine and cheese reception.
- 1.
The obtained clustering seems meaningful: the three time intervals with the highest interactions level are placed in the same cluster (blue), apart from all the others. More in general, each cluster is associated with a certain intensity level, so time intervals in the same cluster, not necessarily adjacent, share the same global interactivity pattern.
- 2.
There are not constraints on the number of abruptly changes connected with these five time clusters. In other words, time clusters do not need to be adjacent and this is the real difference between the approach considered in this paper (time clustering) and a pure segmentation one.
a Aggregated connections for each time interval for the whole network.b Interactions of the same form/colour take place on time intervals assigned to the same cluster (model B) (colour figure online)
6 Conclusion
We proposed a non-stationary extension of the stochastic block model (SBM) allowing us to cluster nodes of a network is situations where the classical SBM fails. The approach we chose consists in partitioning the time interval over which interactions are studied into sub-intervals of fixed length. Those intervals provide aggregated interaction counts that are increments of non-homogeneous Poisson processes (NHPPs). In a SBM inspired perspective, nodes are clustered in such a way that aggregated interaction counts are homogeneous over clusters. We derived an exact integrated classification likelihood (ICL) for such a model and proposed to maximize it through a greedy search strategy. Finally, a nonparametric maximum likelihood estimator was developed to estimate the integrated intensity functions of the NHPPs counting interactions between nodes. The experiments we carried out on artificial and real-world networks highlight the capacity of the model to capture non-stationary structures in dynamic graphs.
More informations about the way the data were collected can be found in Isella et al. (2011) or visiting the website http://www.sociopatterns.org/datasets/hypertext-2009-dynamic-contact-network/.