Introduction
Graph data has become increasingly important in various real-world applications, such as social network analysis and recommendation systems. Graph partitioning can be utilized to analyze such data, dividing the graph into meaningful communities. This enables us to gain valuable insights into the interaction and relationship among the nodes. Among various graph partitioning techniques, Stochastic Graph Partitioning (SGP) [1] characterizes the real-world graphs by utilizing Bayesian inferential methods based on the degree-corrected stochastic blockmodel [2]. Unlike the typical combinatorial graph partitioning problem, SGP presents unique computational challenges caused by time-consuming sampling processes. As a result, the recent HPEC launched the Stochastic Graph Partitioning Challenge (SGPC) to seek innovative acceleration methods from the high-performance computing (HPC) community [1].
SGPC provides a baseline sequential partitioner, namely PEIXOTO which is developed by [3]–[5]. The baseline algorithm utilizes a Fibonacci search approach [6] to explore different numbers of blocks. It determines the optimal partition by minimizing the overall entropy of a partitioned graph. Each node is initially assigned a unique block and merged with others at each search step, called block merging. Subsequently, iterative sequential Monte Carlo Markov Chain (MCMC) updates are applied to assign each node a block, namely nodal block assignment. To evaluate the performance of a partitioner over PEIXOTO, SGPC provides a rigorous evaluation metric on synthetic graphs based on the stochastic model in [2]. The model samples the connection between nodes from a Poisson distribution with a correction term to emulate real-world graph characteristics. Table I lists the characteristics of four categories of these graphs from 1K to 200K nodes. The four categories present different levels of partitioning difficulties or graph complexities:
Low block overlap, low block size variation (Low-Low)
Low block overlap, high block size variation (Low-High)
High block overlap, low block size variation (High-Low)
High block overlap, high block size variation (High-High)
However, PEIXOTO is extremely time-consuming due to its bottom-up merging strategy and iterative MCMC updates, both of which require a large number of iterations that run sequentially. As a result, the algorithm struggles to scale effectively for large graphs. To tackle this problem, SGPC has yielded many solutions [7]–[9], while their speed-ups are not remarkable due to various algorithm limitations (discussed later). For example, the latest SGPC champion [9] only reports up to 3.78x speed-up for a 50K-node graph and cannot complete larger graphs in a reasonable amount of time (e.g., 10hrs).
Consequently, we propose uSAP, an ultra-fast stochastic graph partitioner, to substantially improve the performance of SGP that was previously out of reach. uSAP introduces a novel strongly connected component(SCC)-based initial block merging strategy to largely reduce the partitioning iterations. To further enhance the sampling performance, uSAP adopts a dynamic batch parallel nodal block assignment algorithm. In addition to runtime improvement, uSAP employs a dynamic matrix representation to reduce the memory footprint. We have evaluated uSAP on the 2022 official benchmarks of HPEC SGPC. The results demonstrate the promising performance of uSAP on different graph sizes and categories. For example, uSAP achieves 129.4x speed-up over the latest champion on the graph of 50K nodes. We have made uSAP open-source to facilitate high-performance graph partitioning research1.
State of the Art
To overcome the scalability challenge, Distributed Sketches [7], the 2020 SGPC champion, proposes the matrix sketches derived from random dimension-reducing projections. They demonstrate the excellent scalability in distributed memory of the linear sketch embedding algorithm. However, the result shows that the pairwise precision and recall (i.e., accuracy) are not promising when partitioning large graphs. Also, dealing with high-degree vertices is still challenging because of the algorithm's communication and computation bottlenecks.
DPGS [8], the honorable mention of 2021 SGPC, introduces a graph summarization technique that preserves the community structure of the graph while reducing its complexity. The result indicates that their algorithm runs faster than PEIXOTO, but the pairwise precision and recall are not significantly better than those achieved by PEIXOTO.
Faster Stochastic Block Partition (FSBP) [9], the 2021 SGPC champion, proposes an aggressive initial merging strategy to considerably decrease the initial block count at the first searching iteration, which in turn reduces the total number of partitioning iterations. Its parallelism control strategy carefully manage the amount of parallelism during different phases of the algorithm to improve the performance. However, the aggressive initial merging strategy may merge blocks that cause substantial changes in entropy, resulting in negative effects on the accuracy. Also, its parallelism control strategy does not perform well due to the synchronization overhead.
Usap
To overcome the performance bottleneck of existing partitioners, we present uSAP, an ultra-fast stochastic graph partitioner. We implement uSAP in a task graph programming model to accelerate SGP through an SCC-based initial block merging strategy, a dynamic batch parallel nodal block assignment, and a dynamic matrix representation.
A. Scc-Based Initial Block Merging Strategy
In order to reduce the total number of partitioning iterations, we merge the SCC of the graph into blocks in advance. This approach substantially reduces the initial block number and significantly enhances the algorithm's efficiency. In contrast to the greedy strategy of FSBP, which aggressively increases the number of blocks to be merged, our method focuses on merging blocks with stronger connections. The idea is inspired by the degree-corrected stochastic blockmodel [2], which implies that nodes within the same block exhibit stronger connections than nodes across different blocks. The merging scenario, as shown in Figure 1, represents a possible solution of PEIXOTO and FSBP. Merging node B with node A causes an entropy change of 0.35 according to the entropy definition in [1]. However, merging node C into node A yields a smaller entropy change of −0.43, as shown in Figure 2. This indicates that merging node C into node A is a more favorable choice than merging node B into node A because node A and node C are strongly connected. Based on the observation, we adopt the SCC finding algorithm, which has a linear-time complexity, to identify the SCC and guide the initial merging strategy.
Illustration of merging node B into node A (a possible solution of PEIXOTO and FSBP) and the resulting block edge count matrix used to calculate the change in entropy (ΔE == 0.35 from [1]).
In our SCC-based initial block merging strategy, we introduce an initial block size threshold denoted as lscc to limit the block size accordingly instead of finding all the SCC in the graph. The threshold can be adjusted to accommodate the different complexities of graphs. The overall procedure described in Algorithm 1 begins by applying a depth-first search to obtain the traversal sequence stored in stack. Subsequently, we transpose the input graph and perform the second depth-first search on the transposed graph according to the traversal sequence. Once a node is popped from the traversal sequence, its neighbors are identified as the same block. We continue to find the descendants of these neighbors until the step counter exceeds tscc or no descendant exists. At this point, an initial block is found. The nodes within the block are then marked as finished, and the step counter is reset in preparation for finding the subsequent initial blocks.
Illustration of merging node C into node A (the solution of uSAP) and the resulting block edge· count matrix used to calculate the change in entropy (a smaller
B. Dynamic Batch Parallel Nodal Block Assignment
To prevent the time-consuming MCMC updates in the early partitioning stage, we propose a dynamic approach to decide when to perform parallel nodal block assignment. We define
We randomly select a batch of nodes to parallelize the nodal block assignment, as shown in Figure 3. Each thread is assigned a specific set of nodes and performs block assignment independently based on the shared state of the current partitioning result. After completing the computation of the batch, the global shared state is updated, and the overall entropy is calculated to determine whether to continue the nodal block assignment. The parallelization significantly accelerates the time-consuming MCMC updates without affecting accuracy. The foundation to support this argument can be referred to [10], [11].
C. Dynamic Matrix Representation
To increase the memory efficiency of uSAP, we introduce a threshold parameter denoted as
D. Task Graph Parallelism
To maximize the parallelism of uSAP, we leverage Taskflow [12] to describe our algorithms in a task dependency graph, where dependent tasks and parallel algorithms can be scheduled by the Taskflow runtime across different CPUs with dynamic load balancing. Furthermore, as uSAP incorporates many iterative control flows in the Fibonacci search and MCMC updates, the control taskflow graph (CTFG) programming model of Taskflow allows us to express end-to-end parallelism by integrating control flow into our task graph, largely reducing the threading and synchronization overheads. More details about CTFG and its successful applications can be referred to [12]–[46].
We depict the task graph of uSAP in Figure 4 and present the corresponding overall algorithm in Algorithm 2. uSAP begins with the Initial-Block-Merging (line 1), where the SCC-based initial block merging strategy is employed. This step is followed by initializing block edge count data based on the initial merged blocks. Next, a Fibonacci search (line 4) consists of two steps. The first step is the parallel Block-Merging (line 6), and the second step is batch parallel Nodal-Block-Assignment (line 14). The latter begins with Fetch-Batches (line 10) to obtain batches of nodes and terminates based on the result of Check-Convergence (line 17). Finally, the Prepare-for-Next (line 21) examines the result of the Fibonacci search to determine if the optimal partition has been achieved.
Experimental Results
We evaluate the performance of uSAP using the official 2022 SGPC dataset. We focus on static graphs under four categories of different graph complexity, as listed in Table I. All experiments were conducted on an Ubuntu Linux 5.15.0-58-generic
A. Baseline
We consider two baseline implementations: (1) PEIXOTO sequential partitioner provided by SGPC and (2) FSBP [9] which is the latest champion of SGPC. The aggressive initial merging rate of FSBP is set to 0.75, which is the same as [9]. In addition, we ran FSBP with eight threads, where it achieved the best performance on our machine. Using more threads does not provide any further performance advantage due to its synchronization overhead. In terms of
B. Performance Comparison
Table III compares the runtime performance and memory usage among PEIXOTO, FSBP, and uSAP across different graph sizes and categories. The results clearly demonstrate that uSAP significantly outperforms PEIXOTO and FSBP on all graphs. For example, uSAP is about 1700 x faster than FSBP for the graph of 1K nodes under the low-low category. For the same graph category of 5K, 20K, and 50K nodes, uSAP is about 80.2 x, 103.3 x, and 129.4 x faster than FSBP, respectively. When partitioning the largest graph of 200K nodes under the high-high category (the most complicated graph), uSAP can finish in 23 minutes, while PEIXOTO and FSBP fail to complete within 10 hours. Similar data can be observed in the other three categories as well.
The runtime advantage of uSAP is a combination of our SCC-based initial merging strategy, the dynamic batch parallel nodal update approach, and the task graph parallelism. The SCC-based initial merging strategy significantly reduces the number of blocks before the block merging stage, resulting in much fewer partitioning iterations. The results shown in Figure 5 demonstrate that uSAP outperforms PEIXOTO and FSBP by merging up to 80% of nodes into blocks initially, leading to a large reduction in the number of partitioning iterations.
In terms of memory usage, uSAP outperforms PEIXOTO and FSBP on all graphs. For example, uSAP consumes 7.1x less memory than FSBP and 12.55 x less than PEIXOTO on the 1K-node low-low graph. For the 20K-node low-low graph, uSAP consumes 2.99x less memory than FSBP and 16.4x than PEIXOTO. The memory efficiency of uSAP can be attributed to its dynamic matrix representation, which efficiently manages memory by employing the adjacency list when the block count
Table IV compares the accuracy in terms of pairwise precision (denoted as PP) and pairwise recall (denoted as PR) on the four different categories. We observe that uSAP outperforms PEIXOTO and FSBP on nearly all graphs because the SCC-based initial block merging strategy minimizes the change in entropy, and this helps maintain PP and PR. In contrast, the aggressive initial merging approach used in FSBP can harm the accuracy. This approach can significantly change the entropy and degrade PP and PR. Compared to PEIXOTO, our uSAP algorithm incorporates the batch parallel nodal block assignment inspired by [10], [11]. This technique enables the concurrent processing of multiple nodes, leveraging parallelism to accelerate the MCMC updates without compromising the PP and PR. For the high-high 20K-node graph, uSAP exhibits a bit lower PP and PR than PEIXOTO and FSBP. This is because the threshold tB is fixed across all four categories. Yet, for more complicated graphs like the high-high category, a higher level of granularity in the nodal block assignment is necessary to achieve optimal PP and PR values. Thus, we leave the threshold tB a tunable parameter where applications can fine-tune it to improve the PP and PR (e.g., reducing tB for more fine-grained updates).
C. Scalability
We compare the scalability between uSAP and FSBP over increasing numbers of threads on partitioning the 50K-node graph under the different categories. We do not report the scalability for 200K-node graphs because FSBP cannot complete within a reasonable amount of time. The execution time reported in Figure 6 is presented on a logarithmic scale of base 10. We can observe that uSAP significantly outperforms FSBP, regardless of the number of threads. Regarding scalability, FSBP achieves its best performance when utilizing eight threads. However, when the number of threads exceeds eight, the performance of FSBP begins to degrade. This degradation in performance is due to the lack of an effective scheduling algorithm in FSBP to handle the synchronization overhead that arises with a larger number of threads. To solve this problem, uSAP leverages Taskflow [12] to program our partitioning algorithms in a scalable task dependency graph, where parallel and dependent tasks can be efficiently scheduled by the Taskflow runtime over different CPUs with dynamic load balancing. For example, the runtime of uSAP with eight threads is approximately 8x faster than that of one thread in Figure 6. This significant improvement demonstrates the superior scalability of uSAP, allowing uSAP to effectively utilizes CPUs to achieve performance enhancements for SGP.
Scaling results under different numbers of threads of uSAP and FSBP on 50K-node graphs.
D. Effect of Dynamic Matrix Representation Strategy
We study the effect of our dynamic matrix representation strategy under different tM values. In Table V, we only consider the effect of the dynamic matrix representation without other optimization mentioned in this paper. Using adjacency lists alone takes about 0.45s, 4.9s, 60s, and 363s to partition the four low-low graphs 1K, 5K, 20K, and 50K, respectively. With our proposed dynamic strategy, we observe that the best runtime improvements can achieve 19.6% - 25% on the 50K-node graph under different tM (512, 1024, 2048, and 4096). The value of 1024 strikes a balance between memory access time and the number of iterations to read each row of the matrix for calculating the entropy.
Conclusion and Future Works
In this paper, we have introduced uSAP, an ultra-fast stochastic graph partitioner targeting HPEC SGPC. uSAP has introduced a novel SCC-based initial block merging strategy to significantly reduce the number of partitioning iterations. In addition, uSAP has adopted a dynamic batch parallel nodal block assignment algorithm and a dynamic matrix representation to improve runtime and memory performance. We have evaluated uSAP on the 2022 official HPEC SGPC benchmarks. The results have demonstrated the promising performance of uSAP on graphs of different sizes and complexities. For example, uSAP achieves 129.4x speed-up over the 2021 champion on a graph of 50K nodes.
Our future work will extend uSAP to handle streaming graphs and leverage GPU [36] with data-parallel distributed computing [47]–[50] to gain further acceleration and performance improvements.
ACKNOWLEDGMENT
We are grateful for the support of National Science Foundation of US (CCF-2126672, CCF-2144523 (CAREER), OAC-2209957, TI-2229304, and DMR-2235276). Special thanks go to reviewers for their help with improving this manuscript.