Introduction
The recent success of neural networks has boosted research on pattern recognition and data mining. Many machine learning tasks, such as object detection [1], [2], machine translation [3], [4], and speech recognition [5], which once heavily relied on handcrafted feature engineering to extract informative feature sets, have recently been revolutionized by various end-to-end deep learning paradigms, e.g., convolutional neural networks (CNNs) [6], recurrent neural networks (RNNs) [7], and autoencoders [8]. The success of deep learning in many domains is partially attributed to the rapidly developing computational resources (e.g., GPU), the availability of big training data, and the effectiveness of deep learning to extract latent representations from the Euclidean data (e.g., images, text, and videos). Taking image data as an example, we can represent an image as a regular grid in the Euclidean space. CNN is able to exploit the shift-invariance, local connectivity, and compositionality of image data [9]. As a result, CNNs can extract local meaningful features that are shared with the entire data sets for various image analyses.
While deep learning effectively captures hidden patterns of Euclidean data, there are an increasing number of applications, where data are represented in the form of graphs. For example, in e-commerce, a graph-based learning system can exploit the interactions between users and products to make highly accurate recommendations. In chemistry, molecules are modeled as graphs, and their bioactivity needs to be identified for drug discovery. In a citation network, articles are linked to each other via citationships, and they need to be categorized into different groups. The complexity of graph data has imposed significant challenges on the existing machine learning algorithms. As graphs can be irregular, a graph may have a variable size of unordered nodes, and nodes from a graph may have a different number of neighbors, resulting in some important operations (e.g., convolutions) being easy to compute in the image domain but difficult to apply to the graph domain. Furthermore, a core assumption of existing machine learning algorithms is that instances are independent of each other. This assumption no longer holds for graph data because each instance (node) is related to others by links of various types, such as citations, friendships, and interactions.
Recently, there is increasing interest in extending deep learning approaches for graph data. Motivated by CNNs, RNNs, and autoencoders from deep learning, new generalizations and definitions of important operations have been rapidly developed over the past few years to handle the complexity of graph data. For example, a graph convolution can be generalized from a 2-D convolution. As illustrated in Fig. 1, an image can be considered as a special case of graphs, where pixels are connected by adjacent pixels. Similar to 2-D convolution, one may perform graph convolutions by taking the weighted average of a node’s neighborhood information.
2-D convolution versus graph convolution. (a) 2-D convolution: analogous to a graph, each pixel in an image is taken as a node where neighbors are determined by the filter size. The 2-D convolution takes the weighted average of pixel values of the red node along with its neighbors. The neighbors of a node are ordered and have a fixed size. (b) Graph convolution: to get a hidden representation of the red node, one simple solution of the graph convolutional operation is to take the average value of the node features of the red node along with its neighbors. Different from the image data, the neighbors of a node are unordered and variable in size.
There are a limited number of existing reviews on the topic of graph neural networks (GNNs). Using the term geometric deep learning, Bronstein et al. [9] give an overview of deep learning methods in the non-Euclidean domain, including graphs and manifolds. Although it is the first review on GNNs, this article mainly reviews convolutional GNNs. Hamilton et al. [10] cover a limited number of GNNs with a focus on addressing the problem of network embedding. Battaglia et al. [11] position graph networks as the building blocks for learning from relational data, reviewing part of GNNs under a unified framework. Lee et al. [12] conduct a partial survey of GNNs that apply different attention mechanisms. In summary, existing surveys only include some of the GNNs and examine a limited number of works, thereby missing the most recent development of GNNs. This article provides a comprehensive overview of GNNs, for both interested researchers who want to enter this rapidly developing field and experts who would like to compare GNN models. To cover a broader range of methods, this article considers GNNs as all deep learning approaches for graph data.
A. Our Contributions
This article makes notable contributions summarized as follows.
New Taxonomy: We propose a new taxonomy of GNNs. GNNs are categorized into four groups: recurrent GNNs (RecGNN), convolutional GNNs (ConvGNNs), graph autoencoders (GAEs), and spatial–temporal GNNs (STGNNs).
Comprehensive Review: We provide the most comprehensive overview of modern deep learning techniques for graph data. For each type of GNNs, we provide detailed descriptions of representative models, make the necessary comparison, and summarize the corresponding algorithms.
Abundant Resources: We collect abundant resources on GNNs, including state-of-the-art models, benchmark data sets, open-source codes, and practical applications. This article can be used as a hands-on guide for understanding, using, and developing different deep learning approaches for various real-life applications.
Future Directions: We discuss theoretical aspects of GNNs, analyze the limitations of existing methods, and suggest four possible future research directions in terms of model depth, scalability tradeoff, heterogeneity, and dynamicity.
B. Organization of This Article
The rest of this article is organized as follows. Section II outlines the background of GNNs, lists commonly used notations, and defines graph-related concepts. Section III clarifies the categorization of GNNs. Sections IV–VII provides an overview of GNN models. Section VIII presents a collection of applications across various domains. Section IX discusses the current challenges and suggests future directions. Section X summarizes this article.
Background and Definition
In this section, we outline the background of GNNs, list commonly used notations, and define graph-related concepts.
A. Background
1) Brief History of Graph Neural Networks:
Sperduti and Starita [13] first applied neural networks to directed acyclic graphs, which motivated early studies on GNNs. The notion of GNNs was initially outlined in [14] and further elaborated in [15] and [16]. These early studies fall into the category of RecGNNs. They learn a target node’s representation by propagating neighbor information in an iterative manner until a stable fixed point is reached. This process is computationally expensive, and recently, there have been increasing efforts to overcome these challenges [17], [18].
Encouraged by the success of CNNs in the computer vision domain, a large number of methods that redefine the notion of convolution for graph data are developed in parallel. These approaches are under the umbrella of ConvGNNs. ConvGNNs are divided into two main streams: the spectral-based approaches and the spatial-based approaches. The first prominent research on spectral-based ConvGNNs was presented by Bruna et al. [19], which developed a graph convolution based on the spectral graph theory. Since then, there have been increasing improvements, extensions, and approximations on spectral-based ConvGNNs [20]–[23]. The research about spatial-based ConvGNNs started much earlier than spectral-based ConvGNNs. In 2009, Micheli [24] first addressed graph mutual dependence by architecturally composite nonrecursive layers while inheriting ideas of message passing from RecGNNs. However, the importance of this article was overlooked. Until recently, many spatial-based ConvGNNs (e.g., [25]–[27]) emerged. Apart from RecGNNs and ConvGNNs, many alternative GNNs have been developed in the past few years, including GAEs and STGNNs. These learning frameworks can be built on RecGNNs, ConvGNNs, or other neural architectures for graph modeling. Details on the categorization of these methods are given in Section III.
2) Graph Neural Networks Versus Network Embedding:
The research on GNNs is closely related to graph embedding or network embedding, another topic which attracts increasing attention from both the data mining and machine learning communities [10], [28]–[32]. Network embedding aims at representing network nodes as low-dimensional vector representations, preserving both network topology structure and node content information, so that any subsequent graph analytics task, such as classification, clustering, and recommendation, can be easily performed using simple off-the-shelf machine learning algorithms (e.g., support vector machines for classification). Meanwhile, GNNs are deep learning models aiming at addressing graph-related tasks in an end-to-end manner. Many GNNs explicitly extract high-level representations. The main distinction between GNNs and network embedding is that GNNs are a group of neural network models that are designed for various tasks, while network embedding covers various kinds of methods targeting the same task. Therefore, GNNs can address the network embedding problem through a GAE framework. On the other hand, network embedding contains other nondeep learning methods, such as matrix factorization [33], [34] and random walks [35].
3) Graph Neural Networks Versus Graph Kernel Methods:
Graph kernels are historically dominant techniques to solve the problem of graph classification [36]–[38]. These methods employ a kernel function to measure the similarity between pairs of graphs so that kernel-based algorithms, such as support vector machines, can be used for supervised learning on graphs. Similar to GNNs, graph kernels can embed graphs or nodes into vector spaces by a mapping function. The difference is that this mapping function is deterministic rather than learnable. Due to a pairwise similarity calculation, graph kernel methods suffer significantly from computational bottlenecks. GNNs, on the one hand, directly perform graph classification based on the extracted graph representations and, therefore, are much more efficient than graph kernel methods. For a further review of graph kernel methods, we refer the readers to [39].
B. Definition
Throughout this article, we use bold uppercase characters to denote matrices and bold lowercase characters to denote vectors. Unless particularly specified, the notations used in this article are illustrated in Table I. Now, we define the minimal set of definitions required to understand this article.
Definition 1 (Graph):
A graph is represented as
Definition 2 (Directed Graph):
A directed graph is a graph with all edges directed from one node to another. An undirected graph is considered as a special case of directed graphs where there is a pair of edges with inverse directions if two nodes are connected. A graph is undirected if and only if the adjacency matrix is symmetric.
Definition 3 (Spatial–Temporal Graph):
A spatial–temporal graph is an attributed graph where the node attributes change dynamically over time. The spatial–temporal graph is defined as
Categorization and Frameworks
In this section, we present our taxonomy of GNNs, as shown in Table II. We categorize GNNs into RecGNNs, ConvGNNs, GAEs, and STGNNs. Fig. 2 shows the examples of various model architectures. In the following, we give a brief introduction to each category.
Different GNN models built with graph convolutional layers. The term Gconv denotes a graph convolutional layer. The term MLP denotes a multilayer perceptron. The term CNN denotes a standard convolutional layer. (a) ConvGNN with multiple graph convolutional layers. A graph convolutional layer encapsulates each node’s hidden representation by aggregating feature information from its neighbors. After feature aggregation, a nonlinear transformation is applied to the resulted outputs. By stacking multiple layers, the final hidden representation of each node receives messages from a further neighborhood. (b) ConvGNN with pooling and readout layers for graph classification [21]. A graph convolutional layer is followed by a pooling layer to coarsen a graph into subgraphs so that node representations on coarsened graphs represent higher graph-level representations. A readout layer summarizes the final graph representation by taking the sum/mean of hidden representations of subgraphs. (c) GAE for network embedding [61]. The encoder uses graph convolutional layers to get a network embedding for each node. The decoder computes the pairwise distance given network embeddings. After applying a nonlinear activation function, the decoder reconstructs the graph adjacency matrix. The network is trained by minimizing the discrepancy between the real adjacency matrix and the reconstructed adjacency matrix. (d) STGNN for spatial–temporal graph forecasting [74]. A graph convolutional layer is followed by a 1-D-CNN layer. The graph convolutional layer operates on
A. Taxonomy of Graph Neural Networks
1) Recurrent Graph Neural Networks:
These are mostly pioneer works of GNNs. RecGNNs aim to learn node representations with recurrent neural architectures. They assume a node in a graph constantly exchanges information/message with its neighbors until a stable equilibrium is reached. RecGNNs are conceptually important and inspired later research on ConvGNNs. In particular, the idea of message passing is inherited by spatial-based ConvGNNs.
2) Convolutional Graph Neural Networks:
These generalize the operation of convolution from grid data to graph data. The main idea is to generate a node
3) Graph Autoencoders:
These are unsupervised learning frameworks that encode nodes/graphs into a latent vector space and reconstruct graph data from the encoded information. GAEs are used to learn network embeddings and graph generative distributions. For network embedding, GAEs learn latent node representations through reconstructing graph structural information, such as the graph adjacency matrix. For graph generation, some methods generate nodes and edges of a graph step by step, while other methods output a graph all at once. Fig. 2(c) presents a GAE for network embedding.
4) Spatial–Temporal Graph Neural Networks:
These aim to learn hidden patterns from spatial–temporal graphs, which becomes increasingly important in a variety of applications, such as traffic speed forecasting [72], driver maneuver anticipation [73], and human action recognition [75]. The key idea of STGNNs is to consider spatial dependence and temporal dependence at the same time. Many current approaches integrate graph convolutions to capture spatial dependence with RNNs or CNNs to model temporal dependence. Fig. 2(d) illustrates an STGNN for spatial–temporal graph forecasting.
B. Frameworks
With the graph structure and node content information as inputs, the outputs of GNNs can focus on different graph analytics tasks with one of the following mechanisms.
Node Level: Outputs relate to node regression and node classification tasks. RecGNNs and ConvGNNs can extract high-level node representations by information propagation/graph convolution. With a multiperceptron or a softmax layer as the output layer, GNNs are able to perform node-level tasks in an end-to-end manner.
Edge Level: Outputs relate to the edge classification and link prediction tasks. With two nodes’ hidden representations from GNNs as inputs, a similarity function or a neural network can be utilized to predict the label/connection strength of an edge.
Graph Level: Outputs relate to the graph classification task. To obtain a compact representation on the graph level, GNNs are often combined with pooling and readout operations. Detailed information about pooling and readouts will be reviewed in Section V-C.
Training Frameworks: Many GNNs (e.g., ConvGNNs) can be trained in a (semi)supervised or purely unsupervised way within an end-to-end learning framework, depending on the learning tasks and label information available at hand.
Semisupervised Learning for Node-Level Classification: Given a single network with partial nodes being labeled and others remaining unlabeled, ConvGNNs can learn a robust model that effectively identifies the class labels for the unlabeled nodes [22]. To this end, an end-to-end framework can be built by stacking a couple of graph convolutional layers followed by a softmax layer for multiclass classification.
Supervised Learning for Graph-Level Classification: Graph-level classification aims to predict the class label(s) for an entire graph [52], [54], [78], [79]. The end-to-end learning for this task can be realized with a combination of graph convolutional layers, graph pooling layers, and/or readout layers. While graph convolutional layers are responsible for exacting high-level node representations, graph pooling layers play the role of downsampling, which coarsens each graph into a substructure each time. A readout layer collapses node representations of each graph into a graph representation. By applying a multilayer perceptron and a softmax layer to graph representations, we can build an end-to-end framework for graph classification. An example is given in Fig. 2(b).
Unsupervised Learning for Graph Embedding: When no class labels are available in graphs, we can learn the graph embedding in a purely unsupervised way in an end-to-end framework. These algorithms exploit edge-level information in two ways. One simple way is to adopt an autoencoder framework, where the encoder employs graph convolutional layers to embed the graph into the latent representation upon which a decoder is used to reconstruct the graph structure [61], [62]. Another popular way is to utilize the negative sampling approach that samples a portion of node pairs as negative pairs, while existing node pairs with links in the graphs are positive pairs. Then, a logistic regression layer is applied to distinguish between the positive and negative pairs [42].
In Table III, we summarize the main characteristics of representative RecGNNs and ConvGNNs. Input sources, pooling layers, readout layers, and time complexity are compared among various models. In more detail, we only compare the time complexity of the message-passing/graph convolutional operation in each model. As methods in [19] and [20] require eigenvalue decomposition, the time complexity is
Recurrent Graph Neural Networks
RecGNNs are mostly pioneer works of GNNs. They apply the same set of parameters recurrently over nodes in a graph to extract high-level node representations. Constrained by computational power, earlier research is mainly focused on directed acyclic graphs [13], [80].
GNN*2 proposed by Scarselli et al. extends prior recurrent models to handle general types of graphs, e.g., acyclic, cyclic, directed, and undirected graphs [15]. Based on an information diffusion mechanism, GNN* updates nodes’ states by exchanging neighborhood information recurrently until a stable equilibrium is reached. A node’s hidden state is recurrently updated by \begin{equation*} \mathbf {h}_{v}^{(t)} = \sum _{u\in N(v)}f(\mathbf {x}_{v},\mathbf {x}^{\mathbf {e}}_{(v,u)}, \mathbf {x}_{u},\quad \mathbf {h}^{(t-1)}_{u})\tag{1}\end{equation*}
Gated GNN (GGNN) [17] employs a gated recurrent unit (GRU) [81] as a recurrent function, reducing the recurrence to a fixed number of steps. The advantage is that it no longer needs to constrain parameters to ensure convergence. A node hidden state is updated by its previous hidden states and its neighboring hidden states, defined as \begin{equation*} \mathbf {h}_{v}^{(t)} = \mathrm {GRU}\left ({\mathbf {h}_{v}^{(t-1)},\sum _{u\in N(v)}\mathbf {W}\mathbf {h}_{u}^{(t-1)}}\right)\tag{2}\end{equation*}
Stochastic steady-state embedding (SSE) proposes a learning algorithm that is more scalable to large graphs [18]. SSE updates node hidden states recurrently in a stochastic and asynchronous fashion. It alternatively samples a batch of nodes for state update and a batch of nodes for gradient computation. To maintain stability, the recurrent function of SSE is defined as a weighted average of the historical states and new states, which takes the form \begin{align*} \mathbf {h}_{v}^{(t)} = (1-\alpha)\mathbf {h}_{v}^{(t-1)}+\alpha \mathbf {W_{1}}\sigma \left ({\mathbf {W_{2}}\left [{\mathbf {x}_{v},\sum _{u\in N(v)}[\mathbf {h}_{u}^{(t-1)},\mathbf {x}_{u}]}\right]}\right)\!\!\! \\ {}\tag{3}\end{align*}
Convolutional Graph Neural Networks
ConvGNNs are closely related to recurrent graph neural networks. Instead of iterating node states with contractive constraints, ConvGNNs address the cyclic mutual dependencies architecturally using a fixed number of layers with different weights in each layer. This key distinction is illustrated in Fig. 3. As graph convolutions are more efficient and convenient to composite with other neural networks, the popularity of ConvGNNs has been rapidly growing in recent years. ConvGNNs fall into two categories: spectral-based and spatial-based. Spectral-based approaches define graph convolutions by introducing filters from the perspective of graph signal processing [82], where the graph convolutional operation is interpreted as removing noises from graph signals. Spatial-based approaches inherit ideas from RecGNNs to define graph convolutions by information propagation. Since GCN [22] bridged the gap between spectral-based approaches and spatial-based approaches, spatial-based methods have developed rapidly recently due to its attractive efficiency, flexibility, and generality.
RecGNNs versus ConvGNNs (a) RecGNNs use the same graph recurrent layer (Grec) in updating node representations. (b) ConvGNNs use a different graph convolutional layer (Gconv) in updating node representations.
A. Spectral-Based ConvGNNs
Background: Spectral-based methods have a solid mathematical foundation in graph signal processing [82]–[84]. They assume graphs to be undirected. The normalized graph Laplacian matrix is a mathematical representation of an undirected graph, defined as \begin{align*} \mathbf {x} \ast _{G} \mathbf {g}=&\mathscr {F}^{-1}(\mathscr {F}(\mathbf {x})\odot \mathscr {F}(\mathbf {g})) \\=&\mathbf {U}(\mathbf {U}^{T}\mathbf {x}\odot \mathbf {U}^{T} \mathbf {g}) \tag{4}\end{align*}
\begin{equation*} \mathbf {x} \ast _{G} \mathbf {g_\theta } = \mathbf {U}\mathbf {g_\theta } \mathbf {U}^{T}\mathbf {x}.\tag{5}\end{equation*}
Spectral CNN [19] assumes that the filter \begin{equation*} \mathbf {H}_{:,j}^{(k)} = \sigma \left ({\sum _{i=1}^{f_{k-1}}\mathbf {U}\boldsymbol {\Theta }_{i,j}^{(k)}\mathbf {U}^{T}\mathbf {H}_{:,i}^{(k-1)}}\right) \quad (j=1,2,\ldots, f_{k})\tag{6}\end{equation*}
Chebyshev spectral CNN (ChebNet) [21] approximates the filter \begin{equation*} \mathbf {x} \ast _{G} \mathbf {g_\theta } = \mathbf {U}\left ({\sum _{i=0}^{K} \theta _{i} T_{i}(\tilde {\boldsymbol {\Lambda }})}\right)\mathbf {U}^{T}\mathbf {x}\tag{7}\end{equation*}
\begin{equation*} \mathbf {x} \ast _{G} \mathbf {g_\theta } = \sum _{i=0}^{K} \theta _{i}T_{i}(\tilde {\mathbf {L}})\mathbf {x}.\tag{8}\end{equation*}
\begin{equation*} \mathbf {x} \ast _{G} \mathbf {g_\theta } = c_{0}\mathbf {x}+2\text {Re}\left \{{\sum _{j=1}^{r}c_{j}(h\mathbf {L}- i\mathbf {I})^{j}(h\mathbf {L}+ i\mathbf {I})^{-j}\mathbf {x}}\right \}\tag{9}\end{equation*}
Graph convolutional network (GCN) [22] introduces a first-order approximation of ChebNet. Assuming that \begin{equation*} \mathbf {x} \ast _{G} \mathbf {g_\theta }= \theta _{0}\mathbf {x}-\theta _{1}\mathbf {D}^{-\frac {1}{2}}\mathbf {A}\mathbf {D}^{-\frac {1}{2}}\mathbf {x}.\tag{10}\end{equation*}
\begin{equation*} \mathbf {x} \ast _{G} \mathbf {g_\theta } = \theta \left({\mathbf {I_{n}}+\mathbf {D}^{-\frac {1}{2}}\mathbf {A}\mathbf {D}^{-\frac {1}{2}}}\right)\mathbf {x}.\tag{11}\end{equation*}
\begin{equation*} \mathbf {H}= \mathbf {X}\ast _{G} \mathbf {g_\Theta } = f(\bar {\mathbf {A}}\mathbf {X} \boldsymbol {\Theta })\tag{12}\end{equation*}
\begin{equation*} \mathbf {h}_{v} = f\left ({\boldsymbol {\Theta }^{T}\left ({\sum _{u\in \{N(v)\cup v\}} \bar {A}_{v,u}\mathbf {x}_{u}}\right)}\right)\quad \forall v\in V. \tag{13}\end{equation*}
Several recent works made incremental improvements over GCN [22] by exploring alternative symmetric matrices. Adaptive GCN (AGCN) [40] learns hidden structural relations unspecified by the graph adjacency matrix. It constructs a so-called residual graph adjacency matrix through a learnable distance function that takes two nodes’ features as inputs. Dual GCN (DGCN) [41] introduces a dual-graph convolutional architecture with two graph convolutional layers in parallel. While these two layers share parameters, they use the normalized adjacency matrix \begin{equation*} \mathbf {PPMI}_{v_{1},v_{2}} = \max \left ({\log \left ({\frac {\text {count}(v_{1},v_{2})\cdot |D|}{\text {count}(v_{1})\text {count}(v_{2})}}\right),0}\right)\tag{14}\end{equation*}
B. Spatial-Based ConvGNNs
Analogous to the convolutional operation of a conventional CNN on an image, spatial-based methods define graph convolutions based on a node’s spatial relations. Images can be considered as a special form of a graph with each pixel representing a node. Each pixel is directly connected to its nearby pixels, as shown in Fig. 1(a). A filter is applied to a
The neural network for graphs (NN4G) [24], proposed in parallel with GNN*, is the first work toward spatial-based ConvGNNs. Distinctively different from RecGNNs, NN4G learns graph mutual dependence through a compositional neural architecture with independent parameters at each layer. The neighborhood of a node can be extended through the incremental construction of the architecture. NN4G performs graph convolutions by summing up a node’s neighborhood information directly. It also applies residual connections and skip connections to memorize information over each layer. As a result, NN4G derives its next-layer node states by \begin{equation*} \mathbf {h}_{v}^{(k)} = f\left ({\mathbf {W}^{(k)^{T}}\mathbf {x}_{v}+\sum _{i=1}^{k-1}\sum _{u\in N(v)}\boldsymbol {\Theta }^{(k)^{T}}\mathbf {h}_{u}^{(k-1)}}\right) \tag{15}\end{equation*}
\begin{equation*} \mathbf {H}^{(k)} = f\left ({\mathbf {X}\mathbf {W}^{(k)}+\sum _{i=1}^{k-1}\mathbf {A}\mathbf {H}^{(k-1)}\boldsymbol {\Theta }^{(k)}}\right)\tag{16}\end{equation*}
Diffusion CNN (DCNN) [25] regards graph convolutions as a diffusion process. It assumes that information is transferred from one node to one of its neighboring nodes with a certain transition probability so that information distribution can reach equilibrium after several rounds. DCNN defines the diffusion graph convolution (DGC) as \begin{equation*} \mathbf {H}^{(k)} = f(\mathbf {W}^{(k)}\odot \mathbf {P}^{k} \mathbf {X})\tag{17}\end{equation*}
\begin{equation*} \mathbf {H} = \sum _{k=0}^{K} f(\mathbf {P}^{k}\mathbf {X}\mathbf {W}^{(k)})\tag{18}\end{equation*}
\begin{equation*} \mathbf {H}^{(k)} = {\parallel }_{j=0}^{r} f((\tilde {\mathbf {D}}^{(j)})^{-1}\mathbf {S}^{(j)}\mathbf {H}^{(k-1)}\mathbf {W}^{(j,k)})\tag{19}\end{equation*}
\begin{equation*} \mathbf {H}^{(k)} = \sum _{j=1}^{Q} \bar {\mathbf {A}}^{(j)}\mathbf {H}^{(k-1)}\mathbf {W}^{(j,k)}\tag{20}\end{equation*}
The message-passing neural network (MPNN) [27] outlines a general framework of spatial-based ConvGNNs. It treats graph convolutions as a message-passing process in which information can be passed from one node to another along edges directly. MPNN runs K-step message-passing iterations to let information propagate further. The message-passing function (namely, the spatial graph convolution) is defined as \begin{equation*} \mathbf {h}_{v}^{(k)} = U_{k}\left ({\mathbf {h}_{v}^{(k-1)},\sum _{u\in N(v)} M_{k}\big (\mathbf {h}_{v}^{(k-1)},\mathbf {h}_{u}^{(k-1)},\mathbf {x}^{e}_{vu}\big)}\right)\tag{21}\end{equation*}
\begin{equation*} \mathbf {h}_{G} = R\big ({\mathbf {h}_{v}^{(K)}|v\in G}\big)\tag{22}\end{equation*}
\begin{equation*} \mathbf {h}_{v}^{(k)} = \text {MLP}\left ({(1+\epsilon ^{(k)})\mathbf {h}_{v}^{(k-1)}+\sum _{u\in N(v)}\mathbf {h}_{u}^{(k-1)}}\right)\tag{23}\end{equation*}
As the number of neighbors of a node can vary from one to a thousand or even more, it is inefficient to take the full size of a node’s neighborhood. GraphSage [42] adopts sampling to obtain a fixed number of neighbors for each node. It performs graph convolutions by \begin{equation*} \mathbf {h}^{(k)}_{v}=\sigma \big (\mathbf {W}^{(k)}\cdot f_{k}\big (\mathbf {h}_{v}^{(k-1)},\big \{\mathbf {h}_{u}^{(k-1)}\forall u \in S_{\mathcal {N}(v)}\big \}\big)\big)\tag{24}\end{equation*}
Graph attention network (GAT) [43] assumes that contributions of neighboring nodes to the central node are neither identical like GraphSage [42], nor predetermined like GCN [22] (this difference is illustrated in Fig. 4). GAT adopts attention mechanisms to learn the relative weights between two connected nodes. The graph convolutional operation according to GAT is defined as \begin{equation*} \mathbf {h}_{v}^{(k)} = \sigma \left ({\sum _{u\in \mathcal {N}(v)\cup v}\alpha _{vu}^{(k)}\mathbf {W}^{(k)}\mathbf {h}_{u}^{(k-1)}}\right)\tag{25}\end{equation*}
\begin{equation*} \alpha _{vu}^{(k)} = \text {softmax}\big (g(\mathbf {a}^{T}\big [\mathbf {W}^{(k)}\mathbf {h}_{v}^{(k-1)}\big |\big | \mathbf {W}^{(k)}\mathbf {h}_{u}^{(k-1)}\big)\big)\tag{26}\end{equation*}
Differences between GCN [22] and GAT [43]. (a) GCN [22] explicitly assigns a nonparametric weight
Mixture model network (MoNet) [44] adopts a different approach to assign different weights to a node’s neighbors. It introduces node pseudocoordinates to determine the relative position between a node and its neighbor. Once the relative position between two nodes is known, a weight function maps the relative position to the relative weight between these two nodes. In such a way, the parameters of a graph filter can be shared across different locations. Under the MoNet framework, several existing approaches for manifolds, such as geodesic CNN (GCNN) [90], anisotropic CNN (ACNN) [91], and spline CNN [92], and for graphs, such as GCN [22] and DCNN [25], can be generalized as special instances of MoNet by constructing nonparametric weight functions. MoNet additionally proposes a Gaussian kernel with learnable parameters to learn the weight function adaptively.
Another distinct line of works achieves weight sharing across different locations by ranking a node’s neighbors based on certain criteria and associating each ranking with a learnable weight. PATCHY-SAN [26] orders neighbors of each node according to their graph labelings and selects the top
1) Improvement in Terms of Training Efficiency:
Training ConvGNNs, such as GCN [22], is usually required to save the whole graph data and intermediate states of all nodes into memory. The full-batch training algorithm for ConvGNNs suffers significantly from the memory overflow problem, especially when a graph contains millions of nodes. To save memory, GraphSage [42] proposes a batch-training algorithm for ConvGNNs. It samples a tree rooted at each node by recursively expanding the root node’s neighborhood by
Fast learning with GCN (FastGCN) [49] samples a fixed number of nodes for each graph convolutional layer instead of sampling a fixed number of neighbors for each node like GraphSage [42]. It interprets graph convolutions as integral transforms of embedding functions of nodes under probability measures. The Monte Carlo approximation and variance reduction techniques are employed to facilitate the training process. As FastGCN samples nodes independently for each layer, between-layer connections are potentially sparse. Huang et al. [51] propose an adaptive layerwise sampling approach, where node sampling for the lower layer is conditioned on the top one. This method achieves higher accuracy compared with FastGCN at the cost of employing a much more complicated sampling scheme.
In another work, stochastic training of GCNs (StoGCN) [50] reduces the receptive field size of a graph convolution to an arbitrarily small scale using historical node representations as a control variate. StoGCN achieves comparable performance even with two neighbors per node. However, StoGCN still has to save intermediate states of all nodes, which is memory consuming for large graphs.
Cluster-GCN [58] samples a subgraph using a graph clustering algorithm and performs graph convolutions to nodes within the sampled subgraph. As the neighborhood search is also restricted within the sampled subgraph, Cluster-GCN is capable of handling larger graphs and using deeper architectures at the same time, in less time and with less memory. Cluster-GCN notably provides a straightforward comparison of time complexity and memory complexity for existing ConvGNN training algorithms. We analyze its results based on Table IV.
In Table IV, GCN [22] is the baseline method that conducts the full-batch training. GraphSage saves memory at the cost of sacrificing time efficiency. Meanwhile, the time and memory complexities of GraphSage grow exponentially with an increase of
2) Comparison Between Spectral and Spatial Models:
Spectral models have a theoretical foundation in graph signal processing. By designing new graph signal filters (e.g., Cayleynets [23]), one can build new ConvGNNs. However, spatial models are preferred over spectral models due to efficiency, generality, and flexibility issues. First, spectral models are less efficient than spatial models. Spectral models either need to perform eigenvector computation or handle the whole graph at the same time. Spatial models are more scalable to large graphs as they directly perform convolutions in the graph domain via information propagation. The computation can be performed in a batch of nodes instead of the whole graph. Second, spectral models that rely on a graph Fourier basis generalize poorly to new graphs. They assume a fixed graph. Any perturbations to a graph would result in a change of eigenbasis. Spatial-based models, on the other hand, perform graph convolutions locally on each node, where weights can be easily shared across different locations and structures. Third, spectral-based models are limited to operate on undirected graphs. Spatial-based models are more flexible to handle multisource graph inputs, such as edge inputs [15], [27], [86], [95], [96], directed graphs [25], [72], signed graphs [97], and heterogeneous graphs [98], [99], because these graph inputs can be incorporated into the aggregation function easily.
C. Graph Pooling Modules
After a GNN generates node features, we can use them for the final task. However, using all these features directly can be computationally challenging; thus, a downsampling strategy is needed. Depending on the objective and the role it plays in the network, different names are given to this strategy.
The pooling operation aims to reduce the size of parameters by downsampling the nodes to generate smaller representations and, thus, avoid overfitting, permutation invariance, and computational complexity issues.
The readout operation is mainly used to generate graph-level representation based on node representations. Their mechanism is very similar.
In some earlier works, the graph coarsening algorithms use eigendecomposition to coarsen graphs based on their topological structure. However, these methods suffer from the time complexity issue. The Graclus algorithm [100] is an alternative of eigendecomposition to calculate a clustering version of the original graph. Some recent works [23] employed it as a pooling operation to coarsen graphs.
Nowadays, mean/max/sum pooling is the most primitive and effective way to implement downsampling since calculating the mean/max/sum value in the pooling window is fast \begin{equation*} \mathbf {h}_{G} = \text {mean}/\max /\text {sum}\big (\mathbf {h}_{1}^{(K)}, \mathbf {h}_{2}^{(K)}, \ldots, \mathbf {h}_{n}^{(K)}\big)\tag{27}\end{equation*}
Henaff et al. [20] show that performing a simple max/mean pooling at the beginning of the network is especially important to reduce the dimensionality in the graph domain and mitigate the cost of the expensive graph Fourier transform operation. Furthermore, some works [17], [27], [46] also use attention mechanisms to enhance the mean/sum pooling.
Even with attention mechanisms, the reduction operation (such as sum pooling) is not satisfactory since it makes the embedding inefficient; a fixed-size embedding is generated regardless of the graph size. Vinyals et al. [101] propose the Set2Set method to generate a memory that increases with the size of the input. It then implements an LSTM that intends to integrate order-dependent information into the memory embedding before a reduction is applied that would, otherwise, destroy that information.
Defferrard et al. [21] address this issue in another way by rearranging nodes of a graph in a meaningful way. They devise an efficient pooling strategy in their approach ChebNet. Input graphs are first coarsened into multiple levels by the Graclus algorithm [100]. After coarsening, the nodes of the input graph and its coarsened version are rearranged into a balanced binary tree. Arbitrarily aggregating the balanced binary tree from bottom to top will arrange similar nodes together. Pooling such a rearranged signal is much more efficient than pooling the original.
Zhang et al. [52] propose the DGCNN with a similar pooling strategy named SortPooling that performs pooling by rearranging nodes to a meaningful order. Different from ChebNet [21], DGCNN sorts nodes according to their structural roles within the graph. The graph’s unordered node features from spatial graph convolutions are treated as continuous WL colors [93], and they are then used to sort nodes. In addition to sorting the node features, it unifies the graph size to
The aforementioned pooling methods mainly consider graph features and ignore the structural information of graphs. Recently, a differentiable pooling (DiffPool) [54] is proposed, which can generate hierarchical representations of graphs. Compared with all previous coarsening methods, DiffPool does not simply cluster the nodes in a graph but learns a cluster assignment matrix \begin{equation*} \mathbf {S}^{(k)} = \text {softmax}(\text {ConvGNN}_{k}(\mathbf {A}^{(k)}, \mathbf {H}^{(k)})).\tag{28}\end{equation*}
Most recently, the SAGPool [102] approach is proposed, which considers both node features and graph topology and learns the pooling in a self-attention manner.
Overall, pooling is an essential operation to reduce graph size. How to improve the effectiveness and computational complexity of pooling is an open question for investigation.
D. Discussion of Theoretical Aspects
We discuss the theoretical foundation of GNNs from different perspectives.
1) Shape of Receptive Field:
The receptive field of a node is the set of nodes that contribute to the determination of its final node representation. When compositing multiple spatial graph convolutional layers, the receptive field of a node grows one step ahead toward its distant neighbors each time. Micheli [24] prove that a finite number of spatial graph convolutional layers exists such that for each node
2) VC Dimension:
The VC dimension is a measure of model complexity defined as the largest number of points that can be shattered by a model. There are few works on analyzing the VC dimension of GNNs. Given the number of model parameter
3) Graph Isomorphism:
Two graphs are isomorphic if they are topologically identical. Given two nonisomorphic graphs
4) Equivariance and Invariance:
A GNN must be an equivariant function when performing node-level tasks and must be an invariant function when performing graph-level tasks. For node-level tasks, let
5) Universal Approximation:
It is well known that multiperceptron feedforward neural networks with one hidden layer can approximate any Borel measurable function [105]. The universal approximation capability of GNNs has seldom been studied. Hammer et al. [106] prove that cascade correlation can approximate functions with structured outputs. Scarselli et al. [107] prove that a RecGNN [15] can approximate any function that preserves unfolding equivalence up to any degree of precision. Two nodes are unfolding equivalent if their unfolding trees are identical, where the unfolding tree of a node is constructed by iteratively extending a node’s neighborhood at a certain depth. Xu et al. [57] show that ConvGNNs under the framework of message passing [27] are not universal approximators of continuous functions defined on multisets. Maron et al. [104] prove that an invariant graph network can approximate an arbitrary invariant function defined on graphs.
Graph Autoencoders
GAEs are deep neural architectures that map nodes into a latent feature space and decode graph information from latent representations. GAEs can be used to learn network embeddings or generate new graphs. The main characteristics of selected GAEs are summarized in Table V. In the following, we provide a brief review of GAEs from two perspectives: network embedding and graph generation.
A. Network Embedding
A network embedding is a low-dimensional vector representation of a node that preserves a node’s topological information. GAEs learn network embeddings using an encoder to extract network embeddings and using a decoder to enforce network embeddings to preserve the graph topological information, such as the PPMI matrix and the adjacency matrix.
Earlier approaches mainly employ multilayer perceptrons to build GAEs for network embedding learning. Deep neural networks for graph representations (DNGRs) [59] use a stacked denoising autoencoder [108] to encode and decode the PPMI matrix via multilayer perceptrons. Concurrently, structural deep network embedding (SDNE) [60] uses a stacked autoencoder to preserve the node first-order proximity and second-order proximity jointly. SDNE proposes two loss functions on the outputs of the encoder and the outputs of the decoder separately. The first loss function enables the learned network embeddings to preserve the node first-order proximity by minimizing the distance between a node’s network embedding and its neighbors’ network embeddings. The first loss function \begin{equation*} L_{\text {1st}} = \sum _{(v,u)\in E} A_{v,u}||\text {enc}(\mathbf {x}_{v})-\text {enc}(\mathbf {x}_{u})||^{2}\tag{29}\end{equation*}
\begin{equation*} L_{\text {2nd}} = \sum _{v \in V}~||(\text {dec}(\text {enc}(\mathbf {x}_{v}))-\mathbf {x}_{v})\odot \mathbf {b}_{v}||^{2}\tag{30}\end{equation*}
DNGR [59] and SDNE [60] only consider node structural information that is about the connectivity between pairs of nodes. They ignore the nodes that may contain feature information that depicts the attributes of nodes themselves. GAE*3 [61] leverages GCN [22] to encode node structural information and node feature information at the same time. The encoder of GAE* consists of two graph convolutional layers, which takes the form \begin{equation*} \mathbf {Z} = \text {enc}(\mathbf {X},\mathbf {A}) = \text {Gconv}(f(\text {Gconv}(\mathbf {A},\mathbf {X};\boldsymbol {\Theta _{1}}));\boldsymbol {\Theta _{2})}\tag{31}\end{equation*}
\begin{equation*} \hat {\mathbf {A}}_{v,u} = \text {dec}(\mathbf {z}_{v},\mathbf {z}_{u}) = \sigma \big (\mathbf {z}_{v}^{T}\mathbf {z}_{u}\big)\tag{32}\end{equation*}
Simply reconstructing the graph adjacency matrix may lead to overfitting due to the capacity of the autoencoders. Variational GAE (VGAE) [61] is a variational version of GAE to learn the distribution of data. VGAE optimizes the variational lower bound \begin{equation*} L = E_{q(\mathbf {Z}|\mathbf {X},\mathbf {A})}[\log p(\mathbf {A}|\mathbf {Z})]-KL[q(\mathbf {Z}|\mathbf {X},\mathbf {A})||p(\mathbf {Z})]\tag{33}\end{equation*}
Similar to GAE*, GraphSage [42] encodes node features with two graph convolutional layers. Instead of optimizing the reconstruction error, GraphSage shows that the relational information between two nodes can be preserved by negative sampling with the loss \begin{align*} L(\mathbf {z}_{v}) = -\log (\text {dec}(\mathbf {z}_{v},\mathbf {z}_{u}))-QE_{v_{n}\sim P_{n}(v)}\log (-\text {dec}(\mathbf {z}_{v},\mathbf {z}_{v_{n}})) \\ {}\tag{34}\end{align*}
For the aforementioned methods, they essentially learn network embeddings by solving a link prediction problem. However, the sparsity of a graph causes the number of positive node pairs to be far less than the number of negative node pairs. To alleviate the data sparsity problem in learning network embedding, another line of works convert a graph into sequences by random permutations or random walks. In such a way, those deep learning approaches that are applicable to sequences can be directly used to process graphs. Deep recursive network embedding (DRNE) [63] assumes that a node’s network embedding should approximate the aggregation of its neighborhood network embeddings. It adopts a long short-term memory (LSTM) network [7] to aggregate a node’s neighbors. The reconstruction error of DRNE is defined as \begin{equation*} L = \sum _{v\in V}~||\mathbf {z}_{v}-\text {LSTM}(\{\mathbf {z}_{u}|u\in N(v)\})||^{2}\tag{35}\end{equation*}
\begin{equation*} L= -E_{\mathbf {z}\sim P_{\text {data}}(\mathbf {z})}(\text {dist}(\mathbf {z},\text {dec}(\text {enc}(\mathbf {z}))))\tag{36}\end{equation*}
B. Graph Generation
With multiple graphs, GAEs are able to learn the generative distribution of graphs by encoding graphs into hidden representations and decoding a graph structure given hidden representations. The majority of GAEs for graph generation are designed to solve the molecular graph generation problem, which has a high practical value in drug discovery. These methods either propose a new graph in a sequential manner or in a global manner.
Sequential approaches generate a graph by proposing nodes and edges step by step. Gómez-Bombarelli et al. [111], Kusner et al. [112], and Dai et al. [113] model the generation process of a string representation of molecular graphs named SMILES with deep CNNs and RNNs as the encoder and the decoder, respectively. While these methods are domain-specific, alternative solutions are applicable to general graphs by means of iteratively adding nodes and edges to a growing graph until a certain criterion is satisfied. Deep generative model of graphs (DeepGMG) [65] assumes that the probability of a graph is the sum over all possible node permutations \begin{equation*} p(G)=\sum _\pi p(G,\pi)\tag{37}\end{equation*}
Global approaches output a graph all at once. Graph variational autoencoder (GraphVAE) [67] models the existence of nodes and edges as independent random variables. By assuming the posterior distribution \begin{align*} L(\boldsymbol{\phi },\boldsymbol{\theta };G)= E_{q_{\boldsymbol{\phi }}(z|G)}[-\log p_\theta (G|\mathbf {z})]+KL[q_{\boldsymbol{\phi }}(\mathbf {z}|G)||p(\mathbf {z})] \\ {}\tag{38}\end{align*}
In brief, sequential approaches linearize graphs into sequences. They can lose structural information due to the presence of cycles. Global approaches produce a graph all at once. They are not scalable to large graphs as the output space of a GAE is up to
Spatial–Temporal Graph Neural Networks
Graphs in many real-world applications are dynamic both in terms of graph structures and graph inputs. STGNNs occupy important positions in capturing the dynamicity of graphs. Methods under this category aim to model the dynamic node inputs while assuming interdependency between connected nodes. For example, a traffic network consists of speed sensors placed on roads, where edge weights are determined by the distance between pairs of sensors. As the traffic condition of one road may depend on its adjacent roads’ conditions, it is necessary to consider spatial dependence when performing traffic speed forecasting. As a solution, STGNNs capture spatial and temporal dependencies of a graph simultaneously. The task of STGNNs can be forecasting future node values or labels or predicting spatial–temporal graph labels. STGNNs follow two directions: RNN-based methods and CNN-based methods.
Most RNN-based approaches capture spatial–temporal dependencies by filtering inputs and hidden states passed to a recurrent unit using graph convolutions [48], [71], [72]. To illustrate this, suppose that a simple RNN takes the form \begin{equation*} \mathbf {H}^{(t)} = \sigma (\mathbf {W}\mathbf {X}^{(t)}+\mathbf {U}\mathbf {H}^{(t-1)}+\mathbf {b})\tag{39}\end{equation*}
\begin{align*} \mathbf {H}^{(t)} = \sigma (\text {Gconv}(\mathbf {X}^{(t)},\mathbf {A};\mathbf {W}) +\text {Gconv}(\mathbf {H}^{(t-1)},\mathbf {A};\mathbf {U})+\mathbf {b}) \\ {}\tag{40}\end{align*}
Another parallel work uses node-level RNNs and edge-level RNNs to handle different aspects of temporal information. Structural-RNN [73] proposes a recurrent framework to predict node labels at each time step. It comprises two kinds of RNNs, namely, a node-RNN and an edge-RNN. The temporal information of each node and each edge is passed through a node-RNN and an edge-RNN, respectively. To incorporate the spatial information, a node-RNN takes the outputs of edge-RNNs as inputs. Since assuming different RNNs for different nodes and edges significantly increases model complexity, it instead splits nodes and edges into semantic groups. Nodes or edges in the same semantic group share the same RNN model, which saves the computational cost.
RNN-based approaches suffer from time-consuming iterative propagation and gradient explosion/vanishing issues. As alternative solutions, CNN-based approaches tackle spatial–temporal graphs in a nonrecursive manner with the advantages of parallel computing, stable gradients, and low-memory requirements. As illustrated in Fig. 2(d), CNN-based approaches interleave 1-D-CNN layers with graph convolutional layers to learn temporal and spatial dependencies, respectively. Assume that the inputs to an STGNN is a tensor
Previous methods all use a predefined graph structure. They assume that the predefined graph structure reflects the genuine dependence relationships among nodes. However, with many snapshots of graph data in a spatial–temporal setting, it is possible to learn latent static graph structures automatically from data. To realize this, Graph WaveNet [76] proposes a self-adaptive adjacency matrix to perform graph convolutions. The self-adaptive adjacency matrix is defined as \begin{equation*} \mathbf {A}_{\text {adp}}=\text {SoftMax}\big (\text {ReLU}\big (\mathbf {E}_{1}\mathbf {E}_{2}^{T}\big)\big)\tag{41}\end{equation*}
Learning latent static spatial dependencies can help researchers discover interpretable and stable correlations among different entities in a network. However, in some circumstances, learning latent dynamic spatial dependencies may further improve model precision. For example, in a traffic network, the travel time between two roads may depend on their current traffic conditions. GaAN [48] employs attention mechanisms to learn dynamic spatial dependencies through an RNN-based approach. An attention function is used to update the edge weight between two connected nodes given their current node inputs. ASTGCN [77] further includes a spatial attention function and a temporal attention function to learn latent dynamic spatial dependencies and temporal dependencies through a CNN-based approach. The common drawback of learning latent spatial dependencies is that it needs to calculate the spatial dependence weight between each pair of nodes, which costs
Applications
As graph-structured data are ubiquitous, GNNs have a wide variety of applications. In this section, we summarize the benchmark graph data sets, evaluation methods, and open-source implementation, respectively. We detail practical applications of GNNs in various domains.
A. Data Sets
We mainly sort data sets into four groups, namely, citation networks, biochemical graphs, social networks, and others. In Table VI, we summarize selected benchmark data sets. More details are given in the Supplementary Material A.
B. Evaluation and Open-Source Implementations
Node classification and graph classification are common tasks to assess the performance of RecGNNs and ConvGNNs.
1) Node Classification:
In node classification, most methods follow a standard split of train/valid/test on benchmark data sets, including Cora, Citeseer, Pubmed, PPI, and Reddit. They reported the average accuracy or F1 score on the test data set over multiple runs. A summarization of experimental results of methods can be found in the Supplementary Material B. It should be noted that these results do not necessarily represent a rigorous comparison. Shchur et al. [131] identified two pitfalls in evaluating the performance GNNs on node classification. First, using the same train/valid/test split throughout all experiments underestimates the generalization error. Second, different methods employed different training techniques, such as hyperparameter tuning, parameter initialization, learning rate decay, and early stopping. For a relatively fair comparison, we refer the readers to [131].
2) Graph Classification:
In graph classification, researchers often adopt tenfold cross validation (cv) for model evaluation. However, as pointed out in [132], the experimental settings are ambiguous and not unified across different works. In particular, [132] raises the concern of the correct usage of data splits for model selection versus model assessment. An often encountered problem is that the external test set of each fold is used both for model selection and risk assessment. Reference [132] compares GNNs in a standardized and uniform evaluation framework. They apply an external tenfold CV to get an estimate of the generalization performance of a model and an inner holdout technique with a 90%/10% training/validation split for model selection. An alternative procedure would be a double-cv method, which uses an external
3) Open-Source Implementations:
These facilitate the work of baseline experiments in deep learning research. In the Supplementary Material C, we provide the hyperlinks of the open-source implementations of the GNN models reviewed in this article. Noticeably, Fey et al. [92] published a geometric learning library in PyTorch named PyTorch Geometric,4 which implements many GNNs. Most recently, the deep graph library (DGL)5 [133] is released, which provides a fast implementation of many GNNs on top of popular deep learning platforms, such as PyTorch and MXNet.
C. Practical Applications
GNNs have many applications across different tasks and domains. Despite general tasks that can be handled by each category of GNNs directly, including node classification, graph classification, network embedding, graph generation, and spatial–temporal graph forecasting, other general graph-related tasks, such as node clustering [134], link prediction [135], and graph partitioning [136], can also be addressed by GNNs. We detail some applications based on the following research domains.
1) Computer Vision:
Applications of GNNs in computer vision include scene graph generation, point clouds’ classification, and action recognition.
Recognizing semantic relationships between objects facilitates the understanding of the meaning behind a visual scene. Scene graph generation models aim to parse an image into a semantic graph that consists of objects and their semantic relationships [137]–[139]. Another application inverses the process by generating realistic images given scene graphs [140]. As natural language can be parsed as semantic graphs, where each word represents an object, it is a promising solution to synthesize images given textual descriptions.
Classifying and segmenting points’ clouds enable LiDAR devices to “see” the surrounding environment. A point cloud is a set of 3-D points recorded by LiDAR scans. References [141]–[143] convert point clouds into
Identifying human actions contained in videos facilitates a better understanding of video content from a machine aspect. Some solutions detect the locations of human joints in video clips. Human joints that are linked by skeletons naturally form a graph. Given the time series of human joint locations, [73] and [75] apply STGNNs to learn human action patterns.
Moreover, the number of applicable directions of GNNs in computer vision is still growing. It includes human–object interaction [144], few-shot image classification [145]–[147], semantic segmentation [148], [149], visual reasoning [150], and question answering [151].
2) Natural Language Processing:
A common application of GNNs in natural language processing is the text classification. GNNs utilize the interrelations of documents or words to infer document labels [22], [42], [43].
Despite the fact that natural language data exhibit a sequential order, they may also contain an internal graph structure, such as a syntactic dependency tree. A syntactic dependency tree defines the syntactic relations among words in a sentence. Marcheggiani and Titov [152] propose the Syntactic GCN that runs on top of a CNN/RNN sentence encoder. The Syntactic GCN aggregates hidden word representations based on the syntactic dependency tree of a sentence. Bastings et al. [153] apply the Syntactic GCN to the task of neural machine translation. Marcheggiani et al. [154] further adopt the same model as in [153] to handle the semantic dependency graph of a sentence.
Graph-to-sequence learning learns to generate sentences with the same meaning given a semantic graph of abstract words (known as abstract meaning representation). Song et al. [155] propose a graph-LSTM to encode graph-level semantic information. Beck et al. [156] apply a GGNN [17] to graph-to-sequence learning and neural machine translation. The inverse task is sequence-to-graph learning. Generating a semantic or knowledge graph given a sentence is very useful in knowledge discovery [157], [158].
3) Traffic:
Accurately forecasting traffic speed, volume, or the density of roads in traffic networks is fundamentally important in a smart transportation system. Zhang et al. [48], Li et al. [72], and Yu et al. [74] address the traffic prediction problem using STGNNs. They consider the traffic network as a spatial–temporal graph, where the nodes are sensors installed on roads, the edges are measured by the distance between pairs of nodes, and each node has the average traffic speed within a window as dynamic input features. Another industrial-level application is taxi-demand prediction. Given historical taxi demands, location information, weather data, and event features, Yao et al. [159] incorporate LSTM, CNN, and network embeddings trained by LINE [160] to form a joint representation for each location to predict the number of taxis demanded for a location within a time interval.
4) Recommender Systems:
Graph-based recommender systems take items and users as nodes. By leveraging the relations between items and items, users and users, users and items, as well as content information, graph-based recommender systems are able to produce high-quality recommendations. The key to a recommender system is to score the importance of an item to a user. As a result, it can be cast as a link prediction problem. To predict the missing links between users and items, Berg et al. [161] and Ying et al. [162] propose a GAE that uses ConvGNNs as encoders. Monti et al. [163] combine RNNs with graph convolutions to learn the underlying process that generates the known ratings.
5) Chemistry:
In the field of chemistry, researchers apply GNNs to study the graph structure of molecules/compounds. In a molecule/compound graph, atoms are considered as nodes, and chemical bonds are treated as edges. Node classification, graph classification, and graph generation are the three main tasks targeting molecular/compound graphs in order to learn molecular fingerprints [85], [86], predict molecular properties [27], infer protein interfaces [164], and synthesize chemical compounds [65], [69], [165].
6) Others:
The application of GNNs is not limited to the aforementioned domains and tasks. There have been explorations of applying GNNs to a variety of problems, such as program verification [17], program reasoning [166], social influence prediction [167], adversarial attacks prevention [168], electrical health records modeling [169], [170], brain networks [171], event detection [172], and combinatorial optimization [173].
Future Directions
Though GNNs have proven their power in learning graph data, challenges still exist due to the complexity of graphs. In this section, we suggest four future directions of GNNs.
A. Model depth
The success of deep learning lies in deep neural architectures [174]. However, Li et al. [53] show that the performance of a ConvGNN drops dramatically with an increase in the number of graph convolutional layers. As graph convolutions push representations of adjacent nodes closer to each other, in theory, with an infinite number of graph convolutional layers, all nodes’ representations will converge to a single point [53]. This raises the question of whether going deep is still a good strategy for learning graph data.
B. Scalability Tradeoff
The scalability of GNNs is gained at the price of corrupting graph completeness. Whether using sampling or clustering, a model will lose part of the graph information. By sampling, a node may miss its influential neighbors. By clustering, a graph may be deprived of a distinct structural pattern. How to tradeoff algorithm scalability and graph integrity could be a future research direction.
C. Heterogenity
The majority of current GNNs assume homogeneous graphs. It is difficult to directly apply current GNNs to heterogeneous graphs, which may contain different types of nodes and edges or different forms of node and edge inputs, such as images and text. Therefore, new methods should be developed to handle heterogeneous graphs.
D. Dynamicity
Graphs are, in nature, dynamic in a way that nodes or edges may appear or disappear and that node/edge inputs may change time by time. New graph convolutions are needed to adapt to the dynamicity of graphs. Although the dynamicity of graphs can be partly addressed by STGNNs, few of them consider how to perform graph convolutions in the case of dynamic spatial relations.
Conclusion
In this article, we conduct a comprehensive overview of GNNs. We provide a taxonomy that groups GNNs into four categories: RecGNNs, ConvGNNs, GAEs, and STGNNs. We provide a thorough review, comparisons, and summarizations of the methods within or between categories. Then, we introduce a wide range of applications of GNNs. Data sets, open-source codes, and model assessment for GNNs are summarized. Finally, we suggest four future directions for GNNs.