Cascade2vec: Learning Dynamic Cascade Representation by Recurrent Graph Neural Networks

An information dissemination network (i.e., a cascade) with a dynamic graph structure is formed when a novel idea or message spreads from person to person. Predicting the growth of cascades is one of the fundamental problems in social network analysis. Existing deep learning models for cascade prediction are primarily based on recurrent neural networks and representation on random walks or propagation paths. However, these models are not sufficient for learning the deep spatial and temporal features of an entire cascade. Therefore, a new model, called Cascade2vec, is proposed to learn the dynamic graph representation of cascades based on graph recurrent neural networks. To learn more effective graph-level representation of cascades, the current graph neural networks are improved by designing a graph residual block, which shares attention weights between nodes, and by transforming features through perception layers. Furthermore, the proposed graph neural network is integrated into a recurrent neural network to learn the temporal features between graphs. With this method, both the spatial and temporal characteristics of cascades are learned in Cascade2vec. The experimental results show that our method significantly reduces the mean squared logarithmic error and median squared logarithmic error by 16.1% and 12%, respectively, in the cascade prediction at one hour in the Microblog network dataset compared with strong baselines.


I. INTRODUCTION
Online social network platforms facilitate the dissemination of novel ideas, news, and messages [1]- [3]. The information trajectories that are transmitted between people form information dissemination trees (i.e., cascades) with different popularity, depth, and shape [4], [5]. It would be useful to predict the future diffusion popularity or states of a cascade in its early stages. For example, one can have a wiser view in which papers will become viral and recommend them to potential readers.
The commonly used abbreviations and notations are listed in the Table 1. An example of the dynamic structure of a cascade or information dissemination network can be seen The associate editor coordinating the review of this manuscript and approving it for publication was Weiguo Xia . in Fig. 1. Nodes in the same community are presented in the same color by the Louvain community detection algorithm [6]. A user in the cascade C a (see Fig. 1) posted a microblog at a specific time. Then, other users retweet the microblog and make comments, and edges are formed between the users in the cascade. For example, user M retweeted the microblog, and the microblog was further retweeted by some members (e.g., the node P) of the gray community as shown in Fig. 1(a). During the message propagation process, the diffusion network structure changes with time as shown in Fig. 1. Many works have aimed at predicting cascade growth from both theory and practice [7]- [12]. There are many challenges in cascade prediction; for instance, some underlying factors related to retweet behaviors are too complicated to be quantified. It is difficult to build powerful models for dynamic graphs. In addition, the popularity cascade roughly follows a power-law distribution, making the problem even more difficult [11]. In the works written prior to deep learning or representation learning, effective features including content features, temporal features, and user-related features were mined to train classification or regression models on cascade prediction [5], [7], [21]. Cheng et al. [10] proposed the method DeepCas and verified that the end-to-end model is superior to the approaches that use hand-crafted features. The DeepCas applied random walks to sample the spatial information of cascades and used node2vec [22] to learn the node representations. But it is not sufficient to learn a good representation only based on spatial information since temporal information also has important impacts on cascade growth. From Fig. 2 and Fig. 3, we can see that the growth of cascades can be very different even though the two cascades have similar structural shapes. For example, at the time when the messages have been released for an hour, the cascade C b (see Fig. 2) and cascade C c (see Fig. 3) have equal numbers of retweets.
Cao et al. [11] took the temporal features into consideration and designed a new model called DeepHawkes that combines the Hawkes process and deep learning to greatly improve the predictive performance. One of the drawbacks of current deep learning-based cascade prediction models (e.g., DeepCas and DeepHawkes, which are based on random walks or propagation paths) is that they fail to produce an effective graph-level representation of cascades and suffer from learning largescale parameters. For example, the cascade C a and cascade C b have almost equal numbers of retweets at 30 minutes and 1 hour after the message was posted, but the popularity of cascade C a at 24 hours is 409, far larger than that of cascade C b , mainly due to their different cascade structural types (i.e., viral structure and broadcast structure, refer to Sharad et al. [23]). Therefore, more efficient and effective spatial feature representation methods need to be proposed and applied in cascade prediction. Since at each time point, there is a subgraph as shown in Fig. 1 and Fig. 2. It motivates us to establish cascade representation models by graph neural networks (GNNs) [13], [24]. Leveraging on GNNs, the spatial features of graphs are learned as a whole so that we can have a better graph-level representation of a cascade and fewer training parameters in the model.
Many graph neural networks have been proposed in recent years. The models include ChebyNet [14], GCN [15], GAT [16], DGCNN [17], DAGCN [18] and GIN [19], etc. However, these networks are still not powerful enough in the graph regression tasks and suffer from over-smoothing and feature loss during propagation [13]. So we extend the current GNNs motivated by the idea from ResNet [25] and Transformer [26], and propose a new graph neural network model, called Graph Perception Network (GPN). The GPN applies a new attention mechanism and residual network to improve the learning ability of graph neural networks, which is conceptually simple but outperforms the previous GNNs in graph classification and graph regression tasks.
To learn the temporal features between subgraphs at different periods, the proposed graph neural network is integrated into the bidirectional gated recurrent units (GRUs) [20] and a graph recurrent neural network is designed. Based on the graph recurrent neural network model, Cascade2vec is introduced to learn both the spatial and temporal features of cascades. Furthermore, we propose a biased mean squared error loss that benefits model training in power-law distributed data. A weighted sum pooling mechanism is also applied to integrate features at different moments.
Instead of learning an embedding feature for each user as in DeepCas [10] and DeepHawkes [11], Cascade2vec models the cascades as dynamic graphs and incorporates the users' representation features into the representation learned by the graph neural networks. Thus, the number of training parameters can be sharply reduced. Compared to DeepCas and DeepHawkes with tens of millions of parameters, Cas-cade2vec only has approximately 44, 000 model parameters, which is advantageous in training and applications.
Our contributions in this paper are as follows: • To improve the graph-level representation of graphs in cascades, we propose a more powerful graph neural network that outperforms current GNNs in graphlevel classification and regression tasks. The concept of the residual network and shared attention weights are adopted in the proposed graph neural network GPN to enhance its learning ability of graph representation.
• We model the cascade as a dynamic graph and apply a graph neural network into the cascade feature learning. A dynamic cascade representation model, entitled Cascade2vec, is proposed by incorporating the proposed graph neural network into recurrent neural networks. Compared with DeepHawkes and DeepCas, our model can better represent the spatial and temporal features of cascades. The proposed Cascade2vec achieves significant improvements in cascade prediction over the stateof-the-art methods. Our codes and processed data will be available at https://github.com/zhenhuascut/Cascade2vec.

II. RELATED WORKS
The related works are discussed from two perspectives, one is cascade prediction and the other is graph neural networks.

A. CASCADE PREDICTION
Existing cascade prediction methods roughly fall into three categories: feature-based approaches, generative approaches, and deep learning approaches.

B. FEATURE-BASED APPROACHES
This kind of methods generally analyze and extract effective features related with popularity growth. Then classifiers or regression models are trained based on the hand-crafted features. The features mainly include temporal features, structural features, content features, and user features. Hong et al. [27] used a theme model to learn the topic distributions of text contents in Microblog, and applied it as text features to predict whether a microblog will be popular. Ma et al. [28] found that a microblog is more likely to be retweeted by users when its content is related with hot topics. Jenders et al. [4] found that content features and user features can be complementary and improve the accuracy of popularity prediction. The initial structural link density is found highly related with cascade popularity [29]. The community structure of the relationship network among early adopters is also found to have impacts on the future popularity growth [30]. Yi et al. [31] found that the entropy of PageRank centrality in early time is closely related to future cascade structure. It is reported that the prediction error can be further reduced when combing temporal features into structure features [21]. Cheng et al. [5] have pointed out that the cascades can be predicted to certain extents using feature-based approaches. They built cascades from Facebook and verified the effectiveness of temporal features, structural features, content features, and user features. Overall, the performance of feature-based methods highly relies on the extracted features. However, it is costly and inefficient to quantify all the features and design perfect feature patterns in social networks.

C. GENERATIVE APPROACHES
The generative methods are more related to temporal patterns. The models take online popularity growth as a collection of the arrival process of retweets and model the information diffusion process using the theory of Poisson Process [32] or Hawkes Process [8], [33]. Shen et al. [32] extended Poisson Process and proposed Reinforced Poisson Process (RPP) to model the three key factors in information diffusion. The Hawkes process is then verified to perform better than the RPP model in popularity growth prediction on the dataset from Sina Microblogs by Bao et al. [33]. Zhao et al. [8] extended standard Hawkes process and proposed a method called SEISMIC, which models the selfexciting mechanism of each retweet after observing several tweets in the training set as prior knowledge. The SEISMIC uses a power law function to fit the time decay effect in information diffusion. The complicated factors related to cascade growth are far beyond the theory of temporal process, the pure generative approaches cannot achieve well performance in many cases. The recent research tendency of cascade prediction is incorporating the generative approaches into deep learning approaches such as DeepHawkes [11].

D. DEEP LEARNING APPROACHES
The deep learning methods are roughly classified into two categories. One category focuses on learning temporal information of cascades. For example, Gou et al. [12] proposed a method for predicting cascade popularity based on recurrent neural networks. An attention mechanism on ensemble temporal features is employed in the method. The other category focuses on learning the spatial information. DeepCas [10] was proposed to learn spatial features of cascades in representation learning manner. The end-to-end deep learning method was verified to perform better than features-based and generative approaches in cascade prediction. DeepCas uses random walks to sample the spatial information of cascades. The Node2vec [22] is applied to learn the node representation. DeepCas does not take full use of the temporal data, while the DeepHawkes [11] takes the temporal information into consideration. DeepHawkes significantly improves accuracy by combing the features of temporal and spatial information. DeepCas and DeepHawkes learn the cascade structure by random walks or propagation paths, not by building models on the entire cascade. They suffer from learning a large number of parameters. We found that users with similar roles share similar characteristics in different cascades. The user features can be integrated into the graph-level presentation of cascades. In this way, the learning parameters can be sharply reduced. Graph neural networks [13] are able to establish deep learning model on the graph structured data, suitable for cascade prediction, and have the potential to improve the performance of cascade prediction.

E. GRAPH NEURAL NETWORKS
The graph level representation is one of the basic problems in graph neural network (GNN). The concept GNN was early proposed by Scarselli et al. [34]. ChebyNet [14] was proposed to classify the graphs based on spectral graph theory [35] and Chebyshev polynomials at the conference of NeurIPS 2016. The ChebyNet model shows competitive performance on image and text classification compared with CNN. Its simplified version Graph Convolutional Network (GCN) [15] is also a common used graph neural network. The graph attention network (GAT) [16] applies concatenation attention mechanism to calculate attention coefficients of neighbors, which improves the learning ability of GCN in node classification tasks. However, we apply a different mechanism that is more suitable in graph-level representation. Zhang et al. [17] proposed Deep Graph CNN (DGCNN) that contains a localized graph convolution model with two graph kernels and a SortPooling. To address the problem of current graph neural networks lost information during propagation steps and suffer simple pooling method. Chen et al. [18] proposed a dual attention graph convolution network (DAGCN) to learn the node representation and importance by several novel attention mechanism. It is proven that the GNNs are at most as powerful as Weisfeiler-Lehman test in distinguish graph structure [19]. A simple neural architecture, Graph Isomorphism Network (GIN), is then proposed based on the injective aggregation and pooling. However, we argue that the current GNNs are still not powerful in graph level representation tasks, (i.e., graph classification and regression tasks). The GNNs are required to be further improved or designed when applied in certain applications, such as graph regression tasks in traffic forecasting [36], [37]. There is also an urgent need for more effective graph neural networks that are able to produce a better graph-level spatial representation of the information cascades. Current graph neural networks suffer from over-smoothing [13]. For example, stacking multiple GCN layers result in over-smoothing, i.e., all nodes will converge to the same semantics. We apply the graph perception network in transforming the features of different layers so that the semantic consistency is maintained and use a residual network to enhance the representation. Shared attention weights and a weighted sum pooling method are also adopted to address the over-smoothing problem.
To learn how the dynamic characteristics of the cascades, a graph neural network is integrated into the recurrent neural networks [20]. The proposed Cascade2vec model integrates spatial features and temporal feature learning in the same framework, which is one of the key contributions to cascade prediction.

A. PROBLEM DESCRIPTION
We model a cascade before the observation time as a dynamic graph in the proposed model, Cascade2vec. The snapshot subgraph at each time t is denoted by where the V t and E t are sets of nodes and edges, respectively in the graph at time t. For each cascade C i , a series of subgraphs where T is the observation time. The target time (e.g., 24 hours) is marked as F. In Fig. 4, the T is set to 2 hours, indicating that the observation time is set to 2 hours. The popularities at 1 st hour, 2 nd hour, F th hour are P 1 , P T and P F , respectively. The problem of cascade popularity growth prediction is forecasting the future growth of cascade C i between times T and F using the information in graphs {G t i }, t ∈ {0, 1, . . . , T } before the observation time T . The cascade growth is denoted by P, as shown in Fig. 4.

B. DATASETS C. MICROBLOG NETWORK
The Microblog network dataset is from Cao et al. [11]. On the Sina Weibo platform, 119,313 Twitter-like microblogs and their diffusion trees are extracted and built. Examples of the microblog cascades can be seen in Fig. 1 and Fig. 2. The cascades with less than 10 retweets or more than 1000 retweets during the observation time are filtered, which is the same method as that used by DeepHawkes [11]. The observation time T is set to 1 hour, 2 hour and 3 hours, and the target time F is set to 24 hours. When the popularity growth is greater than 10, the distribution of the popularity growth roughly follows the power-law distribution shown in Fig. 5 (1). In the case of a popularity growth is less than 10, the number of cascades is almost equal, a little different from the popularity distribution of total cascades.

D. APS CITATION NETWORK
The APS citation network dataset includes the citation relationships between papers from 1893 and 2017 and is provided by American Physical Society (APS). The papers from 1893 to 1997 are selected to build cascades so that each of the papers is allowed to develop for at least 20 years. The distribution of popularity growth of the citation network shown in Fig. 5 (2) is quite similar to that of Microblog network. The structure of a citation cascade is shown in Fig. 6(a), which is quite similar to the cascades in the Microblog network. We set the observation time to 5 years, 7 years and 9 years. The target time is set to 20 years.     The total number of users and edges in the Microblog network when T is set to 1 hour and the APS citation network when T is set to 5 years are listed in Table 2. Note that users can be duplicated when we build cascades since they may be involved in forwarding different messages.

E. EVALUATION
The distribution of target values roughly follows a powerlaw distribution; i.e., there is a long-tail area in the data distribution, as shown in Fig. 6 (target values greater than 600 are not represented to produce a better visualization). It shows that cascades with larger growth tend to have fewer popularities while contributing more on loss. We use log2(y + 1) as the target instead of the direct value, where y is the count of retweet growth or citation growth. The mean squared logarithmic error (MSLE) is adopted as the loss and function for evaluation. The MSLE is defined as MSLE = 1 n n i SLE i . The loss is sensitive with the distribution. Therefore, we modified the loss with a bias in the following: bSLE = (y/y max + 1.0) * (y −ŷ) 2 whereŷ is the predicted value and y max is the maximum value of the targets. The standard squared logarithmic error (SLE) is replaced by the bSLE in the training steps. In the evaluation steps, the SLE is retained. When y is small and falls into the head area with rich data as shown in Fig. 6, the deviation is close to the value of a standard MSE. When the y is in the longtail area, it receives more attention than the data in the head area. We also use the median SLE (mSLE) as one of the evaluation metrics since the MLSE is sensitive to long-tail area and outliers. The definitions of the Head Area, Long-Tail Area, and Outliers are not a key point of our paper and are left to future works. More discussions about the powerlaw distribution can be seen in the book [38]. To evaluate the performance of GNNs in regression tasks, the R2score (i.e., the coefficient of determination) is also applied. However, the R2score is sensitive to noisy data, so it is not used in all the experiments.

F. CASCADE PREDICTION BASELINES
Features-Linear (F-Linear). Temporal features, structural features, and features of early adopters introduced by Cao et al. [11] are extracted in the F-Linear method. After the features have been calculated, a multi-perception layer is applied to train a linear regression model.
Deepcas [10] learns the spatial features of cascade graphs in an end-to-end manner. It mainly utilizes the GRUs and an attention mechanism to extract the structural information and node identities in cascade prediction. The method targets on learning spatial features while ignoring the temporal features.
Deephawkes [11] is a state-of-the-art cascade prediction method. It builds a bridge between the Hawkes process and recurrent neural networks. The method learns the factors of the Hawkes process in a data-driven manner and outperforms DeepCas [10] and SEISMIC [33].

IV. CASCADE2VEC
A. GRAPH PERCEPTION NETWORK As mentioned above, the GPN inherits GIN [19], GCN [15], and ResNet [25] while having a more powerful learning ability of graph-level representations. The basic component of the GPN is shown in Fig. 7, which is called as Graph Residual Block since the idea is motivated by ResNet [25]. The input feature X 0 ∈ R N * F is converted into X L after two graph propagation layers. The graph propagation layer performs graph convolutional operations and features aggregations similar to those in GCN [15]. The output of the last graph propagation layer and the input features are converted by perception layers and then concatenated to produce the final output features. The output features are further connected by perception layers to perform classification or regression tasks in certain problems.
The process of the Graph Residual Block is described in the following: Input where N is the number of graph nodes, F 0 is the dimension of the input features, and L = 2 is the number of graph propagation layers. In the graph propagation layer, the features of node v in layer l propagate among neighbors and are updated by Equation 1: where N (v) are the neighbors of node v, W l a ∈ R F l * F l represents shared attention weights in the layer, and F l is the output dimension of f l MLP in the l t h propagation layer. The α uv indicates the strength of the features that should be adopted by the neighbors. The f l MLP is a two-layer perception network in the layer l. According to Xu et al. [19], the sum aggregator is more likely to produce an injective representation. However, it still leads to over-smoothing. Different neighbors have different impact strengths, so shared attention weights are adopted in the GPN. The graph propagation layer is followed by a batch normalization layer [39].
The output feature vector for node v by the graph residual block is: ) where f 0 P and f L P are one-layer perception networks that convert the features into the same dimensions so that they can be concatenated as residual networks [25]. To obtain representation features h G for the entire graph, a sum pooling (SumPool) is applied and followed by a perception network f G P that converts the presentation of a graph into the targeted dimension. h G is calculated by Equation 4. The sum pooling can also be replaced by the SortPooling, which was proposed by Zhang et al. [17]. Then, we will verify the performance of the GPN in the graph classification tasks described by Pinar and Vishwanathan [40] and the graph regression tasks introduced in our paper.

B. GRAPH FEATURES
When the nodes have tagged input features, the input feature matrix X is directly used in the model. When there are no input features for the nodes, we extract node features by a graph Fourier transform or degree matrix of the input graphs.
The adjacency matrix of a graph is denoted by A. W is the weighted adjacency matrix, and D is the diagonal degree matrix. The graph Laplacian matrix of an undirected graph is calculated by L = D − W . The normalized Laplacian matrix is L = I − D −1/2 WD−1/2. If the graph is directed, the normalized directed Laplacian matrix is calculated by L = I − ( 1/2 P −1/2 + −1/2 P T 1/2 )/2, where P is the transition probability matrix and is a matrix with the Perron vector of P on the diagonal [41]. L is a positive VOLUME 7, 2019 semi-definite N × N matrix. We can calculate N eigenvalues = [λ 0 , . . . , λ N −1 ] and N corresponding eigenvectors U = [u 0 . . . , u N −1 ]. The eigenvectors are real and orthonormal. They are regarded as graph Fourier basis functions [35]. The Laplacian matrix L is thus decomposed by L = U T U . The features X t of input graph G t at time t are calculated by U T x. In this paper, the graph features with top-k (set to 100) eigenvalues are selected.

C. GRAPH-LEVEL REPRESENTATION
In the following experiments, we will compare GPN with current GNNs to verify its effectiveness in spatial representation learning. The output size of f l MLP in the layers is set to the same value among {16, 32 }. The learning rate is chosen from {5e-4, 1e-4, 1e-3, 1e-2}. The output dimension of the graph residual block is set to 32. The output dimension of F l is kept to 32. The hidden size and output dimension of multilayer perception in the GPN is kept the same. The optimizer RMSprop [42] is applied. However, in the following experiments, the GPN model is not very sensitive to these parameters in most cases. We only use one graph residual block since it is powerful enough in the experiments. For the comparison methods GCN and GAT, we use two GCN or GAT convolutional layers, respectively, and the hidden size and output dimension are chosen from {16, 32}. For GIN, DGCNN and DAGCN, we used the parameters mentioned in the original papers [17]- [19].

1) GRAPH CLASSIFICATION TASK
Given a set of graphs {G i }, i ∈ {1, . . . , n} and their labels {Y i }, i ∈ {1, . . . , n}, the graph classification task aims to predict the correct class label Y i of a graph G i using the learned graph representation h G i . The datasets of MUTAG, PROTEINS, PTC, COLLAB, and IMDB-M [40] are benchmark datasets used in graph classification [19]. The MUTAG, PROTEINS, and PTC datasets are bioinformatics data with discrete labels. The COLLAB dataset is a scientific collaboration dataset, and the IMDB-M dataset is a movie collaboration dataset. The average precision and standard deviation from the 10-folder cross-validation are reported in Table 3. From the table, it is obvious that GPN outperforms baselines. The GAT is better than the GCN and is a very competitive approach. DGCNN and DAGCN are also competitive methods, but they do not perform well in the COLLAB and IMDB-M datasets in which the nodes do not have tag features. The results indicate that the proposed GPN has a stable and powerful learning ability in graph-level representation.

2) SIMPLE GRAPH REGRESSION TASK
A simple graph regression task is to predict the number of nodes of each graph without any other additional data. The Microblog network dataset is used, and the target is number of nodes of the cascades, which is different from cascade growth prediction. The deep learning models suffer from very large variance if the target value follows a power-law distribution. The target value y (i.e., the number of nodes in a graph) is transformed by log 2 (1 + y). As shown in Table 4, the GPN model has a relatively higher performance than strong baseline models such as GCN, and GIN on the graph regression task of identifying how many nodes are in a graph. The performances of GPN, DGCNN and DAGCN are quite close in terms of R2score. When the true-versus-predictions figures are drawn as shown in Fig. 8, it can be seen that the GPN achieves an almost perfect performance in the perception of how many nodes in a graph. The strong baseline GIN cannot adopt to certain graphs, especially when the targets become bigger.

3) CASCADE GROWTH PREDICTION
Then, we introduce a more complicated task related to cascade growth prediction and verify whether the proposed graph neural network is able to learn effective spatial representation features of a cascade, just like Word2vec [43] in natural language processing problems. We use G T i to represent the subgraph of cascade C i at time T = 1 hour in the Microblog network dataset. Several approaches are applied to learn the representation features of G T i . Then, the features are used to predict future growth at the 24th hour of the cascades. Baseline methods include the F-Linear, DeepCas [10], GIN [19], GCN [15], and the proposed GPN. The prediction results are recorded in Table 5. The results show that the proposed method GPN achieves a 12.4% error reduction in terms of the MSLE compared to the strong baseline DeepCas, while the GCN performs similarly to DeepCas. DGCNN and DAGCN perform better than GCN and GAT on the MSLE. It is noted that the mSLE does not decrease monotonically with the MSLE. The above findings indicate that the proposed graph neural network is capable of producing a better representation of the cascade structure and is conducive to future cascade popularity growth.

4) VIRAL STRUCTURE LEARNING
As mentioned in Section I, the viral structure has significant impacts on cascade popularity growth. The viral structure of cascades has been extensively discussed in [5], [23], [30], [44]. From previous works, the cascades with the viral structure have a Winer index higher than 2.5 or a Modularity greater than 0.3 [23], [44]. We use the Wiener index to calculate the structural virality of cascades. We would like to verify whether the existing models are able to extract effective features of cascades with viral structure. There are 4490 cascades with viral structure (i.e., a Wiener index greater than 2.5), accounting for 11.7% of the Microblog network dataset. In total, 90% of the viral structural cascades are selected as the training set, and 10% of them are selected as the testing set. The prediction on these cascades is more challenging. The average and median popularity growth in the cascades with viral structure are 80 and 245, respectively. While the average and median values of all the cascades are 26 and 114, respectively. The finding indicates that the viral structure facilitates future popularity growth of cascades. The results in Table 6 show that our method performs even better in cascades with viral structure in terms of the MSLE while achieving a slightly lower performance on the mLSE. The path-sampling method DeepCas produces much worse results on cascades of viral structure. The F-Linear method produces better results than DeepCas, which may because the viral structure actually provides more information about how the cascading structure is formed and changes compared to the broadcast structure [23]. Based on the viral structure, it is possible for the models to establish or learn more effective features about the cascades. Overall, the above experiments show that the proposed graph neural network GPN has an effective learning ability for spatial features and graph-level representation .

D. DYNAMIC CASCADE REPRESENTATION
In the following sections, we will discuss how Cascade2vec incorporates the graph perception network into recurrent neural networks to build a dynamic cascade representation model. The whole framework of the dynamic cascade representation model is shown in Fig. 9. The feature matrix and adjacency matrix of the subgraph at time t i are denoted by X i and A i , respectively. From Fig. 9, the structure of a cascade changes with time. For example, three red nodes joined the cascade at time t 2 , and three green nodes are involved in the cascade at time t 3 . Three gates are applied to each moment in our model. First, the reset gate r t is calculated by where σ is the sigmoid activation function. W ir ∈ R H * K , W hr ∈ R H * H and b ir , b hr ∈ R H , K is the output dimension of GPN, and H is the hidden size of the recurrent process in Cascade2vec.
The update gate z t is computed by The hidden state of h t is then updated by the following equations:h where is the element-wise product. W in , W iz and W iz have the same shape. W hn , W hz and W hr have the same shape. All the biases have the same shape. The recurrent neural network outputs a hidden representation h t of the cascade at moment t. Note that the f l MLP , W a , W i and W h are independent of the number of nodes, so the number of parameters is sharply reduced compared with DeepCas and DeepHawkes. The proposed Cascade2vec is also flexible and easy to be integrated with any kind of recurrent neural networks (e.g., LSTM [45] and Transformer [26]).
We can simply use the last output h T as the embedding features of a cascade. The learned h T contains the temporal information of each moment. However, the outputs at different times may also contain some temporal information. We apply a weighted sum pooling method to integrate features at each time. The weighted sum pooling learns a parameter β t for each moment, and the final representation h c i of a cascade C i is computed by: Then, h c i is connected with a two-layer perception network MLP output . The model outputs a predicted valueŷ of cascade C i using the following equation. Then the MLSE loss will be calculated. Note that y is logarithmic value.
There are several more super-parameters in the Cas-cade2vec. The hidden size H of GRU is set to 32. The observation time is split every 180 seconds in the Microblog network and every 91 days in the APS citation network.

V. PERFORMANCE OF CASCADE2VEC
The performance of Cascade2vec is verified on the Microblog network and the APS citation network. The relative reduction (denoted by Reduction) is calculated by |E b − E c |/E b , where E b is the samllest error of the compared methods, and E c is the error from Cascade2vec.

A. PREDICTION PERFORMANCE
According to the results in Table 7 and Table 8, it can be clearly seen that Cascade2vec significantly improves the performance and reduces the prediction errors. Compared to DeepHawkes, the MSLE is reduced by 16.1% in one hour in the Microblog network dataset. As one of the advanced deep learning methods for cascade prediction, DeepHawkes achieves a quite good performance. However, since Deep-Hawkes does not model the entire structure of graphs, its prediction performance can be further improved. When the observation time is increased from 1 hour to 2 hours and 3 hours, the relative reduction is decreased. As more information becomes available, the prediction tasks become easier, so the performance differences between Cascade2vec and the baselines become smaller. There is a large margin between the strong baselines and Cascade2vec in the mSLE, indicating that the proposed method is able to learn more useful features that are conducive to the growth of most cascades. Compared with results on the Microblog network, Cascade2vec achieves a relatively smaller improvement on the MSLE in the APS citation network. While in terms of the median squared log-error, Cascade2vec still performs much better than the baselines. Overall, the results verify that the graph recurrent neural network is suitable and stable for popularity prediction.

B. VIRAL CASCADE PREDICTION
We will discuss the performance of Cascade2vec on the viral cascade to show its ability in dynamic spatial feature learning and discover reasons why Cascade2vec outperforms baselines. An example cascade with a viral structure in the Microblog network is represented in Fig. 10. The tweets of the viral cascades disseminate among users, similar to a virus propagating among people. Many branches are formed during the dissemination process, indicating that the tweets have been spread through different communities. We have demonstrated in Section IV-A that the GPN is superior to DeepCas on spatial feature learning of cascades. We will evaluate the performance of DeepHawkes and Cascade2vec on cascades with viral structure. The cascades with viral structure are selected in the same way introduced in Section IV-C.4. As shown in Table 9, Cascade2vec outperforms DeepHawkes in both the MSLE and mSLE. The relative error reduction is even increased to 23.2% on the MSLE and 33.2% on the mSLE. It shows that cascades with viral structure provide more spatial information. However, DeepCas and DeepHawkes fail to make full use of spatial information in cascades. The DeepHawkes is more stable in this case, while the median error is much lower than the error of the total dataset. Although the prediction task in the viral cascades is  more challenging, as described above, the performance of our approach is still stable in this part of the dataset. Another reason why DeepHawkes and DeepCas do not perform well on cascades with viral structure is the lack of sufficient datasets. They train an embedding vector for each user. However, we found that the roles users play in information propagation are quite similar among different cascades. For example, the nodes A and M in the cascade C a and the nodes D, E, and F in the cascade C d are opinion leaders in their communities. Many marginal nodes represent ordinary users, such as the nodes N , M , S, and T in the cascade C d , the nodes P, K , and L in the cascade C a , the nodes G and H in the cascade C b , and the nodes I and J in the cascade C c . We do not learn a specific embedding for each node but rather learn a whole representation of the graphs. Thus, the number of parameters in our model is much less than that in DeepCas and DeepHawkes, while the performance is more stable when the data is not sufficient.
Although the viral structure provides more spatial information about cascades, temporal characteristics still play a significant role in cascade prediction. In the Microblog network dataset, the MSLE of the GPN that is only based on the static structural information of cascades at 1 hour is 3.18, while the MSLE of Cascade2vec is 2.055, which demonstrates the importance of the temporal features of the graphs. However, if only the temporal information is taken into consideration in a model, it also cannot achieve a high performance either. For instance, SEISMIC [8], an implementation of a Hawkes self-exciting point process, is a temporal model that applies the self-exciting mechanism for each retweet. It also uses a power-law function to fit the time decay effects in information diffusion. The SEISMIC suffers from noisy data and performs worse in cascade prediction, as shown in the Table 9.
Therefore, a better way for cascade prediction is integrating both the spatial and temporal features.

C. TEMPORAL DYNAMICS
We present how the temporal weights change with time in Fig. 11. In the Microblog network data, we split the time of 1 hour into 180-seconds sections. There is a temporal weight parameter at each moment. From the weights, we can see how different moments contribute to the final popularity growth. At one hour in the Microblog network dataset, the 20 parameters are shown in Fig. 11 (a). The weights decrease slightly at the first several splits and then increase until the observation time of 1 hour. This finding shows that the features of several early splits and recent features have more important impacts on the future popularity growth of microblogs. When the observation time is set to 2 hours, similar effects occurred during cascade prediction. In the APS citation network dataset, features that are close to the observation time are given higher weights. However, there is a difference in the temporal dynamics between the Microblog network and the APS citation network. The time weight in the APS citation network increases more smoothly than that in the Microblogs network. The influences of retweet behavior and popularity growth in microblogs are more complicated. The mood of users, the content of the microblogs, and the publishing time can significantly impact the popularity growth of tweets. While the citation growth in most cases is a slow process, the trend in the temporal weights is more smooth.

D. COMPLEXITY ANALYSIS
We analyze the computational complexity refer to the DAGCN [18]. The computational complexity of the GPN is related to the density of input graphs. Each node has to perform feature propagations with its neighbors. Suppose  the average number of neighboring nodes is c. Then, there are 2c * N propagations in total, where N is the number of nodes, i.e., |V |. In the worst case when the input graph is very dense, c and N are on the same order of magnitude. When the graphs are sparse, c is a small constant. Therefore, the time complexity of GPN is O(N 2 ) in the worst-case scenario. Since the graphs of cascades are sparse, the time complexity of the GPN is O(N ) in most case. From Table 2, we can see that the value of c is very small in cascades. For Cascade2vec, it applies GPN at each time segment. Suppose there are S temporal segments, where S is a small constant integer; e.g., S = 20 when T = 1 hour in the Microblog network dataset. Therefore, the time complexity of Cascade2vec is O(2c * N * S) = O(N ) in cascade prediction [46].
Space complexity analyses of deep learning models can be complicated. Rarely papers perform a space complexity analysis of graph neural networks or cascade prediction models [10], [11], [13], [24]. Once a deep learning model has been established, the parameters are fixed. The parameters and the required memory of the GPN and GRUs do not grow with the inputs in Cascade2vec. The space complexity of our model should be O(1) following the analysis in [46], [47]. However, it does not make sense to analyze the space complexity of the model parameters. Therefore, we also analyze the outputs in each layer and estimate the memory requirements following the method introduced in [48], [49]. In the inference process, the model needs to allocate memory for the outputs in each layer. The model parameters and output size of each layer of the model are listed and estimated in Table 10. N in the table is the number of nodes of the input graph. We set N = 1000 to estimate the required memory consumption. The space complexity of the model parameters in DeepCas and Deep-Hawkes is O(N ) due to the memory requirements of node representation are related to the total number of nodes, which can be millions. Therefore, the number of model parameters for DeepCas and DeepHawkes is tens of millions. The space complexity of these models, including memory consumption of outputs in each layer is also O(N ), which is the same as in Cascade2vec. The estimated memory required for the model parameters of Cascade2vec, DeepCas and DeepHawkes are 340 KB, 260 MB and 280 MB, respectively. If the outputs in each layer are taken into consideration, the memory requirements of these models are 13.8 MB, 266 MB and 288 MB, respectively. When the model is implemented in mini-batched mode, the memory consumption is further increased. When the batch size is set to 32, the memory requirement of Cas-cade2vec in theory is around 443 MB. In practice, the GPU memory consumption of Cascade2vec is 966MB, which is higher than what is estimated due to cache, temporary variables, biases or framework consumption, etc.
The runtime for an each epoch of Cascade2vec is 35 seconds in the Microblog network dataset at 1 hour. While the runtimes for an epoch of DeepCas and DeepHawkes are 300 seconds and 172 seconds, respectively. However, both the runtime and the real memory consumption of the models are related to specific implementation. Thus, the comparison is not completely fair since their implementations are based on different frameworks. The preprocessing procedure is also ignored in the complexity analyses. To conclude, our model has a faster runtime in theory and practice than DeepHawkes and DeepCas. However, our work focuses on improving the performance of cascade prediction, not reducing complexities in cascade prediction. The complexity is not our main concern.

VI. CONCLUSION
Cascade prediction is one of the fundamental challenges in social network analysis. The cascades in previous models are learned as a set of random walks or propagation paths without having an effective or favorable graph-level representation. In this paper, we modeled the cascades as dynamic graphs and proposed a new method, Cascade2vec, to learn the representation of cascades by graph recurrent neural networks. To improve the spatial representation of the graphs, we proposed a new graph neural network model that addresses the oversmoothing problems in GNNs and improves the learning ability of graph characteristics. Consequently, the proposed Cascade2vec significantly reduces the mean and median squared log-errors compared with the strong baselines in two cases: retweet prediction in the Microblog network and citation prediction in the APS citation network. Cascade2vec enriches the methods of cascade prediction and proves that a graph neural network is able to perform well in the regression prediction tasks.
In future work, we will take the distributions of the datasets into consideration and design a graph neural network that is more suitable for long-tail distributed data. In addition, external information (e.g., topics and sentiment) will be introduced into the model to enhance the representation of cascades.
ZHENHUA HUANG was born in Anhui, China. He received the B.S. degree from the School of Software Engineering, South China University of Technology, Guangzhou, in 2014, where he is currently pursuing the Ph.D. degree with the School of Software Engineering. He was a Visiting Scholar and also a Young Big Data Scientist with the University of California, Irvine, CA, USA, in the program of IBM-CSC Y-100. He has authored more than five articles. His research interests include social computing, sentiment analysis, and deep learning.
ZHENYU WANG received the Ph.D. degree from the Department of Computer Science, Harbin Institute of Technology, in 1993. He has been a Professor with the South China University of Technology, since 2007. He is currently the Director of the Chinese Information Community of China, and also the Director of the Guangdong Provincial Social Media Processing and Engineering Center. His research interests include natural language processing, text mining, and social network analysis. He has published more than 80 articles. He received the IBM-Ministry of Education University Cooperation Project Excellent Teacher Awards several times, guiding students to win the first and second prizes of the IBM National Competition.
RUI ZHANG received the B.S. degree from the School of Software Engineering, South China University of Technology, Guangzhou, China, in 2015, where he is currently pursuing the Ph.D. degree with the School of Software Engineering, South China University of Technology. His research interests include natural language generation, text mining, and sentiment analysis.