VAM: An End-to-End Simulator for Time Series Regression and Temporal Link Prediction in Social Media Networks

We present a machine-learning-driven end-to-end simulator, called the Volume-Audience-Match (VAM) simulator. VAM’s purpose is to simulate future phenomena related to various topics of discussion in social media networks. We focus our attention on the social media platform, Twitter, due to its abundant use in today’s world. VAM was applied to do time series forecasting to predict the future: 1) number of total activities; 2) number of active old users; and 3) number of newly active users over the span of 24 hours from the start time of prediction. VAM then used these macroscopic volume predictions (VPs) to perform user link predictions. A user–user edge was assigned to each of the activities in the 24 future time steps. We report that VAM outperformed multiple baseline models in the time series task, which were the auto-regressive integrated moving average (ARIMA), auto regressive moving average (ARMA), auto regressive (AR), moving average (MA), Persistence Baseline, and state-of-the-art tNodeEmbed models. Furthermore, we show that VAM outperformed the Persistence Baseline and tNodeEmbed models used for the user-assignment tasks. Finally, it is also shown that using Reddit activity data improves prediction accuracy.


I. INTRODUCTION
S OCIAL media's vast societal influence is apparent. Recent research has shown its effect in many aspects of society, such as election campaigns [1], the spread of coronavirus disease 2019 (COVID-19) misinformation [2], and the promotion of pump and dump cryptocurrency schemes [3].
Clearly, it would be ideal to predict the future phenomena related to any topic on any social media platform. To that end, we created an end-to-end simulator, called the Volume-Audience-Match (VAM) algorithm. VAM's goal is to predict what will happen for a given topic on a social media platform. The experimental results are shown on 18 topics appearing on Twitter.
VAM is a model that consists of two components, or modules that work in the following way. First, for each topic in a given social media platform, at some time step of interest, T , the volume prediction (VP) module of VAM takes as input a set of past time series features both related to that topic and external exogenous features that may influence that topic's behavior in the future. It then uses these features to perform time series forecasting. For any given topic-time step pair, VAM predicts three time series of length S which are: 1) the topic's future event volume time series; 2) the topic's newly active user time series; and 3) the topic's active old user time series. Second, the user-assignment module of VAM uses these three time series predictions, as well as previous user interaction history, to tackle the more fine-grained task of predicting, for a given topic, within the timespan of T + 1 up to T + S: 1) which user performs which action and 2) with whom each user interacts. We frame this problem as a link prediction problem. An edge comprises a child user u and a parent user, v. An edge exists between u and v if u reacts to a post written by v. In Twitter, this reaction takes the form of a retweet, or tweet in the case of an initial tweet (i.e., self-loop).
Note that we use the term "module" to differentiate from the term "model" for clarity throughout this work. VAM is the name of the overall model, while the VP module is the component of VAM that predicts event and user volumes, and the user-assignment module is the component of VAM that performs user-to-user predictions.
We tested VAM's predictive power on the Twitter dataset related to the Veneuzuelan political crisis [4], [5]. A time period spanning from December 28, 2018, up until March 7, 2019 This article makes the following contributions.
1) We introduce VAM, a simulation pipeline that performs both time series regression and temporal link prediction tasks in an end-to-end manner. 2) We show that VAM can predict the creation of new users and their activities. 3) We show that VAM strongly outperforms multiple statistical and state-of-the-art comparisons across a myriad of metrics for the time series prediction task. 4) We show how VAM performs when using XGBoost as its "backend," versus when it uses recurrent neural networks (RNNs) as its backend. We show that the XGBoost versions of VAM are more accurate and faster to train RNN versions. This is notable because RNNs have been used extensively in previous social media prediction literature, while XGBoost has not. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ 5) We provide an analysis of the use of social media platform features to determine what helps VAM achieve the best time series prediction performance in Twitter. We show that using features from activity on Reddit improves predictions of Twitter activity. 6) Finally, we show that VAM greatly outperforms the baseline and state-of-the-art models in multiple userassignment tasks, which were the old user prediction task, the indegree prediction task, and page rank prediction task.

A. General Popularity Prediction in Social Media
The term "general popularity prediction" refers to the prediction of the overall future volume of activities in social media networks. In these works, user-level activity prediction is not considered. The work [6] uses neural networks to do this. Jayaram et al. [7] performed time series regression in social media networks Facebook, Twitter, and Linkedin to predict the future volume of user activities in these platforms. They use various curve fitting models such as polynomial, logarithmic, and exponential regressions. Kong et al. [8] use a Hawkes process model to predict the time of events on various media platforms. Bidoki et al. [9] used LSTMs to predict bursts of Github activity using exogenous features from Reddit and Twitter.

B. Decompositional User-Level Prediction in Social Media
The term "decompositional user-level prediction" refers to works that aim to predict future user-level activity in social media networks; however, the methods break the task into two or more subtasks. There are two types of decompositional approaches we observed in the literature. First, there are what we call "volume-to-user" approaches, which first predict the overall number of events in a social media network with an initial model, and then assign users to these actions using a second model. The framework in this work, VAM, falls into this category. The previous works that also fall into this category are the SocialCube model [10], an auto-regressive integrated moving average (ARIMA)-driven method, as well as the proposed models of [11] and [12], which are LSTM-driven methods. VAM mainly differs from these methods in that it is used to predict user-to-user interactions in Twitter, while the other methods predict user-to-repository interactions in Github.
The other type of decompositional methods is what we call "clustering-based methods." These methods first use an initial model to cluster users in a social media network, and then a second model to predict future user-level actions using the cluster information. Saadat et al. [13] used K -means clustering to cluster Github users based on their activity rate and Bidoki et al. [14] cluster repositories based on their topics such as programming languages and profile keywords. In each work, a second set of models is then used to predict, using the cluster information, the most likely user-repository pairs to occur in the future.

C. Direct User-Level Prediction in Social Media
The direct user-level prediction methods predict future user activity in social media networks, but do so directly, unlike the decompositional approaches. The works of Chen et al. [15] and Liu et al. [16] use embedding neural networks to do this. The works of Hernandez et al. [17] and Shrestha et al. [18] use neural networks on sequences of adjacency matrices to predict user activity over time on Twitter. Garibay et al. [19] introduce the multiplexity-based model, which captures social network evolution based on preferential attachment, attention, and recency cognitive bias.
Blythe et al. [20] used three sampling models, as well as one Bayesian model and one link prediction model to predict future user-to-repo interactions in Github. The authors found that the sampling models performed the best. Finally, there is the work of Murić et al. [21] which created various machine learning models that predicted user-to-repository links in Github, as well as user comment threads in Twitter and Reddit (TR).

D. General Temporal Link Prediction
There have been several previous works on temporal link prediction algorithms. Some use neural networks that embed each node in a given network into a low-dimensional space, such as dyngraph2vec [22], DeepWalk [23], node2vec [24], and tNodeEmbed [25]. These embeddings can then be used for temporal link prediction or node classification. In this work, node embedding is not used in VAM as it can be computationally expensive in terms of training time and space. However, we do compare VAM's performance with some node embedding methods and show that VAM is much more accurate and faster to train.
There are also matrix factorization approaches to temporal link prediction, which are discussed in [26]- [28]. However, these approaches also struggle with scalability due to high computational cost.
Finally, there are temporal link prediction approaches which use probabilistic methods, such as [29] and [30]. These methods have been shown to be effective but suffer from computational complexity in terms of space [30], and time in the case of [29] and [30]. Also note that unlike VAM, these temporal link prediction methods do not have the ability to predict the appearance of new users.

III. IMPORTANCE OF PREDICTING NEW USERS
In this work, we define a "new" user at time step T as someone who has not previously been involved with a topic in the period spanning t = 1 up to t = T − 1. Our data analysis showed that for certain topic-platform pairs, there are a considerable number of new users that appear each day. For five out of 18 topics, at least 40% of the active users within a given day are new on average. For nine out of 18 topics, at least 25% of the users in a given day are new, on average. For this reason, it is important to predict their appearance and activities in addition to that of the old users. We provide more detailed analysis of this phenomenon in the supplemental materials [31].

IV. PROBLEM STATEMENTS
VAM addresses two problems, namely: 1) the Volume Prediction of users and activities and 2) the assignment of predicted activities to appropriate users within the context of a user-to-user link prediction. In this section, we will discuss these two problems in more detail.

A. Volume Prediction Problem
Definition 1 (The VP Task for a Topic-Time step Pair): Let us say that for some given platform, one is given static and temporal features for some topic, q ∈ Q, at some time step T . Intuitively, T can be thought of as the current time step of interest. One must then predict the following three future time series relating to this topic-time step pair, (q, T ): 1) the activity volume time series; 2) the active old user volume time series; and 3) the new active user volume time series. Each time series must span from time T + 1 up to T + S, with S being an integer that represents the length of the predicted time series. Furthermore, letŶ ∈ R 3×S be a time series matrix that represents the aforementioned predicted time series. In other words,Ŷ represents a prediction matrix such that each row represents one of the three output time series, and each column represents a time step in any of the time series.
Y approximates the ground-truth matrix, Y . The time frame that Y encompasses (T + 1 to T + S) is called forecast period of interest, or F T . F T can be thought of as a tuple of the form (T + 1, T + S ). T + 1 is the first time step in the forecast period of interest and T + S is the last time step.
To address the VP problem, we created both XGBoost and RNN regression models. The inputs are time series features and static features related to a given topic. The granularity of time step information given the models is hourly, and they predicted 24-hours time series (one day). We chose XGBoost models because they are known for being relatively quick to train, while retaining high predictive accuracy [32]. We also used RNNs because they have been widely used in previous literature for social media prediction [11], [12], [16]- [18], [25].

B. User-Assignment Link Prediction Problem
In this section the problem statement for the user-assignment problem is introduced. Let {G} t=T t=1 be a sequence of static graphs such that G = {G 1 , G 2 , . . . , G T }. G represents the user-interaction history of some topic, q, on some social media platform.
Each graph at time step t, G t can be viewed as a tuple of sets of the form (V t , E t , w). V t is the set of all users (nodes) u ∈ V t present in graph G t . E t is the set of all edges that exist in graph G t . The edges in E t are of form [u, v, w(u, v, t)]. An edge exists in E t if user u responded to a post made by user v at time step t. The term w represents a weight function such that w(u, v, t) represents how many times user u responded to v at time step t. Using this information, we can now define VAM's user-prediction task as follows.
Definition 2 (The User-Assignment Prediction Task for a Topic-Time Step Pair): Let us say, for some topic-time step pair (q, T ), one is given a matrix,Ŷ ∈ R 3×S . This matrix   TABLE I   BASIC STATISTICS OF THE 18 TWITTER TOPIC NETWORKS contains, for (q, T ), the future VP time series for: 1) number of events; 2) number of old users; and 3) number of new users. Furthermore, let us say one is given a set, {G} t=T t=1 , which represents the user-interaction history of topic q, for each of the T time steps. Given these two input items, predict a sequence, {Ĝ future } t=S t=1 . Intuitively, one can think of G future as a set containing future user interactions over the next S time steps with regard to topic-time step pair (q, T ). So, a sequenceĜ future would have the following form: represents the future user interactions for topic q at time step T +1. Grapĥ G future 2 represents the future user interactions for topic q at time step T + 2, and so on. Note thatĜ future is an approximation of the ground-truth graph set, G future .
Intuitively, one can view the user-assignment problem as a temporal link prediction problem, but with the added "assistance" of the future volume counts from some predictive model. A pictorial overview of VAM is shown in Fig. 1.

A. Twitter Data Collection
The raw data on the 2019 Venezuelan political crisis used in these experiments were originally collected by data collectors at the Leidos company. For the Twitter data, subject matter experts (SMEs) compiled a list of keywords and Twitter handles that would allow for the collection of the most relevant Venezuela tweets. These SMEs were individuals hired by Leidos who were fluent in Spanish and very familiar with the political situation in Venezuela. The keywords were evaluated by SMEs for both their precision and recall with regard to tweets about the Venezuelan political crisis. Frequently co-occurring groups of keywords were then used to create 18 "topics." For example, the topic international/aid_rejected is a topic comprising two separate keywords "international" and "aid_rejected." This topic refers to the disputed President of Venezuela, Maduro, rejecting humanitarian aid from other countries to the people of Venezuela [33]. Table I contains basic statistics of the networks used in this work. Since there were 18 topics in our dataset, we had 18 Twitter networks. Each node in a Twitter graph represents a user, and each edge represents an interaction between users, such as a retweet, or tweet (which can occur in the case of a self-loop). Note that we did not include quotes or replies in our Twitter data, because they comprised such a small portion of overall Twitter activity (3.6%).
The smallest number of nodes for a given network was 62 603 nodes, and the smallest number of edges observed was 110 097. The largest number of nodes in a given network was 484 405, and the largest number of edges observed was  Table I due to space constraints. For full information about each network, see the supplemental materials in [31].

B. Reddit Collection
Reddit is a social media platform in which users read and comment on various message boards, known as subreddits. Reddit posts and comments spanning from December 28, 2018, to March 7, 2019, related to the Venezuelan political crisis were collected by us.

A. Use of Lookback Factor and Exogenous Data
We were interested in knowing whether the next 24 hours of social media activity could be predicted from some initial time step, T , so to that end, we set S = 24 in our experiments. We believe 24 hours is long enough to be useful in a practical application, but still short enough that it is a reasonable period for a model to predict within.
Second, we needed to define an appropriate lookback period, or volume lookback factor for prediction, which we defined as L vol . This means that we used historic data from L vol time steps to make a prediction. In our experiments, we tried 96, 72, and 48 hours as values for L vol .
Furthermore, we wanted to know whether it was sufficient to use Twitter data alone to perform successful future activity predictions on Twitter, or if exogenous features from Reddit were helpful. To that end, we trained and tested two different types of Twitter models, which were Twitter-only (T) models and TR models.

B. Sample Tuples
As previously mentioned, each sample in each dataset represents a topic-time step pair (q, T ). The variable q represents the topic of interest, and T represents the current time step of interest. Each topic-time step sample comprises input features and output values. The inputs and outputs are described as follows.
First, there is the static input feature set. This is a one-hot vector that represents the topic of interest, q. Second, there are the temporal input features. These are the time series input features for our given sample. They differ depending  Table II. Finally, there are the output targets. The output for a given topic-time step pair (q, T ) is the matrix Y ∈ R 3×S . This matrix comprises three output time series for the VP task: the event volume time series, the new user volume time series, and the old user volume time series.
We shall illustrate this point with an example. Let us define the Twitter prediction matrix to be Y , for topic q = arrests, and current time step of interest T = 200. Furthermore, we define a volume lookback factor of L vol = 96 and we define the output time series size, S, to be equal to 24 For the training and validation sets, we wanted to generate as much data as possible, so we calculated each daily sample both in terms of day and hour. That is, a sliding window was used and advanced 1 hours to create a new overlapping example. Using this method, we generated 17 730 samples for each training set, and 2610 samples for each validation set.

C. XGBoost Setup
In this section, we discuss our setup of the VAM XGBoost models. Let D be a dataset such that: Furthermore, let the following be true: The terms n samples , n topics , and τ represent the number of samples, topics, and prediction time steps of interest, respectively.
x i ∈ R m represents an input feature vector of m features, and Y i ∈ R 3×S represents the output matrix.
We then define a matrix of functions, Each function φ a,b (x i ) in the matrix represents a separate XGBoost model, and each of these models maps to an outputtype-and-time step pair value,ŷ a,b i . An integer variable, a, can be used to indicate any particular row in the matrix, such that 1 ≤ a ≤ 3 and an integer variable b can be used to indicate any particular column of the matrix such that 1 ≤ b ≤ S. Recall that rows represent one of the three output types (actions, new users, or old users), while columns represent one of the S future time steps.
The function, (.), represents the VP module, which contains 3×S XGBoost models, φ a,b (.), and each XGBoost model is an ensemble of CART trees. Intuitively, one can think of each of the XGBoost models as "specializing" on a particular (output-type, time step) pair.
There are 3 × S models used because XGBoost comprises regression trees. A regression tree can only predict one output. So to predict a time series, one would need a regression tree for each time step in the time series. The alternative to the multiple-model approach would be to predict an output, feed that output back into the XGBoost model as an input, predict the second output, and so on. The problem with this approach is that one would run into the issue of compounding errors over time. As a result, these errors could cause these models to predict time series that do not come close to approximating the ground truth at all.

D. XGBoost Parameter Selection
We used the XGBoost [32] and sk-learn [34] libraries to create and train our models. The parameters used for our XGBoost models are as follows. The subsample frequency, gamma, and L1 regularization were set to 1, 0, and 0, respectively. For the other parameters, we performed a grid search over a pool of candidate values. We used our validation set to evaluate for the best parameters to use. For the column sample frequency, the candidate values were 0.6, 0.8, and 1. For the number of trees parameter, the candidate values were 100 and 200. For the learning rate, the values were 0.1 and 0.2.
For L2 regularization, the values were 0.2 and 1. Finally, for maximum tree depth, the values were 5 and 7.
For the loss function, mean squared error was used. For normalization, log normalization was used.

E. RNN Overview
We experimented with four different RNNs: GRU [35], LSTM [36], bidirectional LSTM [37], and bidirectional GRU [37] networks. Unlike XGBoost, RNNs have the ability predict multiple outputs within one model, so we did not have to make multiple RNNs per (output-type, time step) pair in the same manner as the XGBoost VAM models. To see additional architecture and hyperparameter information for RNNs, refer to the supplemental materials [31].

F. Baseline Overview
We compared VAM with five statistical baseline models in this work, which were the Persistence Baseline, ARIMA, auto regressive moving average (ARMA), auto regressive (AR), and moving average (MA) models [38]. Furthermore, we used three state-of-the-art methods, tNE-node2vec-H, tNE-node2vec-S, and tNE-DeepWalk.

G. Statistical Baselines
The Persistence Baseline model predicts the events during time frame T +1 to T + S by simply outputting the events that occurred at time T − S to T . The assumption of this model is that the future will exactly resemble the recent past. This assumption may sound naive; however, we found this baseline to perform very well against the others.
The ARIMA model and its variants (ARMA, AR, and MA) are widely used statistical models and, hence, used for comparison as well. The ARMA, AR, and MA models are variants of ARIMA depending on what p, d, and q parameters are set to. The ARIMA model has p > 0, d > 0, and q > 0. The AR model has p > 0, d = 0, and q = 0. The ARMA model has p > 0, d = 0, and q > 0. Finally, the MA model has p = 0, d = 0, and q > 0.
To train each of these ARIMA-based models, a grid search was performed with p and q's possible values being 0, 24, 48, 72, and 96, and d's possible values being 0, 1, and 2. A different model was trained per topic/output-type pair. So, for example, the (Maduro, # of new users) pair had its own ARIMA, ARMA, AR, and MA models. The validation set was used to select the best model parameters for the test period and the root mean square error (RMSE) metric was used to select the best model parameters.

H. State of the Art Comparisons
For the state-of-the-art comparisons, we used three variations of the tNodeEmbed embedding algorithm [25] because it has been shown to work well on link prediction tasks. Furthermore, embedding-based approaches in general have been widely used for temporal network prediction tasks.
tNodeEmbed is a variation of node2vec [23], [24] that incorporates temporal information from the graph into its embeddings. It uses a rotation operation that aligns embeddings of nodes across time for more accurate predictions [25]. For clarity throughout this work, we named each tNodeEmbed variation based on the underlying embedding algorithm it uses for initialization. We refer to tNE-DeepWalk as the tNodeEmbed algorithm that is initialized with the DeepWalk graph algorithm. Likewise, tNE-node2vec-H and tNE-node2vec-S refer to the variations that are initialized with the homophilic and structural variations of node2vec, respectively [24].
In this work, each embedding represents a (child, parent, topic, day) tuple. Furthermore, these embeddings were each fed into one of three different fully connected neural networks (one per embedding approach). The output of one of the neural networks was a vector of 24 values representing the number of activities a particular child-user edge would perform under a particular topic over the next 24 hours. Since these models predicted activity at the user-to-user level of granularity, we aggregated these counts to topic and time step granularity for consistent comparison to the VAM and statistical baseline models. For more information on how these embeddings work, and the hyperparameters used for their associated neural networks, refer to the supplemental materials [31].

A. VP Metrics
To ensure that the time series predictions were correctly measured for accuracy, six different metrics were used over each of the 21 forecast period of interest instances in the test period spanning February 15, 2019, to March 7, 2019. The results were averaged across the 21 instances for each metric.
We used RMSE and, mean absolute error (MAE) to measure how accurate each time series was in terms of "volume over exact time step." Normalized cumulative RMSE (NC-RMSE), which converts the simulated and ground-truth time series into cumulative sum time series and then divides each by their respective maximum values, was also used. This metric allows us to know how well predicted a time step was without considering the overall scale or "exact timing" of each value in the time series. This type of measurement is important because sometimes a time series would predict a burst within some range of time steps, but not in the exact spot. However, knowing a burst of activities will occur within some range of time steps is better than not knowing at all.
Symmetric absolute percentage error (S-APE) measures how accurate the total number of events was for each model, without regard to the temporal pattern. The formula is as follows. Let F be the forecast time series, and let A be the actual time series The volatility error (VE) and skewness error (SkE) metrics were used to measure how well the simulated times series captured the "burstiness" of the ground-truth time series. VE is measured by calculating the standard deviation of both the ground truth and simulated time series, and then calculating their absolute difference. The SkE metric is measured by calculating the skewness of both the ground truth and simulated time series, and then calculating their absolute difference. The skewness statistic used in this work uses the adjusted Fisher-Pearson standardized moment coefficient. It can be found at the top of [39, p. 7].
We found that sometimes a particular model might seemingly have an impressive RMSE, MAE, or NC-RMSE relative to other models, but upon visually inspecting the time series plots, the model in question does not seem to capture the ground truth's "bursts" or "dips" that the other models seem to capture. Hence, the use of metrics that capture "burstiness." We explain this phenomenon in more detail the supplemental materials [31]. Table III shows the overall results for the models on the six aforementioned metrics. Since there were many metrics, we calculated one "overall" metric that represents how well each model performed across all six metrics. We call this new metric the "overall normalized metric error (ONME)." It was calculated by creating six "metric groups," each comprising 14 model metric results for that particular metric. A similar "normalized error metric" was used in [19]. The model results within each of the six groups were normalized between 0 and 1 by dividing each model metric result by the sum of all model metric results within that particular group. The models in each table are then sorted and ranked from lowest to highest ONME.

B. Overall Normalized Volume Metric
To illustrate how well each model performed against the best performing baseline, we used a metric that we call the "Percent improvement from best baseline" (PIFBB). These values represent, as a percent, how much the ONME improved from the best baseline, which in this case was Persistence Baseline. The formula for this value is as follows: The upper bound of PIFBB is 100%, which occurs if a model's ONME is 0. This is clearly the best possible result. The lower bound for ONME is negative infinity because any given model could potentially perform infinitely worse than the best baseline.

C. Overall Metric Result Analysis
The best VP module for Twitter belonged to the VAM-XGB-TR-96 model. This was the XGBoost model trained on TR data with a lookback factor of 96 hours. The ONME for this model was 0.05394, which was about a 17.53% improvement from the best baseline, the Persistence Baseline. Furthermore, all three of the TR VAM models outperformed all three of the T VAM models. This suggests that the Reddit exogenous platforms contain information that can aid with prediction.
The ARIMA-based models (ARIMA, ARMA, MA, and AR) could not outperform the Persistence Baseline, despite its simplicity. The closest baseline to it was the MA baseline, with a PIFBB of about −10.90%.
Finally, despite being state-of-the-art approaches, the tNodeEmbed models did not outperform any of the five basic statistical baselines. The best tNodeEmbed model was the tNE-DeepWalk model, with a PIFBB score of about −42.31%, which was 12% lower than the worst statistical baseline, ARIMA, which had a PIFBB score of about −34.92%. A plausible reason for the weak performance of this set of models is because they directly predict the user-to-user interactions, in contrast to the VAM VP module, and ARIMA models that predict total hourly activity. Performing such a granular task makes it more difficult for these models to accurately predict the more macroscopic phenomenon of hourly user activity. Most of the user-to-user edges perform no activities at the hourly level, so when training models with such samples, the models are inclined to predict mostly 0 activity.

D. VAM XGBoost Versus VAM RNN
Since we observed that the best VAM XGBoost model had TR features (VAM-XGB-TR-96), we then trained several RNN models with the same features to compare their performance. Table IV shows these results. Similar to Table III, there is an ONME metric used to show the relative performance among all models, as well as a PIFBB score to show how well each model performed against the best baseline (Persistence Baseline).
The four different RNN models used were a GRU RNN, LSTM RNN, bidirectional GRU RNN, and bidirectional LSTM RNN. We were particularly interested in comparing the XGBoost VAM models with RNN VAM models because RNNs are among the most frequently used machine learning approaches for social media activity prediction as shown in [11], [12], [16]- [18], [25].
Despite the wide popularity of RNNs, we found that the XGBoost VAM model (VAM-XGB-TR-96) outperformed all RNN approaches. It had a PIFBB score of 17.47%. The best RNN model was trained with a GRU RNN (VAM-GRU-TR-96). It had a PIFBB score of 15.21%. Overall, the RNNs were able to strongly outperform Persistence Baseline. The least accurate RNN (VAM-Bi-LSTM-TR-96) had a PIFBB score of 11.03%.

E. Training Time Analysis
Tables III and IV also show the training time for each model. Each XGBoost and ARIMA model was trained on a computer with an Intel Xeon E5-260 v4 CPU. Each CPU comprised two sockets, eight cores, and 16 threads. Each computer had 128 GB of memory. The tNodeEmbed and VAM RNN models were trained on GeForce GTX 1080 Ti GPUs.
In addition to being the best performing models, the XGBoost VAM models were also the quickest to train, with training times spanning from 3 to 7 min.
The RNN models took much longer to train, with the fastest model (VAM-LSTM-TR-96) taking 2 hours and 54 min, and the slowest model (VAM-GRU-TR-96) taking 4 hours and 36 min.
The Persistence Baseline has "n/a" marked as its training time because this model is trivially created by moving historical predictions forward. There is no training phase involved.
The ARIMA-based models performed worse than the RNN models, and were even slower, taking anywhere from roughly 10 hours (AR) up to 26 hours (ARIMA).
The embedding models took the longest time to train, in addition to being the worst performing. This is because of the cost of creating the original embeddings themselves, and the cost of training neural networks with these embeddings as input features. Furthermore, the embedding methods predict at the hour and user level, in contrast to the ARIMA and VAM models that predict at the total number of users and activities hourly level. The fastest embedding model was the tNE-node2vec-S model, with about 47 hours of training time. The slowest model was the tNE-node2vec-H model, with almost 61 hours of training time.
As one can see, the VAM XGBoost models are the best models for VP, because they are both quick to train and highly accurate relative to the other models.

F. VAM-XGB-TR-96 Metric Results by Topic
We wanted to better understand model performance per topic. To that end, we compared the best VAM model's metric results per topic (VAM-XGB-TR), with the best baseline's metric results per topic (Persistence Baseline).
In the supplemental materials are bar plots and tables illustrating VAM's performance against the Persistence Baseline model per each topic and metric pair. Due to space limits, we briefly describe the topic-level metric results in this section.
For the RMSE metric, VAM won against the best baselines on 18 out of 18 topics. For MAE, VAM won 17 out of 18 times; for NC-RMSE, VAM won 17 out of 18 times; for Overall, VAM outperformed the Persistence Baseline 97 out of 108, or 89.8% of the time. VAM performed particularly well at the "volume over time" metrics (RMSE MAE, and NC-RMSE), as well as the volatility metric (VE). It performed decently for the "magnitude" or "scale" metric (S-APE). Finally, it struggled the most with the SkE metric, which measures the asymmetry of the time series. Fig. 2 shows the performance of the VAM-XGB-TR-96 model against the five baselines on various topics and days. As one can see, VAM was able to more closely approximate the ground truth than the baseline models. On the other/chavez/anti and protests topics, VAM more closely approximated some bursty ground-truth behavior in comparison to the baseline models (with some error of course).

A. Overview
Recall the user-assignment task for VAM. Once the VP-module predicts matrixŶ for topic-time step pair (q, T ), the task for the UA-module is to useŶ and the graph history set, {G} t=T t=1 to predict the future graph sequence, {Ĝ future } t=S t=1 . As mentioned earlier,Ĝ future is an approximation of the ground-truth graph set, G future .
The user-assignment is done in the following way. For S iterations, a graph G future s (s ≤ S) is generated and added to the overall final G future sequence. Eight main data structures are used to aid in the user-assignment task as described in Sections VIII-B-VIII-E.

B. Recent History Table
First, there is a recent history table, called H recent . This is a table containing event tuples generated using information from G. Each tuple contains the following information: 1) the child (acting) user; 2) the parent (receiving) user; 3) the number of interactions between child and parent at some time step t; 4) a flag indicating whether the child is new at time step t; and 5) a flag indicating whether the parent is new at time step t.
H recent is known as a "recent" history table because it is made from only the most recent graph snapshots from G. The lookback factor parameter L user is used to determine the number of snapshots to use. For example, if L user = 5, then only the five most recent graphs in sequence G will be used to make H recent . The assumption here is that recent history is all that is needed to make temporal network predictions.

C. Old and New Users
The next two data structures are the set of selected old users,Ô s , and set of generated new users,N s . Note that VAM "knows" the number of old and new users because they were predicted by the VP module.

D. Old and New User Probability Tables
The fourth data structure is the old user activity probability table (W old ). It is a table containing each old user's probability of being active (e.g., tweeting/retweeting) at some time step t.
The fifth data structure is the new user archetype table (W new_arch ). This table models how different "archetypes" of new users have behaved in the past. These archetypes are generated using recently active user information from H recent . These attributes are: 1) the probability of acting and 2) the probability of being influential (e.g., being retweeted). This archetype table is then used to create the sixth data structure W new which contains the activity and influence probabilities for the users inN s .

E. Old and New Parent Tables
The last two tables are the old and new user parent tables, D old_parent and D new_parent , respectively. These are hash tables in which each key is a user, and the value is a table containing: 1) a list of that user's historical "parents" (also known as users that the user of interest is most likely to retweet) and 2) the probability that the user of interest will retweet or comment that particular parent.
These eight data structures are used to predict each G future s in the temporal sequence G future . Algorithm 1, labeled Assign_Users, contains the pseudocode for the userassignment algorithm. For an in-depth explanation of the algorithm, refer to the supplemental materials [31].

A. JS for Old Users
To measure how well VAM predicted old users, we used both the unweighted and weighted Jaccard similarity (JS) metric, which is also known as the Ruzicka similarity [40]. These metrics were used to measure how well VAM predicted influential old users. We define influential users as users who are retweeted at least once in a given time step. Unweighted JS was used to measure whether a user was retweeted at least once or not. Weighted JS was used to measure the similarity of the predicted number of times a user was retweeted to the ground-truth number of times a user was retweeted.
Let A represent the set of the actual old users within a particular hour, and let P represent the predicted set of old users within a particular hour. The unweighted JS is trivially calculated as the cardinality of the intersection of A and P divided by the cardinality of the union of A and P. Furthermore, let a and p represent vectors that contain the weights of each user in the A and P sets, respectively. For example, a k represents the weight of user A k from the A set. With this in mind, the weighted JS is defined as follows: , p k ) .

B. Defining Success for New User Prediction
Since our task also involves predicting the creation and activity of new users, in addition to old users, defining and measuring predictive success becomes a bit more difficult. Since we do not "know" the names of a new user before they appear in the ground truth, it is impossible to exactly match a new user that VAM generates, with a new user that exists in the ground truth. So, to work around this issue, we measure success using more macroscopic views of the network, specifically the page rank distribution and the complementary cumulative degree histogram (CCDH).

C. Page Rank and EMD
The page rank score [41] measures how influential a particular node is upon the entire network. In our experiments, we calculated page rank on the weighted indegree of our networks. If VAM properly simulated the activities of old and new users, that means that VAM's simulated network page rank distribution should closely approximate the groundtruth network's page rank distribution. To measure the distance between the predicted and actual page rank distributions, we used the earth mover's distance (EMD) metric [42].

D. CCDH and RH Distance
The CCDH of a graph G is defined as (N(k)) inf k=1 , in which N(k) denotes the number of vertices of degree at least k [43]. It is closely related to the more well-known concept of degree distribution. In our experiments, we calculated the CCDH on the unweighted indegree distribution of the ground truth and simulated networks. Success is defined by how closely the predicted CCDH matches the ground-truth CCDH.
To measure the distance between the predicted network CCDH and the ground-truth network CCDH, we used the relative Hausdorff distance (RHD). Previous work has shown the RH distance to be a suitable metric for measuring the distance between two CCDHs [44].

X. USER-ASSIGNMENT RESULTS
In this section, we discuss the results of VAM's userassignment module. Since the VAM-XGB-TR-96 model was the best performing VAM model for the VP task, we used that VAM model's VPs for the user-assignment task. We also used a user-assignment lookback factor L user of 24 hours. So in other words, the past 24 hours of user activity history was used when VAM assigned actions to users. We found 24 hours to work the best.

A. Multiple Trials
Since VAM's user-assignment algorithm is probabilistic, it was run five times with five different seed initializations. The three user-assignment metrics (JS, EMD, and RHD) were then calculated across each of the five trials and averaged together. These averaged results are shown.

B. Overall JS Results
Table V shows the model performances for the user prediction measurements (weighted and unweighted JS). The results were calculated across all 18 topics and averaged together for both the weighted and unweighted scores. Then, a final average JS score was calculated by averaging the weighted and unweighted scores for each model. The PIFBB score was calculated against the best baseline, which was Persistence Baseline.
As seen in the table, the VAM-XGB-TR-96 model had the best Average JS score of about 0.12, and a PIFBB of 36.89%. Persistence Baseline came in second with an average JS of about 0.09. The tNodeEmbed models were much worse, with average JS scores of around 0.009, and PIFBB scores of around −88%.

C. Overall EMD and RHD Results
Table VI shows the model performances on the network measurements (EMD and RHD). Similar to the JS results, each metric was calculated individually for all 18 topics and then averaged together.
Similar to the volume result Table III, we calculated an ONME metric to obtain relative model performance and a PIFBB score to obtain relative improvement over the best baseline (Persistence Baseline).
Once again, VAM-XGB-TR-96 was the best model, with an ONME of about 0.10 and PIFBB of 15.3%. Persistence Baseline came in second place with an ONME of 0.13. The t-NodeEmbed models were much worse with ONMEs of around 0.26 and PIFBB scores spanning from about −116% to −118%.

D. Per-Topic Result Analysis
Since the VAM model and Persistence Baseline were the two best models, we wanted to do a more granular comparison between the two. To that end, we compared the metric results on a per-topic basis in a similar fashion to the per-topic comparison done in Section VII-F. We counted the number of times VAM outperformed Persistence Baseline for each of the 18 topics for each of the four user-assignment metrics.
For weighted JS, VAM outperformed Persistence Baseline on 18 out of 18 topics. Similarly, for the unweighted JS, VAM outperformed Persistence Baseline on 18 out of 18 topics.
For the EMD metric, VAM had 17 out of 18 topic wins. Finally, for RH distance, VAM had 15 out of 18 wins. For tables and barplots showing the precise metrics, refer to the supplemental materials [31]. We also performed analysis to observe how well VAM performs against Persistence Baseline for highly influential and lowly influential users. We found that VAM also strongly outperforms the baseline in this analysis as well. However, due to space constraints, we have placed those results in the supplemental materials as well [31].
In summary, VAM was good at predicting which old user edges would exist for a given time step. It was also quite good at predicting what "type" of user would be active at each time step in terms of page rank influence (as measured by EMD). VAM was slightly worse (but still good overall) at predicting the unweighted indegree distribution of the users (as measured by RH distance).

E. User-Assignment Runtime Information
Similar to the VP module, the user-assignment module of VAM was run on computers with an Intel Xeon E5-260 v4 CPU. Each CPU comprised two sockets, eight cores, and 16 threads. Each computer had 128 GB of memory. The user-assignment module was run in parallel over five computers (one per trial). The average runtime of the user-assignment algorithm across the five trials was about 2 hours and 13 min, which is quite reasonable considering that there were 18 topics and millions of edges. Since there were 21 days in the test period, on average the user-assignment algorithm took about 6.33 min to simulate the activities for one day (or 24 hours) across all 18 topics.

XI. CONCLUSION AND FUTURE WORK
In this work, we presented the VAM simulator. It is the first end-to-end simulator of user activity in social media platforms that uses time series prediction and probabilistic link prediction to estimate future activity of both old and new users. In this work, VAM was used to predict both overall and user-level activity from the recent Venezuela political crisis on a per-topic basis.
On the VP task, VAM was shown to have good performance against multiple widely used statistical models (ARIMA, ARMA, AR, MA, persistence), as well as several tNodeEmbed models. As previously mentioned, it outperformed the best baseline (Persistence Baseline) on 97 out of 108 topic-metric pairs, or 89.8% of the time. On the user-assignment task, VAM strongly outperformed the Persistence Baseline on 68 out of 72 topic-metric pairs, or about 94% of the time. With refinement, VAM could be used as an alert system for potential future real-world activity.
Future work includes a variety of tasks. First, we would aim to use two machine learning models in the user-assignment module to predict the most likely active users and the final link predictions. Perhaps these models could outperform the weighted random sampling approach that VAM's user-assignment module currently uses. For the VP module, we would try other machine learning models, such as transformer neural networks, and compare their performance with the XGBoost and RNN VAM models.