Forecasting Network Traffic: A Survey and Tutorial With Open-Source Comparative Evaluation

This paper presents a review of the literature on network traffic prediction, while also serving as a tutorial to the topic. We examine works based on autoregressive moving average models, like ARMA, ARIMA and SARIMA, as well as works based on Artifical Neural Networks approaches, such as RNN, LSTM, GRU, and CNN. In all cases, we provide a complete and self-contained presentation of the mathematical foundations of each technique, which allows the reader to get a full understanding of the operation of the different proposed methods. Further, we perform numerical experiments based on real data sets, which allows comparing the various approaches directly in terms of fitting quality and computational costs. We make our code publicly available, so that readers can readily access a wide range of forecasting tools, and possibly use them as benchmarks for more advanced solutions.


I. INTRODUCTION
Over the past decade, networks have embraced a softwarization process that is granting increasing control capabilities by operators on their infrastructures. This trend creates new opportunities for network management, many of which build upon anticipatory decision-making paradigms: traffic engineering, resource allocation or service orchestration policies are enacted in advance of future fluctuations in traffic demands, as reviewed by Bui et al. [1]. Anticipatory networking yields the potential to improve network resource utilization and end user's quality of service substantially with respect to traditional reactive, human-in-the-loop strategies. Yet, its actual gains largely depend on the accuracy of the predictions that proactive decisions are taken upon. Therefore, The associate editor coordinating the review of this manuscript and approving it for publication was Oguzhan Urhan . the role of traffic time-series forecasts is today more central than ever in the design of cutting-edge solutions for network management.
Network traffic prediction is in fact a largely explored subject in networking, with early works dating back to the seventies and a flurry of recent proposals fostered by the success of machine and deep learning tools. Several surveys already exist that review and classify the many works on network traffic forecasting based on different approaches, from statistical models to artificial neural network ones.
The most recent reviews focus on specific forecasting approaches, such as deep neural networks (DNN) targeted by Cao et al. [2] and Huang et al. [3] or the graph neural networks (GNN) considered by Jiang and Luo [4]; these surveys are hence limited to a subset of the techniques available in the literature, whereas our goal is to provide a comprehensive view of all major prediction strategies. Closer to our approach, other reviews encompass multiple classes of models. For instance Joshi and Hadi [5] compile an extensive survey on network analysis and traffic prediction, describing pre-processing techniques such as discretization method and feature selection, which are very relevant to improve the data quality due to the presence of outliers or inconsistent information, and linear and non-linear approaches for network traffic prediction. Also, Hendikawati et al. [6] survey papers using the standard linear models (ARMA, ARIMA, SARIMA) as well as multivariate methods where the prediction is a function of two or more variables like vector autoregressive (VAR), vector moving average (VMA), and vector autoregressive moving average (VARMA); they also list papers using neural networks, support vector machines, wavelet transform, fuzzy logic, and Adaptative Neuro-Fuzzy Inference System. Finally, Jiang [7] reviews studies that specifically target cellular traffic prediction, and classify the models into statistical, machine learning, and deep learning models.
All the existing surveys above take a traditional approach of listing a variety of works proposing network traffic predictors, briefly explaining their operation and proposing a classification or taxonomy. Instead, this paper has a different and more ambitious objective: as in regular surveys, we present a wide range of previous studies that employ statistical and artificial neural network techniques for network traffic forecasting; however, at the same time we also propose a complete and self-contained tutorial on those prediction techniques. To this end, we include a full explanation of the mathematical formulation of each technique, and of all the steps that one should follow to best configure the associated model. Unlike previous reviews, this paper thus lets the reader gain a deep understanding of the functioning of available traffic forecasting approaches.
Moreover, we implement all reviewed techniques for traffic prediction, and run numerical experiments with all the techniques presented. Our experimental approach is new, since no previous survey does the same, and it has a twofold advantage. First, it allows performing a direct comparison of the many solutions proposed in the literature in terms of prediction quality and computational cost; we employ real-world citywide datasets of mobile network traffic to perform such a comparative evaluation, and provide a first-ofits-kind ranking of techniques. Second, we publicly release our implementations of network traffic predictors, 1 which enables non-experts to get acquainted with practical realizations of the different forecasting models, and also provides researchers with a sound lineup of benchmarks that can be used in comparative analyses of novel solutions. When combined with open datasets of traffic, such as the one we consider for our analysis, our implementations can improve the reproducibility and verifiability of research results aiming at traffic prediction. 1 The code and documentation are made available, see the Appendix for details. The document is organized as follows. In Section II, we give a brief introduction to network topology and the advantages of 5th generation mobile network. Besides, we present a mathematical formulation for the problem, some literature approaches, and their performance evaluation. In Section III, we discuss ARMA, ARIMA, and SARIMA models, presenting the mathematical formulation of each and explaining their limitations. Besides, we also illustrate pre-processing approaches such as time-series differencing and decomposition, and briefly present clustering techniques, which may be useful to improve the prediction results. Section IV introduces different types of artificial neural network models: three recurrent neural networks (RNN, LSTM, and GRU) and convolutional neural networks (CNN). Again, we detail the mathematical foundations of each model and explain their internal operation with intuitive figures. In Section V, we map relevant papers targeting traffic forecasting to the different techniques presented in the previous sections, while also outlining the associated data sets and evaluation strategies. Section VI present comprehensive numerical experiments comparing all the presented approaches in presence of real-world mobile traffic demands, when considering different configurations for the statistical and neural networks approaches. Section VII provides open research directions. Finally, Section VIII draws conclusions.

II. BACKGROUND
A simplified cellular network topology can be divided into three main parts: radio access network, core network, and user equipment.
• Radio access network (RAN): responsible for the radio communication between mobile device and network, and includes several base stations (BS) and one base station controller (BSC), which is responsible for communication resources management, handover tasks, etc. According to Lin and Zhang [8], BSs are responsible for about 80% of energy consumption of a cellular network. Each generation of wireless cellular technology has its specific BS, as presented in Table 1 (the first generation  was based on analog technology). GSM, UMTS, and LTE are Global System for Mobile Communications, Universal Mobile Telecommunications System, and Long-Term Evolution, respectively. 4G LTE has no controller because eNodeB is also able to accomplish this task.
• Core Network: it is responsible for connecting different RANs among themselves and also to external networks, as the internet.
• User Equipment: mobile devices of the users. VOLUME 11, 2023 Note that RAN and Core Network are complex parts of a cellular network. Details about them can be found in Kanani et al. [9], Abed et al. [10], and Seddigh et al. [11].
Nowadays, studies related to 5G are on the rise. This technology presents improvements compared to the previous generations, as presented in Al-Falahy and Alani [12]: smaller latency, faster download and upload, higher bandwidth, higher number of connected devices, data processing speed (Mbps/m 2 ) 100 times faster than 4G, reduction in energy consumption, etc. All these developments allow many applications that depend on device-todevice communication, as Internet of Things and autonomous cars.
5G network architecture is heterogeneous, that is, includes macro and small cells. Hence, optimization problems concerning spectrum allocation, location of these cells and energy saving of those not serving users are studied, as in Su et al. [13]. Many of these optimization techniques are based on big data generated by both user and network operator. As presented by Zheng et al. [14], this data needs to be processed, due to incomplete or uncertain information, and then transformed into knowledge, such as users location, mobility pattern, and communication behavior. Details on 5G network topology and technical aspects (frequency range, power consumption, channel bandwidth) are presented in Gupta and Jha [15]. In this work, the base stations are also referred to as antennas or towers, regardless of their generation technology.
It is important to emphasize that the network traffic load can also be measured and predicted inside fixed environments, such as Local Area Network (LAN), where different computers or smartphones are connected to routers or switches. The advantages of network traffic prediction (management, anticipatory decision making, and resource allocation) are also required in such a context. Some of the works cited in this survey design forecasting models for these kind of data sets, as it will be further presented.

A. NETWORK TRAFFIC FORECASTING
Consider that a time-series containing T network traffic load observations at a tower j is given by X The elements of this vector could have different time information depending on the data set, such as: amount of transferred data (bytes), number of internet connections, number of incoming/outgoing cellphone calls and SMS, or any other relevant information.
In network traffic forecasting, the objective is to design a model to predict the future load X t+k ] for the next k instants ahead, based on previous observations in X (0) t , X (j) t with j = 1, . . . , n k , as well as any other available information that the data set may provide (geographical position, past errors, weather condition, etc). The predicted value is given by where ∈ R 1×T and ∈ R 1×T are vectors containing the parameters to be estimated.
. . , T , or convolution operations may be used, as it will be later presented. Note that if just information about tower X t . ξ is the number of factors (besides the time-series traffic load) that are used to construct the model, such as prediction errors (in the case of moving average models), hidden state (when using recurrent neural networks), or any other variables that may impact the prediction quality and are available on the data set. T i (Q i t ) ∈ R T ×1 are calculated in a similar procedure and ϒ (j) i t is a vector containing information about tower j used to construct the prediction model.
It is important to notice that the position of each X (j) t in matrix M t is not necessarily determined by the geographical position of each tower. It could also be determined via other methodologies and metrics, later presented, used to cluster the time-series. The same analysis is extended to ϒ A prediction model that is capable of forecasting the traffic load with small error may have big commercial value for companies that design the network. With good k estimates for x t+k , these companies can assign optimized amount of hardware/software, making the network capable of attending the demands inside an area, but also achieving the optimal tradeoff between capacity (number of users, data transmission rate, coverage area, etc) and monetary cost. According to Bega et al. [16], not only over-dimensioned networks causes economic losses, but also under-dimensioned, since they could lead to subscribers' churn rates and fees associated with Service-Level agreements.
Usually, the data sets shared by the companies have information about the traffic load of multiple antennas at a specif granularity time and the geographical position of each one. In Section VI, numerical experiments to compare the approaches discussed in this survey are performed on a data set of the telecommunication activity in Milan, Italy. There, the following information is captured in 10-minute time slots: • received/sent SMS; • incoming/outgoing call; • internet connection start/end. The items above form Call Detail Record (CDR), which were pre-processed by the operator so as to ensure the anonymity o the data subjects generating the traffic. The mobile traffic data is mapped on a regular spatial grid. For a full description of the data set, see Barlacchi et al. [17].
Our goal is to use equation (1) and, based on a time horizon of i observations, that is, for to predict k steps of the traffic load of a specific tower j = 0: Suitable values for prediction and observation horizons are discussed in the following Sections.

B. DATA-DRIVEN APPROACHES
This survey is focused on Autoregressive Moving Average and Neural Networks based prediction models. The first class is a linear combination of previous values of errors and traffic load, leading to relative simple solutions with high data interpretability. As it is discussed in most of the bibliography provided in this work, the forecast capability of neural networks based models present higher quality (in general) when compared to other approaches. Besides, it is relatively simple to combine different types of ANNs to obtain a model that is able to capture particular patterns, e.g., a CNN-RNN based approach can provide spatio-temporal knowledge about the network traffic load. Due to the aforementioned reasons, a great number of papers related to network traffic prediction uses ARIMA or ANN based strategies. Figure 1 illustrates this scenario. A search on Scopus for papers whose indexed keywords contain time-series, forecast, and the name of each of the approaches briefly presented below is done.
Linear Regression (LR) is a machine learning technique where one wants to estimate the parameters for a linear model. Basically, the dependent variable to predict (output) is calculated based on a linear combination of independent variables (input). The parameters are calculated by minimizing some loss function, for instance the sum of the squares or sum of absolute residuals. The first case is equivalent to maximizing the likelihood and a Gaussian distribution for the error term is assumed. The second one, known as Robust Linear Regression, a Laplace distribution is assumed and may provide better results when the data set contains outliers. In Lechowicz [18], Linear regression is used to predict bandwidth blocking probability of a network.

1) SUPPORT VECTOR MACHINE
(SVM) is a machine learning technique widely used in classification problems. In summary, it is an optimization problem whose output is the parameters of a hyperplane that separates the input space into two halfspaces. Subsequently, the classification is performed according to the halfspace of each sample. SVM can also be used in regression and network traffic prediction problems, as in Stepanov et al. [19]. The authors compared SVM forecasting results with predictions calculated by using the Random Forecast algorithm.

2) RANDOM FOREST
(RF) is an ensemble of decision trees, where different random subsets of the training data are used to build different decision trees. Subsequently, a prediction is made by considering majority votes, for classification problems, or the mean of outputs in case of regression.

3) WAVELET TRANSFORM
(WT) can be used as a decomposition method where the signal features, as frequency or trends, change over time. Distinctly from the Fourier Transform, where the signals are decomposed into sine waves, Wavelet Transform generates wavelets, fast decaying wave-shaped oscillation with zero mean. As it will be later discussed, traffic network time-series are highly non-stationary, which makes such a tool very useful in this context. Despite the fact it is a decomposition technique, when used with prediction models, Wavelet Transform may present good forecasting results, as in Lu and Yang [20] and Tian [21].

4) HIDDEN MARKOV MODEL (HMM)
is a probabilistic model where the sequence of events can be described as a Markov process: a state at a time x t is affected VOLUME 11, 2023 only by the previous one, that is, x t−1 . Additionally, these states are not directly measured, however impact an output y t according to known rules. Hence, one wants to estimate x t with observations about y t . In Chadza et al. [22], the authors used HMM to predict network attacks on simulated period attack of 10 days containing 50 and 420 attacking/victim machines and 30 servers.
Autoregressive models are constructed as a linear combination of past observations. Usually, the model is estimated such that its parameters maximize the likelihood of making the correct predictions. Similarly, the moving average models are also linear, however the prediction is calculated based on previous errors. When these two models are used simultaneously, one has the so-called autoregressive moving average models, explained in Section III, that can be extended to ARIMA and SARIMA.
A very large class of functions can be approximated by using Neural Networks, that have the advantage of inserting non-linearity in the models. Many different prediction strategies are based on Neural Networks, where different patterns (such as spatio-temporal) can be learned during a training phase and subsequently used for predictions. Some of the techniques widely used in traffic load forecast are based on Convolutional and Recurrent Neural Networks, detailed in Section IV. Other models are discussed in Chen et al. [23].
Network traffic load prediction is divided (at least) into three big areas: time-series clustering, decomposition, and prediction model. In Figure 2 we summarize the main research topics involved in such a subject. Some examples of strategies regarding each field are given (many others can be found in the references later presented.) It is important to emphasize that these areas are connected and affect the prediction result: for instance, a good time-series decomposition could lead to better clustering result due to stochastic component filtering; subsequently, the estimation of the parameters for a prediction model can be calculated and, according to the cluster some network belongs, its traffic load is predicted.

C. PERFORMANCE EVALUATION
Different metrics can be used to evaluate the three areas presented in Figure 2. Concerning the prediction models, the quality is measured according to some errors, presented in Table 2, between the predicted and actual values of the timeseries.
Usually, a good clustering result is the one where the elements inside the same cluster are similar, and elements in different ones have low similarity (according to some adopted distance measure). However, clustering is an unsupervised learning technique where the size and number of clusters may not be known a priori. Therefore, evaluating if the result of some clustering strategy is good or not is not a trivial task.
According to Aghabozorgi et al. [24], evaluation metrics are divided into two groups: external and internal indexes. In the first one, there is a ground truth where the cluster that each time-series belongs to is known. Hence, the results of  a clustering strategy are compared with a validation data set by means of some metrics, as: Jaccard Score, Rand Statistic, Folkes and Mallow Index, purity, entropy, etc. This is used in Chiş et al. [25], where the authors propose a new algorithm for time-series clustering and the evaluation of the approach is performed by comparing, in terms of Jaccard Score, Rand Statistic, and Folkes and Mallow Index, the labels of their clustering results with the labels of the original clusters.
One can see that in traffic load prediction this may not be applicable, since the data sets usually do not provide information about clustering. In this case, a common practice is to minimize the errors concerning specific error metrics, such as SSE, root-mean-squared standard deviation (RMSSTD), semi-partial R-squared, R-squared, distance between two clusters, etc. By solving the optimization problem with one of the cited metrics, one expects to obtain high intra-cluster and low inter-cluster similarities. In Halkidi et al. [26], a detailed explanation of each one of these metrics is given.

III. SIMPLE STATISTICAL MODELS
Before introducing the models, the concept of stationary time-series is introduced. As reported by [27], Its mean and covariance functions are given by (3) for all integers i and j. X t is stationary if: When these two conditions are satisfied, the statistical properties of the time-series are similar to those of its shifted version (in time).

A. ARMA MODELS
In Hoong [28] and Tan et al. [29], prediction models for network traffic demand are developed by applying the Autoregressive Moving Average (ARMA) time-series modeling technique. Such model is composed of two main terms: an AutoRegressive component (AR), which is the sum of past observations with a white noise constant, and Moving average (MA) component, which is the sum of past white noise errors with the expected value of the time-series. Both are modeled, respectively, as and where: • φ i , . . . , φ p are the AR parameters to be determined; . . , θ p are the MA parameters to be determined; • µ is the expected value of x t ; • p is the autoregressive order; • q is the moving average order. Here x t is the traffic load at time t, which we want to predict based on past instances.
According to Brockwell and Davis [27], X t can be modeled as an ARMA(p, q) process if it is stationary and if for every t Additionally, it is ARMA(p, q) with mean µ if X t − µ is an ARMA(p, q) process. Considering µ = 0, equation (6) becomes For instance, an ARMA(2,1) model with µ = 0 can be represented as Such models provide better results when applied to stationary data. Hence, when the data is non-stationary, some pre-processing is necessary. This is usually achieved by using log return transformation, as where r t is the log return value at time t. Then, the model we need to estimate takes the form After the transformation, both papers make use of Cronos, an open-source tool written in C# programming language, to evaluate the predicting activity. This software makes use of the Maximum Likelihood Estimation-MLE to estimate the model parameters. With MLE, the model parameters are calculated in such way that the likelihood of making the correct observations are maximized.
To evaluate the quality of prediction, Hoong [28] used the mean squared error of the log returns, as where: • value a = i th actual log return; • value b = i th predicted log return. The experiment proposed by the author consisted in designing two models, ARMA(3,0) and ARMA(2,1), and comparing their MSE on 5 step size predictions. ARMA(3,0) models achieved better results for network traffic with seasonal pattern, while ARMA(2,1) presented lower MSE for network traffic with cyclical pattern. MLE for Autoregressive term: consider that each term in the time-series is independent and identically distributed (iid) and the set of observations is defined as D. Then where p(D|φ) is the likelihood to be maximized, and n is the number of observations. In linear regression problems, VOLUME 11, 2023 the noise terms are typically assumed to be Gaussian. Hence, ε t ∼ N (0, σ 2 ) in equation (4), consequently . By considering equations (12) and (13) and applying the negative log likelihood-NLL, one has Hence, we should compute such that equation (14) is minimized. Consider that all x t and x are concatenated in vectors Y and W , respectively. So, This is a least square problem that can be solved by means of the Normal Equation, see Calafiore and El Ghaoui [30] (

B. ARIMA AND SARIMA MODELS
When the time-series is not stationary, the use of ARMA models is possible by using transformations, as discussed before. However, if the time-series is stationary after d differencing times, one may use the ARIMA(p, d, q) model, where the I stands for integrated, see Jung et al. [31]. Hence, ARMA is a special case of ARIMA models (when there is no differencing involved). This category of models is discussed next.

1) TIME-SERIES DIFFERENCING
The differenced time-series, as used in Alsharif et al. [32], is simply defined as for j = {1, . . . , T − 1}. When the differenced time-series is not stationary yet, one may use the second-order differencing, defined as for j = {1, . . . , T −2}. For instance, consider the first sample, that is j = 1: Considering the backward shift operator Bx t . = x t−1 , equation (19) can be written as A general formulation for a differencing of order d at any instant k is thus Then, you construct the ARMA model on the new variables x Another strategy sometimes used is the seasonal differencing, defined as X for j = {1, . . . , T − S} and S is the lag value. For instance, if some event has seasonality of 24 hours, one may set S = 24.

2) MODEL DEFINITION
According to Yu et al. [33], ARMA models can be represented as is the differencing operator, then the ARIMA(p, d, q) model is For instance, consider d = 1 and p = 2, then the left side of equation (28) becomes Here, the parameter d represents the number of times differencing should be applied to make the time-series stationary.
In a similar way, if the time-series to be modeled are non-stationary and have a periodical component with period S, one may use seasonal differencing and equation (27) becomes and the SARIMA(p, d, q)(P, D, Q) S model is described by where: • P: seasonal autoregressive order; • D: seasonal difference order; • Q: seasonal moving average order;

C. TRAFFIC FORECAST WITH SARIMA MODELS
In this subsection, a general procedure to predict network traffic using SARIMA models is explained. It is important to emphasize that this is a general approach based on the literature review. For instance, some papers do not use cluster strategies, others cluster each base station after time-series decomposition, and some apply clustering strategies without considering time-series decomposition. The most general procedure to obtain a model for prediction is illustrated in Figure 3. In short, some filtering technique is applied to the data to extract noise. Subsequently, a clustering strategy can be used to divide the antennas in subsets according to their traffic load similarity. One model for each cluster can be designed. Then, with the time-series of a specific antenna and the information of the cluster it belongs to, it is possible to perform the prediction. Each one of these steps is explained below.

1) TIME-SERIES DECOMPOSITION
Many of the papers that use ARIMA models for forecast, first apply time-series decomposition. This can be performed in several ways. For completeness, we report here the method introduced in Brockwell and Davis [27], and adopted for instance in Xu et al. [34] and Zhang et al. [35]. This method is useful when the time-series has a small trend and we may assume that the trend within each period is constant. The time-series can be described as T = {x 1 , x 2 , . . . , x n } with each x t representing the traffic volume in the t th time slot. Then, the time-series can be written as with: • m t = general trend of the series; • s t = periodic patterns of the series; • r t = stochastic component of the series. The trend parameterm t is estimated by applying a moving average filter, as followŝ  The next steps are to calculate the deviation series for k = 1, 2, . . ., and calculate the average w k of this series. Then, the periodic pattern s k is computed aŝ The de-seasonality series d t can be calculated by removing the periodic component from the original series, as Then, the general trend can be computed by applying a moving average filter on d t . As we want to observe the data from 0 < t ≤ n, define x t = x 1 for t < 1 and x t = x n for t > n, and At last, the residual component is

2) TIME-SERIES CLUSTERING
When the time-series of each tower is decomposed, it is possible to use the trend and periodical components to identify groups of base stations with similar traffic demand patterns. For each group, a separate forecasting model is designed. Hence, clustering algorithms represent an important subject for predicting models. In many works, e.g. Vujičić et al. [36] and Trajković [37], K -means is used. Basically, this is an iterative approach that divides the data set into K pre-defined clusters. Subsequently, each element of the data set is assigned to only one cluster in such a way that the sum of the squared distances between all elements of the data set and their respective cluster's centroid is minimized. New centroids for each cluster are computed and the assignments of the elements are updated. This procedure repeats until the algorithm converges or other stop criterion is attended.
Another strategy is Hierarchical clustering. Proposed in Corpet [38], the basic idea of this algorithm is to consider each element as a cluster and merge the nearest two until some stop condition is achieved. The distance measure can be calculated via Davis-Bouldin Index (DBI ) as in Zhang et al. [35], or via correlation coefficient between two time-series of aggregated traffic as in Cici et al. [39].
The base stations could also be clustered according to some specific characteristic, as their geographical position or traffic consumption level. For instance, Miao et al. [40] classified each antenna according to this last criterion, where the traffic load was used to label it as heavy load (HL), normal load (NL), and light load (LL).
Time-series clustering is not a trivial task due to the usual big dimension of the problem (long sequences of data), and some decisions that may vary from one application to another, as: appropriate representation method, suitable distance measure technique and an adequate clustering algorithm. According to Aghabozorgi et al. [24], comparisons between time-series can be performed in three different ways: • shape-based: two time-series are considered similar if they have similar shapes. Through non-linear stretching of time axes, the sequential data are aligned and then standard clustering algorithms with specific distance measurements are applied; • feature-based: vectors containing features of the time-series (standard deviation, mean, variance, maximum/minimum values, etc) are used as input to standard clustering algorithms. This approach has the advantage of dealing with low dimension vectors, since very long sequences are represented by their respective features; • model-based: parametric models representing each time-series are designed. Subsequently, a clustering strategy is applied to the parameters of the models.
The algorithm strategy and distance measure choices may impact the clustering quality. In time-series comparison context, a widely used metric is the well known Euclidean distance (ED). For instance, consider Y = (y 1 , . . . , y m ), However, this approach has some issues: in case the time-series present the same shape and amplitude, but different phases, ED may be large, leading to a bad clustering result. Additionally, the time-series should have the same length. To circumvent these problems, elastic distance metrics, as Dynamic Time Warping (DTW), can be used, see Ding et al. [41] and Rakthanmanon et al. [42]. DTW is calculated by computing a cost matrix M c whose elements m c (i, j) are and  Figure 5. Note that in this case the time-series have the same length, but this is not necessary to calculate DTW distance.
In Figure 6, it is possible to see a vector V such that equation (41) is satisfied.
For this example, ED(Y , Z ) = 2 and DTW (Y , Z ) = 0, presenting the efficacy of DTW to detect similar time-series even when they have different phase. ED and DTW are between the most used time-series distance measures, however many other approaches exist, for instance: Hausdorff distance, Hidden Markov Model based distance, Longest Common Subsequence, Spatial Assembling distance, Edit Distance with Real Penalty. The reader may find works with comparison and applications of these methods in Zhang et al. [43] and Wang et al. [44].
According to Paparrizos and Gravano [45], an ideal shapebased approach is capable of placing time-series inside the same cluster if they have similar shape, regardless differences on both signal amplitude and phase. The authors propose a time-series clustering algorithm known as k-Shape, that is domain independent and preserves the time-series shapes while comparing them. To accomplish this task, they use a distance measurement, named shape-based distance (SBD), that is invariant to scaling and shifting and depends on the cross-correlation function of the time-series to be compared. The approach was evaluated on 48 data sets, compared to other techniques and it outperformed many of them in accuracy.
Paparrizos and Gravano [45] define k-shape as a nontrivial k-means based procedure: in each iteration, the algorithm updates the clusters by assigning each time-series to its closest centroid; subsequently, new centroids are computed. These steps repeat until some stop criterion is reached, as convergence or the maximum number of iterations. The differences from k-means are on distance measurement (SBD) and centroid computations: while k-means uses the arithmetic mean of the coordinates of all sequences to compute an average sequence, k-shape computes the centroids as a minimization problem whose objective function is the sum of squared distances to all others time-series.
Surveys concerned with time-series clustering are presented in Warren Liao [46] and Aghabozorgi et al. [24]. The authors discuss time-series representation, similarity and dissimilarity measures, clustering algorithms and their taxonomy, and evaluation metrics for the clusters.

3) MODEL CHOICE AND PARAMETERS p, q, P, Q, S DEFINITION
Autocorrelation function (ACF) and partial autocorrelation function (PACF) are widely used to determine the order of the AR and MA components of the models, as in Xu et al. [34], Suarez et al. [47], and Vujičić et al. [36]. ACF describes the autocorrelation between an observation and a preceding observation considering direct and indirect information. On the contrary, PACF considers only the direct relation between a value and the previous one. The behavior of both functions regarding a specific model is shown in Table 4.
The problem to determine these parameters for ARIMA and SARIMA model is similar, one should just apply the necessary differencing to make the time-series stationary,  as done in Kumar and Vanajakshi [48] and Miao et al. [40]. For the SARIMA case, we also need to fit S, P and, Q. S can be set by just plotting the time-series and observing the seasonality of the signal or via spectral analysis, as done in Shu et al. [49]. For instance FFT, which is an algorithm to compute the Discrete Fourier Transform of vectors of data, may be used. Subsequently, P and Q are set by analyzing ACF and PACF behaviors in the lags that correspond to S, 2S, 3S . . ..
As an example, consider a model describing a time-series that becomes stationary after two operations: one-order differencing and one seasonal differencing with S = 6. For the stationary series, the PACF plot gradually decays and the ACF plot has the following characteristics: • one significant component at lag 1; • significant components at lags 6 and 12.
• all other components are not significant. Hence, a model that could fit the time-series is described as SARIMA(0, 1, 1)(0, 1, 2) 6 . p, q, P, and Q could also be set by performing a grid search, where the combination that leads to the smallest Akaike Information Criterion (AIC), that considers the model simplicity and its maximum likelihood estimation, is the chosen one. In fact, a good practice is first to analyze ACF and PACF plots for having an approximation of the parameters values, and later perform a grid search varying them a little to choose the best model according to AIC.
Consider that x (j) K t denotes the traffic load at tower j at time t for a specific cluster K with n k base stations. One may compute the average traffic time-series T k = {x 1 ,x 2 , . . . ,x n }, wherex

4) PREDICTION MODELS
The resulting series in equation (42) can be used to train a model corresponding to a cluster, as proposed in Figure 3.
In this part, a study with different approaches to determinê x t can be done, since the average value may not be the best one. For instance, in some situations the cost associated to forecasting lower traffic load than the real one may be higher than the opposite, then one could use the biggest traffic load at each instant to train a model, as: Hence, according to the time-series of a specific base station, it can be classified and then the assigned model is used to make the predictions. Subsequently, the quality of predictions may be measured by means of well-known metrics between real value and predicted one: mean absolute error, mean squared error, relative error, etc. Note that training models for each base station may lead to more accurate results. Nevertheless, this can be inappropriate since the number of antennas may be very large.

IV. NEURAL NETWORKS
Although SARIMA models are an important tool regarding the forecast of time-series, in the past few years many papers work in such subject with a different and powerful approach: artificial neural networks (ANN). In fact, as introduced by Hornik et al. [50], standard multi-layer feed-forward networks using arbitrary squashing functions are universal approximators of a very large class of functions. This strategy may lead to more accurate results, since it allows a nonlinear approach that uses not just past values (as SARIMA) of a time-series, but also other types of information. For instance, Barabas et al. [51] compares network traffic prediction evaluated with statistical time-series models (as ARMA) and ANNs. The authors show that the approaches based on neural networks yield better results. Similar results are obtained in Azzouni and Pujolle [52], where ANN based approaches outperform linear ones; besides, ANN models based on LSTM (explained below) lead to even more accurate models.

A. A BRIEF INTRODUCTION TO NEURAL NETWORKS
Basically, a neuron is a unit that receives an input vector V = [v i , . . . , v q ] associated to weights w l ij (weight associated to j th  neuron in layer l connected to i th neuron in layer l + 1). The neuron then applies a function on the weighted sum of the inputs to compute the output y. Subsequently, y can become the input of a new layer of neurons forming a neural network, which is a function f : R g − → R q as represented in Figure 10, with g and q being input and output size, respectively. Note that there are no cycles or loops in the network, so this is called feed-forward neural network (FFNN). In case there are 2 or more hidden layers, then we have a deep feed-forward neural network (DFFNN).
f : R − → R in Figure 10 is called activation function, and the most used are sigmoid, hyperbolic tangent, and rectified linear unit (ReLU), shown in Figure 11. Here, b is a scalar representing a bias term.
The idea is that the weights associated with a neuron can be learned in the training phase by minimizing a loss function L, that measures how much the network output is different from the desired one. The procedure to minimize L is based on the VOLUME 11, 2023 gradient descent algorithm and due to the non-convex behavior of the neural network, the global minimum is not guaranteed. The partial derivatives in equation (44) are usually calculated via backpropagation algorithm and η represents the learning rate. Small values for η can lead to time consuming training and big ones could make the procedure diverge.
The training phase also depends on the weights initialization. As proposed in Glorot and Bengio [53], a good practice is to use the Glorot initialization, where each initial weight is given by a zero mean Gaussian distribution with variance given by where fan in and fan out are numbers of inputs and outputs to a layer, respectively.

B. NEURAL NETWORKS AND TIME-SERIES FORECASTING
An important subject when designing a neural network is its architecture. Some types of ANN, as the feed-forward, have the vanishing gradient problem, which happens when the partial derivatives in equation (44) become very small during the back propagation algorithm, leading to little actualization of the weights w i before a local minimum is achieved. Besides, the previous values of sequential data are not considered to predict the next one. To circumvent these problems, papers dealing with time-series prediction (values from previous steps must be considered in regression problems) make use of some different architectures, such as RNN and LSTM, presented below.

1) RECURRENT NEURAL NETWORKS
Recurrent neural networks (RNN), different from feedforward ones, have the property of cycling the output from layers back to the network. Hence, the effect of initial values on the later ones of a sequential data can be modeled by RNNs. In Ramakrishnan and Soni [54], this approach is used to predict three network traffic parameters: volume, protocol classification and protocol distribution. In summary, the main difference between RNN and FFNN is the presence of feedback in hidden layers, as shown in Figure 12. Assume that the input is a t ∈ R m×1 and h t ∈ R p×1 is the hidden state (or memory) of t th time step. Therefore where W ∈ R p×m , U ∈ R p×p , and b ∈ R p×1 are the model parameters to be determined during training. It is important to emphasize that there is just one hidden layer in Figure 12, represented at different time steps, and the estimated parameters are the same for all time steps. Despite the fact that RNNs provide better results in prediction of sequential data when compared to FFNN, they still suffer from the vanishing gradient problem for very long sequences. Hence, special cases of RNNs were designed to retain information even in such cases, as LSTM and GRU.

2) LSTM
In Wang et al. [55] and Li et al. [56], Long Short-Term Memory (LSTM) neural networks are used to predict traffic in cellular networks and traffic flow in transportation systems, respectively. Introduced by Schmidhuber [57], LSTMs are specialized in predicting time sequences due to its cell architecture: three gates, presented below, that update the cell state and its hidden state: • input gate: here, there is a pointwise product between i t andC t (candidate values that could be considered relevant and added to cell state).C t is the hyperbolic tangent function (tanh) of a t and h t−1 (with respective weight matrices), squeezing these values between -1 and 1. tanh is used to regulate the values inside the cell, avoiding them to become very large after the pointwise products. Later, the output of this step is calculated by a sum between the output of forget gate and the pointwise product between i t andC t .
• cell state update: the updated cell state c t is the output of the previous step; c t saves relevant information even from the earliest steps.
• output gate: this gate determines the next hidden state, which contains relevant information about the previous steps. a t and h t−1 (with respective weight matrices) pass through a sigmoid function and is pointwise multiplied by the hyperbolic tangent of c t . The result is the new hidden state h t . Such process is represented in Figure 13. Assume that the input a t ∈ R m×1 are vectors fed into the LSTM layer at each time step, and that the output at each iteration is o t ∈ R p×1 . Note that the input vector can be different at each t, and both c t and h t allow relevant information to be considered in subsequent LSTM cells. Then, ∈ R p×p being the parameters to be determined during the training phase.

Another type of RNN is the Gated Recurrent Units (GRU).
Introduced by Cho et al. [58], GRUs can be seen as a simple version of LSTMs, in the sense they have 2 gates, update and reset, and according to the authors are much simpler to compute and implement. In Patil et al. [59], IoT traffic prediction is developed with the use of GRUs, where the results are shown to be more accurate than the ones obtained via ARIMA modeling. In Fu et al. [60], GRU and LSTM NN are used to forecast traffic flow in the state of California; similarly to the previous paper, both approaches presented more accurate results than autoregressive moving average models. The architecture of a GRU cell is shown in Figure 14.
• update gate: it is responsible for controlling how much information from previous hidden state will be considered by the current one; • reset gate: when its value is close to 0, the current hidden state is forced to ignore information from previous ones and then resets with only the information from the input. Then, considering the same input dimensions previously defined (a t ∈ R m×1 and h t ∈ R p×1 ) , and all matrices U (U z , U r , U h ) ∈ R p×p being the parameters to be determined during the training phase. Besides z t ∈ R p×1 , r t ∈ R p×1 , f t ∈ R p×1 ,h t ∈ R p×1 , and h t ∈ R p×1 . z t and r t are update and reset gates, respectively.

4) CONVOLUTIONAL NEURAL NETWORKS
Another common approach used to predict time-series is based on Convolutional Neural Networks (CNNs). CNNs have a structure that allows them to identify local patterns in the feature space of a matrix [3]. In Bega et al. [16], a distance matrix based on the similarity between the time-series of base stations was constructed, and then a forecast procedure based on CNN was developed. Note that with the distance matrix, local patterns become an important subject to prediction. Traffic flow prediction is developed in Chen et al. [61] also with the use of CNNs. Convolutional neural networks have also big applicability in researches that work with image and video processing, since these two can be seen as matrices with high local patterns. In brief, a convolutional layer can be understood as a filter, also called Kernel, that slides in spatial dimension of an input and then produces the output, as presented in Figure 15. Note that the dimension of the output is smaller. This operation is known as valid convolution and the Kernel does not go outside the matrix border. On the contrary, there are the zero padding, where everything outside the border is set to 0, and symmetric padding, where the terms beyond the border are  mirrored. In both cases, the dimension of the input can be kept or increased.
After the operation with the convolutional layer, ReLU is applied, and then the pooling layer may be used. It reduces the size of the resulting matrix by using the max pooling or average pooling. The first approach constructs an output matrix by considering the maximum values of different regions of the input, while the second averages these values. Convolutional, ReLU, and pooling form one layer of a convolutional neural network. The output may have a smaller dimension and contains local patterns of the input matrix.
A big advantage of CNNs is that, different from Figure 10 where all nodes of the input layer are fully connected to the nodes of the hidden one, neurons between input and hidden layers or between each hidden layer may be connected just to some in the next layer, since CNNs need to identify localized features. This reduces the number of parameters to be determined during the training. Besides, different Kernels can be used in a single convolutional layer, where the resulting matrices (called feature maps) are stacked, and later flattened. The case with 3 Kernels is illustrated in Figure 16. Note that the output of the pooling layer could be the input of another convolutional layer. This procedure would lead to higher feature extraction, since each CNN layer is trained to detect a specific pattern.
Similar analysis can be extended to 3-D cases, where the concept of tensors is used.
Many papers that apply ANNs to predict time-series, as Ong et al. [62] and Deng et al. [63], use tensors. Basically, they are generalization of scalars (rank-0 tensor), vectors (rank-1 tensor), and matrices (rank-2 tensor). For instance, a rank-3 tensor can be seen as a 3-D matrix, which is very useful for traffic load prediction: consider a tensor whose elements are λ ijk , where i and j are related to the geographical position of a tower, and λ is its traffic load at time step k; this is a way to store spatio-temporal information of all towers that belong to a network (this is done in Bega et al. [16]).

5) GRAPH NEURAL NETWORKS
In many of the data sets containing information about the network traffic demand, the location of towers, routers, or switches are also available. Hence, the entire network can be seen as a graph whose nodes and links are the hardware generating the network and their connections, respectively. In recent years, Graph Neural Networks (GNNs), introduced by Scarselli et al. [64], are on the rise and have provided results superior to the aforementioned approaches when dealing with road traffic forecasting problems. As discussed by iang and Luo [4], GNNs have the property of learning directly from graphs, which makes them capable of extracting features from data represented by these structures.
In Wu et al. [65], an extensive survey on GNNs is developed. The authors propose a taxonomy, dividing GNNs into four groups, briefly commented below.
• Recurrent GNNs: also called RecGNNs, they are the first type of GNNs. In summary, it is considered that two neighbor nodes in a graph exchange information until some stop criterion or convergence is reached. The node representation learning is based on RNNs.
• Convolutional GNNs: known as ConvGNNs, they extend the idea of convolution from matrix (or tensor) to graph. Similarly to RecGNNs, a node representation is given by its features and its neighbors ones. Subsequently, multiple graph convolutional layers can be used to learn the graph main features.
• Graph Autoenconders: unsupervised learning that, different from the two aforementioned approaches, encodes the graph in a vector space, from where features can be learned and subsequently a decoder reconstructs the graph.
• Spatio-temporal GNNs: these GNNs consider space and time features simultaneously. They can combine graph convolutional layers with CNN and RNN.
It is important to emphasize that the GNNs above have different mathematical formulations and applications, where the details of each case are also discussed by Wu et al. [65].
' In Wang et al. [66], the authors propose a type of GNN, named Time-Series Graph Attention Network (TSGAN), with the previously discussed elastic distance metric DTW to forecast the cellular network traffic in a real world data set. The prediction results obtained with TSGAN outperformed three standard GNNs and a GRU model in short-term, midterm, and long-term scenarios. In Zhou et al. [67], a Spatio- temporal Graph Convolutional Neural Network is presented and leads to better results than many state-of-art forecasting models: historical averaging, vector autoregressive, LSTM, ConvLSTM, GCN-CNN, attention mechanism GCN-CNN, and diffusion CNN-RNN.
Some characteristics of the approaches discussed above are shown in Table 5.

V. EXTENDED BIBLIOGRAPHY
In this Section, we present, in Table 6, a detailed list of papers using the different approaches discussed in the previous sections for network traffic prediction. In each row, the reader finds the authors, year of publication, the prediction technique, training/testing data set, and evaluation metrics for the respective paper. Besides, the Table is divided into three subgroups: • Yellow subgroup: Autoregressive Moving Average based approaches; • Blue subgroup: RNN, LSTM, GRU, or CNN based approaches; • Green subgroup: diversified approaches. The error acronyms are given in Table 2.

VI. NUMERICAL EXPERIMENTS
To compare many of the approaches described above and used in many of the previously cited papers, we apply SARIMA and RNN models on mobile phone activity data set from the city of Milan. The data set contains information about received and sent SMS, incoming and outgoing calls, internet activity, and geographical position of all antennas. Each of these operations is named as call detail record (CDR The results presented below regarding the SARIMA models were obtained under the following assumptions: • models with all possible combinations of parameters p, d, q, P, D, Q = {0, 1, 2} were fitted; • the model with the smallest AIC is used to predict the CDR traffic behavior; • S = 24 (time-series with period of one day).
• the training data set is standardized according tõ with µ and σ being the mean and standard deviation of the samples belonging to the training data set.

1) REGULAR SARIMA
In this approach, the standardized time-series regarding the training data set, without any previous filtering technique, is used to design the model. After fitting all possible models according to the interval declared above, they were compared according to AIC criterion, which is a parameter that considers not just how well the model fits to its observations, but also its simplicity (number of parameters). The optimum value (minimum) for AIC is obtained for SARIMA(2, 0, 2)(1, 1, 2) 24 . With these parameters, the MAE and MSE between the measured 7 days time-series and predictions given by the model for the same period are, respectively, 2.911 and 12.827.

2) SARIMA WITH DECOMPOSED TIME-SERIES
In this approach, the time-series of the respective cell ID was decomposed, Figure 17, according to equations (32)- (38) and S = 24. The resulting time-series, which was standardized VOLUME 11, 2023 FIGURE 17. One week of measured CDRs traffic load for cell ID 1051 (observed data), its correspondents components, and filtered signal (sum of trend and periodic components).
according to equation (49) and used to fit the models, is the sum of trend and periodic components. The model with the smallest AIC is achieved for SARIMA(2, 0, 1)(1, 0, 1) 24 and provides MAE = 2.733 and MSE = 11.452 between the predictions given by the model and the real measured traffic load for the same period.

B. RECURRENT NEURAL NETWORKS
When training a neural network, some hyper parameters may affect the results, leading to over-fitting or under-fitting. As example, it is possible to cite epochs and batch size, defined as: • epochs (epc): hyper parameter that is used to determinate how many times the learning algorithm will work through the entire data set; • batch size (bs): hyper parameter that defines the number of samples that the learning algorithm should work with, before updating its internal weights. In brief, at the end of the batch, the error between prediction and output is calculated, then the gradient descent algorithm is applied to update the model's weights.  • tests run with Adam optimizer, introduced by [119] and proved to be computationally efficient with little memory requirements, and appropriate for nonstationary objectives; • loss function as MSE; • learning rate set to 0.0001; • activation function of LSTM, GRU or simple RNN layers: tanh; • each input vector (a t in Figures 13 and 14) consists of 24 samples, that is, where eachã t is calculated by using equation (49). This standardization is may improve the network training; • the algorithms have access to 55 days of CDR traffic: 50 are used to train the models and 5 days to validate them; • the last 7 days are used just to evaluate the models performance by comparing their predictions and the real measured values. Besides, after the last LSTM, GRU or simple RNN hidden layer, there are two layers: a flatten layer with 32 neurons and reLU activation function, an output layer with a single neuron (which is the predicted value) and linear activation function.

1) RNN
The results obtained with the simplest recurrent neural network, RNN, are presented in Table 7.

2) LSTM
The results obtained with LSTM recurrent NN are presented in Table 8.

3) GRU
The results obtained with GRU recurrent NN are presented in Table 9.
The previous three models have different numbers of training parameters, leading to specific computational costs. The quantity of training parameters of each approach for different numbers of layers is presented in Figure 18.
It is clear that the number of training parameters grows when more neurons are considered inside the same layer, which is illustrated in Figure 19.

C. CONVOLUTIONAL NEURAL NETWORKS
The data set contains the geographical positions of all antennas, which are distributed in a matrix M ∈ R 100×100 . As previously mentioned, CNNs are able to capture local patterns in a feature space of a matrix. Hence, one could consider all adjacent towers to the one in cell ID 1051 to predict the traffic load for one week of CDRs.
In this case, the input of the CNNs is a tensor presented in Figure 20, where each tensor position is the traffic load of a specific cell ID at a time step. Notice that we have both geographical and time information as input.
For the CNN simulations, we used the same number of epochs, batch size, optimizer, loss function, and learning rate VOLUME 11, 2023        • for the simulations with 2 sequences of feature extractions ( Figure 16), after the layers above, we have: -3D CNN layer with 32 filters, Kernel size of {k, 2, 2} and padding set to ''valid''; -3D Pooling layer of dimension {2, 1, 1} and padding set to ''valid''; • flatten layer followed by dense layer with 32 neurons; • output dense layer with 9 neurons; Note that the output layer has 9 neurons because each one is the prediction for a position of the spatio-temporal input tensor in Figure 20 at time t. Then, the input tensor is updated with these new values and the last 24 hours are used for a new prediction. The results considering MAE, MSE, and the number of training parameters (T.P.) are presented in Tables 10 and 11.

D. COMPARISON BETWEEN TECHNIQUES
The time-series predictions for hourly aggregated CDRs considering the previous approaches are presented in Figure 21. Comparing the results obtained via NN and SARIMA models, we notice that the neural network approaches provided at least one configuration with smaller MAE and MSE, see Table 12, where the best results of each model are shown. These differences in results may grow even more when the time-series to predict is highly nonlinear or the amount of data to be analyzed is very big. In such cases, SARIMA models may present degradation, leading to higher errors.
It is important to emphasize that the results obtained via neural networks approaches may vary a little, even if all parameters are kept the same, in case new training of the models is performed. Due to the highly non-convexity of such functions, different training phases may end up in different local minima. On the contrary, the SARIMA models have a closed-formula of calculation, which leads to the same results if their parameters (p, d, q)(P, D, Q) S do not change. Hence, the order presented in Table 12 may vary if new experiments are performed. Our goal, besides presenting NNs outperforming the statistical models, is to study different strategies, so they can be used in different situations or even together. In fact, many papers design prediction models based on a combination of these approaches. For instance, Huang et al. [3] presents a formulation where the model extracts spatial and time correlations based on 3d-CNN and LSTM, respectively, leading to very accurate results.

VII. OPEN RESEARCH DIRECTIONS
Our tutorial and survey let us overall highlight that, even though network traffic prediction has been extensively studied over the recent years, there are still gaps in the literature and several clear directions for improvement.

A. GAPS IN THE LITERATURE
As far as gaps are concerned, the search for models capable of forecasting the traffic demand with an accuracy as high as possible is intrinsically related to computational costs, due to model complexity and the time associated not just with its training, but also with its output predicted value. Yet, many papers do not present a formal analysis of model complexity, or results about the computational time of the solution. We argue that the community should make an effort of always including comparative analyses of new models with the state of the art not only in terms of performance but also complexity, so as to reveal the trade-off between these two key metrics.
Also, recent challenges in time series forecasting that are general purpose and not specific to network traffic have proven how many proposed models may fail to outperform simple baselines, as summarized by Makridakis et al. [120], [121]. In this regard, many studies in the literature do not provide direct comparison against naive or classical approaches, or do so with unclear parametrization of such benchmarks. Our conclusions is that the availability of a ready-to-use set of baseline models usable by the whole community would benefit the direct comparability of newly proposed solutions. We hope that the implementations developed for the tutorial and comparison parts of this document, which we provide as open-source software, can become the first core of such a toolbox.
It is also worth noting that, although the biggest part of papers cite the importance to predict the network traffic load (e.g., resource allocation and utilization, anticipatory decisions making, or network design), there is a gap on how these forecasting models could be used in a real world scenario to design optimization problems regarding energy saving, quality of service, close-loop control of the entire network, etc. We consider that there is a clear need that new solutions are not only compared to existing ones in terms of absolute or relative errors in the prediction (e.g., via MAE or MAPE metrics), but also deployed in practical applications; in this way, one can really appreciate the gain that a model would grant over another in production settings.
Another obstacle, related to benchmarking and even harder to surpass than those above, is that of the lack of a reference dataset of network traffic that is consistently adopted across performance evaluations of proposed models -opposed to what happens in other communities like the image processing one. Unfortunately, traffic data is often regarded as sensitive by operators, and only disclosed under restrictive Non-Disclosure Agreements (NDAs). Publicly available datasets are scarce and outdated. A promising option in this sense may be provided by emerging efforts on the generation of realistic synthetic spatial and temporal network traffic: early models proposed by Lin et al. [122] and Xu et al. [123] can produce vast amounts of traffic data that can be openly released, towards building a reference dataset to be commonly adopted by the community.

B. DIRECTIONS FOR IMPROVING FORECASTING MODELS
Our survey of forecasting models for network traffic evidences how, as in many other scientific and application domains, deep learning is revolutionizing the approaches for predicting future demands, thanks to significant accuracy improvements. Yet, it also outlines several emerging pathways for the improvement of existing deep learning architectures, which go beyond now rather standard RNNs, CNNs or GNNs and are aimed at further increasing the quality of predictions.
An interesting direction is that of hybrid strategies that mix deep learning and statistical modeling, jointly parametrized via a single training precess. As proven by early works, such as that of Lo Schiavo et al. [102], such combined approaches shows promises of improving the performance of the forecast with respect to pure deep learning solutions. However, current studies in this direction are fairly preliminary and additional research is needed to unveil the potential of hybrid models.
In addition, the vast majority of network traffic predictors employ legacy loss functions, such as MAE or MSE, which are an ideal choice when the target of the forecast is anticipating the future traffic. Yet, in many practical applications, such as for network management, the value of the prediction is in how it can support effective decision-making: in these settings a predictor that minimizes the error to the future demand in not necessarily the optimal choice. On a related note, seminal studies, such as those of Bega et al. [16] have started exploring how to cope with the limitation above directly at model design stage. The concept is introducing loss functions that are designed by network experts and are tailored to the downstream application: such custom functions can train the predictor model to produce an output that is aligned with the requirements of the end user. Loss meta-learning frameworks such as that by Collet et al. [124] also start being proposed, which automate the definition of the most appropriate loss function to the target performance objective prediction.
Another relevant direction concerns the fact that forecast models are today evaluated in non-real-time settings. This implies that the current practice in performance assessment involves (i) off-line training with historical data; (ii) testing with limited re-played historical data typically recorded immediately after the training data; and, (iii) little or no attention to inference latency. Together, these habits lead to predictor implementations that are hardly validated in productionlike settings. The situations calls for forecasting frameworks that can operate in on-line settings, e.g., by ensuring inference delays that are consistent with the requirements of the downstream applications, which can be down to milliseconds in radio access operations. It also requires in=depth studies of the prediction accuracy decrease over time, and need for re-training or, better, continual learning. Finally, evaluations should be designed so that it is ensured that models can correctly operate on live streaming data.
Finally, generalization is also a widely open research subject. Most models proposed in the literature are tested in specific scenarios, which, even when covering large urban areas, are still representative of a given network deployment.
There are unanswered questions about how models trained in specific regions generalize to other areas. Recent studies of transfer learning approaches, such as those by Wu et al. [125] are promising in this sense, and pave the way for further investigations.

VIII. CONCLUSION
In this survey, we presented distinct approaches used to design network traffic prediction models. Different mathematical formulations were discussed and the most relevant were fully explained, so the reader is able to gain a deep understanding of each one. Based on the state of art, the network traffic prediction task was divided into three main areas (clustering, prediction model, and time-series decomposition) that were separately explained. A detailed list of papers addressing such a problem was also presented, allowing the reader to find some of the most relevant papers in this field of research. We also performed numerical experiments in a real world scenario data set to compare some of the explained mathematical models in terms of accuracy and computational cost. All the codes are publicly available, so other researchers can compare their approaches with the ones we provided or even use them as a foundation for more advanced solutions. Gaps in the literature were also listed, allowing the readers not just to use our codes and have a full understanding of network traffic prediction, but also giving possible directions for future works.

APPENDIX
In the numerical experiments, six different prediction models were designed and tested in a real world scenario data set. We provide the codes for all approaches, so other researches can use these models and perform a direct comparison in terms of computational cost and prediction accuracy. All the codes, written in Python 3, have comments to allow their easy use/understanding and are available in GitHub. 2 We also made available a small pre-processed subset of the data used to train the models. A description of the data set is provided by Barlacchi et al. [17] and its full version can be downloaded from the Harvard Dataverse Italia [126].