Multivariate Time Series characterization and forecasting of VoIP traffic in real mobile networks

Predicting the behavior of real-time traffic (e.g., VoIP) in mobility scenarios could help the operators to better plan their network infrastructures and to optimize the allocation of resources. Accordingly, in this work the authors propose a forecasting analysis of crucial QoS/QoE descriptors (some of which neglected in the technical literature) of VoIP traffic in a real mobile environment. The problem is formulated in terms of a multivariate time series analysis. Such a formalization allows to discover and model the temporal relationships among various descriptors and to forecast their behaviors for future periods. Techniques such as Vector Autoregressive models and machine learning (deep-based and tree-based) approaches are employed and compared in terms of performance and time complexity, by reframing the multivariate time series problem into a supervised learning one. Moreover, a series of auxiliary analyses (stationarity, orthogonal impulse responses, etc.) are performed to discover the analytical structure of the time series and to provide deep insights about their relationships. The whole theoretical analysis has an experimental counterpart since a set of trials across a real-world LTE-Advanced environment has been performed to collect, post-process and analyze about 600,000 voice packets, organized per flow and differentiated per codec.


I. INTRODUCTION AND MOTIVATION
P ERFORMANCE prediction of real-time traffic (such as VoIP) is a crucial topic in the network management field.Predicting and optimizing Quality of Service (QoS) and Quality of Experience (QoE) metrics allows to better dimensioning network infrastructures, improving the battery life of devices, and optimizing the resource allocation strategies [1], [2].This is even more critical in cellular environments, where the high unpredictability of variables such as the interference, but also the concurrency of real-time sessions, and the time-varying load of mobile network nodes pose intriguing challenges.
We tackle these issues through a multivariate predictive time series analysis of VoIP traffic across an urban LTE-A environment.At the moment, LTE represents the dominant broadband technology, accounting for 57% of users worldwide [3].Older technologies such as 2G and 3G continue to be intensively M. Di Mauro and F. Postiglione are with the University of Salerno, Italy.E-mails: {mdimauro,fpostiglione}@unisa.itG. Galatro is with IBM Italy.E-mail: galatro.giovanni@gmail.comW. Song is with the College of Information Technology, Shanghai Ocean University, China.Email: wsong@shou.edu.cn A. Liotta is with the Faculty of Computer Science, Free University of Bozen-Bolzano, Italy.E-mail: Antonio.Liotta@unibz.itused for their robustness, with about 38% of subscriptions; whereas 5G accounts for about 5% of subscriptions due to its market immaturity.Interestingly, one of the most adopted deployment today is the Non-Standalone (NSA) 5G, where a substantial part of LTE core network is reused to implement voice-based services such as VoLTE [4].
The access to LTE technology has stimulated a series of studies devoted to analyzing the performance of QoS/QoE metrics involving, for example: various deployment strategies [5], resource allocation [6], probabilistic models [7], and coexistence with other technologies [8].
On the other hand, the main contribution offered in this work pertains to a multivariate time series characterization of the dynamic (time-varying) behavior of crucial VoIP metrics which mutually influence each other.Such a cross-dependency has a great impact on forecasting, since the future values of a specific metric (e.g., the bandwidth consumption) will depend not only on the temporal evolution of the same metric, but also on the evolution of other metrics (e.g., round-trip time, jitter) for a given VoIP flow.Accordingly, we formalize analytically such a cross-dependency by means of a vector autoregressive (VAR) model, along with a set of analyses (e.g., stationarity, causality) useful to capture some insights characterizing the mutual influence among the metrics at stake.Such a formalization is then compared to classic (e.g.tree-based) and novel (e.g., deep-based) machine learning approaches.
As a first step, we carry out an experimental campaign to collect real-world mobile VoIP traffic deriving variables such as bandwidth consumption, mean opinion score (MOS) and signal-to-noise ratio (SNR), among others.In a second step, we perform a predictive analysis aimed at discovering temporal dependencies among the variables and forecast their behavior in future time periods.At this aim, we consider two approaches: ) a statistical approach relying on VAR models, useful to analytically describe the dependencies among interrelated time series; and ) a machine learning approach, employed by turning a time series structure into a supervised learning problem.It is worth noting that a time series analysis would be of little use when dealing with data collected in controlled environments (e.g.testbeds).In such a case, in fact, the forecast would be biased since it is possible to manually tune quantities such as interference or noise figures.Conversely, in real settings we deal with uncontrollable variables, which impact the overall performance, such as: time-varying load of radio and network nodes; physical obstacles; weather conditions; and hand over procedures.
The paper is organized as follows.Section II proposes an overview of similar works, highlighting how our work contributes to the state-of-the-art.In Sect.III, we offer a description of the experimental environment along with details about the time series construction.In Sect.IV, we formulate the problem in terms of multivariate time series characterization, and we introduce statistical and machine learning (ML) based models.In Sect.V, we present the experimental comparison among the different forecasting techniques by taking into account both performance and times.Section VI concludes the work along with some ideas for future research.

II. RELATED WORK AND OFFERED CONTRIBUTION
Due to the rapid evolution of telecommunication infrastructures, themes involving the network traffic characterization are becoming decisive from a network management point of view.QoS and QoE metrics, for instance, are typically used as benchmarks to evaluate the quality of a network service; thus, predicting their behavior is crucial to the aim of network optimization and protocol design.Accordingly, in this section we propose an excursus of relevant works in the field of traffic characterization/forecasting, where we highlight a set of novelties emerging from our work along different directions.
A first aspect concerns the network traffic forecasting through statistical models, where a common trend is to exploit autoregressive moving average (ARMA) [9], [10] or autoregressive integrated moving average (ARIMA) models [11]- [13].Although based on a robust methodology, ARMA and ARIMA models allow to characterize the behavior of individual network variables (in terms of univariate time series models), but are not able to capture the mutual influence among the variables, which is crucial, for example, to understand the interdependency between objective indicators (e.g., bandwidth) and subjective ones (e.g., MOS).
A univariate time series perspective is adopted also by that part of the technical literature which employs machine learning models for network traffic forecasting, including neural networks [14]; support vector machines [15]; general supervised models [16]; deep learning models [17]- [23].
To fill this gap, we formulate the problem in terms of a multivariate time series, where each variable is expressed as a function of values of the variable itself and values of the other variables.This approach allows to characterize the interdependency among variables by enabling joint analyses (e.g., orthogonal impulse response analysis) which would have no meaning in a univariate setting.
Another limitation which emerges in the part of the technical literature focusing on traffic characterization (especially in mobile environments as in our case) is the lack of real-world data.This issue is typically faced through the usage of network simulators, where many variables or models are artificially generated (e.g., mobility models, interference, packet loss, data burst, weather conditions, and many others).Examples include: [24] and [25], where LTE environments are simulated through NS-2; and [26]- [28], where some LTE metrics are characterized via NS-3.Other works employ customized LTE simulators to model QoE [29], [30] and QoS indicators [32], [33], respectively.
Even when network experiments are carried out within real mobile scenarios [34]- [37], a set of limited metrics are considered, often due to the fact that standard communication protocols (e.g., RTP/RTCP) allow to natively collect only classic metrics, typically relating to bandwidth consumption or network delay.To overcome such restrictions we have set up an experimental campaign where, through the RTP Control Protocol Extended Reports (RTCP-XR), we are able to analyze QoS/QoE metrics that are usually neglected in traffic characterization, including MOS, round trip delay, playout delay buffer, and SNR.
In summary, the following contributions emerge from our work.First, we formalize the multivariate time series problem of mobile VoIP traffic through the VAR model, which allows to govern analytically the forecasting process.Moreover, through specific analyses including the Dickey-Fuller test, the OLS-CUSUM test and the orthogonal impulse response, we are able to discover interesting insights and hidden relationships among the considered VoIP metrics.Then, we turn a set of machine learning techniques (random forest, recurrent networks, etc.) into forecasting methods, by reframing the multivariate time series problem into a supervised learning one through a sliding window approach.This step is needed to evaluate and compare performance and time complexity of the statistical approach against the learning-based ones.Finally, we remark that the whole time series analysis relies on an experimental campaign carried on a real-world LTE-A network.During this campaign we: ) collect and elaborate a series of VoIP flows exploiting different voice codecs; ) elaborate a set of performance metrics (most of them neglected in classic literature) through the support of the RTCP-XR protocol.

CONSTRUCTION
The location chosen for mobile VoIP traffic collection and analysis is an urban area (about 2000 people/km 2 ) near Salerno (Italy).Figure 1 shows the area map derived from cellmapper [38], a crowd-sourced cellular tower and coverage mapping service.The number of evolved nodes B (eNB) aimed at handling radio links amounts approximately to 100.All the VoIP traffic is collected between two nodes: a mobile node (a car traveling at about 60 km/h) and a fixed node with a station to collect/elaborate the VoIP flows.The distance between the two nodes ranges from 30 to 70 kilometers.Both nodes are equipped with Linphone [39], one of the few softphones supporting the RTCP-XR protocol defined in the RFC 3611 [40].Such a protocol allows to gather a rich set of metrics not available through the classic RTCP protocol, such as MOS, SNR, round trip delay (or round trip time), and playout delay buffer.
The overall collected flows amount to about 600,000 voice packets, and are divided per codecs, including the following ones: G.722 (64    kb/s of bit rate and 8 KHz of sampling rate).Such a choice is justified by the fact that each codec is able to react differently to diverse network conditions (e.g., the consumed bandwidth or the playout delay buffer) and accordingly adjust the quality of the voice flow.In this way, we obtain a more ample view of the time-based variables behavior and how these are influenced by the different codecs.In Table I we report some useful information about the collected dataset including: codec type (first column), number of RTP packets per VoIP conversation (second column), stream length, namely, the duration of conversation (third column), lost packets (fourth column).We derive such information from RTP stream statistics section available in Wireshark, the open source sniffer tool aimed at network traffic inspection [41].Upon collecting the traffic, we have performed a post-processing stage to extract and process six crucial time-based variables.Precisely, for each voice flow (namely for each codec) we built a (6 × 1) time-based vector   = ( 1 , . . .,  6 )  whose components are the following six time series: •  1 : time series representing the MOS, which quantifies the human subjective experience of a voice call in a dimensionless range between 1 (low perceived quality) and 5 (high perceived quality); this metric has been ten Buffer for brevity and measured in ms), a mechanism to compensate for the encountered jitter by buffering voice packets and playing them out in a steady stream; •  6 : time series representing the signal-to-noise ratio (SNR), defined as the ratio between the power of a signal and the power of the background noise, and measured in decibel (dB).
Figure 2 shows all the six time series for a single voice flow (G.722 codec).Please note that this representation is meant to offer just a big picture of time series behaviors, since the measurements units are different for each series (e.g., the bandwidth is measured in kb/s, the RTT in ms, the SNR in dB, etc.).At this aim, MOS and jitter have been magnified into two separate insets (MOS in grey and jitter in light blue) to better appreciate their behaviors.We can preliminarily notice some interesting facts.For instance, it is immediate to see how the RTT time series (in green) has two noteworthy peaks (approximately at  = 30 and  = 415 seconds) probably due to some obstacles in the VoIP flow path.Yet, the bandwidth time series (in red) seems to be quite stable (around 80 kb/s) with a peak at about  = 360 seconds and a more irregular behavior after  = 450 seconds, probably due to a more unstable connection.Finally, the jitter time series is more or less regular and lying below 40 ms, as prescribed by telco standards for VoIP flows [43].In order to inspect data variability for this flow, we also report a box-plot representation in Fig. 3.Such a representation reveals that metrics such a MOS and SNR seem to be quite stable having 8 and 2 outliers (onto a stream length of 566 s, see Table I).The reason is that MOS naturally varies in a bounded range of values, whereas SNR is typically regularized thanks to the underlying codec.
Remaining metrics exhibit more instability basically due to the uncontrollable external factors (e.g., interferences, mobility) thus, the number of outliers is greater: BW (39), RTT (36), Jitter (52), Buffer (97).We finally note that, to add more value to our work we make available: ) raw datasets divided per codec (as described in Table I); ) post-processed datasets useful to directly test a given forecasting technique [44].

IV. PROBLEM FORMULATION AND FORECASTING MODELS
In this section we examine in depth the problem of multivariate time series forecasting, by exploiting different techniques.Basically, through such a formalization, we try to predict future values of the time series on the basis of the current information set, namely the set consisting of current and past values of the series.
Let  1 ,  2 , . . .,    be a set of  related variables.The forecast of the -th variable  ,+ at the end of period T can be expressed as: where  is the forecast horizon (number of future samples to predict), whereas  (•) is a function of the past observations and can be: ) a statistical model, or ) derived by a machine learning algorithm.One of the most powerful classes of statistical multivariate models for time series forecasting is represented by the vector autoregressive (VAR) models.In contrast, machine learning models exploit their data-driven features to give a prediction of future values of a time series.
It is useful to anticipate that to exploit a VAR model correctly, some preliminary analyses are needed (e.g., stationarity, residual correlation, etc.).Thus, in the next section we formally introduce the VAR model along with its employment into the multivariate time series field.We should clarify that the term variable used in the classic statistics field is equivalent to the term feature typically encountered in the machine learning realm.To adopt a uniform notation, we will use the term variable (or time variable), to highlight the temporal dependency.

A. Vector Autoregressive Model
The vector autoregressive model is a generalization of the univariate autoregressive (AR) model used for time series prediction.In the classic AR model, a value from a time series is expressed in terms of its preceding values.The number of preceding values chosen for the prediction is called order or lag.We remark that, in cases where more time series are present, the AR model does not allow to capture their mutual influence.Conversely, in the VAR model a value from a time series can be expressed in terms of preceding values of the time series itself and the preceding values of other time series.It results in a multivariate time series prediction problem since multiple time series can influence each other.

21
(1)   12 The preliminary operation to perform when employing a VAR model is to determine the best lag  * , which allows to build a VAR( * ) model embodying most of information of the -dimensional time series.Choosing the optimal lag length is not trivial since many criteria exist (often contradicting each other).We start by applying the Akaike Information Criterion (AIC) [45], a selection rule aimed at minimizing the forecast mean square error (MSE).Specifically, the approach is to fit VAR() models having orders  = 0, . . .,   and pick the value of  which minimizes the criterion.
Formally, the AIC criterion obeys to the following rule: being  the time series length and | Σ  | =  −1  =1 ε ε  the determinant of covariance matrix of the residuals estimated by OLS.The results of the AIC criterion applied to our VAR model made of 6 time series is shown in Fig. 4. We can see that the order  which minimizes the AIC criterion amounts to 11.Actually, when choosing the optimal lag for a VAR model, the lag resulting from a selection criteria (e.g., the AIC) represents only a preliminary choice.
Often, such a value must be adjusted to encounter also other needs [46], such as minimizing the residual correlations as explained in the next subsection.
1) Residual Analysis: When fitting a model to time series data, it is likely to find autocorrelation in the residuals (differences between observed and fitted values).In such a case, the model violates the assumption of no autocorrelation in the errors, and the forecast can be inaccurate [47].A powerful residuals-autocorrelation test is the Breusch-Godfrey test [48] (also referred to as Lagrange Multiplier, LM test in brief), which considers the VAR model of the error vector where ℎ is the maximum lag of the error model and   a white noise at time .In the LM test, the null hypothesis H 0 is that there is no residual autocorrelation, whereas the alternative hypothesis H  is that residual autocorrelation exists: A common way to compute the LM test statistic based on the residuals of the VAR() model is to take into account the following auxiliary regression model [49]: where the ε represent the residuals from the original VAR() model (where ε = 0 for  ≤ 0), and  *  is an auxiliary error term.Accordingly, the LM statistic can be computed as where are the residuals from the estimated auxiliary model and tr(•) is the trace of a matrix.Under the null hypothesis of no autocorrelation, it is possible to show [55] that    (ℎ)  − →  2 (ℎ 2 ) where  − → indicates the convergence in distribution (as  → ∞).Moreover, a correction has been proposed by Edgerton and Shukur [50] that exploits the  statistic (based on the Fisher-Snedecor distribution  (, ) with  and  degrees of freedom) in place of the  2 , showing interesting results especially in unstable VAR models [51].Accordingly, the Edgerton-Shukur (ES) statistic has the following form: where Once we have chosen the optimal lag value suggested by AIC criterion ( = 11), we tested the residual correlation through the hypothesis test (6).Such a test has been implemented by exploiting both  2 statistic of the LM test and  statistic of the Edgerton-Shukur test.The results are shown in Table II -left side -in terms of p-values.Moreover, as suggested by credited literature [47], [52], [55], we choose a not so large value for ℎ, namely ℎ = 10.
By choosing a type  error probability  = 0.05 to reject the null hypothesis when it is actually true, a p-value lower than  allows us to reject the null hypothesis (no residual correlation),  6) is accepted, given  = 0.05.Such a condition occurs in both LM and ES tests, but the latter seems to be more "conservative" and allows to reject the null hypothesis less frequently than LM test.We explore also some higher lags 1 and we find interesting results for  = 12, which represents the second optimal choice from the AIC criterion (see Fig. 4).The corresponding results in terms of p-values are shown in Table II -right side.It is possible to notice that the null hypothesis of no residual correlation is satisfied for all values of  in the case of the ES test, with p-values significantly higher than 0.05.Remarkably, also in the case of the LM test we have p-values higher than 0.05.As mentioned before, we have also tried to further increase the order  of the VAR() model, but we obtained more p-values allowing to reject the H 0 hypothesis of no serial correlation (needed for accurate forecasting) than those obtained for the lag length amounting to 12, which was finally elected as the optimal choice.
2) Stationarity: When dealing with VAR models, another important operation consists in removing possible trending behaviors of the involved variables to avoid spurious regressions.Otherwise stated, we have to guarantee the stationarity of the time series, meaning that first and second moments must be time invariant.Pragmatically, the stationarity check is often performed through OLS-based unit root tests.In particular, Dickey and Fuller [53] developed a procedure (DF) for testing whether a variable has a unit root or, equivalently, that the variable follows a random walk.
We use the augmented Dickey-Fuller test (which, differently form classic DF, includes higher-order autoregressive terms in the regression) where the following test model (in its more general form, see [54]) is considered: 1 Too much high values of lags could lead the system to an overfitting.where: Δ  =   −   −1 is the difference operator,  0 is the intercept term (constant),  1  is the time trend, and  the lag of the autoregressive process.Finally, the test statistic on the  coefficient is used to test whether the data need to be differenced to make it stationary.The DF test is the following: H  :  < 0 (alternative hypothesis) stationarity.(11) For our experiments, we have performed the augmented DF test for each variable, verifying that the variables are stationary at first differences, thus, there is no need to apply the differentiation operator.The results are reported in Table III, where the negative values of  (second column) for each variable and the corresponding low p-values (third column) suggest to reject the null hypothesis, and to accept the stationarity hypothesis by assuming a type  error probability of 0.05.
3) Stability: Stability conditions are typically required to avoid explosive solutions of the stochastic difference equations characterizing a time series expressed in terms of an autoregressive part and a moving average part.At this aim, it is possible to show [52], [55] that the VAR() process (2) can be written in the following  − dimensional VAR(1) form: being   the order  identity matrix.The process   is stable if the eigenvalues of the companion matrix  * in (12) have modulus less than one.Such a property is satisfied for the considered VAR() model as can be observed in Fig. 5.
Although such an analysis is formally correct to verify the stability condition, it does not allow to capture the behavior in the time domain.Accordingly, we perform in addition an OLS-based cumulative sum (CUSUM) test [56].autoregressive coefficients) change over the time, the accuracy of the one-step-ahead forecast will decrease and the forecast error will increase.
The panels of Fig. 6 show the results of the OLS-based CUSUM test for all the six variables.The x-axis represents the normalized time between 0 and 1, where the y-axis reports the cumulative sums of residuals (interpretable as random processes).It is possible to notice that all the processes are substantially stable with oscillations around the zero value.A slight exception is represented by SNR where is it possible to see some little drifts from the stability value but never exceeding the 95% confidence boundaries (red lines).
4) Time Series relationships: One of the most interesting aspects when dealing with VAR models is to understand how the time series composing the process are mutually influenced.Precisely, it is useful to know the response of one variable to an external disturbance (unit impulse or unit shock in the econometrics jargon) of another variable, which allows to examine more in-depth the cause/effect relation among the involved variables.In particular, if we observe a reaction of one variable to an impulse in another variable, the latter will be causal for the former [55].In many real-world cases, there is a correlation among the variables in a system.This means that an impulse of one variable should be accompanied by an impulse of some other variables correlated with the modified one.In other words, the impulse response allows to trace the transmission of a single shock within a system of equations [57].Often, it is interesting to isolate the effect of a single variable shock onto another variable of the system to better capture the interdependencies.At this aim, we implement the orthogonal impulse response functions (OIRF) method [58] which allows to rewrite the process (2) as: where: Σ  =   being  a lower triangular nonsingular matrix with positive diagonal elements (also known as Choleski decomposition, see Appendix A.9.3 in [55]), Θ  = Φ   and   =  −1   being a white noise with covariance matrix Since the white noise errors   have uncorrelated components  1 , . . .,   with unit variance   , they are often known as orthogonal residuals or innovations.Thus, it is reasonable to assume that a change in one component of   has no effect on the other components due to orthogonality.In particular, the  -th element of Θ  is assumed to represent the effect on variable  of one unit innovation (namely, one unit standard deviation) in the -th variable that has occurred  periods before.
In our setting, we have 6 variables resulting in 36 orthogonal impulse responses2 as shown in the panels of Fig. 7.The causal variables are grouped per columns.The x-axis reports the observation period, thus it is possible to evaluate the disturbance effects for various observation periods (25 in our case).The blue continuous curves represent the oscillating values of the affected variables around their stability point (horizontal black line at 0), namely if the impulse were not applied at all.The black dashed curves surrounding the blue ones represent the asymptotic confidence intervals at 95% confidence level.Such an analysis has the merit of highlighting some relationships among variables which are often hidden at a first sight.For example, the sub-figure in the first row and second column allows to visualize the effect of a bandwidth shock on the MOS variable (BW → MOS).In particular, it is possible to see that a bandwidth impulse causes a slight increase of the MOS by approximately 0.006 units of innovation after about 15 observation periods.Then, it decreases.Likewise, a BW impulse causes a decrease of a couple of units of innovation in Jitter after 10 observation periods before exhausting its effect.Such behaviors are in line with the fact that having more bandwidth is typically beneficial for other metrics.It is useful to notice that, after a shock, some variables can have a decrease before raising up to their stability point.This is the case of Jitter and Buffer after a MOS impulse which experiment a decrease of 2.5 and −5 units of innovation, respectively (MOS → Jitter and MOS → Buffer sub-figures).Also in this case, we can reasonably admit that a better voice quality can be associated to lower values of jitter which, in turn, is associated to smaller values of the playout delay buffer.Interestingly, the mutual influence between two variables can be quite different when the "causing" and "caused" roles are inverted.For example, in the RTT→BW case, an RTT shock causes a slight oscillation of BW (with a peak of about 2 units of innovation around an observation period amounting to 5) before decaying rapidly to the stability point.In contrast, a BW shock causes a substantial decrease in RTT (BW→RTT) with two peaks (around −50 and 45 units of innovation) and a slower re-stabilization.Such apparently unusual behavior can be explained by the fact that BW (red curve in Fig. 2) exhibits a certain robustness, thus it is not dramatically impaired by unit shocks, whereas RTT (green curve in Fig. 2) appears to be more sensitive due to its oscillating behavior, and is then more susceptible to exogenous interventions.

B. Learning models for time series forecasting
The application of machine learning techniques to time series forecasting is a recent issue with interesting applications to econometrics [61].When dealing with time series, in fact, the temporal information is crucial, whereas a machine learning dataset is typically a list of information equally treated from a time perspective.This notwithstanding, it is possible to manipulate these models (especially supervised ones) to train on historical time series data and provide future predictions.Moreover, some deep learning methods have been explicitly designed to take into account temporal information through memory-based cells, as detailed below.
Recurrent Neural Networks (RNNs): such a technique relies on a network architecture able to handle variable-length sequences naturally.In such a way, through the RNNs it is possible to track the state of a system (by retaining past information) and update its state as the time elapses.The memory state is recursively updated with new observations, thus the hidden state  at time  can be represented as a function of the input at time  and the previous hidden state at time  − 1, namely: that, in turn, is used to evaluate the output (namely, the prediction): A weak point of RNNs is to manage long-range dependencies connected with transferring information from earlier to later times steps across too long sequences (known as the vanishing gradient problem [60]).Such an issue can be solved through the techniques explained below.The following hyper-parameters have been used for RNN: 30 RNN units; dropout rate amounting to 0.25; Adam optimization algorithm (learning rate = 0.1); tanh activation function; 30 epochs.
Long Short-Term Memory (LSTM): represents an evolved RNN network [62] with some internal state cells acting as long-term or short-term memory cells.The output of the LSTM network is modulated by the state of these cells and by three gates which tune the information flow: the input gate responsible to update the cell state; the forget gate in charge of keeping or discarding information on the basis of the input data () and the previous hidden state ( − 1); the output gate which takes decision about which information to pass to the next LSTM unit.The hidden state at time  is: being () the output gate, and () the cell state at time .
Gated Recurrent Unit (GRU): a lighter version of LSTM [63] with two gates.The update gate which embodies functionalities offered by the LSTM forget and input gates.The reset gate which is in charge of deciding how much past information to forget.The GRU hidden state can expressed as: where () is the update gate which decides about the updating amount of its candidate activation ().The following hyper-parameters have been used for GRU: 30 GRU units; dropout rate amounting to 0.25; Adam optimization algorithm (learning rate = 0.1); tanh activation function; 30 epochs.
Convolutional Neural Networks (CNN): they are typically used when dealing with classification problems involving spatial information where an image matrix (2D array) is provided to the CNN structure.On the other hand, when dealing with time-series problems, CNNs can be feed with a 1D array since only temporal dimension must be taken into account.Also when applied to temporal data, the CNN uses: ) the convolutional layer aimed at applying filtering to derive the most representative features; ) the pooling layer to reduce the size of the series while preserving important extracted from convolutional layers; ) the fully connected layer to map the features extracted by the network into specific classes or values.The following hyper-parameters have been used for CNN: 30 CNN filters (each of which with size 6); dropout rate amounting to 0.25; Adam optimization algorithm (learning rate = 0.1); tanh activation function; 30 epochs.
Multi Layer Perceptron (MLP): it is the most common form of neural networks and one of the first to be exploited in time series forecasting problems.The lagged observations (say   ) are used as inputs of an MLP structure to evaluate the forecast ŷ+1 : where (•) is an activation function (e.g., sigmoid, linear, etc.) to produce the output, and   and  are the weights and bias, respectively.The input data activate the hidden layers (intermediate layers) by following the forward activation direction, and, in turn, hidden layers neurons feed forward into output neurons.The MLP process is regulated by the backpropagation, a mechanism able to update neurons weights to progressively minimize the error.The following hyper-parameters have been used for MLP: 30 dense units; dropout rate amounting to 0.25; Adam optimization algorithm (learning rate = 0.1); tanh activation function; 30 epochs.
Random Forest (RF): a technique based on the bootstrap aggregation over decision trees.In practice, during the training stage, each tree within a random forest learns from random samples drawn with replacement (bootstrapping) so as to reach a lower variance.For each sample ,  = 1, . . ., , the desired forecast is the average of forecasts of each tree applied to the input data   , namely The following hyper-parameters have been used for RF: 30 estimators (or trees); 10 as the maximum of the tree.
Extreme Gradient Boosting (XGB): an improved version of gradient boosting, an iterative technique allowing to fit a decision tree on the residuals obtained from a base learner aimed at improving the prediction model by adding new decision trees.The output forecast can be written as: where  is the number of trees and   is the base learner.
An objective function is used to train the model by measuring how well it fits the training data.The following main hyperparameter has been used for XGB in our experiment: 30 gradient boosted trees.

V. EXPERIMENTAL FORECASTING RESULTS
In this section we present a comparative analysis of the methods described in the previous section based on the experimental measurements.Before delving into details of the numerical results, we need to provide some clarifications about the processing we have performed on the gathered data.
A preliminary operation is to re-frame the time series forecasting into a supervised learning problem.We first split the multivariate time series into training and testing sets, by adopting the classic 70/30 split (70% of data is used for training, 30% for testing) as shown in Fig. 8.It is useful to highlight that the classic -fold cross validation method cannot be applied in this setting, which assumes that there is no relationship among the observations.In contrast, when dealing with time series problems, the temporal continuum has to be preserved.Accordingly, we adopt the sliding window mechanism where a part of the input sequence (window of lagged values represented by the past observations within shaded blue area in Fig. 8) is used to forecast new samples (future observations within shaded red area in Fig. 8).The sliding window approach has been profitably employed also in other fields involving time series forecasting such as the smart manufacturing [64] or radar [65].Moreover, as in the aforementioned works, we perform the so-called one-step forecasting where the prediction is made one step at a time to avoid the forecast uncertainty [47].
As regards the tuning of the various learning-based techniques, we have empirically chosen their structures so that the resulting accuracy would be in the same range of the VAR model (which does not require any fine tuning other than the choice of the optimal lag  * ).Such an approach is in line to what suggested by credited literature [66].
We start by visually analyzing the behavior of the various presented techniques for two specific VoIP flows identified by their specific codecs, namely G.722 and G.729 (for space constraints we omit the visualization for remaining codecs but a summary of performance results for each codec is reported in the Table IV).The two aforementioned codecs represent two extreme trade-off choices between conversation quality and bandwidth utilization.Indeed, among the codecs used in our experiments, G.722 provides the better audio quality (for instance, in terms of MOS) but the bandwidth consumption is not very efficient (bit rate of 64 kb/s).In contrast, G.729 offers a slightly lower audio quality but allows a greater bandwidth saving with just 8 kb/s of bit rate.
The panels of Figs. 9 and 10 show the temporal behavior of each variable for codecs G.722 and G.729, respectively.Superimposed onto the actual values of variables (black dashed lines) we report, with different colors as specified in the figures legends, the behavior of each forecasting technique described In order to highlight the behavior of VAR compared to the learning techniques, we also report, in shaded pale red, the 95% forecast intervals (for the VAR) which represent an estimate of the intervals where we expect a future value will fall.Since the interval width amounts to 1.96•   (with   the standard deviation of residuals for each time series, see [55]), the shape and the width of each interval strongly depends on the residuals behavior.For instance, as regards the MOS -G.722 codec case (see the first panel in Fig. 9), the low residual standard deviation directly results in a narrow forecast interval.Conversely, the unexpected peak of RTT -G.722 codec case (see the third panel in Fig. 9) at about 415 seconds has a negative impact on the prediction accuracy (for all examined techniques) and, in turn, implies a growth of residual standard deviation.This directly translates into a broad forecast interval.
The first aspect to highlight is that the actual behavior of variables basically depends on two factors which impact on the performance forecast: the codec type and the network conditions.For instance, the overall bandwidth consumption amounts, on average, to 90 kb/s in the G.722 case (top-middle panel of Fig. 9), whereas it is around 30 kb/s in the G.729 case (top-middle panel of Fig. 10).In principle, this implies that more fluctuations are possible in the G.722 case due to a wider span of values.Unfortunately, even if a codec is able to guarantee a kind of temporal "stability", the high unpredictability of network conditions is the main responsible of fluctuations which are very challenging to predict due to their extremely time-variant behavior.
This notwithstanding, in both Figs. 9 and 10 we can observe that each technique is able to produce a satisfying forecast of the original variables by remaining within the area delimited by the forecast intervals.
By visual inspection, we observe that the VAR (red curve) shows good performance when the time series do not exhibit excessive fluctuations.This notwithstanding, when important fluctuations are present, the VAR model is able to follow the mean value of the oscillating time series (see, for instance, the case of MOS -G.729 codec case).The reason is that VAR is governed by a set of linear equations (see ( 2)), thus it can suffer when representing some non-linear behaviors.Occasionally, also the MLP technique shows slight difficulty to fit the original values, once again due to the underlying linear model (see (18)).Such a behavior emerges in particular in Fig. 9 where the MLP prediction moves a bit away from the corresponding forecast intervals.Conversely, remaining techniques show enough good adaptation to fluctuations, and, in particular, the deep-based techniques whose internal structure allows to keep the state at time  − 1 to improve the prediction at time .
To better quantify the behavior of each technique, we have evaluated the performance for each voice flow (namely for each codec), for each technique, and for each time-based variable as shown in Table IV Such metrics are computed for each time series  = 1, ..., , with  = 6 and  the time series length.These three indicators are often used jointly when evaluating the forecasting accuracy.The RMSE is a quadratic score rule which gives a relatively high weight to large errors since these errors are squared before they are averaged.The MAE is a linear score rule designed to equally weight the individual differences.
The MAPE includes a normalization to actual values and is expressed as percentage.Being such an indicator often used as a summarizing metric, to easily pinpoint the best forecasting technique in Table IV, the corresponding MAPE value is indicated in red.
For each voice flow we have repeated a lag analysis (just as seen in Sect.IV-A), and we have reported the optimal lag value  * close to VAR model in each sub-table of Table IV.
Let us start to notice some general facts valid for all the experiments.For each voice flow, we can notice that the three performance indicators (RMSE, MAE, and MAPE) exhibit very different ranges for each variable.For instance, in the case of MOS, RMSE and MAE never reach the value 1.This is due to the fact that all the chosen codecs guarantee a good perceived quality, with a MOS varying within a limited range of values (MOS values never lie below 4).This directly reflects into low values of all the indicators.
Conversely, RTT varies within a great range of values with some unusual peaks due to the temporary network conditions (see, e.g., the peaks of RTT in green at about  = 30 s and  = 415 s in Fig. 2).In such cases, all the forecasting techniques are (obviously) not able to predict this behavior, thus the prediction error is quite large.Such a condition directly reflects onto performance indicators and, in particular, onto     RMSE whose values are high and hugely different between them since RMSE tends to magnify large errors.Yet, the SNR exhibits a quite standard smooth behavior with weak oscillations and not unusual peaks.Thus, the performance indicators are not dramatically high, indicating a satisfying forecasting accuracy.Indeed, if we compare the performance accuracy for each technique, we can observe that: VAR technique, as also mentioned before, could return satisfactory result when the time series does not exhibit too many non-linearities; on average, deep techniques such as LSTM, GRU, and CNN produce satisfactory results (see MAPE values in red).For LSTM and GRU the results are justified thanks to the presence of a memorybased internal structure able to keep track of past values.For CNN, results are justified by the presence of a convolutional structure able to derive the most significant temporal features.Remaining deep based techniques (RNN and MLP) exhibit less performing accuracy results due to their naive internal structure which does not exploit any particular characteristic of temporal data.Among standard machine learning techniques, XGBoost exhibits the best performance since it relies on a combination of ensemble models useful to improve the quality of prediction.
The performance analysis should be complemented by a time evaluation to better compare the considered techniques.Figure 11 shows the results of such a comparison obtained by exploiting a platform equipped with 2 virtual CPUs (Intel Xeon @2.30GHz) and 13 GB of RAM.In this time analysis we take into account all the experiments (namely, all the voice flows) in order to highlight possible variability through a boxplot representation.First, such an analysis reveals that the VAR technique is able to perform the forecasting in few milliseconds as shown within the top-right inset.This is basically due to the fact that VAR is built through a linear combination of lagged values to forecast the next sample.In contrast, deep methods require more time (in particular during the training time) to perform the forecasting due to their internal structure which can be more or less complicated (e.g., memory-based cells, convolution operations).In the middle we find XGB and RF which, relying on an optimized treebased structure, are quite fast.The boxplot representation also highlights that, when using a technique with a complex internal structure (typically, deep-based techniques) the time variability directly increases.This is obviously connected to the fact that a more complex structure may produce higher delays.This behavior is captured through the inter-quartile range (IQR) defined as the difference between third and first quartiles.Large IQR value imply more dispersed values.In case of deep-based techniques we observe the following IQR values: RNN (0.88), LSTM (1.59), GRU (1.58), CNN (0.42), MLP (0.79).In case of standard learning techniques we have: RF (0.15), XGB (0.14).Finally, VAR is the more stable having the smallest IQR value amounting to 0.006.

A. Main Findings
Through the proposed assessment we are able to infer some general considerations about the evaluated forecasting techniques.First of all, we can reasonably say that there is not a definitive winner, since the forecasting complexity does not allow to select an outperforming technique in an absolute sense.An insightful comparison can be made between the statistical approach (represented by VAR) and the learning techniques.
First, we highlight that the VAR method allows a complete control on the analytical structure of each time series.In particular, its ancillary analyses (e.g., residuals, impulse response) provide deep insights about the time series composition and their mutual relationships.In contrast, the datadriven approach adopted by learning techniques does not allow to capture many analytical details.For instance, guaranteeing the stationarity condition (not required by learning approaches) allows to obtain useful descriptors (mean, variance, correlation) of the future behavior of a time series.Conversely, in case the stationarity condition is violated (namely if the series is consistently increasing over time) the sample mean and variance will grow with the sample size, and will tend to underestimate the mean and variance in future periods.
Second, VAR is a good choice in case the time series exhibits a good stability over time, or when the observation time is wider than tens of minutes (e.g., per-month or peryear).In this latter case, in fact, the temporal irregularities tend to be smoother, and the linear combination of past lagged values offer better performance.Conversely, being intrinsically adaptive, learning techniques are more responsive in presence of network parameters fluctuations.Furthermore, VAR offers challenging performance in terms of compute times due to the simplicity of the model.On the other hand, deep recurrent methods (RNN, LSTM, GRU, CNN, MLP) exhibit slower computation times along with high temporal uncertainty (high IQR values) mainly due to the complex internal structure.Among standard ML techniques, XGBoost offers an interesting trade-off between accuracy and time.
We finally notice that, differently from all the learning techniques, VAR does not need any hyper-parameter tuning (other than the optimal lag) which, if not accurate, could lead to poor performance.

VI. CONCLUSION
In this work we tackle the problem of forecasting mobile VoIP traffic in a real cellular environment.The main purpose is to provide precious information to network operators allowing them to optimize the network planning in mobile environments.In particular, we characterize the temporal evolution of the most important QoS/QoE descriptors of VoIP traffic through a multivariate time series assessment.Forecasting techniques such as Vector Autoregression and machine learning approaches have been compared to highlight pros and cons both in terms of performance and times.
The work presents a series of novelty elements.First, we propose a multivariate time series characterization of network descriptors, an approach currently used by econometricians to model and predict the market stock evolution.Through such an approach it is possible to analytically capture the interdependencies among the stochastic processes which govern the network variables behavior.Then, the time series problem has been turned into a supervised learning framework through the sliding window technique.
Such reframing of the problem is useful to: ) reinterpret the classic concepts of training/test sets in terms of temporal values of a time series aimed at forecasting future values of network descriptors; ) compare in a critical manner statistical techniques (here represented by the VAR model) and machine learning methods.Results show that VAR is the optimal choice when a complete analytical control on the variables is needed, when the network fluctuations are not so persistent, or when strict elaboration time constraints are present.In contrast, learning-based techniques provide excellent accuracy in case of network instability due their data-driven approach.
Finally, the whole assessment is supported by an experimental campaign in a real mobility LTE-A environment, where through the evolved RTCP-XR protocol, we are able to derive network metrics typically neglected in the literature (e.g., MOS, SNR, playout delay buffer).
Such a work remains open for future investigations along several directions: ) the main techniques adopted for this analysis could be extended to technologies such as 5G as they become more pervasive and with the possibility of acquiring data from real settings; ) many derived models could be used as benchmark to design more realistic network simulators; ) new parameters such as the car's speed could be gathered and related to the behavior of the VoIP metrics; ) it could be possible to repeat the whole analysis in a transformed domain (e.g., wavelet domain in place of time domain).

Fig. 1 :
Fig. 1: Experimental setting including a mobile user (top-left) and a fixed user with a control server for data collection and elaboration (bottom-right).Main parameters are summarized in the bottom-left table.
derived from the R-factor (R), a QoE indicator obtainable via RTCP-XR.Then, we have applied the conversion formula provided by ITU-T G.107 standard [42] to derive the MOS, namely: MOS = 1 + 0.035 + 7 • 10 −6 • ( − 60) (100 − ).•  2 : time series representing the bandwidth (often BW for brevity) which provides information about the bandwidth consumption and is measured in kb/s; •  3 : time series representing the round-trip time (RTT), a key performance indicator measuring the time interval (in ms) of a voice packet sent from a source and the ack received from the destination; •  4 : time series representing the jitter (measured in ms), namely the variation in voice packet latency evaluated through the formula:   = |(  () −   () ) − (  (−1) −   (−1) )| which quantifies the jitter of the -th packet depending on the transmitting (  () ) and on the receiving (  () ) time; •  5 : time series representing the playout delay buffer (of-

Fig. 8 :
Fig. 8: Time series reframed into supervised learning through the sliding window method.

Fig. 9 :
Fig. 9: G.722 codec: Multivariate time series forecasting for: MOS, Bandwidth, RTT, Jitter, Buffer, SNR, along with the 95% forecasting intervals in shaded pale red.In the gray area on the right, the results of forecasting for each technique.

Fig. 10 :Fig. 11 :
Fig. 10: G.729 codec: Multivariate time series forecasting for: MOS, Bandwidth, RTT, Jitter, Buffer, SNR, along with the 95% forecasting intervals in shaded pale red.In the gray area on the right, the results of forecasting for each technique.

TABLE I :
Some dataset statisticsCodecRTP pkts Stream length (s) Lost pkts )

TABLE II :
Results (in terms of p-values) for models with  = 11 and  = 12 lags, respectively.Two residual correlation tests have been considered: Lagrange Multiplier (LM) based on the  2 statistic, and Edgerton-Shukur (ES) based on the  statistic, both computed under the null hypothesis.
and thus to accept the alternative hypothesis H  with an error probability of 95%, at most.We highlight in red the p-values in correspondence of  values where the alternative hypothesis (presence of residual correlation) H  of test (

TABLE III :
Augmented Dickey-Fuller test per variable.
. Each sub-table contains the performance per voice flow in terms of test Root Mean Square Error (RMSE), test Mean Absolute value of Errors (MAE), Mean Absolute Percentage Error, defined, respectively, as