Data-driven Network Simulation for Performance Analysis of Anticipatory Vehicular Communication Systems

The provision of reliable connectivity is envisioned as a key enabler for future autonomous driving. Anticipatory communication techniques have been proposed for proactively considering the properties of the highly dynamic radio channel within the communication systems themselves. Since real world experiments are highly time-consuming and lack a controllable environment, performance evaluations and parameter studies for novel anticipatory vehicular communication systems are typically carried out based on network simulations. However, due to the required simplifications and the wide range of unknown parameters (e.g., Mobile Network Operator (MNO)-specific configurations of the network infrastructure), the achieved results often differ significantly from the behavior in real world evaluations. In this paper, we present Data-driven Network Simulation (DDNS) as a novel data-driven approach for analyzing and optimizing anticipatory vehicular communication systems. Different machine learning models are combined for achieving a close to reality representation of the analyzed system's behavior. In a proof of concept evaluation focusing on opportunistic vehicular data transfer, the proposed method is validated against field measurements and system-level network simulation. In contrast to the latter, DDNS does not only provide massively faster result generation, it also achieves a significantly better representation of the real world behavior due to implicit consideration of cross-layer dependencies by the machine learning approach.


I. INTRODUCTION
Within the approaching transition phase from human-driven cars to fully-autonomous traffic systems [1], guaranteeing reliable and efficient communication is of crucial importance for enabling mutual coordination between the traffic participants as well as for optimizing the Intelligent Transportation System (ITS)-based traffic flow by using the vehicles themselves as mobile sensors. In order to provide seamless connectivity and avoid link failures proactively, future communication technologies will rely on short and mid term predictions of the radio channel quality and meaningful endto-end indicators. Context-aware and anticipatory [2] mobile networking principles such as opportunistic channel access [3] and dynamic Radio Access Technology (RAT) selection [4] have been demonstrated to be able to significantly improve the end-to-end Quality of Service (QoS) of challenging data links. In order to fulfill the requirements of upcoming 5G networks for Ultra Reliable Low Latency Communications (URLLC), Massive Machine-type Communications (mMTC), and Enhanced Mobile Broadband (eMBB), these methods need In this paper, we present DDNS as a novel approach for simulating the end-to-end behavior of vehicular communication networks. Through application of a data-driven approach and a combination of multiple machine learning models, the proposed method is able to achieve a level of accuracy almost similar to real world evaluations, the computational efficiency of analytical modeling and the environment control of classical network simulation. Fig. 1 shows a comparison of the modeling complexity between the proposed DDNS and classical DES (the architecture models are inspired by the implementations of the SimuLTE framework [7]). As the DES approach involves a large amount of submodules on all logical layers, the model parameterization within the simulation setup phase is highly complex. Moreover, many of the required parameters are either subject to simplifications or are even unknown due to confidential MNO-specific configurations. In contrast to that, the proposed DDNS method focuses on direct modeling of the end-to-end behavior. The complex interdependencies between the different components are not explicitly parameterized. Instead, they are implicitly learned solely from the data within the training phase of the machine learning models.
This manuscript extends and brings together groundwork for data-driven network simulation [8], data rate prediction [9] and anticipatory data transmission in vehicular networks [3], [10], [11]. In contrast to the previous work, we consider additional experiments, further machine learning methods and provide an extended theoretical discussion. Furthermore, all evaluations are performed in uplink and downlink transmission direction, whereas the previous work focused only on the uplink performance. The contributions provided by this paper are summarized as follows: • Presentation of Data-driven Network Simulation (DDNS) as a novel performance analysis method for evaluating and optimizing the end-to-end behavior of anticipatory vehicular communication systems. • Comparison of different machine learning approaches for client-based online data rate prediction in vehicular Long Term Evolution (LTE) networks. • Validation against field measurements and comparison to classical system-level network simulation in a proof of concept study focusing on opportunistic vehicular data transfer. • All raw results and the developed applications are provided in an open source way.
The remainder of the paper is structured as follows. After discussing relevant related research in Sec. II, we introduce methodological aspects in Sec. III. Afterwards, we present the machine learning-based solution approach for data rate prediction in vehicular multi-MNO networks in Sec. IV, which is a key component for the proposed DDNS method proposed in Sec. V. For the validation of the proposed approach, we consider a case study focusing on opportunistic vehicular data transfer in Sec. VI. Finally, we summarize the key properties and the limitations of the DDNS method in Sec. VII.

II. RELATED WORK
Methods for network performance analysis: Due to the complex interdependencies of mobility and communication, analysis and development of next generation Connected and Automated Vehicles (CAVs) and ITSs require the joint consideration of both domains [12], [13]. System-level network simulation has become the main evaluation method for vehicular communications systems, however the analysis carried out in [6] shows that a high number of publications rely on too simplistic parameter assumptions. Although a lot of effort is spent on making these simulations more realistic [14], the underlying issues are often only shifted to a different domain. As a popular example, ray tracing-based analysis [15] theoretically allows to obtain detailed insights into the radio propagation characteristics within well-defined scenarios. However, the required environment data -highly detailed maps with obstacle shape and material information -is often not available. In addition, increasing the level of detail within those simulations inherently increases the computation time and therefore limits its applicability for large-scale evaluations.
Machine learning: The application of machine learning methods offers new potentials for modeling and analyzing mobile wireless communication systems. While analytical models fail to consider the complex interdependencies between the considered variables in highly dynamic environments, those impacts can be implicitly learned by machine learning-based models. Giordani et al. [16] even envision future 6G networks to bring intelligence to every terminal in the network. A general summary about machine learning methods and their application fields within wireless communication networks is provided by [17]. In addition, Ye et al. [18] and Liang et al. [19] present summaries with a deeper focus on vehicular networks. Recently, the idea of learning the end-to-end behavior of communication systems has received great attention within the wireless communications community [20]. First approaches, which focus on learning the physical layer behavior, have been proposed by Ye et al. [21], Dörner et al. [22], and Aoudia et al. [23]. By interpreting the communication system as an autoencoder, the behavior can be learned in a supervised manner based on Stochastic Gradient Descent (SGD) without requiring channel models for the physical layer interactions. The work presented in this manuscript can be regarded as a logical continuation of the emerged research field. In contrast to the state-of-the-art work, we focus on learning the behavior at the application layer, which is subject to additional interdependencies on the different layers of the protocol stack.
Anticipatory communication: In previous work, we have explored network quality-aware channel access [24] and have demonstrated the massive potentials of using data rate prediction for optimizing the resource efficiency of delay-tolerant vehicular data transmissions [10], [3]. Client-based data rate prediction within mobile cellular networks is a highly challenging task, as the resulting end-to-end throughput is influenced by various external and internal factors. In addition to mobility-related effects, which impact the channel coherence time, cross-layer dependencies (e.g., the slow start mechanism of Transmission Control Protocol (TCP)) have great influence on the observed end-to-end behavior [25]. Active prediction methods monitor the data rates of ongoing data transmissions with time series-based analysis methods. As an example, Throughput prediction based on LSTM (TRUST) [26] brings together mobility pattern identification with TCP data rate prediction based on Long Short-term Memory (LSTM) methods. In contrast to that, passive approaches only rely on measurable network quality indicators without introducing additional traffic themselves. In this paper, we focus on the passive measurement technique due to its wider acceptance within the research community, its better resource efficiency and its inherent capability of making predictions in an-hoc manner. The authors of [27] analyze online data rate prediction based on a large data set for two different MNOs in a highway scenario. Similar to Samba et al. [28], the highest prediction accuracy is achieved with a Random Forest (RF) regression model. However, the resulting prediction accuracy is relatively low, as the end-to-end prediction is solely based on network context indicators and does not consider features, which are related to the cross-layer dependencies within the protocol stack of the User Equipment (UE). Similar studies are carried out by the authors of [29], which compare the performance of the machine learning models Artificial Neural Network (ANN), Logistic Regression (LR), Gaussian Process Regression (GPR), and RF. Their findings conclude that these classic machine learning models -with GPR and RF achieving the highest accuracies -yield excellent prediction results, which can be utilized by the MNO to optimize its network processes.
Maintaining network quality data: While the mobile UE is able to perform measurements of the network quality indicators at its current location itself, it has to rely on estimation methods for forecasting those indicators at future locations. For this purpose, connectivity maps [30], [31] can serve as a way for providing a data-driven method for maintaining geospatially-aggregated network quality information. In [3], connectivity maps are jointly used with mobility prediction in order to schedule the time of vehicular sensor data transmissions with respect to the expected network quality on the future route. Although it is possible to use and maintain these data bases in a completely decentralized way -as people often drive the same routes regularly -data freshness and the grade of covered areas can be significantly increased through exploitation of crowdsensing approaches [32]. In order to increase the overall knowledge data base through using potentially heterogeneous data from different sources, correlationbased feature mapping [33] can be applied. As an alternative to purely measurement-based approaches, the acquired data can be exploited to optimize the parameterization of radio propagation models. The latter are then exploited to estimate the network quality at unobserved locations. In [34]

III. METHODOLOGY
In this section, the general DDNS approach is introduced and the methodological aspects of the performance evaluation of the proposed method are described.

A. Problem Definition and High-level Approach Description
The overall goal of the proposed data-driven approach is to mimic the network behavior of a concrete real world scenario. For this purpose, DDNS relies on replaying previously acquired context traces (e.g., the measured network context indicators a vehicle has encountered on its trajectory) which are utilized to analyze the end-to-end performance of a novel anticipatory communication method based on machine learning.
The logical information flow is illustrated in the overall system architecture model in Fig. 2.
• Prediction model generation: In contrast to system-level network simulations which model actual communicating entities including their protocol stacks, the proposed DDNS method relies on machine learning-based analysis of the end-to-end behavior. Supervised learning is applied to derive a deterministic prediction model which allows to forecast the behavior of the considered end-to-end indicator based on the provided context traces. In this work, we focus on data rate prediction in vehicular LTE networks. Since the resulting accuracy of the prediction model is crucial for the achievable simulation accuracy, this aspect is analyzed detailedly in Sec. IV. • Derivation model generation: If a data rate prediction model is applied in the real world, the actually achieved measurement provides an immediately accessible ground truth for assessing the prediction accuracy. As the defined goal of the DDNS approach is to mimic the behavior of the real world network, the model imperfections need to be taken into account within the simulations. However, since replaying the passive context traces implies to perform data rate prediction on unlabeled data, a ground truth is missing. For addressing this issue, a virtual measurement is derived within the DDNS by sampling from the error distribution of the real world measurements. For this purpose, a second machine learning model is applied to transform the prediction model from the deterministic to the probabilistic domain. This process is further described in Sec. V. • Performance evaluation: Finally, the performance evaluation is performed by applying the novel method on the replayed passive context measurements. The resulting end-to-end behavior is simulated based on the generated machine learning models. Sec. VI illustrates the proposed methodological approach considering a case study focusing on opportunistic data transmission in vehicular networks.

B. Data Acquisition
For the later training of the machine learning models, a comprehensive data set is obtained by performing real world measurements in the public LTE network of the three German MNOs. During the drive tests, every 10 s, a TCP-based data transmission is performed with a random payload size in the range of 0.1, 0.5, 1..10 MB in the uplink and in the downlink transmission direction. Furthermore, passive measurements of network quality indicators are acquired continuously. The data rate measurement is handled at a remote server. All raw measurements can be accessed via [35]. The data transmissions are performed using multiple Android-based UEs (Samsung Galaxy S5 Neo, Model SM-G903F), which execute the developed measurement application 1 . The real world drive tests are carried out in multiple scenarios, which differ with respect to the velocity range and the building density: campus (3 km), urban (3 km), suburban (9 km), and highway (14 km). Each track is driven ten times. In total, 12938 transmissions (58.45 GB of transmitted data) are performed on a total driven distance of 287 km.

C. Data Analysis
The machine learning-based data analysis is carried out with Waikato Environment for Knowledge Analysis (WEKA) [ and LIBSVM [37]. In order to automatically generate online prediction models as C++ code from the abstract WEKA results, we created a dedicated interface application, which is part of the supplied software package. If not stated otherwise, all presented data analysis results are 10-fold cross validated.

IV. CLIENT-BASED DATA RATE PREDICTION
This section discusses the prediction of the end-to-end data rate in uplink and downlink direction in multi-MNO networks. The availability of reliable prediction models is one of the foundations of the proposed DDNS approach, which is further discussed in Sec. V.
Predicting end-to-end performance indicators is a regression task, where a model f is trained to learn the relationship between a feature set X and a labeled data set Y. After the training phase, the model can be utilized to make predictions The overall architecture model of the machine learningbased data rate prediction process, which is conducted in this paper, is illustrated in Fig. 3. In the following evaluations, the feature set X is composed of nine features from different logical context domains: • The application context consists of the payload size of the data packets, which are transmitted via TCP. • The channel context is formed by the passive LTE network quality indicators RSRP, Reference Signal Received Quality (RSRQ), Signal-to-interference-plus-noise Ratio (SINR), Channel Quality Indicator (CQI), Timing Advance (TA) and the carrier frequency of the serving evolved Node B (eNB). • The mobility context is represented by the vehicle's velocity and the current cell id. During the training phase, the resulting data rate of the active transmissions is utilized as the labeled data set Y. The actual regression task is performed by multiple machine learning models, which were tuned in a preparatory step.
• Artificial Neural Network (ANN) [38], where a deep neural network with two hidden layers (10 and 5 neurons) showed the highest prediction accuracy. Learning rate η = 0.1 and momentum α = 0.001 were optimized based on an evolutionary algorithm. • Classification And Regression Tree (CART)-based models: Random Forest (RF) [39], which consists of 100 random trees of maximum depth 20 and M5 Regression Tree (M5) [40]. • Support Vector Machine (SVM) with Radial Basis Function (RBF) kernel [41] trained with Sequential Minimal Optimization (SMO) regression. For completeness, it is remarked that other regression models such as k-Nearest Neighbors (KNN) and LR were also considered during the initial model exploration phase. However, as those approaches did not reach a performance level comparable to the other -and more widely used -data rate prediction models, they were excluded from the deeper evaluations. The interested reader is forwarded to [42], [29] As a statistical metric for the model performance and for allowing a comparison to related work (e.g., [28], [27]), which  consider the same performance indicator, the coefficient of determination is analyzed. It is calculated as withỹ i being the current prediction,ȳ the mean of the measurement and y i the current measurement. The R 2 describes the amount of the response variable variation, which is explained by the derived regression model.

A. Comparison of Different Prediction Models and Training Data Granularities
In the first evaluation, the overall training data set is split into various subsets in order to find the most usable data aggregation granularity within the trade-off between using a higher amount of training data -e.g., a single global data set per MNO-or focusing deeper on the infrastructure-specific aspects, which would imply to utilize many local data sets. In addition, it is analyzed, which regression model achieves the highest prediction accuracy and will be utilized in the further evaluation phases.
For both transmission directions, all regression models are trained on all data subsets, which are composed as follows: Overall, it can be seen that the highest prediction accuracy is achieved with the CART-based models RF and M5, which is confirmed by the findings of related performance evaluations [28], [27]. As pointed out in the analysis of [10], in many cases, a single network quality indicator has a dominant impact on the resulting data rate under well-defined conditions. While the SINR is an important indicator within the cell center region, the RSRQ has a major impact on the considered endto-end indicator at the cell edge. The regions themselves can be estimated with the RSRP which is depending on the distance to the serving eNB. Since the CART models provide a scope-wise feature hierarchy within their model structure, they are able to represent these conditions in their native model architecture.
In addition to the achieved accuracy, a great advantage of the CART-based models is that they can be implemented in a highly resource efficient way using simple if/else statements. Within the online application of the trained models, the execution time for making predictions is nearly negligible. On the considered Android platform, the average online execution time per single prediction is ∼ 0.1 ms for the trained RF. The training of the 10-fold cross validation is performed in less than a minute. Although the RF achieves the highest prediction accuracy, it is remarkable that the much simpler M5 is often only slightly less accurate. As an example for uplink prediction of MNO A, the trained RF consists of 120533 leafs, which contain numerical values. The trained M5 only consists of 11 leafs, which contain linear regression models. The lightweight model size of the M5 can be exploited for enabling the usage of machine learning even on highly resource constrained systems (e.g., microcontrollers).
As the analysis shows, in most cases, the considered regression models benefit more from using a higher amount of training data than from increasing the grade of locality. Based on the obtained results, the following evaluations focus on a deeper analysis of the RF regression model with the global data sets for each MNO and transmission direction.

B. Behavior Analysis of the Random Forest Data Rate Prediction Model
The resulting prediction performance of the RF models of each MNO in uplink and downlink direction is visualized in Fig. 4. It can be seen that the behavior is highly depending on the MNO and its provided coverage within each scenario. For MNO A, the values are spread homogeneously for all scenarios. In contrast to that, MNO B and MNO C have focus regions, where a distinct level of performance is provided (e.g., MNO C only provides the highest performance in the urban scenario). Overall the highest spread of the prediction error can be observed in the highway scenario. Due to the high velocity range up to 150 km/h, the channel coherence time is low and handovers occur frequently. Apart from MNO A, which achieves a similar performance in both transmission directions, it can also be observed that the operators prioritize uplink and downlink performance differently. MNO B is the only operator, which provides downlink Carrier Aggregation (CA). Therefore, the value range of the downlink measurements is significantly larger than for the other MNOs. Fig. 5(a) and Fig. 5(e) show the resulting R 2 for cross-MNO data rate prediction in uplink and downlink transmission direction. It can be observed that the learned models are only able to provide significant results for the networks of the MNO they were trained on. It can be concluded that the measurable context indicators have to be considered jointly with the nonmeasurable MNO-specific configurations, which are implicitly learned as hidden features. Fig. 5(b)-(d) and Fig. 5(f)-(h) show the MNO-specific crossscenario prediction performance. For each MNO, a RF model is trained on the data subset of each scenario and tested against the other scenarios. For MNO A, the campus and urban subsets achieve very good generalization for all test sets. However, the data subsets for the highway and the suburban scenarios do not generalize well. Considering Fig. 4(a) and Fig. 4(d), it can be seen that the error spread is significantly higher for those two scenarios than for the others. Therefore, prediction artifacts, which arise from the low channel coherence time in the challenging environments, limit the cross-scenario prediction accuracy. In contrast to that, the other subsets succeed better on learning the general impact between context indicators and resulting data rate. In addition, the LTE cells in the campus and urban subsets are more crowded than in the suburban and highway subsets. Therefore, if only the latter scenarios are considered, the machine learning model fails to learn the interdependency between cell load -through measurements of the RSRQ -and data rate for high load scenarios within congested cells. For MNO B and MNO C, the cross-scenario generalization is low, as the network performance itself is highly scenario-dependent (see Fig. 4). Moreover, LTE coverage is not always guaranteed, e.g., MNO C suffers from poor LTE coverage (76.25 %) in the campus scenario.
The results emphasize that meaningful data sets should be composed of data from different heterogeneous scenarios in order to achieve good generalization. However, it is not reasonable to handle the different scenarios with scenariospecific prediction models. In all cases, the global MNO data sets achieve a higher mean R 2 than the overall average R 2 of all individual scenarios.

C. Impact of Individual Features
For assessing the impact of individual features on the resulting prediction accuracy, the relative Mean Decrease Impurity (MDI) [43] is computed for the different RFs. The results of the evaluations are shown in Fig. 6.
It can be seen that the feature importance is depending on the MNO. It is influenced by the unknown resource scheduling policy and the unknown configurations of the hardware components of the network infrastructure itself. While the carrier frequency has a dominant impact on the uplink prediction accuracy for MNO B and MNO C, the feature is less important for MNO A. As the overlayed distribution of the observed carrier frequencies shows, the UE is mostly connected to 1800 MHz cells in the network of MNO A. For the other MNOs, the carrier frequencies are distributed more diversely. In the downlink direction, the importance of the carrier frequency is significantly reduced for MNO B and MNO C. While it is possible that the eNBs employ different scheduling policies for uplink and downlink, another explanation is the traffic pattern of the cell users. As the downlink resources are more often subject to resource competition [2], it is plausible that the radio propagation-related impact is less significant than the resource allocation process. For MNO A, the feature importance is symmetrical for uplink and downlink.
In comparison to related work [28], [27], the achieved overall prediction accuracy is significantly higher. While the mentioned approaches only consider the network context features for the prediction, other dominant influences such as the payload size are not considered. The achievable average data rate of a transmission is directly related to the payload size as the latter has a strong impact on the resulting transmission time and the behavior of the TCP slow start mechanism. In the vehicular context, the UE moves during the transmission process, which results in a low channel coherence time. While larger payload sizes are beneficial from a transport layer perspective [44], higher transmission durations increase the probability of significant changes of the channel quality during active transmissions. However, these complex cross-layer interdependencies are implicitly considered by the applied machine learning-based approach.
For completeness, it is remarked that the integration of additional features (e.g., time of day) was analyzed in a preevaluation step. As their consideration did not increase the resulting prediction accuracy, they were removed from the feature set. This behavior can be explained by their correlation to already contained features. As an example, the time of day can be used as an indicator for the load dynamics of the LTE network [2], but similar information is provided by the RSRQ, which is already contained in the feature set.
Within upcoming 5G networks, the Network Data Analytics Function (NWDAF) [45] of the core network will act as machine learning-based method for estimating the load level of network slices. Although similar analyses can already be performed by the UEs using passive control channel analysis [46], providing the NWDAF information itself for the cell users could greatly improve client-side data rate prediction and would therefore significantly contribute to catalyzing anticipatory mobile networking techniques. ...

D. Exploiting Crowdsensing Data For Network Quality Prediction
The presented prediction methods rely on immediate measurements of different context indicators, which allow to derive data rate predictions only for the current vehicle location. However, state-of-the-art anticipatory communication techniques are able to significantly benefit from exploiting knowledge about the network quality along the expected future trajectory (e.g., for opportunistic data transfer [3], which is applied for the DDNS validation in Sec. VI-A).
Since the vehicle itself is not able to measure the network quality at the future locations, it has to rely on previously obtained spatially aggregated data, which can be provided by crowdsensing-based connectivity maps. Fig. 7 shows an excerpt of the derived multi-MNO connectivity map for the urban evaluation track. The connectivity map is organized into three logical layers. The lowest layer consists of previous measurements of the individual features of the prediction scheme. Each cell of the connectivity map contains the aggregated information of measurements, which were performed in the same cell during previous drive tests or by other network participants. For a defined cell size c and a given position predictionP(t + τ ), the cell key k is computed as and utilized to access the context information C from the connectivity map. The prediction layers maintain the prediction results of the considered end-to-end indicators for each MNO and are based on the feature layer information.
On the highest layer, the prediction results are exploited by anticipatory networking techniques. In the considered example  shown in Fig. 7, the availability of multiple MNOs is exploited for data rate-aware interface selection. Apart from enabling context-predictive networking methods, the usage of connectivity maps for maintaining the feature information has additional advantages. First, it allows to separate measurement platform and application platform. Although not all UE types and operating systems are able to provide the same network quality indicators [44], anticipatory networking methods can still exploit this information if it has been measured by other UEs and is maintained by a connectivity map. Second, it enables the usage of synthetic mobility traces [47] for evaluating the to be analyzed method at unobserved locations.

V. DATA-DRIVEN SIMULATION OF END-TO-END NETWORK PERFORMANCE INDICATORS
The deterministic data rate prediction model is now extended by a method to consider model imperfections within the simulations in order to achieve an accurate representation of the real world behavior. Based on the analysis of the previous section, we draw the following conclusions: • In the vast majority of the evaluations, the RF regression model achieves the most accurate prediction performance. Therefore, RF is utilized for performing the data rate predictions within the simulation. • It is more reasonable to use only few models with large data sets than a large number of highly-specified prediction models (e.g., a single model for each eNB). Therefore, the global data sets for each MNO and transmission direction are used as the training data for the prediction model.
In order to derive a probabilistic description of the derivations between ground truth and prediction model, a bayesian machine learning model is applied on the resulting transmission profile of the prediction model. For this purpose, we utilize a Gaussian Process Regression (GPR) [48] model as it inherently provides favorable statistical properties which are explained and exploited in the following paragraphs.
In the first step, the prediction resultsỸ RF of the RF model are used as training data for the GPR model f GPR to derive a predicted data setỸ GPR such thatỸ GPR = f GPR (Ỹ RF ). Fig. 8 (a) shows an example for the resulting behavior of the GPR model based on the overall uplink data set of MNO A. For the predicted valuesỸ RF , the actual real world measurements are centered aroundỸ GPR with a certain value spread. The latter describes the derivations from the real world behavior and is related to effects, which are not covered by the prediction model. However, the confidence area of the GPR allows to draw error-aware samples for each given value ofỸ RF , which follow the distribution of the real world measurements. Assuming a gaussian distribution N of the prediction errors, a sampleỹ GPR can be obtained with the standard deviation function σ GPR as For the considered data set, it can be seen that that the prediction confidence is reduced forỸ RF < 3 MBit/s and Y RF > 33.5 MBit/s, which describes the edge regions of the training set. Due to the probabilistic properties of the sampling process, it is possible that sample values exceed the value range of the observed measurement values or are even assigned impossible values (e.g., negative data rates). Therefore, a final filtering step is applied in order to compensate these statistical effects. The corrected sample valueŷ is finally computed aŝ An example application of this method within DDNS is shown in Fig. 8 (b). In the anticipation phase, the vehicle predicts the currently achievable data rateỸ RF based on the passive context indicators. As a ground truth is missing in the data-driven simulation, a virtual measurementŷ is derived by sampling from the confidence area of the predicted value. For all considered MNOs, all uplink measurements were re-generated with the proposed mechanism by simulatively replaying the transmissions at their actual measurement locations under the measured network conditions. Fig. 9 shows the resulting distribution of DDNS-synthesized data rate values. In comparison to the real world measurements -see Fig. 4 (a)-(c) -it can be seen that the process is able to provide a close to reality representation of the data rate distributions, which  9. Synthesized transmission profiles based on the DDNS method by replaying the real world transmissions using RF-based data rate prediction and GPR-based derivation modeling (uplink transmission direction). In consideration of the real world measurements in the same scenarios (see Fig. 4), it can be seen that DDNS achieves a close to reality representation of the characteristics of all MNOs. is able to capture the MNO-as well as the scenario-specific characteristics.

VI. VALIDATION
In order to validate the proposed DDNS method, a case study focusing on opportunistic vehicular data transfer is carried out with real world field tests serving as a ground truth. As a further reference for the performance of the proposed DDNS method, we consider classical system-level network simulation, which is based on DES. Within the simulative evaluations, both approaches replay the trajectories of the real world measurements of the highway and the suburban scenario. The ultimate goal is to mimic the real world behavior of the analyzed anticipatory communication method within the simulation setup. The following evaluations show the results of additional validation experiments, for which the measurement data is not contained in the training sets of the machine learning methods.
It is remarked that the proposed DDNS mechanism can be exploited for catalyzing the development process of novel anticipatory networking methods by applying a method-in-theloop approach. Within this work, the same C++ implementation code is used for the real world application and the DDNS variant. The only required differences are the context inputs (actual measurements in the real world, trace data in DDNS) and the data transmissions (TCP access in the real world, machine learning-based prediction for DDNS). Achieving a similar level of code reusability is often not possible with established network simulators, as the latter enforce the usage of simulator-specific modules and interfaces.

A. Anticipatory Communication Methods for Opportunistic Data Transfer
In the following, the anticipatory communication methods, which are used as for the validation, are introduced. It is remarked that these models have been published in earlier work and are only applied here. Within this manuscript, the focus of the scientific evaluations is on the achievable accuracy of the simulation approaches and not on the performance of the transmission methods.
Within typical vehicular Machine-type Communication (MTC) systems, the radio channel is accessed in a periodic way, e.g., sensor data is acquired and transmitted to a remote server with a fixed transmission interval. Since this approach does not take the current network quality into account, many transmissions are performed during low radio channel quality periods and are subject to undesired effects such as packet loss. Due to the low resulting transmission efficiency and the need for retransmissions, cell resources and energy are wasted.
In contrast to the periodic transmission approach, the considered anticipatory communication methods Channelaware Transmission (CAT) and Machine Learning CAT (ML-CAT) [10] access the channel in an opportunistic way based on a probabilistic process. The schemes exploit the dynamics of the network channel in the way that they delay the transmission until sufficient radio channel conditions are established. Acquired sensor data is buffered locally until a transmission decision is made for the whole buffer. Due to the introduced buffering delay, the method is intended for delay-tolerant applications (e.g., vehicle-as-a-sensor) and does not satisfy the latency requirements of safety-critical vehicular communications.
With predictive CAT (pCAT) and Machine Learning pCAT (ML-pCAT) [3], the general opportunistic transmission schemes are extended by a predictive component, which introduces a prediction horizon τ for forecasting the radio channel quality at the future locationP(t + τ ). The latter is obtained using trajectory-aware mobility prediction and is exploited for obtaining the context data from a connectivity map.
The different CAT variants can be configured to perform the transmission scheduling decision with respect to different metrics (e.g., SINR and predicted data rate). In the first step, the measured metric value Φ(t) is transformed to a normed metric value Θ(t) with in order to allow the application of the basic CAT principles with metrics that have different value ranges [Φ min , Φ max ]. The transmission probability p TX (t) is then computed as ∆t > t max Θ(t) α·z else (6) with α being an exponent, which describes how much the scheme should prefer high metric values and ∆t being the passed time since the last transmission has been performed. t min is used to guarantee a minimum payload size and t max defines an upper bound for the buffering delay. z is a pCATexclusive factor, which is responsible for taking the trade-off between the current measurement Φ(t) and the anticipated future network qualityΦ(t + τ ) into account and is computed as with ∆Φ(t) =Φ(t + τ ) − Φ(t) and a prediction weighting factor γ. The probabilistic transmission decision process itself is triggered periodically (1 Hz in the following evaluations).

B. Reference Setup for System-level Network Simulation
As a reference for the methodological evaluation, a classical system-level network simulation approach based on DES is applied with Objective Modular Network Testbed in C++ (OMNeT++) 5.0 [49], INET 3.4 and SimuLTE v0.9.1 [7]. The provided example scenario test_handover is taken as a starting point for own extensions. As pointed out in Sec. I, multiple simplifications are required for transforming the real world scenario into a system-level simulation setup: • Code extension: SimuLTE uses a single carrier frequency definition for all eNBs within a scenario. Therefore, the simulator implementation was extended to support individual carrier frequencies for each eNB according to their corresponding real world values.  In the following result analysis, the SimuLTE evaluations will be referred to as DES. Tab. II summarizes the overall parameterization of the transmission schemes and the DES configurations.

C. DDNS-based Parameter Optimization
Since DDNS evaluations can be performed in a highly resource-efficient way (see Sec. VI-E), even large-scale parameter studies that employ brute-force analysis over the whole parameter space can be executed. In order to find the best parameterizations for the considered anticipatory communication methods for each MNO and transmission direction, the  impact of Φ max on the average resulting data rate and buffering delay is analyzed in Fig. 10. As extremely high metric values (e.g., SINR > 50 dB) do not occur in the real world data set, the transmission schemes converge as they are determined by the maximum buffering delay t max which enforces the transmissions after exceeding the timeout.
Every parameter configuration is evaluated based on 20 mobility traces and each evaluation is repeated with 25 different random seeds. In total, for each MNO, every transmission scheme is analyzed in 50000 different evaluation runs. It is obvious, that performing the same amount of evaluations in the real world is practically impossible as it would imply to analyze the data transfer schemes on a total driven distance of more than 1.15 million km during more than 950 whole days. Classical system-level network simulation would take more than 2600 days with four computation cores (estimated based on the findings in Sec. VI-E). However, the DDNS approach requires less than three hours to finish on the considered evaluation system.
Tab. III shows the resulting MNO-specific parameterization of the different transmission schemes, which is based on a trade-off between data rate and buffering delay. Note that the units for Φ max differ between the transmission schemes, as CAT and pCAT perform their decisions with respect to the measured SINR, while ML-CAT and ML-pCAT consider the predicted data rate of the RF model. For all schemes, Φ min is configured as the zero value of the corresponding unit. As a reference, periodic data transfer with a fixed interval of 10 s is considered.

D. Resulting Modeling Accuracy
Finally, the resulting modeling accuracy is investigated for system-level network simulation and the proposed DDNS. Fig. 11 shows the resulting end-to-end data rate values for the different transmission schemes and MNOs in uplink and downlink direction. Within the real world evaluation, several characteristics by applying the different CAT variants can be observed: • The periodic transmission scheme provides the lower baseline for the achievable data rate as the transmissions are performed unaware of the network channel conditions. • The SINR-based CAT variants are able to increase the resulting data rate significantly. • With the introduction of machine learning-based channel quality assessment (ML-CAT) the average data rate is massively increased. • By using context-prediction (pCAT and ML-pCAT), an additional slight improvement is achieved. For DDNS, the achievable modeling accuracy is directly related to the prediction accuracy of the applied regression models (see. Fig. 4). Therefore, MNO A achieves a significantly more realistic representation of the real world behavior for the uplink than for the downlink. As ML-CAT utilizes data rate prediction within the transmission scheme itself, it is subject to the accumulated error of the DDNS mechanism and the prediction error of the Φ RF metric within the CAT mechanism itself. ML-pCAT is furthermore impacted by the prediction error for the anticipated data rate at the future locationP(t + τ ). However, in the vast majority of all evaluations, the impact of the aggregated prediction errors has a lower impact on the results than the parameter uncertainties of the DES.
A general observation for the DES results is that the different MNOs behave very similar. As the MNO-specific configurations are unknown, the MNOs only differ with respect to the eNB position and the applied carrier frequencies. However, in the real world and in the DDNS evaluations, different behavioral characteristics for the MNOs can be observed. Although MNO A achieves the highest uplink data rates for all transmission schemes in the real world, it has the lowest throughput in the event-based simulation. Due to the high prediction accuracy in the uplink for MNO A, ML-pCAT is able to unleash its full potential in the real world and in the DDNS evaluation, where it achieves an average data rate gain of ∼ 14 MBit/s. However, this effect is not captured by the DES due to the applied simplified regression model and the missing TPC. Contrastingly, it shows a similar behavior as for the other MNOs. MNO B achieves the highest mean data rate in the downlink by applying CA within some of the cells. As this feature is not explicitly modeled within the SimuLTE framework, the observed behavior differs significantly from the real world. In contrast to that, the proposed DDNS approach is able to implicitly learn the impacts of CA on the considered Key Performance Indicator (KPI) directly from the measurement data.
It can be seen that the DES fails to mirror the real world behavior of the pCAT transmission scheme. Due to the context prediction step, pCAT is highly sensible to the SINR dynamics. In the DES, the network dynamics differ from the real world due to the fixed eNB transmission power of 43 dBm. In the real world, eNB position and transmission power optimization are the results of a complex network planning phase, which is performed with respect to the radio environment. In contrast to that, the proposed DDNS does not require definitions or value assumptions for the eNB parameters, it simply learns the implications of the hidden variable on the considered endto-end KPI.
In order to assess the overall similarity between real world and simulation, for each transmission scheme, a similarity measurement is computed as the correlation coefficient of the ECDFs of the real world measurement results and the corresponding simulation results. Fig. 12 summarizes the average behavior of DDNS and DES. It can be seen that the proposed DDNS achieves a significantly higher modeling accuracy than the DES method in all considered cases.

E. Computational Efficiency
In additional to the achievable modeling accuracy of the obtained results, the computational efficiency of the simulation setup itself is of great importance for the system optimization phase. Fig. 13 shows the aggregated resulting computation time per run for the different evaluations methods for the considered transmission schemes. It can be seen that the proposed DDNS is multiple orders of magnitude faster than the DES approach. Although the application-level end-toend behavior of the data transfer method is investigated, the DES spends most of its computation resources on simulating processes that are only indirectly related to the considered KPI. As an example, within the SimuLTE setup, neighboring eNBs are interconnected based on X2 interfaces in order to coordinate the cellular handover mechanisms which is completely simulated during the evaluations. In consequence, the event-based network simulation does not scale well when the number of eNBs is increased. Contrastingly, the proposed DDNS allows to derive results with a very high computational efficiency as the machine learning-based modeling focuses on the end-to-end behavior itself and treats the intermediate modules as a black box.

VII. LIMITATIONS OF DATA-DRIVEN NETWORK SIMULATION
Although the previous evaluations have pointed out numerous advantages of using the DDNS method for analyzing end-to-end network performance indicators, it needs to be remarked that the proposed method has a defined application range with specific limitations.
• Dependency to the prediction model: Since the acquired real world data provides the foundation for the evaluation scenario and the prediction models, the significance of the DDNS results is severely depending on the quality and the amount of the data (see Sec. IV-B). Due to the focus on analyzing end-to-end indicators in a data-driven way, the considered features need to be carefully chosen in the data acquisition phase. In contrast to system-level network simulation, it is mostly not possible to alter the analyzed KPI without performing additional measurements and model trainings. • Scenario-oriented analysis: Replaying real world context traces allows to analyze the performance of new data transfer methods under close to reality network conditions. However, the results are only significant for the considered evaluation scenarios and the existing configurations of the network infrastructure. Although this limits the generalizability of the achieved results, it needs to be remarked that system-level network simulators are confronted with the same issues. In addition, the latter are further impacted by simulator-specific feature derivations (e.g., models implemented in DES A might be missing in DES B) which limit the significance of cross-simulator performance comparisons [6]. Open data sets serving as reference scenarios could make a significant contribution to improving the generalizability of the DDNS approach. This way, a novel method could be evaluated using a wide range of different MNO-and scenario-specific impact factors. • Black box approach: Although the applied black box approach enables very fast result generation, the implied encapsulation does not allow to inspect the behavior of the intermediate layers. Therefore, DDNS is mainly intended to be used as a powerful method for the system optimization phase, when the most important features and indicators have already been explored. However, for analyzing the behavior of the lower layer protocols, existing end-to-end models for these layers (eg., [21], [22], [23]) can be applied in a similar way. A possible future extension might be a hierarchical DDNS setup, where the prediction models of the upper layers leverage the results of the lower layer prediction models as additional features.

VIII. CONCLUSION
In this paper, we presented Data-driven Network Simulation (DDNS) as a novel methodological approach for analyzing anticipatory vehicular communication systems. The proposed method exploits machine learning-based prediction models and crowdsensing-enabled data acquisition for achieving close to reality modeling of end-to-end network performance indicators.
While classic DES-based system-level network simulation suffers from a high scenario generation complexity due to a large number of parameters uncertainties, DDNS is able to learn their hidden interdependencies implicitly solely from real world measurement data. The statistics of the derivations between prediction model and real world behavior can be learned by a dedicated machine learning model in order to consider their implications as gaussian noise within the simulative evaluation phase. Applying DDNS to model the behavior of cellular communication systems requires to train individual models for each MNO. Although machine learning-based data rate prediction is able to consider the effects of cross-layer dependencies, the resulting end-to-end behavior is significantly depending on unknown MNO-specific configurations (e.g., the resource scheduling mechanisms).
As it was shown in the proof-of-concept validation focusing on anticipatory vehicular data transmission, the proposed DDNS method is able to achieve more realistic end-to-end results with a significantly higher computational efficiency than the reference system-level network simulation setup.
In future work, we want to further exploit crowdsensingbased data maintenance for keeping the simulation data consistent with the real world. By introducing online learning capabilities in the regression phase, an up-to-date digital twin of the real world network could be achieved, which would be able to autonomously learn and consider new technological developments (similar to CA as discussed in Sec. VI-D).