Machine Learning for QoS Prediction in Vehicular Communication: Challenges and Solution Approaches

As cellular networks evolve towards the 6th generation, machine learning is seen as a key enabling technology to improve the capabilities of the network. Machine learning provides a methodology for predictive systems, which can make networks become proactive. This proactive behavior of the network can be leveraged to sustain, for example, a specific quality of service requirement. With predictive quality of service, a wide variety of new use cases, both safety- and entertainment-related, are emerging, especially in the automotive sector. Therefore, in this work, we consider maximum throughput prediction enhancing, for example, streaming or high-definition mapping applications. We discuss the entire machine learning workflow highlighting less regarded aspects such as the detailed sampling procedures, the in-depth analysis of the dataset characteristics, the effects of splits in the provided results, and the data availability. Reliable machine learning models need to face a lot of challenges during their lifecycle. We highlight how confidence can be built on machine learning technologies by better understanding the underlying characteristics of the collected data. We discuss feature engineering and the effects of different splits for the training processes, showcasing that random splits might overestimate performance by more than twofold. Moreover, we investigate diverse sets of input features, where network information proved to be most effective, cutting the error by half. Part of our contribution is the validation of multiple machine learning models within diverse scenarios. We also use explainable AI to show that machine learning can learn underlying principles of wireless networks without being explicitly programmed. Our data is collected from a deployed network that was under full control of the measurement team and covered different vehicular scenarios and radio environments.


I. INTRODUCTION
ML has a strong potential to overcome challenges arising in vehicular communication and networking, as presented in [1,2]. Examples of such challenges are resource allocation [3,4] and quality of service (QoS) prediction [5]. The QoS prediction, in turn, is an enabler for a variety of use cases in the domain of vehicular communication, such as autonomous driving, platooning, cooperative maneuvering, tele-operated driving, and smart navigation. Some of these are presented in [6,7]. Predictions can not only enhance the existing use cases and support new ones, but also be a stepping stone for a network to become proactive. Such predictions can be an integral part of a network that proactively reacts to sustain, for example, a specific key performance indicator (KPI) target, like the maximum achievable throughput under high network load.
In this work, we focus on the prediction of QoS as a way of looking into the typical workflow for building and applying machine learning (ML) algorithms. We start with data acquisition, moving on to data analysis and statistical characterization of the data, proceeding to feature engineering and different split strategies, and concluding by testing the performance of the model. Predicting QoS is a relatively complex task since it depends on several time-varying factors. It is especially complex in the case of vehicular communication where the radio conditions might change very drastically in a short amount of time [8].
Capturing datasets that reveal the complex interdependencies between network operations, radio environment and terminal behavior is a challenging task in itself. The captured data might often lack the required quality for ML-based prediction, rendering it inappropriate. There are arXiv:2302.11966v2 [cs.NI] 22 Aug 2023 multiple factors that can reduce the quality of a collected dataset for training ML models, with a few examples being dataset imbalance, coverage of radio regions with bounded dynamics or limited network states gathered. Moreover, acquiring different types of data for a prediction task comes with varying acquisition costs that are typically not discussed extensively. From the theoretical perspective of statistical learning, a training dataset should often be independent and identically distributed (i.i.d.) and drawn from the same probability distribution as the test dataset [9,10]. Radio terminals frequently experience drastic shifts in the data distribution [11] that usually degrade the performance of ML models [12].
Meaningful comparisons between prior art proposals, though desirable, can be hard. There are multiple reasons for this, for example, the pre-processing steps are not well reported, or the used performance metrics differ. Additionally, literature in the area of applying ML for wireless communications often reports the performance of the models without giving an in-depth analysis of the dataset's underlying characteristics. This raises the question of how robust such models are to changes in different environments, where the dataset's statistical properties might be different.
Our results emerge from the data analysis of a dedicated measurement campaign with the goal of capturing several characteristics of the radio environment and enabling an indepth discussion of ML workflows. The code for this analysis is based on customary functions from usual Python libraries for scientific computing (e.g., numpy, pandas, scikit-learn...). Our contributions include the following: • We study the prediction of maximum throughput prediction enhancing, for example, streaming or high-definition mapping applications. • We discuss the entire machine learning workflow highlighting less regarded, but important aspects, such as the detailed sampling procedures, the in-depth analysis of the dataset characteristics, and the effects of splits on performance. For example, we test against the data stationarity assumption and discuss methods to handle those for commercial networks. Many proposals in the literature avoid discussing such theoretical violations that many ML models require. • We put the characteristics of wireless environments in the foreground explaining what might hinder and benefit the adoption of ML approaches for new-generation wireless networks. • We further showcase why the lack of this type of information, missing often from the literature, can generate overconfidence in proposed ML solutions. • We investigate the effects of diverse sets of input features specifically those that are not typically available in the prior art. • We showcase that consumer devices and low sampling speeds can be enablers for the adoption of ML in commercial networks. • We also use explainable AI to demonstrate that machine learning can learn underlying principles of wireless networks without being explicitly programmed.
• Our data is collected from a deployed network that avoids limitations of theoretical studies or simulationbased results. The remaining of the paper is structured as follows: Section II presents related work in the area of predictive quality of service (pQoS), and the measurement campaign is described in Section III. In Section IV, we discuss various radio environment properties in relation to ML workflows, such as vehicle speed and stationarity. We continue the analysis on ML workflows in Section V with a careful look into train/test data splitting and feature engineering. In Section VI, we apply a set of ML algorithms to predict the maximum achievable throughput for diverse prediction models, communication directions, feature sets, splitting strategies, and prediction horizons. In Section VII we look at the topic of explainable AI (XAI) [13] showcasing what the models have learned in the absence of explicitly programmed rule sets. Finally, we draw conclusions in Section VIII.

II. RELATED WORK
The application of ML for QoS prediction in vehicular networks has drawn attention recently. In the following, we present an overview of related works. Measurement campaigns that focus on the feasibility of teleoperated driving (ToD) are described in [14,15]. The authors of [14] also include a sensitivity analysis in their work and examine the impact of factors such as speed and distance to a serving cell. When applying gathered data from measurements to QoS prediction in the current literature, a sampling interval in the order of seconds is often used (e.g., [16,17]). This does not reflect the fast-changing wireless channel associated with high-mobility vehicular communication scenarios [18], and there is an open question as to which degree sampling schemes need to be adjusted based on the vehicle speed. Some works already have compared measurement applications running on slow-sampling consumer equipment (CE) with dedicated measurement equipment (DME) analysis [19,20] and propagation model tuning [21]. As vehicular communication is characterized by high mobility of the terminal, it is an open question whether CE's low-cost transceiver chains might render collected data unsuitable for applying ML.
Moreover, it is known that the radio environment has dynamics that can drastically change its statistical properties over a distance of a few meters [22,23]. Such changes in statistical properties of the data stream, where the ML model needs to reformulate its decision boundaries, are being discussed in the literature as concept drift [24][25][26]. Detection and analysis of concept drift using state-of-the-art algorithms have been explained in [27]. Extensive research is conducted on the management of concept drift, some of which are: using an ensemble learning approach [28], concept drift-aware federated averaging [29], or model updating mechanisms [30].
In [31] and [32] the authors present a measurement campaign and evaluate several ML models regarding their performance for downlink (DL) throughput prediction. One limitation is that the authors measure in a public network and thus cannot consider features such as the cell load. The authors of [33] consider the movement of the user and build a two-staged ML model to predict the transmission control protocol (TCP) throughput. In the first step, the movement pattern is identified. Based on this, a long short-term memory (LSTM), trained with corresponding data, is selected for the throughput prediction. Some preliminary results for uplink (UL) and DL throughput prediction using Random Forest and employing the dataset used for this paper are presented in [34]. QoS prediction in a 5G non-standalone network is examined in [35]. Besides throughput, ML is also used to predict latency [5,36,37] and handovers [38]. Recently, several datasets intended for ML-based studies have been made publicly available [39,40] that typically do not provide cell-related information. Moreover, those works focus on introducing the datasets instead of applying ML methods.
An inherent limitation of ML algorithms, especially as they get more complex, is that their decision-making is difficult to understand, rendering them into black boxes. Recently, the field of explainable AI (XAI) [13] has gained traction, to ensure ML models perform as desired, not only in test environments but also in target applications. With XAI, models can be judged in the context of human domain knowledge, e.g. by investigating the learned feature importance.
We extend prior art by providing detailed insights into the aspects mentioned above. Our dedicated measurement campaign, on a fully controlled private network, allowed us to perform a deep analysis and enable an in-depth discussion. We could control which devices could connect to the network, we could adjust the total network interference, and generate diverse traffic dynamics. We consider different radio environments, device types, sampling frequencies, measurement scenarios, and prediction horizons. We present the complete workflow towards building a pQoS model, predicting the maximum achievable throughput. We discuss several design decisions, such as the train-test split, various feature groups, and different models, and show the influence on the achieved performance. In the end, we discuss the issue of trustworthiness.

III. MEASUREMENT CAMPAIGN
Within the project AI4Mobile, an extensive measurement campaign was performed in the 5G-ConnectedMobility test field 1 . In total, more than 3000 km were driven by four vehicles for one week to perform measurements from the long term evolution (LTE) network. A detailed description of the measurement campaign is available in [41]. Since we had control over the network infrastructure, we were able to measure and collect data from all parts of the network. We were also able to define custom scenarios and set parameters accordingly. Subsequently, we briefly describe the setup and highlight some scenarios.

A. Measurement Setup
During the measurement campaign, four vehicles were equipped with several user equipments (UEs). In each vehicle, 1 https://www.ericsson.com/en/cases/2020/5g-connectedmobility Fig. 1: The map of the areas where the measurements took place. The A9 highway is highlighted in red, while the rural and side street areas are marked in green (not all rural streets are plotted to reduce clutter). The blue area is the suburban city of Feucht. an identical DME was used for measurements with high granularity. The DME is a Linux-based PC including a cellular modem connected to a 2x2 MIMO car antenna, which is placed on the roof of the respective car. Additionally, a global positioning system (GPS) receiver was placed on the roof and connected to the DME, providing accurate location and time synchronization via a pulse per second (PPS) signal. To capture the data from the cellular modem, we used the application MobileInsight [42].
Each vehicle was also equipped with several CE phones as pure data generators and measurement devices with a low sampling frequency for a comparison to the DME's measurements. The CE uses the application G-Net Track Pro [43] to capture radio measurements via the Android API. All devices used the application Iperf [44] to exchange data with a local server connected to the network. In addition to the UEs (CE and DME), we also collected data from the base stations and the core network, which were also time-synchronized. An overview of all captured data is given in Table I.
The network is deployed in the south of Germany, close to Nuremberg. It consists of 10 frequency-division duplex (FDD)-LTE cells with 10 MHz of bandwidth at a carrier frequency of 700 MHz, with typical radio tower panel antennas All clocks are synchronized and all generated data is marked accordingly and captured at separate entities. Figure adapted from [41].
(horizontal half-power beam width of 65 • ). Since it is a test network, our devices were the only devices connected to the network. In Fig. 1, the area of the test field is shown, including the position of the distinct cells. While the cells were positioned to provide excellent connectivity along the highway and in the suburban city of Feucht, part of the rural area next to the highway was also covered, which allowed us to capture several radio environments.

B. The Conducted Studies
The purpose of the measurements was to thoroughly examine numerous parameters through a variety of real-time scenarios in our captured dataset. This involved analyzing different driving patterns at different speeds, stationary measurements, recording different radio environments, and generating varied data traffic to create contrasting load scenarios. Furthermore, we utilized various protocols and generated traffic in both the UL and DL to the server and to other vehicles. Additional information on the parameters captured in the measurement campaign can be found in [41].

C. Data Preprocessing
For this study, only user datagram protocol (UDP) measurements from the dataset were considered since the protocol properties of TCP and UDP differ. Therefore the E2E throughput could not be limited by the transport layer. We used the data, where the devices were requesting maximum throughput.
Moreover, we filtered out stationary measurements, i.e., when vehicles were parked for some time. The stationary measurements contain similar radio properties for the majority of the samples, which could result in overly optimistic prediction results.

IV. PROPERTIES OF THE RADIO ENVIRONMENT
In this section, we study properties of the radio environment that might influence ML algorithms. Radio environments have very diverse characteristics that depend on multiple factors. Some of these factors are environment-based, for example, radio propagation can be drastically different in a rural area and in a crowded city center [45], as the type and density of buildings influence the radio propagation. Another factor is, for example, the specific network deployment (e.g. antenna heights, frequency layers and the number of cells in an area). All these factors contribute to very unique properties in the collected data as UEs move. We first discuss the data collection procedures and how these might affect the quality of the collected dataset. We continue looking at the statistical properties of the radio environment. As many ML algorithms rely on multiple statistical assumptions [46], we try to highlight some of those, as they might need to be better understood by the research community that applies ML on wireless networks.

A. Sampling the Radio Environment
High sampling intervals of the radio environment come at a cost, either in terms of acquiring more capable hardware, or in terms of higher power consumption or in terms of signaling, as more data need to be transferred. We specifically study the effects of different sampling intervals on the characteristics of the collected dataset. We also study the effects of the vehicular speed in the characteristics of the collected dataset to see if higher velocities call for faster sampling intervals, for sustaining specific quality characteristics in the collected dataset. We focus on LTE signal values for the analysis, i.e., reference signal received power (RSRP) and received signal strength indicator (RSSI), since these are most affected by the vehicle's speed and influence the target throughput, in turn, as we show in Section VI and in Section VII To quantify the sensitivity of the collected measurements on these two parameters, we need to introduce the resampling error (RE) at the k-th time instance as where x o (k) represents the original signal of length K at a sampling interval of 10 ms for the DME and 1 s for the CE. The corresponding resampled value for a lower sampling interval is x rs (k). To determine x rs (k), the following procedure is applied: first, the original dataset is downsampled by averaging to a target rate to obtain x ds (k ′ ): where M accounts for the downsampling ratio for a certain pair of original and target sampling interval, e.g., M = 10 for a target rate of 10 s when considering the CE's original sampling interval of 1 s. Secondly, the downsampled dataset are then upsampled by the forward filling method, which can be expressed by M -integer division on the original index k: Finally, Equations (1), (2) and (3) can be combined into (4) For each sample at the k-th time instance, the vehicle speed v(k) is also known, which allows to explore the relation between the mobility of the terminal and the resulting RE.  Therefore, the speed range from 0 to 140 km/h is split in intervals of 5 km/h. For each interval with a lower speed boundary v ′ , the mean absolute resampling error RE v ′ is calculated according to Eq. (5).
The results are shown for RSRP and RSSI and measurements from the DME and CE in Fig. 3. The shaded areas indicate the central half of the distribution of the errors.
Focusing on the effect of different sampling intervals for a DME dataset (Figs. 3a and 3b), one can see that the RE increases with a larger sampling interval as expected. Interestingly, this increase seems to be below 2 dB for a sampling interval of 1 s and between 2 and 3 dB for a sampling interval of 10 s, which might still be acceptable for certain ML applications.
The mobility of the terminal seems to have some effect on the quality of the collected dataset. When the terminal is stationary, the RE is close to 0. Both RSRP and RSSI are characterized by a steep rise of the error once the terminal starts to move at an approximate speed of 10 km/h. Above that speed, the error remains stable. Of particular interest is a sampling interval of 1 s, as it shows approximately the error introduced by employing a CE instead of a DME, which collects data at a sampling interval of 10 ms. The mean RE for RSRP and RSSI are well below 2 dB, which suggests that also CE samples may enable an accurate QoS prediction. Only the sampling interval of 10 s shows larger variations across different speeds. Here we notice that the variance of the error is drastically increased for the lower sampling interval.
We also executed the same experiments with a dataset from the less granular CE. We received similar results for the DME dataset as shown in Fig. 3c and 3d, i.e., a rise of the error in the beginning, followed by a saturation of the RE. We empathize that a direct comparison of the resulting RE between the two different devices should be avoided, as the original sampling intervals are larger and thus the number of samples that are considered for each downsampling interval is smaller. This, combined with the simpler transceiver chain of the CE, is a potential explanation for why these results are less smooth.
The results indicate that the vehicle speed does not have a noticeable impact on the absolute error for different sampling intervals, except for vehicles moving at very low speeds. The observed effects can possibly be attributed to the Doppler effect. However, no channel measurements were carried out during the measurement campaign and thus the causes cannot be conclusively assessed on the basis of our data set. The relatively constant error for slow sampling at different speeds can further simplify the sampling procedures without the need to introduce adaptive sampling schemes. Also, lower sampling speeds can still provide useful input to ML models as the reconstruction error remains low. This type of result is necessary for optimizing data collection procedures for future ML applications. In the following sections, we use downsampled data from the DME with a sampling interval of 1s that also enables a comparison with the CE. The resulting dataset sizes for DL and UL are 18846 and 19535 samples, respectively.

B. Data Stationarity
Many theoretical results in ML literature are based on the assumption that the available dataset is i.i.d. [9,10,12]. This assumption can be violated in several ways, for instance, if i) samples are correlated in time (i.e., they are not independent), or ii) the underlying time series is non-stationary (i.e., not identically distributed).
There are both parametric and non-parametric hypothesis tests available in the literature to study time stationarity. Parametric tests are often more powerful, but one needs to assure that the assumptions they require (i.e., a specific wellknown underlying distribution) are met by the tested data. As such, we have opted to use the parametric augmented Dickey-Fuller (ADF) test [47], which searches a unit root in the tested time series as its null hypothesis, while the alternative hypothesis is stationarity. In other words, a low pvalue (for our analysis we have considered p < 0.05) indicates strong evidence for stationarity in the data. As a parametric test, ADF requires a suitable choice of the maximum lag of the assumed autoregressive model. Since the lag order is unknown for our data, we estimate it according to the Akaike information criterion (AIC) [47].
We have run the ADF test for all physical layer (PHY) features from all measurements (c.f. Table I, first row) in an accumulated manner. That is, for every analyzed time series of size T , we construct T sub-series each one consisting of all samples from 0 to t ∈ {0, 1, . . . , T − 1}. We then run the ADF test against every sub-series and concatenate the output p-values into a vector of size T . Fig. 4 shows the resulting accumulated ADF test as a color code on top of some time series.
The expected behavior is that the accumulated ADF test tends to p ≪ 0.05 as the size of the evaluated time sub-series increases. The number of samples needed for this depends on some statistics, such as the variance (c.f. Fig. 4a & 4b), but it rarely exceeds 15 minutes for our data.
This time stationarity assumption is violated for measurements with an appreciable scenario shift, e.g., those where vehicles remain idle for some time and drive away afterward (Fig. 4c). This shows that stationarity cannot always be assumed, even though it might hold true for some datasets under a long-enough measurement time. For the measurement scenarios, we opted for a duration of 40 to 60 minutes while driving to alleviate the problem of non-stationarities.
In summary, we acknowledge the presence of non-stationary portions in our data and its possible negative impact on the performance of ML algorithms.

C. Radio Environment Correlations
Additionally, we look at an important characteristic of the radio environment, which is the correlation in time and space [48,49]. In Fig. 5, we show an example of the autocorrelation function of the signal-to-interference-plus-noise ratio (SINR) for three distinct radio environments. We see that the radio environment tends to be highly correlated in time for all three distinct radio environments. Some higher correlations are noted for the rural environment, with the highway having the 12 weakest of the three. This is probably explained by the faster movement of the vehicles. In [50], the authors discuss that if correlations are not handled properly, there is a higher risk of reporting overly optimistic results. The reason is that the ML model does not learn the underlying relations between input and output clearly, and instead learns the existing dependencies of a dataset, which in our case, are the reported correlations.

V. MACHINE LEARNING FOR WIRELESS NETWORKS
In this section, we discuss important aspects that the ML engineer needs to understand before applying ML to radio data. We discuss the notion of concept drift, different strategies for splitting the data into training and test datasets, and feature engineering.

A. Concept Drift
In this section, we introduce the idea of concept drift as a tool that can mitigate some of the effects of non-stationarity. ML models traditionally need a formal and precise definition of the problem to represent the decision boundaries. However, non-stationarity in datasets implies that the decision boundaries, i.e., concepts, have a higher likelihood to be relearned to accommodate variations in the underlying data distributions [51] [52]. This re-learning or concept drifts must be detected to ensure that the ML models remain accurate. Given that a sample instance x belongs to a class ω (x ∈ ω), the Bayes posterior probability is: This indicates that the drift may occur when: (i) Class definition P (ω | x) changes while P (x) remains the same, i.e., real concept drift. (ii) Virtual drift: Input distribution P (x) changes while P (ω | x) remains the same, i.e., virtual concept drift. (iii) Prior probability of class P (ω) changes while P (x | ω) remains the same. However, sometimes the class boundaries may change due to hidden contexts. This drift occurs due to 'insufficient, unknown, or observable features in a dataset' [53]. We aim to test how often such hidden contexts appear in real-time scenarios such as the automotive domain, where the environments change dynamically at a rapid pace. To this end, we employ the Page-Hinkley (PH) test (see Algorithm 1) to detect statistical changes in the input data stream, i.e., detect drift in data [54] [55].
The PH estimator needs a labeled dataset, a magnitude threshold (δ), and a detection threshold (λ) as inputs. The magnitude threshold defines the degree to which noise is permitted in the dataset. For each subsequent timestep t, a single data entry is fed to the PH estimator and a cumulative error U T Algorithm 1: Page-Hinkley Test The estimator raises an alarm if the cumulative error U T increases beyond the minimum cumulative error m T determined by detection threshold λ.
Since Algorithm 1 only returns a boolean value, we build T subseries identical to the ADF test and run the PH test against all of them. In that way, we can provide an example of detecting drifts in the input dataset as the vehicle is moving through different environments. For that example, an underlying ML model has been trained on the dataset of the suburban region. The PH estimator's magnitude threshold δ is set to twice the standard deviation, 2σ of the training dataset. The dataset captured by DME 3 from Vehicle 3 in the measurement campaign is used to test for drifts. The estimator reacts to the statistical changes in the input data and raises an alarm when a drift is detected (see Table II). The number of detected drifts on the highway is higher by a factor of 3.3 than that of the suburban region. To verify the time and position of these drifts, an example from the dataset, where the vehicle moves through different regions is taken (see Fig. 6). Two distinct regions, i.e., suburban and highway are chosen for training the PH estimator. Furthermore, the drifts are detected for different magnitude thresholds of 1σ and 2σ for the two training datasets. A change in the radio environment, when trained on the suburban dataset (see Fig. 6a), can be detected when the two different thresholds are applied. We also got similar results when we trained for highway environments and the UE was moving between suburban, rural, and highway environments (see Fig.  6b).
The different types of concept drifts are expected to deteriorate ML performance. Detecting these drifts and dealing with them as early as possible will result in a resilient and accurate ML model. One way to handle the drifts is to include all the possible scenarios in the training dataset that a ML model might face during its life cycle. As this might be hard, and even unrealistic in many cases, another approach is online training, which may be based on concept drift detection [56,57]. Hernangómez et. al [58] provide some preliminary results on this dataset using said approach.

B. Train/Test Split and Evaluation of ML Algorithms
This section provides more details on splitting strategies, and it explains how different techniques can provide different insights. If not used properly, the splitting technique might give skewed results, and we hope to explain some of their fundamental strengths and weaknesses in the following.
The train/test splits included in this study are random split, split by time, split by measurement run, and split by fold. Fig. 7 provides the reader with a graphical explanation of the different splitting strategies. Subsequently, we describe the four strategies.
Random split: Each dataset sample has a nonzero probability of being added to the training respectively test set. This often leads to two consecutive samples from the same device being added to the training and test set. These samples, as discussed above, share similarities, explained by their correlation structure. In the coming sections, we use a split where 70 % of the data are in the training dataset and 30 % in the testing dataset. Note that this split strategy assumes samples to be i.i.d., an assumption we have shown to be violated in the radio environment.
Split by time: The dataset is split into two same-size parts based on the time domain. The first part of each measurement run (independent of the device) is added to the training set, and the second to the test set. As one measurement run consisted of driving the highway segment in two directions, this represents adding one direction of travel to the training, and the second to the test set. Hence, the training and test sets contain very similar radio environments but are uncorrelated in time. Split by measurement run: A measurement run (independent of the device) is considered part of either the training or the test set. The assignment and balancing of training and test set can be challenging, as the train and test datasets need to contain similar characteristics. This splitting scheme is best used when the model's generalization performance is the main focus. It is difficult to capture all dynamics (i.e., the parameter permutations described in Section III) in both training and test set. Bad performance on the test set results from the fact that the complexity and dynamics of the region are not reflected by any subset of the measurement runs. In our dataset, we ended up with a split where about 70 % of the data are in the training dataset, and 30 % in the test dataset to ensure that training contains most of the dynamics found in the test dataset.
Split by folds: The last proposed splitting method aims to combine the split by time and random splitting. We divide the time domain into ten subsets of equal length for each measurement run and device. This approach promises a realistic prediction performance evaluation, as it negates many of the problems of random splitting, and combines advantages of the time splitting, where the models learn characteristics from the time-varying radio environment. Here, we assigned 70 % of the data to the training and 30 % to the test dataset.
[tbp] To investigate deeper the effects of the different splitting strategies, we refer to principal component analysis (PCA) [9], an orthogonal linear transformation transforming a dataset (e.g., the training or test set) to a new basis such that the dataset's greatest variance lies in the first principal components. Therefore, it can be considered as a dimensionality reduction method and considering only a subset of the original components, e.g., the first two for visualization purposes. We note that PCA captures only the second-order statistics and that there are alternative methods for visualization, such as t-SNE. Fig. 8 shows the differences between the training and test sets for the four splits using PCA. The random split's train and test datasets have very similar distributions (cf. Fig. 8a). The reason is that the test set does not contain unseen propagation, data traffic, and driving scheme scenarios. Because time-series data from the radio environment seems highly correlated, the training and test datasets include many similar points. Therefore, although this seems to be the splitting scheme most used in the literature, there is a risk that reported performance is higher than for a deployed system. This should not discourage the use of random sampling. However, the data engineer needs to understand the inherent limitations of this splitting technique and be careful when reporting results.
The split by time alleviates to some extent the sample correlation, as consecutive samples belong to the same set (training or test). Moreover, the varying propagation environment, due to the changing direction of travel, and differences in speed, result in distributions that are less similar than for the random split (cf. Fig. 8b). Some regions, e.g., around the value five for the second principal component, are only present in the test set. The PCA can be considered a first test on how detailed collection procedures should be and highlights how timevarying components influence differences between the training and test sets. The split by time adds some limitations on the measurement duration as for balanced train and test sets more data than, e.g., for random split is required.
The split by measurement run is more appropriate to test model performance under varying and unseen parameters in the measurement procedure, the devices' behavior, and the network. In Fig. 8c, it becomes clear that all these parameters have a substantial effect on the two datasets. Perhaps this is a good way to study the generalization performance of ML models, as a training dataset will never comprise all potential parameter permutations. The ML model makes predictions on samples that can be considered different, e.g., using our PCA evaluation.
We consider the split by folds the best splitting strategy for our dataset as it provides enough variation for ML models to be tested on unseen data while alleviating some of the disadvantages of the other splitting strategies.

C. Feature Acquisition and Availability Analysis
We group different types of features based on the type and the measurement capability of a device into distinct feature groups, see Table III. The table defines a name for the feature group, a corresponding abbreviation that is used throughout the rest of the paper, and the parameters (features) that the feature group inherits. The feature group PHY includes features describing the radio environment, while channel conditions (CHAN) contains related features describing the channel. Feature group base station data (BS) contains parameters that are aggregated per cell.
Vehicle information such as position, speed, and distance to the serving cell is included in the feature group vehicle information (VEH). Finally, the feature group radio environment map (REM) contains statistical information of PHY features and throughput in form of a radio environment map.
Since the features are captured from several different network nodes, as previously shown in Table I, it is not realistic to assume that all features are readily available at a specific entity such as a vehicle or one of the network nodes. Data acquisition is associated with a cost function that is not constant for different sets of features. Also, confidentiality and privacy concerns could hinder the accessibility to specific features. This should be taken into account for ML models for wireless networks, instead of assuming full data availability, at least without discussing their acquisition costs.
Subsequently, we further categorized the previously defined feature groups into different access scenarios. We define these according to possible scenarios in which different sets of feature groups can be collected from exemplary entities. The defined access scenarios can be found in Table IV. The table defines a name for each access scenario, a corresponding abbreviation that is used throughout the rest of the paper, the feature groups which can be accessed, and an exemplary entity corresponding to the access scenario.
The first access scenario is modem access (MD) and refers to features available at the modem, such as parameters from feature group PHY, which can often be easily accessed, e.g., via an API. extended modem access (EMD) refers to access to more modem data provided, e.g., by specialized software. Both MD and EMD correspond to data available at a UE, while modem and network access (MDNET), extended modem and network access (EMDNET), and modem, statistics and network access (REMNET) are available for a network operator with varying acquisition costs (in terms of computing and signaling). For example, PHY features are reported by the UE to the network during measurement reports, but these are usually sent after specific events are triggered. On the other hand, features from the CHAN group are reported regularly if the UE receives DL traffic. Since the dataset is based on vehicular measurements, we also introduce access scenarios vehicle and network access (DEVNET) and full device access (DEV) which include VEH features in addition to other feature groups. For comparison, we also include a full access (FULL) access scenario, which has access to all feature groups.
The aim of this paper is not to define specific acquisition costs associated with the different access scenarios, but to acknowledge the potential trade-off between data acquisition costs and ML performance and compare the performance in the next section. Additionally, we would like to highlight two aspects in particular: First, there are data acquisition costs, which vary depending on the different feature groups. Second, ML models need to be tested under different assumptions on data availability including concerns of data governance and ownership.

D. ML Models for QoS Prediction
We briefly introduce different ML models, that are relevant to the prediction task. These include linear regression, ensemble methods, and neural networks. In terms of completeness, we also add a statistical model that serves as a baseline. The model hyperparameters were tuned using [60] with 5-fold cross-validation. The split by folds was used to split the dataset into a train and validation set. We used the mean absolute error (MAE) as the cost function.
REM of the Throughput: After building a REM of the DL/UL throughput, the predicted value is the interpolated throughput value for the current UE position. This method serves as a statistical baseline.
Linear regression (LR): Linear regression is a relatively simple statistical model where a dependent variable is explained by multiple independent variables multiplied by coefficients defining their weights. In a ML task, the weights of these coefficients are learned.
Random forest (RF): A random forest is an ensemble method based on decision trees. The trees are built independently and in a regression task, the mean of the individual trees is the prediction result. Our RF is composed of 793 parallel trees with a maximum depth of 19.
Gradient-boosted decision tree (GB): Gradient boosting is also an ensemble method based on decision trees. Here, the trees are built sequentially, minimizing the loss function with each new tree. The weighted mean according to the trees' prediction performance is the prediction result. Our GB ensemble is composed of 715 sequential trees with a maximum depth of 10.
Multilayer perceptron (MLP): An MLP is a feedforward neural network that stacks together multiple layers of neurons in a sequential architecture. It is a deep learning (DL) model that has been successfully applied to regression problems with labeled tabular datasets. Our MLP consists of four hidden layers 256, 128, 64, and 32 neurons each and rectified linear unit (ReLU) activation functions. It is trained with an Adam optimizer, with a learning rate of 0.001, and a batch size of 16. The parameters of the neural network are initialized randomly. For regularization, we use early stopping with the criterion that training ends when the error does not decrease for eight epochs on a validation dataset.

VI. THROUGHPUT PREDICTION
In this section, we discuss the throughput prediction task based on ML. The prediction task we picked is to estimate the achieved maximum throughput under a high network load. The high network load conditions mean that the end users have to share the network resources.
We start by investigating the performance of different ML models and continue by looking at the performance when different data access scenarios apply. We then compare the influence of the split strategies on the training and test datasets.  For all the provided results we report multiple metrics [61], as this allows a more flexible evaluation of the prediction performance for diverse use cases and when applicable we discuss the limitations of specific metrics.

A. Model Performance
In Tables V and VI, we compare different models on standard regression metrics for the DL and UL direction, respectively. We assume to use all the information available for the prediction (i.e., FULL), consider all environments jointly, and refer to split by folds. Moreover, we relied on all available data samples for performance evaluation.
In each table, we highlight the best-performing model per metric in bold. For the DL direction (Table V) the GB outperforms the other models for all metrics. RF consistently shows slightly lower performance than GB. For the mean absolute percentage error (MAPE), the MLP is on par with GB, but it shows worse performance than GB and RF for most other metrics. Considering the two simpler models, LR outperforms the REM interpolation for most metrics, underlining that the use of context information, which is included in the input features, is crucial for the prediction task. Moreover, as the feature space is not linear, the LR performance is poor.
The UL direction (Table VI) mainly corresponds to the performance observations in the DL prediction task. We observe that the absolute gap between GB and RF narrows for MAE, median absolute error (MedAE), and root mean square error (RMSE) (due to lower UL data rates), with RF showing similar performance as GB for the R 2 score. The UL throughput range (up to 23 Mbps) is smaller than the DL throughput range (up to 70 Mbps), which is reflected in the prediction performance.
While the average MAE in UL is significantly smaller than the average MAE in DL, the MAPE of the DL is comparably lower than in UL.
As there is a trade-off between maximum performance and acquisition cost, we found that performance metrics, such as MAE, converges at around 5000 training samples. Adding further samples only leads to minor increases in prediction accuracy. However, we note that this result is specific to our dataset and may not generalize to other datasets because the minimum required dataset size depends on a multitude of factors. Such factors are the environment dynamics, the network dynamics, and the specific use case (i.e. required accuracy). Nevertheless, this result shows that the performance return degrades with an increasing number of samples and measurement campaigns should be planned carefully as there is space for optimizing the data acquisition expenditures.
In summary, GB proved to be the best-performing model considering both data traffic directions and all metrics. At the same time, the interested reader should note that there are a plethora of ML algorithms that can be applied and provide good performance. The ML engineer can always pick one based on criteria like the model's complexity and the degrees of explainability. As the ensemble methods based on decision trees offer a good degree of explainability and robustness against outliers, we pick GB for the following sections.

B. Access Scenarios & Feature Groups
We continue by looking at a different set of input features and how much these affect the achieved performance. In the literature, input features are very often considered quite alike in terms of acquisition costs. In a real network, there is a cost associated with collecting data from different nodes and endusers. We compare the different prediction performances for the previously defined access scenarios, from Section V, in Tables VII and VIII for DL and UL direction, respectively.
In both DL and UL, the performance of different access scenarios is very similar, with the only difference that, in general, for DL the MAPE is significantly lower compared to UL, and the MAE, MedAE, and RMSE are higher for DL due to the larger value range. The worst-performing access scenario in both cases is MD, where only the feature group PHY is used. Prediction for EMD, modem access and statistics (MDREM), and DEV performs slightly better but not by a large margin. This probably indicates that additional channel or statistical/historical information might be beneficial but does not add much useful information in addition to physical layer parameters, which can be obtained easily. The prediction with only UE features (MD/EMD) achieves the worst performance across all presented metrics and access scenarios. However, adding BS features, demonstrated in access scenarios MDNET, EMDNET, REMNET and DEVNET, significantly improves the prediction performance, resulting in a halved prediction error for DL. The same holds true for UL direction except for the MAPE, which is still significantly lower by around 20 − 30 percentage points. Also, the R 2 score for both DL and UL is increased to approximately 0.9 once BS features are added.
The best prediction performance is achieved when access to all features is provided, which is emphasized by the bold rows for access scenario FULL. As mentioned in the previous section, we introduced this for comparison only, as access to all features is often not realistic or associated with high acquisition costs. However, we see that the performance of FULL is improved by only a small margin compared to the access scenarios including feature group BS -even compared to MDNET, which contains only simple feature group PHY in addition to feature group BS. Thus, MDNET might be a feasible candidate if one tries to find a good trade-off between better performance and lower data acquisition costs.

C. Split Strategies
As discussed in Section V there are different approaches for splitting the dataset into train and test sets.  We present results for using different splitting strategies for the DL direction in Table IX. According to all analyzed metrics, the random split, as described above, yields the best prediction results.
The performance would likely decrease drastically during deployment, compared to other splits, when the model is presented with new data with a different correlation structure.
In our case, the performance of the random split is 56 % better compared to the split by folds. However, the performance of the random split is likely overly optimistic due to the temporal correlations between samples [50]. Hence, ML models trained with the random split strategy might perform worse on new uncorrelated data that appears in real deployments.
As reasoned earlier, performance results obtained with the split by folds are more likely to be more robust to other deployment scenarios. The split by folds achieves the secondbest results for most metrics.
On the other hand, the performance of the splits by time and split by measurement perform worse, which supports the hypothesis that there is rather a change in the statistics over time. This change can be attributed to the radio environment, as there are different numbers and types of vehicles around during the data collection process.
The results above highlight the need for picking a single or a multitude of splitting results for assessing the ML performance for a specific use case. Failure to have a clear splitting strategy might lead to a false estimation of the performance during the deployment phase.
In Fig. 9, we show the predicted values against the measured ones. The DL throughput was predicted with the GB model, utilizing the features contained in the FULL access type and  using the split by folds.
In the figure, we also include two histograms for both axes. As multiple devices were competing for higher throughput, the total capacity of the network was typically split between multiple users making the highest throughput measurements a rarity. That well explains the skewed nature of the distribution. The interested reader should note that the mean type of metrics used in this section, MAE and MAPE, are heavily influenced by the long distribution tails. Therefore we have introduced median-based metrics, such as the MedAE, that are less influenced by a few large outliers.

D. Sampling Intervals
As discussed in Sections IV and V, the radio environment has a correlation structure that renders consecutive measurements quite alike. That means that slower sampling intervals might be used to reduce overhead, such as battery consumption and signaling. To showcase that, we compare in Table X results for slower sampling intervals than the GB results in Table V. We see that a 50 percent slower sampling interval (2 s) brings a negligible drop in the performance of an ML model. Even sampling procedures that are 75 percent slower than the initial sampling (4 s) might be used and still, performance could be more than adequate for some use cases, with the biggest change being that the MAE increases by 20 %. We also noticed that the reduction in the sampling intervals made PHY features less relevant and increased the importance of the REM features. The correlation structure of the radio environment can be further exploited to optimize sampling procedures while keeping the ML model's performance within the requirements of a use case. A deeper understanding of the correlation structure of the radio environment can drastically benefit the sampling procedures [45] supporting more efficient ML workflows.

E. Prediction Horizon
We continue by changing the prediction problem slightly. Instead of predicting the achievable instantaneous throughput, we predict the throughput in sevaral seconds defined by the prediction horizon. This is shown in Fig. 10, for different access scenarios. We see that the access scenario EMD provides poor performance, similar to the previous subsections. The biggest improvement in the prediction error occurs when we include the MDNET access group, which includes the network information. The access scenario performance DE-VNET provides marginal improvements, with the prediction performance dropping slower for longer prediction horizons. Best performance is achieved for access scenario FULL, although after a prediction horizon of 12 seconds, there is no real gain, of any extra-added features, as it performs as well as the DEVNET access scenario.

F. Concept Drift and ML
As was discussed in Section IV, there are concept shifts occurring frequently in the radio environment. Here, we provide some examples of the degradation of the performance for cases that ML faces concept drifts. Our example is shown in Table XI, where we trained a model on the suburban environment and afterward assumed that it was deployed on a vehicle traveling across a highway environment. We have seen already in Section IV that while driving from a suburban environment onto a highway, a large number of concept drifts is detected. The performance of the ML algorithm drops drastically, to the level of a statistical baseline as shown from the R 2 .

G. Consumer Grade Devices
The presented results were calculated using data captured by DMEs. As DME is relatively expensive and comes with increased processing and measurement set-up effort, we compare it in this section to data from CE. Since CE is significantly more accessible in price and set-up, the question arises to which degree the dataset differs from the DME dataset, and, more specifically, whether a CE dataset can be sufficient for the use case of throughput prediction.
For this comparison, we use data only from the vehicles that were equipped both with a CE and a DME (Vehicles 3 and 4) since this creates datasets of roughly the same size and diversity. Moreover, only feature groups PHY and BS were considered, (i.e., access scenario MDNET), to match the more limited feature set availability of the CE. Table XII presents a comparison of prediction results between the CE and the corresponding DME. Both in UL and DL, the different metrics display similar performance.
The R 2 scores of the CE is slightly better than the DME's. The DME has higher sensitivity and we have seen larger number of outliers at the higher throughput ranges. Such outliers negatively influence the DME's performance. Overall, our findings show that CE can be part of a data collection procedure. More expensive devices might bring some small benefits but with diminishing returns.

VII. INTERPRETABILITY AND EXPLAINABILITY OF THE MODELS
In this section, we use Shapley additive explanations (SHAP) [62] and accumulated local effects (ALE) [63] for

A. Shapley Additive Explanations
SHAP is a framework to explain individual predictions. We use the model-agnostic variant of SHAP as described in [62]. Fig. 11 presents the SHAP values of the five most important features, in DL and UL respectively. The x-axis depicts the SHAP value that describes the impact of the input feature on the prediction. The y-axis depicts the input feature names, and the color represents the numeric input feature value.
The cell load feature has both in UL and DL a strong impact on the prediction. A higher cell load value leads to a lower value on the prediction and vice versa. On the other hand, the radio-based features (RSRP, reference signal received quality (RSRQ) and SINR) seem to have the opposite tendency. A higher radio value leads to a higher value on the prediction and vice versa.

B. Accumulated Local Effects
We then continue using ALE for discovering the global effects of input features. Fig. 12 depicts ALE values for RSRP in UL throughput prediction. The x-axis depicts the RSRP value, the y-axis the effect on the prediction, where a higher input feature value leads to a higher value of the prediction.
We note that in the ALE plot seems that there are four regions affecting the prediction differently as these are depicted in the figure. We added orange dotted lines at the perceived border of each region. The first one is with the lowest captured RSRP values. The UE needs to get at least some minimum RSRP before it can transmit data. The second region shows some linear characteristics. Higher RSRP values seem to contribute to a higher throughput significantly in this range, which demonstrates that the RSRP is contributing strongly to the prediction variable (UL throughput). The third region shows some slight saturation in the linear trend, meaning that higher RSRP values do not contribute considerably more to a higher throughput prediction. The last region looks saturated. This probably means that either higher RSRP values do not provide any benefits for higher rates, or other factors like the cell load play a more important role on the prediction variable. We note that the ML model has discovered these regions from the input features, without being explicitly programmed, which is also close to the operating principles of the cellular networks.
The interested reader should note that the absolute numbers of the regions depend to a large extent on the characteristics of the network deployment that include for example the dynamic range of the receivers and the network states captured. Similar type of learnings have been shown in a previous paper [64], which was based on data from an operator in Asia. This increases the confidence that ML is able to learn about wireless environment characteristics on diverse networks and deployments.

VIII. CONCLUSION
Based on a dedicated measurement campaign, we have presented insights into building reliable ML models for pQoS. Our results go beyond UE data by including network and vehicular information measurements, covering a large range of scenarios. Our measurements reveal many challenges ML models will face in real deployments.
Our first contribution discussed methods for improving sampling for the radio environment. Its correlation structure allows improved sampling procedures that reduce energy consumption and signaling for sharing collected data. The vehicle speeds do not seem to impact the statistics and characteristics of the collected data strongly, further simplifying the sampling procedures. We have also tested the data stationarity assumption, a precondition for many ML models and other theoretical approaches. Even though stationarity often holds true, especially for datasets captured over a longer duration, it should not be expected liberally. We have provided multiple examples where this assumption is violated, by focusing mostly on the aspect of concept drifts as vehicles move between radio environments. As concept drifts degrade ML performance, a reactive ML method might be needed for retraining the ML models as required, once drifts are detected. Another option is having larger datasets that cover multiple scenarios, like in our testbed, reducing the number of concept drifts that an ML model will face when deployed. A combination of the two methods might be the most feasible way forward.
We were able to predict the maximum achievable throughput under high load scenarios, where multiple users compete for maximum throughput in the network, with an MAE of 2.46 and 1.08 Mbps for down-and uplink. In particular, our findings show that the effects of data processing, different validation datasets, and sets of input features can have a very strong overall effect on the ML model performance. For that reason, simply comparing numbers from literature can be misleading. We have seen that GB models outperform neural networks while keeping a balance between complexity and explainability. At the same time, our results clearly show that low-cost consumer-grade devices can be part of ML processes. Their performance falls close to more expensive transceiver chains, further facilitating data collection procedures from enduser's terminals.
Moreover, we emphasized the topics of interpretability and explainability, showing that the tested ML models were able to capture the underlying principles without being explicitly programmed. It is interesting to note that the cell load was discovered as the most important feature for both communication directions, showcasing the importance of network features for such prediction tasks.
Our results indicate that more thorough testing of ML models is needed as complexities coming from the radio environment, the end users, and the effects of the network can considerably affect prediction performance, which might not always be precisely captured by any collected dataset. That can result in high performance variations of ML models, when deployed, that have never seen such effects in their training sets.
One can also draw several other conclusions and lessons from this data and its analysis, for the wireless research community and ML engineers alike. First, reporting only the ML performance of models on a specific dataset might provide overly optimistic results. Second, the data collection procedures, the handling, and processing of the data, as well as the way of reporting results, are all equally important in the long chain of ML workflows. Third, results coming from simplified analytical assumptions and simulations should be used with caution. Our results indicate that real-world-captured intricacies might hinder the further performance of ML models than hitherto understood.
Although some of these results could be regarded as intuitive, they have not been properly emphasized or taken into account in the literature, with only a few exceptions. Especially when it comes to applying ML in wireless networks, there has been a tendency to rate the usefulness of ML models based on their performance on some testing datasets. Our results clearly show that this can often be misleading, and more conservative estimates might need to be used. Moreover, the relevant acquisition costs of data are rarely discussed, and we believe that such aspects need to be part of future discussions and proposals. We hope that this publication serves as a starting point in that direction.
We believe that ML-based predictions do not only have the potential to improve specific use cases but also serve as an important enabler for a more proactive network. We made more datasets [65,66] available to the research community that are based on similar collection principles as the ones we described. In the future, we plan to integrate the lessons learned towards methods that can innately handle some of the dynamics we noticed in the radio environment, such as nonstationarities and concept drifts. Moreover, we would like to extend our work toward other KPI metrics, such as latency and the number of dropped packets. Finally, we would like to integrate issues of data governance since, in this study, we did not consider the acquisition cost of the different features in detail.