BoB: Bandwidth Prediction for Real-Time Communications Using Heuristic and Reinforcement Learning

Bandwidth prediction is critical in any Real-time Communication (RTC) service or application. This component decides how much media data can be sent in real time. Subsequently, the video and audio encoder dynamically adapts the bitrate to achieve the best quality without congesting the network and causing packets to be lost or delayed. To date, several RTC services have deployed the heuristic-based Google Congestion Control (GCC), which performs well under certain circumstances and falls short in some others. In this paper, we leverage the advancements in reinforcement learning and propose BoB(Bang-on-Bandwidth) — a hybrid bandwidth predictor for RTC. At the beginning of the RTC session, BoBuses a heuristic-based approach. It then switches to a learning-based approach. BoBpredicts the available bandwidth accurately and improves bandwidth utilization under diverse network conditions compared to the two winning solutions of the ACM MMSys'21 grand challenge on bandwidth estimation in RTC. An open-source implementation of BoBis publicly available for further testing and research.

Abstract-Bandwidth prediction is critical in any Real-time Communication (RTC) service or application.This component decides how much media data can be sent in real time.Subsequently, the video and audio encoder dynamically adapts the bitrate to achieve the best quality without congesting the network and causing packets to be lost or delayed.To date, several RTC services have deployed the heuristic-based Google Congestion Control (GCC), which performs well under certain circumstances and falls short in some others.In this paper, we leverage the advancements in reinforcement learning and propose BoB(Bang-on-Bandwidth) -a hybrid bandwidth predictor for RTC.At the beginning of the RTC session, BoBuses a heuristicbased approach.It then switches to a learning-based approach.

BoBpredicts the available bandwidth accurately and improves bandwidth utilization under diverse network conditions compared to the two winning solutions of the ACM MMSys'21 grand challenge on bandwidth estimation in RTC. An open-source implementation of BoBis publicly available for further testing and research.
Index Terms-Bandwidth prediction, real-time communications, reinforcement learning, RTC, WebRTC, AlphaRTC.

I. INTRODUCTION
R EAL-TIME Communication (RTC) services account for a sizeable fraction of today's Internet traffic [23].For example, there were 300 million daily meeting participants on the Zoom platform alone in 2020, a 50% increase from 2019 [65], and on the Facebook Messenger application, there were 150 million daily video calls in 2021 [49].With  Abdelhak Bentaleb is with the Gina Cody School of Engineering and Computer Science, Concordia University, Montreal, QC H3H 2R9, Canada (e-mail: abdelhak.bentaleb@concordia.ca).
Digital Object Identifier 10.1109/TMM.2022.3216456 and more efficient video and audio codecs, RTC services continue to grow and evolve.Today, RTC is used in a range of applications such as video gaming [26], [43], [53], videoconferencing [34], [38], e-learning [9] and real-time immersive experience sharing [59].Needless to say, RTC is an integral part of our lives as it enables us to stay connected with the rest of the world while working remotely, which has become the new normal due to the COVID-19 pandemic.However, this does not mean users' quality of experience (QoE) for RTC services is always great.Occasionally and sometimes more than occasionally, users still suffer from blurry, low-quality or distorted video, high latency or video freezes and audio drops.
To date, there has been significant research to improve QoE in RTC services.These efforts offered several solutions that can be divided into three broad categories: (i) congestion control optimization at the transport layer [22], [24], [60], [64] that primarily aims to provide an accurate bandwidth estimation, (ii) bitrate selection optimization for video codecs [68] (e.g., H.26x, VPx and AV1) that strives to adapt the bitrate (through the rate control at the application layer) for each frame to suit the instantaneous network capacity changes, and (iii) mixed techniques that combine congestion control and bitrate selection optimizations.Despite the advances in codec rate control, accurate bandwidth estimation is still an open problem.It plays a critical role in maintaining a good QoE as the codec allocates more or fewer bits based on this estimation.In other words, if the actual bandwidth is overestimated or underestimated, this can be detrimental to the QoE.Existing heuristics (e.g., [21], [22]) may work well in some network environments but not so well in others [27] due to dynamic, complex and diverse bandwidth fluctuations.These heuristics mainly follow the Google Congestion Control (GCC) algorithm [1] that implements two rules that consider the aggregated Real-time Transport Protocol (RTP, RFC 3550) feedback information to estimate the bandwidth.The first rule is a loss-based rate controller implemented at the sender, while the second is a delay-based one implemented at the receiver.
Deep reinforcement learning (DRL) has recently emerged as a key solution for many networking problems such as bitrate adaptation [45], congestion control in TCP [7], [42] and RTC [27], scheduling [44] and bandwidth prediction [13].Leveraging the power of a learning-based approach that masters and adapts dynamically to various environments, we design BoB(Bang-on-Bandwidth) -a bandwidth predictor for RTC.BoBis located at the receiver and operates fully automatically by learning from experience and reacting quickly to changes in network conditions while considering video quality and packet delay/loss.It uses actor-critic networks for model training and Proximal Policy Optimization (PPO) [52] with clip and Adam optimizers for policy updates at each time interval.Using DRL directly in the context of bandwidth prediction requires a certain level of caution because of the cold start issues (i.e., not enough data being available at the beginning of the session) [66].The reason is that DRL approaches are often trained offline with large amounts of data, and then used online with limited data.Such a gap between offline and online environments results in inconsistent performance [66] caused by taking sub-optimal actions.To avoid this issue, BoBincludes an adaptive selector for bandwidth prediction that initially uses a heuristic-based controller.Once it collects sufficient input data, it switches to a learning-based controller.
Note that we may use estimation and prediction interchangeably throughout this paper while keeping a small but important difference adopted from [13].An estimation is derived from the raw measurements and/or samples using simple smoothing techniques, whereas a prediction is derived from the smoothed values and/or other data using learning-based techniques.
The contributions of this paper are three-fold: 1) We design BoB, a receiver-side hybrid bandwidth prediction solution for RTC, which combines a heuristic-based controller (inspired by the GCC algorithm) with a DRL controller.The main feature of BoBis to leverage the DRL benefits in adapting to diverse network conditions while using the heuristic-based controller only at the beginning of an RTC session when input data is scarce.2) We propose an adaptive technique to select between the heuristic and learning-based controllers to avoid inaccurate actions when using DRL for bandwidth prediction.3) We implement BoBon Microsoft's OpenNetLab platform termed AlphaRTC [5] and validate its performance gains against the recent state-of-the-art solutions and winners of the grand challenge organized by Microsoft and Open-NetLab on the subject of bandwidth estimation for RTC at ACM MMSys 2021 [3].To train BoB's DRL model, we incorporate BoBinto RTC GYM [3], which emulates an RTC environment, and subsequently, use the model for evaluation using real-world network traces (online BoBmodel inference).We also evaluate BoBin the wild using the OpenNetLab public Internet-based testbed.Evaluation results show that BoBachieves good prediction accuracy with high utilization and viewer experience across many real-world network conditions.The source code for BoBis publicly available at [11].The rest of the paper is organized as follows: Section II overviews some of the QoE optimization solutions for RTC systems.Section III details the proposed learning-based solution (BoB) for bandwidth prediction in RTC systems.The evaluation and analysis are given in Section IV, followed by a discussion on open directions in Section V. Finally, Section VI concludes the paper.

II. RELATED WORK
Improving QoE for different video streaming services, such as RTC, has gained massive attention in the last several years.For this purpose, solutions have been developed with techniques ranging from heuristics to learning-based methods at the transport layer (e.g., congestion control) and application layer (e.g., bitrate selection and bandwidth estimation).In general, these solutions fall into three main categories.

A. Congestion Control Optimization
There are many congestion control solutions that include numerous variants of TCP.Here, we briefly present some of them.Among the early solutions, TCP Reno [37] and NewReno [29] both use a heuristic additive-increase-multiplicative-decrease (AIMD)-based algorithm that considers packet loss as the key indicator for congestion.Later, improved congestion control versions have emerged, such as TCP Cubic [31] and TCP Vegas [18] (and then Copa [8]), where the former tries to replace the AIMD function with an improved one while the latter uses delay as the primary indicator of congestion instead of packet loss.More recently, BBR [19] uses delay instead of loss as the primary parameter to determine the sending rate, allowing it to work near the optimal point of full bandwidth utilization and low delay.BBRv2 [20] aims to address the issues that were introduced in the initial version: (i) unfairness, and (ii) excessive retransmissions in shallow-buffered networks.
As learning techniques became popular, there were attempts to automatically perform the task of congestion control.Winstein et al. [58] designed Remy, a distributed congestion control solution for heterogeneous and dynamic network environments.Remy formulates congestion control as an optimization problem and implements an offline mapping from all possible events to good actions using a dynamic programming approach.Using online learning techniques, PCC-Vivace [25] was proposed to select the best sending rates automatically.Indigo [62] adjusts the congestion window based on a trained model that employs imitation-learning, while Aurora [39] leverages basic DRL techniques to determine the sending rate.Orca [7] uses a hybrid approach that combines a legacy congestion control solution with modern DRL techniques.Zhu et al. [70] proposed NADA, a congestion control scheme for interactive RTC services, where the sender adjusts its sending rate based on either implicit or explicit congestion signaling markings from the network nodes.Johansson et al. designed SCReAM [41], a hybrid loss and delay-based congestion control algorithm for interactive video streaming applications.Interested readers are encouraged to read more details in [48], [62].

B. Bitrate Selection Optimization
Fang et al. [27] designed an RL-based agent to control the sending rate in an RTC system.Their preliminary results showed Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
good performance under challenging network conditions.Tianrun et al. [55] designed Gemini, an ensemble framework for bandwidth estimates in RTC.Gemini implements a hybrid technique that switches between the heuristic-based GCC rule and a DRL agent on the fly based on a safety factor.This safety factor decides when Gemini falls back to the GCC rule once the DRL model performs poorly and then switches back to DRL when the performance improves.However, based on our experimental test runs and results (Section IV), Gemini suffers from three issues: (i) the switching technique fails frequently when it tries to select the correct algorithm, especially under challenging network conditions with a high packet loss ratio (e.g., 3 G/4 G), (ii) the DRL algorithm uses a simple neural network that does not consider the fluctuation in the past bandwidth prediction values, and (iii) the DRL algorithm fails to converge to perform the best bandwidth prediction decisions.Such issues may result in bandwidth overpredictions or underpredictions.
Similarly, Wang et al. [57] proposed HRCC, which uses an RL agent to dynamically tune the values of GCC parameters depending on the network variability instead of using fixed values to boost bandwidth estimation accuracy.Our solution (BoB) falls into this category and its objective of controlling the receiving rate is similar to HRCC and Gemini.All these solutions (HRCC, Gemini and BoB) use a GCC-like heuristic algorithm.However, the key differences are in the DRL-agent design.BoBdiffers from Gemini in the following aspects: (i) the DRL architecture and set of the NN inputs, (ii) the adaptive algorithm switcher, where during the streaming session, Gemini keeps switching between the DRL and heuristic algorithms, and BoBonly uses the heuristic at the beginning and then switches to DRL once more data is available, and (iii) Gemini uses an ACK-based heuristic algorithm while BoBuses a delay-loss based heuristic algorithm.

C. Mixed Techniques
Fouladi et al. [30] designed Salsify as an RTC architecture that includes a video codec and a network transport protocol.Salsify uses per-frame rate adaptation and aims to work under extreme network conditions by alleviating packet losses and delays.To achieve this, Salsify employs a custom encoding/decoding scheme not supported by existing hardware codecs.Zhang et al. [67] proposed a solution that combines a multipath transmission scheme with path selection for improved transmissions in RTC.Here, the sender selects the best path from several candidate paths using a multi-armed bandit learning-based technique.Zhou et al. [69] proposed Concerto, a machine learning-based bitrate adaptation system aiming to maximize video telephony QoE.Concerto first extracts high-level features of both layers (application and transport) and then leverages deep imitation learning to train models using massive data traces.In particular, it considers historical packet losses, packet delays and the sending/receiving rates in its neural network and imitates the behavior of an expert (an Oracle that knows the actual bandwidth values).Zhang et al. [66] developed an online RL-based solution for rate decisions in RTC systems named OnRL.The central insight behind OnRL is that RL models trained offline in a simulator suffer from less satisfactory performance when deployed under real conditions.

III. BOB: BANG-ON-BANDWIDTH
Predicting the bandwidth is one of the critical tasks in RTC that directly impact the user experience.The essential question is how to perform bandwidth prediction accurately, considering the collected information from the Real-time Transport Protocol (RTP, RFC 3550) packets.Information that includes sending/receiving time and packet size can be collected with every received RTP packet.This information is used to compute the receiving rate, packet delay and packet loss, all used as input to figure out how much available bandwidth there is now and will be soon on the current network path.Typically, bandwidth prediction is performed using a heuristic-based scheme (e.g., GCC-based [1]).In a learning-based approach, the above inputs are translated into a state and reward (QoE), which are then mapped to an action (bandwidth prediction).BoBachieves the benefits of both approaches to perform the bandwidth prediction task as explained below.

A. Overview
The overall workflow of BoBis depicted in Fig. 1.It consists of two phases: BoBtesting and BoBtraining.
1) BoBTraining Phase: We use the AlphaRTC GYM simulator [5], based on an ns-3 [51] and WebRTC implementation.This simulator emulates a WebRTC session, utilizing various network traces that were collected from real-world environments such as Belgium 4 G/LTE [56], Norway 3 G/HSDPA [50] and NYU LTE [46].Each network trace is comprised of a throughput value, a round-trip time (RTT) and a packet loss ratio.We implemented BoBas a bandwidth prediction module within RTC GYM.During the WebRTC session, the simulator collects and computes statistics (e.g., receiving rate, packet delay and loss) from each received RTP packet, and then these statistics are fed as inputs into BoB, which in turn predicts the bandwidth at every time step.The predicted bandwidth is sent to the sender via RTP Control Protocol (RTCP, RFC 3550) feedback to adjust the encoding rate.During the offline phase, we train our learning-based BoBmodel (refer to the DRL controller in Fig. 2, with more details given in Section III-C2) and this model will be used during the testing phase.We note that because our experiments were conducted using a short video sample, we did not retrain the BoBmodel during the testing phase.However, if/when desired, it could be retrained periodically.
2) BoBTesting Phase: We use the AlphaRTC implementation [5] with the BoBcontroller for a receiver-side hybrid bandwidth predictor, as highlighted in blue color in Fig. 1.The system Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.consists of an (RTP) sender and (RTP) receiver.The sender initiates the RTC video session with the receiver by creating a UDP socket to send RTP packets and receive RTCP feedback.The congestion control is adapted from GCC and includes two controllers: a loss-based controller and the hybrid BoBcontroller (it is called a hybrid as it has both a heuristic delay-based controller and a learning-based DRL controller).The BoBcontroller is placed at the receiver and responsible for computing a bitrate (x r ) based on the BoBbandwidth predictor output, which is then fed back to the sender.Conversely, the loss-based controller is placed at the sender and is responsible for computing the target sending rate (denoted by x s ).The target bitrate x s is fed to the video encoder, which attempts to encode the video at a bitrate as close to the target as possible.The encoded video is then forwarded to the packet pacer responsible for regulating the bitrate produced by the encoder when the bitrate of the encoded video deviates from the target.Here, the encoder cannot change the rate as frequently as the pacer rate.If the video encoder produces a bitrate higher than the target, then the pacer is allowed to drain its queue at a higher rate to alleviate queuing delays at the sender.On the other hand, padding/forward error correction (FEC) can be added, if desired, under certain circumstances.This way, on average, the sending rate is expected to be equal to the target bitrate x s .

B. System Architecture
BoBis a hybrid rate control solution implemented at the receiver to improve the QoE for RTC systems.It combines the strength of a heuristic-based rate controller with a DRLbased controller to predict the bandwidth.As shown in Fig. 1, BoBtakes the historical packet-level statistics from the network path as an input, where we denote the receiving rate by c t , packet delay intervals by d t , packet loss ratio by l t and the n most recent predicted bandwidth samples by It outputs a prediction (denoted by a t = x r t ) for the next t-th time window denoted by W t (in milliseconds), where t = {1, 2, . . ., T } and T is the total number of time windows of an RTC session.The predicted bandwidth value is then sent to the sender using an RTCP feedback message, which in turn is passed to the encoder.After that, the encoder uses this value as the target bitrate and encodes the frames based on this target.Therefore, BoBcontrols the receiving rate and helps to avoid issues that could lead to poor QoE.In short, BoBreplaces the traditional, heuristic-only-based rate controller (e.g., based on an unscented Kalman filter) by leveraging the power of DRL.During the offline training phase, it uses the past and current information of the incoming packets (at the transport layer) as input to the neural network.Due to the nature of DRL, BoBmight deviate from the right decision in some corner (uncovered) cases, which mostly happen at the beginning of an RTC session.For this reason, we developed a simple but robust adaptive selector that enables run-time switching between the heuristic and DRL-based controllers.The adaptive selector uses the heuristic-based controller at the beginning of an RTC session when the DRL controller behaves sub-optimally because of limited session data and the incorrect exploitation actions.Then, it switches to the DRL controller once most of the corner cases are covered and the predicted values become accurate.Specifically, it uses a current percentage value and a percentage threshold (i.e., fixed to 30%) that is tuned empirically as a switching point between the heuristic-based and DRL controllers.The current percentage value is computed based on the difference and average in the predicted bandwidth values given by the heuristic and DRL controllers.
As shown in Fig. 2, each endpoint (sender or receiver) runs its controller.The receiver runs the BoBcontroller, whereas the sender runs a loss-based controller.Next, we describe the receiver-side BoBcontroller and the sender-side loss-based controller in detail.

C. (Receiver-Side) BoBController
Here, we describe the BoBcontroller, which consists of (i) a delay-based (heuristic) controller, (ii) a DRL controller, and (iii) an adaptive selector.
1) Delay-Based (Heuristic) Rate Controller: At each time window W t , the delay-based rate controller predicts the bandwidth x r t as described in Algorithm 1.In this algorithm, β = 1.08 and α = 0.85 are coefficients of the packet arrival Kalman filter, which are tuned empirically based on our experiments, σ is the controller's state, c t is the receiving rate measured in the last W t = 200 milliseconds (ms), and xt is the additive value that is determined by the rate control region.The delay-based controller first uses the packet arrival filter that divides and groups the received packets into 200-ms windows and then computes the slope factor (denoted by m t ) based on a delay gradient between the groups of received packets and judges the trend of the delay change.After that, m t is fed to the adaptive threshold, which sets the threshold used by the overuse detector.Then, the overuse detector produces a signal that drives the network state (denoted by τ ): underuse, overuse or normal based on m t and threshold (see Fig. 2).The network state is then mapped to a controller state increase, decrease or hold using an AIMD algorithm to predict the currently available bandwidth according to the prevailing network state.If the controller state is decrease, then the controller sets the rate control region to state NearMax.Once the controller state is changed to increase and the rate control region is in state NearMax, the controller sets xt = c t .Otherwise, if the controller state is increase and the rate control region is in state of MaxUnknown, the controller sets xt = β × c t .Therefore, the controller additively increases x r t based on the rate control region.
2) Learning-Based (DRL) Rate Controller: BoBimplements an RL agent that interacts with the environment encompassing the communication process between the sender and receiver in the RTC system.For the BoBmodel training, the packet-level statistics (input) are collected periodically during a fixed time window of W t = 200 ms and aggregated as the environment state.Subsequently, the agent predicts the bandwidth that represents an action value.Formally, the RL agent interacts with the environment that defines a state space denoted by S. At each time window W t (at time epoch t), the RL agent receives a state s t ∈ S from the environment and then takes an action a t ∈ A (bandwidth prediction for the next time window W t+1 ) while it receives a reward r t ∈ R. The essential objective of the agent is to find an optimal policy π : S → A that maps states-to-actions, maximizing the overall reward (i.e., finding the bandwidth that maximizes the receiving rate while minimizing Return(x r t ) 28: end for 29: end function the packet loss and delay).After the bandwidth prediction action a t is taken, the BoBenvironment observes the new receiving rate, packet loss, delay and the predicted bandwidth, and transits to the next state s t+1 ∈ S, while updating the reward r t+1 ∈ R. The DRL controller is depicted in Fig. 3. a) Input State Space and Network: At each time window W t , the state input is a 1 × 11 vector of 11 dimensions defined as s t = {c t , d t , l t , − → X r t }, comprised of the receiving rate c t (bps), packet delay1 d t (ms), packet loss ratio (%) and the n most recent bandwidth prediction samples − → X r t (bps).We normalize each state input using a linear-to-log() function in a value-range [0,1].We then feed the current state s t as the input to the actor-critic network that comprises two neural networks.As depicted in Fig. 4, − → X r t is fed into a 1DConv (LSTM) layer in time order for feature extraction.The main insight behind using LSTM is capturing the bandwidth variation's temporal characteristics.Thus, the accuracy of the bandwidth prediction can be improved.Other inputs are fed into a linear, fully-connected (FC) layer with a Rectified Linear Unit (ReLU()) activation function.After that, the input layers are concatenated and finally fed into the hidden layers.Results from the concatenation  are then aggregated in three levels of FC layers that use 514, 320 and 64 neurons, with a ReLU() activation function with a slope of 0.5.
We use the same structure for both actor and critic networks, but with different outputs.For the actor network, we use a softmax() distribution function followed by a logarithm (log_softmax()) as the last FC layer with L2 normalization of the network, resulting in an output in the range from 0 to 1.The output (selected action) is then mapped to a value between 0.01 to 8 Mbps (as fixed in AlphaRTC [5], [27]) as the bandwidth prediction using a log-to-linear() function.The critic network is similar to the actor without log_softmax() in the last layer, resulting in output state-values, denoted by V π θ (s t , w) (value function), that help the actor network update the policy distribution in the direction suggested by the critic network (such as with policy gradients).We note that each 1DConv layer uses a 3 × 3 convolution with 64 filters to extract implicit features, and is followed by a ReLU() activation function that tries to maintain a non-zero policy gradient over the whole training phase.Therefore, the vanishing gradient problem is avoided while the training time is reduced.
b) Action Space: In each time window W t , BoBpolicy π θ maps s t to a compact action space whose values range between 0.01 and 8 Mbps.Specifically, A = {a 0 : 0.01 − 2 Mbps, a 1 : 2 − 4 Mbps, a 2 : 4 − 6 Mbps, a 3 : 6 − 8 Mbps}, representing an appropriate range of bandwidth prediction for RTC systems.Therefore, the output is a 1 × 4-dimensional vector that identifies the state-action probabilities produced by log_softmax().Then, π θ : s t → a t maps the state to a suitable action (a = [a 0 , a 3 ]) based on the state-action probabilities, i.e., the agent policy selects the action with the highest probability.
c) Reward Function: The reward r t is calculated after each action a t is taken to ensure that BoBcan learn from past experience.It reflects the performance of the bandwidth prediction accuracy according to the user QoE.At each time window W t , we define r t based on [27] as follows: The agent is rewarded when it receives more packets (leading to higher QoE) and penalized when packet delay/loss increases (leading to lower QoE).

d) Training Algorithm:
We use the Advantage Actor-Critic with on-policy Proximal Policy Optimization (PPO) and the Adam optimizer for policy updates.During the training, the objective of BoBis to maximize the total discounted cumulative reward, which is expressed as: where T π θ denotes the batch size for updating the gradient policy (fixed to 4,000 time windows per episode in our simulations), γ ∈ [0, 1] serves as a discount factor (usually customized as 0.99 or 0.9) and R t represents the discounted cumulative reward from time t to the end of the RTC session.The objective of the actor network is to find a policy π : π θ (s, a) → [0, 1] to maximize R t , where π θ : s → a is the probability distribution over different actions A. The stochastic policy π θ is responsible for selecting an action a with the highest probability.On the other hand, the critic network is responsible for making an objective assessment for each current state s t using a value function V π θ (s t , w).
In the training algorithm, we use PPO and the Adam optimizer to update the gradient policy such that R t is maximized at every training episode as: where Θ is the total number of episodes, is the advantage function that expresses the difference in the cumulative reward between the actual value after selecting the action a t based on policy π θ at s t and the expected value.The advantage function is calculated as a function of R t and baseline b t that has a significant impact on the convergence of the total cumulative reward R t .In the DRL model, we found A π θ did not work well.Hence, we replaced it with is computed by the actor network, which uses the k-step Temporal Difference (TD) method given by: For each training step, the actor network strives to maximize R t through maximizing A π θ , i.e., making better action decisions than the current policy π.Therefore, the parameter θ of the actor is updated via a stochastic gradient ascent algorithm as follows: where α is the learning rate and θ log π θ (a t , s t ) represents the dynamics that parameter θ accounts for in order to achieve the objective.It is worth noting that BoBleverages dropouts with probability (p = 0.5) to add a regularization term to the update of the actor network, which helps to alleviate overfitting issues.Such a regularization term can be considered the entropy of the probabilities over the bandwidth prediction decisions H(π θ (.|s t )), which promotes exploration and avoids severe overfitting.The critic network is responsible for making an objective assessment for all the states ∀s t ∈ S during the training.To do so, the critic network uses the standard TD method to compute the loss function and minimize its value.Hence, the parameter w of the critic network is updated through a stochastic gradient descent algorithm as follows: where V π θ (s t , w) and V π θ (s t+1 , w) are the objective assessments for s t and s t+1 , respectively, from the critic network.
We update the policy π θ periodically every k steps < T θ (update interval) using PPO with clipped objective and the Adam optimizer.The PPO aims to optimize (via Adam) the following clipped objective function: where E denotes the empirical expectation over time steps, ratio t (θ) (= π θ (s t , a t )/ π θ old (s t , a t )) is the ratio of the probabilities under the new and old policies, and ε is the clip hyperparameter (usually fixed to 0.1 or 0.2).
3) Adaptive Selector: The main purpose of the adaptive selector is to decide when to switch between the heuristic and learning-based rate controllers.With this functionality, we enable a hybrid bandwidth prediction and increase the accuracy of the DRL controller in the long term.Bandwidth prediction is likely to be inaccurate (because of bandwidth underprediction caused by the lack of data; i.e., transmitted packets from the sender to the receiver) at the beginning of a session, since the values returned from the DRL controller at that time are mostly related to the training dataset.
To overcome this possible inaccuracy, we compare the prediction results obtained from the DRL controller with those from the heuristic controller and validate their accuracy.To do so, we use symmetric mean absolute percentage error (sMAPE).First, we compute the absolute difference (Dif t ) between the predicted bandwidth values given by the heuristic controller (Heuristicbw t ) and the DRL controller (DRLbw t ).Second, we compute the average predicted bandwidth value (Avg t ) based on both controllers.If the output from the percentage Dif t Avg t is equal to or more than 30%, the algorithm decides not to use the DRL controller and feeds the output of the heuristic controller to the DRL controller for later use.In time, the percentage between the outputs of the two controllers reduces and the DRL controller starts making a better prediction.The essential steps of the adaptive selector are highlighted in Algorithm 2. We note that this algorithm also monitors the difference between DRL and heuristic controllers in case of deviations (corner cases) from the expected converged predictions from both controllers.If a deviation happens, it switches back to the heuristic controller.However, we observed this situation only occasionally under some network conditions.Once the DRL controller starts performing well, it keeps doing so in the long run.
We decided the threshold of 30% to switch between BoBcontrollers empirically.We performed extensive experiments to find a suitable percentage that resulted in high bandwidth prediction accuracy and good scores (defined in Section IV-C) in the long term (i.e., the whole live video session).In particular, we ran many tests with various percentage values, from 5% up to 50%, using different network conditions (see Section III-A1) and video content (same as given Section IV).Table I

D. (Sender-Side) Loss-Based Controller
The sender and receiver controllers complement each other to select a suitable bitrate.The loss-based controller is located at the sender and is responsible for selecting the sending rate based on the packet loss ratio.At every time window W t , the sender receives an RTCP feedback message from the receiver carrying the predicted bandwidth x r t and loss ratio l t computed at the receiver.Based on this, the sender selects the sending rate x s t as follows: Here, the selected sending rate x s t changes depending on the loss ratio l t where: (i) x s t remains constant in case l t is small (0.02 ≤ l t ≤ 0.1), (ii) x s t decreases multiplicatively in case l t is high (l t > 0.1), and (iii) x s t increases multiplicatively in case l t is very small (l t < 0.02).The final selected sending rate is then computed as follows: x t = min(x r t , x s t ).This value x t is provided to the encoder as the target bitrate.The chosen loss ratio ranges are given by GCC, as referenced from [22].

E. Parameter Choices and Training Setup
We fixed α and β at 0.85 and 1.08, respectively, in the BoBdelay-based controller.These values have been empirically tuned based on our experiments and our finding is also aligned with [22].For the BoBDRL controller, training parameters can impact its performance, so we empirically set the parameters as follows: the maximum number of episodes N to 2,000, the policy update interval T θ to 4,000 time windows, the PPO k-steps to 20, the PPO clip parameter to 0.2, the discount factor γ to 0.99, the Adam learning rate lr to 3×10 −5 , the Adam β to 0.999, the number of recent samples n to eight, and the time window W t during which the states are captured to 200 ms.To train our DRL model, we used around 500 network traces in total from different datasets: The ACM MMSys'21 grand challenge on bandwidth estimation in RTC dataset [3], Belgium 4 G/LTE [56], Norway 3 G/HSDPA [50], NYU LTE [46], FCC [28], and Synthetic [12].We randomized and divided them into two sets: 80% for BoBtraining and 20% for BoBtesting.With 80-20 train-test split, we performed 5-fold walk-forward cross-validation on each dataset.
The training output is one DRL model with .pthextension, which we use for the online inference and our results are presented in Section IV.

F. BoBImplementation and Challenges
To implement BoB, we used the platform, named Al-phaRTC [5], provided by Microsoft's grand challenge on RTC [3] that comprises two main parts: offline training and online testing.
1) Offline Training: The trace-driven simulator mainly uses PyTorch v1. 10 [47] for the deep reinforcement learning components and implements the GYM for a typical RTC system.The GYM uses ns-3 and WebRTC applications to simulate a sender-receiver RTC environment.The BoBmodel training uses real-world network traces to simulate the network conditions between the sender and receiver in terms of available bandwidth, RTT and packet loss.
2) Online Testing: The AlphaRTC [5] framework is a fork of Google's WebRTC project with machine learning-based bandwidth estimation.We use this framework and plug in the BoBbandwidth predictor for RTC system testing using real-world traces.The BoBcontroller is implemented in Python and consists of about 1,700 lines of new code, available online at [11].In this code, the BoBcontroller is implemented as a class under file name BandwidthEstimator_bob.py, which comprises of three functions: AdaptiveSelector(), Heuris-ticController() and DRLController().
3) Challenges: It is well known [66], [69] that any DRL model requires significant data to converge to the best bandwidth accuracy prediction because of the training-to-testing gap issue [66].Achieving the best bandwidth prediction requires a ramp-up during the live video session, which might hinder the overall performance of an RTC system.For example, during the design of BoB, we tried to use the DRL controller from the beginning of the video session, but we observed that our model experienced frequent bandwidth underprediction issues, which adversely impacted its convergence during the session.We also observed that the underprediction remained for some time because the penalties (loss, delay) were close to zero due to the initial scarcity of data history, so the model thought it was performing well.This is also one of the known issues with a DRL model, which trains an agent by giving it feedback (QoE; a combination of the receiving rate, delay and loss) for decisions while interacting with an environment.Therefore, this confirms why a DRL model requires some time to converge to the best bandwidth decisions.To avoid these issues, we used the heuristic controller to perform the bandwidth prediction decisions and at the same time collected enough data, allowing fast convergence of the DRL controller to the best decisions once the selector switched to it.For all considered network traces, we found that the BoBDRL model requires on average 500 training episodes to converge to the best bandwidth prediction decisions.In the testing phase, we found that at the beginning of a video session, the BoBDRL model requires at most 10-15 seconds of packet transmissions based on the heuristic controller to start performing well (refer to Section III-C3).This is important at the beginning of a live video session as the DRL model requires the latest eight bandwidth prediction values as one of the input channels of the NN.These values are given by the heuristic controller.Since the heuristic controller is purely based on the last value of the heuristic such as packet loss or delay to make bandwidth prediction decisions, it requires minimal data.However, the heuristic controller cannot easily be generalized to various network conditions as it heavily depends on some hardcoded configuration parameters.As a result, it might suffer from inaccurate bandwidth predictions under some network conditions.This motivates and confirms the hybrid selection choice for BoB.

IV. PERFORMANCE EVALUATION
In this section, we evaluate the effectiveness of BoBagainst the purely heuristic-based (GCC) approach and the latest hybrid (heuristic and learning-based) approaches proposed for RTC systems including Gemini [55] and HRCC [57].Our evaluation is divided into two setups: emulation-based and Internet-based.

A. Evaluation Setups 1) Emulation-Based Setup:
In order to evaluate the effectiveness of BoBover an end-to-end controlled system, we used a physical machine running Ubuntu 18.04 LTS OS with dual 20-core Intel E5-2630 v4 @ 2.20 GHz processors and 192 GB memory.We ran the trace-driven framework (AlphaRTC) in an isolated environment using the Docker container provided by the Microsoft team and we installed extra library dependencies for the tc [4] command to be able to throttle the bandwidth between the sender and receiver and introduce packet loss/delay following the network profiles highlighted in Fig. 5.The Estimator class, contained in the source file Bandwidth Estimator.py, is used by the Docker environment to call the desired bandwidth estimator (BoB, HRCC or Gemini) and the get_estimated_bandwidth() method of the Estimator class is invoked as packets arrive in the setup.Each solution logs the predicted bandwidth values, bandwidth prediction Network Profiles: The network profiles we used in the evaluation, re-purposed for this work from [15], are shown in Fig. 5.The profiles are extracted randomly from 20% of network traces assigned for testing, namely: LTE, Twitch, Cascade, FCC Amazon and Synthetic.For FCC Amazon and Synthetic, we fixed the delay to 50 ms and loss to 0.08%.
Video Sample: For the video sample, we used the Big Buck Bunny video sample [2] with 24 fps and 640×360 pixel resolution that was approximately one minute long.The simulation configuration files are given in [5], including re-ceiver_pyinfer.py and sender_pyinfer.pyand updated with the test video source and properties.In these files, there is also an autoclose parameter that specifies the duration (in seconds) of the system test to be performed.In our simulations, this parameter was set to 60 seconds.
2) Internet-Based Setup: OpenNetLab provides an Internetbased public testbed (https://opennetlab.org/) that creates a unified measuring platform to validate the performance of RTCbased solutions, including BoB, under unseen network conditions in the wild through initiating several end-to-end RTC calls.This testbed includes wired, wireless and mobile networks and heterogeneous nodes with support from universities throughout Asia.The nodes are in China (Beijing, Hefei, Nanjing, Lanzhou, Shenzhen and Hong Kong), South Korea (Seoul and Daejeon) and Singapore (Queenstown).The set of nodes in the testbed is coordinated using Azure Backend microservices.
To test the end-to-end RTC calls with the BoB solution versus competitors (heuristic-based, Gemini and HRCC) over the Internet, we submitted a performance validation job by uploading the BoB trained DRL model and algorithms.We also specified the predefined resource (compute node and network type) and predefined scenarios (A, B and C) via a Web-based frontend.For each video sample, each scenario was run five times in a round-robin manner.These scenarios are highlighted in Table II and Fig. 6 shows the setup of the testbed.To validate the performance of BoB's competitors, we used the same process.
Network Profiles: The public Internet-based testbed offers three types of network characteristics: High, Medium and Low Bandwidth (BW).The details of each network are highlighted in Table II.
Video Sample: Each scenario (Table II) runs with different types of video samples including animation, movie, conversation, presentation and screen sharing over a remote desktop.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Each video is five minutes long with various sets of frame rates (fps) and resolutions.

B. Comparisons
We compared BoBagainst three approaches: the heuristic approach, the winner (Gemini [55]) and runner-up (HRCC [57]) of the ACM MMSys'21 grand challenge on bandwidth estimation in RTC, organized by Microsoft.We selected Gemini and HRCC because (i) they represent the latest solutions using a hybrid approach, (ii) they are the winner and runner-up of the grand challenge, and (iii) their implementations are available in AlphaRTC, which allows us to replicate their claimed results.

C. Evaluation Metrics
We tested the efficiency of BoBand other approaches using the following evaluation metrics: 1) Bandwidth Prediction Error and Accuracy: The bandwidth prediction error and accuracy are calculated based on symmetric mean absolute percentage error (sMAPE).The sMAPE is an accuracy measure based on percentage (or relative) errors between predicted bandwidth values (x t ) and actual network profile values (y t ) for the total samples T , and its function is given as follows: 2) Network Score: The network score (denoted by N s ) is computed as a combination of three metrics: delay score (d s ), loss score (l s ) and receiving rate score (c s ), as follows: Here, w 1 = w 2 = 0.1 and w 3 = 0.5 are the weights of the network score.The max_delay is fixed to 400 ms and min_delay is the minimum delay achieved during the RTC session.The ground_truth_c refers to the corresponding average bandwidth that can be obtained in an ideal environment (such as when there is no loss and no delay).Since we have the network profiles for the experiments, it is easy to compute ground_truth_c, which is fixed as the overall average actual bandwidth value in each corresponding network profile (Cascade: 220 Kbps, LTE: 741 Kbps, Twitch: 335 Kbps, FCC Amazon: 676 Kbps, Synthetic: 581 Kbps).Finally, l is the packet loss ratio.
3) Video Score: The video score (denoted by V s ) is calculated with respect to video perpetual quality based on Video Multi-Method Assessment Fusion (VMAF) 2 as follows: where vmaf_score is the average VMAF value (ranges between 0 and 1) computed based on per-frame VMAF values resulting from the source and encoded video.
4) Total Score: The total score (denoted by T s ) is computed as a combination of N s and V s , as follows: where w 4 is the weight factor associated with the video score which is fixed to 0.3, and 4 i=1 w i = 1.We note that the network, video and total score formulation was originally supplied by the Microsoft grand challenge organizers [6].These scores cover all the main metrics to evaluate the QoE performance of an RTC system, which are widely used in many papers such as [27], [55], [66], [69].For instance, the video score uses VMAF, the widely used metric proposed by Netflix to compute video perceptual quality, while the network score combines the important metrics for an RTC system including packet loss, delay and receiving rate.

D. Results and Analysis:
We now compare and describe the performance of different solutions.For statistically meaningful results, we repeated all experiments five times for each solution with the same configuration and all the presented results show the averages over the five runs.We divided our results into two setups: emulation-based and Internet-based.
1) Emulation-Based Results: First, we analyze the performance in terms of bandwidth prediction accuracy that each solution achieves.Then, we compare the performance of different solutions in terms of network, video and total scores, expressed with their metrics.
Bandwidth Prediction Accuracy: The time series plots for different solutions for every network profile are depicted in Fig. 7.The overall average bandwidth prediction accuracy and prediction error in terms of sMAPE are provided in the first two columns of Table III.The red solid lines in Fig. 7 represent the actual bandwidth for the network profiles.A superior solution must determine a bandwidth within a close proximity of these solid lines.Overall, we notice that BoBachieves the best bandwidth prediction accuracy (and the lowest prediction error).Specifically, BoBimproves the overall average bandwidth prediction accuracy by 67.63% (Cascade: 61.72%, LTE: 41.71%, Twitch: 81.64%, FCC Amazon: 73.27%, Synthetic: 79.80%), and reduces the overall average bandwidth prediction error by 49.11% (Cascade: 38.95%, LTE: 46.45%, Twitch: 62.14%, FCC Amazon: 71.07%, Synthetic: 26.94%) compared to the other solutions across all the network profiles.However, only in the Cascade profile, HRCC is slightly better than BoBin terms of the average bandwidth prediction accuracy with a marginal improvement of 0.19%.Therefore, HRCC was able to achieve a better receiving rate, network, video and total scores compared to BoBin the Cascade profile.
Compared to BoB, we also observe that other solutions generally suffer from either bandwidth overprediction or underprediction due to their designs.As shown in Fig. 7, Gemini tends to underutilize the bandwidth, which expectedly produces a low receiving rate score but also a higher packet delay and loss score than BoBand HRCC.This happens because Gemini fails to timely switch between the learning and heuristic-based prediction.For example, for the FCC Amazon profile (see Fig. 7(d)), the learning-based prediction for Gemini fails to track the increase in the actual bandwidth.Similarly, HRCC generally fails Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.to learn suitable parameters for the heuristic-based algorithm, which leads to a bandwidth underprediction issue for various network profiles, which is most visible in Figs 7(b), 7(c) and 7(d).As a result, HRCC suffers from poor video quality.This also confirms that HRCC is more suitable for more stable, lowbandwidth scenarios.
One interesting observation is that the Heuristic solution is not able to recover from bandwidth underestimations during the whole RTC session, which contributes to poor video quality (see the score results in Table III).This outcome confirms the difficulty and importance of bandwidth prediction in RTC [3], and also shows how urgent it is to have a hybrid solution that combines learning-and heuristic-based algorithms.In contrast, BoB, harmoniously fuses both algorithms and tries to predict the bandwidth within a small margin of its actual value during the RTC session, and works equally well across different network profiles.
Scores and Their Metrics: We evaluate different solutions in terms of network, video and total scores and their metrics (see Section IV-C).The average total scores are given in Fig. 8 and the individual metrics are tabulated in Table III.Fig. 8 shows that BoBhas the highest performance in the LTE, Twitch, FCC Amazon and Synthetic network profiles.
For the Cascade profile, BoBand HRCC perform similarly in terms of the total score and average prediction error but differ in terms of delay and loss scores.BoBcan cause increased delays without significantly increasing the packet loss, whereas HRCC has less delay but more packet loss.At the end of the RTC session, the network scores were quite close with these different trade-offs.In the LTE profile, BoBuses higher receiving rates with a delay and loss cost, which still results in a better video score.
In the Twitch profile, BoBhas the highest average bandwidth prediction accuracy, again resulting in high bitrates without inducing much delay and loss.The reason is that BoBcan upshift fast and utilize the available bandwidth after the first 20 seconds (which confirms the convergence of the learning-based algorithm to the optimal solution), where both Gemini and HRCC still underpredict the bandwidth most of the time.Overall, BoBachieves the smallest average prediction error with a value of 0.38 and the highest average prediction accuracy with a value of 81.03%.As for the Heuristic, the prediction accuracy is the worst yet it has the highest delay and loss scores.The overall indicators imply that there is further room for improvement in RTC systems, where the bandwidth prediction and bitrate selection should be jointly considered to achieve better application performance, i.e., to use the full available bandwidth for the media without inducing significant packet delay or loss under diverse network conditions.
Results Summary: In all the considered experiments, BoBperforms better in most performance metrics and outperforms Gemini, HRCC and Heuristic solutions under various network conditions.This is mainly due to BoB's design that combines heuristic and learning-based controllers for bandwidth prediction and bitrate selection for RTC systems.The percentages of improvement (%) achieved by BoBversus other solutions are calculated by comparing BoB's results with the ones obtained by each solution.The results are summarized in Table IV.

E. Internet-Based Results
We further validate the performance of BoBagainst its competitors in terms of network, video and total scores through the OpenNetLab public Internet-based testbed.Fig. 9 shows the average scores for different scenarios.First, BoBachieves the highest scores (video, network and total) compared to Heuristic, Gemini and HRCC in all scenarios.This demonstrates and validates the capabilities of BoBin adapting to unseen network conditions.Second, HRCC suffers from a low delay score, while Heuristic suffers from a low loss score in a low bandwidth network.On the contrary, HRCC suffers from a low loss score, while Heuristic suffers from a low delay score in a medium bandwidth network.Such an outcome confirms their low network scores in low and medium bandwidth scenarios.Third, Gemini is the runner-up to BoB, however, it does not perform well in the high bandwidth scenario.Overall, the results here are quite similar to the ones obtained in the controlled emulation-based experiments.Specifically, BoBprovides the following improvements over Gemini, HRCC and Heuristic, respectively:

V. DISCUSSION AND OPEN DIRECTIONS
To inspire further work in this area, we discuss three interesting future research directions in RTC systems.
1) We believe that QoE metrics and bandwidth prediction accuracy should be jointly optimized for better performance in RTC systems.BoBaims to achieve this objective but leaves room for improvement under more complex network conditions and RTC-based application requirements.2) Various bandwidth prediction models may perform differently based on which metrics they tend to prioritize or sacrifice, which makes comparisons between these models and drawing conclusions difficult, especially if their scores are similar.One way to compare them is to find the best and worst performing model in each QoE metric and quantify the relative performance of the other models against these boundaries or targets so that a system implementer may choose a scheme based on his/her own priorities and preferences.Note that the QoE is a compound metric and if the aggregated values are similar, then the individual components (latency, bandwidth variations, etc.) can be examined to compare different prediction models.3) Fairness is an important aspect when deploying a solution on the Internet where usually competition exists between different streams for the available bandwidth in a shared network environment (either on the server or the client Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. side).This competition can be between intra (e.g., between different RTC streams) or inter traffic (e.g., between RTC and non-RTC streams like HTTP-based streaming traffic).We believe that analyzing fairness and building a fairness-aware solution is critical for optimizing the QoE.Also, the designed solution should consider the impact and diversity in transport-layer congestion control protocols (BBR, NewReno, Cubic, NADA, SCReAM, etc.).Note that the definition of fairness deserves some examination, too.For example, is it fair to treat a small phone (small screen and likely one viewer) and a large big-screen TV (large screen and likely more than one viewer) the same [10]?

VI. CONCLUSION
We developed a receiver-side hybrid bandwidth predictor for RTC services in this study, named BoB.Hybrid prediction is achieved using a heuristic and a learning-based controller.The heuristic uses a delay filter, while the learning-based mechanism uses DRL actor-critic networks with PPO and an Adam optimizer for model training and policy updates.To perform the bandwidth prediction task, BoBuses the heuristic-based controller at the beginning of each session and then switches to the learning-based controller for more accurate bandwidth prediction.As a result, BoBcan achieve a higher receiving rate with reduced packet delay and loss ratio, contributing to a better user experience.During each fixed time window, BoBcollects packet-level data, including the receiving rate, packet delay, packet loss and the last eight predicted bandwidth values as the input state into the neural network to predict the bandwidth for the next time epoch.
BoBhas been integrated into AlphaRTC and the results show the superiority of BoBfor bandwidth prediction in RTC.BoBachieves up to 15.62% and 27.87% better bandwidth prediction accuracy than Gemini and HRCC (the winning and runner-up solutions, respectively, in the ACM MMSys'21 grand challenge), respectively, under various challenging network conditions.For future work, we plan to implement FEC techniques using DRL and perform larger scale real-world experiments.

Fig. 4 .
Fig. 4. The neural network design for training in BoB.

Fig. 5 .
Fig. 5.The network profiles used in the simulations.

Fig. 7 .
Fig. 7. Actual and predicted bandwidth for different network profiles.

Fig. 7 (
d) illustrates that BoBhas the best fit for the actual bandwidth values, especially after the 25 th second.As a result
ubiquitous connectivity Manuscript received 23 March 2022; revised 14 August 2022 and 17 October 2022; accepted 17 October 2022.Date of publication 21 October 2022; date of current version 1 November 2023.This work was supported in part by the Singapore Ministry of Education Academic Research Fund Tier 2 under MOE's Official under Grant T2EP20221-0023, and in part by the Scientific and Technological Research Council of Turkey under Grant 120C154.The Associate Editor coordinating the review of this manuscript and approving it for publication was Prof. Mea Wang.(Corresponding author: Abdelhak Bentaleb.)

TABLE I AVERAGE
RESULTS IN TERMS OF SMAPE AND TOTAL SCORE FOR DIFFERENT PERCENTAGE VALUES OF THE ADAPTIVE SELECTOR , summarizes the outcome over all the tests.As one can see, the percentage of 30% achieves the best performance in terms of the lowest sMAPE and highest total score compared to the other percentages.

TABLE II SCENARIOS
FOR THE PUBLIC INTERNET TESBED accuracy and error, receiving rate score, delay score, packet loss score, network score, video score and total score.

TABLE III AVERAGE
SIMULATION RESULTS FOR DIFFERENT NETWORK PROFILES (↑: HIGHER IS BETTER, ↓: LOWER IS BETTER)

TABLE IV SUMMARY
OF THE AVERAGE RESULTS.PERCENTAGE IMPROVEMENTS OF BOBOVER THE OTHER SOLUTIONS, AT SCALE its accurate bandwidth prediction, BoBachieves the highest receiving rate score and a loss score slightly lower than the best one achieved by the other solutions.Synthetic is one of the most challenging profiles as it exhibits fast and sudden changes in the bandwidth.Even if the bandwidth is changing frequently, the prediction error is the smallest with BoB.Moreover, BoBachieves the highest receiving rate score.