Improved Quality of Online Education Using Prioritized Multi-Agent Reinforcement Learning for Video Traffic Scheduling

The recent global pandemic has transformed the way education is delivered, increasing the importance of video-based online learning. However, this puts a significant pressure on the underlying communication networks and the limited available bandwidth needs to be intelligently allocated to support a much higher transmission load, including video-based services. In this context, this paper proposes a Machine Learning (ML)-based solution that dynamically prioritizes content viewers with heterogeneous video services to increase their Quality of Service (QoS) and perceived Quality of Experience (QoE). The proposed approach makes use of the novel Prioritized Multi-Agent Reinforcement Learning solution (PriMARL) to decide the prioritization order of the video-based services based on networking conditions. However, the performance in terms of QoS and QoE provisioning to learners with different profiles and networking conditions depends on the type of scheduler employed in the frequency domain to conduct the scheduling and the radio resource allocation. To decide the best approach to be followed, we employ the proposed PriMARL solution with different types of scheduling rules and compare them with other state-of-the-art solutions in terms of throughput, delay, packet loss, Peak Signal-to-Noise Ratio (PSNR), and Mean Opinion Score (MOS) for different traffic loads and characteristics. We show that the proposed solution achieves the best user QoE results.

B ROADBAND connectivity plays a central role in mitigating the economic aftermath of the pandemic and boosting the digital access and inclusiveness of different sectors [1]. One such sector of utmost importance is remote education and eLearning, which all regions of the world must have access to [2]. COVID-19 containment measures forced actors of the educational sector to remotely deliver large amounts of media content across the existing broadband infrastructure. Prior to the global pandemic, educational institutions were slowly moving towards a blended learning approach which combines the traditional physical classroom teaching with the adoption of various Information and Communication Technology (ICT)-based tools and solutions to improve the educational experience [3]. However, the global pandemic has accelerated the digital transformation of educational institutions by forcing the teaching-learning process to move to 'online only'. In this context, instructors rely on any form of video content (e.g., live video streaming, video on demand, etc.) as well as text and graphics, to improve the teaching-learning process within the online learning environment. Previous studies [4] have shown that the integration of instructional videos within the educational content can increase the effectiveness of online learning. However, moving from the optional adoption of ICTbased tools within the educational domain to a compulsory one, including video-based learning, does not come without challenges.
One of the existing challenges that was worsened by the pandemic is the issue of digital inequalities. The factors that contribute to these inequalities are [5]: (1) digital literacy; (2) access to hardware and/or software; (3) usage autonomy; and (4) social factors, such as peer interactions. Additionally, when it comes to video-based learning, there are several factors that impact learners' Quality of Experience (QoE), including the type of device (e.g., smartphone, laptop, desktop, etc.) and the quality of the Internet connectivity. To be able to accommodate an appropriate level of eLearning content for mobile learners, stable broadband connections are strongly demanded. To this end, network operators are pressured to ensure high levels of Quality of Service (QoS) and QoE while exchanging larger amounts of educational media content among an increasing number of mobile/online learners over the existing radio access networks.
Enabling good QoS provisioning over the wireless interface is challenging. A limited frequency spectrum must be allocated by a scheduling entity to increase the number of users requesting different traffic types and experiencing a variety of network conditions [6]. In remote education, this aspect is even more of a challenge since a proper prioritization of the delivered services is needed to deal with different learner profiles, dynamic wireless conditions, device types, and content characteristics with heterogeneous QoS requirements [7]. Therefore, the focus of this paper is on packet scheduler and intelligent prioritization of eLearning content for mobile learners. To provide high QoS, we employ a solution based on Prioritized Multi-Agent Reinforcement Learning (PriMARL) [8] to allocate the limited frequency spectrum over an increased number of mobile learners accessing the radio interface. However, enabling high QoS provisioning does not guarantee acceptable QoE when scheduling video services with different degrees of heterogeneity in terms of data rates and QoS requirements. Therefore, the focus of PriMARL would be to maximise both QoS and QoE provisioning for learners experiencing heterogeneous video services in eLearning.
In the literature, multi-agent reinforcement learning is used to deal with user association and resource allocation in heterogeneous cellular and millimeter wave networks [9], [10]. In our previous work [7], we considered the prioritization and scheduling aspects of educational content over the broadband networks and proposed a Hierarchical MARL (HiMARL) model based on a source-sync approach, where the source controller prioritizes video classes in the time domain and the sync controller performs the scheduling and resource allocation in the frequency domain. This method is highly efficient to deliver the requested heterogeneous video services in terms of QoS compared to other state-of-the-art approaches. However, there is no evaluation of the proposed scheduling technique in terms of QoE.

A. Addressed Use Case Scenario
According to a study conducted by Campbell [11], one of the most important issues in enabling eLearning over mobile technology (mobile learning) is the network speed and reliability. In this context, the use case scenario illustrated in Figure 1 is considered. Four types of mobile users access educational video services from a cloud mobile learning server via a 5G gNodeB base station. The mobile users are located in different geographical locations, use diverse device types (e.g., smartphones, laptops, tablets, VR gear, etc.), and have various network connectivity characteristics (e.g., poor, medium, or good connectivity). In this scenario, the network scheduler located at the level of the 5G gNodeB base station is responsible for allocation of the available radio resources to all users and maximizing the QoS parameters for each delivered video service, given the channel conditions, traffic types and characteristics, device resolutions, and prioritization policies. However, as noted in [7], the quality of user experience is important for the learning performance. Therefore, in this paper, we propose PriMARL, an ML-based decision-making framework that aims to increase the time and number of users (learners, instructors) experiencing high QoE levels when delivering a range of four video services.

B. Paper Contributions
The proposed PriMARL framework for downlink scheduling systems eliminates the need for a source-sync approach as employed in the previous work (HiMARL) and improves learner QoE when delivering heterogeneous educational video in different traffic load conditions. In contrast to [7], the contributions of this paper are as follows.
a) Prioritization-Driven Scheduler: The proposed approach focuses on service prioritization. It provides a low complexity solution to the proposed optimization problem that decides in each Transmission Time Interval (TTI) the prioritization order of video classes with different QoS profiles in the time domain and considers particular scheduling rules in the frequency domain, i.e., Barrier Function (BF), Exponential (EXP) and Opportunistic Packet Loss Fair (OPLF) [6]. The remainder of this paper is organized as follows: In Section II, we discuss the related work carried out in this area. Section III introduces the system model, and in Section IV, we describe the proposed PriMARL-based solution. In Section V, we present an analysis of obtained results and Section VI serves as the conclusion of our paper.

II. RELATED WORKS
Recently, an increasing number of solutions that make use of Machine Learning (ML) and other Artificial Intelligence (AI) techniques have started gaining momentum in various fields, mainly due to the global pandemic that accelerated the digital transformation. Different ML-based approaches are proposed in the literature to build intelligent systems that identify patterns and behaviour in historical data and learn from it without relying on rules-based systems.
The concept of Multimedia Intelligence is introduced by Zhu et al. [12], representing the convergence of multimedia and AI. A bidirectional link is formed between multimedia and AI, that enables them to enhance each other. Consequently, on one side, multimedia enriches the varieties of applications for AI through explainability. On the other side, AI boosts the inferrability of multimedia through reasoning.
Deep Reinforcement Learning (DRL) has been used by Cui et al. [13] to propose TCLiVi, a transmission control in live video streaming solutions. TCLivi jointly adjusts the streaming parameters (e.g., video bitrate, target buffer size) in order to improve the QoE for live video streaming. The performance evaluation results show that TCLiVi outperforms other solutions from the literature in terms of QoE score with an increase of 40.84%. DRL has also been used by Mao et al. [14] to propose Pensieve, an intelligent system that generates adaptive bitrate (ABR) algorithms for Video on Demand (VoD) scenarios. Pensieve will automatically learn the adaptive bitrate algorithms that adapt to a wide range of dynamic network conditions and QoE metrics.
Tan et al. [15] investigate the use of game theory to enable dynamic adaptive bitrate streaming in multi-client over Named Data Networking (NDN). A client-side game theory-based distributed ABR algorithm for NDN is proposed to optimize the overall QoE of multiple clients and guarantee fairness. The performance evaluation results demonstrate the effectiveness of the proposed solution in terms of overall QoE, fairness, and bandwidth resource utilization. Looking at maximizing user capacity for an auto-scaling VoD system, Chang and Chan [16] propose AVARDO, an auto-scaling Video Allocation and Request Distribution Optimization solution. The proposed solution seeks to maximize the user capacity at each autoscaling level and formulate the optimization problem as a multi-objective mixed-integer linear programming problem. The performance evaluation results show that the proposed AVARDO solution is close to the optimum.
Random Forest (RF) classifier is used by Chandrasekhar et al. [17] for real time video scheduling over LTE networks. The proposed solution detects the service type of different flows as well as the video player status for users with HTTP Adaptive Streaming (HAS) flows. The output of the RF classifier is used for prioritizing scheduling of the HAS users. The proposed solution enhances the video QoE with an acceptable impact on other non-video best effort services. Similarly, an adaptive resource scheduling solution named AdaptSch, based on neural network (NN) and mobile traffic prediction, is proposed by Semov et al. [18]. AdaptSch makes use of an NN architecture with two building blocks, where the first one predicts the future network state, while the second one chooses the optimum scheduling policy to be applied. The proposed solution improves the system performance in terms of packet delay. However, this comes at the cost of overall throughput degradation.
With a focus on radio resource scheduling in the 5G Radio Access Network (RAN), Tseng et al. [19] designed a modularized Deep Deterministic Policy Gradient (DDPG) architecture. Here, DDPG is used to select a radio resource scheduling policy from a pool of 60 combinations of scheduling algorithms as actions. DDPG has been widely adopted to solve optimal control problems in wireless network environments, such as, in case of network slicing for allocating resources among different slices [20], or among different traffic classes [21]. However, Gu et al. [22] argue that due to the very slow convergence of DDPG, it cannot be implemented in real-world 5G systems. Consequently, the authors propose a knowledgeassisted DDPG that reduces its convergence time significantly and achieves better QoS.
Motivated by the fact that reconfigurable wireless networks open up new opportunities for advanced rich multimedia applications, such as online AR/VR gaming, high-quality video streaming, and autonomous vehicles, Mollahasani et al. [23] take a different approach and propose an Actor-Critic learningbased QoS-aware scheduler to overcome the problem of stringent QoS requirements of such applications. The authors adopt two advantage actor-critic models, where the first technique schedules packets by prioritizing their scheduling delay budget, while the second technique considers channel quality, delay budget, and packet type. Performance evaluation results validate the efficiency of the proposed approach.
In addition to the approaches described above, there are also time-efficient schedulers that target multiple QoS objectives at the same time. An example of such a scheduler is the Frame Level Scheduler (FLS) [24] that divides the scheduling problem in two stages: a) time-domain, where the users are prioritized based on the approximated quota of data necessary to meet the delay constraints; b) frequency-domain, where the prioritized users get radio resources for data transmission in a fair manner according to scheduling rules, such as proportional-fair scheduling. Another efficient example is the Required Activity Detection Scheduler (RADS) [25], where in the time domain, users are prioritized based on a multi-target criterion encompassing fairness, delay, and rate requirement, while in the frequency domain, the pre-selected users are served based on their channel quality. More recently, in [26] the authors proposed the Minimal Delay Violation (MDV) downlink scheduler that considers arrival rates in data queues and the state of each flow in the network in terms of packet loss and delay. When compared to FLS, MDV achieves a maximum gain of about 25% in terms of average system throughput when scheduling users requesting heterogeneous traffic in terms of video, voice, and best effort. In a railway environment, the authors in [27] proposed a New version of RADS (NRADS) that allocates the radio resources to mobile users based on the number of correctly received bits at the level of physical layer, channel conditions, and a static and standardized prioritization sequence to be followed when scheduling multiple classes of services. Compared to RADS, NRADS provides a gain of nearly 10% when measuring the overall system throughput.
In summary, a variety of scheduling approaches exists in the literature to deal with prioritization and scheduling of multimedia services. However, most of these approaches are mainly focusing on QoS optimization. Improving user QoE of the provided services, assessed in terms of objective (e.g., PSNR) or subjective (e.g., MOS) metrics, remains uncovered. Despite the amount of research done in these areas, advancements therein would benefit from the performance of our proposed PriMARL-based decision making solution, which focuses on the maximization of PSNR performance for heterogeneous video scheduling given the dynamic user traffic and network conditions. The primary objective of the PriMARL-based prioritization framework is to maximize the QoS revenue for all video content viewers in terms of packet delay, throughput, and packet loss rate (PLR). Then, the second objective would be to carefully select the best rule for the frequency domain-based scheduling that provides the highest amount of viewers with excellent MOS scores.

III. SYSTEM MODEL AND PROBLEM STATEMENT
The proposed system model is presented in Fig. 2, where mobile/online learners access different types of educational video content from the mobile learner server through the OFDMA interface and scheduling system. Let us define by P = {1, 2, . . . , P} the set of video services that needs to be prioritized at each TTI, where class 1 requests the highest priority and class P is associated with the lowest priority. Furthermore, we consider by U = {U 1 , U 2 , . . . , U P } the set of active mobile learners distributed over P video classes. Each learner u ∈ U p receives on a mobile device (e.g., tablet, smartphone) educational videos with different QoS constraints or requirements for each class p ∈ P. By Q p = {q p,n : n = 1, 2, . . . , N} we define the set of QoS requirements associated to class p ∈ P, where n is a type of QoS indicator that can be throughput, delay, or packet loss.
We define Key Performance Indicators (KPI) for the QoS data (i.e., throughput, delay, packet loss) which are measured in each TTI based on observations collected from each user. In multi-class prioritization and scheduling, in a given class p ∈ P, users' KPIs are constrained by the same set of QoS requirements Q p indicated by standards [28]. Therefore, for each QoS type n, class p ∈ P, and learner u ∈ U p , we define the KPI k p,u,n measured at each TTI and monitored to verify if its QoS requirement q p,n is met. By enlarging the dimension of data to the user level for all N QoS indicators, we can further define the learner KPI vector as k p,u = k p,u,n n=1,2,...,N and the vector of QoS requirements as q p = q p,n n=1,2,...,N .
Then, the aim is to maximize in each TTI the number of KPI vectors k p,u respecting the corresponding QoS requirement vector q p for as many learners u ∈ U p as possible.
The role of the scheduler from Fig. 2 is to prioritize learners from different video classes p ∈ P and allocate the necessary radio resources in the frequency domain at each TTI. Let us suppose that the prioritization sequence, for example, p, 2, . . . , p − 1, p + 2, . . . , 1 is decided at TTI t, where learners requesting video service from class p ∈ P are scheduled first, followed by learners from class 2, and so on. In the proposed system, the number of video classes from the prioritization sequence that are scheduled in the frequency domain depends on the amount of remaining radio resources. In OFDMA networks, the available bandwidth is divided in B number of equal Resource Blocks (RBs). Let B = {1, 2, . . . , B} be the set of RBs that are allocated at each TTI, where RB b ∈ B is the smallest resource unit. Learners u ∈ U p within video class p ∈ P from the prioritized sequence are competing in the frequency domain to get the highest amount of RBs. Then, utility functions are used to rank learners for each RB b ∈ B according to their QoS budget [29]. In particular, for each RB b ∈ B and learner u ∈ U p , an utility function targets specific types of QoS indicators in terms of n and takes as input in each TTI t the measured KPI k p,u,n ; in most cases, as output, such a function provides a measure of how far each KPI of each class p ∈ P is from the QoS requirement q p,n ∈ Q p . At the level of each RB b ∈ B, the learner with the highest utility value is allocated that particular RB. Learners u ∈ U p with higher utility values over the entire bandwidth have higher chances to get more RBs. Let n (k p,u,n ) : R → R be such utility functions that can take different forms depending on the target type of QoS indicator (i.e., throughput, delay, PLR).

A. Optimization Problem
In the proposed optimization problem presented in (1.a), the prioritization of video classes and resource allocation are performed at each TTI t ∈ {1, 2, . . . , T}, subject to constraints (1.b)-(1.e), where T represents the number of TTIs of a given scheduling session. max x,y p∈P u∈U p b∈B In such an optimization problem, the aim is to maximize for each RB b ∈ B the sum of utility values over learners u ∈ U p of class p ∈ P decided by the prioritization sequence in each TTI t. However, the wireless environment must be considered in the optimization problem to enable scheduling and resource allocation for users with high utility values and favorable channel conditions. Therefore, learner u ∈ U p gets the RB b ∈ B if the metric is maximized relative to all other learners' metrics, where λ u,b (t) is the achievable rate that could be obtained if RB b ∈ B would be allocated to learner u ∈ U p at TTI t.
To solve such complex problems, two variables must be determined each TTI t: a) x p,u ∈ {0, 1} decides the learner u ∈ U p to be scheduled in the frequency domain (i.e., if x p,u = 1, then video class p ∈ P is prioritized and user ∀u ∈ U p passed in the frequency domain; if x p,u = 0, then video class p ∈ P is not prioritized); b) y u,b ∈ {0, 1} performs the scheduling and resource allo- When obtaining the best combinations of users and RBs to maximize (1.a) each TTI, a set of constraints must also be considered. Therefore, constraints (1.b) indicate that each RB b ∈ B is allocated to one learner at most. Also, as requested by (1.c), once a video class p ∈ P is prioritized, all learners u ∈ U p = {u 1 , u 2 , . . . , u U p } within that class are competing to get the available resources allocated, where U p is the number of learners in class p ∈ P. In case of remaining resources after scheduling the higher prioritized class, the optimization problem is repeated for the next video class from the prioritized sequence. However, due to unfavorable networking conditions, some video classes can remain unscheduled at certain TTIs. In this sense, let us define by P * (t) the set of video classes scheduled at TTI t, while by P ⊗ (t) we define the set of video classes remained unscheduled, where P * ∪ P ⊗ = P and P * ∩ P ⊗ = {∅}. Accordingly, the constraints (1.d) show that all learners in the scheduled classes p * ∈ P * are passed in the frequency domain and compete for radio resource allocation. Meanwhile, the other learners in p ⊗ ∈ P ⊗ are deprived of receiving video packets in that TTI t due to the fact that there are not enough radio resources left after scheduling learners in p * ∈ P * , as indicated by the constraints (1.e).

B. Problem Solving
To find optimal solutions in (1.a) in each TTI t, the scheduler needs to identify the best type of utility function n to be employed, and at the level of each resource block b ∈ B, the most appropriate learner u ∈ U p and service class p ∈ P. This decision-making should be done in such a way that the set of constraints (1.b)-(1.e) are met in each TTI and the number of KPIs k p,u,n that satisfy their associated requirements q p,n is maximized in the subsequent TTI t + 1. This approach raises two main problems: a) the decision process becomes time-consuming, as each possible combination n × b × u × p must be tested, and the best one has to be selected to perform scheduling; b) finding the optimal solution in each TTI is complex, as the performance (meeting the QoS requirements) of each possible decision in a) needs to be known in advance.
Therefore, we want to simplify the solution-search problem at each TTI by finding sub-optimal solutions of the original optimization problem in two stages: a) the prioritization sequence of video classes; b) the scheduling of pre-selected learners and resource allocation by respecting the prioritization of video classes decided in a).
To solve the first sub-problem, this paper employs a PriMARL-based solution to increase the QoS provisioning by deciding at each TTI the best prioritization of video classes. However, the type of scheduling rule used in resource allocation has a major impact in QoS and QoE provisioning for the pre-selected users. In this paper, we train our PriMARL method by employing three different scheduling rules in the frequency domain [30]: PriMARL-BF, PriMARL-EXP, and PriMARL-OPLF with their main focus on a particular QoS performance indicator, namely throughput (n = 1), delay (n = 2), and packet loss (n = 3), respectively.

IV. PROPOSED PRIMARL SOLUTION
A controller is employed in Fig. 2 to interact with the scheduler entity and learn the best prioritization decision to be taken at each TTI t. In a real system, this controller is deployed at the MAC layer of the 5G gNodeB base station and is owned by the network operator. The interaction between controller and scheduler at the level of MAC layer is modeled according to: state representing the observable data received from the scheduler, action corresponding to the prioritization sequence, and reward that measures in the current state how good the prioritization decision taken in the previous state is. By experiencing a very large amount of interactions in terms of previous state -action -reward -current state, the controller learns from trials and errors to improve its decisions over time based on reinforcement learning [31]. The controller considers P number of agents trained to compute the prioritization decision of video classes at each TTI t. In particular, each agent p ∈ P learns to claim at each TTI the priority of class p ∈ P to be passed in the frequency domain. Then, the controller computes a joint action by ordering the priority values given by each particular agent. Since each agent learns based on its own state to compute a joint action together with other agents, the proposed approach works in a multi-agent reinforcement learning mode [8]. We argue that combining the decisions of multiple agents with various priorities is more efficient than using a single agent that decides the prioritization sequence once at each TTI.

A. States, Actions, and Rewards
An instantaneous state of agent p ∈ P observed at TTI t is given by the data sample s p (t) ∈ S p , where S p is the state space of class p ∈ P. This state is divided into two parts: where c p (t) are some controllable elements that can be influenced by the prioritization decisions, while n p (t) are some non-controllable elements such as the Channel Quality Indicator (CQI) that changes regardless of the applied decision. The controllable sample is represented by is the KPI vector of all learners in class p ∈ P, k p is a vector that computes the differences between each KPI k p,u,n from vector k p and its associated QoS requirement q p,n ∈ Q p , and is the vector containing the amount of queued data for each learner at the level of MAC layer. At each TTI t, the controller state s(t) ∈ S is obtained by encompassing all agents' states: where S is the controller state space. A joint action is denoted by where a i ∈ P is the video class with the i th priority to be scheduled at TTI t, and A is the P dimensional and discrete controller action space. As mentioned, a number of P * classes can be used for scheduling, and consequently, the action is partially used, where 1 ≤ P * ≤ P.
The controllable state of each agent evolves to the next states based on applied joint action: where is a controllable state at TTI t+1, and is the transition function that moves the agent from the state s a i (t) ∈ S a i to the next state s a i (t + 1) ∈ S a i when scheduling learners in class ∀a i ∈ P at TTI t. The reward function of the controller depicted in Fig. 2 measures the impact of applying action a(t) ∈ A in state s(t) ∈ S, defined as [32]: where R : S × A → R is the reward function, and E[·] is the expectation operator, with a random state s(t) ∈ S so that, P[s(t) = s] > 0 and P[a(t) = a] > 0 hold for all a ∈ A. For our purpose, the reward function is computed as follows [7]: where is the reward function that evaluates the QoS performance when scheduling learners in video class a i ∈ P, and h Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
is the weight function that sets the importance of each reward r a i given the sequence [1, 2, . . . , P] requested by the prioritization standard. By using (2) and measuring the QoS performance in each video class, the proposed reward becomes: where we assume that action a i = p ∈ P, and r u,n is the particular reward of user u ∈ U p and QoS requirement q n ∈ Q, with the function argument given by the controlable sample c p,u,n = k p,u,n , k p,u,n , d n .
As shown in [7], the computation of the learners' rewards r u,n depends on type n of QoS requirement for each traffic class.

B. Policy and Value Functions
The proposed solution considers the stochastic game with the tuple meaning that each agent p ∈ P learns based on its own state space S p to cooperate with other agents to maximize the overall QoS provisioning in all video classes according to the employed reward functions in (4) and (5).
Each agent keeps its own policy function defined as the probability of selecting a given joint action a ∈ A in state s p ∈ S p [32]. Similar to the joint action, we compute the controller joint policy as the sequence of π = π p p=1,2,...,P .
Furthermore, each agent keeps track of an action-value function to calculate the expected cumulative future reward if agent p ∈ P is in state s p , executes the joint action a ∈ A by obtaining the ith priority to be scheduled, and the joint policy π is subsequently followed. We define this function by [32]: where 0 ≤ γ ≤ 1 is a discount factor that gives more importance to the immediate rewards than to the later ones, and E[·] is the expectation operator with the same properties as shown in (3). The action-value function of each agent p ∈ P is trained separately to claim the priority of the corresponding video class to be scheduled in the frequency domain. When the controller is trained and the action-value functions are considered optimal or near-optimal, an action a ∈ A is selected with a sequence of probabilities of π(a) = [1, 1, . . . , 1]: where Q * p is the trained function, and solve gives the descending order of all action values and returns the agents' indices.
In addition to the action-value functions of the individual agents p ∈ P, we use the value function V(s) that considers the initial controller state s(0) = s ∈ S and underlies the joint policy π afterwards [32]: The role of V(s) is to coordinate agents in the training process to learn the best prioritization decisions. In addition, the transition between two consecutive states can also be used based on [31]: where s = s(t + 1) ∈ S represents the next state. With these consecutive states {s, s } ∈ S and reward function R(s, a), the value of the previous state V(s) is updated based on (9).

C. Solution Employment
In order to use the proposed solution in real-time systems, two major aspects need to be considered: a) the dimension of all states {s 1 , s 2 , . . . , s P } depends on the number of active learners {U 1 , U 2 , . . . , U P } that can change over time; b) because of the multi-dimensionality of the state space, the action-value and value functions cannot be updated using conventional look-up tables.
Therefore, we address these challenges through compression and approximation methods, respectively.
The original state space S p is compressed to avoid the dependency on U p by applying the transformation: where T is the transformation operator andS p is the compressed state space of class p ∈ P of constant dimension over a variable number of mobile learners. Depending on the elements in s p ∈ S p , the space transformation can have different computations. For example, descriptive statistics (mean and standard deviation) are used for the vector of controllable elements k p,u,n , k p,u,n , d u for all p ∈ P, u ∈ U p and n ∈ {1, 2, . . . , N} [29]. In case of non-controllable elements (e.g., CQI), unsupervised and supervised learning techniques are used [29].
With the compression mechanism, the obtained states s p ∈S p are still multi-dimensional and function approximators must be used to model the action-value and value functions. In this paper, we adopt the use of feed-forward neural networks as parameterizable functions to be learned over time to provide the best prioritization sequence on each state. Therefore, each agent p ∈ P is represented by where p is the set of weights that must be updated during the training stage. To increase the training efficiency of the Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
proposed solution, we also employ the value function of the controller states ∈S and approximated by the neural network Therefore, a number of P + 1 neural networks must be trained during the learning stage of the proposed PriMARL solution.
During training, a joint action a ∈ A is selected by each agent on each states p ∈S p according to: π p a |s p = 1 − a = solve Q p ·; p p=1,...,P , a = solve rand p p=1,...,P , where rand p ∈ [0, 1] is a sequence of random numbers. In some cases, parameter ∈ [0, 1] is set to higher values at the beginning of the training stage (more exploration in terms of the random action selections), and to lower values at the end of the training (more exploitation based on the trained functions). In some other cases, can have constant value for the entire training period. Regardless of the strategy used, the same value of is used by all agents at each TTI. Once a joint action a ∈ A is applied at TTI t, the system moves to the next state, and a reward R(s, a) is computed. We denote by the controller experience at TTI t + 1, and P * (t) is the number of classes scheduled at TTI t. The experience of an agent p ∈ P is given by All these experiences e ∈ {E, E 1 , E 2 , . . . , ., E P } are used at each TTI to reinforce the neural networks with the aim of minimizing the following cost function: where η ∈ [0, 1] is the learning rate, is the set of weights of the trained neural networks, and δ(θ) is the Temporal Difference (TD) error computed as a difference between the target and the actual estimate of the network: By F(·; θ) we mean both the functions V(·; ) and Q p (·; p ) for all p ∈ P. The target F T (·; θ) is determined separately for value and action-value functions. For example, the target of value function takes the form of (9) and the TD error becomes We design the neural network that learns the value function as a critic to determine whether the multi-agent system decision is a good or bad option. If δ( ) ≥ 0, the prioritization sequence a ∈ A has a positive effect and the cost values should be reinforced in the networks with a relatively higher learning rate η = α. If δ( ) < 0, such actions must be prevented in the future by using a lower learning rate η = β and thus, β << α, when choosing the parameters of the PriMARL controller.
Even when the TD error becomes positive, the prioritization decision can infuse the over-provisioning effect and some classes with met QoS requirements (r p = 1) are prioritized at the expense of other classes with unmet QoS requirements (r p < 1), ∀p = p ∈ P. To address this problem, we employ as a penalty function to improve the decision-making, so that: a) if h(a i * , r 1 , . . . , r P ) = 1, i * = 1, 2, . . . , P * , then all video classes a i * ∈ P * meet the QoS requirements but are prioritised at the expense of other classes whose QoS requirements are not met and whose rewards are lower than r p < 1; b) if h(a i * , r 1 , . . . , r P ) = 0, i * = 1, 2, . . . , P * , then prioritising a i * ∈ P * among other classes is a fair choice. Then, the proposed target of the action-value function becomes: where Q T a i * (s a i * , a; p ) is the target function of those classes a i * = p * ∈ P * being scheduled at TTI t, while the rest of the agents are not updated. As observed in (14), negative target values are associated even when the value function error is positive (δ( ) ≥ 0), but the penalty function shows inequity between prioritized video classes (h(·) = 1). Therefore, the error to be reinforced by the agent p * ∈ P * averaged with the learning rate η = {α, β} becomes Finally, the weights of the critic neural network and all agents are updated based on the Stochastic Gradient Descent (SGD) algorithm, which is given by the following formula [7]: In Algorithm 1, we describe how PriMARL is trained to prioritize traffic classes and allocate radio resources through a specific scheduling rule based on utility function n . As input parameters, the algorithm considers two consecutive states {s, s } ∈ S, the action applied in the previous state a ∈ A, and the number of video classes P * (t) being scheduled in the previous state. As an output, Algorithm 1 provides a new action a ∈ A as a prioritization sequence and executes the scheduling and allocation of radio resources. In the first step (lines (4)-(8)), the controller's reward is calculated, the states are compressed for each agent, the error of the value function (critic) is back-propagated, and the weights are updated based on the SGD algorithm. We set different learning rates for the agents if the critic error is positive or negative (line 10). In the second step, we update the agents representing the traffic classes that were scheduled in the previous TTI (lines (11)- (15)). In the third step, the video classes are prioritized according to the new joint action a ∈ A decided by all agents (line 17). In the frequency domain, radio resources in B are allocated to prioritized learners competing with each other based on the type of utility function n or scheduling Algorithm 1 PriMARL Training in Traffic Prioritization and Scheduling With a Particular Utility Function n 1: input: s ∈ S, a ∈ A, s ∈ S, P * (t) 2: output: a ∈ A, scheduling and radio resource allocation 3: for each TTI t+1 4: calculate rewards based on (3) if δ( ) ≥ 0, then η = α, else η = β 11: for i * = 1, 2, . . . , P *

12:
determine target function Q T a i * based on (14) 13: calculate error δ a i * ( a i * ) based on (13) 14: back-propagate and update a i * based on (15) 15: end for 16: // act based on the joint policy 17: determine new action a ∈ A based on policy (11) 18: while B = ∅ have access to radio resources within the limits of the available stock (lines (18)-(24)). For example, learners in class a 2 ∈ P compete for radio resources if there are enough resources left after scheduling the higher-priority class a 1 ∈ P in the sequence.
In Algorithm 2, each PriMARL algorithm is tested and implemented in real-time scheduling. Here, the process is simplified, since only the current states s ∈ S are needed as input parameters, and the algorithm will provide a new prioritization sequence given by the trained agents. The neural networks are no longer updated, but the algorithm still needs to compress the states (line 4) of each agent p ∈ P. The joint action is decided at each TTI by ordering the agents' outputs (lines (5)-(9)), and the scheduling process is performed based on (1.a)-(1.e), depending on the type of scheduling rule n and the available stock of radio resources (lines (11)- (14)).

V. SIMULATION RESULTS
The proposed PriMARL framework is developed in a C/C++ software environment using intelligent OFDMA scheduling in both the time and frequency domains, data compression mechanisms, and neural networks to approximate agents' decisions for each video class. The proposed tool inherits the LTE-Sim functionality [33]. As explained above, the proposed PriMARL-based solution considers three types of utility functions as scheduling rules [6]: PriMARL-BF, PriMARL-OPLF, and PriMARL-EXP. Since most of the stateof-the-art works presented in Section II do not provide the level of detail necessary to enable their implementation, we provide a comprehensive comparison of the proposed solutions with the following approaches: HiMARL [7], FLS [24], and RADS [25]. We evaluate the performance of these schedulers from the perspective of: a) QoS provisioning, where throughput, delay, and packet loss indicators are monitored in each TTI. To quantify the level of QoS provisioning in the time domain, three types of QoS requirements are considered for each video class: Guaranteed Bit Rate (GBR, n = 1), packet delay (n = 2), and Packet Loss Rate (PLR, n = 3). b) QoE provisioning by calculating the perceived PSNR based on throughput and arrival rates. As a result of PSNR assessment, MOS is calculated on five different levels: excellent (5), good (4), fair (3), poor (2), and bad (1).
The purpose of this section is to demonstrate that setting a multi-objective target (n = {1, 2, 3}) to maximize the QoS provisioning does not guarantee the same effect in terms of perceived PSNR and MOS levels. In particular, we show that the proposed PriMARL solution is able to outperform HiMARL, RADS, and FLS when monitoring the number of learners achieving excellent MOS levels while viewing different types of educational video content. Therefore, we organise this section as follows: a) first, we present the traffic characteristics, network, scheduler and controller settings; b) then, we present the QoS analysis in terms of throughput, delay, packet loss, and the number of TTIs when all three QoS objectives are met. c) In the third part, QoE analysis is performed for PSNR and MOS levels for all approaches which are involved in this comparison framework. d) Finally, we provide additional results and insights to better highlight the importance of using PriMARL approaches with static scheduling rules from the perspective of QoE performance.

A. Video Traffic Settings
As shown in Fig. 1, learners access the heterogeneous video contents from mobile devices. To cope with the different financial situations of learners, in this study we consider two resolutions of mobile devices, 240p and 480p, linked to lower and higher prices, respectively. According to [34], for each resolution, maximum thresholds for low and high bit rate values are recommended: a) for 240p, 150kbps and 250kbps; while for 480p, maximum rates of 0.6Mbps and 1Mbps are recommended. Based on the subjective surveys conducted in [7], learners were asked to rate video quality using mean opinion scores for seven categories of educational videos with low and high quality levels. All content categories with low quality levels were perceived as good by all viewers, with the exception of slideshow content, which was perceived as fair. Similarly to [7], we consider the same classes of video services, i.e., low and high quality slideshows with a resolution of 240p, as well as animations and screencasts for devices with a resolution higher than 480p. By modeling animation as video traffic with a variable bit rate and screencast video with a constant bit rate, as well as by standardizing the QoS requirements [28], we obtain P = 4 video classes with the following characteristics: • p = 1: video_1 (slideshow, high quality), q 1,1 = 242kpbs, q 1,2 = 150ms, and q 1,3 = 10 −3 , ∀u ∈ U 1 ; • p = 2: video_2 (slideshow, low quality), q 2,1 = 138kpbs, q 2,2 = 300ms, and q 2,3 = 10 −6 , ∀u ∈ U 2 ; • p = 3: video_3 (animation, low quality), q 3,1 = 512 − 1024kpbs, q 3,2 = 300ms, and q 3,3 = 10 −6 , ∀u ∈ U 3 ; • p = 4: video_4 (screencast, low quality), q 4,1 = 640kpbs, q 4,2 = 300ms, and q 4,3 = 10 −6 , ∀u ∈ U 4 . Figure 3 illustrates an example of a video frame from each educational video class considered.
In such environments, the role of PriMARL is to increase the QoS provisioning in all classes by learning the best prioritization sequence to apply at each TTI according to the actual traffic and networking conditions. We then study the impact of this dynamic prioritization and different scheduling rules (BF, OPLF, EXP) on the QoE metrics, namely PSNR and MOS. During the training and testing stages, the aggregate traffic load of all classes is varied in an interval of u ∈ [6.60], while respecting the following ratios between video classes: video_1 (16.53%), video_2 (16.53%), video_3 (33.3%), and video_4 (33.3%). Then, the QoS and QoE performance is evaluated based on three traffic load settings: low (U ∈ [6,20]), medium (U ∈ [21,40]), and high (U ∈ [41.60]). All scheduling approaches are tested for each configuration of U, and then, the results are averaged over the number of possible configurations in each traffic setting.

B. Network Settings
From the network perspective, we consider downlink scheduling sessions over the OFDMA interface with a system bandwidth of 20MHz and a number of B = 100 RBs. The radio channel model uses fast fading based on Jakes' model due to the high diversity provided in the CQI reports necessary to employ unsupervised learning techniques to find patterns and supervised learning methods to automate the CQI compression process [29]. The most widely used 7-cell cluster inter-cell interference model is considered. Each cell follows a macro-urban model with a radius of 1 km, since a wide range of CQI reports should be captured. When training the PriMARL controller, we consider a generic speed of 30km/h to teach the neural networks how to behave under different channel conditions, while when testing its performance, we consider static positions of the learners over several trials, as explained in more detail later in this section. We neglect intracell interference between mobile devices and other electronic devices, as this aspect is not relevant to our study. When training the machines, all ML-based approaches (HiMARL, PriMARL-BF, PriMARL-OPLF, PriMARL-EXP) are trained separately with different networking conditions. In the test phase, all candidates use the same network conditions.

C. Packet Scheduler Settings
At the level of the packet scheduler, the modulation and coding scheme is adapted at three levels (QPSK, 16-QAM, and 64-QAM) and the scheduling is done at each TTI in the time and frequency domain. In the radio link protocol layer, video packets are transmitted in acknowledged mode, with a maximum of five re-transmissions allowed for each lost packet. Once the scheduling process is complete and the system moves to the next TTI, the QoS indicators obtained are compared with the QoS requirements for each video traffic to verify the level of QoS provisioning. The delay of learner u requesting one of the video services is measured as the head-of-line packet delay and should not be greater than the requirement. The packet loss and the throughput performance are measured by averaging all instantaneous lost packets and throughput, respectively, in a sliding time window of 1000 TTIs. Depending on the method used, scheduling in the time and frequency domains is performed based on different metrics: a) Time-domain scheduling: On one hand, the PriMARL and HiMARL approaches prioritize learners from the same class by deciding the sequence of classes to schedule at each TTI. On the other hand, FLS and RADS prioritize learners from different video classes based on different metrics. For example, as explained in Section II, in time-domain scheduling, the FLS scheduler estimates the amount of real-time data to be transmitted in the next frame of 10 TTIs based on discrete linear control theory arguments. Then, the learners from different classes are prioritized based on the approximated quota of data needed to meet the delay requirements. In the case of RADS, learners requesting different video services are ranked based on a metric that considers fairness, delay, and throughput. b) Frequency-domain scheduling: The proposed PriMARL approach uses different scheduling strategies to allocate data in the frequency domain, namely BF, OPLF, and EXP rules. HiMARL uses reinforcement learning solutions at the level of each video class to learn the best rule to apply each time that class is selected in the prioritization sequence. As the results will show, this scheme is able to balance the QoS provisioning between the PriMARL with separate scheduling rules, by affecting the QoE performance in terms of perceived video quality. In case of FLS, the proportional-fair scheduler is used in the frequency domain to improve the fairness between the pre-selected learners, while RADS uses OPLF to improve the PLR performance since this QoS indicator is not part of the metric used in the time domain.
The scheduling performance is assessed by comparing the six candidates based on different metrics. As performances can vary depending on the network and channel conditions, different trials are conducted, with all schedulers using the same conditions (number of learners, mobility, channel and traffic characteristics) in each trial to allow a fair comparison. Subsequently, the performance metrics are averaged using the following formula: where G is the number of trials in the test stage and m p,g is the metric that evaluates the performance of a given indicator (QoS or QoE) for each video class p ∈ P in one trial g. In this study, we consider a number of G = 10 trials, where each trial has a duration of the scheduling process of about 50s.

D. PriMARL Controller Settings
The PriMARL controller is trained for a duration of 10 7 TTIs and the number of learners switched randomly from IDLE to ACTIVE and vice-versa every 1000 TTIs, taking into account the traffic load ratio between classes. To improve the generalization in decision-making, the speed of each mobile learner is set to 30kmph. Several configurations of neural networks were tested, and only the best ones are considered in this paper. For example, each neural network used to approximate an agent's ranking decision uses a hidden layer with 80 hidden nodes. When covering the entire state of all video classes, the value function uses a neural network with one hidden layer and 200 hidden nodes. In our settings, we choose a discount factor of γ = 0.99, which gives more importance to the value of the next-state when calculating the target value based on (9). Throughout training, we also consider equal chances of selection between exploration (random actions) and exploitation (actions based on trained functions) by setting = 0.5. The learning rates for the critic and all agents are varied according to the minimum errors found during the training period. Figure 4 shows the convergence analysis of the PriMARL algorithm in terms of mean and minimum errors and learning rates. By EV mean we denote the TD error of the critic neural network δ( ) averaged over 1000 TTIs, and by EV min the minimum value reached during training. By EQ mean, we denote the error averaged over all agents and 1000 TTIs (1/4000 P p=1 δ( p )), while EQ min denotes the minimum value. It is worth noting that each time a new minimum is found in the mean error of each agent p ∈ P, the set of weights p is stored. When evaluating the PriMARL approaches, the most recently stored set of weights is used. As can be seen in Fig. 4, the error of critic neural network drops below the value of 0.1 and remains relatively constant for the rest of the training period. In contrast, the mean error of all four agents converges to a value of 10 −3 by the end of the training period. The learning rates associated with the critic (LRV) and agent (LRQ) neural networks are set to an initial value of 0.02 at the beginning of the training period, and gradually decrease with a step of 10 −7 each time a new minimum error is found for each type of neural network.

E. QoS Analysis
To analyse the performance of QoS indicators in all video classes, we measure the levels of throughput, delay, and PLR for low, medium, and high traffic loads when employing the proposed PriMARL and state-of-the-art scheduling solutions. When quantifying the QoS provisioning, we are particularly interested in counting the number of TTIs when all QoS requirements are met in each video class.
1) Throughput, Delay, and Packet Loss: are collected for each scheduling scheme, traffic class, and mobile learner during the entire period of each trial. In particular, we are interested in calculating the percentiles for each collection of QoS indicators and identifying the worst indicators that could help us distinguish between the PriMARL solutions and other scheduling techniques. In this sense, we measure the percentiles of 5 th throughput, 95 th delay, and 95 th packet loss in each video class and average them over G = 10 trials. Figure 5 (first row) shows the performance of scheduling candidates when monitoring the 5 th throughput percentile. For video_1 and video_2, similar throughput is achieved by all solutions at low traffic load. However, for video_3 and video_4, HiMARL, PriMARL-BF, PriMARL-OPLF, and PriMARL-EXP improve the level of the 5 th throughput percentile by about 30kbps compared to the non-ML candidates RADS and FLS. At medium traffic load, PriMARL-EXP is the best option in the video_3 class, while in the video_4 class PriMARL-OPLF outperforms PriMARL-EXP by more than 20kbps. By increasing the traffic load to 'high', in the first two prioritized video classes the throughput level remains nearly similar in both cases. A larger discrepancy in performance between ML and non-ML approaches could be observed in the case of video_3, where PriMARL-OPLF outperforms the FLS scheduler by more than 100kbps. The impact of dynamic prioritization of PriMARL schemes can be observed when comparing the throughput performance of the classes video_3 and video_4. In this case, it can be observed that PriMARL-BF, PriMARL-OPLF, PriMARL-EXP, and HiMARL allocate a higher amount of resources to learners in the video_3 class, while RADS and FLS are not able to prioritize video_3 over video_4, achieving nearly the same throughput for both video classes. Except in the case of video_3 with medium traffic load, the PriMARL-OPLF solution remains the best option when measuring the 5 th throughput percentile in all traffic classes.
Considering the high traffic load and summing up the 5 th throughput percentiles across all four traffic classes and for each scheduler, we obtain gains higher than 55% and 36% when comparing PriMARL-EXP with RADS and FLS, respectively. As we discussed in Section II, MDV [26] and NRADS [27] achieve throughput gains of about 25% and 10% when compared to FLS and RADS, respectively. Therefore, we can estimate the throughput gains of about 45% and 10% when comparing the PriMARL-EXP approach with the recent state-of-the-art schedulers NRADS and MCV, respectively.
The delay performance in terms of 95 th percentile is evaluated in Fig. 5 (middle row) for low, medium, and high traffic loads. In the first case, it can be observed that ML-based approaches perform better than RADS and FLS, especially for the video_3 and video_4 classes. Among all options, PriMARL-BF has the lowest delay in all video classes. When the traffic load is increased to medium and high, the delay increases, especially in video_3 and video_4. At medium traffic load, PriMARL-BF and PriMARL-OPLF minimize the delay in video_1, FLS in video_2, PriMARL-EXP in video_3 and RADS in video_4. For high traffic load, PriMARL-BF, FLS, PriMARL-EXP, and PriMARL-OPLF are the best solutions in the video_1, video_2, video_3 and video_4 classes, respectively. However, when correlating the delay and throughput performance (Fig. 5 first and second  rows), it is generally observed that lower delay percentiles are associated with higher throughput levels.
As shown in Fig. 5 (third row), the PLR performance is measured by the 95 th percentile of packet losses, averaged over the number of G = 10 trials. In case of low traffic load, the MARL approaches outperform RADS and FLS in video_1, while almost the same performance is obtained in video_2. In other traffic classes (video_3 and video_4), PriMARL-OPLF generally remains the best option among all candidates when scheduling low traffic load. For medium traffic load, PriMARL-OPLF gets the minimum PLR in video_1 and video_4, FLS in video_2, and PriMARL-EXP in video_3. When increasing the traffic load to high, the lowest PLR level is obtained by PriMARL-OPLF in all video classes. Similar to delay and throughput, when correlating the packet loss and user throughput, we observe that lower PLR involves higher throughput in terms of 5 th percentile and vice versa.
Looking at the performance of the QoS indicators in Fig. 5, we notice that PriMARL-OPLF generally performs better when measuring the 5 th throughput and the 95 th PLR percentiles, with a few exceptions. These exceptions relate to the PriMARL-EXP solution, which performs better in video_3 and video_4 when scheduling medium and low traffic loads, respectively. By using the reward as a multi-objective function of throughput, delay, and PLR, the HiMARL approach achieves a better balance of QoS performance in all video classes compared to the PriMARL approach with static rules. However, it remains to be verified whether this method is the best option for measuring duration when all QoS requirements are met in each video class.
2) Duration of QoS Provisioning: Figure 6 shows the normalized number of TTIs when all QoS requirements are met in each video class, averaged over G = 10 scheduling trials. In low traffic load settings, PriMARL and HiMARL approaches perform better compared to FLS and RADS schedulers. Since the video_1 and video_2 classes have a higher variability in arrival rates (242kbps and 138kbps, respectively) compared to animation and screencast videos (video_3 and video_4), it is very difficult to maintain certain levels of average throughput for these classes (video_1 and video_2) at the imposed GBR requirements over a very long period of time. This explains the longer duration of QoS provisioning in classes video_3 and video_4 compared to video_1 and video_2 when a low traffic load is scheduled. When the traffic load increases to medium and high, it is observed that the duration of QoS provisioning in video_1 and video_2 is similar to the previous case for all scheduling approaches, except for RADS where a higher performance degradation is obtained. However, in higher rate classes such as video_3 and video_4, PriMARL-EXP, PriMARL-OPLF, and HiMARL maintain the duration of providing high QoS significantly longer compared to FLS and RADS. In both settings of medium and high traffic loads, PriMARL-EXP is the best option, followed by HiMARL and PriMARL-BF in video_3 and PriMARL-OPLF in video_4. Therefore, the best strategy to maximise the duration of QoS provisioning is to schedule learners with the highest delay in each video class given the prioritization sequence decided for each TTI.

F. QoE Analysis
When we analyse the quality of experience of each learner being scheduled in each video class, we calculate the perceived PSNR at each TTI by employing the following formula [35]: where R p,u is the arrival rate in the data queue and T p,u is the throughput of learner u receiving video services from class p ∈ P.   video_1 and video_2 classes at the 5 th PSNR percentiles, we decided to plot the worst percentiles. Depicted in Fig. 7 are the 1 st PSNR percentiles averaged over 10 trials for each traffic load. In case of low traffic, the ML-based approaches outperform RADS and FLS in all video classes, except for video_2 where RADS performs slightly better. By assigning MOS levels to the calculated PSNR percentiles, an excellent MOS is ensured to all learners by all scheduling approaches in the first two prioritized classes; good and fair MOS levels are obtained by the ML-based approaches in video_3 and video_4, while fair to bad levels are obtained through RADS and FLS approaches. In medium traffic load, the best 1 st percentiles are obtained by using PriMARL-OPLF in video_2 and video_4 and PriMARL-EXP for the remaining classes. HiMARL provides a balance in PSNR performance within classes, without being the best option in any of them. Correlating to MOS, PriMARL-OPLF can get a good level in the video_1 and video_2 classes. However, in the remaining classes, a bad MOS level is experienced by all scheduling approaches. When increasing the traffic load to high, good MOS levels are obtained only in video_2 class by all ML-based approaches. When looking at the performance of 1 st PSNR percentiles for all traffic settings and video classes, the best values are obtained by PriMARL-OPLF and PriMARL-EXP solutions. We can conclude at this point that aiming to maximize the multi-objective function in terms of throughput, delay, and PLR will not guarantee the best performance in terms of worst PSNR percentiles, as we have seen in the case of the HiMARL approach.
2) MOS Analysis: This analysis counts the number of PSNR percentiles which falls in the five MOS levels averaged over ten trials in downlink scheduling. Highlighted in green, we represent the best performance in terms of the highest and lowest number of PSNR percentiles with excellent and bad MOS, respectively. In Tables I, II, and III we present the MOS analysis in the form of numerical results for each of the scheduler type in low, medium, and high traffic load. The results are averaged over G = 10 trials and the Standard Deviation (SD) values are reported in brackets.
When scheduling low traffic load of video_1 (Table I) In  TABLE II  MOS LEVELS FOR MEDIUM TRAFFIC   TABLE III  MOS LEVELS FOR HIGH TRAFFIC the second prioritized class, more than 99% of the PSNR percentiles are in excellent MOS level for all approaches. The same performance is obtained in video_3 by MARLbased approaches only, while a degradation of more than 2% in excellent MOS level is obtained by the other approaches (RADS and FLS). When scheduling learners in video_4, PriMARL-EXP, PriMARL-OPLF and HiMARL achieve a percentage higher than 98% of the PSNR percentiles with excellent MOS, while the lowest amount of percentiles in bad MOS is obtained when using PriMARL-OPLF. In case of RADS and FLS, more than 3% degradation of excellent MOS services can be observed. When looking at the overall performance in low traffic setting, PriMARL-EXP, PriMARL-OPLF, and HiMARL could be identified as the best options.
In medium traffic load (Table II), all candidates except for RADS obtained nearly the same performance of 98% PSNR percentiles with excellent MOS when scheduling learners from the video_1 and video_2 classes. For video_3, the PriMARL-EXP solution achieves the highest and lowest amount of percentile with excellent and bad MOS levels, respectively, placing it as the best option among the candidates. HiMARL follows the PriMARL-OPLF policy by degrading the performance uniformly over the MOS levels. RADS achieves a similar performance in terms of the percentage of PSNR percentiles with excellent MOS, but it substantially increases the amount of percentiles located at the bad MOS level. However, being unable to respect the imposed prioritization scheme, RADS provides the highest and lowest number of PSNR percentiles with excellent and bad MOS respectively, when scheduling learners in the video_4 class. Looking at the overall MOS performance within the video classes with medium traffic load, it can be concluded that PriMARL-OPLF is the best option for video_1 and video_2, while PriMARL-EXP can achieve a much higher percentage of excellent PSNR percentiles when scheduling the video_3 and video_4 classes.
By increasing the traffic load from medium to high (Table III), it can be observed that RADS allocates more resources to video_4 with the lowest priority requirements and degrades the MOS levels in the first prioritized service classes, video_1 and video_2. In these cases, all other scheduling options provide more than 98% of the PSNR percentiles with excellent MOS level, of which PriMARL-OPLF is the best option. In video_3, PriMARL-EXP outperforms other scheduling candidates by achieving more than 60% of the percentiles in excellent MOS and around 27% of the percentiles with bad MOS level. The second best option in this case is the PriMARL-BF approach with 55% percentiles in excellent MOS and with 41% in bad MOS. As previously observed in lower traffic settings, the PriMARL-OPLF scheduling technique aims to minimize the packet loss for all learners without any specific control on PSNR performance. The HiMARL approach follows the OPLF scheduling rule for the resource allocation in video_3 and increases consistently the percentage of PSNR percentiles in fair, poor and bad MOS levels. When scheduling learners in video_4, FLS obtains the same performance as for the video_3 class, which means that only the group of video_1 and video_2 services is prioritized over the video_3 and video_4 classes. Looking at the performance among ML-based approaches, PriMARL-BF can get the highest amount of PSNR percentiles with excellent MOS of about 32%, while PriMARL-EXP gets the lowest percentage of percentiles with bad MOS of about 56%.
Summarizing the results from Tables I, II and III, the following conclusions can be drawn from the perspective of MOS levels over the calculated PSNR percentiles: a) RADS does not respect the imposed prioritization scheme and provides higher number of PSNR percentiles with excellent MOS in the video_2 and video_4 classes than in video_1 and video_3 respectively, especially for medium and high traffic loads; b) FLS prioritizes between the group of video_1 and video_2 classes and the rest, but it cannot prioritize video_3 over video_4 and provides nearly the same distribution of MOS levels in both classes for all traffic settings; c) HiMARL aims at maximizing the multi-objective reward function in terms of QoS requirements for all learners in all video classes, and thus, degrading the amount of learners experiencing excellent MOS levels of video content; d) PriMARL-BF and PriMARL-OPLF are fair options to learners from all classes regardless of the wireless channel conditions, which is why the higher amount of PSNR percentiles with bad MOS is obtained, especially when providing video_3 and video_4 services at medium and high traffic loads; e) being able to properly prioritize and schedule learners based on the highest packet delay, PriMARL-EXP provides the best results by substantially improving over other candidates in terms of percentage of results in the excellent MOS category.

G. Additional Results
As we observed, the MARL-based approaches are able to prioritize learners from the considered video classes much better when compared to more conventional scheduling approaches, such as RADS and FLS. When evaluating the QoS performance in Fig. 5, we observe that PriMARL-OPLF and PriMARL-EXP obtain the highest throughput levels (5 th percentiles) and lowest rates in packet loss (95 th percentiles) in different video classes and traffic settings. The HiMARL metascheduler provides the best trade-off between PriMARL-OPLF and PriMARL-EXP in terms of delay, PLR, and throughput because a different scheduling rule is selected to perform the radio resource allocation based on the networking conditions in each class. However, only focusing on improving the QoS performance and ensuring a good trade-off between throughput, delay, and PLR does not guarantee an enhanced performance when measuring the perceived QoE.
When evaluating the PSNR and MOS, we considered three levels in traffic load. For our discussion, we would like to find an approximate average number of learners that can be supported in excellent MOS in all video classes with different scheduling approaches. So far, in Tables I-III, we averaged the MOS levels over the number of learners in the intervals of [6,20], [21,40], and [41, 60], in low, medium, and high traffic load settings, respectively. Then, we can average over the intervals to get the number of learners supported by each traffic setting and we obtain, 12, 30 and 50 for low, medium, and high traffic load, respectively. With the ratios between video classes introduced in Section V-A, the following averaged numbers of learners in each video class are obtained: a) in low traffic load, Based on the MOS statistics exposed in Tables I, II, and III, we would like to find next an approximate number of learners experiencing excellent MOS of video content in each class when employing the best PriMARL scheduling schemes compared to other approaches.
In low traffic settings (Table I), the thresholds of dropping MOS from excellent to lower levels is about 50% for slideshow content with high and low quality (video_1 and video_2), and 75% for animation and screencast contents with low quality (video_3 and video_4). All scheduling approaches analysed in Table I achieve more than 94% of PSNRs in excellent MOS, and therefore, all 12 learners from different video classes experience an excellent MOS level of the viewed content most of the time.
When scheduling medium traffic load (Table II), we approximate the number of learners to five with excellent MOS for all scheduling approaches when watching slideshow content at high and low quality (video_1 and video_2). In case of video_3, PriMARL-EXP and PriMARL-BF provide excellent MOS to nine learners when watching animation, while RADS and HiMARL handle eight, and FLS and PriMARL-OPLF seven learners. When scheduling learners with screencast content, eight of them can get excellent MOS with RADS, seven with PriMARL-EXP, FLS and PriMARL-BF, and six with HiMARL and PriMARL-OPLF. By summing the number of learners experiencing excellent MOS in all video classes, we observe that both RADS and PriMARL-EXP support the same number of learners with this quality, which is 26. However, PriMARL-EXP prioritizes the viewers with animation content (video_3) much better compared to the ones with screencast (video_4) content (U 3 : U 4 = 9 : 7 for PriMARL-EXP compared to U 3 : U 4 = 8 : 8 for RADS).
In high traffic load settings (Table III), all eight learners can receive slideshow content at high quality with excellent MOS when employing the analysed scheduling approaches, except for RADS which supports only six viewers. At a lower quality of video_2, RADS provides nearly the same performance as other candidates supporting the same number of learners with excellent MOS level. When delivering animation content (video_3), PriMARL-EXP is the best option by supporting ten viewers, followed by PriMARL-BF with nine, RADS and FLS with seven, HiMARL with three, and PriMARL-OLPF with two learners. In case of screencast video streaming and scheduling, eight, seven, five, and five viewers are supported by RADS, FLS, PriMARL-BF and PriMARL-EXP approaches respectively. By summing the number of viewers with excellent MOS in all video classes, PriMARL-EXP remains the best option with 31 learners, PriMARL-BF and FLS support 30 learners with the same QoE. However, PriMARL-BF prioritizes animation better compared to screencasts. The list continues with the RADS, HiMARL and PriMARL-OPLF schedulers that can obtain excellent MOS for 29, 20, and 18 learners, respectively.

H. Summary
The QoS analysis (Section V-E) shows that, with few exceptions, PriMARL-OPLF achieves the best results when measuring the 5 th throughput and 95 th PLR percentiles, while PriMARL-EXP performs slightly better when measuring the 95 th delay percentile. When monitoring the time when all QoS requirements are met for each video class, PriMARL-EXP performs better than all other candidates, especially in case of medium and higher traffic load when lower prioritized video services are delivered. From the QoE analysis (Section V-F), PriMARL-OPLF gets the highest level of 1 st PSNR percentiles in almost all cases. However, when considering the QoE levels for all traffic loads, the PriMARL-EXP obtains the highest number of PSNR percentiles with excellent MOS while maintaining the required prioritization between video classes. Further analysis (Section V-G) shows that PriMARL-EXP outperforms other candidates in terms of the number of learners experiencing excellent MOS values of video content in each class and traffic load. Compared to the previous work [7], in which HiMARL is proposed to decide at each TTI the prioritization among classes as well as the selection of the scheduling rule for each class, in this paper we show that, from a QoE perspective, maintaining the static scheduling rule in the frequency domain is more efficient.

VI. CONCLUSION
This paper proposes a PriMARL-based decision-making solution to improve the QoS and QoE provisioning when delivering heterogeneous educational video content in the context of remote education. The proposed PriMARL framework employs an intelligent agent for each class of service that learns to claim its own priority to be scheduled in the frequency domain through a neural network. All agents are cooperating under the form of a joint action to be applied to maximize the overall QoS provision in all classes. Simulation results show that ensuring a good QoS performance does not guarantee excellent QoE levels in different prioritized video classes. We also observed that the scheduling rule which is employed to conduct the scheduling and radio resource allocation plays a crucial role in obtaining high QoE. Among all options analysed in this paper, the proposed PriMARL-based prioritization scheme with exponential scheduling rule works best in terms of perceived QoE. The proposed approach supports 100%, 86%, and 62% of learners with excellent MOS in low, medium, and high traffic settings, respectively. Per Bergamin is a Professor of Didactics in Distance Education and E-Learning with the Swiss Distance University of Applied Sciences. Since 2006, he acts as the Director of the Institute for Research in Open-, Distance-, and eLearning. In 2020, he was also appointed as an Extraordinary Professor with the Faculty of Education, North-West University, South Africa. From 2016, he holds the UNESCO Chair on personalized and adaptive distance education. His research activities focus on self-regulated and technology-based personalized and adaptive learning. Central aspects are instructional design, usability, and application implementation.
Gabriel-Miro Muntean (Fellow, IEEE) is a Professor with the School of Electronic Engineering, Dublin City University (DCU), Ireland, and the Co-Director of the DCU Performance Engineering Lab. He has published over 450 papers in top international journals and conferences, authored four books and 28 book chapters, and edited 9 other books. His research interests include quality, performance, and energy issues related to rich media delivery, technology-enhanced learning, and other data communications over heterogeneous networks. He is an Associate Editor of the IEEE TRANSACTIONS ON BROADCASTING, the Multimedia Communications Area Editor of the IEEE COMMUNICATIONS SURVEYS AND TUTORIALS, and a reviewer for top international journals, conferences, and funding agencies.
Cristina Hava Muntean (Member, IEEE) received the Ph.D. degree from Dublin City University, Ireland, in 2005. She is an Associate Professor with the School of Computing, National College of Ireland. She performed various research activities in the past 18 years fostering and promoting research, leading research projects, supervising Ph.D. and M.Sc. students, and publishing over 120 publications in international peer-reviewed books, journals, and conferences. Her main research areas are adaptive multimedia, adaptive and personalized learning, and user quality of experience.
Ramona Trestian received the Ph.D. degree from Dublin City University, Ireland, in 2012. She is a Senior Lecturer with the Design Engineering and Mathematics Department, Middlesex University, London, U.K. She published in prestigious international conferences and journals and has one authored and five edited books. Her research interests include mobile and wireless communications, quality of experience, multimedia streaming, handover and network selection strategies, and digital twin modeling. She is an Associate Editor of the IEEE COMMUNICATIONS SURVEYS AND TUTORIALS.