Deep Reinforcement Learning for Resource Constrained Multiclass Scheduling in Wireless Networks

The problem of multiclass scheduling in a dynamic wireless setting is considered here, where the available limited bandwidth resources are allocated to handle random service demand arrivals, belonging to different classes in terms of payload data request, delay tolerance, and importance/priority. In addition to heterogeneous traffic, another major challenge stems from random service rates due to time-varying wireless communication channels. Existing scheduling and resource allocation approaches, ranging from simple greedy heuristics and constrained optimization to combinatorics, are tailored to specific network or application configuration and are usually suboptimal. On this account, we resort to deep reinforcement learning (DRL) and propose a distributional Deep Deterministic Policy Gradient (DDPG) algorithm combined with Deep Sets to tackle the aforementioned problem. Furthermore, we present a novel way to use a Dueling Network, which leads to further performance improvement. Our proposed algorithm is tested on both synthetic and real data, showing consistent gains against baseline methods from combinatorics and optimization, and state-of-the-art scheduling metrics. Our method can, for instance, achieve with 13% less power and bandwidth resources the same user satisfaction rate as a myopic algorithm using knapsack optimization.


I. INTRODUCTION
Scheduling and resource allocation are two relevant yet challenging problems with a plethora of practical applications in various fields.For instance, in computing systems, computational processes have to be efficiently arranged and planned for the server to handle them; in management, each person is assigned a set of jobs for completion; in logistics, packages have to be carefully matched to each truck.Since resources, e.g., central processing units (CPUs), workers, trucks, etc., are limited, they have to be shared efficiently among different tasks and requests so as to optimize the system performance.Optimal resource allocation, together with the associated scheduling task, is one of the main challenges and requirements for the design of communication networks.How efficiently the available resources (e.g., subbands, timeslots, beams, transmit power, etc.) are managed has a direct impact on the communication system performance.
In this paper, we investigate the problem of scheduling and resource allocation in wireless networks.A base station (BS) sends data traffic to mobile users, which have different applicationdependent Quality of Service (QoS) requirements.We consider applications that require delivery of large amounts of data without any strict deadline, as well as time-sensitive or mission-critical ones involving low payload packets that have to be reliably received within a stringent latency constraint.The increased heterogeneity in users' traffic and the diverse service requirements substantially complicate the provisioning of high fidelity, personalized service with QoS guarantees.
The objective of this work is to design a generic architecture and efficient algorithms, which take as inputs the specific constraints of the traffic/service class where each user belongs to and as outputs the set of users to serve, as well as the allocated resources and timeslots, as means to maximize the number of satisfied users.
The considered problem here is hard to solve due to several major technical challenges.First, with the exception of very few special cases, there is no simple closed-form expression for the problem and a fortiori for its analytical solution.Second, optimization algorithms that solve the problem have to be computationally efficient and implementable in large-scale wireless networks.Applying optimal methods from combinatorial optimization, such as branch and bound algorithm [1], results in solutions exhibiting prohibitively high computational complexity and being hard or impossible to meaningfully scale with the number of active users.Other existing approaches relying heuristics, approximations, or relaxations, provide suboptimal solutions, which seem to work satisfactorily in specific scenarios but fail to perform close to optimal in general cases and with large number of users.Moreover, the proliferation of new use cases makes the problem of efficient and scalable scheduling and resource allocation more intricate.This will be exacerbated with the advent of the emerging mobile systems (Beyond 5G/6G), which will involve high-dimensional optimization domains, various application scenarios, as well as heterogeneous, often conflicting, QoS requirements.This motivates the quest for alternative methods.
In this paper, we propose to resort to Deep Reinforcement Learning (DRL) for efficient and scalable resource allocation.DRL has recently attracted much attention for providing very promising results in complex problems obeying strict game rules (e.g., Atari, Chess, Go [2]- [4]) or physical laws (robotics and physics-related tasks [5], [6]).In cloud service provision, DRL has been used to schedule incoming tasks to servers according to their heterogeneous CPU and memory requirements [7].DRL approaches have recently shown interesting gains in wireless communication systems [8]- [16].In contrast to most prior work and to harness the high level of stochasticity, we consider distributional DRL [17]- [19] as a means to obtain richer representations of the environment thus better solutions.Furthermore, we leverage (i) techniques such as noisy networks for better explorations [20]; (ii) architectures such as dueling networks [21] for improved stability of the trained models; and (iii) deep sets [22] for simplifying and improving neural network models when permutation invariance properties apply.We combine these three ingredients with a deep deterministic policy gradient method [23] to propose a highly efficient general architecture and scheduling/resource allocation algorithm.In a setup similar to ours and Nokia's challenge [24], deep deterministic policy gradient is used to allocate the bandwidth to incoming data traffic in [25].Nevertheless, unlike our work, [25] considers only full channel state information (CSI), a single traffic class, and only few users (typically less than 15).In [26], Graph Neural Networks, a technique similar to Deep Sets, are used to increase the number of users, but they do not consider traffic of users.Initial attempts to solve the problem of scheduling traffic for users with heterogeneous performance requirements can be found in [27], considering though only full CSI and a limited number of users.
In the context of using DRL for revisiting the problem of heterogeneous multiclass scheduling and dynamic resource allocation in wireless communication networks, our main contributions can be summarized as follows: • We develop a neural network with two crucial architectural choices facilitating a stable training even in the case of high traffic from a very large number of users.First, we leverage Deep Sets [23] as a means to exploit the permutation equivariance property of the problem and drastically reduce the number of necessary parameters.Second, we introduce a user normalization trick capturing the attribute of the problem that the available bandwidth resources are limited.We show that without those crucial steps, the performance plummet.
• We further improve the performance using distributional DRL [19] and reward scaling as implemented in [28].Finally, we get additional gains by adapting the idea of dueling networks [21] used in Deep Q-Networks (DQN) to distributional RL by modifying the output to represent the distribution of the return of the agent's action.
• We demonstrate that our DRL proposal can easily be implemented with minor changes in both extremal cases in terms of wireless channel knowledge, namely full CSI and no CSI.
• To compare our DRL solution, we design strong baselines: -In the full CSI case, the scheduling step is solved in a myopically optimal way by reformulating it as a knapsack problem.The DRL scheduler outperforms baseline schemes in the sense that it reaches the same performance while requiring 13% less power and bandwidth.Furthermore, we devise a baseline operating as an oracle knowing all future traffic characteristics.The oracle finds the optimal resource allocation policy via Integer Linear Programming (ILP) and constitutes an upper bound.Our experimental results show that the proposed DRL scheduler operates close to the upper bound.
-In the no CSI case, the model-based baseline actually requires access to the statistics of the problem and uses them to cast the scheduling problem as an optimization one.The baseline scheme employs the Frank-Wolfe (FW) algorithm, which guarantees that the solution is a local optimum.Our DRL scheme significantly outperforms Frank-Wolfe, supporting our hypothesis that the more complicated the communication system is with unknown variables affecting it, the higher gains may be yield using a DRL-based model-free method.
The paper is organized as follows: in Section II, we introduce the system model including the channel and traffic model.In Section III, we formulate the optimization problem and Section IV is devoted to the main contribution of the paper, that of the design a new DRL scheduler for heterogeneous multiclass traffic.In Section V, baseline algorithms, for performance comparison, are presented.In Section VI, we provide experimental results with both synthetic and real data, and Section VII concludes the paper.

A. Network and channel model
We consider the downlink of a communication system, in which a BS serves multiple users by sending data over a wireless random time-varying channel.Users are uniformly distributed within two concentric rings of radii d min and d max > d min .Therefore, the distance of a user u from the BS is a random variable with probability density function (PDF) We assume that mobility is not very high, such that BS-user distances remain constant during the time interval users remain active.
Orthogonal frequency bands are assigned to simultaneously served users, hence there is no interference among them.Users experience frequency flat fading, i.e., the channel gain of a user remains constant during a time slot and throughout all available frequency bands assigned to.Let a user u that has entered the system at time t 0 .Its channel gain at time t is given by , where n pl denotes the pathloss exponent, C pl is a constant accounting for constant losses, and σ 2 N is the noise power spectrum density.The small-scale fading h u,t evolves over time according to the following Gauss-Markov model where h u,t 0 ∼ CN (0, 1) (circular complex normal distribution with zero mean and unit variance), and We consider the following two cases for the channel state information (CSI): (i) full-CSI, in which h u,tc and the users' locations (and so d u ) are perfectly known at the BS for time t c , thus enabling accurate estimation of the exact resources each user requires; (ii) no-CSI, in which the scheduler is completely channel-agnostic, both in terms of instantaneous fading realization and long-term channel statistics.In case of unsuccessful and/or erroneous data reception, a simple retransmission protocol (Type-I Hybrid Automatic Repeat Request (HARQ)) is employed.A packet is discarded whenever the user fails to correctly decode it (no buffering at the receiver side) and the BS will attempt to send it again in some subsequent slot.
Remark 1: For a non-trivial implementation of the Frank-Wolfe (FW) algorithm, which serves as a baseline for comparison in the the no-CSI case, we need to consider some kind of CSI.For that, we consider the case of statistical CSI, where the scheduler knows the statistics of the users' channels and locations.Our proposed DRL algorithm will always operate under full absence of CSI, since all statistics can effectively be learned through the training phase.

B. Traffic model
We consider a generic yet tractable traffic model, in which users with diverse data and latency requirements arrive and depart from the system.There is a set of service classes C to which a user entering the system belongs to with probability p c .Each user in class c ∈ C is characterized by the tuple (D c , L c , α c ) as follows: • Data size D c : the number of information bits requested by a user belonging to class c.
• Maximum Latency L c : the maximum number of time slots within which the user has to be satisfied, i.e., to successfully receive its data packets of size D c .
• Importance α c : an index allowing the scheduler to prioritize certain service classes, e.g., users with privileged contracts (e.g., high-value Service-Level Agreement (SLA)) may demand better service and higher reliability.
We assume that a maximum number of users K can coexist per time slot and that a new user may arrive only after the departure of a user that exceeded the maximum time allowed to remain in the system.That way, the scheduling decisions do not influence the arrival process.For example, if a user arrives at time t 0 = 1, belonging to class c ∈ C with L c = 4, then even if it successfully receives its requested packet of size D c at t = 1, a new arrival may randomly be generated only at time t = t 0 + L c = 5 and afterwards.The rationale behind adopting this model is as follows.
If a new arrival is generated right after a previous user is satisfied (in the example at time t = 2), then the traffic load is affected by the scheduler performance.The faster the scheduler serves the users, the more arrivals occurs.In contrast, in our model, the arrival process and its statistics remain uninfluenced by the scheduling decisions and the available resources.Therefore, at every time slot, the set of users U t (|U t | ≤ K) contains all users waiting to receive their requested data while remaining within their latency constraint.Finally, to ensure random inter-arrival times, we assert that the probability p null = 1 − c∈C p c is positive , i.e., p null > 0, leaving a probability that no user appears in a time slot.

C. Service Rate
The where Γ(s, x) = ∞ x t s−1 e −t dt is the upper incomplete gamma function.For exposition convenience, we overload notation by allowing u in D u , L u , α u to denote either a class u or a user u belonging to a class with those characteristics.

III. PROBLEM STATEMENT
We consider the problem of heterogeneous scheduling and resource allocation, which involves a BS handling a set of randomly arriving service requests belonging to different classes with heterogeneous requirements.Each class defines the requirements and the expected Quality of Service (QoS) guarantees for its users.Observing this time-varying set of heterogeneous requests, the objective of the scheduler at each time slot is two-fold: (i) carefully select which subset of user requests to satisfy, and (ii) allocate the finite resources amongst the selected user requests.The performance metric to maximize is the long-term importance-based weighted sum of successfully satisfied requests.A request is considered to be satisfied whenever the user has received the requested data within the maximum tolerable latency specified by its service class.
The scheduling problem at hand can be formulated as a Markov Decision Process (MDP) [30] (S, A, R, P, γ), where S is the state space of the environment and A is the action space, i.e., the set of all feasible allocations in our case.After action a t ∈ A at state s t ∈ S, a reward r t ∼ R(•|s t , a t ) is obtained and the next state follows the probability s t+1 ∼ P (•|s t , a t ).The discount factor is γ ∈ [0, 1).Under a fixed policy π : S → A determining the action at each time step, the return is defined as the random variable which represents the discounted sum of rewards when a trajectory of states is taken following π.
Ideally, the aim is to find the optimal policy π that maximizes the mean reward E π [Z π t ].At each time step t, a set of users u ∈ U t is waiting for service, where each user therein belongs to a class c u ∈ C described by (D c , L c , α c ).After L c time steps a new user belonging to class c might arrive with probability p c .Throughout its "lifespan" t indicates the amount of given resources.If at any time t, w u,t > D u /R u,t then user u is satisfied.
Since resources are limited (finite), u∈Ut w u,t ≤ W, ∀t, no more than W resources in total can be spent per time slot.Summing up, State: where l u,t ≤ L u is the remaining number of time slots within which user u (i.e.u ∈ U t ) expects to successfully receive its packet and 1{•} denotes the indicator function.Note that knowing the class c u to which user u belongs, implies knowing the requirements (D u , L u , α u ).An inherent attribute of this MDP is the permutation equivariance of an optimal policy, meaning that if we permute the indexing of the users, then permuting likewise the allocation of the resources retains the performance of the policy.For that, in our DRL approach, we only consider permutation equivariant policies, and as a consequence we need a permutation invariant function to evaluate and train the policy.
In this work, we focus on bandwidth allocation, assuming a fixed amount of energy spent per channel use and no power adaptation, i.e., P u,t = P, ∀u, t.Specifically, for total bandwidth W , the scheduler aims at finding the (w u 1 ,t , w u 2 ,t , ...) ∈ R |Ut| ≥0 with u 1 , u 2 , ... ∈ U t and u∈Ut w u,t ≤ W, ∀t, so as to maximize the accumulated reward for every satisfied user over a finite time horizon.The expected reward is described by the following objective "gain-function" We stress out that a user u remains on the set U t for a time interval less or equal to the maximum acceptable latency L u .If not satisfied within that interval, then it does not contribute positively to the objective G.
Note that R u,t satisfies the Markov property since h u,t follows a Markov model.Under full CSI, the agent (here the BS) fully observes the state s t , while in the no-CSI case, h u,t is unknown resulting in a Partially Observable MDP (POMDP) [31].One way to transform a POMDP into a MDP is by substituting the states with the "belief" of the states [32].Another way is to use Notice that only the most recent part is relevant as users that have already left the system do not affect the way the channels of the current users evolve or the generation of future users or in general the current and future system dynamics.Therefore, we can safely consider the scheduling and allocation history of only the current users.Specifically, if w u,t = (w u,t 0 , w u,t 0 +1 , ..., w u,t ) is the scheduling history of user u then the input of the agent is {∀u ∈ U t : D u , L u , a u , κ u , l u,t , w u,t }.

IV. PROPOSED DEEP REINFORCEMENT LEARNING ARCHITECTURE
In this section, we propose a novel DRL architecture as a means to solve the aforementioned multiclass scheduling and resource allocation problem.Despite the highly challenging dynamics and stochasticity (wireless channel and heterogeneous traffic), we show that DRL can provide performance gains although it is impossible to accurately predict the number of users, their service demands, and their channel/link characteristics even after few steps.

A. Policy Network
Our objective is to build a scheduler that can handle a large number of users K, even in the order of hundreds.Moreover, we require that our method works in both full CSI and no CSI cases with minor -if any -modifications.A widely used approach is Deep Q-learning Network (DQN).However, it is not feasible to employ DQN in our case since it needs a Neural Network (NN) architecture with a number of outputs equal to the number of possible actions and the action space is extremely large (in statistical CSI it is even infinitely large).For that, we resort to a Deep Deterministic Policy Gradient method [23], which trains a policy π θ : S → A modeled as a NN with parameters θ.
If at time t on state s t the action a t is taken followed by the policy π, then the return using ( 4) is given by Note that if even at t the action a t comes from policy π, then Z π (s t , a t = π(s t )) = Z π t .Let the expected return be Then, the objective of the agent is to maximize with p t 0 being the probability of the initial state s t 0 at time t 0 .The gradient can be written [33] with ρ π θ st 0 being the discounted state (improper) distribution defined as To compute the gradient, the function Q π θ (s, a) is needed, which is approximated by another NN Q ψ (s, a), named value network, described in the next subsection.
We now explain the architecture of the model π θ .

1) Deep Sets:
As discussed in Section III, we aim for a policy that falls in the realm of permutation equivariant functions (i.e., permuting the users should only result in permuting likewise the resource allocation).In [22], necessary and sufficient conditions are shown for permutation equivariance in neural networks; their proposed structure called Deep Sets is adopted here.At first, the characteristics (or features as commonly termed in the machine learning literature) where is an element-wise nonlinear function.We stack two of those, one f relu : R K×Hu → R K×H u with σ(•) being the relu(x) = max(0, x) and a second f linear : R K×H u → R K×1 without any nonlinearity σ(•).In addition to preserving the desirable permutation equivariance property, this structure also brings a significant parameter reduction.
The number of parameters of Deep Sets contained in Λ, Γ do not depend on the number of users K. Therefore, any increase in K does not necessitate additional parameters, which could lead to a much bigger network, prone to overfitting.
2) Output: The activation function for the last layer of the policy network is a smooth approximation of relu(x), namely softplus(x) = log(1 + e x ) restricting the output y ∈ R K to be positive.After that, depending on the existence of CSI, there are two ways of performing the allocation.For full CSI, the bandwidth required per user is accurately known.Therefore, we only need a binary decision per user (to serve or not), which will ruin though the differentiability of the policy, a mandatory property for DDPG to work.For that, we interpret the output y as a continuous relaxation of the binary problem.Specifically, y is the assignment to each user of a "value" per resources which after being multiplied by the number of resources the user requires, a user ranking is obtained.Then, the scheduler satisfies as many of the most "valuable" (highest rank) users as possible subject to available resources.Therefore, in full CSI, y semantically denotes how advantageous the policy believes is to allocate resource to each user.On the contrary, in the no CSI case, the action is not binary but continuous since the scheduler has to decide on the portion of the available resources each user takes.To ensure that y has the valid form of portions (i.e., positive and adding up to one) we just divide by the sum, y → y ||y|| 1 (with || • || 1 being the 1 norm) 1 .This discrepancy in the output process is the only minor difference in the considered model between full CSI and no CSI.
3) User normalization: Before the final nonlinearity of softplus(x) = log(1 + e x ), as seen in Figure 5, there is the crucial "user normalization" step x → denoting the 2 norm).Consider first the full CSI case.Without that step, the value network would perceive that the higher the "value" per resource assigned to a user, the more probable is for that user to get resources (and thus to be satisfied and receive reward).Unfortunately, this leads to a pointless interminable increase of every user's "value".What matters here is not the actual "value" of a user but how large this is relative to the rest of the users.To bring the notion of limited total resources, the "user normalization" subtracts from the value of each user the 1 Instead of dividing by the 1 norm, we also considered the softmax(y), which seemed a good choice as it also provides positive outputs adding up to one.Nevertheless, this approach led to poor performance because no matter how much the number of users is increased, the policy insists on evaluating as advantageous to serve only a very small number of users.This makes sense since the softmax function is a smooth approximation of argmax, hence focusing on finding the one most advantageous user to be served.MHz.We depict here the average probability a user to be satisfied over those five experiments to carry an ablation study on the importance of the deep sets and user normalization step.
mean of all the users value.Hence, whenever the algorithm pushes the value of a single user to increase, the values of the rest decrease.In the no CSI case, there is an additional benefit.Since in the following step there is the operation y → y ||y|| 1 so as the output to signify portions (of the total bandwidth), performing previously the normalization step (dividing by ||x|| 2 ) helps keeping the denominator ||y|| 1 stable.
In Figure 1 we show the significance of choosing the right architecture.It is clearly observed that if either all Deeps Sets (in both policy and value network) are substituted by the most common choice of linear blocks or the user normalization step is removed, the performance degrade substantially.4) Exploration: Since the action a t has to satisfy specific properties, such as positiveness and summing up to one for the no CSI case, the common approach of adding noise on the actions becomes rather cumbersome.An easy way out is through noisy networks [20], which introduce noise to the weights of a layer, resulting to change decisions for the policy network.
The original approach considers the variance of the added noise to be learnable.Here, we instead keep it constant since this provides better results.With probability P explore we add noise to the parameters of φ users , resulting to alter output features per user and therefore the policy outputs a different allocation.Specifically, if θ φusers are the parameters of φ users , then they are distorted as θ φusers (1 + σ explore ) with being normally distributed with zero mean and unit standard deviation and σ explore being a constant.

B. Value Network
As mentioned previously, Q π θ (s, a) is used for computing the gradient of the objective function described in (8).Since this is intractable to compute, a neural network, named value network, is used to approximate it.We compare three ways of employing the value network.
1) DDPG: At first, the common approach of DDPG is considered, which uses the Bellman operator to minimize the temporal difference error, i.e., the difference between before and after applying the Bellman operator.This leads to the minimization of the loss where (π θ , Q ψ ) corresponds to two separate networks called target policy and target value neural networks, respectively, used for stabilizing the learning.At each iteration, they are gradually updated as the weighted sum between the current policy/value networks and the current target policy/value network, i.e., θ ← (1 − m target )θ + m target θ and ψ ← (1 − m target )ψ + m target ψ.
2) Distributional DDPG: Another way is to approximate the distribution instead of only approximating the expected value of the return, as in [34].The following analogy is helpful here to motivate its interest.Instead of having a scheduler and its users, consider a teacher and its students.Even though the objective of the teacher is to increase the average "knowledge" of its students, using the distribution of the capacity/knowledge of the students allows for instance to decide whether to distribute his/her attention uniformly among students or to focus mostly on a fraction of them needing further support.
Algorithmically, it is impossible to represent the full space of probability distribution with a finite number of parameters, so the value neural network Z π θ ψ : S × A → R N Q is designed to approximate the actual Z π θ with a discrete representation.Among many variations [18], [35], we choose the representation to be a uniform (discrete) probability distribution supported at i is the i-th element of the output.More rigorously, the distribution that the value neural network represents is 1 , where δ x is a Dirac delta function at x [19].Minimizing the 1-Wasserstein distance between this (approximated) distribution and the actual one of Z π θ can be achieved by minimizing the quantile regression loss where ψ is the target policy network (defined as before).Notice that even though we approximate the distribution of Z π θ (s, a), what is actually needed for improving the policy is only its expected value, approximated as Therefore it is natural to wonder if it indeed helps using Z π θ ψ instead of directly approximating the needed expected value (confirming the intuition in the teacher-student analogy).In Figure 2 we provide numerical support for distributional DDPG.Comparing Figures 2b and 2c, we show the benefits of using distributional DDPG.The distributional DDPG approach detects faster the existence of two different service classes with heterogeneous requirements, thus gradually improving the satisfaction rate for both of them.On the other hand, trying only to learn the expected value leads to a training where the performance for one class is improved at the expense of the other.Nonetheless, when aggregating the rewards coming from both classes, we observe in Figure 2a faster convergence of DDPG than the distributional DDPG even though -when converged -the latter exhibits slightly better performance.Introducing a trick (explained later in the "dueling" paragraph), the distributional DDPG approach can be enhanced and outperforms DDPG.

3) Distributional DDPG & Dueling:
To facilitate the approximation of the distribution Z π θ (s t , a t ), we propose to split it into two parts: one that estimates the mean Z π θ ,M ean ψ and one that estimates the shape of the distribution Z π θ ,Shape ψ .For that, we use a dueling architecture [21] (shown in Figure 5).The output becomes to approximate Q π θ used for training the policy.To ensure the decomposition of the distribution into shape and mean, we add a loss term around zero.The total loss function is To better understand the role and the performance of using the dueling architecture to approximate the (return) distribution, we have implemented a simple experiment, whose results are shown in Figure 3.We set a random variable Z with known cumulative distribution function (cdf) CDF real from which we draw samples.The objective is to test the distributional and the combination of distributional plus dueling approach on how fast using samples from Z they correctly estimate the CDF real .For the first approach (termed Distributional) we use N Q parameters ϕ ∈ R N Q and aim to approximate the quantiles of CDF real through minimizing the quantile regression loss (as in ( 13)): On the other hand, we use the dueling architecture (termed Distributional & Dueling) with parameters ϕ shape ∈ R N Q and ϕ mean ∈ R. We want with ϕ duel := [ϕ shape , ϕ mean ] to approximate the quantiles of CDF real by minimizing the loss L 1+duel (ϕ duel ) as defined in (14).In Figure 3, each column corresponds to a different cdf CDF real : • the first column corresponds to a normal distribution N (0, 1), • the second one to a Gamma distribution Γ(1, 1), and • the last one to an equiprobable mixture of two normal distributions N (0, 1) and N (4, 1).
Each row corresponds to a different number of samples used to estimate CDF real .We depict the estimated cdf when using or not the dueling trick and compared them to the true one.We use N Q = 50 and the optimization algorithm is Adam with learning rate 0.01.We can see that using dueling leads to faster estimation of the true cdf in all cases.

4) Scaling rewards:
A closer look on the range of the possible rewards reveals that they have a very large range of possible values, starting from 0 (no user satisfied) to K (maximum number of users satisfied assuming all classes have equal importance α c = 1).Therefore both its mean and variance may take big values.This is accentuated for the returns since it is the (discounted) sum of many of those rewards.Therefore, approximating the returns which take a large range of values is demanding.Standard technique to facilitate the approximation is "scaling" the rewards.
The rewards are normalized in a way that the returns take values on a more easy to approximate range.Given a path that a fixed agent have taken, one can compute the returns per time slot across that path.Scaling the rewards pushes the mean of those returns to zero and the variance to one.
Specifically, the implementation of scaling the rewards involves first estimating the discounted .The DRL algorithm is fed with those rewards whose discounted sum over time is the return that the policy network is trained to predict.We fix m scale = 10 −4 .In Figure 2a it is shown that reward normalization clearly provides additional boost in the performance.In Figure 4, we visualize what the value network tries to approximate.In the first row, by considering only distributional DDPG, from state s and action a the distribution of the returns Z π θ ψ (s, a) is approximated.From a different state s and action a , there will be other possible random paths that the agent with policy π θ may take and the value network will try to approximate the distribution Z π θ ψ (s , a ).The black dots depict the average of the two distributions, which are in fact the values that the value network of a simple DDPG would like to approximate and the policy network to maximize.In the second row, the use of reward scaling shifts the distributions around zero and also shrink them.In the last row, the dueling trick is added so the value network has two outputs.One branch of the dueling architecture approximates the value Z π θ ,M ean ψ (s, a) = E[Z π θ ψ (s, a)], while the other the centered distribution

5) Deep Sets:
A final remark concerns the architecture, which, as discussed before, should be designed so as to preserve the permutation invariance.If we associate every user's characteristics with the resources given by the agent, i.e., the action corresponding to it, then permuting the users and accordingly the respective resource allocation should not influence the assessment of the success of the agent.To build such an architecture, we adopt the same architecture as in our Policy Network, capitalizing on ideas from DeepSets [22].
The different steps of our algorithm are shown in Figure 5.In this section, we present baseline scheduling algorithms, which are built upon conventional optimization techniques but are adapted to our specific problem.These algorithms are used for performance comparison in order to show the gains of our proposed DRL architecture.

A. Full CSI case
At time t c (t c ≥ t 0 ), for user u 0 , which arrived at time t 0 , both channel h u 0 ,tc and location d u 0 are known.User u 0 is not satisfied at time t if and only if the allocated bandwidth w u 0 ,t is smaller than the threshold w th u 0 ,t = D u 0 R u 0 ,t .We first consider algorithms working with immediate horizon (T = 1), where only the current time t c is considered ignoring the effects on future slots.
In that case, it is possible that the scheduler prefers serving two users that just arrived in the system rather than a user with bad channel requiring more resources but being on the verge of its latency constraint expiration.The optimization problem can be easily rewritten as follows.The variables to optimize are {x u,tc } u .The variable x u,tc is equal to 1 if user u is served at time t c or 0 otherwise.The cost in terms of bandwidth used is w th u,tc x u,tc , since full CSI is assumed and the scheduler allocates exactly the minimum bandwidth required to successfully send the data to user u.Then, the contribution in the reward function is α u x u,tc .As a result, the optimization problem can be written as This problem bils down to the knapsack problems, which aim to maximize the total value by choosing a proper subset from a set of objects.Every object has its value but also its weight, thus preventing one from picking all objects since the total weight of the chosen subset should not exceed the knapsack capacity level.It is a well known N P-complete problem with numerous efficient algorithms solving it.In this work, we use Google's OR-TOOLS library for solving it.
A second baseline we compare with is the so-called exponential rule [36], which corresponds to a generalization of proportional fair scheduler taking into account the queue state and the latency constraint of each user.At each time slot t, users are ordered according to their index values and we start serving the ones with the highest rank until resources are finished.Let v u,t be the number of the time slots user u remains unsatisfied and l u,t be the number of time slots the user is eager to wait (therefore R u,τ the estimated mean past rate.This value is known by the server at time t by keeping track of the history of channel gains.Then the index J u for user u is given by and a u,t = − log(δ u )/l u,t with δ u being the delay violation probability.
Lastly, we focus on algorithms that explicitly take into account the effects of an action on the future of finite horizon (T > 1).For sake of simplicity, we assume that for the time interval t ∈ [t c , t c + T − 1] for all the current users and also for the ones that will appear within that interval, the channel realizations during this time interval are known beforehand (i.e., when the algorithm is executed at time t c ).Therefore, this baseline becomes an oracle since it knows the future channel realizations of users and can choose the best moment to serve them.Evidently, this method provides an upper bound on the performance.Specifically, if U T tc denotes the set of all current users plus the ones that will arrive in the time interval [t c , t c + T − 1], then for every user u ∈ U T tc this baseline knows w th u,t which corresponds to the required bandwidth in order to satisfy u at time t ∈ [t c , t c + T − 1].The optimization problem is then cast as This problem is an ILP and we use IBM CPLEX Optimization software, which employs the Branch and Cut algorithm [1], to solve it.Notice that the above problem cannot be mapped into a knapsack one (as for T = 1), or even a multiple knapsack problem because the weight of each user is time-varying due to channel variability and non-constant user set.

B. Statistical CSI case
Under statistical CSI, the BS knows the statistics of the system (channel, location, and traffic).
In this section, we build a baseline to compare with the proposed DRL scheduler in the no-CSI case as mentioned in Remark 1.
Let us first focus on the case of a single user u 0 arriving at time t 0 .The current time is We denote by w u 0 ,t = (w u 0 ,t 0 , w u 0 ,t 0 +1 , ..., w u 0 ,t ) the assigned bandwidth from time t 0 (beginning of transmission for user u 0 ).Additionally, let A u 0 ,t be a binary random variable, where if A u 0 ,t = 1, then u 0 is still unsatisfied at the end of time slot t (after receiving w u 0 ,t resources) and A u 0 ,t = 0 otherwise.Given that at the beginning of time t, user u 0 still remains unsatisfied and that we know w u 0 ,t is scheduled at time t, we define Φ(w u 0 ,t ; d u 0 ) to be the probability that w u 0 ,t is not sufficient to satisfy the user's request for known location d u 0 and unknown channel realization h u 0 ,t , i.e., The average contribution of user u 0 to the gain (5) on the time interval [t c , t] is given by the following equation, derived by applying the chain rule on conditional probability: We consider now the average contribution on the gain (5) for subsequent users after user u 0 .
The next user (if any) appears at time t 1 = t 0 + L u 0 , the second next at time and so on.In other words, we consider the users, denoted u 1 , u 2 , . .., which appear at time . ., respectively.Users belong to classes c 1 , c 2 , . .., with probabilities p c 1 , p c 2 , . .., respectively.Since the locations of those future users are unknown, we need to average ( 15) and ( 16) over their possible locations in order to obtain their contribution on the gain function (5).So for i ≥ 1 if w u i ,t = (w u i ,t i , w u i ,t i +1 , ..., w u i ,t ), we have where the contribution looking at time t with t < t i + L u i starts at time t i for user u i and where Closed-form expressions for Eqs. ( 15) and ( 18) are provided in Appendix A.
For notational convenience, to include the case where no new user is generated in a time slot, we introduce the "null" class of users, which contains users serving as dummies.They appear with probability p null , are active for one slot (L null = 1) and have zero contribution g with t i+1 =t i +L null .Hence, the average value of the gain function for the sequence of users u 0 , u 1 , ... (so when there is one user at most per time slot, i.e., K = 1) starting at the current From (19), we observe a tree structure 2 that when a user vanishes, there is a summation over all possible classes the new user may belong to.Therefore, a number of branches equal to the number of possible classes (equal to |C|) are created whenever a new future user is taken into account.To harness this scalability issue, we prune the tree by considering only T future time slots and work with finite horizon [t c , t c + T − 1].
The general case with multiple users served simultaneously (K > 1) can easily be considered by just computing K "parallel trees".With a slight abuse of notation, we consider that the first subscript of the variables w refers now to the index of the tree (and implicitly to a specific user).
As a consequence, the variables for the scheduled bandwidth resources over an horizon of length T can be put into the following matrix form and the average gain for these resources takes the following form: Finally, we arrive at our optimization problem at current time t c : It can easily be shown that the objective function G(•) is non-concave with multiple local optima.The constraints given by ( 22) describe a compact and convex domain set, which allows applying the Frank-Wolfe algorithm (FW) [37] that guarantees reaching to a local optimum.The convergence of the FW method is sublinear; however, computing the objective function (20) and its partial derivatives grows exponentially with T , thus leading to slow and cumbersome method in practice.Therefore, in each time slot t c , we use FW to get a local optimum solution W tc from which we retrieve the first column [w 1,tc , • • • , w K,tc ] corresponding to the bandwidth allocation that will be applied at the current time step t c .

A. Synthetic Data
We consider the distance-dependent pathloss model 120.9 + 37.6 log 10 d (in dB) [38], which corresponds to a constant loss component C pl = 10 −12.09 and pathloss exponent n pl = 3.76.The noise spectral density is σ 2 N = −149dBm/Hz.We consider that the distance between the base station and users ranges from 0.05 km to 1 km.The power per unit bandwidth is kept equal to 1µW/Hz.
For the proposed DRL scheduler, we update the target policy and value networks with momentum m target = 0.005.We use replay buffer of capacity 5000 samples.The batch size is set to 64 and the learning rate is set to 0.001.The discount factor is γ = 0.95.We use N Q = 50 quantiles to describe the distribution.The φ user consists of two fully connected layers each with 10 neurons.We have P explore = 0.2 and σ explore = 0.3.The number of input and output dimensions in both f relu and f linear is 10 (i.e., H = H = 10).We remark that the number of parameters is kept relatively low (around 1800), mainly due to the use of Deep Set.Increasing this further unavoidably results in overfitting due to the high stochasticity of the environment.
Moreover, keeping the number of parameters low makes our solution fast and cost-effective (both in terms of energy and hardware).
We consider two scenarios for the traffic as described in Table I.The first scenario consists of two classes, one with users requesting a small amount of data but within a stringent latency constraint (of just two time slots) and one delay-tolerant class requesting a large amount of data.All classes have the same importance as seen from the Imp.
column.In the second scenario, classes do not have the same importance.Note that the Prob.
column describes the probability p c with which a user of that class appears in the system at a given time slot (they do not sum up to one signifying that it is possible that no user appears during some time slots).
In Figure 6, we plot the satisfaction ratio per class priority (i.e., all users having the same priority take part to the computation of the same ratio and so depicted in the same curve) versus the channel correlation (ρ in left column) and versus the total bandwidth (W in right column) for both scenarios and different CSI knowledge.Figures 6a,6c Recall that the FW algorithm reaches to a suboptimal point and different initializations lead to different local optima.For that, at each time slot, we repeat the FW algorithm N init times with a different initialization at each time and we select the best suboptimal point.This method could lead to considerable performance improvement for N init increasing; however, due to computational complexity, we stop at N init = 20.Moreover, as the number of users K increases, so does the number of local optima and that of solutions with poor performance making it tougher for the FW to find a good optimal point without significantly increasing N init .This is the main reason why our DRL Scheduler substantially outperforms Frank-Wolfe algorithm even at moderate values of users (K = 60).Note that our DRL Scheduler continues exhibiting very good performance even if K is further increased.
The proposed DRL scheduler significantly outperforms the knapsack algorithm.For instance, at a level of 95% of satisfaction probability, we may save about 13% of bandwidth, which is followed by a 13% power saving as the power per Hz is kept constant (see Figure 6b).We also observe that our scheduler is quite close to the optimal policy, since ILP which uses an oracle constitutes an upper bound on the performance.In Figures 6e and 6e, there is a priority class of users, which always enjoys a higher satisfaction probability.Interestingly, the proposed DRL scheduler serves slightly worse the priority class than what the knapsack does.Nevertheless, since the priority counts for the 0.1 0.55 ≈ 18% of the users and the rest 82% is much better served using our Scheduler, the latter exhibits overall better performance than the knapsack.

B. Real Data
To assess the applicability of our algorithm in a realistic setup, we perform experiments on real data using publicly available traces based on real measurements over Long Term Evolution (LTE) 4G networks in a Belgium city [39], [40].Six different types of transportation (foot, bicycle, bus,   tram, train, car) are used.The throughput and the GPS location of a mobile device continuously demanding data are recorded every second.Since the timescale of 1s is much larger than the small-scale fading timescale represented here by the random variable h, the measurement that is provided corresponds to M i = E h [W log 2 (1 + κ|h| 2 )] for every i-th second.The value of κ, which mainly depends on the user location, is assumed constant within 1s.As the measurements bandwidth W is not reported in the dataset, we assume it to be 15MHz, resulting in a mean signal-to-noise ratio (SNR) ≈ 6dB in an LTE compliant system.This allows us to retrieve κ from measurement M i .To compute the channel time variation h, the user speed is required in (1) so as to obtain ρ.This is estimated using the trajectory of GPS coordinates given from the traces.A user entering the system belongs to a class according to Table II, with its type of transportation chosen randomly; we then sample M i and her location from the traces accordingly.Knowing in the previous and afterwards time slots the locations, we can compute the average speed and so the ρ.Finally, so far we assumed that the bandwidth can be split as small as desired (continuum); however, in practice, the bandwidth is split into N bl resource blocks and each user is assigned an integer multiple of those.In Table III, we increase N bl with the size of the resource block constant to 200kHz, again confirming the performance gains from using a DRL based approach.
For the Exponential rule, the value of δ u is set to δ u = δ = 10 −2 [36].Nevertheless, since this value does not provide the best results for every N bl (resource blocks), we tune this parameter for each N bl in order to provide the highest possible performance.In Table III, we see that the proposed DRL algorithm outperforms baseline algorithms with full CSI and using real data, both in terms of data rate and satisfaction probability.The gap from the upper bound is rather significant, but this is expected as the upper bound is optimistic assuming that the channel is known in advance.

VII. CONCLUSION
The problem of scheduling and resource allocation of a time-varying set of users with heterogeneous traffic and QoS requirements was studied here.We leveraged deep reinforcement learning and proposed a deep deterministic policy gradient algorithm, which builds upon distributional reinforcement learning and deep sets.Our experiments on both synthetic and real data showed that the proposed scheduler can achieve significant performance gains as compared to state-of-the-art conventional combinatorial optimization methods in both full and no CSI scenarios.
We recall that users u i for i ≥ 1 arrive after user u 0 , therefore their locations are unknown and we need to average over them 3 .
determines the time correlation of the channel, with J 0 (•) denoting the zeroth-order Bessel function of the first kind, f d the maximum Doppler frequency (determined by the user mobility), and T slot the slot duration.If ρ = 0 (high mobility), a user experiences an independent realization of the fading distribution at each time slot (i.i.d.block fading).If ρ = 1 (no mobility), channel attenuation is constant throughout the user's lifespan (no small-scale fading fading).
user are processed individually by the same function φ user : R Nu → R Hu modeled as a two layer fully connected neural network.Then, the outputs of φ user that corresponds to the new characteristics per user are aggregated with the permutation equivariant f σ : R K×H → R K×H of H (resp. H ) input (resp.output) characteristics:

Figure 1 :
Figure 1: We conducted five experiments (with different seeds) for no CSI using the traffic model of Table Ia, a maximum number of users K = 75, ρ = 0, and resources (total bandwidth) W = 5 a) D =R(s, a)+γZ π (s , π(s)), s ∼P (s, a) is the distributional Bellman operator and Z π θ

Figure 2 :
Figure 2: Comparison between distributional and standard (non-distributional) DDPG RL.We conducted five experiments with different seeds as in Figure 1 with the same traffic model.In the first figure, we depict the average over those five experiments; in the other figures, we consider one specific experiment in an attempt to show the inherent ability of distributional DDPG in dealing with heterogeneous traffic.

Figure 3 :
Figure 3: Estimation of a cumulative distribution function with and without the dueling trick.

Figure 4 :
Figure 4: Effect of adding the dueling architecture in the Value Network and/or reward scaling to distributional DDPG RL.

Figure 6 :
Figure 6: Satisfaction rate of the proposed DRL scheduler and the baseline algorithms versus ρ (left column) and W (right column).The first and last rows correspond to the case of Table Ia service rate is measured using Shannon rate expression assuming capacity-achieving codes.The achievable service rate of user u at time t is equal to w u,t R u,t , where R u,t = sum of rewards z t ← γz t−1 + r t , then the running statistic of its mean z mean t ← m scale z mean t−1 + (1 − m scale )z t and of its mean of squares z squares

Table I :
Classes description for two scenarios Table Ia and the row in the middle to Table Ib.Figures 6a, 6b, 6e and 6f refer to the full CSI case while the other to the statistical CSI/no CSI case.

Table II :
Equal Classes description (Data rate per user in Kbps, Latency in msec)

Table III :
Sum Data Rate (in Mbps) / Probability of Satisfaction