On Meeting a Maximum Delay Constraint Using Reinforcement Learning

Several emerging applications in wireless communications are required to achieve low latency, but also high traffic rates and reliabilities. From a latency point of view, most of the state-of-the-art techniques consider the average latency which may not directly apply to scenarios with stringent latency constraints. In this paper, we consider scheduling under a max-delay constraint; this is an NP-hard problem. We propose a novel approach to tackle the scheduling problem by directly addressing the constraint. We consider the downlink of a multi-cell wireless communication network with nodes communicating with users each facing their own delay constraint on randomly arrived packets. Packets must be scheduled to meet the users’ delay constraints. Our main contributions are first, proposing a new search approach, Super State Monte-Carlo Tree Search (SS-MCTS), as a version of regular MCTS modified for large-scale probabilistic environments; second, developing trained value and policy networks to reduce computational complexity, and finally, addressing the scheduling problem through a reinforcement learning framework. Our numerical results demonstrate that the proposed approach significantly improves the packet delivery rate over a baseline approach while meeting the max-delay constraint, and addressing the scalability as the main issues in large action-state spaces.

high reliabilities [31]. Considering the requirements of such 23 applications, average delays are not of interest given that 24 an instantaneous disruption in the transmitted data will lead 25 to a poor performance of the overall system. Furthermore, 26 in applications such as remote surgery, for instance, different 27 tasks face different priority levels as well as tolerable dead-28 lines to be served. 29 In this paper, our main goal is to consider scheduling a 30 given number of flows (interchangeably called users) with 31 The associate editor coordinating the review of this manuscript and approving it for publication was Marco Martalo . random packet arrivals (with a known arrival rate) and a 32 hard latency constraint (maximum tolerable delay) in the 33 downlink of a wireless communication network. We wish to 34 minimize the dropping of packets that comprise these flows; 35 however, finding the optimal schedule for a set of flows 36 is NP-hard [32]. This is the motivation for our work: we 37 wish to find an efficient approach to tackle the scheduling 38 problem in a realistic large-scale scenario. To the best of our 39 knowledge, in considering the issue of latency, most state-of- 40 the-art technologies focus on the average delay (which can, 41 often, be translated into a throughput constraint [25]). Those 42 addressing the hard delay constraint suffer from several lim-43 itations in large-scale probabilistic environments; we discuss 44 these contributions and their limitations below. 45 Since our scheduling problem of interest is NP-hard, 46 we propose a technique based on Monte-Carlo Tree Search 47 (MCTS) to address the maximum delay constraints. We con-48 sider a multi-cell network with base-station (BS) serv-49 ing multiple users simultaneously on multiple channels. 50 the performance and reduce the computational complexity 107 of the proposed approach. In this regard, we have pro-108 posed a distributed multi-agent RL framework which sig-109 nificantly improves the performance, in an online manner. 110 Also, the value and policy neural networks, trained efficiently 111 by SS-MCTS, can be used with much less computational 112 complexity compared to the SS-MCTS. Furthermore, in this 113 work, we consider a more realistic model for scheduling in the 114 downlink of a multi-cell network and further reduce compu-115 tational complexity by both modifying the action-selection in 116 large-scale scenarios and addressing the scheduling problem 117 through a distributed multi-agent RL framework. This effi-118 ciency can be traded off for the performance improvements of 119 the proposed SS-MCTS approach for larger scale scenarios. 120 We also analyse the complexity of the proposed approach. 121 As well as the aforementioned contributions, the scheduling 122 problem considered in this work is selection of multiple tuples 123 of users and their assigned channels per each timeslot in a 124 multi-cell scenario while in [23] a single user is selected per 125 timeslot in a single-cell scenario. 126 The remainder of this paper is organized as follows. 127 We first review the relevant literature on delay sensitive 128 communication systems in Section II. Section III presents 129 our system model. The proposed SS-MCTS technique is 130 then described in Section IV. The framework on RL for the 131 scheduling problem with max-delay constraints as well as 132 the SS-MCTS approach combined with deep neural value and 133 policy networks are introduced in Section V. In Section VI, 134 we provide simulation results illustrating the performance of 135 the proposed techniques. Finally, Section VII concludes the 136 paper. 138 As our work covers such different areas as communications, 139 tree search methods and reinforcement learning, here we 140 discuss the most relevant works in these areas. 141 The related works accounting for a delay constraint mainly 142 consider the throughput-delay trade-off, the delay-limited 143 link capacity, and channel coding schemes for low-latency 144 communications in 5G systems [26]. Importantly, these 145 works generally consider the average delay in delivering data 146 packets. While an average delay constraint can be converted 147 to a throughput constraint [25] (and so is easier to address), 148 applications such as remote surgeries may be better served 149 by a max-delay constraint [14]. At the network layer, recent 150 works include end-to-end delay bounds in wireless networks 151 using large-deviations theory [30], the use of short transmis-152 sion time intervals, and delay-limited throughput [6], [20], 153 average delay of network coding in the downlink [37] and a 154 trade-off between throughput and guaranteeable delay [17]. 155 In are based on choosing the task with the longest processing 181 time remaining or the shortest processing time [22]. Another 182 approach is to first schedule the packets with lowest time- not separately for individual objects [5]. Although heuristic 188 approaches are simple and practical, especially in large-scale 189 scenarios, as we will see, they suffer significantly in terms of 190 their performance. A greedy heuristic will act as our baseline 191 approach.

192
In this paper, we propose the use of RL and deep learning to  Tree search-based methods form another category of 221 value-based approaches to directly address the max-delay 222 problem. Although through tree search methods more 223 actions, and therefore more realistic models, can be consid-224 ered, a full tree search in large search spaces is not practical. 225 This limitation has been alleviated by Monte Carlo Tree 226 Search approaches by reducing the depth of the search tree 227 as well as using an efficient policy to decrease the effective 228 breadth of the search tree [29]. Constrained MCTS for par-229 tially observable Markov decision processes (POMDP) [15] 230 and MCTS with information layer [34] (which uses prob-231 abilistic information of the environment to estimate the 232 expected value of future actions) are the works most relevant 233 to our problem. In these schemes, the next state as a result of 234 an action applied in the current state is not known determinis-235 tically. However, these schemes cannot be directly applied in 236 scenarios with a large number of users since the calculation 237 of the expected value of all future actions is not practical.

238
In [3], MCTS with deep neural networks is used for 239 pilot-power allocation in the uplink of a massive MIMO 240 system. In this work, a regular MCTS approach is considered 241 unlike our work which focuses on modifying the MCTS 242 approach to decrease computational complexities. Also, in [3] 243 only a single decision is made while our maximization prob-244 lem is defined over a finite time horizon and therefore, our 245 proposed approach considers all decisions required within a 246 finite time horizon which increases the complexity of search 247 tree.

248
As mentioned, most state-of-the-art technologies focus on 249 average delay. The approaches which consider hard delay do 250 not address a max-delay constraint and have not discussed 251 practical large-scale scenarios, the main focus of our work. 252 As the problem of our interest in NP-hard, the proposed 253 approaches in the literature to tackle this problem, such as 254 tree search algorithms, must be modified in order to be an 255 efficient tool to schedule a large number of users with indi-256 vidual max-delay constraints on their packets. In this work, 257 our modifications of the MCTS approach also allows for the 258 consideration of large scale probabilistic environments with 259 lower complexity compared to the regular MCTS approach. 260 The proposed SS-MCTS search approach with fewer rollouts, 261 and therefore less computational complexity, is integrated 262 into deep learning methods resulting in an improvement in 263 sample efficiency and therefore performance.

265
Considering the main purpose of this work is addressing 266 a maximum delay constraint in probabilistic environments, 267 in this paper we consider the problem of scheduling in the 268 downlink of a multi-cell wireless network with multiple fre-269 quency bands available to each BS per cell. 270 We consider a network area partitioned into, for conve-271 nience, identical hexagonal cells, with one BS located at the 272 VOLUME 10, 2022 geometric center of each cell. The set of all BSs is denoted 1 If the BS has multiple receive antennas, a simple beamforming scheme like matched filtering can be easily incorporated. small scale fading components vary independently from one 325 timeslot to another while the path loss components remain 326 identical. 327 We assume that each BS can schedule one user per each 328 available frequency in each timeslot. The binary variable 329 y t j n ,f = 1(= 0) is used to indicate whether (or not) in 330 timeslot t, the user j associated with the BS n is scheduled 331 on the frequency band f .

332
Therefore, the signal-to-interference-plus-noise (SINR) 333 ratio for user j in BS b is given by Therefore, the combined data rate achieved by the user j in 338 timeslot t is given by summation of R j b ,f for all F available 339 frequency bands which is denoted by R j b ,tot and given by Our goal is to find a practical and efficient policy for each BS 343 to select users such that the total number of dropped packets is 344 minimized. Each time a user is selected, based on the quality 345 of the channel, packets stored for that user in the BS can be 346 transmitted. The scheduling of each packet can be interpreted 347 as a task with a deadline. Specifically, we consider a finite 348 time horizon divided into T timeslots T = {1, 2, . . . , T }. The total number of packets that arrive for the j th user associated Our goal is to minimize the total number of dropped pack-353 ets over a finite time horizon T . In order to reach this goal, 354 we aim for minimizing the total number of dropped packets 355 in each cell. Mathematically, the problem of interest for all 356 users in the network can be written as  The scheduled transmission time Mathematically, this can be written as 389 though we note that (2) results in T k j−b to be infinite in case  (1) is 392 NP-hard, i.e., there is no polynomial-time algorithm to solve 393 this optimization problem [16].

394
The optimization Problem (1) takes the whole finite time 395 horizon into account for calculating the number of dropped 396 packets within one cell. However, the future information 397 about channels and arrivals is not available and therefore, 398 it becomes non-causal resulting in Problem (1) to be infeasi-399 ble. Therefore, one option is to re-visit the problem of interest 400 by considering the optimization problem in each timeslot t 401 and minimizing the expected number of dropped messages 402 in each timeslot. Thus, considering the same constraints in 403 Problem (1), the objective function in this problem can be 404 replaced by where E[·] denotes expectation (over the future arrivals and 407 channel states).

408
In deterministic scenarios, the B&B technique provides the 409 optimal solution for the problem in (1) by implicitly enumer-410 ating all the possible solutions of the problem on a search tree. 411 The complexity of the B&B approach grows exponentially 412 with the size of the system. Therefore, even for deterministic 413 arrivals, B&B is impractical for real-time execution in large-414 scale scenarios. With random arrivals, the B&B approach 415 is infeasible because the objective function is impossible to 416 evaluate in closed form and taking the expectation requires 417 Monte-Carlo simulations.

418
To build toward an effective solution we recognize that 419 the problem in (3) can be formulated as a constrained MDP 420 and therefore, the optimal solution can be achieved through 421 the well-known value-iteration method [21], [33]. In this 422 MDP, the states at each time comprise the vector of arrived 423 packets, the arrival time for each packet and the channel 424 state information at that time; the actions are the schedul-425 ing decisions {x t k j−b ,f }. However, solving the MDP directly 426 through value-iteration is not practical in our problem as the 427 complexity of solving an MDP is grows exponentially with 428 the size of state and action spaces. Therefore, even for small 429 number of users and channels, finding the optimal solution is 430 impractical.

431
Heuristic techniques provide significantly worse perfor-432 mance than the optimal approaches such as the B&B tech-433 nique, especially for large-scale problems.

434
These reasons constitute the main motivation for our 435 use of the Monte-Carlo Tree Search technique to provide 436 a balance between the optimal and heuristic approaches. 437 The MCTS approach provides a near-optimal solution with 438 lower computational complexity compared to the B&B 439 approach [24]. Importantly, by fixing the number of roll-440 outs, the computation load of MCTS approach can be 441 bounded. In the following section, we first describe the 442 regular MCTS technique and then provide our novel mod- where Q(s, a) represents the total average reward received for 498 taking action a in state s and U (s, a) is defined as where N denotes the total number of times state s has been 501 visited and N (s, a) is the number of times the state-action 502 pair (s, a) has been selected. The constant c balances the two 503 terms in (4); the first term is the state-action value function 504 (exploitation term), while the second is the confidence term 505 (exploration term). Each time an action a is selected at node s, 506 N (s, a) increments and therefore the uncertainty is presum-507 ably reduced, and, as it appears in the denominator in (4) where R represents the reward for taking action a in state s. 517 In the next section, we discuss how the estimated value is 518 generated in the proposed SS-MCTS approach.

519
The state-action value function Q(·, ·) plays a key role in 520 action selection. In regular MCTS, in order to estimate the 521 value of a newly explored node, Monte-Carlo rollouts are 522 performed using a random or some heuristic guided policies 523 (default policy) from the expanded node to a terminal node. 524 This value is then backed up through the tree in the back-525 propagation step.

526
For large action and state spaces, in order to get an appro-527 priate estimate of the nodes' values, a large number of rollouts 528 are required which results in having high computational com-529 plexities; this issue is exacerbated in probabilistic environ-530 ments where the calculation of averages over possible actions 531 is required. In the next section, we discuss our proposed 532 approach to alleviate this limitation of regular MCTS.

534
As mentioned earlier, the MCTS approach combined with 535 neural networks has gained significant attention after its per-536 formance in the game of Go [27], [28] and has been used 537 in several works [11]. Our main contribution in this work is 538 to modify the MCTS approach for large-scale probabilistic 539 scenarios and also, apply this technique to a cellular network 540

578
The expected number of dropped tasks for the super state SS S t ,a t 579 is calculated as follows

581
where Pr((S t , a t ) → S n ) is the probability of reaching state 582 S n by taking action a t in state S t , and d((S t , a t ) → S n ) is 583 the number of dropped packets as a result of transition from 584 state-action (S t , a t ) to S n . The summation in (9) is over all 585 possible arrivals which results in a set of possible states in 586 the super state SS S t ,a t .

587
After calculation of the expected number of dropped tasks 588 for the super state, the calculated value is backed up through 589 the path taken to this node in the tree (similar to the back-up 590 step in regular MCTS) and therefore the values of the edges 591 along this path are updated.

592
The use of SS-MCTS results in a significant decrease in the 593 required number of rollouts compared to the regular MCTS 594 approach as the knowledge of arrival rates is used instead of 595 considering a large number of rollouts to reach a good value 596 estimate. Effectively, we bypass the need for Monte-Carlo tri-597 als to account for the random arrivals. Therefore, the runtime 598 and computational complexity decreases.

599
The SS-MCTS steps as proposed are demonstrated in 600 Fig. 3. In the rollout step of Fig. 3, as an example, two 601 future super states are considered (green boxes). The first 602 super state is the set of states resulting from the state-action 603 pair connected to that state. The second super state is the 604 set of all states as a result of states in the first super state. 605 In this way, the size of super states grows exponentially 606 by considering more look-ahead steps. As we discuss in 607 Section VI, the number of required look-ahead super states 608 depends on the problem parameters. By taking more future 609 steps into account, better value estimation is achieved but this 610 also increases the computational complexity of the algorithm. 611 In order to calculate the expected number of dropped pack-612 ets using the super state, we analyze the probability transition 613 model of states. In this regard, we consider a Markov model 614 such that states in every timeslot depends on their value in 615 the last timeslot. Denote as b t the number of stored packets 616 at time t, the number of remaining packets to be scheduled at 617 time t + 1 can be derived as where r t is the number of randomly arrived packets and 620 c t is the number of scheduled packets at time t. Thus, the 621 probability transition model from the state with b t stored 622 packets to the next state with b t+1 stored packets in case r t 623 packets arrive in timeslot t is calculated as follows The probabilities are calculated using the Poisson distribu-627 tion and known arrival rates. The probability transition model 628 in (11) can be used in the rollout step to calculate the average 629 number of dropped packets in the super state rollout. is a main concern for large-scale scenarios (i.e., large num-(see Fig. 4.b). This results in a large number of actions in each 689 level of tree. Instead, in Section IV-B, we suggested consid-690 ering channels one by one and for each channel, we consider 691 selection of users as actions (see Fig. 4.a). This significantly 692 reduces the number of actions per each level of tree. How-693 ever, the depth of tree is increased by the factor of number 694 of available channels per cell. But as discussed above, the 695 computational complexity of the tree considered in Fig. 4.a 696 is less than that of Fig. 4

705
As discussed, the MCTS approach reduces the depth of the 706 search tree by truncating the tree at newly explored states and 707 replacing the truncated sub-tree by an approximate evaluation 708 of the nodes' value; MCTS also reduces the breadth of the 709 tree through policy improvement techniques such as the UCT 710 approach.

711
The proposed SS-MCTS approach decreases the compu-712 tation required in the rollout step of MCTS. As we discuss 713 in Section VI, the more number of super states (i.e., more 714 number of look-ahead steps) results in better estimation of 715 the newly explored node in SS-MCTS. However, the cal-716 culation of estimated values through super state rollouts for 717 large number of look-ahead steps increases the computational 718 complexity of this technique since the size of the super states 719 grows exponentially when considering multiple steps. In this 720 regard, in order to decrease the computational complexity of 721 the SS-MCTS, as well as to improve the performance, 722 we benefit from deep multi-agent reinforcement learning 723 (MARL) by incorporating policy and value networks coupled 724 with the SS-MCTS approach.

725
The structure of the RL framework for the scheduling 726 problem comprises agents (in our problem, the BS in each 727 cell) and a probabilistic environment in which packets arrive 728 randomly for users. The environment is everything within a 729 multi-cell network including BSs, users and their randomly 730 arriving delay-sensitive packets. The agents interact with the 731 environment by taking actions, denoted by a, indicating the 732 scheduling of users based on the observed state of the system, 733 denoted by s, and according to the policy π(s). The state of the 734 system comprises the stored packets of each user, the arrival 735 time of each packet and the channel state information. The 736 policy π(s) is a function that selects the next action that can 737 be taken at each state s and is determined based on the values 738 of action a, that can be taken at each state s, denoted by the 739 function Q(s, a). 740 We consider a multi-agent RL scenario and consider 741 each BS in a cell as an agent which interacts with the 742 users associated with that BS. Our goal here is to find 743  On the other hand, RL has been shown to be robust to 774 disturbances in the dynamics [8]. Also, unlike the supervised 775 learning which requires a large amount of pre-sampled data 776 and also training requires a large amount of memory and com-777 putation resources, by using (Multi-agent) RL framework, 778 we can train the network more robust and with less com-779 putational costs. Our goal is to find a policy through which 780 BSs can schedule users. Therefore, in this work, we consider 781 an online training of neural networks using RL such that 782 value and policy neural networks are coupled with SS-MCTS 783 approach.

784
In the training of VP-NN using the SS-MCTS approach, 785 both the value and policy networks learn features of the 786 inputs (states in our case) which are useful for their training 787 purposes. Although different features are learned for the dif-788 ferent tasks, the extracted features might be useful for other 789 tasks as well. In the defined deep learning framework for 790 the scheduling problem, since the inputs to both value and 791 policy networks are the same set of states, combining both 792 networks into a single network can avoid multiple networks 793 computing the same features; we therefore consider a sin-794 gle neural network f θ (s) for both value and policy networks 795 instead of having two separate networks. Hence, θ denotes 796 the network parameters. The network takes states as its input 797 and outputs both value V θ and policy vector π θ . Multi-task 798 learning can result in extracting richer and more efficient 799 features which improves the training performance [4], and 800 also avoids overfitting.

801
The input to the neural networks are the states s passed 802 through the value network and policy network. The value 803 network's output is the nodes' estimated value v θ and the 804 policy network outputs the probability vector π θ as the prob-805 ability distribution over actions. The training goal is, there-806 fore, to update the neural networks to minimize the error 807 between the value network's output and the terminal value, Z , 808 VOLUME 10, 2022 normalized probability vector π = { 1 , . . . , T } accord-852 ingly to train the VP-NN. 853 We execute the SS-MCTS approach combined with 854 VP-NN for a large enough number of realizations until the 855 required training accuracy is achieved.

856
The proposed framework allows for online learning of 857 values, unlike the approaches in the literature in which the 858 value model is trained in an offline manner and used to pick 859 actions. In the proposed SS-MCTS with VP-NN approach, 860 a (local) model good enough to take an action in the current 861 state is trained and used for action selection while the training 862 continues.

863
In order to consider the VP-NN in the SS-MCTS approach, 864 we use the policy vector π θ such that the constant term c in 865 the UCT decision rule, as defined in (4), is multiplied by the 866 probability of selection of an action. The action selection as 867 stated in (8) also, needs to be updated accordingly. In this 868 way, the policy vector guides the UCT algorithm toward more 869 promising actions which, in return, results in improving the 870 action-selection policy. The value V θ is also used in order to 871 estimate nodes' values. In this regard, we consider a linear 872 combination of the super states' estimated value, calculated 873 as stated in (9), and the neural network's output V θ to update 874 nodes' values. In this way, as shown in Section VI, using 875 VP-NN improves computation and enhances performance in 876 two ways: first, improving the policy and estimated values in 877 the tree; second, reducing both the depth and breadth of the 878 search tree.

880
The previous section described the proposed SS-MCTS 881 approach. In this section, we present the results of simulations 882 illustrating the efficacy and performance of the proposed 883 technique. To evaluate the performance of the proposed 884 approach, we first consider the simplest non-trivial case of 885 a single-cell in which channels having only two states of 886 ''good'' and ''bad''. Then, we simulate the performance 887 of the SS-MCTS combined with VP-NN in downlink of 888 a 7-cell network with wraparound and multiple frequency 889 bands available to each BS. The proposed approach can be 890 applied to a wide range of large-scale sequential decision 891 making problems dealing with a max-delay constraint in 892 probabilistic environments.

893
The evaluation metrics are the average rate of dropped 894 tasks and the total run-time. 2 Since the optimal approach is 895 not practical for the large-scale case, as the baseline we use 896 a full tree search approach for a small number of users and 897 for large-scale scenarios we use one good greedy choice, 898 selecting the user with the largest number of tasks having 899 the least remaining time assigned to the best channel. This 900 takes the number of stored messages, channels and the delay 901 constraint into account.   The results demonstrate that the SS-MCTS approach 957 provides performance similar to the optimal solution for 958 small-scale scenarios where a full tree search is computation-959 ally practical.  Table 2 compares the average execution time for each run of 962 the SS-MCTS normalized to the greedy approach and regu-963 lar MCTS approach. In order to calculate the average time 964 per simulation, we run 10000 realizations of the SS-MCTS 965 approach for different set of random arrivals and calculate 966 the time that each full run of SS-MCTS approach takes to 967 generate all the scheduling decisions compared to that of the 968 regular MCTS approach, and normalize the runtime to the 969 greedy value simulation time.

970
As is clear, the SS-MCTS based approaches are faster than 971 the traditional MCTS approach. It is worth mentioning that 972 the results for regular MCTS approach are with 500 rollouts 973 for each node's expansion which is significantly higher than 974 that of SS-MCTS which takes only one up to a few look-975 ahead super states into account. In some applications as 976 in [12], MCTS is performed with an extremely large number 977 of rollouts (on order of 10 4 ). However, in case the value 978 estimation can be addressed through limited-depth rollouts as 979 in our problem of interest, using super states can significantly 980 decrease the computational complexity and therefore the time 981 to select actions. 982 VOLUME 10, 2022   states as input and one single output for the value of each 1000 given state and a softmax vector as the output for the policy 1001 network generating the probability distribution of the next 1002 action at each state. We use the Relu activation function for all 1003 layers. In order to train the VP-NN, 20000 realizations of the 1004 SS-MCTS combined with VP-NN are considered. The data 1005 used for training after each realization consists of all states 1006 in one realization, total number of dropped messages and 1007 the vector of normalized visit counts for all possible actions 1008 of each state. The training process repeats until the required 1009 accuracy level is reached. Therefore, we use online training 1010 and our approach improves over time; this is a crucial advan-1011 tage of the proposed approach compared to offline learning 1012 scenarios. timeslots with up to 40 users per cell each having different 1020 strict delay constraints ranging from 1 to 5 timeslots as well 1021 as different random arrival rates. This accounts for different 1022 priority and load of tasks for users. It is worth emphasizing 1023 that it is only the use of the SS-MCTS approach that allows 1024 us to address such a large number of users. 1025 We have also considered a large-enough buffer size of 20, 1026 which for the considered system model parameters results in 1027 close to zero buffer overflow. This is important for evaluation 1028 purposes as the main goal here is to monitor the number of 1029 dropped packets due to expiration of delay only, and not due 1030 to buffer overflow.

1031
The blue curve is the SS-MCTS approach in which the 1032 algorithm looks only one super state ahead in the rollout step. 1033 The orange dashed curve (SS-MCTS+VP-NN) represents 1034 the SS-MCTS approach combined with VP-NN such that, 1035 in the rollout step, a linear combination of the value network's 1036 output and one super step ahead is considered and the policy 1037 is guided using the policy network's output.

1038
As demonstrated in Fig. 7, the combination of super state 1039 rollouts with value networks results in a better performance 1040 compared to the regular MCTS approach. As the number of 1041 users per cell increases, the gap between the two curves also 1042 increases showing the better performance of SS-MCTS+VP-1043 NN in larger-scale environments compared to the regular SS-1044 MCTS approach. Furthermore, the higher number of users 1045 per cell results in higher interference to surrounding cells 1046 causing the concavity of curves for larger number of users. 1047 super state steps compared with full-tree search is demonstrated in Fig. 8. The results here are averaged over 2 to 1052 9 users per cell, with a fixed number of 3 frequency bands and

1082
In Table. 5, the impact of number of timeslots (i.e., depth 1083 of tree) on execution time for a scenario with 3 users 1084 and 2 frequency bands per cell is demonstrated. As shown 1085 here, the runtime grows exponentially in the full-tree 1086 search as the number of timeslots increases while in SS-1087 MCTS, the runtime grows almost linearly (as discussed 1088 in Section. IV-C).

1090
In order to train the neural networks, considering the actions 1091 as only selection of users, as demonstrated in Fig. 4.a, results 1092 in nodes in different levels of tree having different distri-1093 butions. The reason is that for a scenario with F number 1094 of frequency bands per cell, the packets arrive at the begin-1095 ning of each timeslot (i.e., orange nodes in Fig. 4.a, where 1096 packet arrivals are applied to states and channels are updated). 1097 Therefore, we have used two separate VP-NNs which are 1098 trained separately based on levels of the tree which represent 1099 the beginning of a timeslot and based on the rest of levels. 1100 This resulted in a better performance compared to feeding all 1101 levels' data to the same value-policy neural network.

1102
As different states in one realization are correlated, con-1103 sidering all states of one realization of SS-MCTS labeled 1104 with the same terminal value Z (the total number of dropped 1105 messages in one realization) results in an overfitting of neural 1106 networks since increased training of value networks results 1107 in memorizing the output labels. To avoid this issue, for the 1108 training of the neural networks we randomly select a few 1109 states from each realization to reduce the correlation between 1110 data samples.

1111
The simulation results demonstrated that considering the 1112 more features in the problem, such as delay constraints on 1113 tasks, arrival rates and number of users, results in a better 1114 performance of SS-MCTS compared to the greedy approach. 1115 The main reason is that the greedy choice is fixed and 1116 cannot adapt itself with the new parameters. However, the 1117 SS-MCTS, by interacting with the environment gets a better 1118 understanding of how different features are affecting nodes' 1119 values and therefore by considering a good action selection, 1120 searches for the most promising nodes.