Introduction
Narrowband Internet of Things (NB-IoT) is a radio access technology specified by the Third Generation Partnership Project (3GPP) to provide efficient connectivity to a massive number of low-complexity devices, and is considered one of the key technologies adopted by 5G and beyond networks for machine-type communications, with applications in smart metering, logistics, tracking, and smart cities, among others [1]. Its radio interface, which is derived from the Long Term Evolution (LTE) standard, requires a minimum bandwidth of 180 kHz for both downlink and uplink respectively. Its design promotes coexistence with legacy networks, enabling several deployment options: within an LTE carrier, occupying a physical resource block, within a guard band of an LTE carrier, or standalone, for example utilizing a GSM carrier [2]. The NB-IoT physical layer, specified in [3], [4], and [5], uses repetition and signal combining techniques to enhance the coverage of low power devices, sometimes in unfavorable (e.g. underground) locations. See [6] for a detailed survey. To provide equal access opportunities to devices with different path loss conditions, NB-IoT defines three coverage enhancement (CE) configurations with their specific settings.
One of the objectives of NB-IoT is to support transmissions of small data messages from a massive number of devices with relaxed delay requirements [7]. In the uplink carrier, data transmissions are performed over the narrowband physical uplink shared channel (NPUSCH). Connected devices must wait until the base station assigns them a transmission opportunity or uplink grant. The uplink grant is provided in a message called downlink control indicator (DCI), which also specifies the parameters of the NPUSCH transmission: the link-adaptation parameters (the number of transmitted information bits or transport block size (TBS), the modulation and coding scheme (MCS) and the number of repetitions), as well as the specific time-frequency resources allocated to the NPUSCH.1 The time domain of the uplink carrier is divided into subframes of 1 ms, and its bandwidth is divided into subcarriers spaced by either 15 KHz or 3.75 KHz. The minimum amount of time-frequency resources that can be allocated is called a resource unit (RU). For 15 KHz subcarrier spacing, an RU can be configured in 4 possible shapes with 1,3,6 and 12 subcarriers, and 8,4,2, and 1 subframes respectively. The link-adaptation parameters determine the number of resource units that must be allocated for a transmission in the uplink carrier. For example, the number of RUs needed by a 256 bit TBS may range between 2 and 10 RUs depending on the selected MCS [8].
This paper focuses on the control of NPUSCH transmissions, which is a challenging task that involves selecting which devices are allowed to transmit (scheduling), their link-adaptation parameters and the resources allocated to each NPUSCH transmission. Figure 1 shows an example of how NPUSCHs are dynamically arranged in the uplink carrier of an NB-IoT access network with three CE levels labeled as CE0, CE1 and CE3. The NPUSCHs must not overlap the resources reserved for the access opportunities of each CE level, i.e. the narrowband physical random access channels (NPRACHs), appearing periodically in the uplink carrier. Each NPUSCH is configured by a DCI which occupies either 1, 2 or 4 subframes of the narrowband physical downlink control channel (NPDCCH), depending on the CE level of the target device. Because of the hardware limitations of the NB-IoT devices, there must be a minimum delay of 8 ms from the reception of the DCI to the start of the NPUSCH. Other possible delays (that must be specified by the DCI) are 16, 32 and 64 ms.
Diagram illustrating the allocation of time-frequency resources for NPUSCH transmissions in the uplink carrier. An NPUSCH can occupy 1, 3, 6 or 16 subcarriers, and can span multiple subframes depending on the Transport Block Size (TBS), the Modulation and Coding Scheme (MCS) and the repetition number (
NPUSCH transmission control can be seen as a more challenging version of the Tetris game, which is considered a demanding benchmark for decision algorithms, specifically reinforcement learning (RL) algorithms [9]. An NPUSCH is like a Tetris piece whose size, in RUs, depends on the amount of data backlogged in the device and the selected link-adaptation parameters. The NPUSCH shape is rectangular, but its width in subframes depends on the assigned subcarriers. There are many more combinations of link-adaptation parameters than the four possible rotations of a Tetris piece, and they determine not only the occupied resources in the uplink carrier, but also the reception probability of the NPUSCH, which is relevant since each unsuccessful transmission implies a future retransmission. The objective, similar to Tetris, is to “fit” the NPUSCHs in the uplink carrier among the NPRACHs and other (data and control) uplink transmissions, assuring that the data is received with an acceptable delay. In this paper, we propose an RL strategy capable of learning how to control NPUSCH transmissions in a plug-and-play fashion (i.e. once deployed in the network) without causing any perceptible increase in the transmission delay during the learning process. This allows the controller to be re-trained as often as necessary to adapt to changes in network configuration, devices, firmware, etc.
A. Related Works and Contributions
Previous works have addressed different parts of the problem of uplink transmission control in NB-IoT. For example, the problem of subcarrier allocation for NPUSCH is considered in [10] and [11], in combination with transmission delay [10], or power adjustment [11]. The automatic configuration of link-adaptation parameters has been studied in [12], which proposed an adaptive heuristic algorithm known as NBLA, and in [13], which used a multi-armed bandit approach. Device selection for uplink transmission was addressed in [14], which presented a mechanism to assign priorities to different types of devices based on their energy efficiency objectives and latency demands, but their mechanism does not consider resource allocation in the uplink carrier (all messages are assumed to occupy a single RU) or link-adaptation parameters. Other approaches for link-adaptation in NB-IoT, like [15], rely on mathematical models of the transmission error rate. In contrast, instead of a theoretical model, which in general is a simplified representation of the response of the real system, we propose a mechanism capable of learning the system’s response from observations obtained from network itself in operation. Two aspects of the problem that have been considered only by recent works are the allocation of NPUSCH time-frequency resources in the uplink carrier avoiding overlapping with NPRACH resources [16]; and NPDCCH signalling constraints [17]. These works considered a given number of UEs (deterministic scenario), while we consider a stochastic scenario where the devices randomly generate transmissions attempts for messages of random length. NPDCCH configuration and signalling messages were also considered in [18] using a fixed link-adaptation configuration for each CE level, while our control approach adjusts the link-adaptation parameters to the measured path loss of the device and the message size. Downlink scheduling, addressed in [19], although related with uplink scheduling, differs in several aspects such as the structure of the frame, the NPDCCH configuration, and the reduced number options for link-adaptation and subcarrier selection.
Our approach considers a scenario comprising all aspects involved in NPUSCH control decisions: link-adaptation, scheduling, transmission delays, NPRACH awareness, random transmission attempts with variable data demands, retransmissions, and NPDCCH signalling constraints, among others. To address the problem, we propose the use of reinforcement learning algorithms which, despite their current popularity, does not seem to have been sufficiently explored in NB-IoT transmission control. In contrast, there have been attempts to apply RL in the configuration of random access parameters, in particular for the allocation of NPRACH resources of each CE level [20], and for dynamically delaying the access attempts to avoid congestion in the NPRACH [21]. These NPRACH configuration mechanisms, or the heuristic proposed in [22], could seamlessly coexist with our transmission control scheme.
The use of RL is just the starting point for our proposal. As pointed out in [20], RL algorithms need, in general, to be trained offline in a simulator before deployment and, once deployed, real network conditions can be very different from the simulation environment, rendering control policies ineffective. One possible solution is to prepare the algorithms to self-tune after deployment, but this approach raises questions about how long this tuning might take, and how will it affect the network performance. Our goal is to propose a mechanism capable of learning autonomously once deployed in the network (online learning), a task that state-of-the-art RL algorithms are not able to accomplish without deteriorating the transmission delay during their initial stages of learning, as verified by our numerical results. The reason is the low sample efficiency of model-free RL algorithms. They need to explore multiple policies before converging to an efficient one, which implies selecting very ineffective actions during long periods. To overcome this limitation, we propose two complementary strategies: a multi-agent architecture in which two agents learn coordinatedly, and a model-based RL (MBRL) approach, leveraging its higher sample efficiency [23], [24] with respect to model-free RL.
This work is framed within a broader trend focused on developing online learning algorithms to control network functions. Previous works in this line have addressed resource allocation for network slices [25], interference coordination in LTE [26], [27], and energy saving for small cells, [28], using specific ad hoc mechanisms based on multi-armed bandits [26], sequential likelihood ratio tests [27], bayesian models [28], and kernel-based methods [25]. This work is the first one focused on NB-IoT transmission control, and contains the following contributions:
We present a new approach to the design of an NB-IoT transmission control scheme, based on online learning with minimum impact on the network performance.
We evaluate the use of state-of-the-art RL algorithms for this task and highlight their limitations.
We propose a multi-agent architecture in which the control task is divided into two sub-tasks (link-adaptation and device selection) that are assigned to two agents learning in a coordinated way.
We develop a novel model-based RL agent for link-adaptation combining online learning techniques with sample augmentation strategies. The high sample efficiency attained by our agent allows it to learn without a noticeable degradation on the transmission delay.
The rest of the paper is organized as follows: Section II describes the system and explains the transmission control problem. Section III details how RL can be applied to the problem with a single agent and with a multi-agent configuration. Section IV details our novel model-based proposal and Section V explains our evaluation methodology, describes the baselines with which we compare our proposal, and discusses the numerical results. Finally, Section VI summarizes our findings.
System Description and Problem Statement
A. System Description
The system under study consists of a base station serving several thousands of NB-IoT devices. These devices must complete a random access procedure, making use of the NPRACH resources, a connection procedure and, once connected, they must wait for the base station to send them an uplink grant to transmit data on an NPUSCH. We consider one carrier for each transmission direction (uplink and downlink) with a bandwidth of 180 KHz and the time dimension divided into frames comprising 10 subframes of 1 ms.
1) Random Access Over the NPRACH
The uplink carrier contains pre-allocated resources for the NPRACHs of each CE level. An NPRACH can use 12, 24, 36 or 48 contiguous subcarriers of 3.75 kHz. The duration of an NPRACH depends on the number of repetitions of the preamble sequence: 1, 2, 4, 8, 16, 32, 64, or 128. The duration of a single preamble sequence is 5.6 ms (format 0). The NPRACH of each CE appears periodically in the carrier each 40, 80, 160, 320, 640, 1280, 2560, or 5120 ms. For each CE level, the number of subcarriers and the periodicity of the NPRACH are typically determined by the data traffic profile of the devices with this CE level, and the number of preamble repetitions depend on the pathloss of these devices.
To establish a connection with the network, a device must transmit a random access preamble on the NPRACH associated to its CE level. The preambles are known sequences of symbols allowing the base station to detect devices attempting to access the network. To reduce the probability of collision between the preambles of different UEs in the same NPRACH, the symbols are divided into four groups, each one transmitted on a different subcarrier (frequency hopping). The device selects an initial frequency which is associated to a deterministic frequency hopping sequence, such that devices selecting a different initial frequency do not collide. The number of subcarriers of an NPRACH determines the number of frequency hopping sequences available to the devices. The more subcarriers available, the more deterministic hopping sequences can be used, increasing the number of available resources and hence reducing the probability of collisions.
After detecting the preambles, the base station sends a Random Access Response (RAR) message to the devices that have transmitted preambles. The RAR message includes a Temporary C-RNTI (Cell Radio Network Temporary Identifier) that is unique to each device, and contains uplink grants providing the UE uplink resources to transmit its connection request message.
The carrier bandwidth not reserved for NPRACH is divided into 12 subcarriers of 15 kHz. The base station must allocate time-frequency resources of the uplink carrier for two types of transmissions: i) the UE connection requests mentioned above, and ii) for NPUSCH transmissions containing UE data. This allocation must avoid any overlap with NPRACH resources.
2) Signaling Over the NPDCCH
The base station transmits control messages to the devices through the NPDCCH on the downlink carrier. The uplink data grants are provided in specific control messages known as Downlink Control Information (DCI) Format N0 (other formats correspond to different purposes) transmitted over the NPDCCH. DCI messages are repeated a number of times depending on the pathloss of the destination device. We consider that CE2, CE1 and CE0 devices require 4, 2, and 1 repetitions respectively. We consider an NPDCCH lasting 4 consecutive subframes and with a periodicity of 10 subframes. Note that the number of DCIs that the base station can transmit per frame is determined by the CE level of the devices. In each NPDCCH, the base station can transmit one DCI for a CE2 device, two DCIs for two CE1 devices, four DCIs for four CE0 devices or 1 DCI for a CE1 device plus 2 DCIs for two CE0 devices.
The DCI message specifies the values for all the parameters of an NPUSCH transmission:
The set
of contiguous subcarriers allocated in the uplink carrier. The number of subcarriers in the set is denoted by$n_{sc}$ .$N_{sc}$ The number of uplink resource units,
, which is the minimum amount of resources allocable.$N_{RU}$ The number of subframes,
, elapsed from the end of the current NPDCCH subframe until the first uplink subframe of the NPUSCH.$I_{delay}$ The modulation and coding scheme index,
.$I_{MCS}$ The number of repetitions,
.$N_{rep}$
3) Data Transmission Over the NPUSCH
The duration of a resource unit in subframes is
If the uplink data transmitted over an NPUSCH is correctly received at the base station, the transmitting device will disconnect from the base station after receiving an acknowledgement (ACK) from it. Transmission errors are recovered by means of an asynchronous adaptive HARQ process. If the base station is unable to decode a transport block, it will not transmit any ACK to the transmitting device. The station should send a new uplink grant to that device at a later time, indicating that it must retransmit previous transport block using the same link-adaptation parameters (
B. The Uplink Transmission Control Problem
Our objective is to design an algorithm capable of learning how to control uplink data transmissions to minimize the average transmission delay. The control decisions determine, for each NPUSCH transmission, the selected device, the link-adaptation (LA) parameters and the time-frequency resources allocated in the uplink carrier. Importantly, the algorithm must be capable of learning online in the radio access network. This implies that the algorithm must make irrevocable decisions, observe the results of these decisions (i.e. samples of a performance metric), and update its control policy accordingly. In our case, the performance metric is the transmission delay of each device, defined as the time elapsed from the instant when the device becomes connected to the base station to the moment when its data is successfully received.
The algorithm makes decisions when there are no random access procedures in course and all the active devices are connected and waiting for data uplink grants (except those devices that may be in backoff state waiting for a future random access opportunity). Each decision corresponds to a DCI message sent on an NPDCCH. For each DCI sent, the base station must select the device
To make decisions, the agent considers the following information: i) For each connected device: the estimated pathloss, the connection time, the amount of backlogged data, and an indicator of a required retransmission (i.e. if the device was scheduled previously but its transmission was not successfully decoded). ii) The resources occupied in each upcoming subframe up to a certain horizon.
Each decision of the agent consists of a selected user
Reinforcement Learning Approaches
In this Section we describe how the scheduling problem described in the previous Section can be formulated as a reinforcement learning problem. In the first subsection we consider a conventional RL formulation with a single agent. Then, the second subsection explains how the problem could be approached with a multi-agent RL strategy.
A. Reinforcement Learning Formulation
Reinforcement Learning (RL) is a type of machine learning algorithm where an agent learns to make decisions by interacting with the environment. The agent’s goal is to maximize the expected value of a cumulative reward, which is the sum of rewards received over time. At each time step \begin{equation*} \mathbb {E}[R_{t} + \gamma R_{t+1} + \gamma ^{2} R_{t+2} + {\dots }] = \mathbb {E}\left [{\sum _{i=0}^{\infty } \gamma ^{i} R_{t+i}}\right] \tag{1}\end{equation*}
The agent observation
The action
The reward
Given
Moreover, the delay of the reward is variable since it depends on the selected parameters (\begin{equation*} R_{t} = -d_{t} - (\mathbb {I}^{(index)}_{t} \lor \mathbb {I}^{(unfit)}_{t})\cdot \text {max}\{d_{1},\ldots,d_{t}\} \tag{2}\end{equation*}
B. Multi-Agent Reinforcement Learning
In environments involving large action and observation spaces, the learning rate of an RL agent is, in general, too slow for online operation. To accelerate the learning rate, we propose the use of two agents that learn in a coordinated and cooperative way to solve the uplink scheduling problem. This approach, known as Multi-Agent Reinforcement Learning (MARL), poses three main challenges: 1) dividing the main task into subtasks to be distributed among the agents, 2) defining the rewards received by each agent so that when each agent maximizes its particular objective, the global objective is also maximized, and 3) addressing the non-stationarity issue that characterizes multi-agent settings.
An RL agent is assumed to operate in a stationary environment, which means that the response of the environment (the next state and the reward) for each state-action pair remains statistically invariant at any time step. This property allows the RL algorithm to converge, but is not present, in general, when several agents learn concurrently in the same environment. The reason is that, for each agent, the response of the environment does not depend only on the action chosen by the agent, but on the aggregate actions of all the agents. Agents modify their policies while learning, which implies that their mappings from states to actions will change over time. Thus, from the perspective of each agent, the environment becomes non-stationary because its response changes as the rest of the agents modify their policies.
1) Task Division
In our proposed MARL scheme we divide the main task into two sub-tasks: device selection and link adaptation. Decisions are made in sequence: first, the device-selection (DS) agent receives the observation
2) Rewards
In MARL, each agent can receive a different reward signal provided each signal is oriented towards the same general objective. We adopt this strategy to mitigate the problem of delayed rewards discussed in previous section. The DS agent receives a modified version of the reward defined in (2):\begin{equation*} R^{(1) }_{t} = -d_{t} - \mathbb {I}^{(index)}_{t} \text {max}\{d_{1},\ldots,d_{t}\} \tag{3}\end{equation*}
\begin{equation*} R^{(2) }_{t} = n^{(rx)}_{t} - n^{(error)}_{t} {-} c \mathbb {I}^{(unfit)}_{t} \tag{4}\end{equation*}
3) Coordination Scheme
To overcome the problem of non-stationarity under concurrent learning of the agents, we propose a coordination scheme based on the principle of curriculum learning [30] under which agents learn to complete one sub-task before proceeding to learn another sub-task. The idea is to accumulate knowledge gradually by incorporating agents one by one. In our case, the LA agent learns first and, once its performance is considered acceptable, this agent switches to inference operation (i.e. it ceases to update its policy), and the learning of the DS agent starts. With this strategy, the learning of an agent does not interfere with the learning of the other agent. During the learning phase of the LA agent, the DS decisions are random.
Online Model-Based Approach
Although the RL approach is capable of finding effective control policies, it is not fully suitable for online learning because its low learning rate causes too high transmission delays during the initial period of operation, as we show in Section V. The MARL strategy mitigates this problem but its learning rate is still not sufficient for online learning. In this Section we present a model-based proposal for the LA agent that increases the learning rate drastically. We refer to our model-based MARL approach as MBMARL.
A. Elements of the Model
1) Classifiers
In model-based RL the agent learns a model of the system instead of a policy or a value function. The design of our model is aimed at maximizing the contribution of each action to the reward defined in (4). Particularly, at each step the agent should select an action
2) Feedback Scheme
Instead of using a reward signal, this agent receives the indicator signals
3) Buffer Memory
Recall that a variable number of subframes elapse between the transmission of a DCI (which contains
B. Online Learning Strategy
An online learning algorithm for binary classification aims at learning a mapping
Our proposal uses two hypothesis functions,
The overall operation of the link-adaptation agent is detailed in Algorithm 1, where
Algorithm 1 Link-Adaptation Agent
Inputs:
Outputs:
for
Receive
Apply action
Store
Receive
for each
Retrieve
end for
end for
Diagram showing the LA agent, its internal elements and signals, and its interaction with the DS agent and the controlled system.
C. Action Selection
Let us define
, with$h^{(unfit)}_{t}(\mathbf {x}^{(unfit)}_{t})=1$ : the agent estimates that the NPUSCH does not fit into the uplink carrier.$\mathbf {x}^{(unfit)}_{t} = (C_{t}, a_{k})$ , with$h^{(rx)}_{t}(\mathbf {x}^{(rx)}_{t})=1$ : the agent predicts that the transmission is likely to be correctly received.$\mathbf {x}^{(rx)}_{t} = (L^{(i)}_{t}, B^{(i)}_{t}, a_{k})$
When condition 1 holds with
Recall that the information associated to the selected device
The action selection algorithm is detailed in Algorithm 2. Note that, in addition to the explicit exploratory actions, the prediction errors made by
Algorithm 2 SelectAction
Inputs:
Outputs:
Extract
if
return
end if
for
if
end if
if
return
end if
end for
return
D. Model Update with Data Augmentation
Our models
The
Algorithm 3 UpdateU
Inputs:
Outputs:
Extract
Extract
if
for each
end for
else
for each
end for
end if
return
Similarly, to increase the learning efficiency for the update of
Algorithm 4 UpdateR
Inputs:
Outputs:
Extract
Extract
Extract
if
for each
end for
else
for each
end for
end if
return
These two algorithms could implement an even more extensive data augmentation strategy. For UpdateU, we could generate additional samples of
It is worth mentioning that the only configurable parameters in our proposal are those of the online classifiers (
Evaluation
A. Methodology
The proposal has been evaluated empirically using a simulator of the uplink transmission functionalities of NB-IoT. We will first describe the essential aspects of the simulator and then we will present the baselines and the evaluation experiments.
1) Simulation Environment
The simulation environment, developed in Python,2 comprises a population of devices trying to access the system and transmit their data packets, a base station that controls the access and schedules transmissions opportunities for the devices, one or more carriers where the resources for NPRACH and NPUSCH are assigned, and the corresponding channel models for NPRACH and NPUSCH transmissions. Devices are idle until they become active according to a probabilistic traffic model. In particular we use the uniform mMTC traffic model defined in 3GPP TR 37.868 [34], which considers time periods of
An active device goes through several stages. First, it must accomplish the access procedure. After becoming active, the device waits for the start of the next NPRACH corresponding to its CE level. When the NPRACH starts, the device transmits a randomly chosen preamble sequence over that NPRACH. If no other device has selected the same preamble, the transmitted preamble can be detected if it arrives with sufficient power. If the same preamble is selected by several devices, it is received with lower signal quality ratio but it can sometimes be detected by capture effect. The contention is resolved during the RAR window that starts after the NPRACH. When the base station decodes a non-collided preamble, it will send a signaling message (msg2) to the device during the RAR window, granting the device an opportunity to send its connection request and a buffer status report (msg3). This is followed by a signaling exchange which, if successful, ends the access procedure and sets the device to connected state. When the preamble is not detected (because of collision or low signal quality), the device does not receive any message within the RAR window, and starts a backoff period before a new access attempt. If the number of access attempts reaches a certain configurable value, the device will move to an upper CE level.
When the base station manages to decode a collided preamble, several stations will receive the same msg2, resulting in several colliding msg3 responses. If the base station is able to decode one of these msg3 responses, the corresponding device will become connected. If a collided or non-collided msg3 is not decoded, the base station can schedule a retransmission during the RAR window. When a device sends an msg3, its starts a contention timer. If this timer expires before the reception of the msg4 response, the device will start a new access attempt.
Once connected, a device waits for an uplink grant indicated by a DCI. After receiving a DCI, the device will transmit its data packet over the allocated NPUSCH with the indicated LA parameters. Then, the device can receive either an ACK, if the packet was decoded, or another DCI with an uplink grant for a retransmission, if the packet was not decoded. The connection ends upon an ACK reception.
The environment uses the propagation conditions and the antenna pattern described on sections IV-B and 4.5 of 3GPP TR 36.942 [35] respectively. It implements a block fading model where a channel realization is constant over each (NPRACH or NPUSCH) transmission, and changes independently from one transmission to another according to a lognormal shadow fading model with
The configurable parameters of each NPRACH are the periodicity, the number of (3.75 kHz spaced) subcarriers, and the number of preamble repetitions. Other configurable parameters are: the NPDCCH periodicity, the number of consecutive NPDCCH subframes, the backoff time, the duration of a RAR window, the contention timer and the maximum number of access attempts. Table 1 summarizes the environment configuration.
Although our proposed model-based scheme is essentially parameter-free, the HT algorithm selected for the classifiers (
2) Baselines
To evaluate our proposal we compared it with a diverse set of baselines based on three alternative architectures: 1) single-agent RL, 2) multi-agent RL (MARL, presented in Section III-B), and 3) multi-agent control using the NBLA algorithm proposed in [12] for link-adaptation, and RL for user selection. The latter approach is referred to as NBLA+RL. Each approach is evaluated using two strategies for determining which
Note that all the control schemes incorporate either one or two RL agents. For these agents, we have considered the following state-of-the-art deep RL algorithms:
Deep Q-Networks (DQN) [38] is the deep learning version of Q-learning, a classical model-free off-policy RL algorithm. Q-learning and DQN have been used in NB-IoT environments for the configuration of NPRACH parameters in [20] and for dynamically delaying the access attempts [21] to avoid congestion in the NPRACH.
Proximal policy optimization (PPO) [39] is a model-free deep policy gradient algorithm that updates policies preventing the new policy to diverge too much with respect to the previous one, in order to avoid unstable behavior during the learning process.
Synchronous advantage actor critic (A2C) [40] is an on-policy deep actor-critic algorithm.
The experiments use the RL implementations provided by Stable Baselines [41], which is an improved version of the OpenAI Baselines [42]. Two additional RL algorithms were evaluated: Sample Efficient Actor-Critic with Experience Replay (ACER) and Actor Critic using Kronecker-Factored Trust Region (ACKTR), but they performed clearly worse than the above ones and have not been included in the results. In summary, each baseline is determined by an architecture, a device prioritization strategy, and an RL algorithm.
3) Evaluation Experiments
Each control scheme (either a baseline or our proposal) has been evaluated in 30 independent simulation experiments. Each simulation experiment consists of an online learning episode in which the controller starts with no prior knowledge and learns over time. During each simulation run, we sampled the delay of each completed transmission, allowing us to represent how the average transmission delay changes during a learning episode. As explained in Section III-B, a learning episode for a multi-agent scheme comprises two phases: 1) link-adaptation (LA) learning phase, and 2) device selection (DS) learning phase. We have adjusted the duration, in decision steps, of the learning episodes (and their respective phases) to the learning efficiency of each scheme, resulting in the following values:
Single agent RL: 200000.
MARL: 20000 (LA phase) + 20000 (DS phase).
NBLA+RL: 50000 (LA phase) + 20000 (DS phase).
MAMBRL: 10000 (LA phase) + 20000 (DS phase).
B. Numerical Results
1) Prioritizing Devices with Longer Connection Times
We first discuss the results corresponding to the first device prioritization strategy, consisting of selecting among the
Figure 4 shows the delay performance during the MARL learning episodes. The vertical dotted lines represent the boundaries between the LA and the DS learning phases. Compared to the single-agent case, the MARL architecture reduces the initial transmission delay faster, and keeps it low during the DS phase, except in the case of DQN, whose delay surprisingly increases in this phase.
Figure 5 shows the transmission delay during the NBLA+RL learning episodes. Compared with the previous baselines, the initial delays are considerable smaller and, once the DS phase is initiated, these delays generally remain within a range of 200 and 400 ms. There is no noticeable difference between the RL algorithms used during the DS learning phase.
Figure 6 shows the results of our proposal, which is capable of keeping the delay close to 100 ms even at the early stages of the learning episode. As with NBLA, the performance during DS learning phase is similar for the three RL algorithms. Overall, MAMBRL clearly outperforms the previous baselines in terms of quality of service experienced by the devices during online learning. One of the reasons of this performance is the remarkable sample efficiency of the two online classifiers (
In the learning episodes shown in the figures above, we observe that, after a sufficient number of NPUSCH transmissions, all the evaluated control schemes reach a steady state in which the transmission delay remains close to a level that can no longer be improved. Figure 8 shows, for each control mechanism, the distribution of the delay samples during the learning phase (the initial 10000 transmissions) and during the steady state (beyond 20000 accomplished transmissions). For this figure, we have selected the best performing RL algorithm for each architecture. Two conclusions can be drawn from this figure: first, that the task division used by the multi-agent architecture does not reduce the long-term performance compared to the single-agent RL strategy. Second, that in the long-term, both model-based and model-free approaches perform similarly, thus the advantage of the model-based strategy lies in its significant sample efficiency which makes MAMBRL especially effective for online learning.
Comparison of the control algorithms during the learning phase (initial 10000 transmissions) and in steady state operation (beyond 20000 accomplished transmissions).
2) Prioritizing Devices with Higher PATHLOSS
Finally, we assess the performance when the observation contains the
Transmission delay of the single-agent RL baselines (prioritizing devices with higher pathloss).
Transmission delay of the MARL baselines (prioritizing devices with higher pathloss).
Figure 11 shows that the delay performance of the NBLA+RL scheme degrades when using the pathloss-based device prioritization scheme, resulting in transmission delays ranging between 200 ms and 800 ms, independently of the RL algorithm used in the DS learning phase. In contrast, our MAMBRL proposal does not experience any noticeable degradation when adopting this device prioritization scheme, as shown in Figure 12. This suggest that, with an effective controller, the number of devices waiting for an uplink grant remains so small that the criteria used to include them in
Transmission delay of the NBLA+RL baseline (prioritizing devices with higher pathloss).
Transmission delay of the MAMBRL proposal (prioritizing devices with higher pathloss).
Finally, Figure 13 shows the delay distributions of the control schemes in the two phases of operation. During the steady state we see that, while RL and NBLA+RL have slightly worsened their performance with respect to the previous device prioritization criteria, MARL and MAMBRL still maintain their transmission delays close to 100 ms.
Comparison of the control algorithms during the learning phase (initial 10000 transmissions) and in steady state operation (beyond 20000 accomplished transmissions).
In order to verify these results for different traffic intensities, we have repeated the above experiments (for both device prioritization strategies) for a number of devices ranging from 1000 to 2500. As expected, the proposal showed a consistent performance, maintaining the lowest delay during the learning phase under any traffic intensity, and matching the performance of MARL in steady state.
Conclusion and Future Work
In this paper we have addressed the scheduling of uplink transmissions in NB-IoT including the selection of link-adaptation parameters, the allocation of time-frequency resources in the uplink carrier, and device selection. Our goal was to develop a control mechanism capable of learning autonomously in an operating network without any prior knowledge of the system (online learning), while preventing harmful degradation of the quality of service experienced by the devices during the learning process. The proposed mechanism is based on two principles: 1) a multi-agent architecture, in which two agents coordinate to sequentially learn the link-adaptation and device selection policies respectively; and 2) a model-based approach for the link-adaptation agent, especially tailored for high sample efficiency. The proposed structure for this agent, based on two online classifiers, has no precedent in the related literature. The proposal does not introduce configurable parameters other than those of the classifiers, which in our experiments did not require fine tuning. Our experiments have shown that the multi-agent architecture is able to improve the performance of conventional RL algorithms, but it is not able to avoid a steep increment of the transmission delay during the initial stages of learning. In contrast, when our model-based agent comes in, the transmission delay does not experience any noticeable increase during learning. Note that, when using state-of-the-art deep RL agents such as PPO or A2C, the delay can rise up to more than 150 times the delay of our proposal.
One limitation of our model-based approach is its specialization. In our case, the model-based agent is specialized in link-adaptation, and it is not straightforward to generalize it for the control of tasks not contemplated by its model, such as the configuration of NPRACH parameters. For this aim, new agents (either model-free or model-based) could be integrated in the multi-agent architecture. We believe that this approach, supported by the promising results of our proposal, opens two new lines of research in the area of NB-IoT control (and netwoking environments in general): investigating the potential of the multi-agent architecture, and the design of new model-based agents.