Loading [MathJax]/extensions/MathZoom.js
Transmission Control in NB-IoT With Model-Based Reinforcement Learning | IEEE Journals & Magazine | IEEE Xplore

Transmission Control in NB-IoT With Model-Based Reinforcement Learning


The figure illustrates the allocation uplink transmissions (NPUSCH) in the uplink carrier of an NB-IoT access network. We propose a novel reinforcement learning strategy ...

Abstract:

In Narrowband Internet of Things (NB-IoT), the control of uplink transmissions is a complex task involving device scheduling, resource allocation in the carrier, and the ...Show More

Abstract:

In Narrowband Internet of Things (NB-IoT), the control of uplink transmissions is a complex task involving device scheduling, resource allocation in the carrier, and the configuration of link-adaptation parameters. Existing heuristic proposals partially address the problem, but reinforcement learning (RL) seems to be the most effective approach a priori, given its success in similar control problems. However, the low sample efficiency of conventional (model-free) RL algorithms is an important limitation for their deployment in real systems. During their initial learning stages, RL agents need to explore the policy space selecting actions that are, in general, highly ineffective. In an NB-IoT access network this implies a disproportionate increase in transmission delays. In this paper, we make two contributions to enable the adoption of RL in NB-IoT: first, we present a multi-agent architecture based on the principle of task division. Second, we propose a new model-based RL algorithm for link adaptation characterized by its high sample efficiency. The combination of these two strategies results in an algorithm that, during the learning phase, is able to maintain the transmission delay in the order of hundreds of milliseconds, whereas model-free RL algorithms cause delays of up to several seconds. This allows our approach to be deployed, without prior training, in an operating NB-IoT network and learn to control it efficiently without degrading its performance.
The figure illustrates the allocation uplink transmissions (NPUSCH) in the uplink carrier of an NB-IoT access network. We propose a novel reinforcement learning strategy ...
Published in: IEEE Access ( Volume: 11)
Page(s): 57991 - 58005
Date of Publication: 09 June 2023
Electronic ISSN: 2169-3536

Funding Agency:


SECTION I.

Introduction

Narrowband Internet of Things (NB-IoT) is a radio access technology specified by the Third Generation Partnership Project (3GPP) to provide efficient connectivity to a massive number of low-complexity devices, and is considered one of the key technologies adopted by 5G and beyond networks for machine-type communications, with applications in smart metering, logistics, tracking, and smart cities, among others [1]. Its radio interface, which is derived from the Long Term Evolution (LTE) standard, requires a minimum bandwidth of 180 kHz for both downlink and uplink respectively. Its design promotes coexistence with legacy networks, enabling several deployment options: within an LTE carrier, occupying a physical resource block, within a guard band of an LTE carrier, or standalone, for example utilizing a GSM carrier [2]. The NB-IoT physical layer, specified in [3], [4], and [5], uses repetition and signal combining techniques to enhance the coverage of low power devices, sometimes in unfavorable (e.g. underground) locations. See [6] for a detailed survey. To provide equal access opportunities to devices with different path loss conditions, NB-IoT defines three coverage enhancement (CE) configurations with their specific settings.

One of the objectives of NB-IoT is to support transmissions of small data messages from a massive number of devices with relaxed delay requirements [7]. In the uplink carrier, data transmissions are performed over the narrowband physical uplink shared channel (NPUSCH). Connected devices must wait until the base station assigns them a transmission opportunity or uplink grant. The uplink grant is provided in a message called downlink control indicator (DCI), which also specifies the parameters of the NPUSCH transmission: the link-adaptation parameters (the number of transmitted information bits or transport block size (TBS), the modulation and coding scheme (MCS) and the number of repetitions), as well as the specific time-frequency resources allocated to the NPUSCH.1 The time domain of the uplink carrier is divided into subframes of 1 ms, and its bandwidth is divided into subcarriers spaced by either 15 KHz or 3.75 KHz. The minimum amount of time-frequency resources that can be allocated is called a resource unit (RU). For 15 KHz subcarrier spacing, an RU can be configured in 4 possible shapes with 1,3,6 and 12 subcarriers, and 8,4,2, and 1 subframes respectively. The link-adaptation parameters determine the number of resource units that must be allocated for a transmission in the uplink carrier. For example, the number of RUs needed by a 256 bit TBS may range between 2 and 10 RUs depending on the selected MCS [8].

This paper focuses on the control of NPUSCH transmissions, which is a challenging task that involves selecting which devices are allowed to transmit (scheduling), their link-adaptation parameters and the resources allocated to each NPUSCH transmission. Figure 1 shows an example of how NPUSCHs are dynamically arranged in the uplink carrier of an NB-IoT access network with three CE levels labeled as CE0, CE1 and CE3. The NPUSCHs must not overlap the resources reserved for the access opportunities of each CE level, i.e. the narrowband physical random access channels (NPRACHs), appearing periodically in the uplink carrier. Each NPUSCH is configured by a DCI which occupies either 1, 2 or 4 subframes of the narrowband physical downlink control channel (NPDCCH), depending on the CE level of the target device. Because of the hardware limitations of the NB-IoT devices, there must be a minimum delay of 8 ms from the reception of the DCI to the start of the NPUSCH. Other possible delays (that must be specified by the DCI) are 16, 32 and 64 ms.

FIGURE 1. - Diagram illustrating the allocation of time-frequency resources for NPUSCH transmissions in the uplink carrier. An NPUSCH can occupy 1, 3, 6 or 16 subcarriers, and can span multiple subframes depending on the Transport Block Size (TBS), the Modulation and Coding Scheme (MCS) and the repetition number (
$N_{rep}$
), but cannot overlap with NPRACH resources. The base station specifies the parameters of each NPUSCH in the Downlink Control Information (DCI) messages transmitted over the NPDCCH subframes of the downlink carrier.
FIGURE 1.

Diagram illustrating the allocation of time-frequency resources for NPUSCH transmissions in the uplink carrier. An NPUSCH can occupy 1, 3, 6 or 16 subcarriers, and can span multiple subframes depending on the Transport Block Size (TBS), the Modulation and Coding Scheme (MCS) and the repetition number ($N_{rep}$ ), but cannot overlap with NPRACH resources. The base station specifies the parameters of each NPUSCH in the Downlink Control Information (DCI) messages transmitted over the NPDCCH subframes of the downlink carrier.

NPUSCH transmission control can be seen as a more challenging version of the Tetris game, which is considered a demanding benchmark for decision algorithms, specifically reinforcement learning (RL) algorithms [9]. An NPUSCH is like a Tetris piece whose size, in RUs, depends on the amount of data backlogged in the device and the selected link-adaptation parameters. The NPUSCH shape is rectangular, but its width in subframes depends on the assigned subcarriers. There are many more combinations of link-adaptation parameters than the four possible rotations of a Tetris piece, and they determine not only the occupied resources in the uplink carrier, but also the reception probability of the NPUSCH, which is relevant since each unsuccessful transmission implies a future retransmission. The objective, similar to Tetris, is to “fit” the NPUSCHs in the uplink carrier among the NPRACHs and other (data and control) uplink transmissions, assuring that the data is received with an acceptable delay. In this paper, we propose an RL strategy capable of learning how to control NPUSCH transmissions in a plug-and-play fashion (i.e. once deployed in the network) without causing any perceptible increase in the transmission delay during the learning process. This allows the controller to be re-trained as often as necessary to adapt to changes in network configuration, devices, firmware, etc.

A. Related Works and Contributions

Previous works have addressed different parts of the problem of uplink transmission control in NB-IoT. For example, the problem of subcarrier allocation for NPUSCH is considered in [10] and [11], in combination with transmission delay [10], or power adjustment [11]. The automatic configuration of link-adaptation parameters has been studied in [12], which proposed an adaptive heuristic algorithm known as NBLA, and in [13], which used a multi-armed bandit approach. Device selection for uplink transmission was addressed in [14], which presented a mechanism to assign priorities to different types of devices based on their energy efficiency objectives and latency demands, but their mechanism does not consider resource allocation in the uplink carrier (all messages are assumed to occupy a single RU) or link-adaptation parameters. Other approaches for link-adaptation in NB-IoT, like [15], rely on mathematical models of the transmission error rate. In contrast, instead of a theoretical model, which in general is a simplified representation of the response of the real system, we propose a mechanism capable of learning the system’s response from observations obtained from network itself in operation. Two aspects of the problem that have been considered only by recent works are the allocation of NPUSCH time-frequency resources in the uplink carrier avoiding overlapping with NPRACH resources [16]; and NPDCCH signalling constraints [17]. These works considered a given number of UEs (deterministic scenario), while we consider a stochastic scenario where the devices randomly generate transmissions attempts for messages of random length. NPDCCH configuration and signalling messages were also considered in [18] using a fixed link-adaptation configuration for each CE level, while our control approach adjusts the link-adaptation parameters to the measured path loss of the device and the message size. Downlink scheduling, addressed in [19], although related with uplink scheduling, differs in several aspects such as the structure of the frame, the NPDCCH configuration, and the reduced number options for link-adaptation and subcarrier selection.

Our approach considers a scenario comprising all aspects involved in NPUSCH control decisions: link-adaptation, scheduling, transmission delays, NPRACH awareness, random transmission attempts with variable data demands, retransmissions, and NPDCCH signalling constraints, among others. To address the problem, we propose the use of reinforcement learning algorithms which, despite their current popularity, does not seem to have been sufficiently explored in NB-IoT transmission control. In contrast, there have been attempts to apply RL in the configuration of random access parameters, in particular for the allocation of NPRACH resources of each CE level [20], and for dynamically delaying the access attempts to avoid congestion in the NPRACH [21]. These NPRACH configuration mechanisms, or the heuristic proposed in [22], could seamlessly coexist with our transmission control scheme.

The use of RL is just the starting point for our proposal. As pointed out in [20], RL algorithms need, in general, to be trained offline in a simulator before deployment and, once deployed, real network conditions can be very different from the simulation environment, rendering control policies ineffective. One possible solution is to prepare the algorithms to self-tune after deployment, but this approach raises questions about how long this tuning might take, and how will it affect the network performance. Our goal is to propose a mechanism capable of learning autonomously once deployed in the network (online learning), a task that state-of-the-art RL algorithms are not able to accomplish without deteriorating the transmission delay during their initial stages of learning, as verified by our numerical results. The reason is the low sample efficiency of model-free RL algorithms. They need to explore multiple policies before converging to an efficient one, which implies selecting very ineffective actions during long periods. To overcome this limitation, we propose two complementary strategies: a multi-agent architecture in which two agents learn coordinatedly, and a model-based RL (MBRL) approach, leveraging its higher sample efficiency [23], [24] with respect to model-free RL.

This work is framed within a broader trend focused on developing online learning algorithms to control network functions. Previous works in this line have addressed resource allocation for network slices [25], interference coordination in LTE [26], [27], and energy saving for small cells, [28], using specific ad hoc mechanisms based on multi-armed bandits [26], sequential likelihood ratio tests [27], bayesian models [28], and kernel-based methods [25]. This work is the first one focused on NB-IoT transmission control, and contains the following contributions:

  • We present a new approach to the design of an NB-IoT transmission control scheme, based on online learning with minimum impact on the network performance.

  • We evaluate the use of state-of-the-art RL algorithms for this task and highlight their limitations.

  • We propose a multi-agent architecture in which the control task is divided into two sub-tasks (link-adaptation and device selection) that are assigned to two agents learning in a coordinated way.

  • We develop a novel model-based RL agent for link-adaptation combining online learning techniques with sample augmentation strategies. The high sample efficiency attained by our agent allows it to learn without a noticeable degradation on the transmission delay.

The rest of the paper is organized as follows: Section II describes the system and explains the transmission control problem. Section III details how RL can be applied to the problem with a single agent and with a multi-agent configuration. Section IV details our novel model-based proposal and Section V explains our evaluation methodology, describes the baselines with which we compare our proposal, and discusses the numerical results. Finally, Section VI summarizes our findings.

SECTION II.

System Description and Problem Statement

A. System Description

The system under study consists of a base station serving several thousands of NB-IoT devices. These devices must complete a random access procedure, making use of the NPRACH resources, a connection procedure and, once connected, they must wait for the base station to send them an uplink grant to transmit data on an NPUSCH. We consider one carrier for each transmission direction (uplink and downlink) with a bandwidth of 180 KHz and the time dimension divided into frames comprising 10 subframes of 1 ms.

1) Random Access Over the NPRACH

The uplink carrier contains pre-allocated resources for the NPRACHs of each CE level. An NPRACH can use 12, 24, 36 or 48 contiguous subcarriers of 3.75 kHz. The duration of an NPRACH depends on the number of repetitions of the preamble sequence: 1, 2, 4, 8, 16, 32, 64, or 128. The duration of a single preamble sequence is 5.6 ms (format 0). The NPRACH of each CE appears periodically in the carrier each 40, 80, 160, 320, 640, 1280, 2560, or 5120 ms. For each CE level, the number of subcarriers and the periodicity of the NPRACH are typically determined by the data traffic profile of the devices with this CE level, and the number of preamble repetitions depend on the pathloss of these devices.

To establish a connection with the network, a device must transmit a random access preamble on the NPRACH associated to its CE level. The preambles are known sequences of symbols allowing the base station to detect devices attempting to access the network. To reduce the probability of collision between the preambles of different UEs in the same NPRACH, the symbols are divided into four groups, each one transmitted on a different subcarrier (frequency hopping). The device selects an initial frequency which is associated to a deterministic frequency hopping sequence, such that devices selecting a different initial frequency do not collide. The number of subcarriers of an NPRACH determines the number of frequency hopping sequences available to the devices. The more subcarriers available, the more deterministic hopping sequences can be used, increasing the number of available resources and hence reducing the probability of collisions.

After detecting the preambles, the base station sends a Random Access Response (RAR) message to the devices that have transmitted preambles. The RAR message includes a Temporary C-RNTI (Cell Radio Network Temporary Identifier) that is unique to each device, and contains uplink grants providing the UE uplink resources to transmit its connection request message.

The carrier bandwidth not reserved for NPRACH is divided into 12 subcarriers of 15 kHz. The base station must allocate time-frequency resources of the uplink carrier for two types of transmissions: i) the UE connection requests mentioned above, and ii) for NPUSCH transmissions containing UE data. This allocation must avoid any overlap with NPRACH resources.

2) Signaling Over the NPDCCH

The base station transmits control messages to the devices through the NPDCCH on the downlink carrier. The uplink data grants are provided in specific control messages known as Downlink Control Information (DCI) Format N0 (other formats correspond to different purposes) transmitted over the NPDCCH. DCI messages are repeated a number of times depending on the pathloss of the destination device. We consider that CE2, CE1 and CE0 devices require 4, 2, and 1 repetitions respectively. We consider an NPDCCH lasting 4 consecutive subframes and with a periodicity of 10 subframes. Note that the number of DCIs that the base station can transmit per frame is determined by the CE level of the devices. In each NPDCCH, the base station can transmit one DCI for a CE2 device, two DCIs for two CE1 devices, four DCIs for four CE0 devices or 1 DCI for a CE1 device plus 2 DCIs for two CE0 devices.

The DCI message specifies the values for all the parameters of an NPUSCH transmission:

  • The set $n_{sc}$ of contiguous subcarriers allocated in the uplink carrier. The number of subcarriers in the set is denoted by $N_{sc}$ .

  • The number of uplink resource units, $N_{RU}$ , which is the minimum amount of resources allocable.

  • The number of subframes, $I_{delay}$ , elapsed from the end of the current NPDCCH subframe until the first uplink subframe of the NPUSCH.

  • The modulation and coding scheme index, $I_{MCS}$ .

  • The number of repetitions, $N_{rep}$ .

3) Data Transmission Over the NPUSCH

The duration of a resource unit in subframes is $N_{sf}^{RU}=8$ , 4, 2, or 1 for $N_{sc} = 1$ , 3, 6, and 12 respectively. Therefore, the duration of an NPUSCH transmission is $N_{sf} = N_{rep} \times N_{RU} \times N_{sf}^{RU}$ subframes. The number of information bits (Transport Block Size, TBS) contained in an NPUSCH transmission is determined by $N_{RU}$ and $I_{MCS}$ . The larger the $I_{MCS}$ value, the higher the code rate, thus fewer RUs are required to fit a given TBS. We consider that all devices transmit small messages that fit into an NPUSCH. Each device informs the base station of the amount of data to be transmitted by means of a control information element called buffer status report (BSR) which, in our scenario, is sent with the connection request message as in [29].

If the uplink data transmitted over an NPUSCH is correctly received at the base station, the transmitting device will disconnect from the base station after receiving an acknowledgement (ACK) from it. Transmission errors are recovered by means of an asynchronous adaptive HARQ process. If the base station is unable to decode a transport block, it will not transmit any ACK to the transmitting device. The station should send a new uplink grant to that device at a later time, indicating that it must retransmit previous transport block using the same link-adaptation parameters ($I_{MCS}$ , $N_{RU}$ and $N_{r}ep$ ) of previous transmission so that the information can be combined at the base station using a chase combining procedure.

B. The Uplink Transmission Control Problem

Our objective is to design an algorithm capable of learning how to control uplink data transmissions to minimize the average transmission delay. The control decisions determine, for each NPUSCH transmission, the selected device, the link-adaptation (LA) parameters and the time-frequency resources allocated in the uplink carrier. Importantly, the algorithm must be capable of learning online in the radio access network. This implies that the algorithm must make irrevocable decisions, observe the results of these decisions (i.e. samples of a performance metric), and update its control policy accordingly. In our case, the performance metric is the transmission delay of each device, defined as the time elapsed from the instant when the device becomes connected to the base station to the moment when its data is successfully received.

The algorithm makes decisions when there are no random access procedures in course and all the active devices are connected and waiting for data uplink grants (except those devices that may be in backoff state waiting for a future random access opportunity). Each decision corresponds to a DCI message sent on an NPDCCH. For each DCI sent, the base station must select the device $i$ receiving the uplink grant and its corresponding NPUSCH parameters ($n_{sc}$ , $N_{RU}$ , $I_{delay}$ , $I_{MCS}$ , $N_{rep}$ ). The NPUSCH granted to $i$ must not overlap with any NPRACH or with any previously allocated NPUSCH. The DCI message is repeated 1, 2 or 4 times depending on the CE level of $i$ . If, after the last DCI repetition, there are still available subframes in the NPDCCH, the base station can select another device of a CE level requiring a number of DCI repetitions smaller or equal to the remaining subframes in the NPDCCH. This way, the NPDCCH can accommodate between 1 and 4 decisions (uplink grants).

To make decisions, the agent considers the following information: i) For each connected device: the estimated pathloss, the connection time, the amount of backlogged data, and an indicator of a required retransmission (i.e. if the device was scheduled previously but its transmission was not successfully decoded). ii) The resources occupied in each upcoming subframe up to a certain horizon.

Each decision of the agent consists of a selected user $i$ , a ($I_{MCS}$ , $N_{RU}$ ) pair, and a number of repetitions $N_{rep}$ . For each transmission, the ($I_{MCS}$ , $N_{RU}$ ) pair is selected among those that provide the minimum TBS required to hold all the bits indicated in the BSR previously sent by the device. Using a similar approach used by [18] to select $I_{delay}$ , we determine the values of $n_{sc}$ and $I_{delay}$ by an automatic procedure that iterates over the values of these parameters until the NPUSCH fits into the uplink carrier without any overlap. Since the main objective is to minimize the delay, the procedure starts with the values $n_{sc} = 12$ and $I_{delay} = 0$ (minimum delay) for which the NPUSCH transmission ends earlier. Subsequent iterations try $n_{sc}$ configurations with a decreasing number of subcarriers and larger values of $I_{delay}$ , until a feasible combination is found. If no one is found, the agent should make a different decision. The agent must learn also to make feasible decisions, i.e. decisions for which the NPUSCH fits into the uplink carrier.

SECTION III.

Reinforcement Learning Approaches

In this Section we describe how the scheduling problem described in the previous Section can be formulated as a reinforcement learning problem. In the first subsection we consider a conventional RL formulation with a single agent. Then, the second subsection explains how the problem could be approached with a multi-agent RL strategy.

A. Reinforcement Learning Formulation

Reinforcement Learning (RL) is a type of machine learning algorithm where an agent learns to make decisions by interacting with the environment. The agent’s goal is to maximize the expected value of a cumulative reward, which is the sum of rewards received over time. At each time step $t$ , the agent observes the current state $S_{t}$ of the environment and selects a control action $A_{t}$ which is applied to the environment. As a result of this action, the agent receives a reward $R_{t+1}$ and the environment transitions to a new state $S_{t+1}$ . The agent’s objective is to learn a policy $\pi $ , which is a mapping from states to actions, such that the expected value of the sum of discounted rewards (return) is maximized. The discount factor $\gamma \in [{0, 1}]$ controls the importance of future rewards. The expected value of the return from $t$ is given by:\begin{equation*} \mathbb {E}[R_{t} + \gamma R_{t+1} + \gamma ^{2} R_{t+2} + {\dots }] = \mathbb {E}\left [{\sum _{i=0}^{\infty } \gamma ^{i} R_{t+i}}\right] \tag{1}\end{equation*} View SourceRight-click on figure for MathML and additional features. where the expectation is taken over the probabilities of the trajectories (sequences of state-action pairs) determined by $\pi $ . In our environment, the time steps $t=0,1,2,\ldots $ are the instants when the agent makes a decision. Specifically, each instant corresponds to the transmission of a DCI containing an uplink grant. This implies that the number of decision stages in each NPDCCH ranges between 1 and 4 depending on the CE levels of the selected devices.

The agent observation $S_{t}$ , contains the information (pathloss, connection time, bits of data, and retransmission indicator) of a given number of connected devices ($D$ devices), plus a list of integers indicating the number (between 0 and 12) of occupied subcarriers in each upcoming subframe up to a given time horizon $H$ . The observation $S_{t}$ is therefore a vector of $4D+H$ elements. $D$ and $H$ determine a bound to the dimension of the observation space. Otherwise, the number of elements of the observation could grow arbitrarily large, especially the number of connected devices. The devices included in each observation are those with longer connection times, since the objective is minimize the average transmission delay. We validate this strategy in Section V, by comparing it with an alternative approach proposed in [18], which prioritizes the users with the highest estimated pathloss.

The action $A_{t}$ , comprises three variables: an index identifying the selected device $i$ ($\in \{1\ldots D\}$ ), the ($I_{MCS}$ , $N_{RU}$ ) pair, and the number of repetitions $N_{rep}$ . The set $\mathcal {A}(S_{t})$ contains the available actions (action space) for a given observation $S_{t}$ . For the sake of notation efficiency, we will refer to $\mathcal {A}(S_{t})$ simply as $\mathcal {A}$ .

The reward $R_{t}$ depends on the transmission delays experienced by the devices whose uplink data was correctly received between steps $t-1$ and $t$ . We refer to the clock time elapsed between these two steps as the $t$ -th observation period. Note that an observation period can last 1, 2 or 6 downlink subframes depending on which NPDCCH subframe is associated to $t$ and on the CE level of the device selected in $t-1$ . Besides, several uplink transmissions may end simultaneously at the same uplink subframe if their NPUSCHs use different subcarriers. We therefore define $d_{t}$ as the average transmission delay of the transmissions that are correctly detected during the $t$ -th observation period.

Given $d_{t}$ , the reward could be defined as $R_{t} = -d_{t}$ (recall that the objective of the RL agent is to maximize $\mathbb {E}\left [{\sum _{i=0}^{\infty } \gamma ^{i} R_{t+i}}\right]$ ). However, there are some adjustments to this definition that accelerate learning. Note that the reward observed at $t$ is not, in general, a consequence of the decision made at $t-1$ , but of one (or several) decisions made at much earlier steps, because multiple subframes elapse from the time the uplink grant is sent until the NPUSCH transmission is successfully received. This is a case of delayed reward, which makes RL problems harder to solve.

Moreover, the delay of the reward is variable since it depends on the selected parameters ($I_{MCS}$ , $N_{RU}$ , $N_{rep}$ ) and on the occupied subcarriers in the upcoming subframes (which is random and determines $I_{delay}$ and $n_{sc}$ ). We can at least provide some immediate feedback to the agent in the form of a penalty in case the selected action is instantly known to be wrong. In particular, we consider two cases: 1) when the selected $i$ is larger than the number of connected users (in case there are fewer than $D$ connected users) and 2) when the uplink carrier does not have enough resources to fit the NPUSCH. The penalties must be significant with respect to the delay in order to have an effect on the agent. Also, because the delay is variable and unbounded, we define the penalties relative to the maximum delay observed so far. Let $\mathbb {I}^{(index)}_{t}\in \{0,1\}$ denote a boolean signal indicating whether action $A_{t-1}$ has incurred in the first error mentioned above, and let $\mathbb {I}^{(unfit)}_{t}$ be a similar boolean signal for the second error. Thus the reward signal is defined as:\begin{equation*} R_{t} = -d_{t} - (\mathbb {I}^{(index)}_{t} \lor \mathbb {I}^{(unfit)}_{t})\cdot \text {max}\{d_{1},\ldots,d_{t}\} \tag{2}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $\lor $ denotes a logical “or” operation.

B. Multi-Agent Reinforcement Learning

In environments involving large action and observation spaces, the learning rate of an RL agent is, in general, too slow for online operation. To accelerate the learning rate, we propose the use of two agents that learn in a coordinated and cooperative way to solve the uplink scheduling problem. This approach, known as Multi-Agent Reinforcement Learning (MARL), poses three main challenges: 1) dividing the main task into subtasks to be distributed among the agents, 2) defining the rewards received by each agent so that when each agent maximizes its particular objective, the global objective is also maximized, and 3) addressing the non-stationarity issue that characterizes multi-agent settings.

An RL agent is assumed to operate in a stationary environment, which means that the response of the environment (the next state and the reward) for each state-action pair remains statistically invariant at any time step. This property allows the RL algorithm to converge, but is not present, in general, when several agents learn concurrently in the same environment. The reason is that, for each agent, the response of the environment does not depend only on the action chosen by the agent, but on the aggregate actions of all the agents. Agents modify their policies while learning, which implies that their mappings from states to actions will change over time. Thus, from the perspective of each agent, the environment becomes non-stationary because its response changes as the rest of the agents modify their policies.

1) Task Division

In our proposed MARL scheme we divide the main task into two sub-tasks: device selection and link adaptation. Decisions are made in sequence: first, the device-selection (DS) agent receives the observation $S_{t}$ and selects a device $i$ among the connected ones. Then, the link-adaptation (LA) agent receives the observation $S_{t}$ and the selected $i$ , and determines the LA parameters ($I_{MCS}$ , $N_{RU}$ , $N_{rep}$ ). At a given observation $S_{t}$ the DS agent selects actions from the action space $\mathcal {A}^{(1) }(S_{t})$ which contains as many device indexes as indicated by $S_{t}$ . For the LA agent, the observation $S'_{t}$ combines $S_{t}$ and the device $i$ selected by the DS agent. Therefore its action space $\mathcal {A}^{(2) }(S'_{t})$ includes only $(I_{mcs}, N_{RU})$ pairs that provide a TBS compatible with the BSR of device $i$ . For notation convenience we denote these action spaces as $\mathcal {A}^{(1) }$ and $\mathcal {A}^{(2) }$ respectively. Note that $\mathcal {A} = \mathcal {A}^{(1) }\times \mathcal {A}^{(2) }$ .

2) Rewards

In MARL, each agent can receive a different reward signal provided each signal is oriented towards the same general objective. We adopt this strategy to mitigate the problem of delayed rewards discussed in previous section. The DS agent receives a modified version of the reward defined in (2):\begin{equation*} R^{(1) }_{t} = -d_{t} - \mathbb {I}^{(index)}_{t} \text {max}\{d_{1},\ldots,d_{t}\} \tag{3}\end{equation*} View SourceRight-click on figure for MathML and additional features. The LA agent receives a more direct, less delayed feedback signal comprising i) the number of NPUSCHs correctly received during the last observation period, $n^{(rx)}_{t}$ , the number of NPUSCHs received with errors, $n^{(error)}_{t}$ , and the boolean signal $\mathbb {I}^{(unfit)}_{t}$ defined in previous subsection. The reward for this agent is defined as:\begin{equation*} R^{(2) }_{t} = n^{(rx)}_{t} - n^{(error)}_{t} {-} c \mathbb {I}^{(unfit)}_{t} \tag{4}\end{equation*} View SourceRight-click on figure for MathML and additional features. where the weight factor $c$ is constant in this case because both $n^{(rx)}_{t}$ and $n^{(error)}_{t}$ are bounded. We use $c = 10$ which is typically larger than the number of NPUSCH transmissions ending during an observation period, in order to penalize unfit NPUSCH configurations more than transmission errors. Note that an NPUSCH received with errors is preferable to an unfit one, because the former is recoverable by HARQ, while the latter is not even transmitted. With this reward signal, the LA agent learns to configure NPUSCH transmissions with as much reception probability as possible (high $n^{(rx)}_{t}$ and small $n^{(error)}_{t}$ ), constrained to fit into the uplink carrier (keeping the probability $p^{(unfit)}_{t} = \mathbf {P}(\mathbb {I}^{(unfit)}_{t} = 1)$ close to 0.

3) Coordination Scheme

To overcome the problem of non-stationarity under concurrent learning of the agents, we propose a coordination scheme based on the principle of curriculum learning [30] under which agents learn to complete one sub-task before proceeding to learn another sub-task. The idea is to accumulate knowledge gradually by incorporating agents one by one. In our case, the LA agent learns first and, once its performance is considered acceptable, this agent switches to inference operation (i.e. it ceases to update its policy), and the learning of the DS agent starts. With this strategy, the learning of an agent does not interfere with the learning of the other agent. During the learning phase of the LA agent, the DS decisions are random.

SECTION IV.

Online Model-Based Approach

Although the RL approach is capable of finding effective control policies, it is not fully suitable for online learning because its low learning rate causes too high transmission delays during the initial period of operation, as we show in Section V. The MARL strategy mitigates this problem but its learning rate is still not sufficient for online learning. In this Section we present a model-based proposal for the LA agent that increases the learning rate drastically. We refer to our model-based MARL approach as MBMARL.

A. Elements of the Model

1) Classifiers

In model-based RL the agent learns a model of the system instead of a policy or a value function. The design of our model is aimed at maximizing the contribution of each action to the reward defined in (4). Particularly, at each step the agent should select an action $A^{(2) }_{t}\in \mathcal {A}^{(2) }$ that maximizes the probability of reception $p^{(rx)}_{t} = \mathbf {P}(\mathbb {I}^{(rx)}_{t} = 1)$ such that $p^{(unfit)}_{t} = 0$ . Therefore, the agent should be able to estimate both $p^{(rx)}_{t}$ and $p^{(unfit)}_{t}$ for any given observation $S_{t}$ , device, and action. For this, the model comprises two classifiers that generate the predictions for $\mathbb {I}^{(rx)}_{t}$ and $\mathbb {I}^{(unfit)}_{t}$ respectively. These classifiers learn online, meaning that they are trained while making predictions. Therefore, instead of using an existing dataset (offline training), an online classifier receives a new data item at each $t$ and is updated for each single item. We provide a more formal definition of online learning for classification in the next subsection.

2) Feedback Scheme

Instead of using a reward signal, this agent receives the indicator signals $\mathbb {I}^{(unfit)}_{t}$ and $\mathbb {I}^{(rx)}_{t}$ directly, that act as labels for each classifier. The general scheme of online learning works as follows: at each step, each classifier makes a prediction, then its true label is revealed, and then the classifier is updated according to this label.

3) Buffer Memory

Recall that a variable number of subframes elapse between the transmission of a DCI (which contains $A^{(2) }_{t}$ ) and the (correct or failed) reception of the granted NPUSCH. This implies that $\mathbb {I}^{(rx)}_{t}$ is not revealed, in general, at $t+1$ , but after a random number of steps. Therefore, to properly associate each feedback signal to its corresponding action, the model incorporates a buffer memory $\mathcal {M}$ storing recent observation-action pairs. When the outcome of an NPUSCH transmission, scheduled at $t-\tau $ , is observed, then the corresponding pair ($S_{t-\tau }$ , $A_{t-\tau }$ ) is retrieved from memory, and the classifier is updated using the revealed label ($\mathbb {I}^{(rx)}_{t-\tau } = 1$ if the NPUSCH was correctly received, and $\mathbb {I}^{(rx)}_{t-\tau } = 0$ otherwise).

B. Online Learning Strategy

An online learning algorithm for binary classification aims at learning a mapping $h:\mathcal {X}\rightarrow \lbrace 0, 1\rbrace $ from a sequence of examples $(\mathbf {x}_{t}, \mathbb {I}_{t})$ , where $\mathbf {x}_{t} \in \mathcal {X} $ is called an instance, $\mathcal {X}$ is typically a ${d}$ -dimensional vector space, and $\mathbb {I}_{t}\in \lbrace 0,1\rbrace $ is called a label. The function $h$ is called the hypothesis (function) or the (prediction) model. In online learning, the algorithm updates the hypothesis at each step $t$ , thus we denote it by $h_{t}$ . At every $t$ , an instance $\mathbf {x}_{t}$ is presented to the algorithm, which predicts a label $\hat {\mathbb {I}}_{t}\in \lbrace 0,1\rbrace $ using the current hypothesis function: $\hat {\mathbb {I}}_{t} = h_{t}(\mathbf {x}_{t})$ . Then, the correct label $\mathbb {I}_{t}$ is revealed and the learner can update its prediction model $h_{t}$ according to an algorithm-specific strategy. There exist a wide variety of online learning algorithms for classification. A recent survey [31] offers a complete overview of this topic, and the Scikit-multiflow package [32] provides an implementation of an extensive selection of methods.

Our proposal uses two hypothesis functions, $h^{(unfit)}_{t}$ and $h^{(rx)}_{t}$ , to predict $\mathbb {I}^{(rx)}_{t}$ and $\mathbb {I}^{(unfit)}_{t}$ respectively. The instance for $h^{(unfit)}_{t}$ is defined as $\mathbf {x}^{(unfit)}_{t} = (C_{t}, A^{(2) }_{t})$ , where $C_{t}$ is a vector with integers indicating the number of occupied subcarriers in the upcoming $H$ subframes. The instance for $h^{(rx)}_{t}$ is defined as $\mathbf {x}^{(rx)}_{t} = (L^{(i)}_{t}, B^{(i)}_{t}, A^{(2) }_{t})$ where $L^{(i)}_{t}$ and $B^{(i)}_{t}$ denote the estimated pathloss and the BSR, respectively, of the device $i$ selected at $t$ (by the DS agent). $C_{t}$ , $L^{(i)}_{t}$ and $B^{(i)}_{t}$ are extracted from $S_{t}$ .

The overall operation of the link-adaptation agent is detailed in Algorithm 1, where $\mathcal {I}^{(rx)}_{t}$ denotes the set containing the indicator signals $\mathbb {I}^{(rx)}_{t'}, \mathbb {I}^{(rx)}_{t''}, \ldots $ , received during the last observation period. These signals correspond to the actions $A_{t'}, A_{t''}, \ldots $ , selected in previous steps. The following subsections explain the action selection algorithm SelectAction, and the procedures UpdateU and UpdateR to update $h^{(unfit)}_{t}$ and $h^{(rx)}_{t}$ respectively. Figure 2 depicts the LA agent, its internal elements, and the information exchanged both internally and externally.

Algorithm 1 Link-Adaptation Agent

1:

Inputs: $S_{t}$ , $A^{(1)}_{t}$ , for $t = 1,2,\ldots $ , $h^{(unfit)}_{1}$ , $h^{(rx)}_{1}$

2:

Outputs: $A^{(2) }_{t}$ , for $t = 1,2,\ldots $

3:

for $t = 1,2,\ldots $ do

4:

Receive $S_{t}$ , $A^{(1)}_{t}$

5:

$A^{(2) }_{t} \gets $ SelectAction($S_{t}$ , $A^{(1) }_{t}$ , $h^{(unfit)}_{t}$ , $h^{(rx)}_{t}$ )

6:

Apply action $A_{t} = (A^{(1)}_{t}, A^{(2) }_{t})$

7:

Store $(S_{t}, A_{t})$ in $\mathcal {M}$

8:

Receive $\mathbb {I}^{(unfit)}_{t}$ , $\mathcal {I}^{(rx)}_{t}$

9:

$h^{(unfit)}_{t+1}$ = UpdateU($S_{t}, A_{t}, \mathbb {I}^{(unfit)}_{t}$ , $h^{(unfit)}_{t}$ )

10:

for each $\mathbb {I}^{(rx)}_{t'}$ in $\mathcal {I}^{(rx)}_{t}$ do

11:

Retrieve $(S_{t'}, A_{t'})$ from $\mathcal {M}$

12:

$h^{(rx)}_{t+1} \gets $ UpdateR($S_{t'}, A_{t'}, \mathbb {I}^{(rx)}_{t'}$ , $h^{(rx)}_{t}$ )

13:

end for

14:

end for

FIGURE 2. - Diagram showing the LA agent, its internal elements and signals, and its interaction with the DS agent and the controlled system.
FIGURE 2.

Diagram showing the LA agent, its internal elements and signals, and its interaction with the DS agent and the controlled system.

C. Action Selection

Let us define $a_{1}, a_{2}, \ldots, a_{K}$ , as the sequence of all actions in $\mathcal {A}^{(2) }$ ordered in decreasing datarate (note that, for each TBS, a smaller $I_{tbs}$ and a larger $N_{rep}$ results in a smaller datarate). For a given pathloss, the lower the datarate, the higher the probability of NPUSCH reception, but also the higher the amount of RUs occupied by the NPUSCH in the uplink carrier. The action selection algorithm iterates over the actions $a_{1}, a_{2}, \ldots $ , until it finds an action $a_{k}$ fulfilling one of these two conditions.

  1. $h^{(unfit)}_{t}(\mathbf {x}^{(unfit)}_{t})=1$ , with $\mathbf {x}^{(unfit)}_{t} = (C_{t}, a_{k})$ : the agent estimates that the NPUSCH does not fit into the uplink carrier.

  2. $h^{(rx)}_{t}(\mathbf {x}^{(rx)}_{t})=1$ , with $\mathbf {x}^{(rx)}_{t} = (L^{(i)}_{t}, B^{(i)}_{t}, a_{k})$ : the agent predicts that the transmission is likely to be correctly received.

When condition 1 holds with $k>1$ , the agent selects the preceding action $a_{k-1}$ , which is the one with the highest reception probability among the ones that $h^{(unfit)}_{t}$ predicts to fit into the carrier. Condition 2 is assessed only if condition 1 does not hold, i.e. when the agent considers that the NPUSCH fits into the carrier. Thus, if condition 2 holds, the agent selects action $a_{k}$ which defines the most efficient NPUSCH in terms of RUs among the ones that $h^{(rx)}_{t}$ predicts to be successfully received. If the main loop ends without returning any action, it can means either that there is no feasible action or that the prediction models are in their earlier training stages and they are still not sufficiently accurate. Thus, this is an opportunity for exploration, i.e. to gather information about actions that have not yet been sufficiently evaluated.

Recall that the information associated to the selected device $i$ includes a boolean signal $Rtx^{(i)}$ that equals 1 only if the device is waiting for a retransmission. In this case, the agent selects the same LA parameters used in the first NPUSCH transmission granted to this device (at an earlier time step $t' < t$ ), since all HARQ transmissions of a transport block must use the same link-adaptation parameters to be combined and decoded at the receiver.

The action selection algorithm is detailed in Algorithm 2. Note that, in addition to the explicit exploratory actions, the prediction errors made by $h^{(unfit)}_{t}$ and $h^{(rx)}_{t}$ (especially during their initial training stages) result in actions that are also exploratory, i.e. allowing the agent to assess the outcome of state-action pairs for which it did not have sufficient knowledge.

Algorithm 2 SelectAction

1:

Inputs: $S_{t}$ , $A^{(1)}_{t}$ , $h^{(unfit)}_{t}$ , $h^{(rx)}_{t}$

2:

Outputs: $A^{(2) }_{t}$

3:

Extract $C_{t}$ , $L^{(i)}_{t}$ , $B^{(i)}_{t}$ , and $Rtx^{(i)}$ from $S_{t}$

4:

if $Rtx^{(i)} = 1$ then $\triangleright i$ is in HARQ retransmission

5:

return $A^{(2)}_{t'} \triangleright $ use its previous LA parameters.

6:

end if

7:

for $a_{k} = a_{1},\ldots,a_{K}$ do $\triangleright $ in decreasing datarate.

8:

$\mathbf {x}^{(unfit)}_{t} \gets (C_{t}, a_{k})$

9:

if $h^{(unfit)}_{t}(\mathbf {x}^{(unfit)}_{t}) = 1$ then

10:

$k'=max(k-1,1)$ return $a_{k'}$

11:

end if

12:

$\mathbf {x}^{(rx)}_{t} \gets (L^{(i)}_{t}, B^{(i)}_{t}, a_{k})$

13:

if $h^{(rx)}_{t}(\mathbf {x}^{(rx)}_{t}) = 1$ then

14:

return $a_{k}$

15:

end if

16:

end for

17:

return $a_{k'}$ with random $k' \triangleright $ explore

D. Model Update with Data Augmentation

Our models $h^{(unfit)}_{t}$ and $h^{(rx)}_{t}$ are based on the Hoeffding Tree (HT) classifier also known as the Very Fast Decision Tree (VFDT) classifier [33]. We selected this classifier after evaluating different methods implemented in Scikit-Multiflow [32]. HT is a lightweight algorithm that provides a fast and bounded execution time which is independent of the number of examples already seen. The high accuracy observed in the results shown in Section V suggest that its structure is especially well suited to the response of our environment. In this subsection we show a data augmentation strategy to further improve the learning rate of these models.

The $h^{(unfit)}_{t}$ data-augmented update leverages a basic property of the system’s response that we can state as follows: Consider the resource occupancy $C_{t}$ of the uplink carrier at a certain step $t$ . Given an action $A^{(2) }_{t}$ , let $N(A^{(2) }_{t}) = N_{rep}N_{RU}$ denote the number of RUs that the NPUSCH occupies in the carrier. If this NPUSCH does not fit into the carrier, then any NPUSCH that occupies $N(A^{(2) }_{t})$ RUs or more, does not fit into the carrier either. We denote by $\mathcal {A}_{RU^{+}}$ the set of actions requiring as many or more RUs than $A^{(2) }_{t}$ . Conversely, if the NPUSCH defined by $A^{(2) }_{t}$ fits into the carrier, then any NPUSCH that occupies $N(A^{(2) }_{t})$ RUs of fewer, fits into the carrier too. We denote by $\mathcal {A}_{RU^{-}}$ the set of actions requiring as many or fewer RUs than $A^{(2) }_{t}$ . Algorithm 3 details how the above property is used to increase the number of updates for $h^{(unfit)}_{t}$ when an NPUSCH fits ($\mathbb {I}^{(unfit)}_{t} = 1$ ) or not ($\mathbb {I}^{(unfit)}_{t} = 0$ ). In our implementation, the update step for the $k$ -th generated sample uses the VFDT algorithm descried in [33], but any other method for online binary classification could be applied in this step.

Algorithm 3 UpdateU

1:

Inputs: $S_{t}$ , $A_{t}$ , $\mathbb {I}^{(unfit)}_{t}$ , $h^{(unfit)}_{t}$

2:

Outputs: $h^{(unfit)}_{t+1}$

3:

Extract $C_{t}$ and $A^{(2)}_{t}$ from $S_{t}$ and $A_{t}$ respectively

4:

Extract $N_{rep}$ and $N_{RU}$ from $A^{(2) }_{t}$

5:

$N(A^{(2)}_{t}) = N_{rep}N_{RU}$

6:

$h_{0} \gets h^{(unfit)}_{t}$

7:

if $\mathbb {I}^{(unfit)}_{t} = 0$ then $\triangleright $ the NPUSCH fitted

8:

$\mathcal {A}_{RU^{-}} = \{a_{k}\in \mathcal {A}^{(2)}| N(a_{k})\leq N_{t}\}$

9:

for each $a_{k}$ in $\mathcal {A}_{RU^{-}}$ do

10:

$\mathbf {x} \gets (C_{t}, a_{k})$

11:

$h_{k} \gets h_{k-1}$ updated with $\mathbf {x}$ and $\mathcal {I} = 1$

12:

end for

13:

else$\triangleright $ the NPUSCH did not fit

14:

$\mathcal {A}_{RU^{+}} = \{a_{k}\in \mathcal {A}^{(2)}| N(a_{k})\geq N_{t}\}$

15:

for each $a_{k}$ in $\mathcal {A}_{RU^{+}}$ do

16:

$\mathbf {x} \gets (C_{t}, a_{k})$

17:

$h_{k} \gets h_{k-1}$ updated with $\mathbf {x}$ and $\mathcal {I} = 0$

18:

end for

19:

end if

20:

return $h_{k}$

Similarly, to increase the learning efficiency for the update of $h^{(rx)}_{t}$ , we can use a similar data augmentation scheme based on the following property: Consider an NPUSCH transmission from a specific user $i$ , with datarate determined by action $A^{(2) }_{t}$ . If this transmission turns out to be unsuccessful, it would have been unsuccessful with an equal or greater datarate than $A^{(2) }_{t}$ (i.e. with any action in $\mathcal {A}_{RU^{-}}$ ). Conversely, if the NPUSCH transmission is successful, it would have been successful too with an equal or smaller datarate than $A^{(2) }_{t}$ (i.e. with any action in $\mathcal {A}_{RU^{+}}$ ). Algorithm 4 details how the above property is used to increase the number of updates associated to each transmission outcome.

Algorithm 4 UpdateR

1:

Inputs: $S_{t}$ , $A_{t}$ , $\mathbb {I}^{(rx)}_{t}$ , $h^{(rx)}_{t}$

2:

Outputs: $h^{(rx)}_{t+1}$

3:

Extract $i$ and $A^{(2)}_{t}$ from $A_{t}$

4:

Extract $L^{(i)}_{t}$ and $B^{(i)}_{t}$ from $S_{t}$

5:

Extract $N_{rep}$ and $N_{RU}$ from $A^{(2)}_{t}$

6:

$N(A^{(2) }_{t}) = N_{rep}N_{RU}$

7:

$h_{0} \gets h^{(rx)}_{t}$

8:

if $\mathbb {I}^{(rx)}_{t} = 0$ then $\triangleright $ NPUSCH not decoded

9:

$\mathcal {A}_{RU^{-}} = \{a_{k}\in \mathcal {A}^{(2)}| N(a_{k})\leq N_{t}\}$

10:

for each $a_{k}$ in $\mathcal {A}_{RU^{-}}$ do

11:

$\mathbf {x} \gets (L^{(i)}_{t}, B^{(i)}_{t}, a_{k})$

12:

$h_{k} \gets h_{k-1}$ updated with $\mathbf {x}$ and $\mathcal {I} = 0$

13:

end for

14:

else$\triangleright $ NPUSCH decoded

15:

$\mathcal {A}_{RU^{+}} = \{a_{k}\in \mathcal {A}^{(2)}| N(a_{k})\geq N_{t}\}$

16:

for each $a_{k}$ in $\mathcal {A}_{RU^{+}}$ do

17:

$\mathbf {x} \gets (L^{(i)}_{t}, B^{(i)}_{t}, a_{k})$

18:

$h_{k} \gets h_{k-1}$ updated with $\mathbf {x}$ and $\mathcal {I} = 1$

19:

end for

20:

end if

21:

return $h_{k}$

These two algorithms could implement an even more extensive data augmentation strategy. For UpdateU, we could generate additional samples of $C_{t}$ as follows: when an NPUSCH does not fit into the carrier, it will not fit either in a carrier with fewer available resources. We could then generate new samples by adding occupied subcarriers in $C_{t}$ . Conversely, if the NPUSCH fits into the carrier, it will also fit in a carrier with more available resources. Then, we could generate new samples by removing occupied subcarriers from $C_{t}$ . For UpdateU, if an NPUSCH transmission is successful, we could generate new samples with pathloss values smaller than the one observed, $L^{(i)}_{t}$ , and if an NPUSCH is unsuccessful, we could generate new samples with higher pathloss. These two improvements are easy to implement, but our results show that they are not necessary given the fast learning rate attained with Algorithms 3 and 4.

It is worth mentioning that the only configurable parameters in our proposal are those of the online classifiers ($h^{(unfit)}_{t}$ , $h^{(rx)}_{t}$ ). As we discuss in the next section, the chosen classifier (HT) does not require fine tuning of parameters since the default configuration is sufficient to achieve efficient performance and the controller shows low sensitivity to the HT parameters.

SECTION V.

Evaluation

A. Methodology

The proposal has been evaluated empirically using a simulator of the uplink transmission functionalities of NB-IoT. We will first describe the essential aspects of the simulator and then we will present the baselines and the evaluation experiments.

1) Simulation Environment

The simulation environment, developed in Python,2 comprises a population of devices trying to access the system and transmit their data packets, a base station that controls the access and schedules transmissions opportunities for the devices, one or more carriers where the resources for NPRACH and NPUSCH are assigned, and the corresponding channel models for NPRACH and NPUSCH transmissions. Devices are idle until they become active according to a probabilistic traffic model. In particular we use the uniform mMTC traffic model defined in 3GPP TR 37.868 [34], which considers time periods of $T=60$ seconds over which each device attempts to access the network with a uniform probability.

An active device goes through several stages. First, it must accomplish the access procedure. After becoming active, the device waits for the start of the next NPRACH corresponding to its CE level. When the NPRACH starts, the device transmits a randomly chosen preamble sequence over that NPRACH. If no other device has selected the same preamble, the transmitted preamble can be detected if it arrives with sufficient power. If the same preamble is selected by several devices, it is received with lower signal quality ratio but it can sometimes be detected by capture effect. The contention is resolved during the RAR window that starts after the NPRACH. When the base station decodes a non-collided preamble, it will send a signaling message (msg2) to the device during the RAR window, granting the device an opportunity to send its connection request and a buffer status report (msg3). This is followed by a signaling exchange which, if successful, ends the access procedure and sets the device to connected state. When the preamble is not detected (because of collision or low signal quality), the device does not receive any message within the RAR window, and starts a backoff period before a new access attempt. If the number of access attempts reaches a certain configurable value, the device will move to an upper CE level.

When the base station manages to decode a collided preamble, several stations will receive the same msg2, resulting in several colliding msg3 responses. If the base station is able to decode one of these msg3 responses, the corresponding device will become connected. If a collided or non-collided msg3 is not decoded, the base station can schedule a retransmission during the RAR window. When a device sends an msg3, its starts a contention timer. If this timer expires before the reception of the msg4 response, the device will start a new access attempt.

Once connected, a device waits for an uplink grant indicated by a DCI. After receiving a DCI, the device will transmit its data packet over the allocated NPUSCH with the indicated LA parameters. Then, the device can receive either an ACK, if the packet was decoded, or another DCI with an uplink grant for a retransmission, if the packet was not decoded. The connection ends upon an ACK reception.

The environment uses the propagation conditions and the antenna pattern described on sections IV-B and 4.5 of 3GPP TR 36.942 [35] respectively. It implements a block fading model where a channel realization is constant over each (NPRACH or NPUSCH) transmission, and changes independently from one transmission to another according to a lognormal shadow fading model with $\sigma = 10$ dB. The detection of a preamble sequence is based on the probability model developed in [36], and the reception of NPUSCH transmissions uses the block error rate tables available in [37] for each link-adaptation configuration.

The configurable parameters of each NPRACH are the periodicity, the number of (3.75 kHz spaced) subcarriers, and the number of preamble repetitions. Other configurable parameters are: the NPDCCH periodicity, the number of consecutive NPDCCH subframes, the backoff time, the duration of a RAR window, the contention timer and the maximum number of access attempts. Table 1 summarizes the environment configuration.

TABLE 1 Parameter Setting of the Simulation Environment
Table 1- 
Parameter Setting of the Simulation Environment

Although our proposed model-based scheme is essentially parameter-free, the HT algorithm selected for the classifiers ($h^{(unfit)}$ and $h^{(rx)}$ ) involves three configurable parameters (see [33] for further details). The results shown in this Section use the default configuration provided by the Scikit-Multiflow implementation. Nevertheless, to evaluate the parameter sensitivity of our proposal to these parameters, we replicated the experiments for 26 additional configurations around the default settings, and observed no significant variations of the results.

2) Baselines

To evaluate our proposal we compared it with a diverse set of baselines based on three alternative architectures: 1) single-agent RL, 2) multi-agent RL (MARL, presented in Section III-B), and 3) multi-agent control using the NBLA algorithm proposed in [12] for link-adaptation, and RL for user selection. The latter approach is referred to as NBLA+RL. Each approach is evaluated using two strategies for determining which $D$ devices are included in the observation $S_{t}$ : i) those with larger connection times and ii) those with larger estimated pathloss (as suggested in [18]). Our proposal is also evaluated under these two criteria.

Note that all the control schemes incorporate either one or two RL agents. For these agents, we have considered the following state-of-the-art deep RL algorithms:

  • Deep Q-Networks (DQN) [38] is the deep learning version of Q-learning, a classical model-free off-policy RL algorithm. Q-learning and DQN have been used in NB-IoT environments for the configuration of NPRACH parameters in [20] and for dynamically delaying the access attempts [21] to avoid congestion in the NPRACH.

  • Proximal policy optimization (PPO) [39] is a model-free deep policy gradient algorithm that updates policies preventing the new policy to diverge too much with respect to the previous one, in order to avoid unstable behavior during the learning process.

  • Synchronous advantage actor critic (A2C) [40] is an on-policy deep actor-critic algorithm.

The experiments use the RL implementations provided by Stable Baselines [41], which is an improved version of the OpenAI Baselines [42]. Two additional RL algorithms were evaluated: Sample Efficient Actor-Critic with Experience Replay (ACER) and Actor Critic using Kronecker-Factored Trust Region (ACKTR), but they performed clearly worse than the above ones and have not been included in the results. In summary, each baseline is determined by an architecture, a device prioritization strategy, and an RL algorithm.

3) Evaluation Experiments

Each control scheme (either a baseline or our proposal) has been evaluated in 30 independent simulation experiments. Each simulation experiment consists of an online learning episode in which the controller starts with no prior knowledge and learns over time. During each simulation run, we sampled the delay of each completed transmission, allowing us to represent how the average transmission delay changes during a learning episode. As explained in Section III-B, a learning episode for a multi-agent scheme comprises two phases: 1) link-adaptation (LA) learning phase, and 2) device selection (DS) learning phase. We have adjusted the duration, in decision steps, of the learning episodes (and their respective phases) to the learning efficiency of each scheme, resulting in the following values:

  • Single agent RL: 200000.

  • MARL: 20000 (LA phase) + 20000 (DS phase).

  • NBLA+RL: 50000 (LA phase) + 20000 (DS phase).

  • MAMBRL: 10000 (LA phase) + 20000 (DS phase).

B. Numerical Results

1) Prioritizing Devices with Longer Connection Times

We first discuss the results corresponding to the first device prioritization strategy, consisting of selecting among the $D$ devices with the longest connection time. Figure 3 shows the transmission delays during the learning episodes of a single-agent controller using three deep RL algorithms (DQN, PPO and A2C). As the agents start operating without prior knowledge, the initial actions are, in general, inefficient, causing long transmission delays for various reasons (link-adaption parameters causing excessive retransmissions, or consuming too many resources, or not fitting into the carrier, etc). In addition, due to the slow learning rate of conventional RL algorithms, devices connect to the system at a faster rate than the controller is able to schedule them, so they have to wait increasingly longer times for an uplink grant. For this reason, we observe how the delay (which should be in the order of 100 ms) reaches values of 14, 17 and almost 20 seconds with these baselines. When the algorithms gain enough experience, they are able to select efficient actions and begin to drain the backlog of waiting transmissions. From that moment on, the delays start to decrease.

FIGURE 3. - Transmission delay of the single-agent RL baselines during learning episodes.
FIGURE 3.

Transmission delay of the single-agent RL baselines during learning episodes.

Figure 4 shows the delay performance during the MARL learning episodes. The vertical dotted lines represent the boundaries between the LA and the DS learning phases. Compared to the single-agent case, the MARL architecture reduces the initial transmission delay faster, and keeps it low during the DS phase, except in the case of DQN, whose delay surprisingly increases in this phase.

FIGURE 4. - Transmission delay of the MARL baselines during learning episodes.
FIGURE 4.

Transmission delay of the MARL baselines during learning episodes.

Figure 5 shows the transmission delay during the NBLA+RL learning episodes. Compared with the previous baselines, the initial delays are considerable smaller and, once the DS phase is initiated, these delays generally remain within a range of 200 and 400 ms. There is no noticeable difference between the RL algorithms used during the DS learning phase.

FIGURE 5. - Transmission delay of the NBLA+RL baseline during learning episodes.
FIGURE 5.

Transmission delay of the NBLA+RL baseline during learning episodes.

Figure 6 shows the results of our proposal, which is capable of keeping the delay close to 100 ms even at the early stages of the learning episode. As with NBLA, the performance during DS learning phase is similar for the three RL algorithms. Overall, MAMBRL clearly outperforms the previous baselines in terms of quality of service experienced by the devices during online learning. One of the reasons of this performance is the remarkable sample efficiency of the two online classifiers ($h^{(rx)}_{t}$ and $h^{(unfit)}_{t}$ ) used by our proposed model-based agent. Figure 7 shows that these classifiers require very few samples (each NPUSCH is a sample) to attain their highest accuracy levels. The maximum accuracy of the packet reception predictor $h^{(rx)}_{t}$ is smaller because of the uncertainty caused by the signal fading (note that no channel state information is exchanged with the devices), but it is sufficient for an efficient selection of link adaptation parameters.

FIGURE 6. - Transmission delay of the MAMBRL proposal during learning episodes.
FIGURE 6.

Transmission delay of the MAMBRL proposal during learning episodes.

FIGURE 7. - Accuracy of the MAMBRL classifiers over consecutive NPUSCH transmissions.
FIGURE 7.

Accuracy of the MAMBRL classifiers over consecutive NPUSCH transmissions.

In the learning episodes shown in the figures above, we observe that, after a sufficient number of NPUSCH transmissions, all the evaluated control schemes reach a steady state in which the transmission delay remains close to a level that can no longer be improved. Figure 8 shows, for each control mechanism, the distribution of the delay samples during the learning phase (the initial 10000 transmissions) and during the steady state (beyond 20000 accomplished transmissions). For this figure, we have selected the best performing RL algorithm for each architecture. Two conclusions can be drawn from this figure: first, that the task division used by the multi-agent architecture does not reduce the long-term performance compared to the single-agent RL strategy. Second, that in the long-term, both model-based and model-free approaches perform similarly, thus the advantage of the model-based strategy lies in its significant sample efficiency which makes MAMBRL especially effective for online learning.

FIGURE 8. - Comparison of the control algorithms during the learning phase (initial 10000 transmissions) and in steady state operation (beyond 20000 accomplished transmissions).
FIGURE 8.

Comparison of the control algorithms during the learning phase (initial 10000 transmissions) and in steady state operation (beyond 20000 accomplished transmissions).

2) Prioritizing Devices with Higher PATHLOSS

Finally, we assess the performance when the observation contains the $D$ devices with higher pathloss. Figures 9 and 10 show the learning episodes for a single RL agent and MARL respectively using this user selection strategy. On the negative side, we see that one of the baselines (PPO) struggles to learn an effective policies during the episodes. But on the positive side, the transmission delays do not reach values as high as in the previous experiments. One explanation for this fact may be that, in the early stages of learning episodes, the observations received by agents are more similar to each other (high pathloss devices), which is equivalent to learning in a smaller state space. Besides, these observations have a more pessimistic bias so that the initially learned policies tend to select low rate LA parameters that are effective also for devices with lower pathloss.

FIGURE 9. - Transmission delay of the single-agent RL baselines (prioritizing devices with higher pathloss).
FIGURE 9.

Transmission delay of the single-agent RL baselines (prioritizing devices with higher pathloss).

FIGURE 10. - Transmission delay of the MARL baselines (prioritizing devices with higher pathloss).
FIGURE 10.

Transmission delay of the MARL baselines (prioritizing devices with higher pathloss).

Figure 11 shows that the delay performance of the NBLA+RL scheme degrades when using the pathloss-based device prioritization scheme, resulting in transmission delays ranging between 200 ms and 800 ms, independently of the RL algorithm used in the DS learning phase. In contrast, our MAMBRL proposal does not experience any noticeable degradation when adopting this device prioritization scheme, as shown in Figure 12. This suggest that, with an effective controller, the number of devices waiting for an uplink grant remains so small that the criteria used to include them in $S_{t}$ is no longer relevant, i.e. most of the time there are fewer than $D$ backlogged devices.

FIGURE 11. - Transmission delay of the NBLA+RL baseline (prioritizing devices with higher pathloss).
FIGURE 11.

Transmission delay of the NBLA+RL baseline (prioritizing devices with higher pathloss).

FIGURE 12. - Transmission delay of the MAMBRL proposal (prioritizing devices with higher pathloss).
FIGURE 12.

Transmission delay of the MAMBRL proposal (prioritizing devices with higher pathloss).

Finally, Figure 13 shows the delay distributions of the control schemes in the two phases of operation. During the steady state we see that, while RL and NBLA+RL have slightly worsened their performance with respect to the previous device prioritization criteria, MARL and MAMBRL still maintain their transmission delays close to 100 ms.

FIGURE 13. - Comparison of the control algorithms during the learning phase (initial 10000 transmissions) and in steady state operation (beyond 20000 accomplished transmissions).
FIGURE 13.

Comparison of the control algorithms during the learning phase (initial 10000 transmissions) and in steady state operation (beyond 20000 accomplished transmissions).

In order to verify these results for different traffic intensities, we have repeated the above experiments (for both device prioritization strategies) for a number of devices ranging from 1000 to 2500. As expected, the proposal showed a consistent performance, maintaining the lowest delay during the learning phase under any traffic intensity, and matching the performance of MARL in steady state.

SECTION VI.

Conclusion and Future Work

In this paper we have addressed the scheduling of uplink transmissions in NB-IoT including the selection of link-adaptation parameters, the allocation of time-frequency resources in the uplink carrier, and device selection. Our goal was to develop a control mechanism capable of learning autonomously in an operating network without any prior knowledge of the system (online learning), while preventing harmful degradation of the quality of service experienced by the devices during the learning process. The proposed mechanism is based on two principles: 1) a multi-agent architecture, in which two agents coordinate to sequentially learn the link-adaptation and device selection policies respectively; and 2) a model-based approach for the link-adaptation agent, especially tailored for high sample efficiency. The proposed structure for this agent, based on two online classifiers, has no precedent in the related literature. The proposal does not introduce configurable parameters other than those of the classifiers, which in our experiments did not require fine tuning. Our experiments have shown that the multi-agent architecture is able to improve the performance of conventional RL algorithms, but it is not able to avoid a steep increment of the transmission delay during the initial stages of learning. In contrast, when our model-based agent comes in, the transmission delay does not experience any noticeable increase during learning. Note that, when using state-of-the-art deep RL agents such as PPO or A2C, the delay can rise up to more than 150 times the delay of our proposal.

One limitation of our model-based approach is its specialization. In our case, the model-based agent is specialized in link-adaptation, and it is not straightforward to generalize it for the control of tasks not contemplated by its model, such as the configuration of NPRACH parameters. For this aim, new agents (either model-free or model-based) could be integrated in the multi-agent architecture. We believe that this approach, supported by the promising results of our proposal, opens two new lines of research in the area of NB-IoT control (and netwoking environments in general): investigating the potential of the multi-agent architecture, and the design of new model-based agents.

References

References is not available for this document.