Learning Random Access Schemes for Massive Machine-Type Communication with MARL

In this paper, we explore various multi-agent reinforcement learning (MARL) techniques to design grant-free random access (RA) schemes for low-complexity, low-power battery operated devices in massive machine-type communication (mMTC) wireless networks. We use value decomposition networks (VDN) and QMIX algorithms with parameter sharing (PS) with centralized training and decentralized execution (CTDE) while maintaining scalability. We then compare the policies learned by VDN, QMIX, and deep recurrent Q-network (DRQN) and explore the impact of including the agent identifiers in the observation vector. We show that the MARL-based RA schemes can achieve a better throughput-fairness trade-off between agents without having to condition on the agent identifiers. We also present a novel correlated traffic model, which is more descriptive of mMTC scenarios, and show that the proposed algorithm can easily adapt to traffic non-stationarities


I. INTRODUCTION
T HE mMTC paradigm is a key component of 5G and will continue to be important in the development of 6G technologies [1].As the number of Internet of Things (IoT) devices grows, millions of devices with characteristics different from human-type communication will require connectivity [2], [3].To support mMTC in LTE-A, 3GPP has developed narrowband IoT (NB-IoT) and LTE-Machine-type communication (LTE-M) [4], which fall under the category of low power wide area networks (LPWANs).In addition to these cellular standards, non-cellular LPWAN standards such as Sigfox [5] and LoRa [6] have also been developed.3GPP's Rel-17 introduces 'NR-Light', a new class of devices that is more capable than NB-IoT or LTE-M, but supports different features with a bandwidth larger than NB-IoT/LTE-M but smaller than 5G NR devices.In this paper, we focus on low-power, low-complexity machine-type devices (MTDs) with low data rates (around 1-100 Kbps), where communication is mostly uplink dominated.These devices are also low-cost and battery-operated, with long battery life and sporadic activity.Managing medium access for these devices is challenging, and future wireless communication systems will need to provide massive connectivity to meet these needs.
For devices having such characteristics, grant-free RA schemes are preferred as scheduled access incurs huge signaling overhead [7].However, RA schemes are prone to collisions and scale poorly.Traditionally, RA schemes such as exponential backoff (EB) [8] employ back-off mechanism at each device to update their transmit probabilities based on the feedback from the receiver.These schemes are relatively simple and decentralized; however, their performance depends on various assumptions such as the traffic arrival process and whether the buffers are saturated.Additionally, the optimal back-off factor for different system parameters is not fixed and may vary [9].One drawback of EB schemes such as binary exponential backoff (BEB) is the capture effect, where a group of devices occupy the channel for a period of time, causing other devices to be deprived of access and making the technique unfair.Our goal is to design RA schemes for mMTC that not only provide better throughput but are also fair.
Reinforcement learning (RL) algorithms have become a popular method for learning RA policies in wireless networks.These algorithms can adapt to changes in the environment and use past history to learn the transmission probabilities of devices in a decentralized manner.However, many RL solutions are not tailored to the traffic and device characteristics of mMTC systems (details in Section V), and they also struggle with scalability to large numbers of devices, which is a critical concern in RA schemes for mMTC.This is because mMTC devices often have low computational power and rely on battery power, making it impractical to perform learning at each device.To the best of our knowledge, the scalability issue has not been adequately addressed in previous RL or MARL studies on RA schemes, and it is unclear if the proposed techniques can handle a large number of devices.MARL has several advantages over traditional EB backoff policies, as it allows for the design of multi-objective policies in a decentralized manner, which is not analytically tractable using traditional methods.
Therefore, the objective of this paper is to design grantfree RA schemes for mMTC using MARL to achieve fairness, adaptability to changes in traffic, centralized learning with decentralized execution, and scalability to a large number of devices.Our contributions are listed in the following.
• We present a system model to learn schemes in which the devices can leave and join the network randomly.We do not assume that all the devices in the network have packets as opposed to most of the other works for RA with RL, e.g., [10], [11].• We use broadcast feedback to reduce the signaling overhead and for energy efficiency.We assume that the feedback is only sent to the active devices and to save energy, the inactive devices do not listen to the feedback signal.
• We present a suitability report of MARL algorithms for our proposed environment.Since we want a single policy for all the devices that can be learned in a CTDE manner; we provide a comparison between some wellknown MARL algorithms and how they may or may not be suitable for our environment.We propose VDN and QMIX algorithms to achieve our objectives.We present our simulation results for VDN for a multiple-user multiple physical resource blocks (PRBs) environment and also compare VDN, QMIX and DRQN policies.• Most of the MARL algorithms that employ CTDE, include an agent-specific identifier into the observation vector of the agents.In case of mMTC, the devices should be able to leave/join the network and the policies should be scalable to a large number of devices.For these reasons, incorporating agent/device identification (ID) is not feasible.We will also show that how the algorithm distributes resources among MTDs fairly when agent IDs are not incorporated and how the algorithms learn an unfair policy when we use agent IDs.• We present our results for regular or periodic traffic arrival, in which each MTD receives packets following a random process independently.In addition, we present a correlated traffic arrival model in Section IV, that is more suitable for mMTC system.In the correlated traffic arrival model, the devices follow both regular traffic arrivals and also event-driven (ED) traffic arrival.
In ED, that is independent of the regular traffic arrival, a subset of MTDs become active together whenever an event happens.We show that our proposed algorithm adapts to different traffic conditions.

II. RELATED WORK
The application of RL to channel access problems in wireless communications goes back to 2010 that used tabular Qlearning [12].However, it has become popular in the recent years due to the advancements in deep reinforcement learning (DRL).In [13], the authors considered the problem of multiple access where the agents are the base stations to predict the future state of the system.They use recurrent neural network (RNN) and REINFORCE algorithm to learn policies for each agent.In [14], the ALOHA-Q protocol is proposed for a single channel slotted ALOHA scheme that uses an expert-based approach in RL.The goal in that work is for nodes to learn in which time slots the likelihood of packet collisions is reduced.However, the ALOHA-Q depends on the frame structure and each user keeps and updates a separate policy for each time slot in the frame.
In [15], the ALOHA-Q is enhanced by removing the frame structure.However, every user still has to keep the number of policies equal to the time slots window it is going to transmit in.Other works such as [10], [11], [16], [17] consider RLbased multiple access works for multiple channels.In [10], [11], [18] deep Q-network (DQN) algorithm is used for multiple user and multiple channel wireless networks.In [16], another DRL algorithm known as actor-critic DRL is used for dynamic channel access.All of these works train agents with the assumption that every device has always a packet in its buffer (saturation state).Moreover, it is not clear whether their algorithms can be scaled for higher number of agents.Interestingly, these works also do not compare their results with any backoff techniques such as EB to show whether their results outperform them.In [19], authors propose a RA procedure for delay sensitive applications using the context IDs of the devices along with the twostep RA procedure.This is done by predicting the traffic of the devices.A RA strategy for initial access (4 message exchange) to allocate resources in proposed in [20].They assume that each device also reports its energy levels and access delay to the centralized receiver.Therefore, signaling overhead in this work is high for massive access and it is not energy efficient.Recently, a RA protocol for initial access is proposed in [21] where results were shown for both regular and bursty traffic arrivals.A RL-based strategy has been proposed in [22] for the correlated traffic model.In our previous work [23], we had used DQN with a single resource for the devices following Poisson process for traffic arrival and in [24] we showed how DQN with PS is scalable for bursty traffic  [25], a heterogeneous environment is considered in which an RL agent learns an access scheme in co-existence with slotted ALOHA and a time division multiple access (TDMA) access schemes.In [26], access class barring (ACB) mechanism has been optimized for NB-IoT using DRL.A multiple access algorithm is designed using actor-critic MARL in [27].

III. SYSTEM MODEL AND PROBLEM FORMULATION
We consider a synchronous time-slotted wireless network with a set N = {1, . . ., N } of MTDs, a set M = {1, . . ., M } of shared orthogonal PRBs and a receiver as shown in Fig. 1.The physical time is divided into slots, each of duration 1 and the slot index is k ∈ N. At each time slot, we assume that only N a ⊆ N devices are active and the activity pattern follows a random process.Each active MTD transmits over the shared PRBs in a grant-free manner.At each time slot k, an MTD can transmit only one packet and it can transmit it only over one resource m ∈ M. The MTDs are assumed to have perfect synchronization.Moreover, each MTD is equipped with a buffer to store the packets in its queue and each device n can only store at most one packet.The buffer state at time k is defined as B n (k) ∈ {0, 1}, where B n (k) = 1 if there is a packet in the buffer and it is 0 otherwise.If the buffer B n (k) is full, new packets arriving at device n are discarded and are considered lost.Each device becomes active whenever a packet is generated at the device following one of the traffic arrival models given in Section IV.At each time slot k, MTD n takes an action where A n (k) = 0 corresponds to the event when user n chooses to not transmit and A n (k) = m corresponds to the event when user n transmits a single packet on channel m for 1 ≤ m ≤ M .If only one user transmits on the channel m in a given time slot k, the transmission is successful, whereas a collision event happens if two or more devices transmit in the same time slot.The collided packets are discarded and need to be retransmitted until they are successfully received at the receiver.
For feedback, we consider a broadcast feedback signal F (k) from the receiver that is common to all the devices.Formally, we define and F m (k) stands for the feedback corresponding to the channel m and for each time slot k, it is defined as Let us define the binary set B = {0, 1}.We define success and collision event for user n as a function of the feedback and C n,m (k), where G n,m (k) ∈ B and C n,m (k) ∈ B are the success and collision event for the device n respectively, and they are locally computed by each device.Formally, we define the success event for user n and ∀m ∈ M as, and the collision event as, Since each device can only transmit on one resource at each time slot, the indicator whether the transmission on any resource for the device n has been successful or not, can be written as G n (k) = m G n,m (k) ∈ {0, 1} and similarly for collision we can write Furthermore, we can define matrices with n rows and m columns for success and collision events respectively as, and We assume that each user keeps a record of its previous actions, feedback and its current buffer state B n (k) up to h past instants, where we refer to h as the history length.Therefore, the tuple is referred to as the local history or the state of user n at time k, and S(k) = S 1 (k), . . ., S N (k) is the global history of the system.
The feedback signal F (k) is only recorded by the devices that are active at time k − 1.If a new device becomes active at time k, its state is initialized with A n (k − 1) = 0, and F m (k − 1) = 0, ∀m ∈ A for its local history S n (k).The memory is initialized with zero values.Moreover, we set zero values for the time a device has been inactive if the time of inactivity is smaller than the history size.

Definition 1:
A policy or access scheme of user n at time slot k, is a mapping from S n (k) to a conditional probability mass function π n (•|S n (k)) over the action space {0, 1}.We consider a distributed setting in which there is no coordination or message exchange between users for the channel access.Each new action A n (k) ∈ {0, 1} is drawn at random from π n (•|S n (k)) as follows: We are interested in developing a distributed transmission policy for slotted RA that can effectively adapt to changes in the traffic arrivals and provide better performance in terms of throughput, latency and fairness than the baseline reference schemes.We consider EB policies as our baseline schemes.
More specifically, we use BEB when the value of backoff factor is 2, which has been used in IEEE 802.11 and IEEE 802.3 standards.

A. PERFORMANCE METRICS 1) Throughput
The channel throughput is defined as the average number of packets that are successfully transmitted from all the devices divided by the total number of PRBs, over a time window of size K.For the finite time horizon K and for M orthogonal resources, the average throughput of the system is defined as where G n,m (k) refers to the success event over channel m and T ∈ [0, 1].

2) Age of Packets
The age of packet (AoP) 1 of device n, denoted as w n (k), grows linearly with time if a packet stays in the buffer of the device, and it is reset to 0 if the packet is transmitted successfully.Specifically, we assume that w n (1) = 0, and the AoP w n (k) evolves over time as follows: The average AoP for user n after a time span of K time slots is given by and the average AoP of the overall system by ∆ = 1 N n ∆ n .Since techniques such as EB incur capture effect [8] where a transmitting device keeps transmitting on the channel for some time, introducing short-term unfairness.In 1 This metric has a different connotation to age of information (AoI).this work, we use the average AoP to measure fairness as well as the average delay budget of the packets.A higher AoP means the scheme is more unfair and has the higher delay and vice versa.To illustrate the concept of fairness with AoP, let us consider an example of 3 users where each user generates a single packet that is successfully transmitted within K = 10 time slots as depicted in Fig. 2. The average delay (number of time slots taken by each user to transmit their packet is same for both fair and unfair schemes, which is 4 time slots.However, the scheme shown in Fig. 2a is clearly not fair since user 2 takes much more time slots to send its packet as compared to the other two users.This short-term unfairness can be captured with the average AoP and we see that the average for the scheme in Fig. 2a is higher i.e., 1.23) than the one shown in Fig. 2b which is 1.00.

IV. TRAFFIC MODELS A. REGULAR TRAFFIC MODEL
In traditional RA schemes, the activation of MTDs and the traffic arrival for each user is usually modeled by an independent process.We call such traffic arrival as regular traffic arrival.Each device follows independent Bernoulli process with average arrival rate λ n to generate packets in regular traffic model.The average arrival rate for the system can then be written as However, for machine-type communication (MTC), it is highly likely that some devices are correlated in terms of activation, i.e., some devices observe the same physical phenomenon and activate together.For instance, in industrial fault detection or fleet management, some MTDs are highly likely to transmit at the same time due to the activation of certain event.For intance, in flood or quake detection or land sliding, there is a high probability that devices closer to the event will start transmitting at once.Several recent works have used a correlated activity model to design access schemes for MTC [22], [28]- [30].Therefore, the assumption of independent traffic arrival is not valid in this case.Moreover, apart from correlated ED device activity, each MTD also follows regular traffic model [31].

B. CORRELATED TRAFFIC MODEL
The correlated traffic model is a mix of the regular traffic and ED traffic or alarm traffic as depicted in Fig. 3.We assume that the regular traffic generation for each MTD follows an independent random process such as Bernoulli process.Similarly, the ED traffic also follows a random process on top of the regular traffic arrival process.For ED traffic, certain devices are strongly correlated in space and time and the ED traffic generation for such devices is dependent on the occurrence of an event in their vicinity.Regular traffic arrival and ED traffic arrival processes are independent of each other.
To formulate this behaviour, we assume that N MTDs are uniformly distributed in a given area.Each MTD can either be in a regular state or alarm state, when active.We consider L event epicentres that are scattered randomly and independently across the given area.The location of the devices is represented by x ∈ R2 and the location of the epicenter of the events is denoted by y ∈ R 2 .We assume that all MTDs are stationary and fixed to their locations or exhibit very low mobility.
Let E xy denote the event when a device n ∈ N at location x is triggered into alarm or ED mode by the activation of an event with its epicenter at location y.Let Ēxy be the complement of E xy and p xy denotes the probability of a device n at location x being triggered into alarm mode by the activation of event at location y.Moreover, we define the probability of device at location x being in ED mode is p x .We write, where we have assumed that events are triggered independently of each other.Therefore, for the correlated traffic arrival model, each MTD can either be at alarm (ED) state or regular state for a given time slot k.We denote by V x the state of device at location x and we model the states at each time slot k by i.i.d Bernoulli random variable as, Regular with prob.
Moreover, we define with p the probability of an event being active at location y.We assume that each event at epicenter y triggers a subset N y ⊆ {1, . . ., N } of devices.The probability of a device n going into alarm mode depends on the distance of the device from the epicenter y of the event.Furthermore, each MTD n can sense and report multiple events but at any given time, we assume that it can report about only one event.The MTDs are unaware of the actions, and events sensed by other devices.

V. SUITABILITY OF MARL FOR RA SCHEME DESIGN
To design a RA policy for the multiuser MTC environment, it is important to consider the suitability of MARL algorithms in general for scalability and in particular for the specific characteristics of MTC system.There exists a large body of the literature for medium access using RL in wireless networks but only a few address the issues of scalability (e.g., [41] and our recent work [24]).The distributed multiple schemes designed with MARL do have scalability challenges and this is even further exacerbated by the limitations of the mMTC.The MTC system presents the following challenges for MARL algorithms in designing a distributed RA policy: 1) Since the MTDs are low-powered and low-complexity devices that are battery-operated, it is not feasible to perform learning on the devices and therefore, centralized training and decentralized execution (CTDE) method is required for learning.2) Usually, PS method is used for homogeneous agents, and each agent's ID is used in the observation (state) vector to distinguish between agents.The homogeneous agents are those that have the same state-space and action-space.In our system model, since the devices 2 have variable sleep cycles and they should be able to join/leave the network, it is not feasible to use agent identification.Therefore, we require each agent to have Can be used as independent Q learning (IQL) or centralized learning using PS by extending single agent network to multiple agent.
Has convergence issues and it is nearly impossible to learn optimal policy for a large number of agents with PS.
• MAPPO is on-policy and MADDPG is off-policy for CTDE method.
• Both circumvent the challenge of non-Markovian and nonstationary environments during learning.• Stabilize learning, due to reduced variance in the value function estimates.
• Need centralized critic which is not scalable to a large number of agents.• The state-space dimensionality grows exponentially for the critic as the number of agents increases.• Most critically, the accumulated noises by the exploratory actions of other agents make the Qfunction learning no longer feasible [35].COMA [36] • Tackles the credit assignment problem.
• Calculates the advantage function which is able to marginalize a single agent's actions while keeping others fixed.
• Better scalability than centralized critic methods.
• Each agent has its own network and then mixing network takes the Q-values of agents to measure Qtot.• QMIX uses NNs to make the function monotonic whilst VDN is the linear combination of the Q values of each agent.• Both of them make use of RNNs.
• Scalability to a massive number of agents.
• Different policy for each agent and it can also be applied to homogeneous agents with PS. • The information might be lost in this way of centralized training.• It is not sure whether the training with PS without using agent IDs will result in better policy than DQN or not.
Mean Field MARL (MF-MARL) [38] Good scalability results and and it considers the effects of neighbouring agents for each agent to estimate its value function Requires communication between agents; each agent has its own policy to learn.
Multi-Actor-Attention-Critic (MAAC) [39] Uses attention mechanism to incorporate the effects of important agents whom actions are given more consideration than others Requires communication among agents, scales better than MADDPG but their results are only for a small number agents.
Soft Actor Critic [40] • Extension of AC methods.Uses the notion of entropy to encourage exploration and to avoid converging to non-optimal policies.• Can be used for multiagent system with PS, local actors and local critics.
• Has convergence issues just like DQN.
• The performance is not known with centralized training, i.e., centralized critic and centralized actor, an extension of single agent system just like DQN.
a single policy using PS but at the same time, without using agent IDs. 3) Any communication to exchange channel or device information between MTDs is not energy efficient as it will drain the battery of the devices.Therefore, the devices do not communicate with each other and they do not know the actions taken by the other devices.
In Table 1, we provide the comparison of some popular MARL algorithms and their suitability to the proposed system model.Even though there are several MARL algorithms found in the literature, we have given the comparison of some well-known algorithms that employ the CTDE method to learn policies.The aim of this comparison is not to provide an exhaustive survey of MARL algorithms but to make a case of using a specific algorithm over the others for the proposed system model.Interested readers are referred to [42], a recent review paper of different MARL algorithms addressing scalability challenges.Moreover, we are focusing on DRL algorithms only.
Standard DQN [32] and actor-critic algorithms are extended to multiple agents using PS for homogeneous agents in [43].This method scales well to a large number of agents but it does not exploit any advantages of centralisation.Moreover, without using IDs of agents and without any cooperation between agents, we have observed in our simulations that these algorithms are not able provide better policies and they have convergence issues.However, this might be the issue for most of the centralized approaches.Our recent works [23] and [24] use the DQN with PS and in [24], we provided results for up to 500 devices for the bursty traffic.Similarly, just like the DQN can be extended to multiple agents using PS, one can also use Soft Actor Critic method [40] with a local actor and local critic that is shared among all users, where users' individual observation is used to update the actor and critic at each time step for all the users.
Other popular choices for MARL are multi-agent deep deterministic policy gradient (MADDPG) [33] and the recently proposed multi-agent proximal policy optimization (MAPPO) [34].In both MADDPG and MAPPO, the critic network has a global view of the system, which is only applied during the training phase and actor networks are employed for each agent.A major bottleneck of these algorithms is the scalability due to the shared critic network, even if the PS is considered.Since the shared critic network has the observation space of all the agents, the size of the observation space will grow exponentially with the number of agents.Moreover, in a network where the number of agents changes with time, it is inefficient to use the statespace of all agents at the centralized critic.counterfactual multi-agent policy gradients (COMA) [36] has a similar issue for scaling to a higher number of agents because a shared critic is used in it as well, just like MADDPG and MAPPO.Some algorithms such as mean field MARL (MF-MARL) [38] and multi-actor-attention critic (MAAC) [39] show good scalability results but the major issue in these algorithms is that they require communication between the agents, which is not practical for our system model.Furthermore, the results for these algorithms are shown for a moderate number agents.
VDN [37] and QMIX [35] are both for cooperative multiagent learning in which joint action-values Q tot are estimated from Q-values of individual agents that condition only on local observations.One of the main differences between these algorithms is that in VDN, the Q tot is calculated as a linear combination of the Q-values of each agent, while QMIX employs a network that can compute Q tot as complex non-linear combination of individual Q-values.This way of learning also provides better scalability as compared to MADDPG and MAPPO.QTRAN [44] improves upon both VDN and QMIX and provides more general form of factorization but it falls under the same category as VDN and QMIX.For these reasons, we will focus on the VDN and QMIX algorithms in our simulations.

VI. RL ENVIRONMENT AND MARL ALGORITHMS A. THE ENVIRONMENT
We consider shared PRBs where each agent interacts with the resources by taking an action and receiving a common feedback F (k) as observation.The reward R n (k) is then calculated by each agent.The action space of each agent is A and each device can either transmit on channel m, i.e., A n (k) = m or it can stay silent, i.e., A n (k) = 0.The state of each device is the history tuple S n (k) defined in (8).Let R n (k) ∈ R be the immediate reward that user n obtains at the end of time slot k.The reward depends on the agent n action A n (k) and other agents' actions A n (k), n = n.The accumulated discounted reward for user n is defined as where γ ∈ [0, 1) is a discount factor.The reward function to maximize the packet success rate is defined as, The summation sign over agents shows that the reward is global, i.e., all agents receive the same reward, which indicates that the agents are fully cooperative.The environment is partially observable as each agent is unaware of the actions taken by the other user, At each time slot k, each agent n obtains the feedback F (k) from the receiver, updates its history and then feeds S n (k) to the proposed algorithm, whose output are the Qvalues for all the available actions.Each agent n follows the policy π by drawing an action A n (k) from the following Boltzmann distribution, where 0 < τ < ∞ is the temperature parameter which is used for exploration.We decrease the value of τ to 0 gradually to make the agent more greedy.s

B. DEEP Q-NETWORK (DQN)
DQN represents the action-value function using a neural network that are characterized by parameter θ.In double DQN, a target network is also used which is parameterized by θ − that are periodically copied from θ during training.
The DQN and its variants use a replay buffer to store the transitions (s, a, r, s ), where s is the actual state, s is the next state that is observed after taking action a and receiving reward r.The learning updates are applied on the experience samples (s, a, r, s ) ∼ U (D), that are drawn at random with uniform distribution as mini-batches of size z from D and by minimizing the following loss function, where is the target value for the i th iteration.
Since the environment is partially observable, the agents can benefit from using RNN such as gated recurrent unit (GRU) that can facilitate learning from previous history.A DQN making use of RNN is referred to as DRQN.

C. VALUE DECOMPOSITION NETWORKS (VDN)
VDN [37] take advantage of centralization and aim to learn the joint action value function Q tot (s, a), a linear value decomposition from the team reward signal, where s is the The loss function of the VDN algorithm can be calculated in the same way as DQN, i.e., where is the target value for the iteration i.
In this way, each agent performs an action selection locally based on its own learned Q-value in a decentralized manner.Moreover, the VDN method employs RNN or DRQN to calculate Q-values for each agent.

D. QMIX
The QMIX [35] algorithm improves the VDN and it can represents much richer class of action-value functions.QMIX applies the following constraint on the relationship of Q tot and each individual action-value Q a , to ensure that mixing network has positive weights.Intuitively, it shows that if the weights of individual value function Q a are negative, less weightage is given to that agent for cooperation.Moreover, as opposed to VDN, Q tot is calculated in a complex non-linear way.QMIX uses a separate feed-forward neural network as a mixing network that takes individual agents' outputs and mixes them monotonically to produce Q tot to enforce the constraint in ( 24) [35].The weights of the mixing network are produced by a separate hypernetwork to ensure that they are non-negative.The loss function is calculated in the same way as given in (22).21) elseif mixer = QMIX: CALCULATE Q tot using QMIXer [35] for every K θ time slots: UPDATE θ − ← θ for every K β time slots: UPDATE β end end

VII. SIMULATION RESULTS AND DISCUSSION
In the following experiments, we use the VDN [37] method to learn RA policies for different values of N and M .We use the neural network with two layers of size 256 and 64 units before the final layer of size M , and when RNN is used, a GRU layer of 64 neurons is added after the first layer.The parameters used during the training of the network are presented in Table 2.In all the experiments, we use experience replay to accumulate each agent's experience and the learning is performed with CTDE method.An agent's ID is one-hot encoded vector that is appended with the observation S n (k) of each agent.We will first provide results for regular traffic in which we also compare the results for different MARL algorithms such as QMIX and DRQN with VDN and then we present our results for the correlated traffic arrival model.

A. RESULTS FOR REGULAR TRAFFIC
For regular traffic, we employ two ways of training all algorithms: (i) using agents' IDs in the observation vector, which is a common way of training for CTDE, and (ii) without using them.We denote IDs = 0 as the case when agent IDs are not used and IDs = 1 as the case when we incorporate agent IDs in the observation space.We show the results in terms of average throughput (normalized reward) and AoP.The learning process and how average throughput of the system increases for different values of N is shown in Fig. 4. We use K = 2000, 3000, and 5000 per episode during training for N = 8, 16 and 50, respectively.We increase K per episode as N grows to allow better learning for each value of N .Moreover, M = 2 is used to N = 8 and 16 and M = 5 is used when N = 50.The average throughput and average AoP after testing is shown in Table 3.Clearly, the case IDs = 1 outperforms BEB, slotted ALOHA theoretical throughput (i.e., 1/e), and the case when IDs= 0. The case IDs = 0 provides much lower average throughput and as the number of devices N grow, the average throughput also decreases which is not surprising because agents decrease the transmit probability as N grows.
Moreover, we calculate whether the learned policy is fair or not in two different ways as depicted in Fig. 5. First, we show how many packets per user have been successful as in Fig. 5a and the second, the AoP of individual users (which shows both packet delay and fairness) as shown in Fig. 5b.We observe that using IDs incurs a significant unfairness among devices as a subset of MTDs are starved out and they never get a chance to send a packet.Surprisingly, no IDs case provides much better fairness among devices.It is due to the inherent way the MARL algorithms with CTDE behave, we see that the case where IDs are omitted, provides us better fairness as compared to the other case.When we use IDs in the state space, devices are only concerned about achieving a better throughput.But when we do not use IDs, there's no such coordination that exists during the centralized training, which could allow the MTDs to come up with an unfair consensus.
To understand this, let us take the case of N = 8.For VDN, IDs = 1 case, it is clear that there are some devices that have sent 0 packets and there are a few devices that have sent most of the packets (95 th percentile is 1070 and 25 th percentile is around 443).Similarly, in Fig. 5b for N = 8, the average AoP value, which is the mean point and the distribution of AoP among each device shows a significant difference between 75 th percentile (627) and 25 th percentile (0.8).On the other hand, for VDN, IDs = 0 case, the number of successful packets sent by each device are around the mean value (493) for all percentiles, i.e., 75 th percentile is around 500 and 25 th percentile is around 486.Similarly, the average AoP value as shown Fig. 5b and in Table 3 for VDN ID= 0 case is much lower (5.5), which is the indication of fairness.Similar conclusion can be drawn for N = 16 and N = 50.
The case when IDs = 0 allows agents update the policies as if it is a single agent (hence single policy).This is unlike the IDs = 1 case in which, even if the state-space is the same for agents, they behave differently.This way we achieve better trade-off without using agent IDs, which is also scalable and allows devices to join/leave the network without identification.These plots also show that IDs = 1 case outperforms the BEB in terms of average throughput but BEB has better average AoP than IDs = 1 case.The average throughput of BEB technique is higher than IDs = 0 case for higher values N but no IDs case exhibits much better throughput-fairness tradeoff, as evident from average AoP values.The case where agents IDs are used is most unfair because of the reward signal that only cares about maximizing the throughput.
Obviously, one can learn a policy by designing a reward function that enforces the devices to be fair even when agent IDs are incorporated; however, such a scenario is not of our interest in this paper.Both VDN and QMIX use mixer networks to calculate the total Q-value Q tot and exploit the benefits of centralized learning.However, DRQN does not take any such advantage of centralized training.For this reason, we can see that both QMIX and VDN algorithms learn a policy that maximizes the throughput for the case when agent IDs are incorporated (IDs = 1) and QMIX outperforms VDN.Interestingly, the DRQN learns only a slightly better policy when IDs = 1 as compared to the case when agent IDs = 0, again, due to the major difference that it doesn't take any advantage of centralization as opposed to the VDN and QMIX.However, in Fig. 7, we can clearly see that QMIX and VDN learn a policy that is unfair when IDs = 1, VDN being more unfair than QMIX.On the other hand, the DRQN for this case has lower AoP and relatively much fairer policy than VDN and QMIX.It does not imply that DRQN is a better algorithm than VDN and QMIX.In fact, QMIX outperforms VDN and both of them outperform DRQN as far as the objective (maximizing throughput) is concerned.Another interesting observation is that the learned policies are very similar between all the algorithms for IDs = 0 case and it is evident from both Fig. 6 and Fig. 7.They are fair but it seems that exploiting centralization advantages without using agent IDs does not provide significant improvements.

B. RESULTS FOR CORRELATED TRAFFIC
In this work, we are interested in designing an access policy for the MTDs deployed in an area, and whenever an event l happens, the MTDs in the vicinity of the event or the MTDs closer to the epicenter of the event become active in a correlated manner.To model this, we calculate the probability of a device at location x becoming active due to the event happening at epicenter y as p xy in the following way: where d xy = x − y and d th is the threshold distance.We assume that the events are atomic in nature, i.e., if an event becomes active in time slot k, it activates MTDs within d th .
We assume that each MTD activated by an event has one packet each to transmit and the MTDs remain active until their packets are successfully transmitted.
For correlated traffic arrival, we consider the example as shown in Fig. 3  process with average arrival rate λ = 0.3 for regular traffic.We consider L = 3 event epicenters and MTDs belonging to each epicenter are given in 5.The events become active following another independent Bernoulli process with average event activation probability given by λ such that p = λ/L.We train and test for different values of λ as shown in Table 4.The training for each λ is performed over 60 episodes and K = 2000 time slots per episode.Since we want to learn the same policy for each agent, we do not use agent IDs for correlated traffic arrivals case and we only consider VDN IDs = 0 case since have seen that VDN performs better as compared to the QMIX and DRQN, Moreover, we do not consider whether events-driven traffic has any priority over regular All events are of the same nature and the same reward function as the regular traffic is used.Fig. 8 shows the average throughput and average AoP for different values of λ.The value λ = 0  means that there is only regular traffic and we see in Fig. 8 that as the λ increases, the average throughput and average AoP both increase, which is natural because when there is more traffic, there are more packets being successful but require more time to be transmitted.By further zooming in on individual MTDs, we want to observe how each MTD is behaving in correlated traffic scenario.In this case, we do not incorporate agent IDs in the state-space of agents.Fig. 9 shows successful number of packets and AoP per device and we compare the correlated traffic with regular traffic arrival scenario.Clearly, only users that are involved in reporting any event have higher throughput as well as higher AoP.
Obviously, when few users become active together, they will take more time to resolve collision and to send their packets successfully.The reason to show these plots is that the RLbased algorithms adapt to the traffic changes as the devices that are not involved in reporting any events have similar AoP and packet success rate to the regular traffic case, and only the MTDs belonging to events change their policies.The baseline BEB does not really adapt or care whether the devices are involved in events.The throughput is high for BEB for devices that are receiving more packets which is not surprising but if we look at the AoP plot in Fig. 9b, there are devices that are not involved in reporting any events but they have higher AoP for BEB as opposed to the proposed algorithm.

C. SCALABILITY AND ROBUSTNESS ANALYSIS
We compare the learned policies of VDN, QMIX and DRQN for no IDs case and show how robust the policy learned is by each algorithm if we scale it for a higher number of devices.The performance of each algorithm in terms of average throughput is shown in Fig. 10.We denote with N tr the number of devices during training, and the number of devices for testing is denoted by N test .The average arrival rate of the system is λ = 0.3.The results are simulated for 3 different random seeds and the best performances are shown in Fig. 10.We show that the VDN has more robust policy than both QMIX and DRQN.For λ = 0.3, the policy learned for N tr = 4 performs the same N test = 4, 8, 16 and the throughput starts dropping after that.It is because the policy learned for N tr = 4 has higher λ n = λ/N tr and as the number of devices grow, the collisions are not resolved and hence the average throughput drops to almost 0. On the other hand, the policy learned for a relatively higher number of devices such as N tr = 16 is robust for the number of devices less than N tr and also scales for a large number of devices.
Intuitively, for instance, when N tr = 4 and λ = 0.3, then λ n for smaller N tr has higher arrival rate or in other words, the MTDs observe packet arrival more frequently than λ n for larger N tr and therefore, devices learn to be more aggressive in terms of their transmissions to empty their buffers and such policy does not perform well for a very large number of devices.
Furthermore, the policy of QMIX has worse performance as compared to both VDN and DRQN.We can also observe that QMIX without incorporating agent IDs is not as effective and not as robust as compared to the VDN.

VIII. CONCLUSION AND FUTURE WORK
In this paper, we have proposed MARL-based RA schemes for multi-user multi-channel mMTC network.We have used the broadcast feedback commonly received by all the MTDs.We demonstrated that incorporating agents IDs is not suitable for mMTC systems, as we aim to design a fair and a scalable scheme.We have shown that even when the optimization objective is to maximize throughput, not incorporating agent IDs provides a fair use of resources by each agent and, thus, results in a better throughput-fairness tradeoff as compared to the BEB scheme and when IDs are used.This is supported  by our analysis of successful packet distribution per user and also through average AoP.We presented the scalability analysis of the proposed algorithms where we omit agent IDs for learning and we conclude that our system scales well for lower average arrival rates and that the learned policy is more robust for VDN as compared to the QMIX and DRQN.Moreover, we have presented the suitability of several popular MARL algorithms and have shown that the VDN and QMIX take advantage of centralization and they are better suited for designing RA schemes for mMTC.We have demonstrated that RL-based algorithms can adapt to changes in traffic, whereas EB schemes are not aware of such changes.We have used a correlated traffic arrival model along with the regular traffic, for which we show that the users learn the correlation and adapt to changes in the traffic.
For the correlated traffic, we have assumed that all the events are of the same nature and they have the same priority.Future works could consider prioritizing events to learn a scheme where the devices with high priority send their packets with low latency.Moreover, one can use agent IDs for a system where the devices are fixed and known, we can design a reward that is fair and that exploits correlation between devices even better.Furthermore, whilst omitting IDs make the scheme fair, the system throughput goes to zero as for a large number of devices; therefore, there may be a need to design algorithms with some coordination among devices or a group of devices.
A. VALCARCE, (Senior Member, IEEE) is Head of Department on Wireless AI/ML at Nokia Bell Labs, France.His research is focused on the application of machine learning techniques to L2 and L3 wireless problems for the development of technologies beyond 5G.He is especially interested on the potential of multiagent reinforcement learning for emerging novel L2 signaling protocols, as well as on the usage of Bayesian optimization for RRM problems.His background is on cellular networks, computational electromagnetics, optimization algorithms, and machine learning.

FIGURE 2 :
FIGURE 2: An example of using AoP for fairness.

FIGURE 3 :
FIGURE 3: MTC Network depicting N = 20 MTDs uniformly distributed in a rectangular area with L = 3 event epicenters.MTDs follow regular traffic and those within the range of the event epicenter follow both regular and ED traffic.

FIGURE 4 :
FIGURE 4: Average throughput during training with VDN algorithm for different values of N and to compare the cases when using IDs and not using IDs.The results are for λ n = 0.3 and (K, N ) = (2000, 8), (K, N ) = (3000, 16) and (K, N ) = (5000, 50).

FIGURE 6 :
FIGURE 6: Average throughput comparison of different MARL algorithms during training, for N = 8 and λ n = 0.3.
, in which N = 20 MTDs are randomly distributed in rectangular area.The MTDs follow Bernoulli

FIGURE 10 :
FIGURE 10: Average Throughput performance of the learned policy for λ = 0.3 for different N tr and tested for N test .
1 − p x Alarm with prob.p x

TABLE 1 :
Suitability of some popular MARL for designing RA schemes for MTC systems

TABLE 2 :
Simulation Parameters Algorithm 1 Training Phase of the Proposed Algorithm Define N, τ, γ ∈ [0, 1], λ n , ∀n ∈ N , h and K Initialize S n = 0, B n = 0, ∀n ∈ N for each episode do for each time slot k = 1, . . ., K do GENERATE traffic for all MTDs, i.e., Bn ∼ Bernoulli(λ n ) for regular traffic, and Bn ∼ Bernoulli(p) for ED traffic, ∀n ∈ N UPDATE buffer B n = min(1, Bn ) for each agent/MTD n = 1, . . ., N do Observe input S n and feed it to DRQN Generate the estimate of Q a , ∀a ∈ A Choose action according to (19) Receive feedback F (k) from the receiver Obtain reward R n according to (18) Update buffer B n Observe the next state S n end STORE (S, A, R, S ) in the replay buffer D,

TABLE 3 :
Average throughput and average AoP values for the proposed algorithm compared to the BEB for the learned policies shown in Fig.4.

TABLE 4 :
Number of times each event was activated for K = 10, 000 time slots for different event activation probabilities λ.

TABLE 5 :
MTDs reporting the events as in Fig.3