Deep Reinforcement Learning-Based Access Class Barring for Energy-Efficient mMTC Random Access in LTE Networks

Long-Term Evolution (LTE) networks are expected to be a key enabler for the massive Machine-Type Communications (mMTC) service in the 5G context. As highly synchronized access attempts from a massive number of Machine-Type Devices (MTDs) may overload the Physical Random Access Channel (PRACH), the Access Class Barring (ACB) scheme has been officially adopted as a control in LTE specifications. The baseline ACB scheme with fixed barring factor and fixed mean barring time has been shown to prevent the PRACH overload issue at the cost of a sharp increase in access delay. In order to improve the baseline ACB’s delay performance, several studies have suggested discarding the barring time and dynamically adjusting the barring factor over time. While neglecting the barring time can indeed bring a significant delay improvement, it may also cause an increased level of energy consumption for the MTDs because an MTD may need to continuously listen in order to obtain the updated barring factor without getting to actually transmit. In this paper, we propose to dynamically tune both the barring factor and the mean barring time using a reinforcement learning approach known as the Dueling Deep Q-Network. Computer simulations show that given a certain tolerance level on access delay and energy consumption, our design can achieve a significantly higher level of energy satisfaction while maintaining a comparable level of delay satisfaction compared to schemes that only focus on tuning the barring factor.


I. INTRODUCTION
The massive Machine-Type Communications (mMTC) service featuring billions of autonomous Machine-Type Devices (MTDs) has been identified as one of the three main use cases of 5G networks [1]. Due to its ubiquity and low-cost requirement, the service is to be mainly supported by the current generation Long-Term Evolution (LTE) cellular access technology that has matured in terms of coverage and manufacturing cost. Nevertheless, the LTE technology is originally designed for human-centric communications and may not cope well with the massive device population of the mMTC service. Studies from the 3GPP and literature have indeed shown that when tens of thousands of MTDs wake up and try to access an LTE cell in a highly synchronized The associate editor coordinating the review of this manuscript and approving it for publication was Wenjie Feng. manner (e.g., in event-based mMTC applications), the cell's Physical Random Access Channel (PRACH) may be severely overloaded, which results in a significant portion of the MTDs not being able to obtain access rights before exceeding the maximum number of allowed attempts.
To cope with the PRACH overload issue, the 3GPP has adopted the Access Class Barring (ACB) scheme as a countermeasure in LTE specifications. The scheme works by forcing the devices that are about to request access on the PRACH to first perform a probabilistic check whose passing probability, i.e., the ''barring factor'' p ACB , is under the evolved NodeB's (eNB) control. Devices that pass the check may proceed while the others are barred for a random duration whose average length is determined by the ''mean barring time'' T ACB (also controlled by the eNB) before they can redo the check. This mechanism reshapes the highly synchronized access pattern to prevent the PRACH from being overloaded VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ by too many concurrent MTDs. In fact, it has been shown that the baseline ACB with fixed p ACB and T ACB can significantly increase the access success rate for the MTDs in massive bursty access scenarios at the cost of a sharp increase in access delay [2].

A. RELATED WORKS
In order to improve the baseline ACB's delay performance while still maintaining its high success rate, various works have proposed to dynamically adjust p ACB over time. In [3], p ACB is increased and decreased in a heuristic manner depending on whether the average number of colliding preambles in the last three slots (also known as Random Access Opportunities, or RAOs) exceeds and drops below certain thresholds, respectively. [4]- [8] on the other hand, introduce innovative methods to estimate the number of backlogged MTDs from the numbers of colliding, singleton, and idle preambles, and the used p ACB in previous RAO(s). Based on such estimates, p ACB for the next RAO can be set correspondingly so that the average number of transmitting MTDs is kept at an optimum. [9] employs a Kalman filter to refine the estimate further after the arrival period is over, and there is no new MTDs. It is worth noting that except for [8] which assumes a fixed T ACB , most of the works assume that MTDs who fail the barring check in an RAO will retry in the very next RAO (which effectively means discarding the T ACB ) to simplify the backlog estimation process. The effect of different combinations of p ACB and T ACB in the baseline ACB scheme has been investigated in [10].
Recently, reinforcement learning, more specifically the Q-learning approach to the dynamic ACB tuning problem, has started to gain traction owing to its ability to produce reasonable solutions using only computer simulations of the system, i.e., without the need to formulate complicated theoretical models. In [11], tabular Q-learning is used to learn a p ACB control that tries to balance access delay and PRACH collision rate, while T ACB was not specified. On the other hand, [12] assumes a fixed T ACB = 4 seconds and uses tabular Q-learning to learn a p ACB control that greatly reduces contentions on PRACH while accepting the delay cost. Although the tabular method is guaranteed to converge to the optimal policy given that the system satisfies the Markovian property and some mild assumptions [13], it becomes infeasible when the number of possible state-action pairs (s, a) is too big such that the estimate of the Q-function Q * (s, a) (also known as the action-value function) cannot be represented exactly in table forms. There has thus been a transition to using deep neural networks (DNNs) to approximate the estimate Q(s, a) of Q * (s, a) by a parameterized version Q(s, a; θ ). Being an approximation, DNN-based Q-learning is not guaranteed to arrive at optimal policies but instead allows us to obtain sub-optimal solutions of much more complex systems. For example, [14] considers two different systems where the MTDs are single-class and multi-class, and where the reward functions are defined as the number of successful MTDs and the sum of ratios of the number of successful MTDs that meet their class' delay requirement to the number of successful MTDs of the same class, respectively. A Feed-Forward Network (FFN) is then trained to learn two corresponding sub-optimal p ACB control policies conforming to the two reward definitions. Additionally, the authors assume exponentially growing T ACB and backoff time at each individual MTD. [15], on the other hand, discards T ACB and uses a Gated Recurrent Units (GRU) to learn a p ACB control that maximizes the overall number of successful MTDs. In [16], an FFN is employed in order to learn a dynamic PRACH resource allocation scheme in NB-IoT systems rather than to control the barring parameters.

B. CONTRIBUTIONS
It is seen that many dynamic ACB solutions only focus on tuning the p ACB and either discard or assume a fixed or exponential T ACB . While discarding T ACB and relying on p ACB alone is beneficial from a delay standpoint, it may lead to very high energy consumption at the MTDs, especially in bursty access scenarios where the devices may need to continuously monitor the downlink to obtain the updated p ACB without getting to actually transmit due to a persistently low p ACB . This can be partly mitigated by using a per-device T ACB that grows exponentially with the number of times that the device has failed the barring check as in [14]. Nevertheless, the energy expense may still be noticeable as an MTD may need a considerable number of failed barring checks or, equivalently, p ACB updates, before its T ACB grows big enough to escape the low-p ACB period. Employing a fixed high T ACB , on the other hand, is energy-efficient but obviously results in a much higher overall access delay even if p ACB is dynamically adjusted.
In order to balance the access delay and energy consumption at the MTDs, both p ACB and T ACB should be adaptive. However, backlog estimation and proper barring parameters selection in each RAO become challenging due to the involvement of the variable T ACB . Reinforcement learning is a promising approach to avoid dealing with such mathematical complications, yet there is no (Q-learning) dynamic ACB solution that jointly controls the barring pair to our best knowledge. Therefore, in this paper, we propose a DNN-based Q-learning dynamic ACB scheme that controls both the p ACB and T ACB to flexibly realize the tradeoff between the access delay and energy consumption.

II. SYSTEM MODEL AND PROPOSED SOLUTION A. SYSTEM MODEL
We consider a single LTE cell environment in which N MTDs arrive following the bursty Beta(3,4) distribution, i.e., traffic model 2 in [17]. All N MTDs have the same requirements on the tolerable access delay and energy consumption, respectively denoted by D thresh and E thresh . When an MTD arrives at the system, it first monitors the downlink to obtain the System Information Block (SIB2) containing the ACB parameter pair (p ACB , T ACB ) and performs the ACB check by comparing a randomly generated number in [0, 1) to the obtained p ACB . If this number is smaller, the MTD randomly selects and sends one among R available preamble sequences as an access request to the evolved NodeB (eNB) over the PRACH in the nearest upcoming RAO. Otherwise, the MTD is barred for a period of length (0.7 + 0.6 × rand) × T ACB , where rand is another random number in [0, 1), before it can update the p ACB (by re-obtaining the SIB2) and redo the check.
Even if the MTD passes the check and is allowed to send a preamble, it is not guaranteed to succeed as there may be other MTDs using the same preamble in the same RAO due to the randomness in preamble selection. In such a case, a collision happens and renders the preamble undecodable [17]. The eNB will thus not send back acknowledgments for the collided preamble(s) during the Random Access Response (RAR) window, and MTDs who have selected those preambles must update the barring parameters (by re-obtaining the SIB2) and re-perform the ACB check after backing off for a period of length [0, T BO ). On the contrary, MTDs whose preambles are successfully received will be acknowledged and granted dedicated resources for transmissions of the next signaling message (known as Msg3) containing their IDs and the reasons for requesting access. For each correctly received Msg3, the eNB echos back the decoded ID via a message known as Msg4. The random access procedure is considered finished upon the correct reception of the Msg4 at the relevant device's side. Note that if an MTD experiences a certain number of consecutive preamble transmission failures (denoted by N PTmax ), it will terminate the random access procedure and is considered temporarily ''blocked'' by the network. On the other hand, there is no upper limit to the number of failed ACB checks at a device. The ACB scheme is summarized in Fig. 1.
For simplicity, we assume that a preamble transmission only fails due to collisions. Furthermore, in the LTE specifications, the devices that have passed the ACB check will not be required to redo the check even if their preamble attempts are unsuccessful. However, we notice that a better performance can be achieved if all devices are forced to perform the check every time they are about to transmit a preamble and have thus adopted this strategy in our work. This is in line with most existing works on dynamic ACB. For our energy model, we assume that once arrived at the cell, an MTD is always in either one of the three states, namely transmitting (Tx), receiving (Rx), or idle, with respective power consumption levels denoted by P tx , P rx , and P idle . The Tx state is applicable in the subframe where the MTD transmits a preamble. On the other hand, the Rx state is assumed when the MTD is either capturing the RAR window in an attempt to look for an acknowledgment from the eNB, or monitoring the downlink channel to obtain a new SIB2 for the purpose of performing an ACB check. Note that the device may stop capturing the RAR window in the very same subframe where a relevant acknowledgment is found, but otherwise has to capture the whole window without finding any acknowledgments. On the other hand, the MTD is assumed to spend one subframe in Rx state each time it wants to obtain the updated SIB2. When the device is neither transmitting nor receiving, it falls into the idle state.

B. NEURAL NETWORK-BASED Q-LEARNING
To achieve both a high delay and energy performance, the barring parameters should be changed over the RAOs to adapt to the unknown access traffic variation. However, the selection of barring parameters itself also affects the observable access pattern. As such, the evolution of dynamic ACB systems (including the proposal) may be approximated by Markov Decision Processes (MDPs) which are succinctly described via four sets, namely a set S of states of the system, a set A of possible actions (or decisions) to be taken, a set P a (s, s ) modeling the conditional probability that the system will transit to state s ∈ S given that it is currently at state s ∈ S and an action a ∈ A is to be taken, and a set R a (s, s ) representing the immediate rewards obtained when the system transits from state s to s due to the action a. The ultimate goal in a dynamic system is to find a mapping µ * : S → A, i.e., a policy dictating which action to take given the current state of the system, that maximizes the expected total rewards, i.e., where γ ∈ (0, 1) is the discounted factor needed to make the sum converges. VOLUME 8, 2020 It is noted that given the optimal policy µ * , the value which the expected sum E ∞ t=0 γ t R µ * (s t ) (s t , s t+1 ) converges to is a function of the initial state s 0 . This function, denoted by V * , is known as the value function encoding the maximum achievable total rewards given each starting state. The value function must satisfy the Bellman equations for all states s, s ∈ S. Furthermore, the action produced by the optimal policy, i.e., a * = µ * (s) must also be the action that achieves the maximum of the right hand side in the Bellman equation (2) [18]. Thus, if we instead have the V * (s), then the optimal policy can be readily recovered as for all s. The central problem of finding µ * in a dynamic system can thus be cast into the problem of solving the system of Bellman equations to get the value function (2) can be rewritten as Intuitively, Q * (s, a) is the maximum achievable expected total rewards given that the system is currently at the state s and action a is going to be taken. Following (3), the optimal policy µ * can be extracted from Q * (s, a) simply as Thus, the problem of finding µ * can also be alternatively cast into the problem of finding Q * (s, a) for all pair of (s, a). The problems of finding V * (s) and Q * (s, a) are theoretically equivalent and can both be solved using the Value Iteration (VI) algorithm [18] given the transition probabilities P a (s, s ) and the associated rewards R a (s, s ). However, when this information is not available, and the expectations in (2) and (4) are to be estimated by sampling from actual interactions with the system (or computer simulations of the system), the approach of finding Q * (s, a) can be more convenient. In fact, Watkins et al. [13] showed that by maintaining a table storing the estimates of Q * (s, a), henceforth denoted by Q(s, a), for every possible pair of (s, a) and only update the estimate of the actual encountered (s t , a t ) at time step t according to where s t is the actual next state that the system transits into after a t has been performed, r t is the corresponding observed immediate reward, and ξ t is the learning step at time t, the Q t (s, a) will converge to Q * (s, a) as t → ∞ given mild assumptions on the ξ t . This method to find Q * without requiring knowledge on the P a (s, s ) and R a (s, s ) is also known as tabular Q-learning. Note that the discount factor γ plays an important role in determining the convergence rate of (6). The closer γ is to 1, the longer it takes for (6) to converge but at the same time the optimal policy µ * obtained via (5) can achieve a higher maximum expected total reward.
The main issue with tabular Q-learning is that for complex systems, the number of possible (s, a) pairs can be too large such that the estimate Q(s, a) cannot be represented exactly in table forms. A natural solution is thus to instead represent the estimate of Q * (s, a) by a parameterized Q(s, a; θ ) using the powerful non-linear function approximators known as DNNs, then, at time t, try to find (learn) the best set θ t of parameters that minimize the squared difference between the two sides of the Bellman equation (4), i.e., try to minimize the loss function However, early attempts in using DNNs for Q-learning were not fruitful due to the fact that Q(s, a; θ ) cannot be updated in isolation for each pair of (s, a) and that consecutive transition samples from the system are highly correlated. The first breakthrough in DNN-based Q-learning is in [19] where Mnih et al. propose to use a memory of size T mem known as the ''experience replay'' that records the transition history, i.e., the tuples (s i , a i , s i , r i ) for t − T mem < i ≤ t, and use randomly drawn batches of samples from the memory to mitigate the correlation. Furthermore, a target network is also used instead of the online network in the ''target'' portion of (7). The parameters θ target t of this network is periodically overwritten by θ t from the online one with a period of T copy time steps but remains fixed otherwise. This helps to stabilize the training process as the target is now fixed w.r.t. the online parameters θ t , and the problem of minimizing (7) only changes once every T copy step. That is, their objective at time t is to find a set θ t of parameters that minimizes whose value is estimated from a randomly drawn batch of experiences (s, a, s , r) from the memory. The experience replay and target network significantly improve performance of their Deep Q Network (DQN).
The work in [20] further prove that Q(s, a) can be viewed as the sum of two separate terms, namely the value term V (s) representing the ''value'' of being at state s and the advantage term A(s, a) representing the ''advantage'' of choosing action a at state s, and propose the ''Dueling DQN'' architecture in which the two terms are parameterized as V s; θ, θ V and A s, a; θ, θ A and combined into the estimate Q(s, a; θ, θ V , θ A ) as in (9). They then show that such decomposition can greatly improve the system performance compared to the case where Q(s, a) is parameterized directly, especially for systems with many ''bad'' states where a bad state is defined as a state at which the reward is more or less similar regardless of the action to be performed. An example of a dueling DQN with three hidden layers are shown in Fig. 2. It is worth noting that even with the improvements, DNN-based Q-learning solutions remain an approximation, and the corresponding obtained policies are most likely sub-optimal.
C. PROPOSED DYNAMIC ACB SCHEME Unlike previous solutions which only control the p ACB , we propose to have the eNB controlling both the barring factor p ACB and the mean barring time T ACB . The Dueling DQN architecture [20] is then employed to parameterize Q(s, a). In our system, an action a is a nonnegative integer, more specifically the index of one among |A| barring pairs. On the other hand, we assume that the eNB always knows the delay and energy consumption of the successful MTDs in the i-th RAO before the upcoming (i + 1)-th RAO and stores into its memory the information tuple On the other hand, the reward function design is another significant difference in our work compared to previous DNN-based Q-learning ACB solutions. More specifically, we explicitly take the energy consumption of the N s i successful MTDs into account and define the immediate reward of the i-th RAO as a weighted sum between the number of successful MTDs whose delay is below D thresh and that number of successful MTDs whose energy consumption is below E thresh , normalized by the number of preambles R. The reward in the i-th RAO, denoted by r i , is thus where α ∈ [0, 1) is the weight indicating the relative importance of energy performance at the MTDs. Similar to o i , r i is also stored in the eNB's replay memory once observed. Note that the information on the access delay and energy consumption of the successful MTDs in an RAO can be obtained by requiring these MTDs to report their delays (in the unit of subframes) and the number of subframes they spent in the Tx and Rx states via the subsequent Msg3. If the network is to be trained online via interaction with an actual system, this information may not be available to the eNB before the next RAO as per our assumption because the successful MTDs do not send their Msg3 immediately after an RAO. However, given an expected arrival distribution, the network can always be trained offline via interaction with computer simulation of the system where such information is available instantly after each simulated RAO. After the offline training, the best-performed set of weights can then be extracted and used as the starting point for online training. This paper only considers the offline training phase as in line with most Q-learning-based ACB solutions. The problems of delayed information and mismatched arrival distribution in online training will be left for future studies.
We use an FFN with 3 fully connected hidden layers, each contains 128 nodes, as the online network in the Dueling DQN architecture. We note that the Q * (s, a) are not functions VOLUME 8, 2020 of time and therefore not necessarily be approximated by DNN types that are designed to capture temporal correlation of the input, such as GRU networks. The FFN was chosen because it achieves good end results despite not having any temporal characteristics (as per section III-B) and is easy to train. Besides, the temporal correlation between the RAOs has been partly encoded in the input states themselves since a state is defined as a window of observations across T win consecutive RAOs. This online FFN network is then trained offline using the Adam optimizer [21] in a series of simulated ''episodes''. An episode starts at the same time as the arrival period of T arr seconds and finishes when all N MTDs have either succeeded in accessing the cell or blocked after undergoing N PTmax preamble transmission failures, at which point the states of all MTDs are reset, and a new episode starts. The learning rate of Adam is initially λ init , and is halved every time the reward accumulated over the RAOs of an episode (henceforth referred to as episode reward) has not increased for 50 consecutive episodes until it reaches a minimum value of λ min . Also, during training, the network parameters θ, θ V , θ A are backup once every 10000 RAOs. Note that for online training, the Q-agent may not be able to perform gradient descent updates of the online network every RAO since the computation time required may exceed the RAO periodicity. However, since we are considering offline training, the online network can always be updated on a per simulated RAO basis.
In the i-th RAO of the offline training phase, the Q-agent observes the system's state s i and makes a decision following the well-known -greedy strategy where it chooses the most promising action (index of a barring pair) according to the current estimates, i.e., a i = arg max a Q(s i , a ; θ i ) with probability 1 − and a random action in the set A with probability . The parameter is set to 1 at first but is gradually annealed to a minimum value of 0.1 within the first 1 million simulated RAOs during training. This strategy aims to provide a tradeoff between exploration and exploitation during the limited number of RAOs of the training phase. The training process is summarized in Algorithm 1.

A. SIMULATION SETUP
In this section, computer simulations are performed to compare the effectiveness of the proposed solution to the dynamic ACB solution in [14], where exponentially growing backoff and barring times are used. The hyperparameters of the training procedure of our scheme are detailed in Table 1. The environment settings used for both training and verification, on the other hand, are detailed in Table 2. The delay tolerance D thresh is set to 1000 subframes, i.e., 1 second. The limit E thresh on the energy consumption at an MTD, meanwhile, is chosen to be equal to the consumption level when the MTD spends 2 subframe in Tx state, 2 × W RAR + 2 subframes in Rx state, and D thresh − (4 + 2 × W RAR ) subframes in idle state, where W RAR is the length of the RAR window in the where each experience is of the form s j , a j , r j , s j+1 , and each state s j is a window of consecutive observations s j = o j−T win +1 , · · · , o j ; 7: for each experience j in the batch do 8: Set y pred j = Q s j , a j ; θ, θ V , θ A ; 9: Set y target j = r j + γ max a Q s j+1 , a ; θ target , θ V ,target , θ A,target ; 10: Calculate the sample loss L j θ, θ V , θ A = y pred j − y target j 2 ; 11: endfor 12: Based on s j , a j , and L θ, θ V , θ A of the experiences, perform a stochastic gradient descent update on θ, θ V , θ A following Adam's update rules; 13: if i mod T copy = 0 do 14: Overwrite target network parameters θ target , θ V ,target , θ A,target with current online network parameters θ, θ V , θ A ; endfor unit of subframes. This corresponds to the case where the MTDs succeed in its second preamble attempt and receives the acknowledgment in the last subframe of the RAR window, all at an exact delay of D thresh . Note that given the power levels in Table 2, a Tx or Rx subframe can be traded for 2000 idle subframes, which implies a way to tradeoff between access delay and energy consumption. It is the Q-agent's task to learn how to control the barring pair to perform the tradeoff efficiently given a certain α in (10).

B. RESULTS AND DISCUSSIONS
After the offline training phase, we pick the episodes whose reward is among the top 1% of all episodes and extract the θ, θ V , θ A sets that were backup during or close to those episodes to evaluate and report the best results among them in this section. For a fair comparison, we compare the reported results with those obtained in the case of perfect control, where the eNB always knows the exact number of backlogged MTDs and can thus select a corresponding p ACB such that the average number of MTDs per RAO is equal to the number of available preambles R, in the single priority network setting of section III-A of [14]. This reference scheme also assumes exponentially growing T BO and T ACB at the MTDs. More specifically, in the unit of milliseconds, T BO is a random number in 0, 10 × 2 P t −1 and T ACB = (0.7 + 0.6 × rand)× 2 N barring , where P t and N barring are the number of times an MTD has transmitted a preamble and the number of failed ACB checks at the MTD, respectively.
The ratios of the numbers of successful MTDs that also meet the delay and energy thresholds to the total number N of MTDs, respectively denoted by r del s and r ene s , are shown in Table 3 for the two schemes. It is seen that compared to the reference scheme, ours with different α can achieve comparable r del s but at a significantly lower energy cost as suggested by the much higher r ene s . For instance, the proposed scheme at α = 0.1 can achieve a similar level of r del s while outperforming the reference in terms of r ene s by a hefty margin of 16%. To further prove that such energy gain comes from the fact that MTDs are not forced to update p ACB too frequently and not simply from reduced contention levels, we modify the reference scheme so that the average number of MTDs per RAO is kept at 0.7 × R instead of R. The corresponding results are shown in the last column of Table 3. It is clear that while this modification helps to boost the reference scheme's energy efficiency, the r ene s metric is still far lower than that of the proposed scheme. This has proven that p ACB updates are indeed a major source of energy consumption at the MTDs, and that dynamic control of both p ACB and T ACB is crucial for simultaneously improving the network's delay and energy performance in bursty access scenarios. In terms of average delay, the reference scheme (second-tolast column) poses better performance. Nevertheless, the data should be relevant as long as the MTDs can obtain access rights within the delay threshold D thresh , and the metric r del s is thus more relevant than the average access delay. On the contrary, the average energy consumption of successful MTDs is as crucial as r ene s since it dictates the amount of unrecoverable energy spent by the battery-limited devices. Here, as foreshadowed by a much higher r ene s , successful MTDs in the proposed scheme consumes anywhere from 25% to 42% less energy on average than those in the reference scheme depending on α, which is a remarkable improvement. When the blocking ratio, defined as the ratio between the number of MTDs that cannot successfully access the network before exceeding N PTmax to the total number of MTDs N , is concerned, the two schemes perform comparably with less than 0.5% of the population getting blocked.
Looking at the proposed scheme alone, it is obvious that as the energy prioritization parameter α is increased, the achieved r ene s and r del s tend to increase and decrease, respectively. This is expected and implies that by adjusting α, VOLUME 8, 2020 the proposed scheme can efficiently realize the tradeoff between access delay and energy consumption. To support this claim, we report the smoothed curves of r del s and r ene s observed during the training episodes for different values of α in Fig. 3. As seen, there is a definite trend showing that the Q-agent is driven towards solutions that yield lower r del s (and higher r ene s ) for higher values of α and vice versa. Note that the curves' occasional fluctuation is a typical behavior in DNN-based Q-learning as there is no guarantee that the agent will converge at the optimal solutions. It is also interesting to take a detailed look at the solutions discovered by the agent for different α settings. We have thus plotted in Fig. 4 the p ACB and T ACB control that the Q-agent comes up with using the same sets of θ, θ V , θ A that were used to obtain the results reported in Table 3. The vertical axis on the right represents the values of T ACB normalized by 4000ms and p ACB . Here, the immediately visible feature is that during the peak period (from around RAO 500 to RAO 1000), the p ACB is kept relatively high at 0.7 or above for all cases, which is in stark contrast with  the reference scheme where p ACB may reach as low as 0.25. More importantly, the high barring factors are coupled with a very high T ACB which mostly varies between 3400ms to 4000ms, except for the case of α = 0, i.e., when energy consumption does not matter. This implies that most of the MTDs will pass the barring check and are allowed to partake in the busy RAOs for a smaller success chance but a high possibility to meet the access delay threshold D thresh . The smaller fraction of the population who fails the check, on the other hand, is immediately deferred until much later when the new traffic has settled down. While those devices will certainly miss the delay deadline, they also save a notable amount of energy thanks to not having to update the barring parameters frequently and a higher chance of a successful preamble transmission. The results in Table 3 indeed prove that the combination of a high p ACB and a high T ACB during the peak period is an overall more efficient strategy compared to greatly lowering the p ACB to keep the average number of MTDs close to R.
Additionally, when α is increased, i.e., energy performance is more emphasized, the Q-agent chooses to 1) slightly lower both p ACB and T ACB during the peak period, and 2) keep T ACB at those lower values for longer even after the peak has elapsed. As a result, the number of contending MTDs is consistently maintained at lower levels over a longer period. This reduces the contention level without requiring the MTDs to update the barring pair too frequently to achieve a higher r ene s . At the same time, the PRACH is still kept busy enough to avoid severe under-utilization. Finally, we note that these best θ, θ V , θ A sets are found relatively late, as shown by the last row in Table 3. However, other good sets that yield performance levels (in terms of average episode reward) within 96-98% of the reported ones are also found frequently during training. Furthermore, the Q-agent actually starts to make reasonable barring parameters selections quite early, as shown in Fig. 5. It is evident that depending on α, usable sets achieving performance levels within 90% those of the reported sets in Table 3 are found as early as after 0.75 million simulated RAOs.

IV. CONCLUSION AND FUTURE DIRECTIONS
In this paper, we proposed a DNN-based dynamic ACB solution that explicitly takes energy consumption into consideration and controls both the barring factor p ACB and the mean barring time T ACB in order to improve the energy performance of the MTDs in massive bursty mMTC access scenarios. Computer simulations show that compared to conventional schemes that only control the barring factor, the proposed scheme can achieve a similar delay performance at a significantly lower energy cost. Furthermore, by adjusting a parameter representing the importance of energy performance in the reward function, our scheme can efficiently realize the tradeoff between access delay and energy consumption. In future works, we aim to investigate the impact of the delayed reward, the computational latency of the gradient descent update procedure, and the mismatched arrival distribution on the performance of the proposed scheme when the best weights extracted during the offline training phase are used as the initial weights for online training. The designs that can overcome such issues to further improve the performance will also need to be considered.