Reinforcement Learning Based Adaptive Blocklength and MCS for Optimizing Age Violation Probability

As a measure of the freshness of data, Age of Information (AoI) has become an essential performance metric in status update applications with stringent timeliness constraints. This study employs adaptive strategies to minimize the novel, information freshness-based performance metric age violation probability (AVP), the probability of the instantaneous age exceeding a predefined constraint, in short packet communications (SPC). AVP can be considered one of the key performance indicators (KPIs) in 5G Ultra-Reliable Low Latency Communications (URLLC), and it is expected to gain more importance in 6G technologies, especially in extreme URLLC (xURLLC). Two distinct approaches are considered: the first focuses on adaptively selecting the blocklengths with either imperfect or missing channel state information exploiting finite blocklength theory approximations. The second involves dynamically choosing the modulation and coding scheme (MCS) to minimize the AVP under stringent timeliness constraints and non-asymptotic information theory bounds. In the context of adaptive blocklength selection, state-aggregated value iteration, Q-learning algorithms, and finite blocklength theory approximations are leveraged to adjust blocklengths to achieve low age violation probabilities adaptively. The simulation results highlight the effectiveness of these algorithms in minimizing age violation probabilities compared to the fixed blocklengths under varying channel conditions. Additionally, constructing a deep reinforcement learning (DRL) framework, we propose a deep Q-network policy for the dynamic selection of the modulation and coding scheme among the available MCSs defined for URLLC systems. Through comprehensive simulations, we demonstrate the superiority of the proposed adaptive methods over traditional benchmark methods.


I. INTRODUCTION
Reliable and fast communication has become an urgent need for many applications with the rapid development of technology over the years.Ranging from factory automation and smart grids to remote surgery and autonomous driving, a vast number of applications rely on reliably and efficiently The associate editor coordinating the review of this manuscript and approving it for publication was Md.Arafatur Rahman .
transmitting short status update packets from a source to a monitor.With these applications came the demand for timely delivery of information.In consequence, a measure of the timeliness of data called Age of Information (AoI) has emerged and become an important research topic.AoI is defined as the time elapsed since the last successfully delivered packet was generated [1].It is a critical metric in status update systems where information is needed before it becomes stale or irrelevant, such as industrial automation, augmented reality, and traffic safety applications.While it is also regarded as an important metric in fifth-generation (5G) systems, AoI is expected to gain more prominence and be considered as a key performance indicator (KPI) in sixth-generation (6G) communications, especially in nextgeneration/extreme Ultra-Reliable Low Latency Communication (xURLLC) and massive Machine Type Communication (mMTC) systems.As the name implies, 5G URLLC focuses on stringent latency and reliability requirements; 1 ms or lower latency is targeted in addition to successful packet delivery rates up to 1 − 10 −5 or even 1 − 10 −9 in some cases [2].With xURLLC, additional qualifications are introduced such as throughput, spectral efficiency, energy efficiency, and security, as well as AoI [3].The significance of AoI is also apparent in semantic communications, where the meaning of the transmitted message is more important than the accurate transmission of bits [4].AoI is considered one of the fundamental measures of the relevance of the information in semantic communications, as it determines whether the information is still fresh and valuable or out-ofdate and irrelevant [5].
In age-aware xURLLC and mMTC systems, and status update applications such as augmented reality, smart sensors, and industrial automation, information packets generally consist of a small number of bits.Such communication systems are referred to as short packet communications.Unlike conventional communication networks with long packets, in short packet communications, the distortions caused by the thermal noise and the propagating channel are not averaged out.Thus, Shannon capacity cannot be used as a performance metric in short packet communications as it is based on infinite blocklength.Instead of classic information theory results, finite blocklength (FBL) theory approximations need to be utilized [6].
The main challenge in age-aware short packet communication systems is the selection of the appropriate blocklength for coding.If a large blocklength is used, implying that a larger number of redundancy bits is used, the probability of error is small.However, the transmission duration increases as a result of transmitting a larger number of bits; hence, age also increases.On the other hand, using a small blocklength results in a shorter transmission time but a higher error probability.Thus, a challenging trade-off exists when selecting the blocklength, and one of our purposes in this study is to overcome this trade-off and minimize the AoI by selecting the blocklength dynamically.
Another approach to the AoI minimization problem for short packet communications is adaptive modulation and coding (AMC).In communication systems, the modulation and coding scheme (MCS) determines the number of bits to be transmitted in one symbol and the coding rate.The selection of the MCS directly affects the age, similar to the blocklength.MCSs with high code rates and modulation order result in short transmission time, but higher error probability.Contrarily, MCSs with lower modulation order and coding rate guarantee a lower error probability, yet longer transmission time.Hence, the same trade-off exists in MCS selection for age optimization.
The majority of the studies on AoI are focused on the average age [7], [8], [9], [10], [11], [12] and peak age [7], [13], [14].Average age is defined as the time-average AoI.Although useful, it is not a sufficient metric for fully assessing the timeliness of the information since it cannot account for extreme AoI events observed with low probabilities [15].Peak age is another important AoI metric, indicating the value of age just before an update is correctly received.While peak age is a critical metric for ensuring the freshness of the received data, the timeliness of the whole process also needs to be assured.Also, numerous real-time applications have stringent timeliness constraints, and violation probabilities are prominent rather than averages in such systems.
In this study, we investigate the age violation probability (AVP); the probability that the instantaneous age exceeds a given threshold in short packet communications.We first utilize finite blocklength theory approximations to dynamically select the optimal blocklength that optimizes AVP with either imperfect or missing channel state information.Secondly, we focus on choosing the MCS adaptively to minimize the AVP under stringent timeliness constraints and non-asymptotic information theory bounds.
Related Work: There are a few works in the literature showing the existence of an optimal blocklength that minimizes the age-related metrics [7], [8], [9], [13].In [7], [8], and [9], the optimal blocklength minimizing the average age is investigated taking into account retransmission techniques like automatic repeat request (ARQ) and/or hybrid ARQ (HARQ).On the other hand, in [13], the optimal blocklengths optimizing delay and peak age violation probabilities are studied using FBL information-theoretic bounds.Notably, the study in [13] showed that there may exist two distinct optimal blocklengths that result in same average age but different age violation probabilities.This highlights the critical importance of prioritizing age violation probabilities in addition to the average age while optimizing blocklengths.
Aside from showing the existence of an optimal blocklength, methods for finding the optimal blocklength have also been a topic of discussion [10], [11], [12], [14], [16].In [10], [11], [14], and [16], blocklength selection in pointto-point wireless networks are considered for optimizing end-to-end delay [16] or age metrics [10], [11], [14].The study in [12], solves the non-convex blocklength optimization for average age in a two-hop wireless relaying network.References [10] and [16] formulate the average delay [16] and average AoI minimization problems as Markov decision process (MDP) and proposes dynamic blocklength selection methods based on reinforcement learning (RL).Meanwhile, [11] maps the average AoI minimization problem under a power consumption constraint to a constrained Markov decision process (CMDP) and solves the problem by linear programming methods.Although motivated by them, our blocklength selection problem differs from the aforementioned ones as it focuses on the age violation probability and 122412 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
proposes a dynamic blocklength selection methods based on RL and dynamic programming (DP).This allows our method to adapt to the varying channel conditions and imperfect channel state information, setting it apart from previous work.
Some works in the literature also use RL techniques for AMC to optimize traditional performance metrics such as throughput [17], [18] and spectral efficiency [19].However, none of them consider dynamic MCS selection in AoIaware systems.Both [17] and [19], use Q-Learning to map channel quality indicators to MCS options.Reference [19] aims to maximize spectral efficiency and maintain a low block error rate (BLER) while [17] optimizes the link throughput in orthogonal frequency-division multiplexing (OFDM) wireless systems.Reference [18] also maximizes the link-level throughput with MCS selection and power allocation by Deep Deterministic Policy Gradient (DDPG) agents in a distributed manner.MCS selection in age-aware systems has been considered only in [20], where an AoIdriven scheduler without any learning-based approach or any finite blocklength analysis is proposed to minimize the longterm average AoI.
A baseline technique for AMC is outer loop link adaptation (OLLA) [21].It is an addition to inner loop link adaptation (ILLA), a fixed lookup table method that maps the channel quality indicator (CQI) to the highest MCS that satisfies the block error rate requirement.OLLA improves ILLA by adjusting the signal-to-noise ratio (SNR) according to the positive or negative acknowledgment (ACK/NACK) following a transmission; thus, the effects of delayed CQI or quantization errors are avoided.
To the best of our knowledge, our study is the first to propose an RL-based dynamic MCS selection method to minimize AVP in short packet transmissions and provide superior performance compared to baseline methods.Similarly, while there are some studies on optimal fixed blocklength in ageaware systems, we present a novel method of dynamically selecting the optimal blocklength according to channel conditions based on RL, and we consider not average age but the AVP.Also note that the RL algorithms proposed in this paper do not assume the knowledge of the underlying system characteristics such as channel distribution, packet arrival statistics, and finite blocklength error probabilities.
Objectives and Contributions: Our main objective is to minimize the age violation probability by an adaptive selection of the blocklength or modulation and coding scheme, and the main contributions of this study are as follows: • We leverage finite blocklength theory approximations and formulate the AVP minimization problem as a discrete-time Markov decision process.We present a dynamic programming method that uses the known system characteristics to select the optimal blocklength for the current channel and AoI states.
• In the absence of apriori knowledge of the system characteristics and with either imperfect or missing channel estimation, we exploit an RL approach for obtaining an online policy that chooses the optimal blocklength adaptively.
• We propose a deep Q-network (DQN) algorithm that dynamically chooses the appropriate MCS among the available MCSs defined in 5G URLLC standards [22].
The adaptive selection of both the codelength and the modulation order is investigated under different scenarios where the channel state information is available or unavailable.
• Extensive simulation results show that the proposed algorithms achieve significantly lower AVP than the fixed blocklength schemes and benchmark link adaptation policies.The structure of the paper is as follows: In Section II, we present the system model adopted in the blocklength and MCS selection problems.In Section III, we investigate DP and RL-based adaptive blocklength selection methods.In Section IV, we study AVP minimization with dynamic MCS selection and propose a deep RL-based solution.In Section V, we compare our RL-based policies' performances with the baseline methods.Section VI concludes the paper and discusses future work.

II. SYSTEM MODEL
We consider a discrete-time point-to-point communication link with stochastic arrivals of time-critical information packets.The source generates short status update packets according to a Bernoulli distribution, and λ ∈ (0, 1) denotes the probability of a new packet arrival in one channel use (CU).The information packets are stored in a single-server queue with capacity 2, meaning that aside from the packet in service, there can be at most 1 packet in the queue.The queue follows a Last Come First Serve (LCFS) policy with preemption in the queue (LCFS-Q) as defined in [23]: If a new packet arrives when the queue is empty, it is sent to the server immediately.However, if the queue is not empty, the packet already waiting in the queue is replaced with the newly arrived packet.The LCFS-Q queueing policy has previously been shown to be more efficient than the First Come First Serve (FCFS) policy [24].

A. SHORT PACKET TRANSMISSION MODEL
The information packet generated by the source consists of k bits.The encoder maps the information packet to a codeword with blocklength n, and code rate k/n.After encoding and modulation, the packet is transmitted through the wireless channel.The packet is demodulated and decoded on the receiving side, and a positive or negative acknowledgment is given.Figure 1 illustrates the main components of the system model studied in this paper.We assume a memoryless block-fading channel where the fading coefficient is constant for a block of symbols.Each transmitted packet is subject to independent and identically distributed (IID) fading coefficients and additive white Gaussian noise.The inputoutput relation of the channel is as follows: where x and y denote the transmitted and received symbols, respectively.h is the corresponding fading coefficient and w denotes the additive noise.The fading coefficient h is assumed to be constant during the transmission of a block with length n.Let P denote the transmit power.Assuming additive white Gaussian noise (AWGN) with a standard normal distribution N (0, 1), instantaneous SNR can be expressed as This paper focuses on transmitting short packets within stringent timeliness constraints.With significantly reduced coding gain, short packet communications are error-prone due to AWGN and fading.The successful reception of a transmission block or a decoding error are assumed to be acknowledged by an error-free single-bit ACK/NACK feedback.
We first study adaptive blocklength selection schemes minimizing (16) and utilize non-asymptotic information theory results in order to derive the BLER for a chosen blocklength n, denoted by ϵ n .In the well-known study of Polyanskiy et al. [25], the maximal coding rate, i.e., the rate at which an encoder/decoder pair with coded blocklength n and BLER lower than ϵ n exists, is expressed as follows: where C(γ ) and V (γ ), defined as a function of the SNR γ , denote the capacity and channel dispersion, respectively.
Lastly, O(log n/n) is the remainder term, and Q(•) is the tail distribution function of the standard normal distribution: Rewriting (3) in the following form allows us to formulate the block error rate ϵ n given the number of information bits k, coded blocklength n, and SNR γ : Then, as a more realistic and practical approach, we consider an MCS selection problem to choose the optimal blocklength and modulation order to minimize AVP in short packet communications.We leverage finite blocklength approximations to obtain BLER, denoted by ϵ n,M , for given blocklength n and modulation order M .In [25], an infinite constellation is assumed; thus, the expression for the maximal coding rate in (3) does not apply to practical modulation schemes with finite constellations such as M-ary quadrature amplitude modulation (M-QAM).In such cases, we can not use the capacity definition in (4).Instead, we can exploit the following mutual information bound in [26].
Here, an M-QAM constellation with equiprobable symbols is assumed.γ is the SNR at the receiver, x i ∈ X M is the M-QAM constellation point from the symbol set X M , and y is the received signal.In [27], the authors provide the approximation for I (γ , M ), denoted by I ′ (γ , M ), based on multi-exponential decay curve fitting (M-EDCF): The coefficients ε and ϑ (M ) j are provided in [27] and the approximation is shown to be in correspondence with the experimental results.To compute the maximum coding rate in an equiprobable M-QAM constellation, the capacity C(γ ) in ( 3) is replaced with I ′ (γ , M ) [26], with V (γ ) and Q(.) defined the same as in ( 5) and ( 6), respectively.Let us denote the block error rate in this case with ϵ n,M , then we can express the maximum coding rate as follows: We can calculate the BLER by rewriting (10) in the following form: Thus, we use (7) in blocklength selection problem and (11) in MCS selection problem for calculating the block error rate.In addition, we can utilize MCS tables defined in the 5G standards [22], one of the tables lists MCSs with modulation up to 256QAM, and the other two tables define MCSs with 64QAM at most.In this work, we investigate the MCS indexes introduced for low spectral efficiency cases and URLLC applications at [22,   outlines some of the MCS indexes with the corresponding modulation orders M , code rates R, and spectral efficiencies.The blocklength used in each MCS, and in (11) for BLER calculation, can be found as in (12).
The adaptive MCS selection for AVP optimization can also be considered as adaptive blocklength n and modulation order M selection problem, where the set of available blocklengths is determined using (12).We consider different scenarios to solve the adaptive block length and MCS selection problems.In the first one, the quantized channel state information (CSIT) is known and included in the state of the system.Channel quality indicator, CQI , stands as a measure of the channel condition depending on the SNR, described as in [19]: where γ min and γ max are the minimum and maximum SNR values, respectively, and N cqi is the total number of CQI states.⌊.⌋ corresponds to the floor function that takes a real number as input and gives the greatest integer less than or equal to this real number as output.Meanwhile, the second scenario is more practical and studied in this paper, assuming CSIT is unavailable, and CQI is excluded from the state.

B. AGE VIOLATION PROBABILITY (AVP)
Let r (t) denote the AoI at the receiver at time t ∈ {0, 1, 2, . ..}, defined as the time elapsed since the generation of the most recent packet that was successfully delivered: where u(t) is the packet's time stamp, similarly, q (t) denotes the AoI at the source queue at time t and represents the time elapsed since the arrival of the last packet in the queue.r (t) keeps increasing in the absence of a successful transmission; that is, a transmission error occurs, or there is no status update packet in the system.If a transmission error occurs, the previously transmitted packet is discarded, and the packet waiting in the queue gets transmitted.If a packet is correctly decoded, r (t) is set to q (t). Figure 2 shows the evolution of r (t) over time.We aim to minimize the age violation probability, defined as the probability that r (t) exceeds a predetermined threshold max .Following the notations in [13] and [28], we can express the AVP as We consider a frame-based model where the transmitter chooses a finite blocklength n l (and modulation order M l for MCS selection) at frames denoted by l = {0, 1, 2, . . ., L}.If there is a packet waiting at the source queue at the beginning of frame l, the transmitter transmits the most recent packet selecting a finite blocklength n l (or modulation order M l for MCS selection).Otherwise, the transmitter stays idle for one CU, which is assumed to be a frame with length one CU, i.e., n l = 1.Let t l ∈ Z ≥0 and t l+1 ∈ Z ≥0 denote the starting time of l th frame (l + 1) th frames, respectively, where Using a simplified version of the reward function used in [28] and [29], we count the number of CUs in which the instantaneous age at the receiver exceeds the age threshold, i.e., when r (t) > max , during each frame.We compute the AVP by taking the ratio of time in which r (t) exceeds the threshold to the time passed during the total number of frames L [28]: where 1(•) is the indicator function which is equal to 1 if there is an age violation, i.e, r (t) > max ; otherwise, it is equal to 0.

III. ADAPTIVE BLOCKLENGTH SELECTION FOR MINIMIZING AGE VIOLATION PROBABILITY
We consider the adaptive selection of coding rate to minimize AVP and address the tradeoff between smaller blocklengths with higher error probability and larger blocklengths with longer transmission delays.To effectively employ RL-based techniques, we formulate our problem as a countable-state discrete-time discounted MDP.This MDP is characterized by five-tuple ⟨S, A, P, R, ⟩, where ∈ (0, 1) is the discount factor determining the importance given to future rewards.S represents the countable state space and is investigated for two different sets S 1 and S 2 corresponding to the scenarios CSIT is available and not, respectively.The first set includes CQI at frame l as a state variable and is formed by three components: ( q (l), r (l), CQI (l)) ∈ S 1 .Meanwhile, the second set does not include CQI and thus ( q (l), r (l)) ∈ S 2 is formed by two components.With a slight abuse of notation q (l), r (l) and CQI (l) denote the age of the packet at the queue, at the receiver and quantized channel state at the beginning of frame l, respectively.That is, q (l) and r (l) represent the AoI at time t l , indicating that q (l) = q (t l ) and r (l) = r (t l ).
The action space, A, represents the finite set of blocklengths we can select, plus stay idle action, that is, n l = 1.The reward function R : S × A → Z is defined as: where r (l) is the component of S l describing the AoI at the receiver and A l = n l for all n l ∈ A. Besides that, we also need to consider the states in which the queue is empty, denoted by q (l) = −1.There should be no blocklength selection in such states since there are no packets to transmit.The system should stay idle, i.e. n l = 1, until a new packet arrives.
The state transition probabilities P n l ss ′ = P(S l+1 = s ′ |S l = s, A l = n l ) is determined by the underlying statistics of error probabilities and random packet arrivals.Therefore, we first recognize all possible state transitions and calculate the following corresponding probabilities.
If the queue state is empty, i.e., q (l) = −1, the transmitter stays idle for one CU and waits for a new packet arrival, that is, n l = 1.The next queue state, i.e. q (l + 1), depends on the packet arrival at one CU with probability λ ∈ (0, 1) while r (l + 1) = r (l) + 1 as there will not be any new packet arrival to the receiver.The transition probabilities are given as follows (omitting the parenthesis from the state variables ( q , r )): where q and r stand for q (l) and r (l), respectively.When the queue is not empty at the beginning of frame l, i.e., q (l) ̸ = −1, a packet is waiting to be transmitted.Then, the transmitter chooses a finite blocklength n l from the available blocklengths, n l ∈ A. q (l + 1) depends on the arrival time of the most recent packet in the queue during n l CUs at frame l. q (l + 1) = −1 refers to the case of no packet arrivals throughout the n l CUs.For a Bernoulli arrival rate of λ ∈ (0, 1), q (l + 1) can take the following values with the corresponding probabilities for all j ∈ {0, . . ., n l − 1}: The AoI at the queue in the next frame l + 1, q (l + 1), is determined by the arrival time of the most recent packet in the queue during the n l CUs at previous frame l.The AoI at the receiver in next frame l +1, r (l +1), depends on the AoI in the queue at the beginning of frame l, q (l), and whether a block error occurred or not with probability ϵ n l defined in (7).
Unlike q (l) and r (l), the change in the CQI state is completely independent of other states and the previous CQI state.We calculate the SNR as γ = P|h| 2 where the channel coefficient h is assumed to be a Rayleigh random variable for simplicity.Since the probability density function of the Rayleigh distribution is known, probabilities corresponding to the defined SNR, hence CQI, intervals can be calculated.In conclusion, using the packet arrival probabilities and state transitions expressed in (18) and (20), and CQI probabilities, we can obtain P n l ss ′ for all states and all actions.We remark that the formulated MDP has a countable-state space considering both q (l) ∈ {0, 1, . ..} and r (l) ∈ {1, 2, . ..} are unbounded by definition.However, since the reward given (17) is the same for all r (l) > max , the problem can be reduced to a finite-state finite-action MDP where r (l), q (l) ∈ [0, max +  Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
policies, Blackwell optimality holds for finite-state finiteaction MDPs and the gain of the discounted MDP described in this section approaches to the AVP defined in ( 16) as discount rate increases, i.e., → 1.
We also adopt state aggregation method [32] when constructing the state space, i.e., by combining similar states into groups, we reduce the number of states, hence reducing the complexity of the problem.Although the time unit is one CU, q (l) and r (l) components of the state do not point to a single value, but a collection of values.Hence, the mapping from AoIs at the queue and the receiver to the states q (l) and r (l) is not one-to-one.With a much lower number of states, the complexities of the proposed algorithms are significantly reduced, and the convergence rate is accelerated.
Next, we present two novel solution methods for the blocklength selection problem.The first is based on the value iteration method [30], [31] exploiting the knowledge of system characteristics, while the second utilizes Q-learning [33] without apriori knowledge of system characteristics.

A. VALUE ITERATION BASED ADAPTIVE BLOCKLENGTH SELECTION
Value iteration is a dynamic programming method that requires full knowledge of the environment dynamics, i.e., state transition probabilities P n l ss ′ in ( 18), ( 20) and reward function R(S l , A l ) in (17).The purpose of value iteration is to maximize the state-value function denoted with V (S l ), which is the expected discounted accumulation of the future rewards starting from the state S l [30], [31]: It is possible to obtain the optimum state-value function V * (s) recursively, using the knowledge of P n l ss ′ and R n l s = R(S l = s, A l = n l ): In the value iteration method, we exploit (22) to obtain the maximum state-value function.After the iteration converges, we obtain a deterministic policy denoted by π, where π : Value iteration-based adaptive blocklength selection method (VI-ABM) is summarized in Algorithm 1.

B. Q LEARNING BASED ADAPTIVE BLOCKLENGTH SELECTION
We propose two adaptive blocklength selection methods based on Q-learning, which assume no prior knowledge about environmental dynamics.The first Q-learning agent is for all s = ( q (l), r (l), CQI (l)) ∈ S 1 do end for 9: end for 12: until δ < ρ / * convergence * / 13: for all s = ( q (l), r (l), CQI (l)) ∈ S 1 do 14: 15: end for 16: return π assumed to know the quantized channel state information, so CQI is included in the state S l = ( q (l), r (l), CQI (l)) ∈ S 1 of the system.Also, note that although the CQI knowledge is assumed, the channel state information is noisy and quantized with N cqi as in (13).On the other hand, the second agent knows only the ages of the queue and receiver and assumes no CSIT.Hence, CQI is excluded from the state S l = ( q (l), r (l)) ∈ S 2 .Actions and rewards are the same for the two scenarios.q (l) denotes the age of the packet in the queue, and q (l) = −1 if the queue is empty.r (l) denotes the age of the packet at the receiver.Q-learning is an online reinforcement learning algorithm to find the optimal action-value function Q(S l , A l ), also known as Q-function.Q-function is the discounted accumulation of the future rewards given state S l and action A l : Q-learning is a model-free, off-policy temporal difference algorithm.The Q-learning agent learns entirely by trial and error, following a behavior policy that is different from the learned target policy to generate behavior [33].The agent faces a trade-off between exploration and exploitation [34], i.e., choosing the action with the highest action-value estimate or a non-greedy action to improve its estimate.ε-greedy is a simple strategy to balance the explorationexploitation trade-off: With probability ε, the agent chooses a random action, and with probability 1−ε, it chooses a greedy action.
Firstly, we initialize the Q-functions Q(S l , A l ) to zero for all states S l ∈ S and all actions A l ∈ A. We follow an ε-greedy policy with a decaying exploration rate: at each iteration, the exploration rate ε is multiplied by a decay rate ζ .The initial value is ε = ε max , and the minimum value is limited to ε min .At each iteration, according to the observed state S l , the agent has to select either to use a blocklength n l if there is a packet waiting for service or to stay idle for one CU, i.e., n l = 1.After the action is executed, the environment goes to the next state S l+1 , and returns reward R(S l , A l ) defined in (17).We update the corresponding Q-table entry Q(S l , A l ) according to Bellman's rule: (25) where α, 0 < α < 1, is the learning rate or step size.With a higher learning rate, the changes in Q(S l , A l ) are more rapid.Similar to the exploration rate, we use a decaying learning rate: starting with α = α max , the learning rate is multiplied with the same decay rate ζ in each iteration, and the minimum value it can take is α min .Assuming that all state-action pairs continue to be updated, and the parameters ε and α are set properly, Q(S l , A l ) converges to the optimal value Q * (s, a) = Q(S l = s, A l = a) for given frame l [33].
Algorithm 2 gives a detailed explanation of our Q-learningbased adaptive blocklength selection method (QL-ABM).

IV. ADAPTIVE MCS SELECTION FOR MINIMIZING AGE VIOLATION PROBABILITY
In this section, we focus on adaptively selecting the modulation and coding schemes to minimize the age violation probability, and present our solution based on deep Qnetworks.

A. DQN BASED ADAPTIVE MCS SELECTION
The modulation and coding scheme selection is a more complex problem than blocklength selection.This is because the number of actions and states is significantly larger, and it is impractical to use a tabular method like Q-learning where Q-functions Q(S l , A l ) for all states S l ∈ S and actions A l ∈ A are stored in a table.The required memory and computation resources are too high; thus, Qlearning fails to be a feasible solution, and we utilize deep reinforcement learning (DRL) methods instead [34].It is a function approximation technique that uses deep neural networks (DNN).The Q-function Q(S l , A l ) is approximated by Q(S l , A l ; θ), where θ is the vector consisting of the weights of the DNN mimicking the actual Q(S l , A l ).The network is also called a deep Q-network (DQN).It consists of an input layer, H hidden layers, and an output layer.The network takes a state S l as an input, and as outputs, it gives the Q-functions for state S l and all possible actions.
Similar to Section III-B, we consider two DQN-based scenarios to solve the adaptive MCS selection problem.In the first one, the CQI information is known and included in the state S l of the system.Meanwhile, the second scenario is more practical, assuming we know only the ages at the queue and receiver, and CQI is excluded from the state.Actions and rewards are the same for the two scenarios.Let S 1 and S 2 denote the state spaces for the first and second scenarios Observe the current state s: s = ( q (l), r (l), CQI (l)) for QL-ABM-1 s = ( q (l), r (l)) for Q ABM-2 6: a ← 1 / * choose stay idle * / Observe the next state s ′ and reward r: 15: Update Q-table : 16: s ← s ′ 18: end for as ( q (l), r (l), CQI (l)) ∈ S 1 and ( q (l), r (l)) ∈ S 2 , respectively.Similarly to Section III, q (l) denotes the age of the packet in the queue, and q (l) = −1 if the queue is empty.r (l) denotes the age of the packet at the receiver.For the CQI state, instead of quantization as in Section III, here we obtain the CQI simply by rounding the SNR to the nearest integer.
Unlike the blocklength selection problem, we do not use the state aggregation method for q (l) and r (l).The evolutions of q (l) and r (l) in time are the same: The age of the packet at the queue is affected only by the new packet arrivals to the system.When a packet arrives at the queue, q (l) is reset to zero.Otherwise, it increases with the unit rate.The age at the receiver r (l), on the other hand, grows until the transmission is completed successfully.Let n (M ) l denote the blocklength used according to the chosen MCS index at frame l, and n (M ) l = 1 implies the action of staying idle for one CU.Then, the changes in q (l) and r (l) after 122418 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

n (M ) l
CUs can be expressed as follows: Again, the CQI state after n (M ) l CUs does not depend on the previous or the other CQI states but changes randomly according to Rayleigh distribution.The finite action space A represents the MCSs in [22, Table 5.1.3.1-3],plus stay idle action.Also, we design a slightly different reward function R(S l , A l ) than the one in Section III.We count the number of age violations in each iteration because of the selected action.However, this is not a sufficient solution: The reward of applying an action A l is the same whether r (l) is above the threshold or not.Thus, the reward should include information about how much the threshold is exceeded.Also, as in blocklength selection problem, the DQN agent should not choose to stay idle unless the queue is empty.Again, rewards corresponding to these cases are large negative values.On the other hand, the reward of choosing to stay idle when the queue is empty is zero, as it is the optimal action to take in that state.We follow a slightly different notation from Section III here, a 0 corresponds to the action of staying idle, i.e., n Then, the reward function is expressed in (28), as shown at the bottom of the next page.
The DQN agent iteratively learns with experience.An experience can be represented with a (S l , A l , R(S l , A l ), S l+1 ) tuple: The state S l , the action A l taken in state S l , the reward R(S l , A l ) obtained by taking action A l in state S l , and the resulting next state S l+1 .A replay buffer with a limited size stores the experiences, and to train the network, a batch of experiences is sampled randomly from the buffer.This method improves stability because it eliminates the correlations between the samples and covers a wider variety of state-action pairs [34].The instabilities are also limited by the usage of two networks in the training process: the main network and the target network.The main network is represented with the action-value function with weight vector θ (Q(S l , A l ; θ)), and the target network is shown as Q(S l , A l ; θ − ).While the main network is actively trained, the target network is updated at every N episodes.The purpose is to improve stability and increase the probability of convergence by avoiding rapid changes in Q(S l , A l ; θ − ).
At each time step in an episode of the algorithm, the agent chooses an action A l with an ε-greedy approach: with probability ε, a random action is selected.Otherwise, the action with the maximum Q value is selected.As in QL-ABM, we use a decaying exploration rate ε.Execution of action A l results in reward R(S l , A l ) and state S l+1 .The experience (S l , A l , R(S l , A l ), S l+1 ) is stored in the replay buffer.The agent is trained with a minibatch of experiences sampled randomly from the replay buffer.The difference between the actual and predicted results, i.e., gradient loss (L(θ )), is calculated.As the loss function, we use Huber loss [35]: (29) states that if the loss value is less than φ, Huber loss is equal to the mean squared error (MSE); however, for loss values greater than φ, Huber loss equals the mean absolute error (MAE).As MSE loss squares the difference, it puts more weight on outliers, i.e., observations that differ substantially from the others.On the other hand, MAE loss weighs all errors with a linear scale, ignoring the outliers.By combining MSE and MAE, Huber loss balances the weight given to outliers.
As the training processes, the loss is expected to converge to arbitrarily small values.Lastly, at every N episodes, the weights of the main network are copied to the target network.The algorithm for our DQN-based adaptive MCS selection method is given in Algorithm 3, and the related parameters are listed in Table 3.

B. BASELINE SOLUTIONS
To evaluate their performances, we compare our DQN-based solutions with two baseline methods: ILLA and OLLA [21].ILLA is an adaptive MCS selection method based on a fixed lookup table approach; it chooses an MCS index that satisfies a target BLER requirement for a given SNR value.The measured SNR can be unstable because of variations in the wireless channel, quantization errors, and delays.In such cases, ILLA becomes an inefficient solution, and the OLLA technique is used in addition to ILLA for improving performance.OLLA adjusts the measured SNR γ with an offset η olla according to the ACK/NACK feedback about the transmitted packet.The resulting SNR γ olla is used for  of VI-ABM and QL-ABM-1&2 compared with fixed blocklength schemes.We fix the number of information bits to k = 100, and the blocklengths in our action space go from 100 to 300 with a step size of 25.In VI-ABM, the number of iterations run for each scenario is 200, and the discount factor is 0.95.The number of iterations and the discount factor in QL-ABM-1&2 are 100000 and 0.95, respectively.As mentioned before, we use a decaying exploration rate ε in QL-ABM-1&2, and the related parameters are ε max = 1, ε min = 0.01 and ζ = (1 − 10 −4 ).We also use a decaying learning rate with the same decay rate ζ , and the maximum and minimum values are α max = 0.5 and α min = 10 −4 .
Figure 7 shows the results obtained with different transmit power levels when the arrival rate and threshold are fixed (λ = 0.01 and max = 800 CUs).Low transmit power implies that the probability of experiencing low SNR levels is high.For low P values, large blocklengths (n ≥ 200) result in lower AVP among all the fixed blocklength schemes.This is because more redundancy bits are needed for reliable transmission, i.e., low BLER, in low SNR cases.As P increases, using large blocklength constantly becomes inefficient, and smaller n values such as 100 and 125 become advantageous.On the other hand, our adaptive blocklength methods provide lower AVP for the majority of P levels since they can dynamically select the optimal blocklength to use in each different channel realization.Although QL-ABM-2 w/o CQI shows slightly worse performance than QL-ABM-1 with CQI, its performance still attains  or surpasses the performance of best fixed blocklength schemes.The performance differences between VI-ABM, QL-ABM-1, and QL-ABM-2 are more apparent for lower P values.Since high SNR levels are rarely experienced for low transmit powers, the Q-learning agent cannot learn about them thoroughly, so it does not know which action is optimal in the states corresponding to high SNR without CQI knowledge.Meanwhile, VI-ABM and QL-ABM-1 achieve significantly lower AVP for all P values than the other schemes.It is worth noting that VI-ABM requires apriori knowledge of CSIT and system dynamics, which may not always be feasible.
In Figure 8, the results of varying packet arrival rate λ are displayed where P = 0 dB and max = 800 CUs.When λ is small, the packet arrivals are sparse, and the main factor increasing the age is the idle periods where the system waits for new packet arrival.Thus, AVP is very high for both the fixed blocklength schemes and our methods.As λ increases, these idle periods are shortened; hence AVP decreases significantly for all schemes.When λ = 0.1, the probability of updating the queue with a newly-arrived packet is high, this  leads to smaller q ; therefore, smaller r and AVP.VI-ABM performs better than the fixed blocklength schemes for the whole range of λ values, while the performance gap becomes more visible for larger λ.Although not as good as VI-ABM and QL-ABM-1, QL-ABM-2 also achieves lower AVP than the fixed blocklength schemes for all packet arrival rates.
Lastly, in Figure 9, age violation probabilities for different age thresholds are demonstrated.Transmit power P is kept constant at 0 dB and arrival rate λ is 0.01.For low max values, AVP is large for all cases, as expected.As max is increased, AVP decreases substantially for all schemes.For all threshold values, VI-ABM and QL-ABM-1&2 outperform the fixed blocklength schemes as the threshold increases, while VI-ABM achieves the lowest age violation probability for all threshold values.
It is clear that for all scenarios, VI-ABM is superior to both QL-ABM-1 and QL-ABM-2.Nevertheless, it is essential to recall that value iteration is a model-based method; hence it requires complete knowledge of the environment dynamics, such as state transition probabilities and reward models.On the other hand, Q-learning agents learn with trial and error, as it has no prior knowledge about the environment.Also, it suffers from the exploration-exploitation tradeoff mentioned in Section 2.6.Thus, it is reasonable that VI-ABM performs better than Q-learning-based methods, considering its prior knowledge and higher complexity.In addition, among two Q-learning-based methods, QL-ABM-1 outperforms QL-ABM-2 for all test scenarios, which is understandable, as SNR, hence CQI state, is a crucial factor in determining the probability of error and affects the action selection process.Nevertheless, QL-ABM-2 is a more practical method than QL-ABM-1 as it does not require knowledge about CSIT.

B. ADAPTIVE MCS SELECTION
We compare the performances of the two DQN-based solutions with the baseline methods ILLA and OLLA.Three target BLER values (10 −1 , 10 −3 , 10 −5 ) are used with the ILLA method, and for OLLA we set BLER to 10 −1 .The number of information bits is set to k = 200.In the MCS table [22,, the modulation order M and the coding rate R for each MCS index are provided and the corresponding blocklength n can be computed as n = k R•log 2 M .We refer to the proposed policies when the information on the CQI state is available and unavailable as DQN-AMC-1 and DQN-AMC-2, respectively.
Figure 10 shows the age violation probability of different schemes for various transmit power levels P. Age threshold max and arrival rate λ are fixed at 5000 CUs and is 0.005, respectively.When P is low, the probability of having lousy channel conditions is higher; thus, the frequently seen SNR values are low, and erroneous transmissions heavily influence AVP.As ILLA and OLLA schemes use low MCS indexes to achieve the target BLER, AVP is high because of the large blocklengths, so the DQN-AMC schemes provide lower AVP.As P increases, the superior performance of DQN-AMC becomes more visible.However, for transmit powers above around 4 dB, ILLA and OLLA schemes become more advantageous as higher MCS indexes with small blocklengths are used.Notably, while the ILLA schemes have similar performances, as BLER of ILLA goes from 10 −1 to 10 −5 AVP increases since a lower MCS index with a larger blocklength satisfies the lower BLER requirement at a certain SNR.Meanwhile, it is evident that using OLLA does not significantly affect the age violation probability.Comparing the two DQN-AMC schemes, it can be seen that DQN-AMC-1 clearly outperforms DQN-AMC-2 for most of the P levels.Still, considering that DQN-AMC-2 does not know the SNR and has lower complexity regarding the number of states, it is a feasible solution.
Figure 11 demonstrates the age violation probability for different packet arrival rates.At the lowest arrival rate (λ = 0.001), DQN-AMC schemes are insufficient.The reason is that the DRL agent mainly encounters the states in which the queue is empty, even with a high exploration rate.Therefore, it cannot fully learn the optimal actions when the queue is  non-empty.Increasing λ to about 0.005 leads to a substantial reduction of AVP in all schemes, but the difference is much higher for DQN-AMC schemes.For λ values above 0.005, changes in AVP become negligible for all schemes.As in the previous results, ILLA with BLER = 0.1 and OLLA perform very similarly, and for ILLA with a smaller target BLER, we observe higher AVP.
In Figure 12, AVP is plotted for different age thresholds max while the transmit power P is fixed at 0 dB, and arrival rate λ is 0.005.As can be seen, DQN-AMC schemes surpass the performances of ILLA and OLLA schemes.Also, DQN-AMC-1 achieves lower AVP than DQN-AMC-2 for almost all threshold values.Consistent with the previous results, ILLA scheme with BLER = 10 −5 has the highest AVP, and the difference between the ILLA schemes is visible.Again the OLLA scheme improves the performance negligibly.As the threshold increases, the probability of age violation is reduced for all schemes.The proposed DQN-AMC methods achieve lower age violation probabilities for most of the test scenarios.DQN-AMC-1, which includes CQI information in the state performs better than DQN-AMC-2 in general.This is understandable, as SNR, hence CQI, is one of the main factors determining the probability of error and affecting the action selection process.Nevertheless, DQN-AMC-2 is an efficient method considering that it does not require knowledge about the SNR and has a lower number of states, thus lower complexity.

VI. CONCLUSION AND FUTURE WORK
This paper addresses short packet communication links with strict timeliness requirements for xURLLC and mMTC systems.To capture data timeliness, we optimize age violation probability for dynamic blocklength selection and modulation/coding scheme.We propose value iteration and Q-learning under non-asymptotic information theory approximations for dynamic blocklength.Simulation results show that the optimal blocklengths exist for different transmit powers, arrival rates, and predefined age thresholds.The proposed adaptive blocklength selection methods with/without CSIT significantly outperformed the fixed blocklengths even in an unknown arrival rate and block error rate conditions.For the adaptive modulation/coding scheme, due to a large state space, we introduce two algorithms based on DQN, with/without CSIT.Our DQN-based approach exhibits significantly lower age violation probability compared to ILLA and OLLA baseline methods.Across dynamic blocklength and modulation/coding problems, the gap between methods with/without channel state information narrows as SNR increases.These methods have the potential for xURLLC and mMTC systems with multiple users and various channel models considering distinct geographical locations and pathloss of the transmitter and the receiver in the future.

FIGURE 1 .
FIGURE 1. System model for the blocklength and MCS selection problems.

FIGURE 4 .
FIGURE 4. Coding rate versus AVP for different transmit power levels when λ = 0.01 and max = 800 CUs (red circles correspond to the minimum AVPs).

FIGURE 5 .
FIGURE 5. Coding rate versus AVP for different arrival rates when P = 0 dB and max = 800 CUs (red circles correspond to the minimum AVPs).

FIGURE 6 .
FIGURE 6. Coding rate versus AVP for different age thresholds when P = 0 dB and λ = 0.01 (red circles correspond to the minimum AVPs).
FIGURE 2. The evolution of r (t ) in the presence of random packet arrivals with LCFS-Q, and transmission errors.