Multi-Agent DRL Approach for Energy-Efficient Resource Allocation in URLLC-Enabled Grant-Free NOMA Systems

Grant-free non-orthogonal multiple access (GF-NOMA) has emerged as a promising access technology for the fifth generation and beyond wireless networks that enable ultra-reliable and low-latency communications (URLLC) to ensure low access latency and high connectivity density. Furthermore, designing energy-efficient (EE) resource allocation strategies is a crucial aspect of future cellular system development. Taking these goals into account, this paper proposes an EE sub-channel and power allocation strategy for URLLC-enabled GF-NOMA (URLLC-GF-NOMA) systems based on multi-agent (MA) deep reinforcement learning (MADRL). In particular, the URLLC-GF-NOMA methods using MA dueling double deep Q network (MA3DQN), MA double deep Q network (MA2DQN), and MA deep Q network (MADQN) techniques are designed to enable users to select the most appropriate sub-channel and transmission power for their communications. The aim is to build an efficient MADRL-based solution, ensuring rapid convergence with small signaling overhead, to maximize the network EE while fulfilling the URLLC requirements of all users. Simulation results show that the MADQN and MA2DQN methods, which have lower complexity than MA3DQN, are more appropriate for the URLLC-GF-NOMA systems under consideration. Moreover, our proposed methods exhibit superior convergence characteristics, a reduction in signaling overhead, and enhanced EE performance compared to other benchmark strategies.

Recently, the combination of NOMA and URLLC has been investigated in several works [9], [10], [11] to increase connectivity and guarantee the reliability and latency requirements for wireless networks. Specifically, these works considered multiple-input multiple-output (MIMO) and multiple-input single-output (MISO) schemes for URLLCenabled systems to improve the system performance in terms of reliability and latency. The works proposed user-pairing methods based on the power-domain NOMA principle to enhance connectivity and reduce interference. However, the above works did not examine the GF access method, which can support massive access and reduce the transmission latency for wireless systems requiring high reliability and low latency.
Taking GF transmission into account, the works in [17], [18] studied GF access for OMA. In the GF-OMA scheme, users can select RBs randomly, and each RB is used strictly by a single user for successful reception. This limitation may lead to severe collisions when the number of users is much higher than the number of available RBs. To overcome this challenge, GF-NOMA has emerged as a promising technology for massive access by allowing multiple users to access the same RB based on the power-domain NOMA [7]. In particular, the users occupying the same RB are distinguished by different received power levels, and multi-user data can be decoded at the receivers by utilizing the successive interference cancellation (SIC). The traditional contention-based GF-NOMA schemes are implemented by dividing a cell area into multiple fractions and using the orthogonal resource allocation among those fractions to reduce the inter-fraction collisions [8], [19]. Nevertheless, the spectrum competition among users within the same fraction is still high, resulting in severe interference and reducing system performance. Thus, it is important to find a smart congestion control method to reduce the collisions and improve the long-term system performance.
Intelligent features are an important aspect of future cellular networks, and many current research works have applied RL-based algorithms to address the collisions and severe interference in massive access scenarios [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31]. Specifically, Sharma and Wang [20] proposed a collaborative distributed Q-learning algorithm for the frame-based slotted-Aloha (SA) random access (RA) scheme to find the best resource block allocation strategy for IoT users, in order to avoid collisions in GF-OMA-based IoT systems. The authors in [21], [22], [23], [24] investigated the application of Q-learning to different GF-NOMA scenarios with/without SPC to mitigate the congestion and interference in overloaded systems, where the number of users is larger than the number of available RBs. However, RL-based algorithms such as Q-learning are not suitable for large high-dimensional state-action spaces [13], making them inadequate for addressing the network optimization problems in complex and large-scale scenarios of future wireless networks.
To overcome the aforementioned challenges, recent studies have been applying deep RL (DRL) to address the complex resource allocation problems and optimize system performance [25], [26], [27], [28], [29], [30], [31]. In particular, the work in [25] proposed a DRL framework to find an optimal resource management strategy for GF-OMA systems and address dynamic spectrum access issues. In [26], a DRL algorithm based on generative adversarial networks was proposed to minimize power consumption while ensuring high reliability and low latency for orthogonal frequency division multiple access (OFDMA) systems. To further improve the spectral access efficiency and enhance the system performance, DRL-based GF-NOMA schemes were investigated in [27], [28], [29], [30], [31] under different scenarios. Specifically, the work [27] investigated a pilot sequence-based GF-NOMA system and proposed a centralized training distributed execution multi-agent (MA) DRL (MADRL) solution to maximize the network throughput (number of successfully served users). Additionally, different MADRL-based dynamic resource allocation strategies for power-domain GF-NOMA systems were investigated in [28], [29] to maximize the system throughput [28] and sum rate [29]. In [30], [31], DRL-based methods were proposed for GF-NOMA systems enabling massive URLLC (mURLLC) to maximize the long-term average throughput.

B. CONTRIBUTIONS
Unlike the aforementioned works on GF-NOMA systems, this paper investigates an MADRL-based resource allocation strategy aimed at maximizing the energy efficiency (EE) while satisfying the users' requirements on reliability and latency for URLLC-enabled GF-NOMA (URLLC-GF-NOMA) systems. Given the stringent requirements of reliability and latency of URLLC users, there is a demand for an efficient and rapid communication protocol. Therefore, our focus is on constructing an effective distributed MADRLbased solution that achieves both EE and rapid convergence with minimal signaling overhead. The approach is designed to reduce the information exchange between the environment and agents, based on which the lower processing latency for URLLC users can be achieved. Indeed, we consider a GF-NOMA scenario where the users compete for the RBs, i.e., subchannels (SCs) and transmission power levels (TPLs), to communicate with the BS by randomly selecting one SC and one TPL for their transmissions. Following the NOMA principle, the users utilizing the same SC are distinguished by their received power at the BS, and their messages are decoded in an orderly manner using SIC [8]. However, with its random access nature, GF-NOMA may cause severe interference since too many users can select the same SC, leading to the system performance degradation. To overcome this drawback, we utilize DRL techniques to enable the users to find the most suitable SCs and TPLs for their transmissions, optimizing the network EE, and fulfilling the URLLC requirements of all users. Thus, the main contributions of this paper are summarized as follows: • Given that EE is an important factor due to users' energy limitations, we investigate the problem of maximizing the long-term average EE for URLLC-GF-NOMA systems. The goal must be achieved while also ensuring the strict requirements of users in terms of reliability and latency, which necessitates a rapid and efficient transmission protocol. Building on this EE maximization problem, we further investigate the objectives of maximizing the sum rate and minimizing power consumption to clarify the benefits of the proposed problem in balancing the achievable sum rate against power consumption for energy-limited users.
• We develop three distributed MADRL-based resource allocation methods to address the considered problem: MA Dueling Double Deep Q Network (MA3DQN), MA Double Deep Q Network (MA2DQN), and MA Deep Q Network (MADQN). Within this context, the MADRL frameworks are designed to provide energyefficient learning-based solutions which ensure rapid convergence and minimal signaling overhead, ultimately reducing the processing latency for URLLC users. • We provide a performance comparison between the proposed mechanisms and other benchmark schemes to clarify the benefits of the former in terms of convergence property and EE performance. Additionally, we evaluate the effects of different state-action spaces, URLLC requirements, and the number of users on the achieved rewards and EE performance. The provided numerical results prove that the proposed solutions outperform other benchmark schemes, achieving higher EE, faster convergence, and reduced signaling overhead. The remainder of the paper is organized as follows. Section II presents the system model, URLLC method, and the EE maximization problem. Section III describes the MADRL-based solution of the EE optimization problem for the considered URLLC-GF-NOMA system. Section IV provides the obtained simulation results and discussions. Finally, Section V concludes this paper. For clarity, we provide a summary of the main notations and symbols used in this paper in Table 1.

II. SYSTEM MODEL
We consider an uplink URLLC-GF-NOMA system consisting of one base station (BS) and a set of M URLLC users, denoted by M, allocated uniformly around the BS within a circle-cell radius of r c (m), as shown in Fig. 1. The system bandwidth is equally divided into a set of K orthogonal SCs, denoted by K, to serve the users. Moreover, the GF-NOMA transmission strategy is utilized to improve the spectrum access efficiency and guarantee strict requirements of the URLLC users in overloaded scenarios, i.e., M > K. Following this transmission scheme, the users utilize the available SCs to communicate with the BS, and multiple users can share the same SC based on the power-domain NOMA principle [7].
In 5G new radio (5G-NR) networks, the SC's bandwidth is defined as 2 ν times of SC's bandwidth in 4G systems (i.e., 180 kHz), where ν ∈ {0, 1, 2, 3, 4} denotes the numerology index which stands for the various SC types in order to support different services [32], [33]. In particular, the SC with higher bandwidth is used for URLLC service while other services such as enhanced mobile broadband (eMBB) and massive machine type communications (mMTC) can utilize the numerology with smaller SC spacing. Given this context, this paper considers a scenario where the total bandwidth is divided into a set of SCs, i.e., K, serving the URLLC users, and the bandwidth of SCs is defined as W = 2 ν ×180 (kHz).

A. UPLINK GF-NOMA TRANSMISSION PROCESS
Under the GF strategy, the users are free to choose the SCs for their transmissions without any scheduling instructions from the BS. However, this can lead to severe collision issues as too many users may select the same SCs. To mitigate this drawback, the NOMA technique can be applied, where multiple users can access the same SC. Considering the NOMA transmission process over SC k (k ∈ K) in time slot (TS) t, we denote x Let M k be the number of users using SC k in TS t, i.e., K k=1 M k = M. Then, the received signal at the BS over SC k in TS t is given by where n(t) ∼ CN (0, σ 2 ) is the additive white Gaussian noise (AWGN), P represents the channel coefficient between user m and the BS over SC k in TS t.
We assume that the users using SC k are sorted in the descending order of the corresponding received power level at the BS, i.e., P (k) Following the NOMA principle, the messages of the users with higher received power level are decoded earlier at the BS. Specifically, the BS decodes the message of a user by treating the messages of users with lower received power level as noise [11], [34]. It then reconstructs and removes this component from the received signal to decode the remaining users' messages successively by using the SIC technique. Accordingly, the received signalto-interference-plus-noise ratio (SINR) of user m over SC k in TS t is expressed as (2)

B. URLLC COMMUNICATION MODEL
Due to the stringent low-latency requirement of URLLC communication, very short packets and finite blocklength (FBL) is implemented for data transmission, so-called shortpacket communications (SPC). Consequently, the Shannonrelated capacity formula cannot be applied to the URLLC communication model since it is designed under the assumption of the infinite block length (iFBL). According to [5], the achievable rate of user m over SC k in the FBL regime for a quasi-static flat fading channel can be approximated as where v (k) 2 is the channel dispersion, τ denotes the transmission latency threshold, ε m is the decoding error probability, and Q −1 (x) represents the inverse of the VOLUME 4, 2023 e − t 2 2 dt. Based on (3), one can define an SNR threshold for user m trying to transmit one packet over one SC k in each transmission TS that satisfies the URLLC requirements (i.e., τ and ε m ) as [35] where n b (bits) is the packet size. From (4), the target rate for the transmission of user m can be defined aŝ Similar to [28], [35], we assume that each user m can transmit its packet only once. As the interference over an SC increases, the likelihood of packet drops escalates. Specifically, a successful transmission occurs if R (k) m (t) ≥R m ; otherwise, any deviation from this condition results in a failed transmission, i.e., a dropped packet.

C. ENERGY EFFICIENCY MAXIMIZATION
Energy efficiency (EE) is considered one of the major goals in 5G and beyond wireless networks [36]. Furthermore, the majority of mobile devices operate on limited battery power [36], resulting in the need to design energy-efficient communication methods. To address this concern, we first define an EE factor with the purpose of ensuring the achievable rate requirement while reducing the power consumption for the system as follows: where P c denotes the circuit power consumption. In what follows, the work focuses on designing an effective distributed power control and SC assignment strategy for URLLC-GF-NOMA systems to maximize the average EE while ensuring the URLLC requirements of all users. This can have a direct impact on the overall sustainability and cost-effectiveness of the considered networks. The design objective can be cast by the following problem: where E t [ · ] is the expectation operation over TSs, x and P denote the SC assignment and power control strategies, respectively. The constraint (7b) represents the rate condition to guarantee the users' URLLC requirements. The constraint (7c) ensures the NOMA-based multi-user decoding process. The constraint (7d) implies that each user selects at most one SC. The constraint (7e) shows the users' power budget. Remark 1: It is noteworthy that the EE maximization problem defined in (7) can also include the objectives of maximizing the sum rate and minimizing the power consumption. These objectives can be attained by setting the denominator and numerator as 1, respectively. Thus, the considered scenario represents a general case where an efficient solution, striking the trade-off between the achievable sum rate and power consumption, can be achieved. Further evaluation on this matter is provided in Section IV.

III. MADRL-BASED ENERGY EFFICIENCY RESOURCE ALLOCATION SOLUTION FOR URLLC-GF-NOMA SYSTEMS
The problem described in (7) is challenging to solve due to its non-convex nature and NP-hard complexity. Moreover, with the GF access method, the users can select their preferred SC and transmission power independently in each TS without requiring admission approval from the BS. While this feature can reduce the access latency and increase the connectivity density, it also necessitates a decentralized optimization solution. Therefore, to effectively address the problem stated in (7), we consider an MADRL-based method, which can be implemented in a distributed manner.

A. MADRL FRAMEWORK
RL is one of the machine learning methods that enable a learning agent to achieve its specific goal with the best long-term reward by interacting with the environment in a trial-and-error manner [29]. In particular, an agent interacts with the environment by taking an action selected from its action space at the current state. It then receives a respective reward and moves to a new state. These procedures are repeated until convergence is observed, where the learning policy of the agent achieves an optimal value in terms of average reward. This learning process can be formulated as a Markov decision process (MDP) with a tuple of four elements (S, A, R, P), defined as follows: • S: The set of states in the environment, where s(t) ∈ S denotes the state of an agent at TS t. • A: The set of actions that an agent can take, where a(t) ∈ A is the action of an agent at TS t. • R: The reward function, where r(t) represents the immediate reward of the agent at TS t by performing action a(t) in state s(t). • P: The probability distribution function of the state transition, where P(s(t), s(t+1)) denotes the state transition probability from state s(t) to state s(t + 1). In the considered URLLC-GF-NOMA system, the behavior of all users (i.e., transmission power and SC selection) can be modeled as an MA MDP (MAMDP), which is denoted by . Unlike a single-agent DRL related to the learning process of only one single agent, our proposed MADRL-based model involves a set of agents M, where all agents operate autonomously and concurrently in a sharing environment. In particular, each agent m observes its current state s m (t) ∈ S m from the environment and performs an action a m (t) chosen from its own action space A m . The joint action of all agents can be formulated as The agent m then moves from the current state s m (t) to a new state s m (t + 1). All agents then receive a reward of r(t + 1) and perform an update of their current policy according to the feedback from the environment. It is worth noting that each agent having a distinct reward may result in selfish behavior, leading to a reduction of the global network performance [37]. Therefore, we assume that all agents have a common reward to obtain the global optimum. The main elements of the proposed MADRL approach are defined as follows: • State: Due to users' independence and URLLC requirements, the state of agent (user) m ∈ M is designed only based on the local information available at this agent to reduce the processing latency and the signaling overhead in information exchange between the agent and environment. Specifically, the state of agent m in TS t can be defined as the combination of SC index and transmission power value it selected in the previous TS t − 1, which is expressed as where k m (t − 1) and P k m (t−1) m (t − 1) are the selected SC index and transmission power of agent m. Since the users' selection of SC and transmission power will impact the overall EE, it is reasonable to include this information in the defined state. From (8), the state of agent m has a cardinality of 2. It is noteworthy that the state definition in (8) differs from those in recent related works on GF-NOMA systems, which require a large signaling overhead in information exchange between the environment and the agents during the learning process [28], [29]. A performance comparison between different state definitions will be provided in Section IV. • Action: At the beginning of TS t, agent m selects an SC and transmission power for its transmission. As a feasible solution, the discrete power domain has been widely used for the learning-based GF-NOMA systems in the literature [21], [27], [29]. This approach can ensure stable convergence and reduce the computational complexity of the distributed learning models conducted by the users who have limited computational resources. Given this context, we consider a discrete action space, where the power is quantized into L levels which are determined asP l = lP max /L, l ∈ {1, 2, . . . , L}, wherê P l is the l-th TPL. Thus, the action of user m in TS t is defined as where a m (t) = kl indicates that agent m selects SC k and TPL l in TS t. Thus, the action space size of agent m is KL and the overall action space size of all agents is determined as (KL) M . • Reward: After all agents take their chosen actions, they receive an immediate reward from the environment reflecting if their transmissions are successful or not, i.e., if all constraints in the problem (7) are satisfied or not. In the MADRL frameworks, both centralized and decentralized rewards can be considered to build learning models. The centralized-reward mechanism yields a common reward to all agents, whereas in decentralizedreward schemes, each agent receives a distinct reward. However, the decentralized-reward strategy can lead to selfish behavior among agents. They may compete with others to maximize their own rewards, which potentially results in a degradation of overall system performance.
To circumvent this issue, a common reward can be implemented to align the agents towards a shared global objective [37]. Since the objective is to maximize the network EE, we use the achieved EE to formulate the reward function. Furthermore, all agents receive the same reward with the aim of achieving the common objective, i.e., optimizing the network EE and guaranteeing URLLC requirements of all users. Thus, the reward function is defined as if all constraints in the problem (7) are satisfied, 0, otherwise.
Based on the reward function defined in (10), it becomes apparent that inappropriate user actions, such as an excessive number of users choosing the same SC, may degrade the system's EE. Consequently, the users will receive a low reward. Throughout the learning process, users explore the environment to find the best policies that will maximize their reward, ultimately leading to optimal EE performance. The objective of RL algorithms is to find a policy π to maximize the expected reward [38]. Considering the Q-learning algorithm -a popular RL technique, the expected reward achieved by agent m after taking action a m in state s m following a policy π can be determined based on the action-value function (or Q-value function) as where E[·] denotes the expectation operator andr(t) is the long-term discounted cumulative reward which is given bŷ where γ is the discount factor that determines the weight of the future reward. Based on (11), the optimal Q-function can be calculated as Through the Q-learning method, the optimal policy can be found based on the available information (s m (t), a m (t), r(t), s m (t + 1)). The update equation of the Q-value function of agent m can be expressed as [38] Q(s m (t), a m (t)) = Q(s m (t), a m (t)) a m (t)) , (14) where y m (t) = r(t) + γ max a Q(s m (t + 1), a) and α ∈ [0, 1] is the learning rate.
Although the Q-learning method has been widely adopted in wireless networks for resource management purposes, it only works well under small state-action spaces, which limits its applicability. Its practicality diminishes as the problem size increases, primarily due to two key factors [29]: (i) the need for a lookup table to store Q-values for every possible state-action pair becomes unmanageable in terms of storage complexity when dealing with large-scale problems; and (ii) with a larger state space, many states are rarely visited, resulting in decreased performance. To overcome this drawback, we consider DRL techniques to efficiently solve the proposed problem in (7). In the DRL method, a deep neural network (DNN) is integrated into the framework of Q-learning to reduce the memory size and computational complexity by calibrating and training the DNN's different layers to define the best action for each state instead of using a large storage space (i.e., Q-table) to store all Q-values [39]. In this paper, we propose MADRL-based EE URLLC-GF-NOMA methods, where different DRL techniques including deep Q network (DQN), double DQN (2DQN), and dueling 2DQN (3DQN), are investigated. 1

B. PROPOSED MADRL ALGORITHMS FOR URLLC-GF-NOMA SYSTEMS 1) MADQN-BASED APPROACH
In this section, we consider a MADQN-based URLLC-GF-NOMA approach. With this method, each agent constructs its own DQN model that consists of two different DNNs: the online and target networks, as depicted in Fig. 2. Specifically, in each TS t, agent m uses the online network for Q-function approximation Q(s m (t), a m (t); θ m ) to select an action a m (t) ∈ A m at state s m (t) ∈ S m . Here, θ m represents the parameters (weights) of the agent m's online network. Meanwhile, the target network is used to stabilize the learning process, and its parametersθ m are updated by copying the parameters θ m of the online network after a certain number of TSs, which is also known as the parameter update frequency F.
Regarding the action selection at each state, one should consider the trade-off between exploration and exploitation during the learning process to achieve the optimal policy. Given this context, the -greedy policy can be 1. Besides DRL algorithms based on Q-learning and DNN, tile coding and on-policy learning could also be promising methods to achieve an effective solution and analytical convergence. This would be a noteworthy issue to investigate in future work. used for action selection to obtain a balance between the exploitation of the best Q-value function and the environmental exploration [38]. In particular, the -greedy policy selects an action based on two conditions: where Q m (t) = Q(s m (t), a; θ m ). Herein, the parameter determines the level of exploration, and it is usually set to decrease over time to reduce the exploration rate as the learning progresses. During the learning process, MADQN approach uses the experience replay strategy to achieve learning stability, where the transition in the form of a tuple (s m (t), a m (t), r(t), s m (t + 1)) is stored in the experience replay memory of each agent m. At each iteration, a mini-batch of experiences is sampled uniformly to train the learning model and update the parameters of the online network θ m with the purpose of minimizing the loss function defined as where y m (t) is the target value calculated from the target network as follows: Given the DQN model of each agent mentioned above, the proposed MADQN-based URLLC-GF-NOMA approach is summarized in Algorithm 1. In particular, in TS t, each agent m observes its current state s m (t) ∈ S m and takes an independently action a m (t) ∈ A m selected based on the -greedy policy in (15). After performing the chosen action, agent m receives a common reward r(t) based on the achieved EE and moves to a new state s m (t + 1). It then stores an experience tuple of (s m (t), a m (t), r(t), s m (t +1)) into its experience replay memory, and a minibatch of experiences is sampled for training the online network. The parameters of the online network θ m are then updated to minimize the loss function in (16) by using the stochastic gradient method, where the target value is given by (17). After a predetermined number Initialize the network state s m (t), ∀m.

5:
for t = 1, 2, . . . , T do 6: All agents select their actions a m (t) ∈ A m , ∀m, based on the -greed policy in (15). 7: All agents take their actions, receive a common reward r(t), and move to the next state s m (t + 1). Store an experience tuple of (s m (t), a m (t), r(t), s m (t + 1)) to the replay memory of agent m. 10: Randomly sample a mini-batch of experience from the replay memory for training. 11: Determine the loss function L(θ m ) as follows: -MADQN approach: Using (16) and (17).
-MA2DQN approach: Using (16) and (18). -MA3DQN approach: Using (16) and (18), where the Q-value (action-value) functions are calculated by utilizing (19). 12: Update θ m by using stochastic gradient to minimize L(θ m ). 13: Updateθ m asθ m = θ m after every F TSs. 14: end for 15: end for 16: end for of TSs, the parameters of the target networkθ m are updated by copying θ m . The above training process continues until reaching a predefined number of episodes guaranteeing the algorithm's convergence.

2) MA2DQN-BASED APPROACH
From (17), one can observe that the MADQN approach based on DQN model using the same Q-value function for both tasks, i.e., action selection, max a∈A m Q(s m (t +1), a;θ m ), and action estimation, Q(s m (t + 1), a;θ m ). This can lead to an unstable learning process since the Q-value function is estimated over-optimistically. To mitigate this issue, we investigate an MA2DQN-based URLLC-GF-NOMA approach, where 2DQN model is considered [40], as shown in Fig. 2. In this method, the action selection and evaluation are decoupled to avoid the overestimation issue by replacing the target value in (17) with the following one y m (t) = r(t) + γ Q s m (t + 1), arg max a∈A m Q m (t + 1);θ m , (18) where Q m (t + 1) = Q(s m (t + 1), a; θ m ). As can be seen from (18) that the online network Q(s, a; θ m ) is

3) MA3DQN-BASED APPROACH
An MA3DQN-based URLLC-GF-NOMA approach is studied in this section. This method uses a 3DQN model whose structure is depicted in Fig. 3, to speed up the convergence and improve the learning efficiency [41]. Following MA3DQN approach, each agent m creates its own 3DQN model based on 2DQN, where the last layer of the 2DQN model is split into two parts to evaluate the state value function (SVF) V(s m (t)) and the advantage function (AF) A(s m (t), a m (t)). Herein, the SVF V(s m ) is used for estimating the quality (goodness or badness) of a given state s m (t), allowing the agent to evaluate the long-term potential of being in that state. Meanwhile, the AF A(s m (t), a m (t)) captures how much better or worse a specific action is compared to other actions in state s m (t). This allows the agent to choose the best action to take in a given state. The two parts are then combined to produce the final action-value function Q(s m (t), a m (t); θ m , θ V m , θ A m ) that is used to select actions in the environment. Here, θ V m and θ A m denote the parameters according to SVF-related and AF-related parts, respectively. Given this context, the action-value function determined by agent m for a given state s m (t) and action a m (t) is calculated as follows: where the last term of the right-hand side of (19) is the mean of the AF over all actions. It is subtracted from the AF A(s m (t), a m (t)) of a specific action to ensure that the AF is centered around zero, making it easier to train the network. This approach improves the convergence and stability of the network and enables the effective separation of the estimation of SVF and AF, resulting in better performance compared to DQN and 2DQN architectures. The MA3DQN-based URLLC-GF-NOMA approach is also cast by Algorithm 1 under the designation MA3DQN mentioned in Step 11.

1) COMPLEXITY ANALYSIS
Let H, N h , and I s be the number of training layers (input, hidden, and output layers), the number of neurons in layer h, and the size of the input layer. For each TS, the computational complexity of URLLC-GF-NOMA algorithms based on MADQN and MA2DQN can be calculated by where X = I s N 1 + H−1 h=1 N h N h+1 . For the training phase with M agents, E episodes, and T TSs, the computational complexities of the algorithms can be given by Taking the MA3DQN-based URLLC-GF-NOMA algorithm into account, it has higher complexity than MADQN and MA2DQN-based algorithms due to the implementation of the dueling network architecture. Specifically, its complexity can be determined as

2) CONVERGENCE ANALYSIS
The convergence of a multi-agent system relies on whether the combined strategy of the agents ultimately approaches the optimal state (Nash equilibrium), ensuring the stability of the solution. In this paper, we propose URLLC-GF-NOMA methods based on MADQN, MA2DQN, and MA3DQN, which combine the conventional Q-learning and neural networks. To analyze the convergence of these methods, two key aspects need to be addressed [42]: (i) demonstrating the ability of the conventional Q-learning to converge to the optimal state, and (ii) verifying that the neural network approach effectively identifies or approximates the nonlinear Q-values generated by the general Q-learning iteration as depicted in equation (14). In particular, it has been shown in [43] that the conventional Q-learning algorithm guarantees the attainment of the optimal state when the learning rate α t satisfies 0 ≤ α t ≤ 1, t α t = ∞, and t α 2 t < ∞. Additionally, based on [44], it is established that the neural network can approximate any nonlinear continuous function when adequately sized and suitably initialized. Thus, the convergence of our proposed methods can be guaranteed. It is noteworthy that as mentioned in [45], the theoretical analysis of the neural network's size and initial conditions for ensuring its convergence before training poses challenges due to the complex quantitative relationship between the network convergence and hyperparameters. Therefore, we utilize simulations to demonstrate the convergence of our proposed methods.

3) SOLUTION ANALYSIS
To clarify the difference between the scenario considered in this paper and the ones investigated in related works on RL-based GF-NOMA [27], [28], [29], [31], this section provides a solution summary examined in these works, as shown in Table 2. As can be seen from this table, different DRL frameworks have been proposed to address the unique problems of GF-NOMA systems effectively. In delay-sensitive RL-based systems, signaling overhead is a key performance indicator. It is defined as the number of information bits needed to feed back the channel status data, SC indicators, and the transmission power of a specific user over an SC [46]. Also, the total number of users and SCs, and the exchange of states as well as rewards between the agents and environment can affect the signaling overhead. Higher signaling overhead results in larger processing latency for users.
Following [46], it is assumed that transmitting a continuous value of channel status, data rate, and reward requires 16 bits. Additionally, 1 bit is allocated for acknowledgment (ACK) feedback, 2 bits for decoding status, and 4 bits for the SC indicator, transmission power, and other relevant parameters. The work [27] produces a large signaling overhead because it depends on the decoding status ofK pilot sequences, users' average throughput, and parameters (weights) of the centralized-training MADRL model transmitted from the BS to users who build local DRL models for distributed execution. These parameters depend on the number of input, hidden, and output layers (A) and the number of neurons per layer (N a , 1 ≤ a ≤ A). In addition, large signaling overhead can be observed in [28], [29] due to the inclusion of various feedback information. This includes the channel status and ACK information of each user [28], as well as users' data rate [29]. In [31], the BS decides the actions for users (the selection of repetition value and contention transmission unit (CTU)), hence, the signaling overhead depends on the feedback information from the BS to the users regarding the selected actions for the transmission of each user. Note that V cc , V ic , V sc , V sd , and V ud used in Table 2 stand for the number of collision CTUs, idle CTUs, singleton CTUs, successfully served users, and failure decoding users, respectively. In our method, only the reward feedback is required to reduce the signaling overhead, but still guarantee an effective learning solution. Consequently, the signaling overhead is determined by the reward feedback.

IV. SIMULATION RESULTS
In this section, the simulation results are provided to evaluate the performance of the proposed MADRL-based resource allocation methods for the considered URLLC-GF-NOMA system. The simulations were performed on an Intel core i7-8665U CPU with 1.9 GHz frequency, 16 GB of random access memory (RAM), and 64-bit Windows 10 operating system. The learning models were considered with three hidden layers, including 256, 128, and 64 neurons. The experimental parameters are provided in Tables 3. Besides the proposed URLLC-GF-NOMA approaches based on MADQN, MA2DQN, and MA3DQN, we further investigate the following methods for comparison purpose.
• MA Q-learning (MAQL) [21]: MAQL is applied for GF-NOMA systems in [21]. With this scheme, each agent builds its own Q-table to store Q-values of all possible state-action combinations during learning process. • Random approach: In this scheme, users randomly select SC and TPL for their transmissions without learning. • Exhaustive search (ES): This method determines the optimal solution through exploration of the entire network space in every TS. • GF-OMA method: This method explores GF-OMA scheme, where the users utilize distinct frequency/time domains for their transmissions [47]. • Different state spaces [28], [29]: Various state spaces for MADRL-based GF-NOMA systems introduced in [28], [29] are also considered to assess the proposed methods' efficiency in terms of convergence property and signaling overhead. Specifically, the network state defined for agent m in [28], named State 1, consists of its action, its channel gains over all SCs, and its transmission outcome. Meanwhile, the work [29] defines agent m's state, so-called State 2, as the combination of the achievable rates of all agents. Fig. 4(a) shows the convergence behavior during the training phase of the URLLC-GF-NOMA approaches based on MA3DQN, MA2DQN, MADQN, MAQL, and Random schemes by plotting the reward achieved by all agents with respect to the various number of episodes. As can be observed from this figure, the Random method achieves the worst performance (i.e., lowest reward) as compared to other schemes. This is because the users randomly select SC and TPL when using this method. It is, therefore, difficult for them to find the best SC and TPL for their transmissions to optimize the network performance and guarantee URLLC requirements. Among the remaining approaches, the MAQL scheme outperforms the Random method thanks to the application of the Q-learning algorithm, but still achieves worse performance than others. This highlights the constraint of the Q-learning method when applied to a dynamic environment with an extremely large state-action space. Taking our proposed URLLC-GF-NOMA methods (i.e., MA3DQN, MA2DQN, and MADQN) into account, they are superior to the MAQL and Random methods, while achieving the same learning behavior and comparable rewards in this simulation. After the training phase, the testing phase is conducted to evaluate the training results, where the users always select the best action with the highest Q-value based on their learning results under new network conditions (network states and channels). The simulation results for the testing phase are provided in Fig. 4(b), where the testing process is performed over 100 episodes. This figure shows that during the testing phase, the learning-based methods (MA3DQN, MA2DQN, MADQN, and MAQL) can guarantee the convergence they achieved in the training phase.
In Fig. 5, we plot the variation of the achieved reward versus the number of episodes when using the MA3DQN approach with different network state definitions. This is to evaluate the efficiency of our proposed methods in terms of convergence and signaling overhead. Specifically, we investigate two network states used for GF-NOMA systems in [28], [29], namely State 1 and State 2, as mentioned earlier. In addition, a channel-based state definition, so-called State 3, is also investigated, where only the channel state information (CSI) of each user is used to define its state. One can see from Fig. 5 that the method utilizing the proposed state in (8) attains rewards comparable to the method that uses State 2 and State 3, and larger than the method utilizing State 1. Furthermore, the proposed state demands lower signaling overhead than State 1, State 2, and State 3. In particular, the proposed state only requires the agents to know their own selected SC index and transmission power value, which are available at the agent. Thus, the environment only needs to provide feedback to the agents regarding their transmission outcomes (i.e., reward), which is used for the training process. Meanwhile, State 1 requires the agents to also have knowledge of their own channel quality and incorporate transmission results into their state information. This unnecessarily increases the input data for the agents' learning model. On the other hand, State 2 requires agents to grasp the achievable rates of all users. This necessitates significant information exchange between the environment and the agents, resulting in high signaling overhead. Moreover, State 3 demands for additional information exchange between the agents and the BS to achieve the CSI, increasing the signaling overhead but does not contribute to further improving the learning process and the system performance in our considered scenario. Fig. 6(a) and Fig. 6(b) illustrate the effect of small and large state-action spaces (i.e., number of users (M), SCs (K), and TPLs (L p )) on the achieved rewards, respectively. Herein, the MA3DQN, MA2DQN, and MADQN approaches using the proposed state and State 2 are considered. As demonstrated by these figures, the methods using the proposed state and those employing State 2 have similar learning behavior and achieve comparable reward values in the small stateaction space. However, in the large state-action space, the methods utilizing the proposed state outperform those using State 2. This is because by utilizing the proposed state, the state-action space of the considered methods is significantly reduced compared to that of the methods employing State 2, resulting in a faster learning process and higher achieved rewards for the methods using the proposed state. Fig. 6(b) also illustrates that the MA3DQN method outperforms the MA2DQN and MADQN methods in the large state-action space generated by State 2. This is due to the MA3DQN approach's ability to rapidly identify optimal actions and important states, leading to better learning outcomes than the MA2DQN and MADQN techniques. The enhanced performance of MA3DQN is achieved by the separation of state and action networks at the last layer of the DNNs model used in these schemes. On the other hand, when the proposed state is employed, it results in a considerably smaller state-action space than State 2, even with an increase in M, K, and L p , resulting in faster learning. As a result, the MA3DQN, MA2DQN, and MADQN methods employing the proposed state achieve comparable learning outcomes. Thus, the MA3DQN method is developed for problems with a larger state-action space, whereas the MA2DQN and MADQN methods, with a simpler network design, are suitable for problems with smaller state-action spaces.
To evaluate the effect of the URLLC requirements (i.e., ε m and τ ) on the system performance, we plot the variation of the achieved reward versus the number of episodes with different value sets of (ε m , τ ), while using the MA3DQN method in Fig. 7. This figure indicates that the achieved reward can converge to a greater value when the lower URLLC requirements are set; for instance, the reliability decreases (i.e., ε m increases from 10 −7 to 10 −1 ), and the latency threshold is degraded (i.e., τ increases from 0.5 ms to 2 ms). This can be explained by the fact that the minimum data rate threshold based on (5) gets higher with the increase in the URLLC requirements. It is, thus, more difficult to obtain the rate constraint required to fulfill the URLLC conditions in this case, leading to an EE performance degradation. Fig. 8 shows the performance comparison in terms of the achieved reward between the methods using GF-NOMA and GF-OMA. For the GF-OMA scheme, each user occupies a distinct resource block and the system bandwidth W is equally divided among the users [47]. Observing Fig. 8 reveals that the methods utilizing GF-NOMA obtain greater reward gains compared to those utilizing GF-OMA. This can be attributed to the performance degradation that occurs in the latter due to the splitting of bandwidth resources   MA3DQN, MA2DQN, and MADQN methods, and these approaches outperform the MAQL and Random schemes. Fig. 9 depicts the variation of the average EE with respect to the number of users (M) for different methods. As observed from this figure, the EE performance decreases as the value of M gets higher since the growth of the number of users sharing the same SCs in this case leads to stronger interference. In addition, the proposed MA3DQN, MA2DQN, and MADQN methods yield better EE performance than the MAQL and Random methods when M increased. Furthermore, they achieve comparable EE gains under the different values of M. As mentioned earlier in the previous results, this is because the proposed approaches produce a small state-action space for each agent, accelerating their learning process and leading to equivalent EE performance. Fig. 10 provides an EE performance comparison between the investigated methods (i.e., MA3DQN, MA2DQN, MADQN, MAQL, and Random) and an optimal solution obtained through the ES method by plotting the achieved EE versus the number of TPLs. The ES method finds the largest EE by traversing all possible actions in the network in  every TS. As illustrated in Fig. 10, the EE values achieved by the MA3DQN, MA2DQN, and MADQN methods are close to those of the ES method and significantly exceed those of the MAQL and Random approaches. It is noteworthy that the ES method is infeasible for large network spaces since it requires exploring the entire network space, leading to high computational complexity. To address this issue, the proposed URLLC-GF-NOMA methods based on MA3DQN, MA2DQN, and MADQN enable the users to interact with the wireless environment and learn from their accumulated experiences to rapidly achieve a near-optimal solution without visiting the entire network space. Fig. 11 provides an EE performance comparison between MADQN methods using centralized and decentralized rewards with different values of M, K, and L. Specifically, the centralized reward is defined in (10), whereas the decentralized reward implies that each agent can receive a distinct reward depending on its own transmission outcome. In particular, with the objective of maximizing EE, the decentralized reward of each agent m can be defined as r m (t) = R    (5), respectively. As can be seen from Fig. 11, the EE performance achieved by using decentralized rewards is much smaller than the cases using centralized rewards. This is due to the fact that employing decentralized rewards can lead to the selfish behavior of agents, where they may compete with each other to maximize their own objective instead of the common one, i.e., maximizing the overall EE while guaranteeing the URLLC requirements of all users. Therefore, a significant global EE performance degradation can be observed as shown in Fig. 11.
As mentioned earlier in Section II-C, the problems of maximizing the achievable sum rate, named as maxRate, and minimizing the power consumption, so-called minPower, can also be investigated based on the EE maximization problem, denoted by maxEE, defined in (7). Herein, maxRate and min-Power are achieved by setting the denominator and numerator of (6) as 1, respectively. Given this context, Figs. 12(a) and 12(b) depict the achievable sum rate and the power consumption versus learning episodes for different problems, including maxEE, maxRate, and minPower, respectively. These figures demonstrate that maxRate can obtain the highest sum rate but with the largest power consumption since it only focuses on maximizing the sum rate, leading to high power consumption. Meanwhile, minPower can achieve minimum power consumption but results in a poor achievable sum rate due to its power minimization objective. On the other hand, the proposed maxEE problem can achieve a high sum rate close to that obtained by maxRate while minimizing the users' power consumption. Thus, maxEE outperforms maxRate and minPower in guaranteeing the trade-off between the achievable sum rate and power consumption for energy-limited users. Fig. 13 provides the EE performance of different MADRL frameworks proposed for GF-NOMA systems including our proposed solution, throughput-based solution [28], and ratebased solution [29]. As can be seen from this figure, our proposed solution achieves much better EE performance than throughput-based and rate-based solutions. This is because our proposed solution aims to maximize EE with minimum transmission power to save energy for those users with limited energy resources. In contrast, the throughput-based method tries to maximize network throughput, hence, higher transmission power than necessary can be used to ensure the successful decoding of the users' messages. Meanwhile, the rate-based solution focuses on maximizing data rate with large transmission power resulting in EE performance reduction.
To clarify the benefits of received power-based decoding order, Fig. 14 shows the EE comparison between received power-based and rate-based SIC methods during the learning process. Here, we consider that the predetermined rate demand of user m (1 ≤ m ≤ M) is set as m bps/Hz. Considering the rate-based SIC method, the message of the user with lower rate demand will be decoded earlier at the BS. This is because the user having its signal decoded earlier would suffer stronger interference and achieve a smaller data rate. As can be observed from Fig. 14, the received powerbased SIC outperforms the rate-based SIC in terms of EE. The reason behind this result is that the decoding order in the received power-based SIC method is more flexible than that in the rate-based SIC approach, which depends on the users' channel conditions and TPL selection. This can help the users find the most appropriate SC and TPL for their transmissions to optimize the global EE performance and satisfy the different rate demands of all users. In contrast, the decoding order is fixed in the rate-based SIC method due to the predetermined rate demand of the users. It is, therefore, difficult for users to find the best learning policy, especially in time-varying and strong-interference environments, leading to performance degradation.
From the results achieved above, it can be concluded that the proposed URLLC-GF-NOMA methods based on MA3DQN, MA2DQN, and MADQN can obtain similar performance and outperform other benchmark schemes in terms of EE, convergence rate, and signaling overhead. However, the methods based on MA2DQN and MADQN exhibit lower complexity compared to the MA3DQN-based method as indicated in Section III-C1, thereby reducing the power consumption and processing latency for the URLLC users. This benefit makes them better suited for the considered URLLC-GF-NOMA system.

V. CONCLUSION
In this paper, we have investigated a resource allocation problem in an uplink URLLC-GF-NOMA system where the users aim to maximize energy efficiency while satisfying their URLLC requirements. To achieve this, we have proposed three MADRL-based URLLC-GF-NOMA approaches (MA3DQN, MA2DQN, and MADQN) for the users to learn how to select the most suitable sub-channel and transmission power for their transmissions. In particular, we have designed an MADRL framework that guarantees a rapid convergence and small signaling overhead to maximize energy efficiency and satisfy users' URLLC requirements. Our simulation results have shown that the proposed URLLC-GF-NOMA methods based on MA3DQN, MA2DQN, and MADQN can achieve similar performance, but MA2DQN and MADQN are more appropriate for the investigated URLLC-GF-NOMA system due to their lower complexity compared to MA3DQN. Moreover, our proposed methods outperform existing benchmark schemes in terms of energy efficiency performance, convergence property, and signaling overhead to guarantee the URLLC requirements of energy-limited users.