Refiner GAN Algorithmically Enabled Deep-RL for Guaranteed Traffic Packets in Real-time URLLC B5G Communication Systems

Ultra-reliable and Low-latency Communications (URLLC) is expected to be one of the most critical characteristics in Beyond fifth-Generation (B5G) cellular networks with stringent low latency and high-reliability requirements. The Deep Reinforcement Learning (deep-RL) framework has been applied to predict the optimization of a Resource Block (RB) and minimize Power Allocation (PA) to guarantee a high End-to-End (E2E) reliability and low E2E latency under rate constraints. This paper proposes a novel Policy Gradient-based Actor-Critic Learning (PGACL) algorithm to optimize the policy gradient for optimal rate allocation to solve the RB, minimize power, and guarantee a solution for URLLC scheduling. The purpose of a PGACL algorithm is to provide a good policy with a closer convergence rate and a low computational cost depending on the reduced action space for every user. URLLC systems need to operate in highly reliable systems and account for extreme network conditions. Therefore, we proposed the refiner Generative Adversarial Networks (GANs) that apply enough extreme events for the deep-RL agent to generate synthetic data with high reliability similar to real data based on the regulated number of extreme events in the dataset. This refiner GAN method enables a deep-RL approach to generate large amounts of data practically used in real-time operation. Simulation results showed that the proposed deep-RL for refiner-GAN can omit the transient training time and develop deep-learning based on a controlled set of unlabeled real traffic at a relatively short time. Furthermore, the refiner GAN demonstrated 99.9999% reliability and E2E latency of less than 1.4ms.


I. INTRODUCTION
With the rapid deployment of wireless networks to support diverse applications, e.g., smart city, and intelligent transportation, Beyond Fifth Generation (B5G) networks are required to provide seamless access and diverse services for a huge number of devices over a limited radio spectrum radio. In wireless networks, different devices have various Quality-of-Service (QoS) requirements. For QoS guarantee, in B5G wireless networks, Ultra-reliable and Low-latency Communications (URLLC) is one of the most challenging services with stringent low latency and high-reliability requirements, i.e., in 3rd Generation Partnership Project (3GPP), a general URLLC requirement of a one-way radio is 99.999% target reliability with 1ms latency [1], [2]. Nevertheless, one key challenge in the B5G system is how to design enough extreme events for the Deep Reinforcement Learning (deep-RL) agent to intelligently make decisions (such as Resource Allocation (RA), Resource Block (RB), energy management, and transmission scheduling) for wireless networks under different devices [3].
To accommodate a high data rate in cellular networks of B5G, real-time optimization is needed to control the generation of the real data from radio resources under timevarying network conditions. Recently, much attention has been paid to the study of URLLC, RA problems, and decision making in wireless networks such as [4], [5]. The optimization of the RA problem in [5] depends on the proposed two-phase framework, which includes the Enhanced Mobile Broadband (eMBB) RA and URLLC scheduling to maximize the data rate of Users (UEs), by considering the reliability of both eMBB and URLLC. RA for Orthogonal Frequency Division Multiple Access (OFDMA) can decrease the End-to-End (E2E) delay and achieve the stringent reliability of the uplink and Downlink (DL) transmission in mobile edge computing systems in [1], [6], [7]. This efficient RA algorithm is used to guarantee the maximum delay by offering a partial iteration between the uplink and the DL to reduce E2E latency [8]. B5G addresses several new service applications, such as drones, virtual reality, the advanced Internet of Things (IoT), Artificial Intelligence (AI), and autonomous driving. These applications require high reliability and low latency. Virtual environments such as videos and images need big data rates with ultra-reliability in real-time. These applications with a short packet size can provide high reliability and low latency, based on localization in time with Transmission Time Intervals (TTIs). Moreover, to achieve low latency and high reliability, the data package should be small, the TTI should be short in the era of B5G [9]. The transmission latency can be decreased when the blocklength is short, and the relation between blocklength and error probability has been studied, as shown in [10]. This relation improves RA for short packet transmission in URLLC, as shown in [11], [12]. The RA problem is a power minimization implement the URLLC-B5G requirements, optimization algorithms cannot be performed by sacrificing QoS due to compromised reliability, latency, and rate limitations. Decreasing the total power consumption for a Base Station (BS) can be achieved by improving the bandwidth and Power Allocation (PA) through developing deep transfer learning for radio RA in real-time. However, minimizing power consumption and meeting the requirements of wireless UEs in cloud radio access networks cannot be executed [13][14][15] because they do not adopt the limitation of deep-RL when working with large action spaces. Another challenge is gathering real data for training deep learning.

A. RELATED WORKS
In order to achieve B5G service provisioning URLLC service. The problem of transmitting more packets and reducing power transmission levels is based on the decision policy's current state. The work in [16] proposed a modelbased and data-driven unsupervised learning method for designing a burstiness-aware scheduling framework that reserves bandwidth for UEs to satisfy the ultrahigh reliability requirement. The authors of [17] propose a packet prediction technique to predict the future incoming packets based on the packets in the current queue. The author of [6] studied the RA problem for a mission-critical IoT system to achieve a data rate under short packet communications by jointly enhancing the bandwidth and PA. The authors of [4] proposed an actorcritic Reinforcement Learning (RL) that uses a new reward function for RB allocation and PA to enhance the learning efficiency by applying a learning policy and interacting with the environment. This actor-critic system supports a highly reliable and low latency in device-to-device-enabled vehicleto-vehicle wireless networks. Many studies used training deep-RL to develop a novel artificial agent capable of learning and substantially enhancing prediction during the training process to collect more datasets, as shown in [15], [19], [20], [22]. If the deep-RL agent is utilized without training, the system will be lacked experience since the beginning, resulting in an unreliable system. In [6], [15], [18], the systems did not take into account the learning reliability based on the training dataset to accommodate extreme and critical events and, therefore, could not demonstrate the B5G URLLC that occurs in real wireless networks. Previous works, as shown in [18], [19], could not satisfy the requirement of B5G URLLC because the systems did not gain considerable experience through rigorous training to improve unusual traffic patterns, extreme events, and unforeseen network congestion. Therefore, the requirements of B5G URLLC can be fulfilled by adopting the large action space that the deep-RL agent can take. The deep-RL agent needs to address more action space. However, this large set of actions makes RB and reliability unsuitable for deep-RL frameworks [20]. Previous studies [5], [8], [18], [19], [22], and [27] were unable to handle the large amounts of data involved in URLLC and demonstrated a high order time complexity, making them unsuitable for real-time requirements. However, deep-RL has difficulty achieving a large label of a real dataset in real-time. Deep-RL for Generative Adversarial Networks (GANs) in the realtime setting is needed to achieve large arrival rates [21][22][23][24]to authorize the deep-RL agent to gain experience.
A large training sample is critical to handle the large amounts of data involved in URLLC to guarantee the training efficiency and improve the learning speed, and learning stability toward the optimal policy [20]. A large real dataset in real-time achieves by using the reward clipping scheme to enhance the training constancy of GAN [23] by improving learning an approximate distribution of the stateaction to obtain more stable and superior learning. Achieving the suitable strategy for transmitting high packets transmission efficiency of different buffers through multiple channels depends on the transmission scheduling mechanism using deep-learning [24]. The proposed GAN-powered deep distributional Q network decreases the size of the action space that provides a good transmission packet, guarantees high reliability, and achieves the optimal RA [21]. To meet the target reliability and guarantee the desired arrival rate to every UE depends on minimizing the BS power.

B. MOTIVATION AND CONTRIBUTIONS
Motivated by the above issues, to satisfy the E2E delay requirements of network UEs, we have addressed the joint problem of power minimization with a rate constraint of UEs, ultrahigh reliability, and ultralow latency. To solve this problem, we proposed a deep-RL framework to measure the E2E reliability, and E2E delay latency for every UE based on using a dynamically predicted traffic model, jointly allocating RBs and power minimization UEs under the constraint rate and URLLC. Deep-RL uses a Deep Neural Network (DNN) to gather data through the learning process. Fig. 1 illustrates that deep-learning combines Artificial Neural Networks (ANN) with an RL agent's experience to learn the best actions possible in a virtual environment. Both URLLC RB allocation and PA transmission relax into convex optimization problems, which become a nondeterministic polynomial-time problem, causing difficulty to obtain a closed-form solution. The system was proposed to obtain an optimal RA and extreme rate requirements of UEs.  We proposed a deep-RL outline that uses two feedback inputs, transmits power and reliability, and updates the DNN in every time slot. Intelligent B5G-URLLC is used to schedule and guarantee ultra-reliability and minimize power transmission in every time slot based on the required feedback received in small time slots and the prediction of the significance of its actions in the future. The Lyapunov function solves the queue delay for the arriving packets waits for transmission. In addition, the Lyapunov function guarantees a minimized transmitting power and ultrareliability with a time-varying system for every UE.  Even though the deep-RL effectively adopts the large state space problem, it still has to guarantee the desired rate for every action by addressing the huge state space and action space in URLLC. Therefore, we proposed a Policy Gradientbased Actor-Critic Learning (PGACL) that can provide a good policy with a closer convergence rate and a low computational cost based on reduction in action space by adapting the exact reliability and latency for every UE. Furthermore, the policy gradient based on training deep-RL for PGACL can select the optimal solution by iteratively leveraging the Bellman equation that provides the optimal distribution of actions for every UE and develops the estimation of action values in an integral randomness environment.  Deep-RL for B5G URLLC uses a higher number of training samples to select the optimal state decision based on the historical data of traffic packs. Issues related to the high number of training samples appear at a sudden increase in the arrival rate of every UE with a long recovery time. In this case, the system requires a transient time, which is critical to perform URLLC. Furthermore, using only deep-RL learning makes it difficult to achieve large labels of a real dataset in real-time settings. We proposed deep-RL for refiner GANs to provide sufficiently accurate traffic packs in real-time settings by incorporating a large number of unlabeled training samples to support B5G networks. A more vital real data requires covering all unexpected situations and training periods by using a sufficient number of extreme events and controlling the level of extreme events when the refiner's output is similar to the refiner's input.

II. SYSTEM MODEL
We have considered the DL OFDMA scenario of a single BS, where the BS is located at the center of the cell and a set Ĭ of UEs and has a set of available at RBs. The wireless RA consists of the following × RB allocation matrix and × PA matrix. The Shannon capacity rates cannot be used due to the small packet size in the URLLC traffic. Instead, the achievable rate can use the finite blocklength to define a URLLC UE on RB at any time slot . The achievable URLLC rate based on finite blocklength is given by [25], [30]: where is the transmission power of gNodeB (gNB) for URLLC user ( ∈ ), and is the Rayleigh fading channel gain from the BS to UE on RB at time slot . represents the duration of a TTI, is the RB allocation display with =1 when RB is allocated for UE at any TTI of the time slot ; otherwise, = 0. is the bandwidth of RB, and 0 is the single-sided noise spectral density. The achievable rate depends on RB for every UE by calling TTI of the time slot to serve URLLCs based on the reliability constraint. From (1), the reliability of the URLLC decreases due to interference. Thus, for serving URLLC data rate, UEs are based on concentrating a finite blocklength and error probability [25] to support B5G-URLLC scenarios. The achievable rate depends on RB to every UE ∈ as shown (1). The probability of the E2E packet delay surpassing a predefined target E2E latency for UE is represented as reliability ѱ . The transmission delay includes E2E packet delay and queue delay for the arriving packets. To meet the + Data arrives from upper layer system reliably and latency requirements, the system must keep rate concerning the packet arrival rate [22], [30]., i.e.
where ₼ represents the vector of current serving URLLC UEs for particular packet size, is the random number representing the arrival URLLC packet rate at UE at TTI, and (. ) refers to an unknown function and we consider it as implicitly approximate. A guarantee of high reliability depends on redesigning the physical layer and enabling technologies, including packet and frame structure, as shown in Fig. 2. The short packet transmission or efficient lowlatency transmission over 0.1 ms depends on the control of signaling and scheduling information for a large portion of the transmission latency as = + + + + . Where represent the time slot to decrease the processing latency, is the latency, is signal propagation time, is the time to achieve precoding and decoding, is the time taken for retransmission, and represents pre-processing time. The success probability of negative acknowledgement (NAK) is sent by the UE if there is no acknowledgement (ACK) ACK/NAK, and is the successful probability of ACK that guarantees high reliability of data packet when the UE sends NAK. The minislot-level (142, 241 µs), 2x, and 4x provide the base numerology for sub-frame and latency in the slot , which improves the success probability of packet [9]. Allocation of RB is achieved by calling in any TTI of a time slot based on the reliability constraint. The target E2E latency for UE for every packet loss probability can be written as where ѱ represent the reliability for a UE , and E2E latency for UE . The achievable rate depends on the ability to ensure the URLLC when the Signal -to-Noise Ratio (SNR) ≥ 5 dB, as shown in [5], [26 -28]. However, a reduction in the reliability of each UE occurs due to variations in the quality of the channel. The reliability ѱ and latency of URLLC depend on the ability to ensure that the outage probability of E2E instantaneous packet delay is more than the target E2E latency ( , ) for UE , as shown in (3). Designing reliable RA to keep the minimum required rate depends on the instantaneous arrival rate that satisfies the queue stability condition and the reliability and latency condition ( < ≤ ѱ ∀ ). The average data arrival rate of UE ( ∈ Ĭ at TTI) was needed to serve URLLC UEs, which is expressed as follows: where is the packet arrival rate in the time slot. The vector URLLC of UEs is expressed as = , when ∈ by nextgeneration gNB at TTI of the time slot ; otherwise, = . Based on (3) and (4), the probability and latency ensure the URLLC.

A. PROBLEM FORMULATION
The goal for the system is to allocate resources to minimize the average DL power while maintaining reliability, latency, and rate for the UEs. So, we pose this RA problem as a power minimization problem that is subject to a QoS constraint on maximizing the reliability of every packet and maintaining the required E2E latency and rate constraint of UEs are given in [2], [5], [22], and rate constraint of UEs can be formulated as: . [ The optimization problem in (5) seeks to minimize the average power consumed by the BS. In the outage probability in (5a), the packet scheduling utilizes available radio resources efficiently, ensures fairness among scheduled UEs, and satisfies the QoS requirements depending on maximizing the reliability of every packet and maintaining the required E2E latency [2]. A higher value of traffic packets (5b) in the TTI lowers the chance of a UE getting scheduled, and therefore, it ensures fairness among the UEs over a certain time duration. The transmission short packet delay of all UEs and reliability constraints are preserved by (5a) and (5b). However, the reliability condition and ultralow latency in (5a) must ensure that the E2E delay is less than ( ( , ) ≤ 0) with minimum reliability of 1 − ѱ and represent the minimum required for the instantaneous arrival rate and minimum power . Both (5b) and (5c) ensure that the rate of each UE enables connection to the BS and determination of data transmission at every UE to guarantee target reliability ѱ at every time slot . Moreover, constraints (5d) and (5e) represent the orthogonality of RBs among the URLLC and PA. Resource restraint is given by constraint (5f). The latency or reliability cannot be sacrificed to decrease power, as shown in (5).

B. DECOMPOSITION AS A SOLUTION APPROACH FOR PROBLEM
The wholly achievable UE RB and joint optimization of PA depend on obtaining the best solution by searching for possible URLLC location TTIs in the space. The gNB allows URLLC UEs to obtain some RBs directly on TTI within every time slot . At the beginning of the time slot , gNB schedules each of its RBs, the URLLC traffic requests any of to come in, and the scheduler tries to serve the requests in the next + 1. The portion of all RB is required for serving URLLC traffic overlaps at TTI. The reality of chance constraint in (5a) is still computationally expensive, and the combinatorial variable in (5) has difficulty reaching a globally optimal solution [3], [5], [22], [28]. Currently, it needs to adapt (5a) into deterministic form for explaining (5) by assuming ( , ) = ∑ − , ∈ ∀ . Thus, URLLC traffic arriving at gNB (any TTI of the time slot, follows a Gaussian distribution, i.e., ~(Ӷ, 2 ), where Ӷ and 2 represent the mean and variance of .
where Ӻ represents the cumulative distribution function of the instantaneous packet size of . From (6a), the reliability can reduce the E2E delay for the arrival rate of several URLLC packets at UE at TTI. Therefore, constraints (6) can be rewritten as follows: Pr {∑ ≤ ∈ } < 1 − ѱ * , (7a) At every time slot , the target reliability ѱ i * depends on the rate of every UE. In addition, the minimization of transmit power depends on the PA and RBs of every UE. The (7) and (7c) show that the low latency and reliability depend on adapting the exact reliability with the smallest resource usage. From (7) -(7f) the fairness doctrine for this mission contributes stationary service quality enhances URLLC rate based on finite blocklength, and makes UEs more pleasant in the network [2]. Moreover, the PA is selected as the central resource for optimization problems (7a). Thus, the repeated form of (5a) and the problem of RB allocation, i.e., ∑ ≤ | |, ∀ ∈ ∈ for any UE and any RB are still NP-hard due to the appearance of a combinatorial variable. The deep-RL agent must be trained with sufficient experience because the deep-RL gaining can take a long time to guarantee high reliability.

C. INTELLIGENT URLLC-B5G SCHEDULING: DEEP-RL
From the problem formulation the proposed deep-RL framework will use two feedback inputs and to estimate its performance and update its DNN in every time slot: The total power in DL BS for every time slot ( ) = min and the calculated reliability ѱ of every UE can be calculated from the problem formulation in (5). Using those two inputs, the deep-RL can determine and for all and . After iteratively assigning and and receiving the needed feedback in a few time slots. Determining the desired rate for every UE depends on the application of deep-RL at every time slot. We will use the derivation in Section II-D. An Action Space Reducer for RB and PA to the OFDMA resources and for all ∈ Ĭ, ∈ while decreasing the power in (Section II-D). Therefore, we will use the derivation in (Section II-E) for PGACL algorithm, whereas every UE achieves the data rate and attains a reward function as shown in (8) and transmits it as feedback to the deep-RL that uses this feedback and updates every UEs accordingly PGACL algorithm (Section II-E). The deep-RL framework is formally defined by its action-value function , state-space , and reward ℛ. The deep-RL framework takes action ( ∈ ) at every state ( ∈ ) and receives reward function, i.e., ℛ( , ). We have considered the number of arriving URLLC packets , instantaneous packet length , for every UE, and channel variation at every time slot.
Thus, the state of a time slot is defined as = ( , , ), ∀ ∈ Ĭ, ∈ , the action space represents the DL transmission power, and the number of TTI of every RB allocation for any UE is expresed as = ( , ), ∀ , . We formulated the reward function to guarantee that the requirement of URLLC based on training its DNN and action space function is satisfied as follows: where is a time-varying weight that guarantees the URLLC reliability when ѱ < ѱ * throughout the time slots, Ω represents the weight factor of power, and the total transmit power ( ) = ∑ ∑ =1 =1 is a casual variable depending on the status of the channel gain. The delay priority in (5) is non-convex due to the non-convex function created from two inputs, and . , that use deep-RL to evaluate the weights of the power and the reliability constraint to define the time-varying weight at step, + 1, as follows: where ѱ * represents the target reliability at time slot , which is defined in (7). The reliability is accurate when the timevarying value increased , when the reliability is ѱ < ѱ * . The maximum data arrival rate is stable and can satisfy . The number of bits to be transmitted in every TTI depends on the packet size of each UE. Deep-RL appropriated depends on whether the BS maximizes the reward function in (8), and the reliability for every UE must be guaranteed at the fixed point when +1 = , also when the reliability is ѱ ≥ ѱ * . The convergence deep-RL algorithm first assumes that the time-varying value converges to * . This convergence of deep-RL can minimize the allocated power , maximizing a data arrival rate and for every UE and of the time-varying value . By defining the Lyapunov function for every BS as ‖ +1 − ‖ 2 = ‖max{ + ѱ * − ѱ , 0 } − ‖ 2 , the original optimization in (5b) is equivalent to (7d) to satisfy (7b). Therefore, if this constraint in (7b) is not satisfied in any time slot, the queue delay will depend on the number of arriving packets in the queue waiting for transmission, followed by the use of Lyapunov optimization to solve the RA problem related to time variation [22], [27]. The deep-RL guarantees the reliability for every UE when ‖ѱ * − ѱ ‖ 2 ≤ 2( * − ) (ѱ * − ѱ ) and if the initial conditions of queue delay are finite. Then, high reliability is guaranteed for every UE based on the fixed point for the equal time variance ( +1 = ), followed by + ѱ * − ѱ ≤ max{ + ѱ * − ѱ , 0 } − . Additionally, the latency for every UE is guaranteed when the BS maximizes the reward in (8) when + ѱ * − ѱ ≤ , and ѱ ≥ ѱ * . From (7f), the value in every time slot shows more complexity, making it difficult to improve the cellular networks. The long order of time complexity and difficulty handling more active space in URLLC [5], [28]. The deep-RL algorithm must be able to address more action space in real-time. To solve this problem, a novel algorithm was proposed to reduce the size of the action space without limiting it.

D. ACTION SPACE REDUCER FOR RB AND PA
To improve the scalable hierarchical framework for the whole RB allocation and PA of the problem, the allocation solution with the smallest power was selected. The RA of the action space adopts the actions for diverse integer representing the action space sizes, i.e., ( ) × with dimensional RB ( × ) and PA ( × ). Reducing the action space can improve the rate of every UE by adjusting exactly the reliability and latency [29]. The optimization problem related to the action space function can be defined to obtain and from the optimization variable in (5).
where = [ 1 , 2 , … . . , ] ∈ represents the desired rate for every UE. The optimization problem in (10) is equivalent to decreasing the BS power and guaranteeing the rate by collecting the desired rate for every UE. The constraint (10a) guarantees that every UE can be performed with a required rate ≥ , and minimization of BS power in (10), which can be solved with constraint (10a) a form of inequality constraint ≥ , will be fulfilled in the form of equivalence [30]. The convex optimization problem in (10)-(10b) can be solved by the augmented Lagrangian algorithm [28]. To reduce the state space dimensions, the rate for every UE is defined by collecting the desired rate for every UE. The deep-RL can reduce the error caused by the action space reducer and conventional RA by using the Lagrangian for the problem (10) while recognizing relaxation of the inequality constraint ≥ by introducing auxiliary variables as follows: Moreover, the problem in (11) is dual decomposable for every RB, growing will growth , which can be rewritten as: where is a vector of Lagrangian dual variables and ( , , ) is a convex function and has a closed-form solution. The optimal power for a given is denoted as optimal * . The optimal PA is obtained when the equality in (5b) and (5e) is controlled. By utilizing the softmax function as the activation function in the output layer, we can guarantee that ≥ 0. This problem in (12) can be derived in terms of as follows: The optimal PA can be expressed as: Since the channel gain of the transmission is required to guarantee the rate of every UE, which decreases with the transmit PA to UE, then every RB can be allocated only to one UE, and the optimal solution of problem (12) is selected to allocate RB to UE where is expressed as: Allocating the RB to every UE is based on the optimal PA * required in (14) and (15) by minimizing the system PA.
The greedy allocation method is taken into account, and the UE for the optimal PA is selected in every RB * .
According to [31], only the deep-RL complexity is affected by the increase in the number of UEs; the complexity of our algorithm is ( ) based on the chosen actions and will provide the new training state information to the agent in (12). Moreover, when reducing the action space, the desired rate for every action becomes = [ , , … . . , ] ∈ . This action space is still not scalable due to the Ndimensional excitation in , as shown in [32]. The proposed deep-RL problem using a policy gradient is shown in the new subsection.

E. PGACL ALGORITHM FOR POLICY GRADIENT FOR OPTIMAL RATE ALLOCATION
In this section, the PGACL algorithm is explained to improve the performance of the policy gradient for optimal rate allocation and analyze its convergence. The agent aims to select a policy gradient algorithm such as the PGACL algorithm that can control the desired rate or every action of the optimization problem in (12). Power allocation * based on deep-RL can learn to find the optimal policy * through maximizing expected reward. The policy function can be described as ( , ) = { , ∀ ∈ Ĭ, ∈ }, where is a conditional distribution of RB , which is allocated for UE at any TTI of the time slot . The goal of deep-RL is to obtain the optimal policy * by selecting the desired rate and power according to the current state based on the decision policy and generating the maximal ( , ) for each state space and action space . Let * = arg max ( , ) ∀ ∈. With given policy , the policy action value, i.e., ( , ) is a cumulative discounted reward at a given π as shown in Fig. 3, which is used to update the deterministic policy to achieve a policy that exploits the expected return of the algorithm in (16) and can be expressed as where ∈ (0,1] is a discount factor, and is expectation. The cumulative discounted reward iteratively applying ( +1 , +1 ) leads to junction ( , ) = [ℛ( , ) + ( +1 , +1 ) − ( , )]. The best solution obtained by iteratively leveraging the Bellman equation in [16], is used to improve its ability and estimate action values in an integral randomness environment. The policy that maximizes ( ) of the DNN can be obtained and explained in [7]: Based on (17), policy optimization proposed a policy gradient PGACL algorithm that can provide a good policy with a closer convergence rate and a low computational cost by combining policy learning and value learning. The PGACL algorithm consists of two parts: a) The first part is called an actor part that can update the policy in DNN depending on the policy gradient. This policy is created based on proposing a parameter vector , as ( , ) = ( | , ). The agent performs the Bellman optimality in (16) on every transition of the selected action and obtains the target action value. This actor part can achieve incredible performance on deep-RL problems according to (17) in respect of as is the gradient of objective function. Additionally, this policy of a parameter vector can provide the optimal distribution of actions i t for every UE ∈ Ĭ based on regularizing the actor's learning to correct the error and improve stability as large steps in the actor update. The parameterized policy can control the policy by the Gibbs distribution as ( , ) = ( ( , )) ∑ ( ( ,)) ∈ ⁄ , based on using the gradient function in (17) as +1 = + ( ), where ( , ) is the feature vector and represents the learning rate of the actor. b) The second part is called the critic part, which is used to learn the correct scheduler mechanism to be fulfilled at every TTI to maximize the rate allocation for the policy gradient. The function estimator in [33] is applied to estimate the value function Ҷ( , ) of agent ǂ, which is expressed as: where Ϫ = [Ϫ 1 ( , ), . . . . . . , Ϫ ( , )] is the basis function vector. The critic utilities are used to calculate the error between the estimated and real values to achieve the performance gain in terms of the temporal-difference (TD) method as = ℛ +1 + Ҷ( +1 , +1 ) − Ҷ( , ). The linear function estimator in (18) is used to update the weight vector, i.e., Ϫ( , ) by using the gradient descent method as: ҷ( +1 , +1 ) = ҷ((s t ), (a t )) + ℓ c t ∇ ҷ Ҷ(s, a) = ((s t ), (a t )) + ℓ c t Ϫ(s, a), where is the critic that uses the TD method, and ℓ is the critical learning rate. The value function in (18) has updated the DNN at every TTI by applying critic learning to the value of Ҷ( , ) in (19). The PGACL algorithm trains DNN by testing random tuples in the experience pool based on selected action, next state ( +1 ), the current reward ℛ and supplies the experience tuple. When ∈ (0,1) starts from any Ҷ( , ) iteratively applying the operator ҷ( +1 , +1 ), the iterative process starts to satisfy the Bellman optimality equation, as shown in [34]. The value of time-varying is updated according to (9) in the system to meet the target reliability while reducing the aggregate BS power. The issue can occur from a sudden increase in the arrival rate of each UE over a long recovery time, causing the system to require a transient time, which is critical to perform URLLC B5G. To overcome this issue, refiner GANs were proposed to evaluate real data and synthetic data sets by controlling the generation of real data that operate in real-time. This proposed method can generate realistic traffic flows at the packet level and guarantee reliability for RA in the long term. Furthermore, the proposed deep-RL has been emphasized for refining the GAN solution to generate high-reliability synthetic data similar to real data based on the regulator of the number of great actions in the dataset. Previous research has not introduced this number of great actions in the dataset [22], [36].

III. PROPOSED DEEP-RL FOR REFINER GAN IN REAL-TIME
Deep-RL for refiner GANs in real-time is needed to achieve large arrival rates created from a (synthetic data) generator and a (data) discriminator. The guarantee of a level of great actions in the generated datasets depends on when the Refiner Neural Network (RNN) output is similar to the input of the RNN. The optimal RNN Ҩ * ℝ is trained to obtain the output of the refiner indiscernible from a real dataset based on using a discriminator neural network over with an RNN [35], [37] as: where Ҩ ℝ denotes the regular weight for the refining network, Ҩ represents the regular weights of the discriminator, and ℱ is a cross-entropy as the loss function. The optimal RNN Ҩ * ℝ is trained to minimize an objective prediction function when the discriminator of predicting networks attempt to discriminate the real data from the refined synthetic data, as shown in Fig. 4. The RNN controls the generation of the real-like data ℤ obtained from a distribution (ℤ) and generates real data as = (ℤ, Ҩ * (Ҩ ℝ ))~(ℤ). The discriminator Ҩ at the time of training recognizes between the actual data ( ) and the data arriving from the refined, GAN-distribution traffic data (ℤ) by training a function. The DNN is trained with a backpropagation function by using cross-entropy as the loss function, as follows: where ( ) represents real distributed data and (ℤ)represents refined simulated data. The notation ~( ) denotes an expectation, dependent on the output of the discriminator parametrized by Ҩ , when the unrefined synthetic input is given, and ℤ is a real data sample from the distribution, i.e., (ℤ)and (ℤ, Ҩ * (Ҩ ℝ )). According to (21), the first term represents the discriminator's ability to learn real data distribution. The second term represents the discriminator's ability to train the coming from the refiner generator. By training the discriminator, the generative refiner can guarantee the output of the RNN as ℱ(Ҩ ℝ ) = ℤ~(ℤ) log(ℱ(ℤ, Ҩ * (Ҩ ℝ )) (1 − ℱ(ℤ, Ҩ * (Ҩ ℝ ))) ⁄ ) and minimize the average correct predictions [8]. From (21), in practice, the output of the RNN cannot be controlled for sufficient training for Ҩ ℝ to learn well. When Ҩ ℝ is reduced, Ҩ can reject training with great confidence because the output becomes incomparable to the real data obtained during the data training rather than a training Ҩ ℝ used to minimize min Ҩ ℝ = log (1 − ℱ(ℤ, Ҩ * (Ҩ ℝ ))). The dynamics of Ҩ ℝ and Ҩ Provide stronger real data by controlling the level of extreme events when the refiner's output is the same as its input. The Ҩ * the network is trained to make the distribution of the real data and generated data the same by applying the convergence discriminator Ҩ * = ( ) ( ( ) + ( )) ⁄ , where ( ) is the global optimality distribution of synthetic data [37]. From (21), if the output of the RNN is similar to its input, the synthetic dataset can easily be recognized by the convergence discriminator. Therefore, the global optimality of the virtual environment can prepare the trained agent and help to control the level of real extreme events when ( ) = ( ). The optimal discriminator Ҩ * of the real wireless environment can be written as: The training refiner GAN in real-time for the discriminator (Ҩ ) maximizes any ( , ) ∈ 2 \{0,0}, and the function Ỿ → log Ỿ + log(1 − Ỿ) attains the maximum in the set Ỿ ∈ [0, 1] at + . The training objective Ҩ can be explained as the maximization of the log-likelihood for estimating the discrimination of the real data from the refined synthetic training data, which can be expressed as follows: The optimal discriminator consequently tries to maximize max ℱ(Ҩ ℝ , Ҩ ) Ҩ by training to distinguish between Ҩ ℝ and Ҩ . The minimax RNN in (21) can be maximized by generating the best realistic samples of real data that aim to fool the best-trained offline, such as the Ҩ , and can be reformulated as follows:

(ℤ)
Synthetic Data VOLUME XX, 2017 The generator function achieves the convergence discriminator Ҩ * as in (21), which satisfies [(Ҩ * (ℤ) = ) = ( * ( ) = )]. In addition, if and only if ( ) = ( ), the discriminator Ҩ is allowed to minimize the average correct predictions, as shown in (24). The algorithm generates real data that operate in real-time refiner GAN training for developing deep learning. Algorithm I handled the large amounts of data involved in URLLC and demonstrated a high order time complexity, making them suitable for real-time. The algorithm of refiner GAN training can address more action space with low complexity depending on removing the transient training time with great confidence of regular weights of the discriminator Ҩ because the output becomes incomparable to the real data in real-time. The computational complexity is proportional to the number of actions at every decision epoch. After an update to the RNN, the gradient of the output of the discriminator was parametrized by Ҩ . The network that discriminates between the refined data ℱ( ℤ, Ҩ * (Ҩ ℝ )), and real data are more likely to be classified as real data.
After several steps of training, if Ҩ ℝ and Ҩ generate more realistic real data that reach a point at which both cannot improve because ( ) = ( ). The discriminative neural network has to allocate training data to establish efficient decisions for refiner networks. Let , C 0 , and C denote the training layers in trained deep-RL models, proportional to the number of hidden layers and dimensions of the output utilized in deep-RL, respectively. The complexity in every training for every agent is computed by ) at every training procedure. In real data training, every TTI has epochs ℕ ℎ with every epochs being time slot , and every trained model is finished over iterations. Therefore, the convergence and the network has ʒ agents with ʒ trained deep-RL models reached. Hence, the total complexity is (ʒℕ ). The high deep-RL training complexity phase is achieved offline and the number of actions for a limited number of epochs at a powerful unit as the BS [38]- [40].

IV. SIMULATION RESULTS
In this section, B5G URLLC is evaluated by achieving large labels of the real dataset in real-time and the number of packets generated during the interarrival time. In addition, the number of arriving packets from the dataset, which is comparable in length and interarrival time for every UE is proposed. The proposed algorithms are evaluated using different benchmark metrics. These metrics are commonly used to evaluate the performance of existing approaches to performance deep-RL for refiner GANs [15] [22] [30]. The main simulation parameters are listed in Table II.

A. INTELLIGENT URLLC-B5G SCHEDULING AND ACTION SPACE REDUCER FOR RB AND PA
The intelligent agent can improve smart packet transmission scheduling for URLLC. The average transmission delay is very sensitive to the size of the data packet. There is high reliability of short packets as the reliability decreases with an increase in the average packet size, as shown in Fig. 5. The proposed deep-RL ensured the reliability of the UEs by keeping a limited blocklength close to 1 at = 2.5Mbps, and the URLLC is more reliable in increasing the data rate when the packet size is small because the delay increases with the increase in the packet size. In Fig. 5, it can be observed that the proposed deep-RL for refiner GAN can provide high reliability and low latency with a high data rate based on the decreasing packet size due to limited bandwidth. The proposed deep-RL for refiner GAN can randomly reduce the packet loss generated by different multiuser arrival at a BS. Compared with [30], Fig.4, packet size can provide more reliability for traffic with shorter packet sizes. The interarrival time between packets and queueing delay violation depends on the performance of E2E in terms of reliability. The deep-RL can decrease errors quickly based on the average training loss and validation loss for the ANN. The proposed deep-RL for refiner GAN could keep the URLLC reliability higher than 97% at = 2.5 Mbps, while the average achievable rate denotes failure to maintain suitable reliability, which fell to a value lower than 60%. From Fig. 6, the deep-RL algorithm for reducing transmission power relies on the current state of the decision policy to obtain intelligent transmission for every UE. The training sample time is insufficient if the packet arrival rate is high. Therefore, training time requires the AI in URLLC to transmit more packets in real-time. The proposed deep-RL reduces the total power while holding the lowest average rate of every UE and controls the packet transmission for a maximum delay. The deep-RL achieved the minimum average transmission delays as low as 0.16 ms, as shown in Fig.6, by finite transmit power to many URLLC UEs. As shown in Fig. 6, the optimal RA can provide the minimum transmission delay while minimizing transmit power or providing the same power as proposed in the deep-RL. The minimum average delay obtained by the amount of power is =35 W, compared with [22], Fig.7, where the minimum average delay with the power becomes more significant than 50 W. The average transmission delay is nearly flat in refiner GAN when the transmission power level increases. The training of deep-RL with the PGACL algorithm can control the transmission duration of each packet by ensuring a minimized transmit power to more URLLC.

B. OPTIMAL RATE ALLOCATION FOR POLICY GRADIENT
The policy gradient is significant for selecting an optimal rate allocation to solve the RB, improving transmission packet to guarantee high E2E reliability, and guaranteeing a good policy with a closer convergence rate, which depends on the reduced action space for every UE. Fig. 7   The packet size is large, and the rate will decrease because delay violations have caused the packet loss probability.
The proposed deep-RL of URLLC for refiner GAN becomes more reliable by generating the best realistic real data samples to fool the best-trained offline. Fig.7, the delay increases with increasing packet size. When the packet size is increased, the average delay metric is not suitable in realtime services (delay of packets would lie in a small range). However, the proposed deep-RL for refiner GAN has achieved the average achievable data rate that varied from 28 Mbps to 48 Mbps by growing the average URLLC load from 20 to 100 packets/time slot. Figure 8 evaluates the reliability with the total number of UEs in the system at the time slot. The reliability did not decrease with an increase in the running time from zero to 500-time slots but slowly increased due to its strict E2E QoS requirement. From Fig.  8, the refiner GAN can achieve 0.999 with around 400 epochs. While in [15] Fig.6, only 400 epochs are needed to achieve the 0.98 accuracies. As a result, the system provides high reliability and a higher data rate based on localizing in time with TTI. From (11), the data package size was too small to achieve high reliability, and the TTI was short. From Fig. 8, the proposed deep-RL for the refiner GAN provides good performance by enabling fast convergence, achieving a better response, and improving time by controlling the generation of real data during refiner GAN training for the first time slot in real-time. When the reliability in the PGACL is smaller than that in the refiner GAN due to the use of PGACL, the issue of a sudden increase in the arrival rate of each UE with a long recovery time occurred, and the system needed a transient time, which is critical to perform B5G URLLC.

C. DEEP-RL FOR REFINER GAN IN REAL-TIME
In this subsection, the conditional refiner GAN for large iterative training can provide a desirable action in each decision epoch, reduce the order of time complexity and control the great action space involved in URLLC in realtime. Figure 9 shows the relation between E2E target latency versus the delay reliability in terms of the effect of the maximum bandwidth. From Fig. 9, the high reliability and low latency were achieved at a high rate following allocating the higher bandwidth to the system. The E2E latency increases with increasing reliability because of the tradeoff between latency and reliability. When the bandwidth increased from 35 MHz to 45 MHz, the latency decreased, making it difficult to guarantee the latency and reliability. Moreover, the minimum 35 MHz bandwidth could increase the rate of each UE as per requirement without sacrificing the E2E latency and reliability of URLLC. The curve in Fig. 9, will increase the reliability because the URLLC data rate based on finite block-length provides more reliability to the traffic with shorter packet sizes and protects UEs at bad channel states to satisfy the looser reliability requirements. The proposed deep-RL for the refiner GAN can achieve a reliability of 99.9999% and a latency of less than 1.4ms. Figure 10 shows the training GANs and presents the actual refined training data. The discriminator Ҩ and refiner losses decreased to a stability loss based on the real dataset in terms of the training loss. The optimal refiner GAN (Ҩ * ℝ) has been trained to provide stable loss over time for discriminator loss and refiner loss. The loss did not decrease (loss was slowly decreasing) over time per training number of epochs due to control over the output of the RNN for sufficient training so Ҩ ℝ can learn well. Only 2000 epochs (training samples) are needed for the deep-RL refiner GAN to achieve stable loss. From Fig.10, the deep-RL for refiner GAN trains data samples to minimize an accurate prediction, as shown in (21), and generates the best realistic samples of real data that aim to provide the stable loss and generate high-reliability synthetic data similar to real data when the prediction tries to discriminate between the two sets of data.

V. CONCLUSION
This paper uses the proposed deep-RL framework to determine the E2E reliability and E2E latency for every UE based on a dynamically predicted traffic model, jointly allocating RBs and power minimization under the constraint arrival rate B5G URLLC in the DL of an OFDMA system. The joint problem of minimization power was formulated with rate constraints of UEs, ultrahigh reliability, and ultralow latency to operate in highly reliable systems. Using those predictions in the RA process, the proposed deep-RL can predict UEs' traffic. To solve this problem, it is necessary to guarantee the desired rate for every action by addressing the large state space and large action space based on the proposed PGACL that can provide a good policy with a closer convergence rate and a low computational cost. Finally, to improve highly reliable systems, the E2E reliability and latency of every UE were used as feedback on the proposed refiner GANs to provide sufficient traffic packs in real-time by avoiding leaving a large number of training samples unlabeled. As a result, the proposed deep-RL for the refiner GAN can minimize unlabeled real traffic, which needs to learn faster and has a shorter transition period. From the simulation results, the proposed deep-RL verifies that refiner GANs can satisfy the stringent requirements of URLLC and high rate based on omission of the untrained agent and the synthetically trained agent, which takes a longer transient training time. Our future work will investigate the improved intelligent smart packet transmission scheduling and fairness of UEs for the internet of everything in URLLC-B5G.