Smart Packet Transmission Scheduling in Cognitive IoT Systems: DDQN Based Approach

The convergence of Artificial Intelligence (AI) can overcome the complexity of network defects and support a sustainable and green system. AI has been used in the Cognitive Internet of Things (CIoT) which improve a large volume of data, minimize energy consumption, manage traffic, and store data. Improving smart packet transmission scheduling (TS) in CIoT is dependent on choosing an optimum channel with a minimum estimated Packet Error Rate (PER), packet delays caused by channel errors, and the subsequent retransmissions. Therefore, we propose a Generative Adversarial Network and Deep Distribution Q Network (GAN-DDQN) to enhance smart packet TS by reducing the distance between the estimated and target action-value particles. Furthermore, GAN-DDQN training based on reward clipping is used to evaluate the value of each action for certain states to avoid large variations in the target action value. The simulation results show that the proposed GAN-DDQN increases throughput and transmission packet, while reduces power consumption and Transmission Delay (TD) when compared to fuzzy Radial Basis Function (fuzzy-RBF) and Distributional Q-Network (DQN). Furthermore, GAN-DDQN provides a high rate of 38 Mbps, compared to actor-critic fuzzy-RBF’s rate of 30 Mbps and the DQN algorithm’s rate of 19 Mbps.


I. INTRODUCTION
Recently the Internet of Things (IoT) has emerged as a promising vision in Beyond fifth-Generation (B5G) wireless technology, which can be intelligently interconnected to the growing usage of application services such as mobile phones, video streaming, and video conferencing in business and daily life. Application services enable people to access the streaming applications from anywhere and at any time, while also providing big data in real-time. Improving a wireless multimedia application (i.e., vehicles, monitors, YouTube, Skype, and web browsing) is dependent on the packet transmission schedule in Ultra-Reliable Low Latency Communications (URLLC) [1], [2]. In addition, URLLC is closely related to mission-critical IoT applications due to stringent constraints on the combined latency and reliability [2].
In particular, controlling loss and enhancing packet Transmission Scheduling (TS) in the Cognitive IoT (CIoT) system are two major challenges for URLLC systems. Many studies have proposed spectrum schemes [3], [4] to guarantee the quality of service requirements and provide a high transmission data rate. Artificial Intelligence (AI) has been used in CIoT to keep up with data volume while minimizing energy consumption, traffic management, and data storage. To guarantee throughput and transmit packets of different buffers through multiple channels, a new Q-learning-based TS is proposed to solve packet transmission efficiency in CIoT systems [5]. The packet transmission efficiency using a Deep-Learning (DL) agent can substantially enhance its prediction if it becomes more intelligent for the CIoT [5].
To provide better-performing TS in the Cognitive Internet of Vehicles (CIoV) system, Deep-Reinforcement Learning (DRL) has difficulty achieving a large label of a real dataset in real-time [6]. The Generative Adversarial Network (GANs) are needed to achieve packet TS and thus allowing the DRL agent to gain knowledge. The Deep Distribution Q Network (DDQN)-based GAN is proposed to plan an intelligent agent as shown in [6]. The model-free actor-critic for DRL is proposed to solve the problem of TS by applying the learning problem for intelligent resource allocation in CIoT systems and improving transmission packet rate, power consumption [7]. In addition, the GAN-based DDQN algorithm improves training stability in CIoT by efficient transfer to estimate the value of every action for certain states and the expectation of the action-value distribution. Our entire procedure of using DRL for CIoT systems is dependent on designing an intelligent agent Fig. 1.

A. MOTIVIATION AND CONTRIBUTIONS
The current low packet transmission efficiency of IoT faces a problem of the crowded spectrum because of the rapidly increasing popularity of various wireless applications. A major challenge in CIoT is packet transmission efficiency. The unexpected growth in arrival rate to all Users (UEs) necessitates unnecessary overhead and long retransmit in the case of extreme events. Because URLLC applications are sensitive to reliability and latency, short periods of unreliability or latency can have a significant impact on UEs. Managing spectrum decisions must enhance the dynamic channel nature in Cognitive Radio Networks (CRNs), which provide a large volume of data based on the estimated Packet Error Rate (PER), channel status, throughput, and packet retransmission delay. Furthermore, to overcome the issue of underestimation of action-value due to the effect of random noise in CIoT. Therefore, we propose a GAN-based DDQN empowered by the Software Defined Network (SDN) controller in a highly complex IoT environment for intelligent TS. The main contributions of this work are as follows:  We explore how to improve CIoT throughput by maximizing the quality of the transmitted packet rate and reducing Transmission Delay (TD) based on channel transmission, Signal-to-Noise Ratio (SNR), and PER for choosing a good channel and reducing the spectrum handoff in the multimedia applications.  We propose a Radial Basis Function (RBF) learning algorithm for reducing transmission power, which is dependent on the current state of the decision policy to obtain intelligent TS for every UE in CIoT systems. Furthermore, we use convergent actor-critic to provide a reasonable decision for real-time learning at the output layer of the fuzzy-RBF learning algorithm. The integration of the actor-critic fuzzy-RBF learning algorithm has the capability of solving the TS problem for a large number of transmission packets under updating temporal-difference error.  We propose GAN-DDQN for enhancing the action improvement value for each action, resolving the DRL long training process issue, and real-time processing of the collected real data. The DDQN learning suffers from the challenge that only a small part of the generator output is included in the calculation of the loss function during the training. The proposed GAN-DDQN can remove this loss when the agent is deployed, as well as reduce delay, improve throughput, and perform TS. To stabilize GAN-DDQN training, we propose a new reward-clipping process that can be preventing large variations in the target action value.

B. RELATED WORKS
Achieving an intelligent spectrum handoff decision in CRN depend on the proposed transfer actor-critic learning selection that uses a comprehensive reward function that considers a good knowledge of channel quality, packet error rate, packet dropping rate, and throughput [8]. Energy management, resource allocation, and TS are the challenges in the CIoT system, which require the design of a learning agent to develop decision-making ability [8]. To solve the distributed resource allocation problem for IoT using cognitive hierarchy devices, human-type devices exist in CIoT systems and machine-type devices [9]. Moreover, reducing power consumption and low-energy UEs such as narrowband-IoT depend on scheduling data transmissions to fulfill the URLLC requirements for cellular networks [10]. Improved CIoT systems for URLLC requirements must provide a data transmission with reducing average packet latency, reliability, energy efficiency, and internet connectivity [11][12][13][14]. Based on previous studies [6][7][8][9][10][11][12][13][14], CIoT systems are still not smart enough to search for the optimal policy. To make the system performance more intelligent to search the optimal policy, we propose an RBF to extract the intelligence and improve the performance of TS based on developing 'intelligent' fuzzy controllers. Improving the success of transmitted packets is dependent on minimizing the contradiction between the evaluated and target actionvalue distributions, which achieves the optimal resource allocation policy for GAN-powered DRL [15]. Good transmission packet scheduling guarantees high End-to-End (E2E) reliability based on the proposed experienced DRL for action space reducer that reduces the size of the action space of GAN [16]. Previous studies [12][13][14][15][16] often override the real-time request for real-time traffic and the influence of delay for retransmissions. Whereas real-time communications over IEEE 802.11 are vital to meet the high efficient TS in the CIoT system.

II. SYSTEM MODEL
In this section, we consider the downlink of the wireless IoT devices are random distributed in a circular cell, and every IoT device selectively adapts to the lower modulation levels: Binary Phase-Shift Keying (BPSK), 4-Quadrature Amplitude Modulation (QAM), 8-QAM, and 16-QAM. The GAN scheduling is applied in the SDN-based radio access network for dynamic CIoT and a noisy wireless bandwidth in the radio access network with downlink transmissions.

A. CHANNEL STATE
We considered a CRN with ℎ independent channels; every channel is allocated to UEs. The transmit packet arrival rate in CIoT is modeled as a Poisson distribution process in IoT. The SNR is independent and identically distributed between different transmissions during the transmission packet. The probability distribution function density of the received SNR statistically, can be written as: where represents the instantaneous SNR of the ℎ channel at the receiver and ̅ represents the average received SNR. Let , be the received SNR at the ℎ transmission after the packets are combined at time slot Ҩ, then ensuing SNR at the ℎ transmission is an adopted system, the received SNR defines the perfect channel state information at the receiver. The status of ℎ channel involvements blocks Rayleigh fading with time slot Ҩ, , presents a binary variable {0,1} [14], [17]. If ( ) = 1, the channel is busy by one transmission packet; else, the channel is idle.

B. POWER CONSUMPTION MODEL
Power consumption depends on the small-scale channel gains. The several packets must wait for retransmission the next time. Every device has two power consumption statuses in each time slot Ҩ {0,1}. The transmit power indicates the transmitted data packets to every device. To reduce the status power consumption under the queuing list to access this channel and wait to be arranged to other channels to obtain the low-power level for every device on the ℎ channel is modeled as: where is the circuit power and is the transmission power consumption on the ℎ channel in CIoT systems.

C. TRANSMIT PACKET RATE MODEL
In this subsection, we explain the SNR boundary value of the Rayleigh fading channel based on the packet loss rate [18]. In this paper, we choose the 2 -QAM. when the received SNR exceeds the minimum SNR. The minimum SNR required to achieve target Bit Error Rate (BER) is derived as , = 1 ln ( , ), = 1, 2, . . − ⁄ , as shown in [19]. In this equation, and represent the modulation and coding scheme levels, respectively. The − represents the maximum number of transmissions in Hybrid Automatic Repeat Request (HARQ). To avoid deep channel fades, no payload bits will be sent, and the received SNR must be lower than 1 , for all modulation and coding scheme levels of SNR [ , +1 ] [17]. The BER is estimated based on the calculated SNR and modulation level , = 1 − (1 − 2(1 − 1 2 ) (√3 , /( − 1))), where represents noise power and is the Q-function that used to find the tail distribution function of the criterion normal distribution. The average PER in the HARQ mode for all SNR values equal to , 1 , , 2 · ·· , , − , including the number of packet transmissions, can be expressed as: where (  , where is the packet size transmitted successfully on the ℎ channel. When a packet is retransmitted, the receiver tries to recover errors by combining them with ease. To conforming packets established from previous transmissions, , with a combined retransmitted packet can be calculated to perform the BER for the retransmitted packet as: where represents the total number of errors proceeding with the permitted distance of the complication code at the ℎ transmission attempt. The estimated BER for the retransmitted packet is calculated by PER, whereas the BER of the retransmitted packet is not independent of the previously transmitted packet, as shown in (4). The successful transmit packet rate of the ℎ packet transmission on the ℎ channel can be expressed as: ) .

D. TRANSMISSION DELAY MODEL
CRN may suffer from TD problems; to meet the high bandwidth of real-time transmission, there is a need to reduce the average TD in wireless CRN, which consists of two kinds of delay [20]: hand-off and retransmission [21]. Retransmission is used to improve reliability and meet performance targets with low power consumption. Moreover, the system mainly focuses on packet delays caused by channel errors and subsequent retransmissions. The delay can be estimated by assuming that one packet must be transmitted − times at the Medium Access Control (MAC) layer. The packet retransmission delay should be calculated as and represent the treating time of handshake in MAC transmission delay and the time required to transmit the data packet, respectively [22]. The average delay can be determined based on ℎ channel allocation and the effect of retransmissions, as shown as: The number of the retransmission times and the PER are used to determine the maximum retransmission delay , as shown in (6). The packet TD is minimized by computing the maximum handoff time of one packet by analyzing the processing time for both loosen and sending packets in the ℎ channel allocation. According to TD's real-time traffic analysis [20], the handoff TDs can be writing as: where and represent the processing time of the handover process. The average TD of a single packet can be computed by adding the average delay for retransmitting and the handoff TD as: High data throughput is estimated based on the average TD required to provide a packet and the impact of the maximum number of real-time retransmissions.

E. THROUGHPUT
Each packet has the same coding rate indicated by . The total transmission throughput (in a bit) and throughput in bits per symbol of the ℎ packet transmitted using the 2 2 -QAM level is × × ,where represents the number of symbols per ℎ packet, and represent the modulation scheme used. Therefore, the successfully transmitted total throughput for packets can be writing as: Improving throughput depends on choosing a high modulation level to transmit more bits per symbol and an adaptive modulation scheme.

F. QOS FOR IMPROVE SMART PACKET TRANSMISSION SCHEDULING
Maximize Mean Opinion Score (MOS), and the handoff scheme can maximize the quality of the transmitted data while minimizing TD by improving Quality of Experience (QoE) when considering the PER, packet length, channel transmission, and SNR. The MOS is a metric used to access VOLUME XX, 2017 1 the multimedia UEs perception of the highest quality [23]. The performance of the TS depends on the maximum expected MOS for spectrum handoff, which is achieved by choosing an available channel with the minimum estimated PER and identifying the transmit packet rate that corresponds to a QoE-driven spectrum handoff, which is expressed as: where the variables Ꞃ , , , and represent the transmit packet rate, the normalized total throughput, normalized low-power consumption level for each device, and average TD, respectively. The variables 1 , 2 , 3 , 4 , 5 can be obtained by a linear deterioration analysis [24]. From (10), the performance of TS depends on when the CIoT networks periodically generate data packets and transmit the average packets rate in the CIoT networks represent whether the ℎ packet is transmitted on the ℎ channel, in which case = 1; otherwise, = 0. Moreover, by improving the , , the maximum normalized system throughput can be written as = ⁄ , where represents the ideal throughput. To reduce the level of power consumption, it can be expressed as , where represents the maximum consumed power threshold. The average delay for retransmitting mechanism is affected when the time delay continues to grow, the QoE decay of on-demand throughput is more, and this can be expressed as = ⁄ , where represents the total TD threshold.

III. PROBLEM FORMULATION
The goal is to maximize MOS by ensuring the performance of TS based on evolutionary conditions. The DNN is used to obtain an optimal policy, which can be achieved by applying DRL. This agent reacts to its environment as a Markov Decision Process (MDP) ( , , ℛ, ), where stands for the state space, contains each a potential actions space set, ℛis the immediate reward function × → ℛ, and is a transition probability function × × → [0, 1]. In addition, is denoted as the decision policy that performs a state to the action : → . The DRL agent exposes = ∈ , where is an episode and the agent selects an action = ∈ ( ). According to policy , the agent interacts with an environment through actions. Then, the environment changes into a new state +1 =^∈ with transition probability^( ) and offers the agent a feedback reward, indicated as ( , ). is the reward of an action on a state, and is defined as the predicted MOS of multimedia transmission. The objective of the DRL agent is to maximize the discounted cumulative reward, which can be written as: where ѱ ∈ (0, 1) represents a discount factor and is the expected return. From (11), we can determine the state function that follows a policy denoted by Ѵ ( ), which can be rewritten as: The value of the reward is denoted as ℛ( , ( )) = { ( , ( ))}. Because evaluating the policy for reward function ℛ( , ( ))and transition probability^ in (12) is difficult, we used the Bellman equation for getting the optimal policy. ℛ( , ( )) represents the reward of action on a state and is also described as the predicted MOS of multimedia transmission. In Q-learning, the policy is established by performing the state-action pairs and can be written as: (13) The predictable discounted cumulative reward begins with taking action under the policy . Consequently, the optimal policy * of the value function, indicated by Ѵ * as shown in (12), can be mathematically written as: Let Q-learning of the value function * ( , ) = * ( , ) = max ( , ) be the optimal action under the optimal policy * . The reward of an action on a state is defined as the predicted MOS of multimedia transmission. Ѵ * ( ) = max Ѵ ( ) = max ∈ ( ) (MOS( , ( )) + ѱ ∑∈ ( )Ѵ * (^)). The optimal policy yielding the highest value of the optimal value function for all sets of actions and states as Ѵ * ( ) = max ∈ ( ) ( * ( , )). Moreover, it can be written in terms of the optimal policy as * ( ) = arg max ∈ ( ) ( * ( , )). The optimal Bellman equation can be written as: The DRL concept of a learning experience is combined with the reward principle to solve this problem. To maximize the total discount reward function, the DRL concept can be discussed in detail below.  Agent: The vision of having an intelligent network running can be achieved by considering the quality of learning based on using information from previous successful experiences to create intelligence in the SDN control panel. indicates the priority level that assigns to each of the channels, shows the quality of the channel (SNR), represents the traffic load of the chosen channel and represents the performance of the TS in the CIoT system in terms of minimizing TD.  Action Space: It is necessary to adjust all policy and to determine its improvement = { , , , }, where represents the power consumption control (active or sleep), denotes the spectrum management access, which should avoid unnecessary waiting time or handoff, shows the transmission modulation selection, and represents the bandwidth allocation in every packet.  Reward: It is designed based on traffic scheduling policies that take URLLC service requests into account. The reward function is used to improve training with probability ratio clipping of MOS, , Ꞃ , , and . Therefore, we offer a new mechanism for URLLC scheduling, called the actor-critic, based on a fuzzy-RBF algorithm, which can schedule and avoid large computations in the learning process.

IV. FUZZY-RBF ALOGRITHM BASED ACTOR-CRITIC LEARNING FOR URLLC SCHEDULING
The goal of DRL is to address the problem of intelligent TS and reduce power transmission levels based on the current state of the decision policy. To solve the TS problem under a large number of transmission packets, we propose a fuzzy-RBF learning algorithm to converge both the action of the actor and the state-action of the critic. Fuzzy-RBF can adjust its stochastic learning policies in CIoT systems under a great dimensional system state. To increase the sum discounted reward and enhance a transmission schedule, depending on calculated Bellman optimality [24] as (π) = ∫ ѱ ( | , )∫ ( , ) , where relative to the regular predictable reward per time step under the policy. The fuzzy-RBF consists of three types of layers. In this environment, the state space represents the input of the actor and critic. The output of fuzzy-RBF depends on the estimation of the actor and critic function. The connection weight vector in both the actor and critic learning frameworks is based on estimating software expansion potential, which requires the determination of the hidden layer of the fuzzy-RBF. The UE-specified system state is denoted as = { 1, , . . , , } ∈ Ɱ at the time step for the input layer. Every neuron in the input layer represents the input state variable Ɱ, . After that, each node of the hidden layer signifies the front part of a fuzzy rule, and the output of hidden layers using the Gaussian kernel function is given as: where represents the pattern of Gaussian kernel function in the ℎ hidden layer node, Ϫ ( ) represents the weight vector of the Gaussian kernel function, and is the variance controlling the sensitivity of the Gaussian to off-center input. The associations of the hidden layer with output are then learned by squared error minimization as: . This ℎ node in the hidden layer is capable to achieve it quickly and simultaneously without iterative learning. The fuzzy-RBF learning algorithm for actor and critic is composed in the output layer, representing the actor outputs for the action function as and value function , where the Ѽ is the weight vector between ℎ hidden layer nodes and ℎ is the output node of the actor-network, and ҳ represents the weight vector between ℎ hidden layer nodes and ℎ output node of the critic network. Due to the exploration utilization "Gaussian interference" the output action ( ) cannot be used directly. So, to achieve the actual action function ̌( ), it is necessary to detect and remove the learning progress of inactive hidden layer based on calculating the error between the estimated value and real values in terms of the temporaldifference method as: The update temporal-difference error depends on the corresponding weight vector ҳ between ℎ hidden layer nodes and ℎ output nodes. To handle a delayed reward, the eligibility trace mechanism in DRL, which connects a weight vector ҳ depends on developing the learning process and propagate the temporal-difference error [26]. The fuzzy-RBF learning updates the weight vectorҳ and the eligibility trace ₼ at the time step ; can be written as: where represent the learning rate for the weights vector of the critic network and ∈ [0, 1] represents a decay parameter for eligibility trace mechanism. While the actor part is at the end of time step , the policy can be improved by using the update temporal-difference error as: where represents a positive parameter of the actornetwork. Therefore, improved the policy based on the updating temporal-difference error without requiring the system prior knowledge provides a better approximate for VOLUME XX, 2017 1 the action of the actor and the current action in the critic part as shown in (18) and (19).

Algorithm I: Training Algorithm of the Proposed fuzzy-RBF Based Actor-critic Learning.
1-Setlearning rate for the weights vector , variance controlling the sensitivity 2 , and decay parameter for eligibility trace mechanism in DRL. 2-Determine: initial state 0 , a fuzzy weight vector Ѽ between ℎ hidden layer nodes and ℎ output node of the actor-network, and ҳ weight vector between ℎ hidden layer nodes and ℎ output node of the critic network.

V. DEEP DISTRIBUTIONAL RL BASED ON GAN-SCHEDULING FRAMEWORK
The proposed GAN -scheduling is used to create a virtual environment for training DRL agents and operate in highly reliable systems. The agent attempts to obtain the optimal TS based on the distributional perspective on DRL [27], [28] is the random return ƴ whose expectation is the value . The random return achieved by adhering to a current policy by performing an action from the state indicated by the random variable ƴ ( , ) due to the unexpected predictability in the environment; thus, resulting in where ′ and ′ are random nature of the next state-action pair after developing a policy, and : ≜ indicates random variable has a similar probability law as . Consequently, the behavior of the policy evaluation for the distributional Bellman operator Ʈ can be defined by where dis(A, B) represents the distance between random variables and , which can be restrained by several metrics, such as p-Wasserstein [29] and Kullback-Leibler divergence [27]. The p-Wasserstein metric extends the cumulative distribution functions. For ℱ , two cumulative distribution functions over the reals, it is defined as ɗ (ℱ, ) = inf , ‖ − ‖ , where the infimum is possessed overall pairs of random variables ( , ) with respective cumulative distribution ℱ and . By applying the inverse cumulative distribution function, the achieved transform of a random variable Ⱳ uniformly distributed on [0,1] as ɗ (ℱ, ) = ‖ℱ −1 (Ⱳ) − −1 (Ⱳ)‖ . For < ∞, this is more explicitly expressed as the C51 algorithm [27] ƴ ( , ) using a discrete distribution and attained state-ofthe-art performance on Atari 2600 games. The p-Wasserstein between them is given by Assumed two random variables , with cumulative distribution functions ℱ , ℱ , can create ɗ ( , ): = ɗ (ℱ , ℱ ). The optimal possible action-value depends on the distributional Bellman optimality operator that is a hard contraction in the p-Wasserstein distance and depends on decreasing (22) with p-Wasserstein distance (error). To overcome this issue due to action-value and the effect of random noise in CIoT. We propose GANs to evaluate real data and synthetic data by controlling the generation of real data in real-time.

A. GENERATIVE ADVERSARIAL NETWORK (GAN)
Generative adversarial networks offer a virtual environment for training and experimenting with DRL agents. The GANs train two models: a generative model and a discriminative model . The Wasserstein GAN guarantees the suitability of the discriminator as a 1-Lipschitz function, and proposed VOLUME XX, 2017 1 in [30], [31] to adopt the gradient retribution and perform as follow: where represents the set of 1-Lipschitz functions, Ж represent real data sample, ℤ is a random distribution sample and the probability of packets that have the same coding rate is indicated by Р( ) = 2 ⁄ (‖∇ Ж ′ (Ж ′ )‖ 2 − 1) 2 , Ж ′ = ℰЖ + (1 − ℰ) (ℤ), ℰ~∪ (0, 1), where is the gradient penalty coefficient. To handle the output using multiple neural layers in (18), (19) must depend on the flow of a DNN to approximate the state-value distribution.

B. GAN-DDQN BASED ON REWARD CLIPPING TECHNIQUE FOR DISCRIMINATOR NETWORK
According to the problems with large action space, we use the GAN-DDQN algorithm to estimate the value of every action for certain states. The discriminator network uses a 1-Wasserstein criterion to decrease the error (distance) between target action-value particles and the estimated action-value particles, as shown in (22) and (23). The current state t = and sample from the uniform distribution ∪ (0, 1) are fed to the network by the agent at iteration t [32]. To perform the predicted action-value particles (samples), the agent computes ( , ) = (1 ) ∑ ( , ) ⁄ , ∀ ∈ , and select the action ^= arg max ( , ) , ∀ ∈ .
Consequently, the agent receives a reward ℛ, and the environment travels to the next state t+1 = ′ . The tuple transition ( ′ ,^, ′ , γ) is collected into a replay buffer , as shown in Fig. 2. The networks and are updated using every transition tuples in for every iterations [33]. From the transition , the target action-value particles is denoted as i = ′ + ѱ ′ î ( ′ , ), where ^ represent the action of the highest expectation action-value particle, where^= where Р(γ) is declared in (24), and ( , ) is the output of the network parameterized by when the input is provided. The loss function, as shown in (25), will be high when the discriminator can discriminate between the real data distributed according to replay buffer . To preventing great variation in the target action value, we propose a new reward-clipping mechanism, as shown in (27). The clipping strategy can be formulated as follow: where 1 − ϵ and 1 + ϵ are the thresholds that are manually set. This new reward-clipping used to measure the difference between the precision of the network distinguishing and the optimal action-value particles generated by network . We assume the ϵ thresholds that partition the transmission schedule increases the utility and set the constant 1 + ϵthat are taken as the rewards in RL, whose values are much lower than the utility in the reward. Then, the utility as the reward in RL is followed by clipping to these 1 + ϵ constants. If the reward clipping (RC) parameter is large , then it can take a long time for any weights, thus making the process of setting parameters more sophisticated. However, if the RC is small , this can easily lead to disappearing gradients when the number of ϵ thresholds is small.

Algorithm II: Enhance Intelligent TS Based on Proposed A GAN-DDQN Algorithm.
1-Initialize a generator and a discriminator with random weights and , discount factor ѱ, the number of predicted action-values , and gradient penalty coefficient. 2-Initialize the agent iteration = 0, target action-value ′ with weight .
3-Initialize replay buffer to updated every transition tuples for and . 4for transition ≥ 0 do 5-Select current state = with smallest ′^( ′ , ) that partition the transmission schedule and increases the utility and set the constant 1 + , 6-Get the next state-action by search the predicted actionvalue ( , ) from the current state = , and agent samples from the uniform distribution ∪ (0, 1). 7-Perform the predicted action-value for the agent computes ( , ) = (1 ) ∑ ( , ) ⁄ , ∀ ∈ ; and select the action ^= arg max ( , ) , ∀ ∈ .

8-
The agent receiving a reward ℛ, and the environment travel to the next state +1 = ′ , 9-The tuple ( ′ ,^, ′ , ) is collected into , 10-The agent updates the and of and using transition tuples in for iterations, 11for various transitions ≥ 0 in GAN training, do 12-Fulfillment the target action; the agent first chooses 13-transitions from as a mini-batch ( ′ ,^, ′ , ). 14-The target action-value denoted as = ′ + ѱ ′^( ′ , ), the agent expectation action-value ^= arg max (1 ) ∑ ′^( ′ , ) ⁄ , 19-Set all the transitions in for training and resetting ′ = , by the agent for replicating network to ′, 20-Set a new reward-clipping mechanism as shown in (27), 21-Calculates the difference between the precision of the distinguishing and the optimal action-value generated by by ( ) = ( ( ), 1 − ,1 + ).

VI. SIMULATION RESULTS
In this section, we evaluated the performance of our proposed GAN-DDQN algorithm in the CIoT system. The target schemes are compared to GAN-DDQN [32], standard actor-critic RL algorithm based on the policy gradient for TS [14], and also depend on analyzed deep Q-learning RL algorithm for TS [15]. URLLC was evaluated under the GAN-scheduling by achieving a large real dataset in realtime, and the URLLC packet is small in size. The main simulation parameters are listed in Table I.

A. TRANSMISSION SCHEDULING AND TRANSMISSION DELAY
In this section, we examined the learning process in terms of the GAN-DDQN scheduling learning procedure compared to actor-critic fuzzy-RBF and DQN concerning the power consumption level, throughput, and transmit packet rate value when the rate of normalized packet arrival is 0.5. The performance gap between other algorithms and the GAN-DDQN scheduling learning shows that the GAN-DDQN scheduling becomes more pronounced and effective learning due to the increase in the number of iterative steps. Figure 3 presents the learning process in terms of normalized throughput, transmit packet, and power. The three DRL algorithms achieve similar performance for the normalized throughput, as shown in Fig. 3(a), during the training process, the GAN-DDQN scheduling slightly achieves better performance throughput than the other actor-critic fuzzy-RBF and DQN. From Fig. 3(b), the GAN-DDQN scheduling achieves a slightly improved transmit packet rate in fewer iterations than actor-critic fuzzy-RBF and DQN. In contrast, the GAN-DDQN scheduling allows it to generate better candidates with high fitness, while the actor-critic fuzzy-RBF and DQN provide the same performance packet rate during the training process. From Fig. 3(c), increasing the arrival rate to all UEs will necessitate more re-transmissions and increase the VOLUME XX, 2017 1 number of hand-off processes. The lowest transmit power is depicted by applying GAN-DDQN to learn the action value distribution, reducing the difference between the predictable action-value and target action-value distribution. Moreover, the GAN-DDQN scheduling provides the best case of random noise, when compared to actor-critic fuzzy-RBF and DQN, as shown in Fig. 3(c). Figure 4 presents the average TD performance by considering transmission power. When the value of power is small the performance of the average TD for the ℎ channel in CIoT is almost the same good as the optimum. When the power level is small, the actual average transmission power of the scheduling packet rate remains constant, the constraint on average transmission power weakens further, and most packets must be transmitted with a very short delay, as shown in (10). When the average packet delay increases, more packets are in the queue waiting for transmission or more packet retransmissions. However, in Fig. 4, increasing the average transmission power doesn't assist in sending more of the packet transmissions. In addition, the average packet TD curves are nearly flat when the level of transmission power is increased. This means that the average packet TD is determined by the packet arrival process, not by level transmission power.

B. THROUGHPUT
In this section, we examined the performance of the throughput of the system that can be achieved under channel status during the training process. From Fig. 5, the success of the packet arrival depends on minimizing the amount of time it takes for a packet and the difference in packet delay. The long waiting time for the packets reduces throughput, making the packets wait longer to transmit. The normalized throughput decreases when the packet rate increases. The normal throughput is close to 1 with successful training for GAN-DDQN scheduling and achieves a high packet arrival rate, which does not necessitate additional training for various TDs. The packet arrival is randomly selected with a learning rate ranging from 0 to 0.14, and an optimization algorithm is used to obtain successful training for actor-critic fuzzy-RBF and DQN. GAN-DDQN increases intelligence and minimizes error (distance) between target action particles and predicted action-value particles to achieve minimal long-term costs based on signal processing planning and the use of the RC for GAN-DDQN data ѱ ′^( ′ , ), to distinguish real samples from ( , )samples produced by reducing the loss to transmit big data. The transition probability is nearly proportionally to the packet arrival rate, as shown in Fig. 6. Figure  6 also shows that as the packet arrival rate increases, the transmission of transition probability of the three algorithms improves. When the packet arrival rate increase, the transition dropping probability increased markedly and influenced at high traffic load, and continue to increase. On the other hand, the high transition probability occurs when more packets are received to provide the optimal TS. When the system's radio resource is fixed under heavy traffic and more packets arrive, the optimal broadcast planning decision increases linearly as the traffic packet  arrival rate increases. Moreover, the average big data improves with the arrival rate when the packet arrival rate exceeds the transmission transition probability. The transition probability depends on the successful transmission probability of data packets in SNR as shown in (1), and also the average arrival rate as shown in (11) to (13). Finally, the proposed GAN-DDQN scheduling algorithm has a better transition probability performance than actor-critic fuzzy-RBF and DQN algorithms in terms of improving transmission packet scheduling in the CIoT system.

C. SMALL PACKETS FOR URLLC TO GUARANTEE QOE
In this part, we considered the performance of the proposed algorithms within the scenario that the packet size of the URLLC service is small, as shown in Table 1. We considered two cases: the more bandwidth allocation resolution is either 1 MHz or 200 KHz to guarantee meeting the MOS of URLLC service. From Fig. 7, the impact of average data rate for different URLLC traffic depends on the GAN-DDQN scheduling, actor-critic architecture for fuzzy-RBF, and DQN algorithm to distribute the URLLC traffic. The GAN-DDQN scheduling algorithm learns the URLLC traffic, improves the channel variations that come in difficulties, and adjusts the URLLC weight dynamically according to (27), leading to more reliable transmissions based on the proposed new method new reward-clipping to preventing significant variation in the target action value. The GAN-DDQN scheduling provides an average bit rate of 38 Mbps when the URLLC arrival load is 5 and decreases to 8 Mbps when increasing the average URLLC load to 100 packets/time slot. However, the average big rate obtained by the actor-critic fuzzy-RBF and DQN algorithms vary from 30 Mbps to 19 Mbps when increasing the average URLLC load from 5 to 100 packets/time slot. Figure 8 shows the packet losses rate by considering the varying numbers of UEs for three algorithms; the loss rate increases with more UEs. The algorithm of GAN-DDQN scheduling provides a smaller loss rate than actor-critic fuzzy-RBF, and DQN when the number of UEs is less than 100. Furthermore, the control performance loss and enhance TS depends on enabling discrimination between the real-data distributed according to replay buffer , as shown in (25) for stringent URLLC requirements. From Fig. 8, the GAN-DDQN scheduling provides good performance with less loss of rate than actor-critic fuzzy-RBF and DQN algorithms based on applying the RC by increasing the intelligence and decreasing the error between target action-value and the estimated action-value, as shown in (22) and (23). While the actor-critic fuzzy-RBF provides a lower performance loss rate than GAN-DDQN with an increased number of UEs depends on minimization error for the hidden layer, as shown in (16), and also depends on calculating the error between the estimated value and real values by update temporal-difference error, as shown in (19).

VII. CONCLUSION
In this paper, designing a learning agent with intelligent decision-making ability is very challenging in the CIoT system. The smart scheduling in DRL for the RC guarantees a good transmission packet with high reliability. Our proposal investigated the combination of GAN-DDQN used to solve the intelligent TS in CIoT systems. In addition, the proposed RC adopted in GAN-DDQN scheduling improves the training stability with probability ratio clipping of reward, power consumption, transmission packet rate, and throughput. The simulation results show that improving the training stability and increasing the intelligence for the GAN-DDQN scheduling algorithm based on action-value for discriminator network for RC decreases the error between the target action-value particles and the estimated action-value particles. Also, the simulation results show that the GAN-DDQN scheduling algorithm has a greater performance than other DRL algorithms. Our future work will investigate the distributed implementation of our proposed GAN-DDQN based on removing the temporary training time that exists in DRL in the case of unforeseen maximum events that cause failure in URLLC systems.