Energy Efficient Transmission Power Control Policy of the Delay Tolerable Communication Service

In recent years, the development of wireless communication leads to an explosive growth of energy demand, widely application of smart devices and rapid emergence of services. So, the energy efficient communication is expected urgently to save power and prolong the lifetime of the resource-constrained terminal devices. Especially in 5G age, the excellent spectrum efficiency provides more opportunity to save power by adjusting the transmission power of the delay tolerable (DT) service. Meanwhile, although the tradeoff between energy efficiency and service delay plays a non-negligible role in the energy efficient communication, it is not exploited sufficiently due to the time variation and randomness of wireless communication channel. For this reason, the fundamental tradeoff between energy efficiency and delay of the DT service is investigated and analyzed. And, the optimal problem of energy efficient communication for DT service is formulated as a Markov Decision Process (MDP) which can be solved effectively by statistical dynamic programming (SDP) since the perfect channel state information (CSI) is hard to obtain. To improve the utility of research, the approximate SDP (ASDP) and Q-learning are also investigated to overcome the limitation of the curse of dimensionality and model-based algorithm respectively.

way to prolong the live life of power-constraint terminal devices.
In fact, technologies towards energy-efficient wireless networks have been studied for a long time. In [4], Chen et al. discussed the tradeoff among energy efficiency, spectrum efficiency, convergence efficiency and service delay of cellular network. The conclusions showed that the increasing of energy efficiency, unlike the increasing of spectrum efficiency which is almost always beneficial for the Quality of Service (QoS), may lead to the degradation of QoS. In [5], the energy consumption model of macro cell Base Station (BS) was formulated based on the analysis of the fundamental circuit. Moreover, R.Mahapatra et al. simplified the energy consumption model of macro cell BS by dividing the consumed energy into static power and transmission power to reduce the complexity of analysis [6]. In [7], serial indices of the energy efficiency were proposed in terms of network convergence, throughput and spectrum efficiency. With the maturity of the research frame of the energy efficient communication, the specific technology was investigated to realize the energy efficient communication.
In this paper, we investigated the energy efficiency potential of the Delay Tolerable (DT) service with partial Channel States Information (CSI). We also proposed three transmission power control policies under various assumptions. Our main contributions are summarized as follows: • The energy efficient problem of the DT service is formulated. Based on the formulated optimal problem and convex optimal theory, the performance upper bound of the transmission power is derived.
• Since the perfect CSI of the future cannot be obtained, the energy efficient problem of the DT is reformulated as a Markov Decision Problem (MDP) with statistical CSI. Furthermore, under the discrete channel assumption, the statistical dynamic programming (SDP) is adopted to design the energy efficient transmission policy for the proposed MDP.
• Due to the dimension curse, the SDP is not available for the continuous channel. Unfortunately, the channel model of wireless communication follows continuous distributions in most of the general case. For this reason, we adopted the approximate SDP (ASDP) to design the energy efficient transmission policy for the DT service under the continuous channel assumption.
• The policies of the ASDP and SDP are developed based on the statistical CSI, which may not be available in practice. In order to improve the utility value of research, we develop a model free transmission policy to improve the energy efficiency of DT service based on the Q-learning algorithm. The remainder of this paper is organized as follows. In Section II, the related works of energy efficient communication are brief reviewed. In Section III, we formulate the system model of DT service. And the optimal problem of the transmission power is also proposed to derive the performance bound under the perfect CSI assumption. In Section IV, we reformulate the proposed optimal problem as a MDP since the perfect CSI is unavailable in practical scenarios. Furthermore, the SDP is adopted to solve the proposed MDP under the discrete channel assumption. In order to further increase the utility value of our research, the ASDP is adopted to deal with the dimension curse caused by the continuous channel in Section V. And, the Q-learning is investigated to overcome the limitation of model-based policy. Section VI draws a conclusion and summarizes the paper.

II. RELATED WORKS
Energy efficient power control and allocation is a challenging problem in modern wireless network due to the complex network structure and various service constraint. Many researchers contributed substantially to this field. In [8], the sleep mechanism was proposed to control the static power by components deactivation when they are not working. Following this idea, [9]- [11] optimized the mode switching policy of sleep mechanism based on the loading condition and traffic intensity with the blocking probability constraint. Furthermore, with the explosive rise in the number of subscribers and traffic, the concept of cell breathing emerged as the generalized BS sleeping mode in [12]. The cell breathing was more flexible and effective since it can adjust the cell size in accordance with the loading and QoS variations. The conclusions of [13], [14] showed that the cell breathing provided excellent energy efficiency performance gain in the complex cellular network.
Besides the static power, the optimization of transmission power is also generally investigated. In [15], the doublethreshold water filling (WF) algorithm was proposed to improve the spectrum efficiency and energy efficiency in the spectrum sharing systems with the statistical CSI. In [16], a multi-objective optimal problem was formulated to optimized the energy efficiency and spectrum efficiency of the cognitive radio networks. By the -constraint method, the minimum power allocation strategy of the secondary transmitter was obtained while the average interference power on the primary user receiver was limited. In [17], a joint optimal problem of the OFDMA system was formulated. And, the researcher also proposed an iterative algorithm to solve the proposed optimal problem. In [18], a link adaptive power control and allocation (LaPCA) scheme was designed based on the requirement of LTE service. Instead of optimizing the energy efficiency directly, LaPCA kept the portion of cell transmission power to be proportional to the volume of data flows that were nominated for transmission. In order to overcome the shortage of centralized approaches, game theoretic has attracted significant attention to develop distributed power control and allocation approaches. In [19], [20], the problem of QoS-driven power control in the uplink of CDMA wireless networks is formulated as a non-convex noncooperative Multi-Service Uplink Power Control (MSUPC) game. Based on the diverse user's requirements, various utility functions were designed with respect to the throughput performance, QoS requirements fulfillment and the power consumption. And, the corresponding distributed iterative algorithm were also proposed and evaluated. In [21], a simulation-based optimization framework for D2D network was proposed to achieve a tradeoff between the energy consumption and network performance. Although the existence of Nash equilibrium point was not proved theoretically, the simulation results showed that the proposed optimization framework has an excellent performance in most case.
In all of the related research, the tradeoff between energy efficiency and service delay kept playing an important role. In [22], [23], the relationship between energy efficiency and service delay was analyzed and simulated. The conclusion showed that the energy efficiency and delay were contradictory to each other. In [24], the time-domain power allocation scheme was studied to improve the energy efficiency of the D2D system in the delay-insensitive scenario. In [25], the mechanism of pro-actively pushing content to users was adopted to reduce the energy consumption due to the extended transmission time duration. Furthermore, the user request probability thresholds was derived under two different QoS constraints in this paper. In [26], the grid powerdelay tradeoff of the energy harvesting wireless system with finite energy storage capacity was studied. The bounds of the average delay and grid power consumption was derived. In order to improve the utility value of the relative research, an energy efficiency and delay tradeoff optimization scheme was proposed under the limited energy and data storage space assumption [27]. A modified distributed Q-learning algorithm was also developed to cooperative with the proposed optimization scheme in this paper. The simulation results showed that the proposed algorithm and scheme can obtain the near-optimal performance as the classical centralized Q-learning mechanism.
But to the best of our knowledge, the most existing researches of resource allocation over time-domain were developed under the simple wireless channel model or perfect CSI assumptions to reduce the research complexity. These assumptions do not correspond to the practical wireless communication scenarios. Under the general conditions, the obtained CSI is imperfect and partial since the wireless channel is random and time-varying. Fortunately, the development of reinforcement learning provided an effective tool to improve the performance of wireless communication with partial CSI [28]- [30]. And, a few researchers have employed it to optimize the energy efficiency of various wireless networks [31], [32]. Motivated by their work, the reinforcement learning was adopted in this paper to improve the energy efficiency of the DT service.

III. SYSTEM MODEL AND PROBLEM FORMULATION
In this section, the model of DT service is formulated. And, the transmission power optimal problem of DT service is also proposed. A. SYSTEM MODEL Just as the Fig. 1 shown, a time-slotted communication link with 2 nodes [32] is studied in this paper. A time slot is a fixed time interval and could consist of one or several packets. Meanwhile, the multiplicative fading and the additive white Gaussian noise (AWGN) are adopted. Without loss of generality, we assume that the fading gains of the channel between the users are affected by fast fading, shadow fading, and long-time scale variations and thus, can be modeled as a stationary time-varying stochastic process. Furthermore, we also assume the channel gain within the duration of each time slot is constant. So, the receiving signal of the destination node in the i the slot can be presented as where P t,i (Watt) is the transmission power of i th slot. It is fixed within the duration of a time slot. h i and n i are the channel fading coefficient and Gaussian noise variable, respectively. y i and x i are the receiving signal and transmission signal. The proposed link is supposed to provide a serials DT service. Unlike the delay sensitive service, the DT service requires a certain amounts information in a period rather than immediately. For example, the video-on-demand (VOD) is a typical DT service when the spectrum efficiency is good enough. The video will keep playing smoothly if the new data can fill the buffer before it is NULL. So, the maximal tolerable delay of VOD depends on the video playing rate and the buffer pool size of terminal node. And, the energy efficiency of DT service can be improved by optimizing the transmission power in time domain. The Fig. 2 illustrate the optimal power allocation pattern for the DT service which requires to transfer I bits in 8 slots. The solid bars in the Fig. 2 represent the channel condition reciprocal of the wireless channel which is defined as Meanwhile, the line bars in the Fig. 2 represent the transmission power of the i th slot. This idea is similar to the well-known water filling algorithm except we allocate the transmission power resource in the time domain. In order to reduce the analysis complexity, we assume all of the nodes in the proposed model equip the single antenna.
When the energy efficient communication is studied, a popular and precise indices is the energy efficiency which VOLUME 8, 2020 is defined as where C (bits/s) and W (Hz) are the channel rate and bandwidth respectively. P f and P t (Watt) are the static power and transmission power respectively. The main job of the energy efficient communication is to maximize the energy efficiency under a set of constraint conditions. In general case, the higher energy efficiency means the source node can transmit more information with limited power. However, when we consider a DT service with the certain amount of information and maximal tolerable delay, such as VOD or file transfer, the energy efficiency problem (3) can be modified as where I (bits) and T are the required total information amount and the maximal number of tolerable delay slots respectively. τ (s) is the duration of slot. And, P t,i (Watt) is the transmission power in the i th slot as we defined before. Different from the traditional energy efficiency, the information amount I is a constant. And, since the receive node require to receive all of these information in a period, the total static power WT τ P f is also a constant. Then, the optimal problem of energy efficiency can be turned to the optimal problem of transmission power W τ T i=1 P t,i . In order to reduce the analysis complexity, τ and W are normalized in this paper. So, in the following part, we will focus on the minimization problem of T i=1 P t,i .

B. PROBLEM FORMULATED
Without loss of generality, the required information I of DT service can be represented as where γ i is CSI which is defined in (2). And, the transmission power constraint is where P t,i,max is the available maximal transmission power of source node in the i th slot. So, the transmission power optimal problem of the DT service can be formulated as min : Without loss of generality, the sign '≥' is adopted to replace the sign '=' in the information constraint without affecting the optimal result. Apparently, (7) is a classical convex optimal problem which can be solved effectively when the source node has the perfect CSI (γ i ). However, it is almost impossible to obtain the perfect CSI of the future T slots in practical for the source node. So, (7) actually provides a transmission power lower bound or an energy efficiency upper bound of the practical DT service. On the other hand, with the development of various channel estimation, the source node can get a pretty well CSI of the current slot and the statistical CSI of the future slots. This advantage means the source node should adjust the transmission power base on the observed CSI. Following this idea, we reformulated the transmission power optimal problem of DT service as a Markov Decision Process (MDP).

IV. ENERGY EFFICIENT POWER CONTROL POLICY FOR THE DISCRETE CHANNEL
In this section, we propose an energy efficient power control policy for the DT service with discrete channel model based on the SDP. And, as we mentioned before, the DT service requires a certain amounts information I in a period T rather than immediately. So, the state set of policy can be designed as the following subsection.

A. POLICY STATE SETTING
In general, a MDP can be defined by a 4-tuple S, A, R, Pr t . S = {s 1 , s 2 , . . .} is the states set. A = {a 1 , a 2 , . . .} is the action set. Meanwhile, R and Pr t are the action reward and the state transition probability, respectively.
For the transmission power optimal problem of the DT service, the state set S is defined by the CSI and the required information that the source node want to transmit. And, the action set A is defined by the information that the source node will transmit in current slot. Unfortunately, the required information (I ) and CSI (γ ) usually are continuous variables which will lead to an unsolvable bellman equation due to the curse of dimensionality. For this reason, we discrete the required information I as (0, I /N 1 , . . . , (N 1 − 1)I /N 1 , I ) by a predefined integer N 1 and assume the CSI is simply enough to be formulated as a discretization model (γ 1 , . . . , γ N 2 ) in this section. So, the state set S is denoted as where k I ∈ {0, .., N 1 }, k c ∈ {1, .., N 2 }, and i ∈ {1, .., where k a ∈ {0, .., N 1 }. From the definition of S and A, the transition probability Pr t is where Pr c (γ k c ) is the probability when the CSI is γ k c which follows Furthermore, the reward R = r 1 , r 2 , . . . , r T is the function of state and action. And, it also reflect the transmission power P t,i . By this way, we formulate the transmission power optimal problem of DT service as a MDP problem with finite states. Furthermore, a T-stages query table can be established.

B. OPTIMAL ALGORITHM OF THE VALUE FUNCTION
Generally, the MDP problem can be solved by the Bellman equation where V t+1 (s i ) and Q t+1 (s i , a j ) are the state value function and state-action value function of t + 1 th stage, respectively. If the MDP has a terminate state, such as the transmission power optimal problem of DT service, the Bellman equation of the terminate state can be further simplified as In practical, the state value function V t (s i ) can be calculated by the state-action value function Q t (s i , a j ) as Actually, the state-action value function Q t (s i , a j ) also indicates the best action. When the source node has the exact state-action value function, the best action of the t th stage can be identity as So, it is clear that the accurate state-action value functions are very important for the Bellman equation. Due to the constraint of the required information of DT service, the action of the last stage is So, the last stage Bellman equation of DT service (13) should be represented as where s i = (k T I * I /N 1 , γ T k c ). Based on (10), (12) and (17), all of the state-action value function Q t (s i , a j ) can be calculated by the back track algorithm. Furthermore, the best action of each state can be identified by (15). The policy based on the SDP is summarized in the Policy. 1.  . ii) Identify the next stage based on the action.

V. ENERGY EFFICIENT POWER CONTROL POLICY FOR THE CONTINUES CHANNEL
When the CSI is simple enough to be formulated as a discrete model, the previous algorithm is an effective method to improve the energy efficiency of the DT service. However, since the channel of the wireless communication is various and random, the discrete model is not powerful enough to depict it in most of the general case. For this reason, the Additive White Gaussian Noise (AWGN) and Multiplicative Rayleigh Fading (MRF), which are the most common channel modes of wireless communication, are considered in this section and the corresponding simulation section. Meanwhile, we have the following property. Property 1: Let the channel gain h follows the Rayleigh distribution with probability density function Property. 1 shows that the CSI γ follows the negative exponential distribution when the AWGN and MRF are adopted. So, the proposed SDP policy is not effective anymore due to the dimension curse.

A. MODEL-BASED POLICY BY APPROXIMATE STATISTICAL DYNAMIC PROGRAMMING
Approximate Statistical Dynamic Programming (ASDP) is the extension of SDP. When the state value function and stateaction value function cannot be calculated exactly, the ASDP is adopted to approximate the state value function and stateaction value function by reducing the accuracy demand. Following this idea, we define the state set S and action set A as (0, I /N 1 , . . . , (N 1 − 1)I /N 1 , I ). Then, the T-stages query table has T × (N 1 + 1) elements. And, the state transition probability Pr t is Moreover, since the state set did not consider the CSI anymore, the Bellman equation of the terminate state (13) should be rewritten as where E γ [·] is the expectation about CSI γ . Similarly, the Bellman equation of the other stage (12) is Pr t (s |s)V (s ).

Pr t (s |s, a)V (s ). (21)
Different from the SDP approach which could identify the best action of each state directly, ASDP will identify the action based on the CSI of current slot. So, (15) should be modified as The policy based on the ASDP is summarized in the Policy. 2.

B. MODEL FREE POLICY BY Q-LEARNING
When the statistical model of the continuous wireless channel is available, the ASDP is an effective method to overcome the dimension curse and improve the energy efficiency of the DT service. However, it is a challenging job to estimate the exact statistical model of the wireless channel at times. For this reason, it is more valuable to develop the model free policy for energy efficient DT service. So, the Q-learning algorithm is investigated in this section.
Based on the idea of Q-learning algorithm, the value function V (s) and Q(s, a) can be estimated by the transmission experience. We assume there are a set of transmission power samples which is obtained by the transmission power control policy π in the state s i . Then, the state-action value can be estimated as where m = 1, 2, . . . ., (25) can be rewritten as the recursive form where It is easy to know that the Q m π (s i , a j ) will converge to a exact Q π (s i , a j ) with sufficient sample supportting. For this reason, a random policy is the best choice for the Q-learning algorithm to explore all of the possible states and action. On the other hand, a determine policy is preferred to pursue the best energy efficiency of DT service. In order to deal with this dilemma, the -greedy policy, as a tradeoff, is adopted. The -greedy policy can be represented as where |A(s i )| represents the elements number of the action set when the source node is in the state s i . Furthermore, the -greedy policy has the following property. Property 2: The -greedy policy with 0 < < 1 is an improvement and asymptotic optimal policy.
Proof: See the Appendix B. (26) shows that the -greedy policy will pursue the energy efficiency with probability 1 − + |A(s i )| and explore the value function with probability |A(s i )| . Furthermore, Property. 2 promise that the -greedy policy will provide the best energy efficient policy for the DT service when the Q m π (s i , a j ) approach to the exact Q π (s i , a j ). To improve the convergency performance, the decay factor β is adopted as The policy based on the Q-learning is summarized in the Policy. 3. i) update the estimated state-action value based on (27). ii) update the -greedy policy as (29).

C. IMPLEMENTATION COST AND COMPLEXITY ANALYSIS
From the above mentioned, we can see that the energy efficient power control policy can be developed based on the SDP, ASDP and Q-learning algorithm. In general, the statistical CSI, which is required by SDP and ASDP, is obtained by channel estimating technology in the realistic wireless communication environment. So, the implementation costs of SDP and ASDP mainly depend on the cost of adopted channel estimating technology. Meanwhile, for the Q-Learning, the transmission experience is required to estimate the value function. So, the implementation cost of Q-Learning mainly lies in the time cost which is used to sample the transmission experience. In the following, we will conclude the time and space complexity of the various energy efficient power control policies.
The state space of SDP is based on the predefined integer N 1 , channel state amount N 2 and maximal tolerable delay T . So, the states space complexity is O (N 1 N 2 T ). The action space of SDP is same as the state space. So, the overall space complexity of the energy efficient power control policy based on the SDP is O (2N 1 N 2 T ).
The time complexity is corresponding to the space complexity. In order to identify the query table, the source node has to evaluate the state value function and state-action value function among various states. So, the corresponding time complexity is O(N 2 1 N 2 2 T ). Meanwhile, during the transmission stage, the policy is a typical lookup table algorithm. So, the corresponding time complexity is O (N 1 N 2 T ).
Different from the SDP, the space complexity of ASDP only depends on the predefined integer N 1 and the maximal tolerable delay T . So, the states space complexity is O (N 1 T ). Meanwhile, there is no action space for the ASDP. So, the overall space complexity of the energy efficient power control policy based on the ASDP is the states space complexity (O (N 1 T )). Similarly, the time complexity of query table building is O(N 2 1 T ). Meanwhile, during the transmission stage, the source node should identify the action based on the CSI of current slot. So, the corresponding time complexity is O (N 1 T ).
The space complexity of Q-learning is same as the ASDP. The time complexity depends on the convergence rate. If we denote K as the convergence iteration times, the time complexity of Q-learning is O (KN 1 T ).

VI. SIMULATION RESULTS
In this section, the simulation results of the proposed polices are presented and analyzed. All of the results are evaluated on PC Intel (R) Core (TM) i5 CPU @ 3.20 GHz. The duration of the time slot is 1 (s). And, the required information of the destination node I is 2 (bits). We also assume the available maximal transmission power of each slot P t,i,max is same as 20 (Watt).

A. NUMERICAL RESULTS UNDER DISCRETE CHANNEL ASSUMPTION
In this subsection, the transmission power performance of DT service is presented when the channel fading coefficient is 0.4 or 0.9 with probability 0.5. And, the signal-noise-radio (SNR) is defined as In the Fig. 3, the relationship between SNR and transmission power is shown. The blue line with plus mark is the transmission power performance when the source node has the perfect future CSI. It is also the result of (7). Just as we analyzed before, this result can be regarded as the lower bound of the transmission power for the DT service. And, the red line with square denotes the transmission power performance of greedy policy. When the greedy policy was adopted, the source node will work with the available maximal transmission power P t,i,max to pursue the maximal spectrum efficiency. The black line with circle and the green line with '×' are the performance of SDP and ASDP policies, respectively. From the results in the Fig. 3, it is clear to see that the transmission power of the DT service decreases with the increasing of the SNR. And, the SDP policy showed the best performance which approaches to the performance upper bound effectively. Meanwhile, the performance of ASDP is slightly worse than the performance of SDP. Both of them are better than the performance of the greedy algorithm obviously. These conclusions are consistent with previous analysis. The SDP policy is developed based on the original Bellman equation without any approximation. And, the ASDP works with the reward expectation instead of the exact reward to reduce the computational complexity at the cost of slight performance degradation. Fig. 3 also showed that the performance gap among various policies decreases with the increasing of SNR. When the SNR is over 8 (dB), the performances of SDP and ASDP are very close to the performance upper bound. This result indicates that the ASDP policy is a better choice when the channel condition is good enough. On the other hand, the SDP policy is suitable for the low SNR case. Fig. 4 presented the relationship between transmission power and the maximal tolerable delay under various SNR conditions. The solid lines are the performance when the SNR is 2 (dB). The dash lines are the performance when the SNR is 8 (dB). The meaning of the color and mark is the same as the Fig. 3. The results of Fig. 4 showed that the transmission power decreases with the increasing of the maximal tolerable delay except the greedy policy since it focuses on the current slot without considering the potential of future slots. And, the gap between the performance of greedy policy and the performance upper bound increases with the increasing of the maximal tolerable delay. These results mean that the tolerable delay of the communication service does provide an opportunity to decrease the transmission power. And, larger tolerable delay means a more potential for saving power.
Meanwhile, we also notice that the decreasing rate of the power lower bound also reduce with the increasing of the maximal tolerable delay. For the SNR = 2 (dB) case, the power lower bound is almost convergent when the maximal tolerable delay is over 7 (s). And, for the SNR = 8 (dB) case, the power lower bound is almost convergent when the maximal tolerable delay is over 5 (s). These results show that too long delay provides limited performance advantage for the service. When the channel condition is good enough, the service could finish quickly without extra power consumption.
The results of Fig. 4 also showed that the performance of SDP policy is almost the same as the power lower bound when the tolerable delay is large enough. And, the performance gap between SDP and ASDP policies increases with the increasing of the maximal tolerable delay. These results mean that the ASDP is more suitable for the DT service with smaller tolerable delay and SDP is more suitable for the DT service with larger tolerable delay.

B. NUMERICAL RESULTS UNDER CONTINUOUS CHANNEL ASSUMPTION
In this section, we assume all of the CSI γ i i = 1, 2, . . . T follow the same distribution. The well known Rayleigh fading channel with parameter σ = √ 2 is considered in this section. So, from the Property. 1, CSI γ i follows the exponential distribution with parameter 1 σ 2 i . Fig. 5 presents the relationship between γ i and transmission power. In this case, The SDP policy is not available due to the dimension curse caused by the continuous distribution. On the other hand, the Q-learning policy, which is denoted by the pink line with diamond mark, is adopted to minimize the transmission power of the DT service. The meaning of symbols are the same as their meaning in Fig. 3. The results of Fig. 5 are similar to the results of Fig. 3. With the increasing of the γ i , both the transmission power and the performance gaps among various policies decrease greatly. These results indicate that the service delay also provides the potential to reduce the transmission power under the continuous channel assumption. Especially in the low γ i region, the energy efficiency potential of the DT service is very obvious. Furthermore, the performance of Q-learning policy is significantly better than the performance of ASDP policy, though it almost has no prior information about the channel. This result means that the Q-learning policy withgreedy algorithm could 'learn' the continuous channel pretty well. We also notice that, unlike the discrete channel case, there is an obvious gap between the performance of ASDP policy and the performance upper bound though the γ i is larger than 10 (dB). This result means the approximate of the reward leads to more performance degradation in the continuous channel.
The relationship between transmission power and maximal tolerable delay is proposed in the Fig. 6. The results show that the increasing of the maximal tolerable delay leads to the decreasing of the transmission power. And, the decline rate also decreases with the increasing of the available maximal delay. Especially in the γ i = 8 (dB) case, the advantage of delay is very limited. We also notice that the performance of Q-learning is better than the performance of ASDP. And, there is an obvious gap between the performance of ASDP and the performance upper bound. These conclusions are similar as those of Fig. 4.
In the Fig. 7, the convergence performance of the -greedy policy is presented when the decay factor β is 0.999. For the  SNR = 2 (dB) case, the red line with circle and blue line with plus are adopted to denote the convergence performance of -greedy policy when the maximal tolerable delays are 3 and 8 (s) respectively. Meanwhile, For the SNR = 8 (dB) case, the black line with '×' mark and green line with square mark are adopted. From Fig. 7, it is clear to see that the transmission power of the -greedy policy decreases steadily. This result is consistent with the Property. 2 that the -greedy policy is an asymptotic optimal policy. And, Fig. 7 also show that the channel condition (γ i ) and maximal delay affect the convergence performance significantly. The large maximal delay leads to the large state-action space which will results in the convergence difficulty for the -greedy policy. And, the higher γ i means the channel condition has low variance which is beneficial for the convergence performance. So, when T = 8 (s) and γ i = 2 (dB), the -greedy policy needs about 5000 iterations to converge. when T = 3 (s) and γ i = 8 (dB), the -greedy policy almost converged with 1000 iterations.

C. RUNNING TIME
In this section, the running time of the proposed power control policies is evaluated when the maximal tolerable VOLUME 8, 2020 delay (T) is 8 (s). Just as we mentioned above, the results are evaluated on PC Intel (R) Core (TM) i5 CPU @ 3.20 GHz with MATLAB 2017. And, the SNR of discrete channel and the γ of continues channel are 8(dB). The running time of ASDP and SDP is divided into the prepare stage time and transmission stage time. The prepare stage time denotes the running time which is used to identity the query table. And, the transmission stage time spends in the communication process. From the results in the Table 2, we can find that the prepare stage time of SDP is larger than it of the ASDP. Meanwhile, the transmission stage time of SDP is less than it of ASDP since the ASDP has to identify the action based on the channel condition of current slot. We also notice that the time efficiency of the proposed policies is excellent. Even when the iteration number of Q-learning K is 10000, the running time is less than 0.2(s). The results suggest that the proposed policies are applicable for real-time application.

VII. CONCLUSION
The energy efficient communication technology plays an important role in the future wireless communication.
To exploit the potential of DT service in the energy efficiency, the fundamental tradeoff between energy efficiency and delay of the DT service is investigated and analyzed. Furthermore, since it is hard to obtain perfect CSI in practical scenarios, Three energy efficient policies of the DT service is designed under various partial CSI assumptions. The numerical results show that the proposed policies could improve the energy efficiency of DT service greatly. Even when the source node has no prior information about the wireless channel, the energy consumption of DT services could still be saved by adopting the Q-learning policy. Thus, the results of this paper may provide some new insights to design energy efficient communication for DT service in the near future.

APPENDIX A PROPERTY 1
Let the channel gain h follows the Rayleigh distribution with PDF as We set a new variable as Then, we have So, the PDF of y is

APPENDIX B PROPERTY 2
The -greedy policy π can be written as π(a|s) =

A(s)
, none − greedy; We define the previous -greedy policy π as π (a|s) = (41) also is the unique solution of the best policy. there are more detail in the [28].