Advanced Energy-Efficient Computation Offloading Using Deep Reinforcement Learning in MTC Edge Computing

Mobile edge computing (MEC) supports the internet of things (IoT) by leveraging computation offloading. It minimizes the delay and consequently reduces the energy consumption of the IoT devices. However, the consideration of static communication mode in most of the recent work, despite varying network dynamics and resource diversity, is the main limitation. An energy-efficient computation offloading method using deep reinforcement learning (DRL) is proposed. Both delay-tolerant and non-delay tolerant scenarios are considered using capillary machine type communication (MTC). Depending upon the type of service, an intelligent MTC edge server using DRL decides either process the incoming request at the MTC edge server or sends it to the cloud server. To control communication, we draft a markov decision problem (MDP). This minimizes the long-term power consumption of the system. The formulation of the optimization problem is considered under the constraint of computing power resources and delays. Simulation results delineate the significant performance gain of 12% in computation offloading through the proposed DRL approach. The effectiveness and superiority of the proposed model are compared with other baselines and are demonstrated numerically.

transmission delay. However, geo-distributed MEC devices cannot cope with all the computation offloading tasks due to their limited resource capabilities. It means that some have to be processed at the remote cloud server. This entails an efficient solution to classify the traffic based on its delay constraints.
Recently, machine learning (ML) is considered as an effective technique for solving many classification challenges and problems. The authors in [11], [12] have researched the computation offloading problem using machine learning on time-varying computing systems.
The authors in [13] have proposed a reinforced learning (RL) based machine learning algorithm for solving random variation problems in the time domain of the wireless channels. It is done by considering the future reward feedback mechanism from the environment to help achieve long-term goals.
Besides, a binary edge computing-based offloading scheme known as deep reinforcement learning (DRL) is used to choose the user to offload their task to an edge server that serves multiple users [14], [15]. By using the DRL method, MEC determines whether or not the computing task should be offloaded to the MEC server [16]. The authors in [16] and [17] introduced the content of DRL in detail. Both, however, neglect the capability of computation by the mobile device itself. This can reduce the computation delay for some tasks. In [18], the authors consider the cloud and local computing cooperation.
The authors in [19] proposed a DRL method to improve energy efficiency in the internet of vehicles. In the proposed solutions the authors established a fog-cloud offloading system to optimize the power consumption based on delay constraints. The overall system is decomposed into the front-end and back-end. The authors in [20] propose an energy-efficient task offloading strategy based on the channel constraint with deadline awareness. The proposed scheme minimizes the energy consumption of user equipment along with satisfying the deadline conditions of mobile cloud workflows. To handle the massive connection in the presence of heterogeneous edge servers, the authors in [21] investigate the computation offloading management problem. The authors jointly considered the channel states with power consumption in addition to latency and diverse computation resources.
An RL-based algorithm for a multi-user MEC computation offloading system is designed by the authors in [22]. They have replaced the markov decision process (MDP) by proposing a Q-learning based theme. In [23], the DRL is utilized to solve the storage and computation problem of action-value Q. This DRL based algorithm is called deep Q network (DQN). It uses deep neural network (DNN) for the value function estimation of the Q-learning algorithm [24]. In [25], [26] a single edge server based edge-computing system is discussed. Multiple UEs can offload computation to the server. The optimization problem under consideration is the energy consumption and sum cost of delay of all the UEs. However, these works do not consider energy efficiency with delay-tolerant and non-delay tolerant devices, which is more likely the scenario of the future internet.
In this paper, we jointly consider MTC edge and cloud servers for computation offloading to improve energy efficiency with delay constraints. Using DRL, the computational offloading requests are classified as either delay tolerant or non-delay tolerant. Consequently, the delay-tolerant and non-delay tolerant requests are served at the cloud server and an MTC edge server, respectively. The use of edge servers reduces the transmission distance as compared to the cloud server and thus improves the quality of service (QoS) in addition to energy efficiency.
The main contributions of this paper are: • A DRL based computation offloading strategy for MTCDs is presented. This scheme improves the energy efficiency of the MTCDs using the edge computing infrastructure with delay constraints. Moreover, it also enables the MTCDs to achieve optimal offloading without knowing the MTC edge and cloud server model.
• The proposed DRL based approach minimizes the energy consumption in the uplink domain of the MTC devices for achieving energy efficiency in the MTC system. It does this by choosing an appropriate tier to perform the task of computation offloading.
• We analyze the power consumption under the different delay constraints and evaluate the energy consumption of the offloading tasks of the delay tolerant and non-delay tolerant MTCDs.
• The impacts of DRL based rewards are illustrated, and the simulation results are compared with Q-learning and fixed resource allocation schemes to analyze performance efficiency. The DRL based proposed scheme achieves 12% more reward than the fixed resource allocation approach.

II. SYSTEM MODEL
We consider an MTC environment, where a large number of MTCDs are distributed throughout the network. The MTCDs are generating rigorous computing tasks that need to be offloaded based on the allocated maximum power resources and minimum delay constraints. To provide real-time computation and intelligent decision for energy-efficient processing, the MTC edge server, and DRL controller are connected with the evolved node B (eNB). All the MTCDs are connected with the eNB for communication with the MTC edge server and the DRL. The DRL takes an intelligent decision for processing MTCD data either at the edge server or at the cloud server based on delay tolerance. The DRL sends the data of delay-tolerant MTCDs to the cloud server for computation. While the data of non-delay tolerant MTCDs are computed at the MTC edge server. Fig. 1 shows the proposed network mode. Let all the MTCDs (delay tolerant and non-delay tolerant) be represented as v = {1, 2, . . . , V } and with a single eNB represented as J , where V and J are the total number of MTCDs and eNB, respectively. The eNB is connected to the MTC edge server and the DRL controller for making intelligent decisions. The eNB is also connected with a cloud server represented as M . It has a comparatively larger computation capacity than the MTC edge server.
The problem is approached with the appropriate mode selection to offload the rigorous computing tasks in a given time frame either to the MTC edge server or cloud server. We consider S mode We also assume that the MTCD is equipped with a single antenna and each eNB is equipped with multiple antennas L t . A detailed breakdown of the MTC edge and cloud server communication and the related energy consumption model is provided in the following subsection. It is worth noting that we have considered intra-cell interference in our proposed model.

A. MEC COMMUNICATION MODEL
Since the MTCDs v do not posses processing capability, all the tasks need to be offloaded to edge computing tier or cloud computing tier. At the edge tier, the tasks will be executed by the MTC edge server. However, as the edge server has limited resource capacity, the edge server cannot tackle both delay-tolerant and non-delay tolerant MTCDs computing within the edge tier. If all the tasks could be tackled at the edge tier then the energy consumption will be very low. The communication link y v,j between the MTCDs v and the eNB j is represented as: h H v,j is the channel state information (CSI) matrix from MTCD v to eNB j, w j represents the transmitted beam-formed towards the MTCD v. The received additive Gaussian noise at the MTCD v is denoted by n j which is distributed as (0, σ 2 i ). In addition, s v i denotes the transmitted signal vector which contains the message to eNB j.
In the system model for energy efficiency, the received signal to interference plus noise ratio (SINR) is an important parameter. We use it to ensure the QoS and evaluate the channel capacity. The SINR from the MTCD v to the eNB j is represented by the following equation: where w v is the power for beam-forming of the link between MTCD v and the eNB j. The transmit power w v can vary for all the MTCDs. Its variance can result from the channel conditions and the power constraints imposed by the eNB. The optimal transmission rate minimizes transmission and communication delays. It also forces the system to ensure the QoS. Keeping this under consideration and Shannon's channel capacity, the following equation represents the achievable transmission rate from MTCD v to eNB j: where the total achievable channel bandwidth of the network is represented by W and γ min is the lower bound to ensure the QoS. We define the target SINR as γ min , which can be represented as Equation (4) suggests that the threshold γ min must be satisfied with choosing the channel in the access network.

B. ENERGY CONSUMPTION MODEL
In the considered MEC architecture, the energy consumption is calculated as: (i) Energy consumption for uploading and downloading the computing task and (ii) processing energy consumption in a different tier. Generally, the uploading of energy consumption occurs in two phases. First, the DRL controller decides if the task can be computed at the MTC edge server. If it can be computed at the MTC edge server then the MTCD v uploads the generated tasks to the MTC edge server for processing. If the task is beyond the processing capability of the edge server, then it will be assigned to the cloud server. The task in the latter case consumes higher energy as compared to the former case. We consider the trade-off between delay and power consumption. We also consider the QoS of the network to offload all the tasks within the tolerable delay.
We assume that each MTCD generates W task v , which needs to be offloaded within the allocated time frame either at the edge tier or cloud-tier, based on the task size and offloading urgency. With transmission delay D v,M from MTCD v to cloud-tier M and computing delay D computing edge at the edge node, total delay from MTCD to cloud-tier is considered as [27]; where D v,j and D j,M are the transmission delays from MTCD v to eNB j, and eNB j to cloud-tier M , respectively. While R * , * represents the achievable transmission rate in the respective hop. Furthermore, φ j,v represents the maximum available computing resource at the edge. Similarly, the energy consumption from uploading the data from MTCD v to cloud-tier M including processing at the edge and cloud tiers can be represented as [27]; where E upload v,j and E upload j,M represents energy consumption for uploading data from MTCD v to eNB j and subsequently to the cloud-tier M , respectively. While, processing at the edge and cloud tiers are represented as E Edge v and E Cloud v , respectively. Furthermore, P v represents power consumption for uplink data transmission from MTCD v to eNB j, N j,v represents computing resources of MTC edge Server j to MTCD v, P j represents power consumption from edge tier j to cloud-tier M and N M ,v represents computing resources of Cloud M to MTCD v.

III. PROBLEM FORMULATION
To minimize the power consumption we draft a joint optimization problem. We explicitly consider the CSI and transmission delay to achieve energy efficiency in the edge and cloud server-based MTC architecture. The energy optimization problem is stated as s.t.
where constraint (8) represents the QoS demand for each MTCD v. The transmit power cannot go beyond the total power P total and is represented by the constraint (9). Likewise, constraint (10) depicts that the computing power is also bounded by the maximum power capacity P max . Furthermore, the optimization variable s mode v has been exploited to make the appropriate decision by selecting a proper mode to offload the data.

IV. DRL BASED ENERGY EFFICIENCY IN MTC EDGE COMPUTING
The computation offloading problem is formulated to achieve energy efficiency based on MDP. To solve this problem we use a DRL algorithm.
The problem of MDP is designed by using the state and action space, and immediate reward. From hereon the model will be defined with the tuple of {S, A, P(s t+1 |s t , a t ), R(s t , a t )} where the scenario of the environment is represented by the set of states symbolized by S, and the set of a possible number of actions is symbolized by A. After performing the action a t at state s t , the state transitioning probability to the state s t+1 is given by P(s t+1 |s t , a t ) and R(s t , a t ) characterizes the received reward, when action a t has taken at state s t . The objective of the MDP model is to obtain the maximum accumulated rewards R over a long period T . In the following subsection, we provide a detailed breakdown of our computation offloading problem for MTCDs.

1) STATE SPACE
The system state is designed for the energy efficiency by offloading the MTCD data. It is compatible with the currently available communication mode, power resources, and CSI. We assume that there are two modes for each MTCD v. The first mode is the eNB tier mode M eNB v and the second is cloud tier mode M cloud v . Thus, the state space can be designed as is the CSI for MTCD v and eNB j, and D v,j is generated delay for offloading the tasks to the eNB computing tier.

2) ACTION SPACE
To achieve enhanced performance and to accomplish the tasks the agent can perform a large number of actions. Nevertheless, to perform a large number of actions massive computation capacity is required. This, in turn, reduces the overall system performance. Therefore, it is considered that the agent performs only one action at each time state t based on the state space s t . The action space is represented as a t = [s mode v ]. Based on the available CSI and delay of D v,j the agent selects the optimal communication mode for offloading the delay.

3) REWARD FUNCTION
The purpose of rewarding the previous action is to give feedback to the RL model. Consequently, it is crucial to define an appropriate reward to improve the learning process. An efficient rewarding function influences the process for finding optimal policy [28]. The proposed reward function is closely related to the energy-minimizing problem (7). It can be achieved by efficient computation offloading. The negative-sum cost of the system's energy consumption is regarded as an immediate reward. The immediate reward r t depends primarily on the precise mode selection. Therefore, the immediate reward can be represented as The accumulative reward for the longer period T is considered as minimizing the energy cost of the system. Let ξ be the discount factor between [0, 1] that relates the current mode selection with the effect of future reward. Then, R t = T t=0 ξ t r t , where R t is the accumulative future reward. It is worth mentioning that more importance on immediate rewards is placed for a lower value of ξ [29]. The overall rewards are always negative because we attempt to minimize the energy cost of the system. Thus, our main goal is to maximize the reward for minimizing energy.
The state transition probability P(s t+1 |s t , a t ) is important for attaining long term accumulative reward. However, it is quite difficult to obtain the transition probability in the MDP problem. Hence, the Q-learning has been used to solve the MDP problems. The Q-function is defined as Q(s t , a t ) which shows the quality of an action a t at a given state s t . The maximum expected achievable reward is gained by the Q-function that follows the policy π , which can be stated as [28]; where from the beginning state s 0 and action a 0 the Q-function is responsible for obtaining the accumulative rewards.
Hereafter, by following the Bellman criterion, the optimal Q-function is estimated as [30]; Usually, the Q-function is attained in an iterative manner by means of the information of (s, a, r, s , a ) at the next time state s t+1 . Henceforth the updated Q-function can be given as [28]; where α represents the learning rate and α ∈ [0, 1] represents the weight of the current offloading decision. By using the proper learning rate, the repetition algorithm makes sure that Q t (s, a) will definitely be converged to Q * (s, a). The optimal action A and mode selection s mode v can be obtained by repetitively updating the Q-value from each state-action pair [28].
The RL problem for a fewer number of state-action scenarios can be efficiently solved by using the Q-learning algorithm. It is because when the state-action pair is large, it becomes challenging to go over each step, having all the samples that are stored in Q-table [28]. This behavior intrinsically limits the traditional RL with fully observed low-dimensional state space [31]. That is why the energy minimization problem is formulated in the dynamic wireless environment. This leads to an extremely large number of state space. To enhance the performance by avoiding the curse of dimensionality and the limitations of Q-learning we use the DQN approach [28].

A. DRL BASED OFFLOADING
In this subsection, we introduce a DRL based model selection. The DRL acts as an agent. It combines the traditional RL and DNN. It is used for efficient computation offloading to accomplish energy efficiency in the MTC edge server. This approach significantly accelerates the learning process [32].
A DRL has a unique pair in the form of possible action and an output unit. The state portrayal can be regarded as an input source to the neural network (NN) [29]. In a single pass, the DRL can compute the Q-values for all possible actions in a given state. This reduces the complexity of the system [28], [32].
The DRL becomes capable to achieve better performance by using the NN. This happens due to less interaction with the complex environment by using the inherent generalization capability and the replay memory. DRL also approximate Q(s, a; θ )toQ * (s, a) that results in performance improvement and avoids the unreasonable Q-learning situation in RL [32].
In the algorithm 1, we analyze the energy-efficient computation offloading. The input state is the delay between MTCD and eNB. The Q-values Q(s, a, ω) having weight of ω are their outputs. They detect the precise estimate regarding all possible actions. The controller performs the action a t based on the given state. After executing the action a t the system moves to a new state s t+1 . From the reward functions of energy efficiency the agent simultaneously calculates the reward r t . This reward is based on the selected modes. As an interaction the replay memory D stores this transition (s t , a t , r t , s t+1 ). The DQN is updated after a limited number of iterations through a batch of random sample of size B selected from the replay memory D. The following equation gives the minimized loss function by training the DQN towards the target value [33]; where (r t + ξ max a ∈AQ (s t+1 , a ;ω ) represents the target network [29]. In every certain period, the agent will set the weights of the DQN to the target DQN.

V. SIMULATIONS RESULTS AND DISCUSSIONS
In this section, we have demonstrated the performance of the simulation. For simulation, we have used python 3.6 and tensor flow 1.14 to build the neural network. To construct a neural network, two fully connected dense layers have been selected. We choose 1 eNB, and 10 MTCDs. Furthermore, for each MTCD, the transmit power is chosen as 10dBm.
The network area is selected as 400 * 400. The rest of the parameters are illustrated in Table 1.

A. PERFORMANCE ANALYSIS
In the beginning, we compared the DRL learning parameter, where the transmit power, a single eNB, MTC-edge server Generate random x ∈ (0, 1). 5: If x ≤ ε: 6: The agent will perform a random action a t . 7: Else: 8: The agent will perform the action a t as a t = arg max a t ∈A Q(s t , a t ; ω).
9: End If 10: Execute action a t in the emulator. 11: Observe the reward r t according to the formulated problem (11). Based on the performed action a t for energy efficiency, evaluate the reward r t whether the energy consumption is minimized or not and then observe the upcoming state s t+1 . 12: Store the reward r t which is achieved according to formulated problems together s t , s t+1 and a t as (s t , a t , r t , s t+1 ) in replay memory D. 13: Randomly sample the mini-batch with the size B of transitions (s t , a t , r t , s t+1 ) from the replay memory D. 14: Perform a gradient descent on (r t + ξ max a ∈AQ (s t+1 , a ;ω ) − Q(s t+1 , a t ; ω)) 2 with respect to ω to train the Q-network based on selected transition. 15: Periodically set the value of parameter ω toω. 16: End For 17: End For and cloud computing server domain are taken into consideration. Furthermore, the delay tolerance and non-delay tolerance offloading schemes are also taken into account. The performance of DRL's convergence stabilizes when the batch size is set to 32. With the minimal batch size, the DRL exhibits rough performance. When we take the larger batch size of 64, its impact is almost similar to that of batch size 32. In both cases, the DRL shows a very smooth performance. This can be seen in Fig. 2.
The impacts of different learning rates are depicted in Fig. 3, where can see that when the learning is very high α = 0.09, the agent learns very fast but it is trapped between local optimums because it cannot explore all the states. After a certain number of steps, the agent's cumulative reward starts  to decrease. Consequently, the output did not achieve optimality. In contrast, when the learning is low, around 0.001, the system learns very well but slowly, which generates a long delay. However, with the learning rate of 0.01, the agent achieves very stable performance. Fig. 4 shows the QoS requirements' effect on our proposed system. It can be seen that when the QoS is higher around 400 Kbps, the system consumes more power levels around 230mW for ensuring the best performance of the system. However, with the lower QoS demand, power consumption is linearly decreased. Fig. 5 demonstrates the performance tackling the delay tolerance and non-delay tolerance service request. We consider the same number of requests. However, it has been observed that when the maximum possible number of requests are served at the eNB, the power consumption remains minimum. Despite the same number of requests, the non-delay tolerance service consumes less power about 128mW because of its proximity to the eNB, while the delay tolerance services consume the power of 160mW. It is because of long-distance and traversing on different mid nodes. It can also be observed  that the delay tolerance mode consumes around 20% higher energy than the non-delay tolerance mode. To compare the performance of DRL, we have considered the Q-learning and fixed resource allocation algorithm. The overall performance comparison is showed in Fig.6. It can be observed that the DRL achieves higher performance as compared with other benchmarks. In the DRL approach, the agent exploits the optimal resources and shows the best performance in achieving cumulative rewards, which leads to the highest energy efficiency. The proposed scheme achieves 12% more reward than the fixed resource allocation approach. While the Q-learning method gains 3% lower reward as compared with the proposed scheme.
We have selected the eNB and radio resources without considering the optimality for a fixed resource allocation algorithm. As a result, it shows lower performance in achieving cumulative rewards. The Q-learning based algorithm shows a higher performance than that of the fixed resource allocation approach. However, it yields lower efficiency in achieving rewards as compared with the DRL approach. It is because of traversing past each state of Q-learning. The Q-learning approach requires every previous state to be revisited. This catalyzes the unexpected power consumption and accumulates fewer rewards.
In Fig. 7. The total energy consumption of three different schemes is compared. The proposed scheme outperforms both the Fixed Resource Allocation and Q-learning Schemes and consumes the lowest energy, which is approximately 125mW. The Fixed Resource Allocation Scheme consumes the highest amount of energy, which is around 190mW. Whereas the Q-learning consumes comparatively lower energy that is approximately 155mW.

VI. CONCLUSION
In this paper, we minimize the energy consumption of the machine type communication devices (MTCDs) using the deep reinforcement learning (DRL) approach. MTC edge computing architecture utilizes the DRL technique to improve the offloading process of MTCDs. Our problem is formulated under power constraints and QoS requirements. We compared the DRL, Q-learning, and fixed resource-based allocation machine learning techniques. Among these techniques, the simulation results prove that the DRL based approach outperforms in achieving cumulative rewards and is most efficient in minimizing the energy consumption of the MTCDs.