Task Offloading and Trajectory Control for UAV-Assisted Mobile Edge Computing Using Deep Reinforcement Learning

Mobile Edge Computing (MEC) has been widely employed to support various Internet of Things (IoT) and mobile applications. By leveraging the advantages of easily deployed and flexibility of Unmanned Aerial Vehicle (UAV), one of MEC primary functions is employing UAVs equipped with MEC servers to provide computation supports for the offloaded tasks by mobile users in temporally hotspot areas or some emergent scenarios, such as sports game areas or destroyed by natural disaster areas. Despite the numerous advantages of UAV carried with a MEC server, it is restricted by its limited computation resources and sensitive energy consumption. Moreover, due to the complexity of UAV-assisted MEC system, its computational resource optimization and energy consumption optimization cannot be achieved well in traditional optimization methods. Furthermore, the computational cost of the MEC system optimization is often exponentially growing with the increase of the MEC servers and mobile users. Therefore, it is considerably challenging to control the UAV positions and schedule the task offloading ratio. In this paper, we proposed a novel Deep Reinforcement Learning (DRL) method to optimize UAV trajectory controlling and users’ offloaded task ratio scheduling and improve the performance of the UAV-assisted MEC system. We maximized the system stability and minimized the energy consumption and computation latency of UAV-assisted the MEC system. The simulation results show that the proposed method outperforms existing work and has better scalability.


I. INTRODUCTION
The increasing popularity of the Internet of Things (IoT) [1] provides a promising platform for sophisticated mobile applications such as automatic driving [2], augmented reality [3], and various cognitive applications. With the rapid growth of IoT and mobile devices, a vast number of devices are connected to wireless networks and rely on online resources to process various tasks. Offloading a large number of tasks through the core networks and processing the tasks on central cloud servers would cause network traffic congestion and increase latency. In addition, most mobile applications The associate editor coordinating the review of this manuscript and approving it for publication was Zhenhui Yuan . are latency-sensitive, computation-intensive, and energyintensive. To mitigate those issues, Mobile Edge Computing (MEC) [4] is propose to process offloaded tasks in proximity.
Although MEC has numerous advantages, it cannot avoid the limitations of static tower locations. Therefore, it is challenging to deploy MEC anytime and anyplace. In addition, there is a high possibility that the infrastructure might sometimes be destroyed by the natural disasters. Furthermore, it is almost impossible to mount the infrastructure for temporary use or rural areas like hot spots and mountains. In those mentioned scenarios, the IoT devices cannot fully function to serve its users. Thanks to the flexibility of Unmanned Aerial Vehicles (UAV), UAV-assisted MEC [5]- [7] is introduced to serve as a computational server for mobile users at flexible positions. The UAV-assisted MEC can extend the mobile devices' operating lifetime and accelerate computation by providing extra computation resources on the MEC servers. Moreover, offloading tasks to the proximity MEC server avoids mobile users frequently communicating with the cloud or uploading their tasks to the cloud, thereby alleviating communication congestion. Due to the limited payload and energy, the UAV has limited computation capability and flying time. Therefore, the UAV-assisted MEC system demands a control model that can maximize the number of tasks processed before expiring and minimize energy consumption.
Conventional optimization methods have been adopted to address abovementioned problems. As presented in [8], task scheduling and trajectory controlling are essential to minimize the energy. Zhang et al. [9] adopted the Lagrangian dual method and successive convex approximation (SCA) to optimize task offloading to UAV-assisted MEC server with respect to minimize the energy consumption. The classical optimization methods such as convex (CVX) optimization [10] and mixed-integer optimization (MIP) [11] are also employed to address similar problems. Similarly, Liu et al. [12] presented an algorithm based on SCA to minimize the engorge consumption by controlling the CPU frequencies of edge servers and UAV trajectory. However, the UAV-assisted MEC system environments are not fully observable to the optimization models, and it is challenging to formulate the complex MEC environments to the conventional optimization models. Besides, the computational cost of standard optimization methods is growing exponentially with the increasing of parameters.
As one of the most powerful pattern recognition and optimization methods, machine learning [13] has extensively employed to learn and optimize various problems for UAV-assisted MEC. Jiang et al. [14] introduced a deep learning method to optimize UAV positions and schedule resources in UAV-assisted MEC to save energy consumption. However, both standard machine learning and deep learning [15], [16] methods require labeled historical data, which demands considerable human effort to label the training data. Reinforcement Learning [17] can learn and optimize for UAV-assisted MEC without training data by interacting with MEC environment. Therefore, reinforcement learning [17] and Deep Reinforcement Learning (DRL) [18] methods are introduced resource allocation and UAV position optimization [19] to minimize energy consumption. DRL [20] employs deep neural networks to capture the complex states of UAV-assisted MEC and reinforcement learning to make decisions. Wang et al. [21] presented a Multi-Agent DRL model for trajectory planning and task offloading for UAV-assisted MEC. However, the existing reinforcement learning or DRL methods require further optimization steps to achieve joint optimization purposes. The extra steps introduced additional computational costs and compromised the performance of the models. This paper aims to develop an end-to-end DRL model to optimize UAV trajectory planning and task offloading for UAV-assisted MEC. Specifically, the model is to jointly optimize computation offloading and UAV trajectory controlling to minimize the time and energy consumption of the whole system and maximize the stability of the system. To maximize stability means to balance the computation of the system workload, extend the system operation time, and maximize the number of completed tasks. The main contributions of this paper are summarized as the following: • The UAV-assisted MEC system has been formulated Markov Decision Process (MDP) so that we can solve the optimization problem with DRL models. The details of the MDP elements and UAV-assisted MEC environment are well defined with explanations.
• A novel end-to-end DRL mode is developed to address the optimization problem. Specifically, the DRL model is proposed to optimize the offloading task ratio and UAV trajectory to minimize the overall energy consumption in UAV-assisted MEC.
• The proposed method has been verified extensive simulations. The simulation results show the proposed model outperforms the existing methods, including greedy algorithm, reinforcement learning, and DRL methods. The remainder of the paper is organized as follows. Section II presents the related works, and Section III shows the system model and problem formulation. In Section IV, we explain how to solve the optimization problem in the UAV-assisted MEC system with the proposed DRL model. After that, we present the simulation results and concluded our work in Section V and Section VI, respectively.

II. RELATED WORKS
In the literature, many research works on computational offloading [22] and trajectory controlling [23] in MEC and UAV-assisted MEC. Since MEC can increase computation capacity and save energy for mobile devices, it has been extensively researched in recent years as a critical technology towards 5G. In [24], the authors provide high-quality surveys, where the definition, the computation and communication modeling of MEC, and its advantages and applications are discussed. Chen and Hao [25] formulated the task offloading problem of MEC as a mixed-integer non-linear program to reduce the computation delay and save the battery life. Zhang et al. [26] explored the computation offloading mechanisms for MEC in 5G networks, where energy consumption is minimized.
Generally, we can category UAV-assisted MEC literature into several types based on the proposed methods and their research objectives. The first type is to minimize the energy consumption of the whole system or the energy consumption of mobile users [27]- [34]. Hua et al. [27] proposed a resource allocation strategy for a single UAV-assisted edge server for an individual mobile user system, where a UAV is designed to fly from the predefined initial location to the final location with the computation power. Instead of only providing the offloaded service for a single mobile user, Jeong et al. [28] researched how a single UAV provides the computation service for multiple users. In [27] and [28], the authors explored the UAV-assisted MEC system that keeps the UAV fly and provides offloaded computation service simultaneously. In order to save propulsion energy, Xiong et al. [30] and Diao et al. [32] proposed that the UAV only provides offloaded computation services in a specific time or a specific positions.
Besides considering the energy consumption of the MEC system, the computation rate is also considerable for UAV assisted MEC system [35]- [37]. There are two kinds of offloading computation in MEC, including binary offloading and partial offloading. In the binary offloading mode, the users only choose to perform computation tasks all in the local or select to offload all computation tasks to the MEC server. For partial offloading mode, the users can perform computation tasks partially in local and MEC server, where the local computation and offloaded computation are performed in parallel. In [36], a two-stage alternative algorithm and a three-stage alternative algorithmare employed to tackle the partial offloading and binary offloading problem respectively. Hu et al. [35] researched partial offloading in UAV-assisted MEC system with a penalty dual decomposition-based and L0 norm algorithm, which minimized the total processing time, including the transmission time, computation time, and local computation time. Of most interesting, the simulation results indicated that a better performance could be achieved while UAV keeps stationary in a set of time intervals to collect data.
The joint optimization methods are extensively investigated in recent literature because energy consumption and computation rate are both overcritical because of the limited onboard energy of the UAV. Chen et al. [37] utilize the DRL methods to schedule the offloading to improve the satisfaction of perceived delay and energy consumption of mobile users. Zhan et al. [38] first studied the energy minimization without the time constraints, followed by the minimization problem of the task completion time. After that, it jointly optimized the UAV energy and completion time with the Pareto-optimal solution. It is noteworthy that the existing studies, such as the mentioned above, researched the UAV-assisted MEC system with the objective of energy consumption minimizing and task completion time minimizing separately without considering the balance of two aspects. Although Zhan et al. [38] take the trade-off between energy consumption and task completion time into account, the authors did not take the long-term stability of the whole system into consideration. Deep Reinforcement Learning (DRL) [18] methods are naturally introduced to make decisions for UAV-assisted MEC [39] to achieve high performance. However, they did not consider the partial offloading, where the computation task can be only processed solely on a local device or the UAV; therefore, it has much less freedom to control and optimize for the offloading tasks.

III. SYSTEM MODEL AND PROBLEM FORMULATION A. SYSTEM MODEL
In this work, we consider a MEC system with a single UAV and K mobile users in a 3-D Cartesian coordinate, as illustrated in Fig. 1. A UAV carries a MEC server and flies at fixed altitude to provide computation service for IoTs and mobile users. Since the MEC server has greater computing capability than mobile users, mobile users can offload their computation-intensive and latency-sensitive tasks to the UAV so that mobile users to reduce their energy costs and speed up computation. The set of mobile users is described by k ∈ K = {1, 2, . . . , K }. Concretely, a user k can offload d τ of the current task to the UAV, and 1 − d τ of the task is processed on the local device. A control agent can plan the trajectory of the UAV and the offloading proportion d τ of the task. To minimize the total energy consumption of the MEC system and maximize the number of completed tasks over the time T , we need to define models in the MEC system, including communication model, offloaded computation model, and local computation model.

1) COMMUNICATION MODEL
In the communication model, users transmit their computation tasks to the UAV-assisted MEC server. We divide time T into N time slots, where N > K and τ th time slots is defined as τ ∈ T = {1, 2, . . . , N } and the length of time slots is sufficiently small. Assume there is up to 1 task generated in each time slot. The position of mobile user k is given by u k = [x k , y k , 0], k ∈ K. The trajectory of the UAV in a horizontal plane at altitude H can be indicated by UAV's discrete locations in each time slots, which is defined as h τ = [x τ , y τ , H ], τ ∈ T . Assume that the UAV has the capability to return to its initial position after completing missions. Thereby, we can derive the following constraints for the UAV's flight as equation (1).
where equation (1) represents that the speed of UAV has to satisfy the maximum speed constraints of UAV v max .
Since UAV flies at the proximity of mobile users, there are Line-of-Sight (LoS) link and Non-line-of-Sight (NLoS) link between the UAV and mobile users communications. Define P LoS as the LoS link probability between the UAV and mobile users, the calculation of which can be retrieved from [40], given as equation (2). The environment-related variables α and β also can be obtained from paper [40].
Based on [40], the pass-loss between UAV and the mobile user k at τ th time slots is indicated as equation (3).
where c is the light speed and f c is the carrier frequency. The parameters η LoS and η NLoS denotes environment-dependent losses for LoS and NLoS links, respectively. Therefore, the data transmit rate of mobile users to UAV is given as (4) where B denotes the bandwidth, p τ,k denotes the transmission power of mobile user k at τ th time slots, and σ 2 denotes noise power.
Assume that D τ,k bits data is needed to compute for mobile user k at τ th time slot. d τ,k is introduced to indicate the ratio of the tasks offloaded via user k to UAV at τ th time slot. Thereby, in communication model, the transmission time and the transmission energy consumption is calculated by equation (5) and (6), respectively.
where p τ,k is the transmission power.

2) OFFLOADED COMPUTATION MODEL
After transmitting the offloaded computation tasks, UAV and mobile users perform the offloaded computation and local computation. Denote C τ,k as the required CPU cycles to compute per bit data. Therefore, the computation time t off k,c and energy consumption E off k,c can be calculated by: where f UAV represents CPU frequency of MEC server mounted on UAV. κ = 10 −26 is a hardware-related constant. The UAV has an idle state to save energy and a running state to process the tasks.
Since multiple tasks are transmitted to UAV by many mobile users, there is a waiting time for transmitted tasks. Suppose that there is a virtual queue in UAV. The virtual queue holds the offloaded tasks, and the edge server computes the tasks with first-in-first-served order. The proposed model algorithm decides the offloaded ratio of the queue head task and UAV position at next time slot. Suppose that there are a − 1 tasks waiting at this virtual queue, then the current task is added into the queue as a th element. Therefore, we can calculate the waiting time of user k as follows: In the offloaded computation model, the total time cost is given as equation (10).

3) LOCAL COMPUTATION MODEL
Similar to offloaded computation model, given the frequency of CPU f k for the mobile user k, we can derive the local computation time and local computation energy consumption in equation (11) and (12).

4) OVERALL TIME AND ENERGY CONSUMPTION
The total time cost is equal to maximum value of the summarize of transmission time and offloaded time and local computation time, as equation (13).
where τ,k denotes maximum tolerant of task t total τ,k . If t total τ,k exceeds this expiration time, we consider the algorithm is failed to complete this task because the task is expired. The total of energy consumption of UAV assisted MEC system includes communication energy consumption, offloaded computation energy consumption, local computation energy consumption, and the propulsion energy consumption of the UAV. The former three types of energy consumption have been listed in the above, and the propulsion energy consumption can be given, where ξ = 0.5MT /N and M is the mass of UAV including its payload [35]. Therefore, the total of energy consumption at τ th time slots, can be computed as,

B. PROBLEM FORMULATION
In the proposed UAV-assisted MEC system, we aim to minimize the total energy consumption and time consumption simultaneously; meanwhile, the number of completed task before their expiration time is maximized. Therefore, we formulate our target problem as follows: where λ 1 , λ 2 and λ 3 are the normalization factors that covert the terms into about the same magnitude values. The total energy consumption is constrained by the total consumable energy E uav in the UAV, and the tolerance of the tasks constrain the computation time. Again, E τ,k and t total τ,k are defined as the energy and time cost at τ th time step. And F τ,k is a flag value describes whether the tasks processed before its expiration time; it can be assigned with value 1 as if the total time cost less than the maximum tolerant and 0 otherwise, which given by: where the t total τ,k is total consumption shown in equation (7). Note that we can remove the E τ,k term if we only care about whether the task has been processed before its maximum tolerant time and response speed. However, we may also want to extend UAV life time by sacrificing the response time. Therefore, it is more reasonable to both minimize the time cost and energy consumption.

IV. PROPOSED METHOD
This section presents a Markov Decision Process (MDP) for UAV-assisted MEC and the proposed DRL model to optimize the abovementioned target problem.

A. MARKOV DECISION PROCESS SETTINGS
DRL employs deep neural networks to capture the complex states of UAV-assisted MEC and reinforcement learning [17] to make decisions. Therefore, the DRL settings are the same as reinforcement learning configuration. Specifically, there is a learning agent interacts with environment and learns to accumulate long-term expected rewards. The environment describes the real-world with a Markov Decision Process (MDP). In this work, we consider the MDP has a finite number of states, and terminal states are defined as the MEC server is overloaded and cannot process the newly arrived tasks. In practice, we can consider an episode is terminated when the task waiting time on the MEC server is longer than a threshold. For the sake of simplicity of the argument, we assume that the UAV has remained the sufficient energy to return to its initial state after completed all tasks.

1) STATE SPACE
From the system description, the states are considerably complicated as it contains the states of user devices, task profiles, network channel distribution, and various parameters of UAV. We first define that MDP in each episode is described by a set of states, given by: where s τ ∈ S is a general state at time slot τ , and s τ equal to {f k , f UAV , τ,k , r τ,k , h τ , τ }, where f k and f UAV are devices CPU frequency of k th user and UAV, respectively; τ,k is defined as the information of τ th task on mobile user k, represented by τ,k = {D τ,k , C τ,k , τ,k }; r τ,k is current speed of transfer data from mobile user k to the UAV; h τ is UAV position at time step τ ; τ tasks queue states in the UAV at time slot τ .

2) ACTION SPACE
The states are transited according to the actions that have been taken and the MEC network's internal transition probability. Each action contains two parts, including deciding the percentage of a task to offload and control the next position of UAV. And a general action can be defined as: where d τ,k is the percentages of the tasks to offloading at time step τ . In other words, the task is offloaded d τ,k of the task to the UAV and 1 − d τ,k process on local devices.
The h τ +1 decides the next position of UAV. Note that the MDP environment state is changed whenever an action has been taken conditional upon the current state.

3) TRANSITION FUNCTION
We assume the channel gain does not change during the short time slot τ . That means that the data transfer speed r i,k change when the UAV has been moved to the new position because r i,k depends on the distance and pass-loss. We do not consider that the channel gain change while UAV is flying. Thus, the transition probability of the MEC network can be given by: where, s and s are the current state and next state, and the transitions depended are the actions taken by the DRL agent and the current states. Also, the sum of the probabilities is equal to one:

4) REWARD FUNCTION
The feedback from the MEC network to the DRL model can be computed when an action has been taken for a set of offloading tasks, which often called rewards in DRL. Specifically, a reward described as the number of completed tasks before expired minus the corresponding energy and time consumption, formulated as: This equation indicates that the agent is awarded when the offloaded task has been processed before it has expired denoted F τ . The agent is punished through energy and time consumption term. The energy E τ and time t total τ,k consumption values have been smoothed with logarithm function because the learning model can suffer from the feedback of fluctuation of the energy and time consuming if we use the original value. Moreover, C is a small constant value to encourage the model to keep running and accumulate rewards over time steps. The explanation of other parameters have been defined in equation (16).
To maximize the long-term accumulated rewards of the proposed model, each action has been evaluated with expected long term reward, which can be given by: where R τ is the immediate reward, and γ G τ +1 is the discounted long-term reward that can be calculated by equation (24). γ represents the discount for future reward, γ ∈ [0, 1].
when k = 0 the feedback is the immediate reward R τ . While γ = 1 and k > 0, the future reward is not discounted.
Since the environment only returns an immediate reward to the learning agent during the interaction and learning, the expected future rewards are often generated with a policy π, which is a sequence of actions corresponding to a set of states. The expected value of state s take action a called action-value function Q(s, a), and the maximum Q(s, a) is known as optimal action-value function Q * (s, a) given by: In other words, the target problem is equivalent to find an optimal policy π * that can maximize the expected long-term reward. In practice, there are more than one optimal policy but we do not need find all of the optimal polices.  [18] as the approximator. It seems intuitive to replace the Q-table in the RL algorithm with a deep neural network [41]. However, the RL has to learn the right answer from continuous, evaluative, and sequential feedback. In other words, the DRL is different from supervised learning, the label data (feedback) of RL comes from RL iterative update. Therefore, the feedback is changing in each iterative. The feedback is a score for evaluating the action made based on the current state. The model can suffer from oscillation during exploitation and exploration because of the noisy feedback from the environment. To address this challenge, DQN using a target network to back up the Deep Q-Network and fixed the weights (aka, Fixation Method) during certain episodes.
The general process of the offloading system of UAV-assisted MEC and DRL agent learning process has been presented in Fig. 2. First, the users send the profile of the ready offloading tasks to the offloading control system denoted as Environment in the offloading system. Second, the Environment collects the mobile edge computing network's current states whenever it receives a task profile. Third, the DRL agent also resides at the UAV, and it takes actions based on the observation information from the Environment. Fourth, the Environment provides feedback and the next states of the MEC network to the DRL agent corresponding to the action, the feedback, and can be considered the evaluation of the action, and the feedback also knows as a reward. Note that there are two deep neural networks in the DRL agent called the local network and target network, respectively. The local network takes action for the Environment. Finally, the control agent (Environment) executes the action, which decides the proportion of the task allowing offloading to the UAV and driving it to the new location.
The training algorithm has been shown in Alg.1. The training process has shown in the following: First, the experience replay buffer is to store the collected data. As shown in Fig. 2, the system generates a record whenever the agent takes action and interaction with the environment. And each record contains the current state s τ , and the action a τ has been made, and the reward r τ , and the next state s τ +1 , formed as tuple < s τ , a τ , r τ , s τ +1 >. The experience reply buffer is a queue-like buffer with a fixed length; the new records are stored to the buffer, and it discards the oldest the records and keep the latest records when the buffer is full. Experience reply buffer is crucial to DRL learning and converges to robust policies because it is wasteful to use the sample only once and through away in conventional RL. In addition, training the deep learning model with multiple deep learning passes is very common; the number of epochs in deep learning defines train the model with the same training samples. The model can converge faster and learn from the rare samples that are considerably important to converge to a robust policy. Moreover, although we formulate the MEC network environment with an MDP framework, we prefer to decouple the sequential dependencies during the learning and interaction. Finally, we can reduce the noise of the training samples by using draw batches of the samples from experience buffer instead of using a single training sample at one time.
Second, the deep neural network (Deep Q-Network) for representing the Q-Value function has been defined. Note that the inputs and outputs sizes of the network are set as equal to state space and action space because the input will be the state and the outputs are the probabilities of the available actions.
The target network is created by copying from the Deep Q-Network. The two copies of the networks have different purposes for training. The first copy is often called a local network and responsible for interacting with the environment and generating the training data samples. The target network is crucial for the training because it can prevent the learning model in suffering oscillation caused by the noise feedback from the environment. During the training, the model tries to minimize the loss between the Q-value from the target network and the local network. Let Q * s , a ; θ − τ be the optimal value from target network, and the outputs can be given by: In other words, the model uses the target network values to supervise the local network to prevent suffering from oscillation. Therefore, the actual loss function Deep Q-Network can be given: Third, a score window has been internalized to smooth the reward score. With the complexity of the mobile edge computing network system and noisy feedback, the rewards are still considerably noisy; therefore, it is more reliable to evaluate the model with an average score of all the rewards in the current window. The score window is a queue, and it updates along with the training; the score window discards the old score and keeps the latest score when the queue is full. Algorithm 1 E2DQN-Learning for UAV-MEC Input: epoch_no, start , end Output: loss, gradients //1. Initialization Section: Initialize replay memory D with capacity N ; Initialize action-value Q with random weights θ; Initialize target action-valueQ weights θ − ← θ; Initialize rewards window size; Reset the UAV-MEC environment; Initial input raw data s 1 ; Prepossessing: S ← φ(< s 1 >); for time step: τ ← 1 to T max do // 2. Generate training data section: Select action a from based on s τ using: π ← − Greedy (Q(s τ , a, θ)); Take action a, to get feedback and next state s τ +1 ; Prepossessing next state: s τ +1 ← φ(s τ +1 ); Save experience tuple (s τ , a, r, s τ +1 ) into replay memory D; // 3. Training Section: Draw mini-batch of experiences (s j , a j , r j , s j+1 ) from D; if s terminal == s j+1 then Set targetȲ j ← r j ; else Set targetȲ j ← r j + γ max a Q s j , a, θ − ; Update local network: θ ← θ − δ∇ θ j L θ j with Adam; Soft-update the target network: θ − ← θ − − ρ(δ∇ θ j L θ j ); ← max( end , * decay); Store score for current episode; Fourth, the algorithm starts to train the model by starting an episode. We defined an episode ends when the UAV server has full, which indicates the waiting time is longer than a threshold. We have to reset the MEC network environment whenever we start a new episode. As described in the previous section, we want the model to keep the UAV server running without an explosion. Before the state features have been fed into the model, the features have been prepossessed. The feature value scale is considerably different that can miss-lead the model may bias on feature with significant numerical values. However, less critical features, and ignore the critical feature represented by small numerical values. Therefore, we normalize the components of the state before concatenating together as one state and input to the model. Fifth, the agent starts to interact with the mobile edge computing environment and the local network by incorporating a − greedy algorithm. Each interacts to produce an experience tuple, including current state, action, reward, and next state, denoted as (S, A, R, S ); these experience tuples are collected and stored in the experience buffer for training the local network. Note that the learning agent chooses the best action given the current state and policies with probability 1 − and takes random action with probability . To balance the exploration and exploitation, the is decay over time because we want the model to spend more time exploring the environment at the beginning of the training that the later episodes. As the model has more knowledge about the environment, we want the learning agent to leverage more about the experience than exploration.
Finally, the agent draws sample batches of the experience, tuples the reply buffer, and train the local network. As mentioned above, the learning agent tries to minimize the loss (errors) between the outputs of the local network and the target network. The weights of the local network update every step of the gradient descent. In the original DQN method, the algorithm only updates the target network every N steps by overwriting the weights with the weights of the local network. In this work, we adopt the soft-update method introduced by Lillicrap et al. [42] update the target network smoothly instead of the Fixation method, which is a significant update every N steps. Concretely, the target network is updated with a small portion of change local weights discount ρ.We adopt the Adam [43] algorithm to optimize the loss function and update the weights of the local network. Note that generating training data block in the above step and training steps have to one step followed by others. Therefore, we can run server times of the training and a single time data generation steps; they also can run simultaneously.
In summary, the algorithm starts to initialize the replay buffer, the local network, target network, and score window size. Moreover, the proposed DRL model resets the MEC network every episode, and the learning agent interacts with the environment and generates training data and stores into the experience buffer. The DRL agent can draw sample data from the experience buffer to train the local network. The target network will be overwritten by the local network at every N step.

V. SIMULATION RESULTS
In this section, we present the details of the simulation configuration and results. We adopt Python 1 as the programming language and PyTorch 2 to build the proposed DRL model. The simulator has three main components. including the system environment, UAV-assisted MEC, and the DQN agent. The system environment describes the whole MEC offloading environment as MDP. The UAV-assisted MEC simulates the MEC network. The DRL model takes actions for the MDP environment of UAV-assisted MEC. The proposed DRL model has been compared the proposed methods with the greedy algorithm, reinforcement learning (Q-learning) and existing DRL model. Note that the existing reinforcement learning models and DRL models have to incorporate traditional optimization to achieve joint optimization.

A. SIMULATION SETUP
The UAV-assisted MEC is constructed by various MEC entities, including the UAV, users, and network environments. Further, the environment wraps up the UAV-assisted MEC and provides the interface to the DRL agent. For the simplicity of the argument, we call the UAV-assisted MEC and its environment wrapper as MEC environment in the following section. The initial position of the UAV is located at [10,10,25] with the frequency of 16×10 9 Hz. The expiration queue time and the task number constraint are 10 seconds and 500, respectively. The MEC environment is considered as overloaded when the task waiting time in the UAV task queue is longer than 10 seconds, or the number of tasks in the UAV queue exceeds 500. The simulator terminates the current episode and start a new episode when MEC environment is overload. The key parameters settings are listed in the Next, we present the neural network settings used for the DRL agents to optimize the target problem. Since our goal is to maximize the number of completed tasks before expiring and minimize the energy and time consumption by optimizing the UAV trajectory and the offloading ratio. We input the state of UAV, users, and environment into the neural network and output the combination of the UAV position and the offloading ratio.The input parameters include the task transmission speed, UAV current queue waiting time, UAV current position, UAV frequency, the position of current served user, its frequency, and the data size, computation cycles, and expiration time of the task. There are nine action options to move UAV position to control the UAV trajectory; further discretize the ratio offloading [0, 1] into five intervals. Therefore, the state space (neurons in the input layer) and the action space (neurons in the output layer) are, 11 and 45 respectively. Besides the input layer and output layer, we also build 5 hidden layers, which have 256, 512, 1024, 512,, and 256 nodes in the corresponding layer. During the training the process, the batch size and the buffer size are set as 256 and 10 5 , respectively.
Moreover, the DQN has to balance exploration and exploitation to find the optimal policies. In this simulation, we adopt the -greedy algorithm, where the agent randomly generates a number which is ranged at [0, 1]. When the generated number is greater than , the agent exploits the learned knowledge, which means the agent takes the best action according to the current optimal policy. On the other hand, the agent takes a random action to explore the environment if the generated number is less than or equal to . The started from 1.0 and decay to 0.001 gradually. Finally, the future expected rewards are discounted with γ , which is defined as 0.9.

B. RESULTS
To demonstrate the efficiency and robustness of the proposed DRL model, we compare the proposed with existing works given the aforementioned parameter settings. The UAV-assisted MEC system performance is impacted by various factors, such as the number of mobile users, the MEC environment traffic, and different parameter settings of the DRL models. Therefore, we also show the performance of the model in different parameter settings.
To figure out the heuristic settings and performance improvements of the proposed end-to-end DRL (E2DQN), we run 2, 000 episodes to compare the performance of the model with different values. Fig. 3 shows the small has less exploitation the faster the model converges to its local optimal. However, the model with = 0.99 can achieve the highest, and = 0.7 end up the lowest performance in this simulation. The models with large values take more time to explore the environment than those with small . Although the model with larger can achieve higher performance, the large is not always the best option because the exploration costs computational power and take a longer time to converge its optimal policies. Nevertheless, we choose to set = 0.99 because it can collect more rewards than others.
To verify the performance of the proposed DRL model, we compare the DQN with the greedy algorithm (Greedy), reinforcement learning (Q-learning), and the existing DRL model (DQN) by collecting rewards. In this work, each reward contains weighted values of the completed task number and the energy consumption. As shown in Fig. 4, the proposed model (E2DQN) achieves the highest performance because it can lean to optimize the target problem in an endto-end manner. Contrarily, the exiting DRL model (DQN) and reinforcement learning model (Q-learning) require further steps to optimize multiple objectives. Specifically, the DQN and Q-learning agents learn to control trajectory and classical optimizer to optimize the task offloading ratio. The classical optimization steps often have less room to optimize based on the outputs (actions) generated by the learning agent, which compromise the performance of the models. The learning agents are learning ''knowledge'' and exploring the system environment in the early period, and the greedy always selects the optimal local actions to collect rewards. Therefore, the learning agents have lower performance than the greedy algorithm in the early period. As they accumulate sufficient ''experiences'', the learning agents consider the long-term reward and take optimal actions, but the greedy always selects the local and short-term optimal actions. Gradually, the learning agents perform the greedy algorithm. As shown in Fig. 4, Fig. 5 and Fig. 6, the Q-learning and existing DQN models converges to their optimal policies faster than the proposed model. The existing models take fewer time steps to learn because Q-learning cannot learn to capture complex state space; moreover, the exiting models only learn to optimize for a single target, and the rest of  the optimization steps are processed by classical optimization methods such as CVX and MIP methods. Therefore, the exiting learning methods have relatively small action spaces. However, Q-learning and exiting DQN methods end up with relatively lower performance than the proposed method because it is challenging to assure the collaboration learning agents and optimization methods converge global optimal policies for joint optimization problems. As we can see from Fig. 4, Q-learning and existing DQN are converged about 750 and 1000 episodes, whereas the proposed converge to its optimal policies around 1,500 episodes.
Further, we compare the proposed model with other methods in terms of the reward components (sub-objectives). As shown Fig. 5, the proposed model can possess more tasks than existing DQN models, Q-learning, and the greedy algorithm. Moreover, Fig. 6 shows the proposed model consumes less energy than the other models. Again, the proposed method can optimize the objectives in an end-to-end manner to generate actions that optimize the UAV trajectory and offloading task ratio simultaneously. The proposed method can process more than 250 tasks, and the existing DQN method can process about 175 tasks; the Q-learning and greedy algorithm can about 148 and 50 tasks, respectively. The proposed model can save energy and increase number of processed tasks by optimizing the UAV trajectory and offloading task ratio simultaneously.
Finally, Fig. 7 shows the scalability of the models, and it shows the proposed model outperforms the DQN model and Q-learning models. The learning models can process more tasks with the increasing of users; on the contrary, the greedy algorithm performs extremely poorly in terms of scalability because it has no mechanism to plan to accumulate expected long-term rewards. Note that the greedy rewards are decreased when more tasks are offloading in a short time, which can lead to the server overloaded and end the current episode; the earlier the episode ends, the less reward the agent can collect. The greedy algorithm probably maintains the same rewards if there is no reset episode in the simulation. On the other hand, the learning algorithms can collect more rewards in an episode because they optimize the offloading ratio with respect to long-term rewards and maintain the stability of the MEC server.

VI. CONCLUSION
In this paper, we have investigated the deployment of UAV-assisted MEC networks in a short time and temporally serve multiple users. Further, we proposed a deep reinforcement learning model to learn and optimize task offloading and UAV trajectory control in an end-to-end fashion. The proposed model controls the proportion of offloading tasks and UAV trajectory; the proposed model takes actions to optimize multiple targets, including computing latency, the energy consumption of the UAV-assisted MEC network. Specifically, the agent tries to maximize the number of tasks processed before they expire and simultaneously minimize energy and time consumption. The proposed model is an end-to-end model that does not require further optimization based on the output. The results show that the proposed model outperforms the existing methods. For future work, we will investigate multiple UAV collaboration with task partitioning. YA-HONG WANG received the Bachelor of Engineering degree in environment engineering from the Zhengzhou University of Aeronautics, in 2016. She is currently pursuing the Ph.D. degree in environmental science with the Institute of Urban Environment, Chinese Academy of Sciences, China. Her work experiences include motion-sleep, growing development, and reproductive capacity of drosophila melanogaster under long-term radiofrequency radiation.
PENG CAI received the Ph.D. degree from the Shanghai Institute of Biochemistry, Chinese Academy of Sciences, in 1995. He is currently a Researcher, a Ph.D. Supervisor, and a Group Leader of the Physical Environment Research Group, Institute of Urban Environment, Chinese Academy of Sciences. He focuses on the research of environment and health, especially the influence of special operating environment on the physical and mental health of personnel. VOLUME 9, 2021