A Task Offloading Algorithm With Cloud Edge Jointly Load Balance Optimization Based on Deep Reinforcement Learning for Unmanned Surface Vehicles

Unmanned Surface Vehicles (USVs) generate a large amount of data that needs to be processed in real time when they work, but they are usually limited by computational and battery resources, so they need to offload their tasks to the edge for processing. However, when numerous USVs offload their tasks to the edge nodes, some offloaded tasks may be thrown due to queuing timeouts. Existing task offloading methods generally consider the latency or the overall system energy consumption caused by the collaborative processing at the edge and end layers, and do not consider the wasted energy when the tasks are thrown. Therefore, to address the above situation, this paper establishes a task offloading model to minimize long-term task latency and energy consumption by jointly considering the requirments of latency and energy-sensitive tasks and the overall load dynamics in the cloud, edge, and end layers. A deep reinforcement learning (DRL)-based Task Offloading with Cloud Edge Jointly Load Balance Optimization algorithm (TOLBO) is proposed to select the best edge server or cloud server for offloading. Simulation results show that the algorithm can improve the utilization of energy consumption of the cloud edge nodes compared with other algorithms. At the same time, it can significantly reduce the task throw rate, average latency, and energy consumption of end devices.


I. INTRODUCTION
The Unmanned Surface Vehicle (USV) is a kind of intelligent surface ship that can carry out autonomous navigation and mobile surveillance tasks. Compared with traditional boats, USV has the advantages of low cost and multiple functions. They can perform tasks in complex and hazardous water environments, which reveal a wide range of application scenarios in maritime cruise [1], environmental monitoring [2], antisubmarine, and other fields.
The complex and changing ocean environment, unpredictable navigational emergencies, and multi-source sensor data are all factors that place high demands on the intelligent processing capability of USVs. In recent years, deep The associate editor coordinating the review of this manuscript and approving it for publication was Md Arafatur Rahman. learning (DL) in artificial intelligence has achieved great success in various fields, which also provides an opportunity to enhance the intelligence of USVs. For example, USVs can effectively extract features from the data collected by sensors through convolutional neural networks, which accelerates USVs' intelligence progress and puts high demands on USVs' storage and computation capabilities. However, limited by the local computing ability and power supply of USVs, USVs cannot handle these resource-intensive intelligence tasks quickly [3], so most of the intelligence tasks are still handled by traditional cloud-based centralized solutions. This solution provides scalable resources to handle these intelligent tasks, but the cloud servers are often far from the data source. The considerable data communication overhead and latency caused by traffic congestion will be the critical bottleneck of the cloud computing model. Driven by this problem, edge computing is emerging as a practical solution that distributes computing and storage resources near IoT devices [4]. Thus the data generated by USVs that need to be processed can be directly offloaded to nearby edge servers. With this approach, the data transmission latency can be significantly reduced, and congestion in the network can be alleviated.
Although edge computing has many advantages, existing studies have shown that there are significant challenges in guaranteeing stable operation of edge servers over long periods [5], such as satisfying multiple user requests with high-performance requirements at the same time. With the increase of computational tasks, the limited resources in some edge nodes cannot handle such computational tasks, which will lead to queuing of some tasks for execution at the edge nodes [6]. In contrast, some other computational resources may remain idle or non-busy, thus leading to an imbalance in the edge server load. Therefore, we must ensure load balancing at the edge nodes to improve the quality of user experience (QoE) and satisfy the quality of service (QoS) requirements. In addition, it is another major challenge to determine whether or where to offload tasks in multi-edge network architecture, limited by the computing ability and power of mobile terminals.
Scholars have investigated several approaches to overcome some of these challenges. Most of them jointly consider task latency and energy consumption in task offloading decisions [7]- [11]. However, only the time delay and energy consumption for given task completion is considered. The energy waste caused by some tasks being thrown after computation or offloading is still out of consideration because they cannot meet the latency requirements, which may lead to a waste of power supply and the reducing in USV working hours. For example, Thai T et al. [7]proposed a new edge computing network architecture that enables edge nodes to collaborate and share computational and wireless resources. Jie et al. [8] offloaded tasks to multiple edge servers by deep reinforcement learning algorithms to reduce energy consumption and average computational latency. However, these studies tend to consider collaboration between edges and ends, or endto-end. Therefore, this study proposes a DRL-based system offloading algorithm in a cloud-edge collaboration scenario, TOLBO (Task Offloading with Cloud Edge Jointly Load Balance Optimization). It is featured by making offloading decision based on the task size of the USV, the queue information of the cloud, edge, and USVs, and the load level at the cloud and edge nodes. Therefore, it can improve the resource utilization of cloud and edge servers,especially the utilization of energy. In other words, when the computing tasks are intensive, the throw rate of tasks can be minimized. The main contributions are as follows: • A task offloading problem model is developed for latency-sensitive and energy-sensitive tasks. The model considers the load level dynamics of the cloud and edge nodes to minimize the expected long-term cost of the task, especially the delay of the task, the penalty of the thrown task, and the energy consumption required to process the task.
• In order to achieve the desired long-term cost minimization, a DRL-based cloud edge collaborative joint optimization offloading algorithm named TOLBO is proposed.
• Simulation experiments show that TOLBO can better utilize the processing ability of the cloud-edge-end layers. Compared to other algorithms, it can significantly reduce the task drop rate, the average latency, and the energy consumption of mobile terminals.
The rest of this paper is organized as follows. Section II describes the current related work. Section III presents the system model. Section IV presents the design of the DRL-based offloading algorithm. Section V describes the experimental setup and the performance evaluation of the algorithm. Finally, We summarize the full paper in Section VI.

II. RELATED WORKS
The main goal of USV's task offloading is to minimize task completion time and reduce the energy consumption of mobile terminals.
Completion time is generally defined as the time duration from submitting a request to receiving the task result, and it is often considered as a QoS metric in the design of offloading methods. The task completion time is affected by many factors such as waiting delay, transmission delay, and computation delay.
Due to the limited battery capacity of various mobile devices, the energy consumed by the devices to complete their tasks is also a matter to be considered. Energy consumption mainly includes data processing and data transmission energy consumption. For both types of energy consumption, all existing related works use simple linear models in which the energy consumed increases linearly with the amount of data processed or transmitted.
Since task offloading requirements are diverse, task offloading decision is usually a multi-objective optimization problem. A common solution to this problem is to transform multiple objectives into one objective by some methods (weighting method) or use one of the most important objectives as the optimization one and the others as constraints. Thus, the optimal task offload problem can be reduced to selecting the right task offloaded to the right edge server at the right time to improve the application's performance while efficiently utilizing the edge server resources. This problem is an NP-hard multi-objective optimization problem [12], for which only an approximate optimal solution can be found.
In recent years, many scholars have researched the offloading decision problem to optimize latency and energy consumption. Thai T et al. [7]proposed a new edge computing network architecture that enables edge nodes to share computational and wireless resources collaboratively. They describe the joint task offloading and resource allocation optimization problem as mixed-integer nonlinear programming (MINLP) and solve the problem using the interior point and branch-and-bound methods. It is demonstrated through experimental results that the proposed method minimizes the total energy consumption and satisfies all the delay requirements for mobile users. Du et al. [13] modeled the multi-objective optimization problem as a mixed-integer nonlinear programming problem containing two subproblems of task unloading decision and resource allocation, and solved the resource allocation subproblem using fractional programming and Lagrangian pairwise decomposition. However, they did not consider the waiting time delay for executing the task. Sun et al. [14] proposed a DBO-based online learning algorithm that jointly utilizes projected dual-gradient iteration and greedy methods to minimize the task computation communication delay at the user terminal. The long-term effectiveness of the algorithm was demonstrated experimentally.In large-scale offloading scenarios, traditional methods such as convex approximation [15], game theory [16], and metaheuristics [17] usually require complex mathematical modeling of the system. Thus, they have difficulty coping with the computational burden caused by the exponential growth of the search space, and traditional algorithms may find locally optimal rather than globally optimal offloading decisions.
In contrast, Reinforcement Learning (RL) can learn optimal offloading policies by interacting with edge networks, which can achieve model-free control without knowing the system's internal transitions. Therefore, reinforcement learning methods are widely used in offloading [18]. Zhou et al. [19] proposed a Q-Learning-based spectrum access scheme in mobile networks to enable mobile users to access the optimal spectrum and maximize the transmission rate. Dab et al. [9] proposed a QL-JTAR algorithm to minimize the energy consumption of mobile terminals under application delay constraints. Traditional reinforcement learning stores tuples containing states, actions, and rewards into Q tables. However, they are not applicable when the state or action space is enormous. Deep Reinforcement Learning (DRL) can solve the above problem by integrating neural networks into reinforcement learning to approximate Q values. Therefore, DRL for joint optimization problems has attracted increasing interest from researchers, and complex offloading decision problems can be solved more efficiently using this approach. Jie et al. [8] offload tasks to multiple edge servers by deep reinforcement learning algorithms to reduce energy consumption and average computational latency. The offloading strategy is further optimized by transmission rate and battery power into the decision conditions during the learning process and improved the service performance of the whole edge system. Lu et al. [10] modeled a large-scale heterogeneous MEC environment and proposed an IDRQN algorithm based on the LSTM algorithm to address the shortcomings of DRL. Finally, the experiments verify its validity. Similarly, Meng et al. [11] also proposed  an online DRL algorithm in their model to reduce the queuing delay of end-users by designing a new action space and reward function for making the trade-off between the energy consumption of mobile devices and the delay of tasks.
In order to better compare the working scenarios of the above mentioned working methods and optimize the dimensions, we have compared all methods accordingly in Table 1.
From the table, we can see that the current work usually only considers the collaboration among edge servers, but ignore cloud servers with a large number of computing resources. Also, all papers do not consider the waste of computational resources and energy when tasks are thrown. Therefore, in order to deal with large-scale edge computing scenarios, we propose the TOLBO algorithm based on DRL. This algorithm takes the task thrown into account and utilizes the computing resources of cloud edge servers to make the best decision.

III. SYSTEM MODEL
In this section, we first give an overview of the system model and then describe the task model, delay model, energy model, and problem modeling successively.
As shown in Fig. 1, a quasi-static scenario will be considered in this paper. During task offloading, all mobile end devices and wireless network states do not change. At time slot, there are a series of USVs: U = {u 1 , u 2 , u 3 . . . u N } connected wirelessly to the edge servers: N , K denote the number of USVs and the number of edge servers, respectively. Meanwhile, RSUs are wired to remote cloud servers to transfer task data and computation results.
The heterogeneous type of the edge server can be rep- . . , Q mec K ,N denotes the task queue information and f mec K denotes the maximum CPU frequency. Similarly, the cloud server can be denoted as C = (Q cloud , f cloud ).
We consider that the tasks generated by USVs are indistinguishable and each USV generates at most one task in a time slot. So at the beginning of the time slot, when the task arrives, it can be selected to be processed locally, offloaded to an edge node or a cloud server. As shown in Fig. 2, if the USV decides to place the task locally for processing, the task is added to the queue waiting for computation (i.e., computation queue). Conversely, the task is placed into the queue waiting for transmission (i.e., transmission queue) depending on the offloading destination. If the task is offloaded to the cloud server, it is uploaded to the transmission queue of RSU first. As for all the transmission queues and computation queues, we consider them a FIFO queue, assuming that the current task finishes in the current time slot and the next task is processed at the beginning of the next time slot. Also, the RSU, the edge server, and the cloud server are assigned a queue for each user(Section III.B).

A. TASK MODEL
At the beginning of time slot t ∈ T , if the USV user u i ∈ U has a new task arriving, we can define this task as i,t = D i,t , T max , T i,t , where D i,t represents the size of the task, T max is the maximum tolerated delay (in time slot) of the task, and T i,t indicates the generation time slot of this task, so if the task has not been completed at time slot T i,t + T max − 1, this task will be thrown.
When a task arrives, an offloading decision needs to be made for the task i,t . In order to facilitate the energy consumption calculation later, we define a One-Hot Encoding variable d u (t) to indicate whether the task is to be executed locally or offloaded to an edge node or a cloud server. When d u (t) = [1, 0, 0], it means local execution, when d u (t) = [0, 1, 0], it means execution on the edge server, and when d u (t) = [0, 0, 1], it means offloading to the cloud server. When a task is offloaded to the edge server, we define a binary variable O m,u (t) ∈ {0, 1} to indicate whether the task generated by the USV user u at time slot t will be offloaded to the edge server m. When O m,u (t) = 1, the task is offloaded to the edge server m. Conversely, it is not offloaded. Because tasks are non-separable and a task is not repeatedly offloaded to multiple edge servers, O m,u (t) is restricted by (1) When task i,t is selected for local processing, it will be placed in the computation queue. At the beginning of time slot t, if the local computation queue is not empty and the USV is not processing the task, the task at the top of the queue is added to the computation. To facilitate the calculation of the actual energy consumption of the task, we define T comp u (t) to indicate the number of time slots after which task i,t has been processed or thrown. ( We set the minimum resolution of the task execution time to (t) and queue length Q comp u (t) are known, we can calculate the queue length of the next time slot as follows: For example, at time slot 0, the tolerated latency T max of task u,0 is 10 times lots, and suppose that the task needs 3 time slots to be fully processed. At this time slot, the queue length Q (1) = 0 + 3 − 1 = 2. Also, we can calculate the total latency T local u,t for the task to be processed locally and completed or thrown.
T local When the task is executed on the edge server, the edge server receives the tasks offloaded by each USV user. To ensure the fairness of each task completion, we consider that the resources allocated to each USV user on the same VOLUME 10, 2022 edge server are equal. Assume that at time slot t, there are G m (t) USV user's tasks computed on the edge server m. Therefore, the computational resources shared by each USV user is f edge m /G m (t). It is worth noting that G m (t) changes between time slots as tasks end or new tasks arrive. Hence, the computation queue information on the edge server also changes with time slots.
Here we assume that the tasks generated by USV users at time slot t are offloaded to the edge server at time slot t. Thus, we can define u,m,t = u,t .
Since the computational resources allocated by the edge server to different USV users change dynamically with the number of tasks on the edge server, it is difficult for us to derive the queue information on the edge server by a simple calculation. Therefore, we define q To better calculate the latency of tasks on the edge server, we define l edge u,m (t) ∈ T to indicate that task u,m,t is processed to completion or thrown at this time slot. Also, we definê l edge u,m (t) ∈ T to denote that task u,m,t starts to be processed at this time slot. Thus, we can conclude that The above equation indicates that task u,m,t will start only after all tasks have been computed or thrown before time slot t. Therefore, we can calculate l edge u,m (t) from the following equation: The above equation indicates that at time slot l edge u,n (t) − 1, the computation size of the edge server is less than the task data size, while at time slot l edge u,n (t), the computation size of the edge server is greater than or equal to the task data size, and the computation delay T edge u,t can be expressed as Meanwhile, the cloud server we can think of as an edge server with more computing power, so its queue information and computational latency are similar to those of an edge server, which can be defined as q cloud u,m (t) and T cloud u,t .

2) TRANSMISSION DELAY
The transmission model is also a FIFO queue. Assuming that the transmission power of the USV user is P tran u , the maximum transmission rate in an additive Gaussian white noise (AWGN) channel [20] is where B denotes the bandwidth of the channel, d denotes the distance between the USV user and the RSU, N 0 is the noise power. Also, the above transmission model uses Rayleigh fading channel model. h is the channel attenuation coefficient and β is the path attenuation coefficient. Similar to the computation queue, at the beginning of the time slot, if the transmission queue is not empty and the USV user u does not execute the transmission task, the task at the top of the queue is added to the transmission process. Also, we set the minimum resolution of the task transmission time to 1-time slot by default, then the actual execution time of the task from the USV user u to the edge server m is Therefore, we can derive the transmission queue Q tran u (t + 1) for the next time slot t as follows: Another part of the tasks is offloaded to the cloud server for execution, which is first offloaded to the roadside unit (RSU) and then transmitted to the cloud server via the RSU. Since RSUs and cloud servers are typically connected by wired connections such as fiber optics, the data size of tasks can be neglected for the transmission rate, and only the overhead between transmissions needs to be considered. We define this overhead delay as a constant r tran RSU , and the actual execution time of the RSU offloading to the cloud server can be defined as: The transmission queue for the next time slot can be defined as 3

) TASK DELAY
Depending on the offloading decision of the task, the total delay of task completion or being thrown by the USV user at time slot T is defined by T u,t , which is expressed as:

C. ENERGY MODEL
Since USVs need to execute tasks in the outdoor water environment, they carry limited battery capacity. While edge servers and cloud servers are powered by fixed power supplies, the energy consumption of servers is negligible. Therefore, in this paper, our primary goal is to reduce the energy consumption of USVs for processing tasks. The energy consumption of USV processing tasks consists of two main parts. When the tasks are executed locally, we only need to consider the energy consumption during the task execution. When the task is offloaded to the edge or cloud server, we mainly consider transmission energy consumption.
Computational energy consumption: In subsection III.B.1 we mentioned that the computational capability of the USV user u is f usv u , and for the task u,t the actual execution time is T comp u (t). It is worth noting that it does not necessarily mean that the task will be completed, and the task may be thrown. Thus, the computational energy consumption incurred by processing the task is similar to what was defined in [21].
Transmission energy: If a task is decided to be offloaded to an edge server or a cloud server, then the energy consumption mainly comes from the transmission energy. Recall that in the previous subsection, we defined the transmission capability of the USV user u as P tran u , then the energy consumption required for transmitting the task u,t is [22].
Therefore, depending on the offloading decision of the task, the total energy consumption of the USV user generated by the task completed or thrown can be defined as:

D. PROBLEM MODELING
The focus of this work is to optimize the long-term performance of the USV, i.e., to maximize the number of completed tasks and reduce the completion delay of each task within the tolerable delay range while reducing the energy consumption of the USV. We can therefore model this optimization problem as follows.

IV. DRL-BASED OFFLOADING ALGORITHM DESIGN A. MDP MODEL
At the beginning of each time slot, each USV observes its state (e.g., task size, queue information) and transmits it to the decision agent. If a new task is to be processed, the agent selects an action for that task. The current state of the USV and the task decision action incur corresponding costs (i.e., the delay of the task, the penalty when the task is not completed, and the energy consumption required to process the task). The agent's goal is to minimize their expected long-term costs by optimizing the policy from state to action. 1) State Space At the beginning of the time slot t ∈ T , each USV u ∈ U observes its state information, including task size, queue information, and load level history of the edge nodes. Specifically, the USV user u observes the following states: 2) Action Space At the beginning of the time slot t ∈ T , if a USV user u ∈ U has a new task u,t arriving, then it will choose the offloading action for the task: (a) whether to process the task locally or offload it to an edge node or a cloud server, i.e. d u (t); (b) to which edge node the task is offloaded, i.e. o u . Thus, the action of USV user u in time slot t is represented by the following action vector 3) Reward If the task is processed successfully, then the task latency is the duration between the time the task arrives and the time the task is processed. If the task is thrown, we set the task latency as a penalty delay (γ T max ). Thus, the final task latency can be expressed as We design a flexible reward function that solves all optimization objectives at once using a weighted sum of energy consumption and delay. The reward function R t (S t , A t ) can be expressed as where β 1 and β 2 are the parameters that normalize the task delay and task energy consumption. η ∈ [0, 1] is used to smooth the weighting factor of task delay and energy consumption. The task energy consumption is approximated as a squared multiple of the device operating frequency, so we scale it with a logarithmic function. The agent can only obtain the immediate reward in the current state, not the future reward, so in order to maximize the long-term reward, we go through the learning strategy to predict the expected future reward. Here we introduce the Q function, which is defined as the expected value of the long-term reward obtained by taking specific action in a certain state. The Bellman equation of Q function can be expressed as VOLUME 10, 2022 where R τ is the immediate reward.γ is the discount factor, since our assessment of the future is not always accurate, a discount needs to be added to the expected future reward to express future uncertainty.γ G τ +1 is the long-term reward after the discount, which can be calculated by The expected future reward is usually predicted by the strategy π, so we need to find an optimal strategy π * to maximize the Q function by In this subsection, we propose a DRL model to solve the joint optimization problem presented in the previous section.
Due to the complexity and continuity of states, it is impractical to observe all states and actions. Deep Reinforcement Learning (DRL) introduces deep neural networks to replace Q-functions (Deep Q-Network DQN) as approximators so that the action of the optimal strategy π * can be expressed as Unlike traditional supervised learning, DRL models can learn without a priori knowledge by interacting with the environment and continuously exploring feedback. However, the environmental feedback changes with each interaction. To prevent oscillations in model training, DQN uses the target Action-Value network to back up the weight parameters of the Action-Value network, making the fitted labels relatively stable over time and thus allowing smoother convergence of the Action-Value network. Thus, the target Action-Value network and the Action-Value network are structurally identical, and their estimates of both reward and action values are related to the weight parameters. During the training process, the Action-Value network learns from the reward estimatesF given by the target Action-Value network and updates the weights of the Action-Value network through the loss function L τ (w τ ). At this time, the weights of the target Action-Value network are not updated. However the weight parameters of the Action-Value network are assigned to the target network after the Action-Value network has completed a certain number of updates. The reward valuation of the target Action-Value network is calculated as where w − τ denotes the weight parameter of the Action-Value network in the previous round and the loss function can be expressed as (29)

C. TRAINING PROCESS
In this subsection, we describe the training process of the DRL model. To simplify the description, we divide the training process into three parts: initialization, training data generation, and learning from the experience pool, as shown in Fig. 3.

1) Initialization:
Initialize an Action-Value network Q(s, a | θ) and assign the parameter θ to the Target Action-Value networkQ(s, a | θ). A replay memory D to capacity N is also initialized to store the experience tuples s, a, r, s generated by USV users interacting with the environment. Then, the whole system iteratively generates data and trains the DRL model over a while. This Episode ends when the step is greater than or equal to the designed time slot. 2) Exploration and data acquisition: USV users interact with the network environment to generate training datasets. The agent balances exploitation and exploration by using the − greedy method. Initially, the agent knows nothing about the edge network environment, and it takes more random actions to explore the environment. As the agent gradually gains enough knowledge of the environment, the agent uses the learned knowledge to generate optimal policies. As a result, decreases after several Episode cycles. The reward r τ and the next state s τ +1 can be obtained by interacting with the environment, and the tuple composed of the current state s τ , action a τ , reward r τ and next state s τ +1 is stored in the experience pool to improve the learning speed of DRL. 3) Replay memory: When the agent interacts with the network environment, the sequences of experience tuples may be highly correlated and lead to training failure if the interaction data are sequentially input into the learning model. Therefore, we use the experience replay method to sample training data from the experience pool randomly. By this approach, data generation and learning are not dependent on each other, decoupling the sequential correlation of experience tuples. Second, replay memory enables the agent to learn more from the same experience. Most importantly, replay memory can reduce oscillations or divergences from discrete training samples by using batch samples. 4) Learning: The DRL learning process is slightly different from training a regular deep learning model. In forward propagation, the DRL model takes a batch of training samples from the experience pool and feeds them to the learning network and the target network; the error between the returns from the learning network and the target network is then used to calculate the loss. The loss function of DRL is calculated by the difference between the outputs of the learning network and the target network because the DRL model learns experience from the reward feedback rather than from the labeled data learning experience. The pseudo code is shown in Algorithm 1.

Algorithm 1 DRL-Based Task Offloading With Cloud
Edge Jointly Load Balance Optimization Algorithm (TOLBO) 1 // 1.Initialize: 2 Initial replay memory D to capacity N 3 Initial action-value network with random weight θ 4 Initial target action-value network with weights θ − = θ 5 for Episode 1 to M do 6 // 2.Generate training data 7 Initial input raw data x 8 for task in raw data do 9 Select action a from state s using: π ← ε − Greedy(Q(s, a, θ)) 10 end 11 Take action A, observe reward R and get next input S 12 Store experience tuple S, A, R, S in replay memory D 13 end 14 // 3.learning: 15 Obtain mini-batch of s j , a j , r j , s j+1 from D 16 if Episode stop at step j + 1 then 17 Set targetF j ← r j 18 else 19 Set targetF j ← r j + γ max a Q s j , a, w − 20 end 21 Update θ ← θ + α∇ w j L w j 22 Every N steps,update: θ − = θ 23 ε ← max (ε max , ε * dec ay) 24 Store score for current episode

V. EXPERIMENTAL ANALYSIS
In this section, we evaluate the performance of TOLBO in simulations. We use Python as the programming language in this simulation. In addition, we choose PaddlePaddle [23] to build the DRL model. We consider a static scenario with 50 USV users, 5 edge servers, and 1 cloud server for the simulation experiment. That is, the number of users does not change over time. If the environment changes, we can also reset the random exploration probability to 1 and retrain the model. The parameter settings for this experiment are shown in Table 1.

A. PARAMETER ANALYSIS
The neural network in this algorithm is trained online, and the experience collected in real time is used to train the neural network and update the task offloading decision. We evaluate the convergence of the proposed algorithm under different neural network hyperparameters and algorithm settings. We evaluate the algorithm's convergence under different hyperparameters and batch sizes of the neural network. The experimental setup is 1000 Episodes with 100-time slots per Episode. The simulation results are shown in Fig. 4, where the x-axis represents the different Episodes, while the y-axis represents the cost per USV user (including delay, drop rate, and energy consumption information). Fig. 4(a) shows the convergence of the proposed algorithm at different learning rates. The learning rate is the minimum number of steps toward the loss function in each iteration of the algorithm. As can be seen from the figure, when lr = 10 −3 , the proposed algorithm has a relatively fast convergence rate and a relatively small convergence cost. When lr = 10 −5 , the convergence speed of the algorithm decreases and does not drop to the lowest convergence cost after 500 Episode cycles. Moreover, when the learning rate is large (lr = 10 −2 or lr = 10 −1 ), the parameter variation of the neural network becomes larger, so the anomalous data disturbance in the experiment will put the algorithm in an unstable state. It leads to a higher convergence cost than the random unloading strategy. Therefore, this experiment chooses lr = 10 −3 to reduce the convergence cost and accelerate the convergence speed. Fig. 4(b) shows the convergence of the proposed algorithm for different batch sizes, which is the number of samples selected in each training round. When the batch size increases from 2 to 8, the convergence speed is accelerated. Furthermore, when it is increased from 8 to 32, the algorithm's convergence speed and convergence cost are not significantly improved. Therefore, this experiment chooses batch size 8 to reduce the training time while ensuring the algorithm's performance.

B. PERFORMANCE EVALUATION
We compare the proposed algorithm (TOLBO) with several existing methods, including no offload (N. Offload), random offload (R. Offload), DRL [11], and we also compare our offloading algorithm with edge servers only (TOLBO*).

1) PERFORMANCE INDICATORS
In this experiment we consider the following performance metrics: 1) Task throw rate: The task throw rate refers to the percentage of the total number of tasks thrown because they are not completed within the tolerated delay range and is measured by θ throw . A smaller θ throw means a smaller number of thrown tasks and a more stable system. θ throw can be expressed by where C throw denotes the number of tasks thrown by all USV users in an Episode, and C task denotes the total number of tasks generated by all USV users in an Episode. 2) Average task delay: the average delay of USV users to complete each task, denoted by 3) Average energy: Because the energy consumed by a task that is thrown is small, which can be easily averaged over the overall number of tasks, we calculate the average energy consumption using the average of the completed tasks relative to the energy consumption generated by the USV over the entire Episode, denoted by 4) Energy waste rate: This parameter represents the percentage of energy wasted by tasks that are thrown during execution to the overall USV energy consumption, measured by θ E . A smaller θ E means a higher energy utilization by USV users u. θ E can be expressed by 2) IMPACT OF TASK ARRIVAL RATE ON PERFORMANCE As task arrival rates increase, more tasks are generated for the same number of USV users in the same time slot. Incorrect offload decisions can lead to task backlogs in the queue, resulting in task throwing, unnecessary transmission, and computational energy consumption. In Fig. 5(a), the number of tasks generated by each USV user in the same Episode increases as the task arrival rate increases. The proposed algorithm in this paper can maintain a lower task throw rate than the benchmark method. When the task arrival rate increases from 0.1 to 0.4, the average delay of our proposed DRL-based algorithm increases by 17.2%, while the average delay of the benchmark method increases by at least 92.5%. This implies that the average delay of the proposed algorithm increases less than that of the benchmark method as the system load increases. As the task arrival probability increases to 0.7%, the average latency of some methods decreases as more and more tasks are thrown, bringing the average latency slowly closer to the maximum tolerated latency. Also, compared to TOLBO*, the average latency of TOLBO is reduced by 15.3% due to the support of cloud servers with higher computational power.
In Fig. 5(b), TOLBO is able to consistently maintain a lower throw rate compared to the benchmark algorithm. For example, when the task arrival probability is at 0.5, the throw rate of TOLBO is 7.28%, while the throw rate of the benchmark algorithm is between 62.13% and 83.16%. This is because TOLBO makes better use of all computing resources at the cloud edge and end side. At higher task arrival rates (0.7-1.0), the throw rate of TOLBO is reduced by at least 39.8% compared to TOLBO*. Because the computing resources at the edge server and user side are already severely insufficient when the computational pressure is high, the addition of cloud servers can effectively alleviate the situation.
In Fig. 5(c) and Fig. 5(d), the average energy consumption of our proposed algorithm is reduced by 47.55% compared to the average energy consumption of the benchmark method when the task arrival rate is 0.1. For USVs, power resources are pretty valuable. When the workload is low, the algorithm balances power consumption and delay, and offloads more tasks to the cloud edge servers within the range of time delay tolerance, so as to reduce the energy consumption of processing tasks for USVs. At the same time, when the task arrival rate is high, the tasks are thrown away excessively due to the wrong offloading decisions of other algorithms, which increases the average energy consumption of tasks and the waste rate of energy consumption.

3) IMPACT OF THE NUMBER OF USERS ON PERFORMANCE
As the number of users increases, more tasks are generated in the same time slot. Different from the task arrival rate, the more users there are, the more local computing resources there are.
In Fig. 6, we discuss the comparison of the performance of each method for the different numbers of USV users. As the number of users increases, the decision dimension increases. This reduces the edge cloud node resources allocated to each USV user, which puts forward higher requirements for the adaptability of the algorithm.
As shown in Fig. 6(a), the proposed algorithm in this paper makes good use of the computational resources of the edge cloud nodes at the different number of users so that the latency is always kept at a steady low level. TOLBO* can still maintain a low latency at 60 users, but when the number of users rises to 70-100, the task needs to be further offloaded to the cloud server due to the insufficient computing power of the edge nodes, which leads to a rapid increase in latency.
In Fig. 6(b), as the number of USV users increases, the dimensionality of the offload decision increases, thus, the wrong offload decision leads to an increasing throw rate of the benchmark method. In contrast, TOLBO always maintains a zero loss rate. Compared to TOLBO, TOLBO* has a progressively higher task drop rate when the number of users is greater than 50 due to the lack of computing resources from the cloud server.
In Fig. 6(c) and Fig. 6(d), when the number of users is small (10-50), the proposed algorithm in this paper reduces the average energy consumption by at least 31.57% and the energy waste rate by 15.44% compared to the benchmark method. When the number of USV users is high (60-100), the algorithm of this paper greatly reduces the average energy consumption and energy waste rate. For example, when the number of users is 100, the average energy consumption is reduced by 49 compared with R. Offload and DRL, and the energy waste rate is reduced by 21.35%. At the same time, TOLBO* maintains a low energy consumption and energy waste rate when the number of users is low. However, as the number of users increases, some tasks are thrown away due to the lack of computing resources in the cloud server, thus increasing the average energy consumption and energy waste rate.

VI. CONCLUSION
In this work, we consider the energy consumption overhead of the thrown tasks and study the computational offloading problem of the USV with the objectives of minimizing the task throw rate and task delay and minimizing the task energy consumption. We propose TOLBO, a DRL-based cloud edge collaborative load optimization task offloading algorithm. The proposed algorithm maximizes the long-term task payoff. Moreover, the algorithm can make all decisions without relying on other optimization functions for joint optimization purposes. Simulation results demonstrate that the algorithm can significantly reduce the task throw rate, average latency, and the average energy consumption of the processing tasks, compared to the methods of No Offload (N.Offload), Random Offload (R.Offload), and DRL [11]. However, the experimental results of the method described in this paper are still VOLUME 10, 2022 derived based on quasi-static simulation experiments. In the future, we will consider migrating the method to dynamic real experimental scenarios and combining the latest algorithms and techniques, such as using the LSTM algorithm to predict upcoming tasks, so as to improve the performance of the TOLBO algorithm in dynamic IoT application scenarios.