Energy-Optimal Flight Strategy for Solar-Powered Aircraft Using Reinforcement Learning With Discrete Actions

The low efficiency of photovoltaic cells limits the energy absorption of high-altitude long-endurance (HALE) solar-powered unmanned aircraft vehicles (UAVs), which dramatically weakens the capacity for long-endurance missions. Therefore, finding a method to extend the flight duration with finite solar energy drives extensive research. The present work introduces a method that applies a deep reinforcement learning (DRL) framework to generate an energy-optimized flight strategy for HALE solar-powered aircraft. The neural network controller is designed to realize autonomous flight navigation by giving commands of thrust, attack angle, and bank angle. A mission area with a radius of 5 km is assumed to test the RL controller performance. The simulation results show that the RL controller leads to a 28 % increase in the battery SoC after a 24-hour flight, which indicates that a controller based on the RL framework might be a potential method for solving the solar-powered UAV trajectory planning problem. Aiming to explore the applicability of the RL controller, a sustained flight test is implemented. The results show that a 39-day endurance flight is achieved by the RL controller, which is 50% higher than the base case with a steady flight trajectory.


I. INTRODUCTION
Research on high-altitude long-endurance (HALE) solarpowered unmanned aerial vehicles (UAVs) has received considerable attention in recent years mainly due to its promising future in various applications such as long-endurance intelligence, surveillance, and reconnaissance (ISR) [1], [2]. However, the poor capabilities of the photovoltaic cell and the rechargeable battery lead to a continuous increase in aircraft scale to meet the 24-hour flight energy requirement, which strongly obstructs the design of HALE solar-powered aircraft [3]. Therefore, researchers began to find alternative methods to help HALE UAVs absorb and store more energy. Since the solar energy absorption varies with the time and sunlight incident angle, the trajectory optimization becomes a potential methodology to enhance the endurance of HALE aircraft, which mainly includes two The associate editor coordinating the review of this manuscript and approving it for publication was Bin Xu. approaches: first, path planning aiming to maximize solar energy absorption by optimizing the incident angle between the sunlight and the solar panels; second, the energy management strategy that stores solar energy as gravity potential during the daytime.
The earliest research on solar-powered UAV flight path planning was implemented by Klesh and Kabamba [4], [5] with a detailed discussion on the influence of flight bank angle and speed when the solar-powered UAV is cruising in an optimal path. A nondimensional parameter that power ratio is proposed to predict the qualitative features of the aircraft energy-optimal state. Klesh et al's results showed that a perpetual endurance flight can be achieved when the power ratio exceeds a specific threshold. However, only the simplest flight path, such as flying from the starting point to the destination, was applied in their work. Therefore, it is still worth exploring whether their method can be extended to a more complicated scenario. VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Aiming to achieve continuous tracking to moving ground targets, Huang et al. [6] proposed an integrated model and studied energy-optimal path planning using a particle swarm optimization (PSO) algorithm. Numerical results showed that the proposed method provided the possibility for solar-powered UAVs to achieve longer target tracking. Unfortunately, the longitudinal angles were not considered in their solar energy absorption calculation, which means that their method can only be applied when the UAV is in a level flight.
Spangelo and Gilbert [7], [8] explored path planning when aircraft flies in three-dimensional space and estimated the effect of longitudinal motion on energy absorption. By allowing aircraft to change speed and altitude, an increase in solar energy absorption up to 30% is achieved compared to the steady case with constant speed and altitude. However, Spangelo et al. fixed the aircraft's spatial location on the surface of the vertical cylinder, which actually distorted the three-dimensional space to a two-dimensional surface.
Martin et al. [9] expanded Spangelo et al's work by allowing the aircraft to fly inside the vertical cylinder rather than just on the surface. Therefore, a more complex flight path can be made, which leads to better solutions. A nonlinear model predictive control (MPC) formulation was applied to maximize the total stored energy, E tot,stored = E battery + E potential with a receding-horizon approach. The results showed the maximum E tot,stored is increased compared to the steady-state trajectory. Similar research was further conducted by Marriott et al. [10], and the computational efficiency of the path optimization process improved by adopting a greedy dynamic programming algorithm with buffering. However, both studies applied an optimization process with a limited forward horizon to plan the trajectories, which cannot provide end-to-end control instructions due to the optimization process.
In addition to making an energy-optimal path, some researchers tend to introduce the energy management strategy (EMS) into solar-powered aircraft flights. A novel EMS that partly storing solar energy as gravitational potential in the daytime was proposed by Gao et al. Their results showed that an additional energy surplus up to 23.5 % can be achieved with the proposed EMS in one 24-hour flight [11]. An attempt to reduce the rechargeable battery weight by utilizing gravitational potential was further implemented [12]. Gao et al. believed the solar-powered UAV endurance might be greatly improved by energy storage using gravitational potential. Based on Gao et al's work, a developed energy management strategy considering uncertain weather conditions was investigated by Lee and Yu [13].
Wang et al. [14] considered the installation angle of solar panels in an energy absorption model. The flight state and control variables were estimated by the Gauss pseudospectral (GPS) method. Through comprehensive adjustments to flight attitude and altitude, the aircraft received an 18 % increase in the total stored energy after a 24-hour flight. Furthermore, a mission-oriented design for three-dimensional path planning using a coupled method with GPS and colony algorithms was presented to maximize the mission effectiveness in Wang et al's work [15].
At present, the trajectory planning or energy management of solar-powered aircraft based on energy optimization mainly focuses on improving real-time performance and utilizing energy day and night. However, high-accuracy optimization algorithms are often complex and have high energy consumption, while the aircraft needs to adjust its flight state in real time according to the predicted target position [16], [17]. Additionally, some high-efficiency online methods, such as MPC, still require the receding-horizon optimization process [18]. Recently, reinforcement learning (RL) has received considerable attention in the aeronautic community due to its real-time end-to-end control performance and adaptability to an unknown environment. Relying on the deep neural network, a transformation from dynamic trajectory generation to an end-to-end controller can be achieved, which reduces the computational cost and provides real-time navigation or guidance command functions based on realizing long-term objective optimization. Several relevant works are described below.
To increase the endurance of unmanned aircraft, Woodbury et al. studied the flight strategy when a glider went through the ascending thermals. Reinforcement learning was applied to generate the reference bank angle command for leading the glider close to the ascending thermals. Therefore, a circling trajectory can be established to help gliders gain energy from the atmosphere. Three variables, the distance to the thermal, azimuth to the thermal relative to aircraft heading, and the aircraft bank angle were chosen to describe the state in RL. The numerical results showed that the RL-based controller can consistently navigate the aircraft to the ascending thermals and the online glider trajectory generation can be implemented by low-computational-burden table lookup processing in a static state-action value table [19]. Reddy et al. further developed Woodbury et al's work using an RL model-free framework. In addition to the bank angle, Reddy et al. noted that the local vertical wind acceleration and rollwise torque were the key variables as navigational cues. The results showed that a longer endurance was achieved with the conversion between gravitational potential and kinetic energy [20].
Bohn et al. first adopted the proximal policy optimization (PPO) algorithm to generate UAV flight controller commands directly [21]. The Skywalker X8 fixed-wing UAV was used as a test aircraft in the simulated turbulent wind field. Up to seven variables related to the UAV flight were selected to establish the state in RL. With a well-trained neural network, it was observed that the RL control framework has several superior characteristics, such as excellent anti-interference ability, precision and quick response compared with the traditional PID algorithm.
Bellemare et al. applied the distributional QR-DQN algorithm to realize stagnation point maintenance of high-altitude stratospheric balloons by determining the wind field use strategy [22]. The RL framework was applied to create a high-performing flight controller. The data self-correcting system was introduced in their work to prevent the effect of imperfect data. A 39-day controlled experiment over the Pacific Ocean was implemented, and the experiment results showed that the RL-based controller outperformed the conventional control algorithm.
The aforementioned studies indicated that RL can process high-dimensional data by optimizing long-term goals and learning information, which has great potential in autonomous path planning and end-to-end aircraft control. Therefore, an alternative method might be to combine solar-powered UAV path planning with EMS as an integral flight strategy. However, few works have reported the application of machine learning to solar-powered HALE unmanned aircraft.
Given this lack, a double deep Q-network (DDQN) with dueling architecture is chosen as the reinforcement learning algorithm for solar-powered aircraft trajectory planning for the following reasons. First, the basic Q-learning algorithm and the modified algorithms have very strong robustness and can be applied to a variety of tasks, such as video games [33] and quadrotor control [34]. Second, a deep Q-network algorithm has been used to guide the path of air vehicles and has achieved very good results [22], [35]. To our knowledge, this is the first time that the reinforcement learning method has been completely applied to HALE solar-powered aircraft flight trajectory planning. The present work makes the following contributions: (1) The reinforcement learning framework is innovatively applied to the three-dimensional energy-optimal flight strategy of HALE solar-powered UAVs.
(2) Aiming at the mission that the UAV is restricted to specific regions, a neural network controller is designed by the model-free reinforcement learning method to realize automatic navigation. The flight commands of thrust, attack angle and bank angle are directly generated by the proposed controller.
(3) The proposed controller does not require a precomputed feasible trajectory via a motion planning algorithm and can achieve a millisecond response.
(4) A sustained flight test is first implemented to explore the applicability of the RL controller in divergent flight conditions. Compared to the base case with a constant speed and altitude, the present work shows an additional 50 percent flight endurance in sustained flight.
The remaining part of the paper proceeds is organized as follows. The dynamic simulation models are described in Section II. The configuration of a model-free DRL algorithm and the key design decisions are presented in Section III. Then, in Section IV, a level flight base case is described, and the training process is presented on this basis. The controller is evaluated considering 24-hour flight and sustained flight cases. Finally, Section V presents the final comments and suggestions for further work.

II. MODELS
In the air, HALE aircraft can absorb solar energy from the environment and convert it into electric energy through the photovoltaic cells, which is allocated to the propulsion and avionics systems or stored in rechargeable batteries. The interconnection of these subsystems on HALE solar-powered aircraft is shown in Fig. 1.
In this section, the numerical models used in the present work are introduced in five parts: aircraft dynamics and kinematics model, solar irradiance model, energy absorbed model, energy consumption model and energy storage model, respectively.

A. AIRCRAFT DYNAMICS AND KINEMATICS MODEL
For trajectory optimization problem research, the focus is on the aircraft's macroscopic characteristics during the flight. Therefore, a simplified mass point model can be used. Assuming the aircraft is not affected by wind and sideslip angle during the flight and ignoring the propulsion system installation angle, Fig. 2 shows the aircraft body frame are from the center of mass to forward, right wing and down, respectively.
By analyzing the forces on the aircraft, the dynamics and kinematics equations of the aircraft can be expressed in the following differential form. A detailed derivation can be found in [23]: where x, y and z are the aircraft positions in the earth-fixed inertial frame O g X g Y g Z g , which point to the orientations of north, east, and down, respectively. V is the aircraft velocity, α is the attack angle measured between the O b X b axis and the V vector, and γ is the flight path angle. ψ is the yawing angle, which is equal to the heading angle because there is no sideslip. φ is the bank angle, m is the aircraft mass, and g is the acceleration of gravity. Another Euler angle, pitch angle θ, is calculated by the following formula: In addition to gravity, the aircraft is subject to aerodynamic drag force, lift force and thrust generated by the propulsion system, whose directions are shown in Fig. 2: the thrust is aligned with the O b X b axis, the drag is parallel to the negative velocity vector, and the lift is perpendicular to the velocity vector.
Then, lift force L and drag force D can be calculated by  where ρ is the air density around the aircraft that is calculated according to the 1976 US Standard Atmospheric Model [24]. C L and C D are the lift and drag coefficients. S is the wing area. In this work, C L and C D are fitted to a set of Reynolds number Re and attack angle α equations, where the Reynolds number is calculated according to the local density and flight velocity [25]. The fitted curves of the lift coefficient and drag coefficient are shown in Fig. 3, and some aerodynamic coefficients under certain working conditions are listed in Table. 2.

B. SOLAR IRRADIANCE MODEL
According to [26], the solar energy that can be absorbed by HALE aircraft in the daytime consists of two parts: (1) direct solar irradiance, which is directly radiated to the photovoltaic cells installed on the upper wing surface, and (2) diffused  irradiance, which is utilized in the form of sunlight scattered by other molecules in the atmosphere and radiated to the photovoltaic cells. The absorbed energy is influenced by the intensity of the solar flux, the azimuth angle of the sun and the angle between the photovoltaic cells and the direct solar beam.
As Fig. 4 shows, n pm is the external unit normal vector of the photovoltaic cell, and n s is the unit vector that points from the center of the photovoltaic cell to the sun. α s is the solar altitude angle, and γ s is the azimuth angle. Formulas adopted to estimate the total available solar flux I tot at any specified position and time are given in Table 3.

C. ENERGY ABSORPTION MODEL
The solar flux through the aircraft photovoltaic cells, which is converted into electric energy, can be calculated based on the geometric relationship between the normal vector of the solar cell plane and the incidence vector of sunlight [29]. For a photovoltaic panel with an area of S i PV , its external unit normal vector in the aircraft body coordinate system is represented as [x i , y i , z i ] T ; then, the solar flux through this cell can be obtained by where n s is the unit vector pointed from the photovoltaic cell to the sun and n pm is the external normal unit vector of the photovoltaic module plane, as shown in Fig. 4.
They are given by denotes the coordinate transformation matrix from the aircraft body coordinate system to the inertial coordinate system, using shorthand cα and sα to represent cos(α) and sin(α). Then, the total input energy power converted from solar flux P solar is where η MPPT is the efficiency of the maximum power point tracking (MPPT) controller and η PV is the efficiency of the photovoltaic cells.

D. ENERGY CONSUMPTION MODEL
Solar-powered aircrafts rely on propellers to provide forward thrust to overcome aerodynamic drag and some necessary avionics systems need to consume energy. Hence, during the period of all-day flight, the energy consumption is composed of the above two parts [30]. In this work, a simplified propulsion system model and constant avionics power are taken to calculate the power required P required [31].
where P acc denotes avionics power, P prop denotes the power of the propulsion system, η prop denotes the propeller efficiency and η mot denotes the motor efficiency.

E. ENERGY STORAGE MODEL
Currently, solar-powered UAVs mostly use Li-based batteries, which have the advantages of high specific energy and stable charging and discharging processes [14]. The battery pack on the plane can either power all the systems or store excess solar energy for use at night. SoC is used to estimate the remaining available energy in the battery compared to the full-charge state [32].
Eq. 10 states the definition of SoC, where E battery is the temporal stored energy of the battery, E battery,max is the maximum battery pack capacity, which ignores capacity loss due to cyclic charge and discharge and is set to a constant value in the present work.  [27], [28].
SoC can be updated by the following Eq. 11 and here, the differences in battery charging and discharging processes are not considered: where E battery,0 is the initial electric energy in the battery pack and P battery = P solar − P required is the net battery power. For battery charging, P battery is positive because the energy consumed in flight is less than the energy converted by the photovoltaic cells; otherwise, it is negative. When the battery is fully charged P battery = 0. Therefore, the net energy power P net is defined below, and it is not less than zero if there is the possibility to make a long flight: In addition to batteries, HALE aircraft can store gravitational potential energy E potential through elevation that can be released by gliding down to maintain the airspeed required for flight at night. In this article, for purposes of the net energy expression, only the gravitational potential energy relative to the initial height h 0 is considered: In summary, through establishing the above five mathematical models, the aircraft state can be solved in real time through a numerical solution method in the simulation process, which provides a detailed description of the environment in the reinforcement learning framework for the aircraft trajectory optimization problem. The pseudocode of the calculation flow is shown in Algorithm 1. Detailed parameters and some physical constraints of the studied HALE solar-powered aircraft are listed in Table 4 and Table 5, respectively.

III. REINFORCEMENT LEARNING METHODOLOGY
The objective is to utilize reinforcement learning to train a neural network controller offline and to provide control commands to achieve superior flight endurance. However, since the stability of the solar UAV is greatly affected by the pitch angle, its climbing process can be determined by the thrust and angle of attack. Therefore, the designed RL controller is tasked with outputting the control commands of bank angle, thrust and attack angle online according to the flight states. Here, it is assumed that the inner-loop controller is ideal and can automatically track its control instructions. At each time step, the controller receives an immediate reward, and its goal is to establish a control law that maximizes the sum of future discount rewards.

A. DOUBLE Q-NETWORK WITH DUELING ARCHITECTURE
DDQN with a dueling architecture algorithm belongs to the model-free off-policy reinforcement learning methods. Considering the readability and completeness, the DDQN algorithm and dueling architecture are introduced briefly.
The Q-learning algorithm aims to map each state-action pair (s, a) to its corresponding value Q(s, a) [36] and select the action corresponding to the maximum Q(s, a) according to the strategy; one of which is that under a given strategy π, the value of an action in state s can be indicated as: where γ ∈ [0, 1] is a discount factor that trades off the importance of immediate and later rewards. By selecting the action with the highest optimal value Q * = max π Q π (s, a) in each state, it is easy to derive the optimal strategy from the optimal value. With the expansion of the state space, the traditional table lookup method cannot meet the requirements, so the DQN algorithm [33] is proposed in which a multilayer neural network is used to approximate the optimal action-value function and can update its parameters θ with loss In the neural network training process, DQN uses a separated target network to generate the target Q-value y DQN t to decrease the shock and improve stability: where θ − are the parameters of the target network that are updated regularly.
The DQN maximum operator uses the same value for action selection and action evaluation, resulting in overestimation. In contrast, double DQN (DDQN) [37] decomposes action selection from action evaluation to prevent this problem and uses y DDQN i to replace the target y DQN i , as shown below: In the DQN and DDQN algorithms, once the current state s is given, these networks evaluate the Q values for all action state pairs. However, it is not necessary to estimate the value of each candidate action for all states. Wang et al. [38] proposed the dueling structure to improve the DQN performance. This structure separates the state value estimation function V (s) and action advantage function A(s, a), which represents a relative measure of the importance of each action in that state, and recombines them through a special aggregating operation to approximate the Q-value of each action: where |A| is equal to the size of the designed action set. While the dueling architecture is combined with DDQN, the training process can be represented by Algorithm 2. Note that in the present work, the ''soft'' target update [39] is employed so that the fixed target network can be updated to the learned network in small movements each time, rather than copying the whole target network parameters directly, which can improve learning stability.

B. STATE SPACE
The state space consists of the position, attitude, local clock time, battery and action information, so at time step t, the state can be represented by The ODEINT package in Python SciPy is employed to solve the differential equations in Section II and update the state [9]. The smaller the simulation step size of the formulation integration taken, the more accurate the attitude obtained. However, to reduce the number of interactions between agents and the environment in each episode of reinforcement learning and reduce the calculation time, the aircraft state is observed once every n iteration steps, and the corresponding command is selected, which is retained in the next skipped iteration practice steps [40]. This process is illustrated visually in Fig. 5.
The components of the state vector are expressed in different units and have different dynamic ranges. Therefore, the convergence process may fluctuate when the neural network is updated based on gradient descent. To improve the training speed and accuracy, all elements are normalized to the range of 0 to 1 by Eq. 20 so that the neural network does  Table 3 Calculate total input energy power P solar using Eq. 4-8 Calculate power required P required using Eq. 9 Update battery state SoC using Eq. 11 and potential energy E potential using Eq. 13 end for Output: state s t+1

Algorithm 2 Double DQN With Dueling Architecture
Initialize replay buffer memory D to capacity N Initialize the network parameters θ Initialize the target network parameters θ − ← θ for episode = 1 to M do Initialize s 1 Initialize preprocessed sequence g 1 = g(s 1 ) for timestep = 1 to T do Select a random action a t with probability or a t = arg max a Q(g(s t ), a; θ) Execute action a t Update flight state by Algorithm 1 Observe the reward r t and s t+1 Preprocess g t+1 = g(s t+1 ) and rescale Store transition tuple (g t , a t , r t , g t+1 ) in D, and replace the oldest tuple if ||D|| > N Sample a minibatch size N m tuples (g j , a j , r j , g j+1 ) from D randomly Set y j = r j , if episode terminates r j + γ Q g j+1 , arg max a j+1 Q(g j+1 , a j+1 ; θ); θ − , otherwise Update network parameters θ by performing a gradient descent step with loss: L = 1 N m j ||y j − Q(g j , a j ; θ)|| 2 Update the target network: θ − ← τ θ + (1 − τ )θ − Set s t ← s t+1 end for end for not have to deliberately learn this scale [21].
In addition, because the policy is a feedforward network with no memory, the input vector of each time step t is composed of the values of the previous several time steps to facilitate the acquisition of dynamics.

C. ACTION SPACE
In this work, the action space of the controller is three dimensional, consisting of the commanded increment, thrust T cmd , attack angle α cmd and bank angle φ cmd . From Eq. 1, the aircraft can change the attitude and flight altitude by these three commands. Because the DDQN with dueling architecture needs discrete action space, it is designed to have five basic values for the change in each command, as shown in Table 6, which means 125 action pairs can be chosen by the controller. To achieve more accurate control in some situations and to be able to change attitude quickly, numerical values of symmetry are selected to guarantee not to exceed physical limits during the response.

D. REWARD
In reinforcement learning, the reward function directly affects the training results. Therefore, it is very important to design a reward function that is reasonable and has realistic physical implications. Generally, the goal of solar-powered UAV trajectory optimization during high-altitude navigation is to achieve longer flight endurance or obtain more energy than the initial condition after a 24-hour flight under the premise of meeting mission constraints and system dynamics. Although environmental uncertainty and other factors affect flight time, the total energy available in the system is still the main factor affecting flight time [11]. Therefore, the reward function constructed for the HALE solar-powered aircraft is shown in Eq. 21.
where E battery is the temporal energy stored in the battery and E potential is the gravitational potential energy relative to the initial height calculated by Eq. 13. κ is a weight coefficient used to amplify the influence of potential energy and is equal to 0 at night to save battery power: To be closer to the situation of a real flight, the rules are made that when the state of the aircraft is beyond the constraint range, such as angle of attack exceeding the specified maximum or the HALE aircraft flies out of the specified altitude, the aircraft is judged to crash, and a larger penalty is given.

E. THE INNER-LOOP RESPONSE MODEL
To concentrate on the autonomous navigation problem, it is assumed that an ideal low-level controller can track the control command u cmd within the constant response time range after the control commands are given. So φ, T , α in Eq. 1 can be calculated by Eq. 23 until the next control commands are chosen.
where u (t) represents the response value to the command at time t, u cmd is the control command given at t 0 and u (t 0 ) is the initial value. t a is the constant response time range, which means that after this time, the specified command can be fully tracked.

IV. SIMULATION AND RESULTS
In the present work, the flight test area is assumed to be a vertical cylinder centered at location (39. Table 2, which is consistent with Keidel et al's work [27]. A numerical result of total available solar flux at a similar latitude and the same day by Martin et al. [9], denoted by the black short dashed line is used to verify our calculation code. As shown in Fig. 6, only a slight divergence with a maximum difference of 2 % can be found compared with Martin et al's results, which proves that the solar flux calculation codes in the presented works are reliable. Fig. 7 shows the total available solar flux at different altitudes over the flight test location in the summer solstice. It is obvious that the total available solar flux increases with altitude. The additional solar energy might be absorbed by the aircraft to meet the flight power requirements at high altitude.
It should be noted that for HALE solar-powered aircraft, an increase in flight altitude always leads to an increase in thrust power for level flight due to the decrease in air density. The low air density causes a low Reynolds number, which further causes a decrease in both the lift coefficient and lift-drag ratio, as shown in Fig. 3. Against gravity, the flight speed should be increased. Therefore, according to Eq. 1, the required thrust power increases with the flight altitude.

B. BASE CASE
To compare with the RL framework flight controller, a steady circular trajectory is considered as the base case in the presented work. The cruising altitude is set to 15 km, which is the minimum altitude limited by the mission. The radius of the circular trajectory is set to 5 km which is consistent with the flight test area radius. These settings aim VOLUME 9, 2021  to lead to a minimum thrust power requirement at a level flight state. Similar base cases were also used in [9] and [14] to compare their optimized trajectories but with different radius constraints and flight test locations.
The minimum thrust power when the aircraft cruises on a circular trajectory can be estimated by solving the following optimization problem.
Since the base case has a steady circular trajectory, the initial conditions are consistent with the results of Eq. 24, as shown in Table 7. On this basis, the 24-hour flight data were calculated, and its three-dimensional trajectory is shown in The altitude and SoC curves are shown in Fig. 9. It can be found that the final SoC was 27.1 % after a 24-hour flight, which is lower than the initial SoC of 30 %, and the altitude was maintained at 15 km, which is the same as the  initial altitude. The 24-hour flight data of base-case indicates the total stored energy of aircraft E tot,stored decreases with the initial conditions. This case provides a baseline for the subsequent comparisons.

C. NETWORK ARCHITECTURE AND HYPERPARAMETER SETTING
The network architecture adopted is shown in Fig. 10: there are 2 fully-connected layers with 512 units, and then the dueling network splits into two streams of fully-connected layers. The value streams have a fully-connected layer with 256 units, and the advantage streams have a fully-connected layer with 128 units. The final hidden layers of the value and advantage streams are both fully-connected with the value stream having one output and the advantage of as many outputs as valid actions.
The discount factor is set to γ = 0.99, and the Adam optimizer [41] is chosen for training the neural network with a learning rate of 0.0001 and minibatch size of 64. For the soft target update coefficient τ = 0.001. The replay buffer size is 10 6 . The above hyperparameter settings are referenced from [42]. In addition, the exploration rate ε linearly decreased from 0.7 to 0.1 over a period of 1 million steps then fixed.
The initial conditions for each episode are consistent with the counterparts of the base case shown in Table 7.  The flight data are updated every 1 second, and the RL controller generates a command every 20 seconds. Therefore, there are at most 4,320 timesteps per episode. When a 24-hour flight is completed or the aircraft is determined to crash, the episode restarts with the initial conditions. The test episode is run every 1 million timesteps, and the total reward of the episode is visualized during the training, as shown in Fig. 11. Using the models and constraints described above, the final controller can be obtained in approximately 140 million time steps. The trained model takes 1 millisecond to output flight commands on a laptop computer with an i5-10210U CPU when tested. This means that the RL controller can reasonably be expected to operate in real flight. Additionally, the calculation time of the single-core CPU required by different online algorithms in predicting the 30-minute-horizon trajectory is selected to further compare the real-time performance of the method [10]. The results are shown in Table 8 with the same sampling time achieved with Marriott's work, which shows that the computational time greatly decreases by using the end-to-end control capabilities of the RL controller.

D. 24-HOUR FULL TRAJECTORY OF RL-CONTROLLED AIRCRAFT
The well-trained RL controller is employed to test the performance in a 24-hour full trajectory. The controller selects  an action at intervals of 10 s and 20 s. The comparison of two cases is shown in Table 9. The relative differences between the maximum stored energy and the final SoC in the 24-hour trajectory are -0.17% and -0.04%, respectively, so the sampling time is set at 20 s for the presentation of the results. At the same time, the longer sampling time can also use the remaining CPU computing time to process other aircraft tasks, such as reconnaissance data processing and communication relay task assignment.
The altitude and SoC curves of the 24-hour flight are shown in Fig. 12(a). The power consumption time histories of the RL-controlled aircraft propulsion system, base-case propulsion system and avionics system are presented in Fig. 12(b). It is observed that the RL controller learned a strategy that can be divided into five intuitive stages: preclimb, climb, high-altitude cruise, glide descent and low-level hover. Complete three-dimensional flight trajectories are presented in Fig. 13. Moreover, typical trajectories of the first four stages are shown in Fig. 14.

1) STAGE1:PRECLIMB
The preclimb stage lasted from 5:00 to approximately 6:50. Since the total available solar flux increased with altitude, the RL controller decided to make full use of the battery's remaining energy for a slight climb from 15 km to 17 km until SoC reached the minimum value of 0.205, which was VOLUME 9, 2021  The yellow represents the preclimb stage, the red represents the climbing stage, the green represents the high-altitude cruise stage, the gray represents the glide descent stage, and the blue represents the low-altitude hover stage. The dashed black line is the horizontal boundary. very close to the limited state. This caused an increase in solar energy absorption due to a higher altitude. A dramatic power requirement was found simultaneously in the beginning of the preclimb stage, as shown in Fig. 12(b), which is consistent with the RL controller command. Subsequently, a short cruise at 17 km altitude was found, which induced a decline in propulsion power and gave a slight recharge to ensure the health of the battery SoC until the average solar flux was sufficient to continuously climb after 6:40. A closed view of the aircraft flight altitude and battery charging power can be seen in Fig. 17. In addition, Fig. 14(a) shows a typical trajectory for a circle during this process, and Table 10 shows the  RL-controlled aircraft had an orbital period of approximately 0.31 h with an orbital radius of 4,200 m, which was faster than the 0.4 h of the base case. This was due to the increase in flight speed at high altitudes. Fig. 15(b) shows the detail of the time-based comparison of solar energy absorption between the RL controller case and base case. It can be seen clearly that the RL-controlled aircraft absorbed more solar power in the majority of the preclimb stage. The preclimb stage shows the RL controller attempted to balance the solar energy absorption and flight energy cost even in an ultimate state. To the best of our knowledge, this is the first report of a preclimb stage, although it is reasonable intuitively.

2) STAGE2:CLIMB
The climbing stage lasted from approximately 6:50 to 12:10. It is shown in Fig. 12 that the aircraft decided to increase the VOLUME 9, 2021 altitude rather than recharge the batteries, which implies that the RL controller understands the increase in flight altitude is more significant than recovering the battery SoC, and there is enough time to recharge during the remaining daytime. Therefore, a full altitude-priority strategy was generated. When the aircraft approaches 24 km, the aircraft decreased the climbing rate in the case of exceeding the ceiling of 25 km. A typical trajectory for a circle during the climbing stage is shown in Fig. 14(b), and a detailed time-based comparison of solar power received between the RL controller and base case is shown in Fig. 15(c). Consistent results with the preclimb stage show that the RL-controlled aircraft receives more solar energy than the base case by using a higher altitude with a fast circular motion. The orbital period decreased to 0.23 h with a radius of 3,981 m at approximately 8:00 due to a continuous increase in the flight speed. The aircraft stored the gravitational potential energy of 3.6 kWh and 60% SoC at the end of this stage.

3) STAGE3:HIGH-ALTITUDE CRUISE
The high-altitude cruise stage lasted from 12:10 to 18:30. The aircraft cruised in a circle trajectory between 24 km to 24.5 km altitude for more than 6 hours and the battery SoC achieved 100 % at approximately 15:00, see Fig. 12. A typical trajectory for a circle during the high-altitude cruise stage is shown in Fig. 14(c), and a detailed time-based comparison of solar power received between the RL-controlled aircraft and base case is shown in Fig. 15(d). It can be seen that the RL-controlled aircraft gained an additional 0.3 kW boost in solar power absorption on average compared with the base case at approximately 12:00. Moreover, a slight climb was found after the battery SoC became full, which further implies that the RL controller attempted to take full advantage of solar power under a ceiling of 25 km though the solar power dramatically decreased compared to midday. After 17:00, slight declines in both flight altitude and SoC were found due to insufficient solar power. This hints that the RL controller began to turn its strategy from altitude priority to SoC priority.

4) STAGE4:GLIDE DESCENT
The glide descent stage was from 18:20 to 21:30. Due to a sustained decline in solar power, the RL controller fully turned to the SoC priority strategy to maintain the battery energy. The aircraft thrust decreased gradually to zero in this stage, and the aircraft hardly consumed any battery energy except the necessary power to keep avionics working, as shown in Fig. 16(a). In this stage, the aircraft fully released the gravitational potential stored in the daytime until it reached the minimum height limit of 15 km. A typical trajectory for a circle during this stage is shown in Fig. 14(d), and a detailed time-based comparison of solar power receiving is shown in Fig. 15(e). It can be seen that the RL controller gained more solar power in the majority of this stage compared to the base case due to a higher altitude. This helped to reduce the battery energy consumption caused by avionics.

5) STAGE5:LOW-ALTITUDE HOVER
When the aircraft reached the minimum altitude limit of 15 km, it flew around the outer circle of the mission radius until the sun came up the next day. It attempted to maintain optimal output power of 1.75 kW while maintaining altitude. Due to the use of discrete action commands, its power consumption showed a small oscillation, but the average was generally close to the minimum power of the base case, as shown in Fig. 16(b).
The last four stages were also found in Wang et al's results [15] to maximize the net energy absorption in a 24-hour flight circle. However, in Wang's research, the stages were predefined in the problem setting before applying the pseudospectral method with flexible parameters, which limits the type and number of stages. This might cause the preclimb stage to not be found. Fig. 12(b) shows the 24-hour time histories of the RL controller thrust power, base-case thrust power and avionics power. Due to the discrete control command, the thrust power of the RL controller oscillated. However, it can be clearly seen from the smoothed curve that the thrust power of the RL controller was approximately 0 kW in the glide descent stage and consistent with the base case's in the low-altitude hover stage.   18 shows the comparisons of the battery SoC and total stored energy E tot,stored between the RL controller and base case. The results imply that a maximum total stored energy of 27.6 kWh was received by the RL controller due to the gravitational potential store, which is 15 % higher than the base case's total stored energy of 24kWh. After a 24-hour flight, the total stored energy left was 8.34 kWh, which is 14 % higher than the initial state of 7.2 kWh and 27.9 % higher than the base case's 6.5kWh. The results strongly suggest that the neural network controller trained by reinforcement learning can automatically plan the flight path and support a long-endurance flight.

E. SUSTAINED FLIGHT RESULTS
Previous researchers have focused on flight strategy optimization in one 24-hour flight and many remarkable results have been reported based on the comparison between the trajectory optimization results and the control group [9], [14]. However, how well these fixed optimized flight strategies work when applied in divergent flight conditions, such as different days, without a reoptimization process, remains a question. To explore the applicability to unknown solar conditions, the RL controller trained in the aforementioned 24-hour flight was directly applied to a sustained flight test. The initial conditions were consistent with those in Table 6. A detailed comparison between the RL controller performances and base case in a sustained flight test were implemented.
The time histories of flight altitude and battery SoC are presented in Fig. 19. It can be seen that an entire 39-day enduring flight was reached from the departure date of the summer solstice. Five flight stages, preclimb, climb, high-altitude cruise, glide descent and low-altitude hover were observed, which was consistent with the 24-hour flight. When the temporal battery energy was sufficient to support the climb, the preclimb stage gradually faded away. Extreme values of maximum SoC, minimum SoC, maximum altitude, minimum  altitude and their reaching time in the sustained flight test are presented in Fig. 21. It is observed the maximum battery SoC reached 100 % every day, but the arrival time delayed gradually from 14:45 to 16:00 with the increase in the day number, as shown in Fig. 21(a). However, opposite tendencies were found in the time histories of the minimum SoC and its arrival time, as shown in Fig. 21(b). This is due to a continuous decline in the intensity and duration of solar radiation. Moreover, Fig. 21(c-d) shows few changes were found in the maximum and minimum flight altitudes and their reaching times. However, the cruising time exceeding a 24 km altitude gradually decreased from 6.1 hours on the 5th day to 3.6 hours on the 35th day, and the time to reach 15 km altitude gradually increased earlier (see Fig. 20(a)). And the cruising time maintaining 100% SoC also decreased from 2.7 hours on the 5th day to 1.4 hours on the 35th day, see Fig. 20(b). This shows that the RL controller adjusts the flight strategy dynamically to meet the energy requirement in subsequent flight stages.
In addition, an aircraft with the base-case strategy crashed after 26 days because the battery SoC was lower than the protective value of 20 %. Fig. 22(a) shows the total stored energy comparison between the strategy generated by the RL controller and the base case at 5:00 every day. An increase of 28 % was observed on the 5th day, which indicates that the strategy by the RL controller is beneficial to maintain long flight. The total stored energy on the 25th day was 8.29 kWh, which is comparable to the 8.22 kWh of the 20th day. This implies that the RL controller attempted to improve its energy.  Table 11 shows the comparison of available solar energy at 20 km altitude in five different stages. The stages were divided according to the starting time of each stage on the first day. It was found the available solar energy in stage4 and stage5 is declined sharply due to an earlier sunset. Furthermore, a slight decrease of 3.5 % was observed in stage1 due to the later sunrise. All of these declines led to an insufficient power supply, which caused the crash to occur on the morning of the 40th day.
By combining Eq. 12 and integrating it over 24 hours, the relationship can be inferred between the energies: (25) where E battery denotes the subtraction of the battery energy at the end and the beginning of integral. The solar energy utilization ratio η can be defined as The physical meaning of the solar energy utilization ratio η is the aircraft's capacity to use the entire 24-hour solar energy. A higher η indicates less solar energy is wasted.
In Fig. 22(b), it can be found that the RL controller can absorb more energy and arrive at a utilization factor of approximately 95 %, which is only 65 % of the base case. With this trained RL controller, the flight endurance of the aircraft increased by 50 %.

V. CONCLUSION
In the present work, a deep reinforcement learning framework was applied to trajectory planning and automatic navigation of high-altitude long-endurance unmanned aerial vehicles. A station-keeping task was assumed, which requires flight within a radius of 5 km and an altitude region from 15 km to 25 km. The solar radiation model and the flight dynamics model were introduced. The numerical results show that the RL-controlled aircraft received more energy after the entire 24-hour flight, which was 27.9 % higher than the base case. Subsequently, a sustained flight was implemented to test the applicability of the RL controller. An entire 39-day flight was achieved whereas the base case only flew 26 days. The presented work shows that the controller generated by reinforcement learning might be an alternative method for improving the HALE solar-powered UAV's flight endurance.
However, double DQN with dueling architecture is an algorithm for discrete action space; hence, the flight control command may produce vibration. In future work, a continuous action space RL method might be adopted. Moreover, the wind field should be considered in the simulation. Since reinforcement learning aims at maximizing long-term cumulative return, the proportional relationship between potential energy and battery energy for the optimization process still needs to be further explored.