Energy-Optimal Trajectory Planning for Near-Space Solar-Powered UAV Based on Hierarchical Reinforcement Learning

One of the key technologies for achieving day and night flight, tracking solar peak, and reducing flight energy consumption for a near-space solar-powered unmanned aerial vehicle (UAV) is trajectory planning. However, the environmental differences faced by the near-space solar-powered UAV during long-term flight pose challenges to its online trajectory planning. This article introduces a hierarchical guidance method designed using a hierarchical reinforcement learning algorithm, which includes a two-layer neural network structure of bottom-level trajectory planning models and a top-level decision model. The top-level decision maker selects the appropriate bottom-level planner based on flight and current environmental information, while the planner outputs thrust, attack angle, and bank angle commands based on the input information. This hierarchical guidance structure can improve the UAV’s adaptability to energy environment variations and realize an autonomous flight based on energy maximization in long-term missions. Flight simulations spanning spring, summer and autumn seasons show that the guidance controller is able to switch flight policies on its own as the environment changes, allowing the UAV to maximize energy gain on each day, thereby achieving the best energy management strategy in long-term flight. The simulation results also verify the over-fitting and under-fitting effects of the neural network in the solar UAV trajectory planning task, providing support for the necessity of hierarchical guidance.


I. INTRODUCTION
A near-space solar-powered Unmanned Aerial Vehicle (UAV) can theoretically fly permanently, relying on the inexhaustible supply of solar energy.Its working environment is 10km-30km above the ground [1].Some characteristics like long endurance, high flight altitude, and flexible deployment are unique to this type of aircraft, and it is currently widely used in military and civilian fields such as communi- The associate editor coordinating the review of this manuscript and approving it for publication was Jiefeng Hu .cation relay, air warning, and ground observation.Trajectory planning is one of the key technologies to promote the development of near-space solar-powered UAVs.In order to meet the flight requirements of near-space vehicles, which need to track the maximum solar irradiation and ensure low flight energy consumption, it is necessary to design appropriate flight trajectory planning methods and optimize flight strategies.
Current research on the flight strategy of solar-powered UAVs focuses on two aspects [2].One is to study the relationship between the photovoltaic cells and the incidence angle of sunlight during flight, so that more energy can be obtained by optimizing the flight attitude.The other is to explore energy management strategies based on a wide range of altitude variations, reducing the platform's dependence on batteries through gravity energy storage.
In terms of attitude optimization, Klesh and Kabamba [3] first investigated the trajectory optimization problem of solar UAV between two waypoints, and discussed in detail the effects of aircraft bank angle and flight speed on energy absorption when the UAV cruised in the optimal path, and proposed a non-dimensional parameter power ratio to predict the qualitative characteristics of the aircraft energy optimum by using the maximum principle.Dwivedi et al. [4] sought to maximize the solar power input by solving the optimal bank angle and attack angle in advance, and then take the optimization results as the desired attitude input of an internal loop controller.The effect of the method is verified by real machine flight based on sliding mode control.At the threedimensional level, Spangelo and Gilbert [5] explored the three-dimensional path planning of a solar-powered UAV on a vertical cylindrical surface based on a closed ground path, and proposed a parametric trajectory optimization method based on the periodic spline function.Wang [6] studied the trajectory optimization in free three-dimensional space.They used Gaussian pseudo-spectral method to comprehensively utilize the changes of attitude angle and flight altitude for 3D flight path optimization.In terms of gravity energy storage, Gao et al. [7] studied the release of the stored gravitational potential energy by the solar UAV through unpowered gliding, analyzed the equivalence of gravitational potential energy and energy storage batteries, and proposed an energy management strategy based on gravitational potential energy storage in the diurnal cycle.Based on the work of Gao et al., Lee and Yu [8] studied an energy management strategy considering uncertain weather conditions.
Compared to general solar-powered UAVs, missions of a near-space solar-powered UAV typically characterized by a larger spatial and temporal span.The large spatiotemporal span can easily lead to UAVs being unable to fully follow the pre-planned trajectory in actual flight.Therefore, the near-space solar-powered UAV need to use online trajectory planning to real-time correct trajectory errors.Martin et al. [9] used the nonlinear Model Predictive Control (MPC) to optimize the flight trajectory in cylindrical space by adjusting the optimal attitude on a smaller time scale of 8 seconds.The maximum total stored energy was increased compared to the steady-state trajectory.Hamidreza et al. were probably the first to apply neural networks to solar UAV trajectory planning [10], which used the interior point optimization technique and the bounded nonlinear simplex search algorithm to solve the energy-optimal trajectory planning problem.And they used the trajectory planning results to train Adaptive Neuro-Fuzzy Inference System (ANFIS) that can be deployed in the flight control system, providing a new avenue for online energy-optimal trajectory planning.
The above methods can partly correct the flight trajectory, but they still have disadvantages such as long computation time.Deep reinforcement learning (DRL) has been shown to provide highly efficient and stable real-time online trajectory planning for solar-powered UAVs.Ni et al. [11] investigated the energy-optimal trajectory planning method for high-altitude long-duration solar aircraft using the DRL technique, which obtained the trajectory considering both the solar aircraft flight attitude in relation to the sunlight incidence angle and the gravitational potential energy storage.The feasibility of the reinforcement learning framework for online trajectory planning of solar-powered UAV was verified.Xi et al. [12] considered the effect of wind field based on Ni's work and experimentally verified that the reinforcement learning controller had some advantages over the traditional L1 trajectory planning method for wind gradient energy harvesting of solar-powered aircraft.
However, DRL still has some shortcomings in planning trajectories for solar-powered UAVs, among which the over-fitting problem of neural networks seriously affects the trajectory planning capability in long-endurance missions of near-space solar-powered UAVs.The meaning of over-fitting is briefly described as follows [13].The main challenge in machine learning, including reinforcement learning, is that the model must be able to perform well on new inputs, not just on the training set.The goal of machine learning optimization is not only to reduce the training error, but also to close the gap between the training error and the generalization error.Among them, the training error is a measure of training effectiveness, and generalization error is the mathematical expectation of the model's error on the test dataset.When training a neural network model, there may be over-fitting, indicating that the model's ability to fit functions is too high, and more attributes that are not suitable for the test set are learned from the training set, resulting in a significant difference between generalization error and training error.
The trajectory planning model of a solar-powered UAV requires a whole day of environmental solar energy data to be trained, and with the increase of time and space span, the environmental energy of different dates varies greatly.The neural network is easy to overfit some characteristics of the training date, resulting in poor performance on other dates.The DRL method of the paper [11] was used to demonstrate this phenomenon and the resulting UAV state of charge (SOC) curves as shown in Fig. 1 and Fig. 2. In this case, the SOC gain obtained by the UAV adopting the flight policy based on the summer solstice energy environment for a 24-hour flight on the vernal equinox was lower than that obtained by the UAV adopting the flight policy based on the vernal equinox energy environment.Similarly, the SOC gain obtained by the UAV taking the flight policy based on the vernal equinox energy environment for 24 hours of flight on the summer solstice was lower than that obtained by the UAV taking the flight policy based on the summer solstice energy environment.The mission of a near-space solar-powered UAV is typically months long [14], and over-fitting can severely limits the application of existing DRL track planning techniques.If all energy data within the long endurance task time is used as the training set to avoid the impact of over-fitting, new problems will be caused.On the one hand, the extremely large amount of data will significantly increase the training time and cost of the model.On the other hand, it will make the neural network under-fitting, that is, it is difficult to fully fit the characteristics of the training set, which will increase the training error and generalization error at the same time.
Considering the above shortcomings, new methods are needed to overcome the difficulties faced by the DRL trajectory planning technology for near-space solar-powered UAVs.Hierarchical reinforcement learning (HRL) decomposes the problem on the basis of reinforcement learning and is better able to adapt to complex environments or tasks [15].The current research of hierarchical deep reinforcement learning (HDRL) has been explored in track planning and flight decision making.Notter et al. [16] divided the glider flight task into two subtasks based HRL method, triangle tracking and hot air ascent, and designed the decision module.The glider was able to perform the triangle tracking task and the hot air ascent task well with this method and was able to plan the two subtasks of the mission autonomously.Lockheed Martin [17] successfully applied the HRL method to UAV air combat to achieve pursuit decision, evasion decision, and strike decision in multidimensional air combat decision making.Zhang et al. [18] decomposed the complex task of collecting sensor data and charging for UAVs, and used the option-based HDRL method to optimize the UAVs' mission trajectories.In addition, other related studies [19], [20], [21], [22], [23] have shown the theoretical feasibility of HDRL in the field of aircraft trajectory planning and autonomous decision making.The main trouble of DRL-based trajectory planning techniques for near-space solar-powered UAVs in large spatial and temporal spans comes from the large changes in the energy environment, and the idea of HRL to decompose a large problem into sub-problems provides an explorable direction to deal with the complex energy environment of UAVs.Therefore, it is of theoretical significance and of research value to develop trajectory planning techniques based on HRL methods for long-endurance missions of nearspace solar-powered UAVs.
In summary, in order to take full advantage of the real-time, high efficiency and stability of the DRL technique in online trajectory planning for near-space solar-powered UAVs, and to overcome problems such as the decreased trajectory optimization capability of neural network models after variations in the energy environment, the following contributions are made in this paper: (1) The variation pattern of solar angle and solar irradiance in the time range from the vernal equinox to the autumn equinox and in the altitude range of 15km-25km is studied, and the energy environment is classified considering the nearspace solar-powered UAV's own performance.
(2) For different types of energy environments, a DRL method based on continuous action space is designed to plan the trajectory of solar the UAV in 3D space respectively.
(3) By taking the trajectory planning models adapted to different energy environment types as the lower-level policies and designing the upper-level decision models, a hierarchical guidance framework for the near-space solar-powered UAV based on HRL is established.The framework enables the autonomous planning of long flight trajectories for the UAV.
The rest of the paper is presented below.Chapter 2 describes the mathematical model of the simulation flight platform and environment.Chapter 3 introduces the hierarchical reinforcement learning algorithm and the UAV hierarchical guidance framework.Chapter 4 describes the detailed design method of HRL based guidance system.Chapter 5 presents the simulation results of a long endurance mission, and the results are analyzed and discussed.Finally, Chapter 6 summarizes the research work in this paper and provides comments on future work.

II. MODEL
This section presents the mathematical models used in this study, including the dynamics and kinematics models of a near-space solar-powered UAV, a solar irradiation model, and an energy model.

A. DYNAMICS AND KINEMATICS MODEL OF THE UAV
In the trajectory planning of a near-space solar-powered UAV with extremely large spatial and temporal scales, the vehicle 21422 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.can be regarded as a mass point, so a simplified threedegree-of-freedom mass point model is used in this study.Neglecting the effects of vehicle sideslip angle, propulsion system installation angle, and the wind field, the coordinate system of the vehicle is shown in Fig. 3, where O g X g Y g Z g is the north-east-earth Analyzing the forces acting on the vehicle, a system of kinetic and kinematic equations is obtained as follows [24]: where V is the flight velocity, α is the attack angle, which is the angle between O b X b and velocity vector, γ is the tracking angle, ϕ is the bank angle, ψ is the yaw angle, m is the total mass of the vehicle, g is the acceleration of gravity, which is taken as 9.81m/s 2 , while x, y, and z are the positions of the vehicle in the inertial coordinate system.In addition, the pitch angle θ of the vehicle can be expressed as: The forces on the aircraft include thrust T , lift L, and drag D. Thrust T is generated by the propulsion system and is oriented parallel to the O b X b axis.Lift L and drag D are generated by the airflow, with L perpendicular to the velocity and D parallel to the negative direction of the velocity.The expressions for L and D are as follows: where ρ is the air density, calculated from the 1976 U.S. Standard Atmosphere Model.S is the wing area, C L is the lift coefficient, and C D is the drag coefficient.In this study,  1.

B. SOLAR IRRADIATION MODEL
To estimate the solar radiation received by the UAV in realtime, this paper introduces the mathematical model proposed by Keidel et al. [25].
The angle between the PV cell and the sunlight is shown in Fig. 4, where n s is the vector pointing towards the sun from the center of the PV panel, n pm is the normal vector of the PV panel, α s is the solar altitude angle, and γ s is the solar azimuth angle.During the flight in the near space, the effect of reflected radiation can be ignored due to the thin cloud layer and few impurity micro-particles in the upper atmosphere.Therefore, the solar radiation received by the PV cell can be divided into two types.The first type is direct radiation I dir , which is directly radiated from the sunlight to the PV panels.The second type is scattered radiation I diff , which is repeatedly scattered by the Earth's atmosphere before reaching the PV panels.Thus, the total radiation I tot received by the PV cell can be expressed as: where the direct radiation I dir can be expressed as: and the scattered radiation I diff can be expressed as: In equation ( 6), I on is the solar irradiance outside the atmosphere on the n d day of the year, and α dep is the corrected G sc is the standard solar radiation constant, taken as 1367W/m 2 .R earth is the radius of the Earth, taken as 6356.8km.

C. ENERGY MODEL 1) ENERGY ABSORPTION MODEL
The energy absorption model of PV cells designed in the work of Chang [26] is adopted, and its details are as follows.
For a PV cell with an area of S PV , the solar irradiation flux through it can be calculated in terms of n pm and n s , with the following expressions: where the n pm and n s can be expressed by the solar altitude angle, solar azimuth angle and aircraft attitude angle, as shown in equation (10).
Then at any moment, the total power P solar converted by PV cells laid on the solar aircraft can be expressed as: where η MPPT is the efficiency of the Maximum Power Point Tracking (MPPT) and η PV is the efficiency of the PV cell.

2) ENERGY ABSORPTION MODEL
The energy consumption during the flight of a near-space solar-powered UAV is mainly generated by the propulsion system and avionics, and the simplified required power P need is expressed as [27]:    P prop = TV η prop η mot P need = P prop + P acc (12) where P prop is the power demanded by the propulsion system, P acc is the power demanded by the avionics, η prop is the efficiency of the propeller, and η mot is the efficiency of the motor.

3) ENERGY STORAGE MODEL
All systems of the near-space solar-powered UAV are powered by PV cells.SOC is the percentage of remaining battery energy to the maximum stored battery energy, and its rate of change is related to the solar cell power P battery , as shown in equation ( 14):

P battery E battery,max
, SOC < 100% P battery E battery,max or 0, SOC = 100% (14) where E battery is the current battery energy, E battery,max is the maximum battery energy, and the change rate of SOC is proportional to the battery power P battery during charging and discharging, while it is 0 when the battery is fully charged and has not started discharging.
The solar-powered UAV can also store energy by gravitational potential energy, which can be released during the night gliding.The gravitational potential energy storage expression is shown in equation (15), where h 0 is the initial height of the UAV.
By establishing the above mathematical models, the flight status of the UAV can be solved in real time, The required parameter values are shown in Table 2, where t s is the ideal response speed for aircraft control commands.

III. HIERARCHICAL REINFORCEMENT LEARNING
In this section, an option-based hierarchical reinforcement learning algorithm and a guidance framework for the nearspace solar-powered UAV based on the algorithm are introduced in detail.

A. OPTION-BASED HRL
Hierarchical reinforcement learning provides a way to find spatio-temporal abstractions and behavioral patterns in long-term planning and complex control problems, which has been mostly studied in three main ways: Options Framework [15], MAXQ Decomposition [28], and Hierarchical Abstract Machines (HAMs) [29].In this study, a HRL technique based on Options Framework is used.The details are as follows, and the reason for selecting Options Framework is explained at the end of this subsection.
Options Framework extends actions at the temporal level.Options, also known as skills or macro operations, are sub-policy with termination conditions.It observes the environment and outputs actions until the termination conditions are satisfied.The termination condition is an implicit split point in a class of temporal sequences to indicate that the corresponding sub-policy has done its job and the top-level policy-over-option needs to switch to another option.Given a Markov Decision Process (MDP) with state set S and action set A, option ω ∈ is defined as a triple (I ω , π ω , β ω ), where I ω ⊆ S is a set of initial states, π ω : S × A −→ [0, 1] is an intra-optional policy, and β ω : S −→ [0, 1] is a termination function that provides a random termination condition.An option ω is available for state s only when s ∈ I ω .If the option ω is executed, the action will be determined by the corresponding policy until the option is randomly terminated according to β ω .The policy-over-option is a high-level policy that the agent needs to learn, defined as µ : S ×O s −→ [0, 1], where O denotes the set of all options, and O s denotes the set of options available at state s.µ(S, O) denotes the selection of o as the current option with probability µ at state s.
Options can be viewed as decomposition of the MDP at the temporal level, and the Semi-Markov Decision Process (SMDP) provides a theoretical view of Options Framework with uncertainty in the duration between actions.The SMDP is a standard MDP equipped with an additional element F : (S, A, P, R, F), where F(τ |s, a) gives the probability that the transfer time is τ when the action a is executed in state s.
The comparison of MDP and SMDP states is shown in Fig. 5.In SMDP, the time interval between two decisions is τ .An option starts at s t and terminates at s t+τ .At each intermediate time t ≤ k ≤ t + τ , the MDP depends only on s k , while the SMDP may depend on the entire sequence preceding it.If the SMDP is used in Q-learning and updated after each option terminates, the update formula is as follows: The trajectory planning policy of a solar-powered UAV within a day and night can be seen as a 24-hour action sequence.One of the important evaluation indicators of the policy is the energy status of the UAV after a 24-hour flight, which depends on the entire action sequence.Therefore, the policy of each day is consistent with the concept of the option  when using DRL for trajectory planning of UAVs for longduration missions, so an option-based HRL approach is used in this study.

B. GUIDANCE FRAMEWORK
The framework of the hierarchical guidance model based on Options Framework for long-endurance flight of a near-space solar-powered UAV is shown in Fig. 6.The neural network of the top-level decision model learns the macro action selection strategy, and the neural network of each bottom-level model learns the path planning strategy for different energy environments.Through this framework, the UAV can select the appropriate trajectory planning strategy in different environments and output control actions through the corresponding strategy to complete the guidance of long-endurance flight missions.
In general option-based HRL, the policy-over-option, intra-optional policies and the termination function need to be trained simultaneously.However, the SMDP in this study adopts a fixed step size, which is the number of time steps included within 24 hours.Therefore, the parameters of the termination policy are fixed and do not need to be updated.It can be expressed in the following form: where τ is the fixed number of time steps for the SMDP, the termination probability before step τ is 0, and at step τ is 1.Intra-optional policies take pre built bottom-level models, because the bottom-level model trained by taking the option framework as a whole has difficulty in concretizing the action sequence into a policy that conforms to human logic and experience, and it is prone to overfit to a certain action rather than a sequence of actions, so intra-optional policies need to be pre-designed considering experience.Therefore, in this study, the policy-over-option and intra-optional policies are trained separately.

IV. GUIDANCE CONTROLLER DESIGN
In this section, the methodology for delineating the energy environment of the near-space solar UAV is introduced, and the design methodology for its hierarchical guidance controller is introduced in detail, including the design of the bottom sub-task models and the top-level decision model.The entire process is presented in the form of a flowchart in Appendix A.

A. ENVIRONMENTAL CLASSIFICATION
In this study, we investigate the track planning over a time scale of six months from the vernal equinox to the autumnal equinox, where the vernal equinox is the 80th day of the year and the autumn equinox is the 266th day of the year.And dates within this time span need to be classified according to the variation of the energy environment.The annual solar irradiation located at 39.92 • N , 116.42 • E and an altitude of 15 km is shown in Fig. 7, which is calculated by programming based on the aforementioned solar irradiation model and consistent with the result in [9].The sunshine duration from the vernal equinox to the summer solstice extends from 12 hours to 14.9 hours, an increase of 24.17%, and the sunshine duration from the summer solstice to the autumn equinox shortens from 14.9 hours to 11.9 hours, a decrease of 20.13%.Due to the small difference in peak solar radiation, the difference of energy environment mainly depends on the duration of sunlight.Therefore, the closer the date is to the summer solstice, the more abundant the environmental energy is, and the closer the date is to the spring equinox or autumn equinox, the more scarce the environmental energy is.The state machine strategy is a commonly used path planning method for solar-powered UAVs proposed by Gao et al. [30], which is widely recognized and adopted.An aircraft with this policy essentially flies along the outer surface of a cylinder, and the transformation operations of climb, cruise, and descent are performed according to certain preset rules.The state machine policy divides the trajectory of the solar UAV in 24 hours into five phases, which are the low altitude hovering charging phase, the climbing phase, the high altitude hovering phase, the descent phase, and the low altitude hovering phase at night.This rule is designed based on gravitational energy storage, which converts abundant solar energy into gravitational potential energy after the battery is fully charged.
The state machine strategy is adopted to plan the trajectory of the UAV in each day within the studied time range.Considering the actual flight conditions of the modeled UAV, the low hovering charging altitude is set to 15km, the high hovering altitude is set to 24.8km, the hovering radius is 5km, and the condition for switching to the climbing phase is that the SOC reaches 100%.The height curves of the resulting trajectories are shown in Fig. 8, and based on the height curves, these trajectories can be categorized into A, B, and C.Among them, the A class trajectory appears from day 80 to day 91 and from day 253 to day 266, because the solar irradiation time is short in these two periods, the environment is scarce in energy, and the UAV cannot fill the battery, so the UAV is always in the low altitude hovering phase.Class B trajectory appears from day 92 to day 129 and day 211 to day 252, in which the solar irradiation time is longer and the environment is more energetic, but the power starts to drop when the UAV climbs, so the UAV is able to store gravitational potential energy but cannot maintain high altitude hovering.The C class trajectory appears from day 130 to day 210, during which the sun irradiation time is much longer, and the UAV can reach the highest altitude and then keep hovering for a period of time, so the UAV is able to complete gravity energy storage well.
To facilitate the study, the ability of the UAV to perform gravitational energy storage is used as the basis for environmental classification, so the dates of class A tracks is one type of environment and the dates of class B and C tracks is the other type of environment.Therefore, the period from the vernal equinox to the autumn equinox, with reference to the flight performance of the UAV itself, can be divided into two environments.One type is the energy-poor environment, including days 80 to 91 and 253 to 266, the other is the energy-rich environment, including days 92 to 252.

BOTTOM TRAJECTORY PLANNING MODEL
To design the bottom-level options in the guidance framework based on the results of environment classification, this study uses the maximized entropy soft actor-critic algorithm (SAC) to train intra-optional policies.

1) SAC ALGORITHM
The SAC algorithm belongs to a model-free reinforcement learning method proposed by Haarnoja et al. [31], which is an off-policy solution to bridge the gap between random policy optimization and deterministic policies.It brings in the idea of maximizing entropy in the traditional actor-critic method and uses a stochastic distributed policy function similar to that of PPO [32].The introduction of the maximizing entropy method enables the policy to be as random as possible, the agent can fully explore the state space to avoid the policy from falling into local optimum prematurely, and multiple feasible solutions can be explored to complete the task, which improves the resistance to interference.In addition, to improve the performance of the algorithm, SAC employs techniques from deep Q networks (DQN), introducing two Q networks as well as the target network.Also, in order to express the importance of maximizing the entropy value, the adaptive temperature coefficient is introduced.The adjustment of the temperature coefficient for different problems is constructed as a constrained optimization problem, which maximizes expected returns while keeping the entropy of the strategy greater than a threshold.The pseudo code of the SAC is shown in Algorithm 1.

2) TRACK PLANNING MODEL FOR THE ENERGY-POOR ENVIRONMENT
To adequately represent the information of the UAV and the environment, the solar angle and the position, flight attitude, flight speed, battery status and action information of the vehicle are used as inputs to the bottom model.Thus at time step t, the input states can be represented in vector form as follows: Note that each state variable is linearly normalized in order to eliminate the differences between their quantity units.
Based on the energy absorption model, it is known that the control variables are naturally chosen as the pitch and bank angle.However, since the stability of the UAV is greatly affected by the pitch angle, its ascent can be determined by the thrust and angle of attack.In this model, the action space of the controller is three-dimensional and consists of the commanded incremental thrust T cmd , the attack angle α cmd , and the bank angle ϕ cmd .Without exceeding the physical limits, the controller can choose any continuous value within the setting range listed in Table 3.
In order to guide the UAV to better manage energy, this study designs an intensive reward function for the bottom-level trajectory planning model.Considering the difficulty of filling the battery of the solar-powered UAV in  the energy-poor environment, the dense reward function is designed as follows: (19) This reward function only guides the UAV to maintain its own power, excluding other actions and adopting a conservative flight strategy.
The policy network is shown in Fig. 9, which consists of two fully connected hidden layers containing 512 cells, each followed by a ReLU activation function.In this design, the policy is assumed to be Gaussian distributed.Therefore, the last layer of the policy network consists of two parallel linear layers which respectively encode the mean and standard deviation of each command type.The policy network is trained by the SAC algorithm.

3) TRACK PLANNING MODEL FOR THE ENERGY-RICH ENVIRONMENT
The model takes the same state space, action space, and neural network design as the track planning model for energy-poor environments, while the dense reward function is designed to take full advantage of solar energy.The design of the reward function consists of three parts, each of which encourages the UAV to pursue an abstract goal, considering that the UAV is capable of climbing in an energy-rich environment.First, the UAV is fully charged by absorbing solar energy after the sun rises.Then the UAV converts the excess solar energy into where the factor e amplifies the gravitational potential energy to the same level as the battery energy.the SOC with a critical value of 0.95 instead of 1 is to consider the damage of battery charging and discharging.

4) TOP-LEVEL DECISION MODEL
Unlike the continuous action space of the bottom model, the action space of the top-level decision model consists of the bottom options, which are discrete actions, so the DQN algorithm is adopted to train the top-level decision model.As shown in Fig. 6, the UAV's own energy state and environmental energy state are selected from the state input s t of the bottom model as inputs to the top-level model.At timestep n, the input state can be represented as a vector form as follows: where n is an integer multiple of the decision interval of the top-level decision model.Let the decision interval be k time steps, and a time step of the bottom-level model represent time duration t.They satisfy the following relationship: E potential and E battery represent the energy state of the UAV itself, calculated from [h t , SOC t ] in s t .E env represents the environmental energy state, which depends on the α s,t , γ t of each time step in the following decision interval, that is, the solar radiation situation within 24 hours.Since the trajectories based on the state machine strategy are the general benchmark cases in the research of the solar-powered UAV, they reflect the spatial and temporal distribution of the aircraft in flight trajectories considering gravity energy storage.Therefore, E env is calculated through integrating the energy absorbed  by a 1 m 2 PV cell according to this spatial and temporal distribution.For convenient calculation, the efficiency impact of photovoltaic modules is ignored and it is assumed that the normal vector of the cell points towards the sun.The calculation formula is as follows: The size of E env in each date from the vernal equinox to the autumnal equinox is shown in Fig. 10.Since E env will be linearly normalized, the aforementioned calculation assumptions are reasonable.
The action space of the top-level decision model contains two discrete actions a h 0 and a h 1 , with a h 0 representing the trajectory planning strategy in the energy-rich environment and a h 1 representing the trajectory planning strategy in the energypoor environment.The action with the larger value is used to activate the corresponding trajectory planning strategy, and then autonomously generate flight commands of the aircraft.
Since the key evaluation index of the energy managementoriented flight trajectory is the battery energy gain after daily flight, in order to enable the UAV to autonomously switch policies as the environment changes, the reward function of the top-level decision model is designed as: The top-level decision policy network consists of a fully connected hidden layer containing 64 cells, each followed by a ReLU activation function.The decision process of the top-level policy is shown in Fig. 10.The UAV selects a trajectory planning strategy at the current time step n according to 21428 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.this policy, and after outputting corresponding control actions for k steps, the UAV again selects the next trajectory planning strategy according to the top-level policy.

V. SIMULATION EXPERIMENTS
In this study, the mission area of the near-space solar-powered UAV is a cylindrical space with a radius of five kilometers located at 39.92 • N , 116.42 • E .The mission time is from the vernal equinox day to the autumn equinox day, and the basic parameters and constraints of the UAV are kept consistent with Table 2.The training of the top-level decision model and the bottom-level trajectory planning model is realized by using the Tianshou platform, and the hyperparameters are set as shown in Table 4 and Table 5.
Since the reinforcement learning controller obtained based on the training on the day 172 is well evaluated in Ni's work [9], the track planning model for the energy-rich environment in this study also takes the solar irradiation information of the day 172 for training.Since day 172 is the most abundant date of solar irradiation in the energy-rich environment, the track planning model for the energy-poor environment is trained with the solar irradiation information on the day 91, which is also the most abundant date in the energy-poor environment.
The flight information is updated in 0.02-second steps by the Runge-Kutta method.The bottom controller observes the current state every 20 seconds and then inputs it to the bottom policy network and gets the command, so the maximum number of time steps for an episode of the bottom model is 4320 according to equation (22).The top-level model contains 5 consecutive dates per episode, and the start time is randomly selected within the task time for each training, so the maximum number of time steps in an episode for the top-level model is 21600.All relevant settings and assumptions for model training are presented in Table 6.The trained hierarchical policy networks are deployed on a laptop with a Ryzen9-5900HX CPU for testing.

A. THE BOTTOM MODEL TRAINING AND TESTING
Fig. 12 shows the reward curves during the training of the bottom-level models, where two types of trajectory planning controllers converge after 5 million time steps.For the convenience of the narrative, the trajectory planning model for the energy-rich environment is subsequently noted as the bottom policy 0, and the trajectory planning model for the energy-poor environment is noted as the bottom policy 1.The testing of the two bottom policies is as follows.
The trained bottom policies were used to guide the 24-hour flight of the UAV.The testing environment for policy 0 is on day 172, the testing environment for policy 1 is on day 91, the initial time is 4:24, and the initial SOC is 30%.Two trajectories are shown in Fig. 13, the altitude and SOC curves are shown in Fig. 14, and the angle, velocity and thrust curves are shown in Fig. 15.To better analyze the flight information, some curves are smoothed using the sliding average method with a window size of 70.Please note that the horizontal coordinate time is calculated based on the initial time 4:24.
From the altitude and SOC curves, the trajectory of policy 0 is mainly divided into five stages: circling with  charging, climbing, high-altitude cruising, descending, and low-altitude circling, while the trajectory of policy 1 is always circling in a stable state at low altitude, which can be roughly divided into charging and consumption stages based on its SOC changes.Both of two trajectory are consistent with the design of the reward function.
The first stage of the trajectory of policy 0, which is circling with charging, lasts from the aircraft takeoff until around 12:40.During this period, the UAV with policy 0, like the UAV with policy 1, circles at an altitude of 15 km, resulting in their similar thrust and roll angles, relatively stable velocity changes, and track angles close to 0. Both UAVs consume energy first and then recharge, but the UAV with policy 1 starts charging later than the UAV with policy 0 because the sunrise time on day 91 is later than that on day 172.
The second stage of the trajectory of policy 0, which is climbing, starts from 12:40.Note that when the UAV with policy 0 starts climbing, the SOC is about 90%, which is lower than the threshold value of 95% in the reward function, indicating that the RL policy can prepare in advance for better returns.In this stage, the thrust rapidly increases to 100 N, the track angle rapidly increases, and the velocity and bank angle gradually increase.Until the UAV approaches its maximum altitude, the thrust and track angle gradually decrease, and the velocity and bank angle tend to stabilize.The total duration of climbing is about 2.23 hours.The UAV is fully charged with its battery at 13:54, and the battery power curve in Fig. 16 shows that the energy consumption of the UAV thereafter is provided by the excess solar energy.During this period, the UAV with policy 1 is still charging.
The trajectory of policy 0 ranges from 14:54 to 16:55 for high-altitude cruising.During this phase, the UAV with policy 0 hovers around an altitude of 24.5km, and its velocity and bank angle are smoothly maintained at relatively high values.From Fig. 16, it can be seen that the UAV's SOC is always 100% since its energy input power is always greater than the output power.During this time, the UAV with policy 1 reaches its maximum SOC at 16:42, ending the charging phase and starting consuming its battery energy.From power curves in Fig. 17, it can be seen that this is because from 16:42 onwards, external solar radiation weakens, resulting in the UAV's energy input power becoming lower than its output power.
In stage 4 of the trajectory of policy 0, which is descending, the UAV start to reduce its thrust at 16:55, causing a decrease in its output power.However, as solar irradiation decays, the energy input power is still lower than the output power, and therefore the SOC decreases.The UAV drops the thrust to 0 at 19:18, ending its powered descent and instead beginning an unpowered descent by releasing gravitational potential energy, so its SOC decline rate has significantly slowed down.Throughout the descent stage which lasts 4.15 hours, the bank angle, velocity, and track angle of the UAV with policy 0 gradually decrease until it approaches the lowest altitude, and then its velocity and bank angle stop decreasing, while the thrust and track angle increase.Over this period, the UAV with policy 1 does not have gravitational potential energy to release and its thrust remains steady as it maintains a constant altitude hovering, thus its SOC continues to decrease.At 18:06, the solar irradiation completely disappears and the UAV with policy 1 begins to circles at night with its SOC decreasing at a constant rate.
The UAV with policy 0 circles at a low altitude in the fifth stage of its trajectory, and since solar irradiation disappears during its unpowered descent, it circles at night like the UAV with policy 1 when it reaches the lowest altitude.At 04:24 the next day, the SOC of the UAV with policy 0 is 58.35% with an energy increase of 28.35%, and that of the UAV with policy 1 is 44.31% with an energy increase of 14.31%.
Above tests have proved the feasibility of bottom models to plan the trajectory, which supports the subsequent training of the top-level decision model.

B. THE TOP MODEL TRAINING AND TESTING
The task duration from day 80 to day 266 contains 807840 time steps, and if the number of time steps of the total task duration is used as the maximum number of steps in an episode in training, it requires extremely high computational time and cost to complete an episode.In order to reduce the training cost, the top-level model is trained by randomly selecting five consecutive days from the task duration as an episode.
Since the training sample for each episode is not the whole mission time period, but a random period in it, the flight process of the UAV before the sample time is required to determine the initial state inputs for the initial date of the 21430 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.sample, so that the training setup can match the real flight situation.A state matrix containing the initial state for each day of the mission time is designed to allow the UAV to obtain reasonable initial state inputs by continuously updating the matrix during training.
Set up the initial state matrix of the UAV as follows: Step 1: The start date is randomly selected to get the initial power SOC d and flight altitude h d of the UAV.
Step 2: S d is converted to get E battery and E potential , which are combined with the environment energy state value E env to get the top-level controller state input.
Step 3: the top-level controller selects the macro-action to get the UAV states SOC ′ d+1 and h ′ d+1 after 24 hours, and then updates the top-level policy network.
Step 4: Comparing [SOC ′ d+1 , h ′ d+1 ] and [SOC d , h d ], the set with larger SOC is taken as S d+1 .Step 2 is iterated until the end of this episode.
Through the gradual update of the state matrix, the initial state input of the top-level model for each date will be the result of the UAV's flight from the vernal equinox to before that date, so that the initial state settings of each episode are consistent with the situation of a long endurance flight starting from the vernal equinox, ensuring the effectiveness of the top-level model training.
Considering the training through randomly selected time, this study takes the change of the macro action sequence of the top controller at a fixed time period as a judgement criterion of whether the model converges or not.The action sequences are represented in the following way: Value option (26) In the formula, Value option represents the selection made by the top-level controller.When the controller selects the lowlevel policy 0, its value is 0, and when the low-level policy 1 is selected, its value is 1. n is the number of days in the time period, and the fixed time period observed here is 5 days.The planning result of the top-level decision model within the task time are shown in Fig. 20.Macro action a h 0 , which is policy 0, is selected from day 99 to day 245, and macro action a h 1 , which is policy 1, is selected from day 80 to day 98 and from day 246 to day 266.There are a slight date difference between the planning results and the environmental classification results.

C. LONG ENDURANCE FLIGHT SIMULATION
In order to demonstrate the superiority of this hierarchical guidance method for long endurance missions, multiple methods are applied to guide a near-space solar-powered UAV for a mission, setting up the following cases for comparison: (1) The planning result from the application of the hierarchical guidance controller.
(2) Strictly following the environment classification, the policy adopted by the UAV is determined by the type of environment on its current flight date.
(3) The UAV only uses the policy 0 to plan its trajectory.
(4) The UAV only uses the policy 1 to plan its trajectory.
(5) The UAV uses a RL-based trajectory planning policy that is trained with solar irradiation information for all dates during the task time.The number of training steps is consistent with bottom models and the reward curve converges, as shown in Appendix B. Considering that the segmented reward function equation (20) has non gravity energy storage parts, this policy adopts the same reward function as policy 0.
The policy adoption for the above cases is shown in Table 7.
The initial time of all cases is 4:24 on day 80, and the initial SOC of them is 30%.Taking the daily SOC of the UAV at 4:24 as the vertical axis and the 80th to 267th days as the horizontal axis, comparison of the cases is shown in Fig. 20.Among them, in Case 3, the UAV only adopts policy 1, which does not perform gravity energy storage, and its SOC after each day and night of flight is maintained at a low value.In Case 4, the UAV only adopts policy 0, which performs gravity energy storage, so its SOC values after each day and night of flight are higher compared to those in Case 3 from day 100 to day 245.However, on days 80 to 99 when the aircraft is at the beginning of the take-off and days 246 to 266 when the date is close to the autumnal equinox, the UAV in Case 4 has a lower energy gain per 24 hours instead, due to more scarce environmental energy.Therefore, in Case 4, the SOC of the UAV after completing the entire mission time is only 29.83%, while in Case 3, that is 47.06%, which is 17.23% higher than the former.In Case 1, the UAV's hierarchical guidance controller autonomously switched its policy at the beginning of day 100 and day 246, thus allowing the UAV to provide more abundant power for avionics or other mission payloads during the mission period with abundant external energy as in Case 4, and to maintain the same battery energy reserve as in Case 3 when it continues to fly until the external energy is scarce.
The UAV in Case 2 does not switch its policy autonomously through the top-level controller, but switches its policy based on the date boundaries obtained from the environment classification.Due to the strict adherence to the environment classification, the UAV takes the opposite choice near the date boundaries as in Case 1, resulting in a decrease in its SOC gain from day 92 to 98 and from day 246 to 252.The largest difference occurs on day 252, on which the SOC of the UAV in Case 2 after 24 hours of flight is 5.50% lower than that of the UAV in Case 1.The difference between the state machine method and the reinforcement learning method taken by the bottom model leads to a discrepancy between the environmental classification and the result planned by the hierarchical guidance controller.As can be seen in Fig. 19, the hierarchical guidance controller selects the policy adapting to the energy-rich environment slightly later than the previous time boundary, and selects the policy adapting to the energy-poor environment slightly before the later time boundary.The previous time boundary, from day 91 to day 92, is taken as an example for the analysis as follows.On day 92, based on its own physical properties and external irradiation, the UAV is able to be fully charged while hovering at low altitude of 15 km, and in accordance with the rule of the state machine policy, the UAV will climb after being fully charged.Therefore, this date is classified as the energy-rich environment.However, according to the planning result of policy 0 shown in Fig. 21, the UAV starts to climb when the SOC reaches only about 75% in order to seek higher energy gain.This phenomenon of early climb has been discussed in the trajectory analysis of policy 0. For the current environment, this policy is still relatively risky, as the UAV is difficult to fill up the battery due to the energy consumption of climbing, resulting in the energy gain of the UAV with policy 0 is lower than that of the UAV with policy 1.This is one of embodiments of over-fitting in the policy network.Therefore, the top-level controller does not select policy 0 until day 99 when the environmental energy is more abundant.
The UAV in Case 5, adopting the flight policy obtained from the all-date data training, only gets high energy gains from day 132 to day 214, which has a shorter duration compared to Case 1, and the SOC at the end of each diurnal flight  is about 14% lower than that of Case 1.At the same time, the UAV is less able to gain energy during the periods close to the vernal equinox or the autumn equinox, and thus the SOC at the end of the mission time is 6.92% lower than that in Case 1.This is because the huge amount of data makes the policy network suffer from some under-fitting phenomenon,  which makes it difficult to achieve the best performance on each date.
The under-fitting of the model is analyzed as follows.The trajectory profiles obtained from policy 0 and the policy trained based on the all-date data are shown in Fig. 22 with the same initial conditions on day 172.Both trajectories contain gravitational energy storage, and the UAV guided by policy 0 is able to perform high altitude hovering at an altitude close to 25km, while the UAV with the all-date policy is only able to climb up to 20km, so the former gains more gravitational potential energy.From the thrust and SOC curves, it can be seen that the UAV with policy 0 has more gravitational potential energy and thus performs a longer un-powered gliding, which consumes less energy and results in a higher final SOC of the UAV.Fig. 24 shows the altitude and SOC curves of the planned trajectories from day 258 to the autumn equinox of policy 1 and the policy trained on the full date data with the same initial conditions.From the SOC curves, it can be seen that the UAV with all-date policy exhibits an overall decreasing trend in its SOC from day 261 to day 264, while the daily SOC curve of the UAV with policy 1 is basically similar.The initial SOC of each day of the two trajectories is selected for observation.As shown in Fig. 24, the reduction in the daily initial SOC of the UAV with all-date policy becomes larger from day 261 to 265, whereas the SOC of the UAV with policy 1 maintains a stable and lower rate of decrease.In conjunction with the altitude curve in Fig. 23, it can be seen that the UAV with all-date policy does not adjust its height-seeking strategy appropriately as it entered a date when the environmental energy is more scarce, namely entering day 261.The UAV does not properly adjust the timing of early climb or give up this decision, resulting its gain of gravitational potential energy could not compensate for the energy loss from climb energy consumption, so its SOC after daily flight decreases even more until day 265, when the UAV no longer climbs.However, the UAV with policy 1 consistently hovers at low altitude and performs better in the relatively extreme energy scarcity environment.In the above phenomenon, the UAV adopting the all-date policy does not sufficiently perform gravitational energy storage when the environmental energy is abundant, and does not appropriately adjust the strategy of gravitational energy storage when the environmental energy is scarce, reflecting the fact that the large amount of environmental data in the all-time period causes the model to still have high training errors and fall into the under-fitting effect when the reward curve converges.
The experiments and comparisons of the above cases demonstrate that the hierarchical guidance controller is able to plan an energy-optimized long endurance flight trajectory for a near-space solar-powered UAV, allowing the UAV to pursue the maximum energy gain possible in different energy environments.Meanwhile, for the errors caused by the environment classification based on empirical knowledge, the controller can also make up for them through its autonomous decision-making.

VI. CONCLUSION
Aiming at the trajectory planning problem of long endurance flight of a near-space solar-powered UAV, this paper designs an intelligent guidance controller based on hierarchical reinforcement learning to optimize the energy management capability of the UAV.The conclusions of this study are summarized as follows: (1) Based on the UAV's own physical properties and solar irradiation, an energy environment classification method is proposed to divide dates based on whether the UAV can be fully charged while hovering at low altitude, and the environment with relatively abundant external energy and the environment with relatively scarce external energy are obtained.
(2) A hierarchical guidance controller for the UAV is designed on the basis of environment classification, which consists of bottom-level trajectory planning models and a top-level decision-making model.The bottom-level trajectory planning models are trained based on different types of energy environments by the continuous-action-based SAC algorithm, which can guide the UAV to autonomously track the energy-optimal flight trajectory.The top-level decisionmaking model is trained by the discrete-action-based DQN algorithm, which can guide the UAV to autonomously change the trajectory planning strategy with the variation of the energy environment.
(3) The simulation results show that bottom planners of the hierarchical guidance system can effectively plan the trajectory of the solar-powered UAV in their respective adaptive environments.The energy growth of the UAV with a planner suitable for the energy-rich environment is 28.35% on day 172, while the energy growth of the UAV with a planner suitable for the energy-poor environment is 14.31% on day 91.At the same time, the top-level decision maker can autonomously switch its bottom-level trajectory planners as the environment changes.Due to the RL planner adapting to the energy-rich environment climbing early, which causes the UAV to perform poorly on energy-rich dates near the pre classification boundary, there are some differences between the planning results of the top-level decision maker and the pre classification results.
(4) The simulation results of long endurance flight demonstrate that the hierarchical guidance controller is well adapted to the large span of energy environment variations for long duration flights.The planning results of the hierarchical guidance controller have the longest time of high energy gain, up to 147 days, compared to only adopting one bottom policy or all-date policy.It also achieves 47.06% power remaining after completing the mission time flight, which is the same as only using the policy based on the energy-poor environment, and meanwhile respectively 17.23% and 6.92% higher than only employing the policy based on the energy-rich environment and only using the all-date policy.At the same time, the autonomous decision-making of this controller is able to compensate the error caused by the environment classification based on empirical knowledge.
The above research results show that hierarchical reinforcement learning can help a near-space solar-powered UAV overcome the challenge of trajectory planning caused by the substantial variation of the environmental energy during long-term flights, strengthen the UAV's energy management capability in long endurance missions, and further improve the adaptability and intelligence of the UAV.How to reduce the error between model training and the real world is a further research issue, which will contribute to the practical application of the neural network controller.In addition, in subsequent research, the flight airspace of the UAV will be expanded to 10km or below to explore how hierarchical reinforcement learning can assist the solar-powered UAV to cope with the interference of factors such as weather or cloud cover.Furthermore, the 24-hour flight trajectory of a solar-powered UAV is characterized by distinct phases, which are equally suitable for setting up different policies through hierarchical reinforcement learning to improve the trajectory planning method.

FIGURE 1 .
FIGURE 1. SOC curves of different policies on the vernal equinox.

FIGURE 2 .
FIGURE 2. SOC curves of different policies on the summer solstice.
inertial coordinate system, O b X b Y b Z b is the body coordinate system, O b X b points to the front of the vehicle, O b Y b points to the right wing, and O b Z b is downward.

FIGURE 4 .
FIGURE 4. Angle between PV cells and solar irradiation.
value of α s relative to the ground level.Their expression is as follows.

FIGURE 6 .
FIGURE 6. Framework of hierarchical guidance controller for the near-space solar-powered UAV.

FIGURE 8 .
FIGURE 8. Trajectories planned by the state machine strategy.

FIGURE 10 .
FIGURE 10.E env from the vernal equinox to the autumn equinox.

FIGURE 14 .
FIGURE 14. Altitude and SOC curves of two trajectories.

FIGURE 15 .
FIGURE 15.Angles, velocity and thrust curves of two trajectories.

FIGURE 16 .
FIGURE 16.Powers curves of trajectory for policy 0.

FIGURE 17 .
FIGURE 17. Powers curves of trajectory for policy 1.
S d represents the size of the UAV's SOC and flight altitude at the initial time (4:24) on day d, with an initial SOC d of 0.3 and an initial h d of 15 km.The state matrix is updated in each episode of the top-level model training in the following manner:

Fig. 18
shows the Value sequence curves of the three time periods observed during the training process of the top-level model with 1000 epochs, where the time steps of each epoch represents 5 × 24 hours.As shown in the figure, after 1000 epochs, the action sequence of the top-level controller tends to stabilize at each time period, and the top-level decision model can converge.

FIGURE 20 .
FIGURE 20.Simulation comparison of each case.

FIGURE 21 .
FIGURE 21.The trajectory with policy 0 on day 92.

FIGURE 22 .
FIGURE 22.The all-date policy and policy 0 on day 172.

FIGURE 23 .
FIGURE 23.The all-date policy and policy 0 from day 258 to 266.

FIGURE 25 .
FIGURE 25.SOC curves of different policies on the vernal equinox.

TABLE 1 .
The values of A ij and B ij .C L and C D are fitted as functions of Reynolds number Re and the attack angle α.Re is calculated based on the current speed and altitude.The fitting results are shown in (4), where the values of A ij and B ij are shown in Table

TABLE 2 .
Relevant parameters and physical constraints of the UAV.

TABLE 3 .
Command value range.

TABLE Top -
level DQN algorithm hyperparameters.

TABLE 6 .
Model training related settings and assumptions.

FIGURE 19 .
The top-level decision model planning result.