Improved Reinforcement Learning Using Stability Augmentation With Application to Quadrotor Attitude Control

Reinforcement learning (RL) has been successfully applied to motion control, without requiring accurate models and selection of control parameters. In this paper, we propose a novel RL algorithm based on proximal policy optimization algorithm with dimension-wise clipping (PPO-DWC) for attitude control of quadrotor. Firstly, dimension-wise clipping technique is introduced to solve the zero-gradient problem of the PPO algorithm, which can quickly converge while maintaining good sampling efficiency, thus improving the control performance. Moreover, following the idea of stability augmentation system (SAS), a feedback controller is designed and integrated into the environment before training the PPO controller to avoid ineffective exploration and improve the system’s convergence. The eventual controller consists of two parts: the first is the result of the actor neural network in the PPO algorithm, and the second is the output of the stability augmentation feedback controller. Both of them directly use an end-to-end style of control commands to map the system state. This control architecture is applied in the attitude control of the quadrotor. The simulation results show that the quadrotor can quickly and accurately track the command and has a small steady-state error after the training by the improved PPO algorithm. Meanwhile, compared with the traditional PID controller and basic PPO algorithm, the proposed PPO-DWC algorithm with stability augmentation framework has better performance in tracking accuracy and robustness.


I. INTRODUCTION
In recent years, reinforcement learning (RL) has made rapid progress in the field of artificial intelligence research. Combined with deep learning and data progression, deep reinforcement learning (DRL) algorithms modeled by neural networks are now successfully applied in a variety of scenarios such as investing [1], gaming [2], and traffic control [3]. At the same time, in the robot control system, the application of DRL for continuous tasks has become one of the most popular research topics, including the motion control of unmanned aerial vehicles [4], unmanned surface vehicles [5], and autonomous underwater vehicles [6].
Quadrotor UAVs are widely used in power inspection [7], urban planning [8], agricultural monitoring [9], disaster The associate editor coordinating the review of this manuscript and approving it for publication was Yang Tang . rescue [10] and other fields, and is currently developing in a safer and more efficient direction. The quadrotor is a typical underactuated nonlinear strong coupled system with 6-DOF (six degrees of freedom) for the rotational and translational motion and only four actuators for control input. For the attitude control of the quadrotor, it is not enough to be able to hover, but also perform large maneuvers in a tough environment [11]. Factors such as air resistance, the gyroscopic moment generated by the rapid rotation of the motor during flight, and the uneven distribution of the mass will affect the stability of the quadrotor flight. Many advanced control algorithms have been proposed and applied to quadrotor flight control systems, such as sliding mode control [12], adaptive control [13], robust control [14], active disturbance rejection control [15] and model predictive control [16]. In the ideal environment, the quadrotor shows excellent performance in both agility and precision. However, when there are uncertainties in the environment, typical control methods that rely on precise quadrotor models find it challenging to achieve the control requirements. On the other side, most control techniques do not consider the actuator's saturation limitations in the design process, which will degrade the controller's performance and lead to instability of the system in extreme cases. Alternatives to the conventional control techniques are available through intelligent controllers [17]. In recent years, thanks to the development of machine learning, the intelligent flight control system based on DRL developed by neural networks has become a trendy research field.
It has been proven that RL algorithms can achieve success in situations close to the complexity of the real world [18]. Deep research has been carried out on the policy learning for autonomous control of quadrotors. In [19], RL achieves stable quadrotor control by training a neural network policy in a model-free manner. Combined with low-resolution images, a control policy trained with RL in [20] was used to achieve autonomous landing of an aircraft for the first time. The RL controller designed in [21] has successfully completed the tasks of hovering and trajectory tracking for a real quadrotor. Remarkably, more advanced RL algorithms such as soft actor-critic (SAC) [22], twin delayed deep deterministic policy gradient (TD3) [23] and proximal policy optimization (PPO) [24] are gradually being used in the control system of the quadrotor. An actor-critic neural-network-based controller was presented in [25] to improve the quadrotor trajectory tracking performance. In [26], MAV has successfully completed autonomous navigation under a gust environment using the SAC algorithm as a DRL framework. Sequential deep q-network (SDQN) was first used as an end-to-end learning paradigm to train control policies for autonomous landing of UAVs in [27]. Advanced tasks of UAVs in actual flight are completed in [28] by the control policy of the deep deterministic policy gradient (DDPG) algorithm. In [29], the state-of-the-art DRL algorithm PPO was used to explore the control policy of the quadrotor position loop while maintaining good sampling efficiency. Moreover, the method of integrating classical controllers with RL policies has been shown to have higher learning efficiency [30]. An MPCguided RL policy search algorithm is studied in [31] for learning quadrotor autonomous flight. In order to improve the tracking accuracy and robustness, a method that introduces an integral compensator in the actor-critic neural network was investigated in [32]. In [33], the PPO algorithm was suggested to correct the parameters of the quadrotor PID controller, which significantly reduced the training time and ensured the high stability of the quadrotor.
Among the advanced RL algorithms off the shelf, PPO has been evaluated as one of the most suitable algorithms for synthesizing high-precision attitude flight controllers [34]. However, due to its structural factors, problems such as vanishing gradients of clipping samples can also lead to inefficient learning of agents in high action-dimensional tasks. Some variant algorithms based on PPO have been proposed to solve the zero gradient problem. In [35], a two-phase policy gradient algorithm (PPG) that advances training and distills features is proposed to optimize the value function using a higher-level sample reuse method, which solves gradient vanishing and improves sample efficiency. A PPO algorithm based on relative Pearson (RPE) divergence is proposed in [36], through which an explicit minimization target can be yielded, and the latest policy is restricted to the baseline policy. Although the improvement of the algorithm can improve the training efficiency in the benchmark, the application of the PPO needs to be further studied for the specific quadrotor control system.
In this paper, the original PPO algorithm is improved for training the quadrotor attitude controller to achieve higher precision control in a shorter time. The algorithm is altered in two ways: algorithm structure and combination of classical control theory. The results of the original PPO algorithm are used as a benchmark to demonstrate the advantages of the new algorithm. The main contributions of this paper are listed as follows: 1) By estimating the advantage value of dimension importance sampling (IS) weight clipping, a new proxy target with a mechanism of clipping the IS weight of each action dimension separately is proposed to improve sample efficiency and achieve stable training for quadrotor attitude control.
2) A stability augmentation controller is introduced into the RL gain to speed up the process of training the quadrotor control policy and significantly improve the motion control accuracy of the quadrotor.
The remainder of this paper is organized as follows. Section II introduces the modeling of quadrotor and the basic principle of PPO algorithm. In Section III, the disadvantages of the PPO algorithm are analyzed. Then, the PPO algorithm with dimension-wise clipping and the PPO algorithm combined with stability augmentation controllers are introduced, respectively. Section IV gives simulation details and results, and our conclusions are summarized in Section V.

A. DYNAMIC MODEL OF QUADROTOR
The basic structure of the quadrotor is shown in Fig. 1. The UAV control system is an under-actuated system with four inputs and six outputs. The inertial coordinate system and the body coordinate system fixed on the quadrotor are established to describe the attitude of the quadrotor. F i (i = 1, 2, 3, 4) are the thrusts generated by the four rotors and i (i = 1, 2, 3, 4) are the rotational speeds of the rotors. φ, θ and ψ denote three Euler angles of the quadrotor.
For rotational motion, applying Euler's equation of rotation to the quadrotor frame, the absolute derivative in dynamic coordinates can be expressed as follows: where I = diag(I x , I y , I z ) is the diagonal inertia matrix of the quadrotor and ω = [φ,θ ,ψ] T is the angular velocity of the three axes of the quadrotor. M is mainly composed of the following parts: the control torque M τ , the gyroscopic effect torque M c and the rotational dynamic resistance torque M f . The control torque M τ can be obtained as where L is the distance from each rotor to the center of mass, Finally, the nonlinear dynamic rotation equation of the quadrotor is as follows: Equation (3) should be discretized for RL implementation to define a Markov Decision Process (MDP). In our work, the state space is defined as s t = {φ, θ, ψ,φ,θ,ψ} which includes the Euler angles and the angular velocity. The action space is selected as a t = {a 1 , a 2 , a 3 , a 4 }, where a i , i = 1, 2, 3, 4 is the throttle level of each rotor and can be obtained by a i = 2 i / 2 m . m is maximum speed of each rotor. The state transition function can be obtained as follows: where f = [f 1 , f 2 , f 3, f 4 , f 5 , f 6, ] T is the smooth nonlinear function vector.

B. REINFORCEMENT LEARNING
From the basic principles of RL, it can be seen that RL has some essential differences compared with supervised learning and unsupervised learning. The data samples are static training sets with labels in traditional supervised learning. However, RL is a continuous decision-making process. In the training process of RL, the agent does not have any instruction information, and it updates its policy by getting reward values through interaction with the environment. The agent finally obtains the best policy through continuous trial and error.
The basic block diagram of RL is shown in Fig. 2. It trains an optimal policy through continuous trial-and-error interactions between the agent and environment. Agent generally consists of two parts, RL algorithm and policy, where the policy is usually a function approximator with adjustable parameters, such as neural network. When training begins, the agent chooses an action a t to act on the environment. The entire environment model reaches a new state s t , generating a reward value r t simultaneously. The RL algorithm continuously updates policy parameters based on action a t , state s t , and reward r t . The agent and environment interact in a continuous loop to generate data samples. The agent finds the optimal control policy when the accumulated reward is maximized during training. Policy gradient (PG) with importance sampling (IS) is a classic policy-based RL algorithm. The PG algorithm mainly sets J (λ) as the performance function and maximizes J (λ) by updating the policy. In each iteration, PG updates the new policy πλ by analyzing the current policy π λ from time step t: where ρ t (λ) = πλ(a t |s t ) π λ (a t |s t ) is the IS weight and A π λ (s t , a t ) is the advantage value.
The traditional PG algorithm is greatly affected by the large IS weight, which directly leads to the unable final learning effect or the long policy convergence time. PPO algorithm proposes a new objective function to solve the problem of selecting step size by storing multiple training steps for mini-batch updating. Moreover, it has two other characteristics. One is that it has been proven to have excellent performance in solving continuous action problems with networks. The other is that it adds importance sampling technology to update, so that PPO algorithm can achieve the optimal balance in terms of algorithm complexity, accuracy and implementation difficulty.
PPO uses the objective clipping function to bound the policy update of the current policy to achieve stable learning. Policy π λ i generates the current sample batch when the iteration starts from the i-th time. Then according to multiple mini-batches sampled in B i , π λ completes the update. Due to the difference between the policy π λ i that generates B i and the target policy π λ of policy updating, PPO calibrates the statistical difference according to the IS weight ρ t . In addition, PPO reduces the IS weight in order to limit the amount of policy updates to ensure the stability of learning. Therefore, the objective function of PPO is given by the following: whereÂ t is the estimate of A π λ i (s t , a t ) and B i randomly sampled M samples in each mini-batch. However, it is precisely because PPO clipping the overall likelihood ratio causes the gradient of the cutting samples to vanish completely. Therefore, in the task of high action dimension, PPO also has the problem of low sample efficiency, which affects learning efficiency and tracking accuracy in complex quadrotor systems. The improved PPO algorithm presented in this work is an attempt to solve this problem.

C. NETWORK STRUCTURE
The neural network used in PPO is based on the Actor-Critic network structure. Due to the good generalization ability of neural networks, the multilayer perceptron (MLP) structure proposed in [19] is used. There may be a better network configuration, but in fact the neural network is quite versatile. Its configuration has been able to approximate a controller for similar tasks. MLP is a fully connected feedforward artificial neural network trained using supervised learning with backpropagation. Its initial weight is a Gaussian random number with mean 0 and standard deviation 1. The structure of the actor-critic neural network is shown in Figure 3. For the Actor network, the input layer is the quadrotor state s t , and the output layer is the signal that controls the rotational speed of the four rotors of the quadrotor. Each network has two hidden layers, each with 64 nodes, which are neurons with tanh activation functions. The critic neural network has the same structure. The only difference is that its output is an estimated value function that evaluates the advantage of selecting a given action a t in a given state s t .

III. PROPOSED APPROACH A. PPO WITH DIMENSION-WISE CLIPPING (PPO-DWC)
In (6), r t is a function of the optimization policy variable λ, A t is fixed for the policy π λ i that is generated from the given action of the current sample batch B i . Therefore, in general, the cost maximization for λ is to increase ρ t whenÂ t > 0, and decrease ρ t whenÂ t < 0. PPO restricts the number of policy updates by clipping the objective function. The advantage is that this clipping mechanism can prevent ρ t from becoming too small or too large, especially for many complex environments, a stable update range is more conducive to faster and more efficient training. The disadvantage is that when the dimension sample is too high, it is easy to cause a zero gradient problem, resulting in local optimization. When we simplify the clipped objective function: It can be seen that whenÂ t < 0 and and whenÂ t > 0 and ρ t > 1 + ε,Ĵ t = (1 + ε)Â t . In these two cases,Ĵ t is a constant and the gradient disappears. The problem of this kind of zero gradient is very severe [37], especially in high-action dimension tasks. Because PPO directly clips the loss function, the sample efficiency is strongly affected by the zero-gradient samples created by PPO. Gaussian distribution is often considered as a random policy for RL when performing action tasks: where µ = (µ 0 , µ 1,... , µ D−1 ) is the mean vector, D is the action dimension, σ is a standard deviation parameter, I is the identity matrix and thus policy parameter λ = (µ, σ ). When policy π λ is decomposed into policy dimensions, it can be drawn as follows: It can be seen that π λ grows exponentially with the increase of D, which leads to an excessive weighting of IS. In response to this problem, combining the advantages of the clipping mechanism, the IS weights of each dimension will be clipped separately, and the new IS weight function is proposed, as shown in (10): VOLUME 10, 2022 In addition, an additional loss is proposed to prevent the IS weight from being too far from (6). The IS weights are constrained with a simple KL divergence: α IS is set as an adaptive weighting factor to constrain divergence: if Finally, the objective function is given as follows: (13) Dimension-Wise clipping successfully solves the zero gradient problem of PPO. Proximal policy optimization with dimension-wise clipping (PPO-DWC) improves the learning efficiency of the algorithm and effectively reduces the disappearance of gradient. Algorithm 1 shows the complete iterative process. Randomly initialize states of quadrotor 5.
Load the desired states 6.
Sample trajectory B i of size N from π λ i 8.
Store B i in replay buffer E 9.
Compute advantage estimationsÂ t by using all off-policy trajectories B i , . . . , for each gradient step do 12.
Sample mini-batch size M from the sample batches in E 13.
Compute the objective function J (λ) 14. Update end for 16. end for 17.

B. PPO WITH PD STABILITY AUGMENTATION CONTROLLER (PPO-PD)
RL is to find the correct policy for an unstable system through constant trial and error. In the learning process, there are many invalid data in the random actions generated, which is not conducive to the rapid convergence of the agent to the optimal policy. In many cases in actual engineering, an SAS will be used to achieve the control requirements more quickly. For example, for a balance bike, without the feedback of the stability augmentation system, it is just a unicycle that is difficult to control. After the stability augmentation system is added, learning to ride a unicycle becomes much easier, which is also available for machine learning. Inspired by this, we can introduce the idea of SAS into the learning process of RL.
Before the RL training control policies, we can design a stability augmentation feedback controller to stabilize the equilibrium point first. For a nonlinear system (4), the original current action of RL is a t . Assuming that the target point is an unstable equilibrium point, our goal is to learn a t to make s t tend to 0, which is also the basic idea of RL training control policies. After integrating the stability augmentation feedback, we define where k(s t ) is the stability augmentation state feedback and a' t is the action generated by RL. Substituting (14) into (4), we obtain the new environment, including a stability augmentation controller as Once the RL controller is trained for (15) to choose an action a' t , we can obtain the action a t of the original environment (4) by using (14). For the quadrotor UAV system, proportional differential (PD) controllers are usually designed in the roll, pitch, and yaw channels to ensure local stability of the attitude [38]- [40]. In this work, we use the following PD control as the stability augmentation feedback: where k Pi and k Di (i = φ, θ, ψ) respectively represent the proportional and differential coefficients of the roll, pitch and yaw channel. Moreover, the total lift level of the quadrotor is assumed to be T . Then it can be obtained: It must be pointed out that the role of stability augmentation feedback is to construct a local convergence region for the original system. The goal of RL is to find a solution that can reach the convergence region. Using the stability augmentation controller could avoid a large number of trial and error in the training process, and improve the learning efficiency. Moreover, it can be used in combination with other RL algorithms.
On the other hand, compared with individual PD feedback, the RL algorithm will enhance control performance to deal with extreme conditions, such as state overshoot or actuator saturation. Algorithm 2 describes the learning process of an RL controller with stability augmentation. Randomly initialize states of quadrotor 5.
Load the desired states 6.
for actor = 1 to N do 7.
for time step =1 to T do 8.
Calculate the stability augmentation feedback k(s t ) with state s t 9.
Run policy λ to generate RL gains a t 10.
Record reward r t and the next state s t+1 11.
Store transition (s t , a t , r t , s t+1 ) into replay buffer 12.
Compute advantage estimations A t 13.
end for 14.
for epoch = 1 to K do 15.
Optimize the loss target with mini-batch size M ≤ NT 16.

C. SYSTEM FRAMEWORK
In the learning process, the RL algorithm is required to stabilize the attitude while the quadrotor can be released from any attitude in the state space. Neural networks are used to receive the state of the quadrotor, provide the throttle level of each rotor, and seek the optimal policy in iterations.

1) PPO-DWC FRAMEWORK STRUCTURE
The network structure of the system is shown in Fig. 4. Two neural networks are used in the training of PPO-DWC, one is the critic neural network, and the other is the actor neural network with parameter λ. Four policy sub-networks with parameters λ i (i = 1, 2, 3, 4) compose the actor neural network. Their weights will be optimized during the training phase.
During training, the current state of the quadrotor will enter replay buffer E as a state vector [φ, θ, ψ,φ,θ,ψ] T .
Since PPO adopts batch training, after the actor network collects a batch of state vectors, its network parameters are copied to the old actor network. In the next batch of training, the four sub-networks continue to be trained and updated. At the same time, the parameters of the old actor network remain unchanged until copied by a new round of network parameters. After the policy update, the output of the old neural network will be dimensionally clipped to obtain π λ and the IS weight function ρ t as the input of the PPO-DWC operation.
For the critic network, the advantage values as its output will evaluate the quality of the measures taken to achieve these states. After updating by minimizing its parameter, the critic neural network also feeds the advantage value to the operation side to complete the entire update process of the actor network.
After updating the policy, the outputs of the four subnetworks are µ i and σ i (i = 1, 2, 3, 4), which correspond to the four sets of mean and standard deviation of Gaussian distribution. A group of actions is randomly sampled from a Gaussian distribution and normalized to a i (i = 1, 2, 3, 4). a i becomes the input of the quadrotor, and the quadrotor generates a new state.

2) PPO WITH STABILITY AUGMENTATION FRAMEWORK STRUCTURE
The system structures of PPO-PD and PPO-DWC are fundamentally different. PPO-DWC mainly analyzes the algorithm structure and clips the policy dimension to optimize the convergence of the optimal function, while PPO-PD uses the classical PPO algorithm and introduces a stability augmentation controller in the state observation stage to cooperate with RL to complete policy optimization, which is a synchronous process. The quadrotor will output the state to the stabilization module and the RL module respectively, and the input obtained is also the result of the combination of the stability augmentation feedback and the RL gain. The system framework is shown in Fig. 5.
For the attitude angle tracking task, our goal is to minimize the cumulative tracking error. In order to evaluate the performance of the quadrotor in terms of robustness, the reward signal is as simple as possible. Therefore, the reward function is given by:

IV. SIMULATION AND RESULTS
In this section, the proposed improved PPO control policy is applied to the quadrotor UAV. Table 1 lists the model parameters. In order to fly safely, physical constraints should be imposed on the states of the quadrotor. The range of attitude angular velocity is set to ±258 • /s, which also meets the limitation of the gyroscope sensor. The range of attitude angle is set to ±45 • . When the quadrotor's attitude exceeds 45 • during training, it will be considered a bad training session, and the training round will be terminated early. We use Python to  develop the training simulation environment of the quadrotor. TensorFlow tools are utilized to build neural networks for learning and training [41]. Its library calls are computed on a laptop GPU (NVIDIA GeForce GTX 1650Ti). The simulations do not involve parallel computing techniques.

A. PPO-DWC ABILITY TEST
The control policies learned by the original PPO algorithm and the PPO-DWC algorithm are compared from offline training efficiency and control performance.

1) OFFLINE TRAINING
After defining the actor-critic network structure, Table 2 gives the training parameters of Algorithm 1 in the offline learning phase.
The training task is that the quadrotor can adjust to the desired attitude [0, 0, 0] in a randomly initialized state. Two indicators are used: average value loss and average accumulated rewards which are in a negative correlation to measure the learning effect. When training the quadrotor, the error should become smaller and smaller. In each step, the smaller the error of the quadrotor attitude is, the larger the reward value is. A larger and more stable accumulated reward fully reflects a more accurate and faster control policy. In this study, we calculate the average value of each 50 groups of data, and evaluate the value loss and accumulated reward. PPO-DWC and the original PPO algorithm are used to train on the same network structure and training parameters, respectively. The comparison results are shown in Fig. 6. The training phase has a total of 6000 iterations. The total time cost of the computation of the two algorithms is almost the same because they are trained under the same network structure and training parameters, which takes 1932.4 s. It can be seen that PPO-DWC makes the value loss converge faster during training, which is due to its strong sampling efficiency. At the same time, ten independent simulations are carried out for the two algorithms. The shadow part in the figure represents the standard deviation of the ten simulation results. Obviously, the errors of the two algorithms are convergent. After a certain round of training, the original PPO algorithm still has errors, while the error of the PPO-DWC algorithm is smaller, and the convergence is faster.
The average accumulated reward is shown in Fig. 7, which means that PPO-DWC has a higher convergence speed and higher reward. The PPO-DWC training progress stabilizes after about 500 training steps, taking only 152.6 s. While the original PPO algorithm converges after about 1500 steps, which takes 482.4 s. It can also be seen from the standard deviation that the PPO-DWC algorithm is consistent in the simulation. After 6000 training iterations, we test the control policies learned through the PPO-DWC and original PPO algorithms in the environment of quadrotor rotational motion. The initial attitude of the quadrotor is [−30, −20, −10] • , which is set within the safe range. The attitude angles of the quadrotors are recorded for 10 seconds. Fig. 8 shows the results of the two algorithms. It is noticeable that both algorithms can obtain convergent policies in the quadrotor attitude control task. PPO-DWC has higher accuracy.
In order to highlight the advantages of PPO-DWC, we compare the mean average error (MAE) of the attitude angles provided by the two algorithms. As shown in Fig.9, the control performance of PPO-DWC is better than the original PPO algorithm.

2) ROBUSTNESS TEST
In the offline learning phase, a stable robust control policy has been trained. In order to test its generalization ability, two different robustness tests are implemented. A traditional PID controller will also be added to make a comparison. Similar to PPO, PID also controls the target by trial and error. According to the error between the actual output of the quadrotor state and the desired command, the output is repeatedly adjusted to VOLUME 10, 2022  achieve the given desired value. Moreover, many other control algorithms are based on precise model dynamics, which is incomparable to the model-free PPO learning algorithm.
Case I: Model generalization test of different sizes. In this case, we change the distance from the rotor to the center of mass (the radius of the quadrotor) to test the generalization ability of the control policy. Assuming the initial flight attitude of the quadrotor is [−15, −10, −10] • and the desired attitude is [0, 0, 0] • . The attitude changes of the quadrotor are observed during 10 seconds of flight through three control policies as PID, original PPO and PPO-DWC. In addition, the sum of the absolute error of three attitude angles is introduced to demonstrate the dynamic performance of the three control policies. A smaller sum of error means a faster control policy and higher accuracy.
We assume that the radius of the quadrotor model in the offline learning phase is 0.31m as the standard radius. In the robustness test, the mass of the quadrotor model remains unchanged. We change the radius range from 0.1m (65% smaller) to 1.2m (300% larger). A total of 12 simulations are carried out. The change of the attitude angle under the three control policies is shown in Fig. 10. When the radius is in 0.31m∼0.6m, the three control policies can make the quadrotor reach the steady state very well. However, when the radius gradually increases, the PID controller becomes unstable. Compared with the previous steady state, the quadrotor based on the PID controller begins to oscillate violently, and the error between the real attitude and the desired attitude becomes larger. The result also suggests that the quadrotor model performs more consistently under the RL control policies. The control performance is better than the PID control, which fully reflects the good generalization ability of RL. The result presented by the PPO-DWC policy and the original PPO policy indicates that the steady-state error of PPO-DWC is significantly smaller than that of the original PPO under the same number of training steps. The model controlled by PPO-DWC policy can be quickly and accurately stabilized, demonstrating its high efficiency in learning control tasks in complex environments. Its improvement over the baseline PPO is quite apparent.
It can also be verified in Fig. 11 that in the test model set, the control policy learned in the PPO-DWC algorithm has the slightest influence on the response of attitude tracking. With the increase of radius, the control performance of RL and PID controller degrade, but the decrease of PID control performance is more prominent. The steady-state error of PPO-DWC is the smallest. Moreover, it can be found that the control policy has better performance in a small quadrotor. The rapidness of the aircraft is benefited from the slight aerodynamic drag and moment of inertia of the smaller quadrotor.
Case II: Model generalization test under random initial state. Simulations under different initial states of the quadrotor are conducted to observe the control efficiency of the PPO-DWC algorithm. The control task is to adjust the quadrotor from a random initial attitude within a safe range to a desired steady state [0, 0, 0] • . A total of 20 simulations are carried out. We observe and record the attitude changes of the quadrotor in the 10 seconds of flight. The results are shown in Fig. 12. It can be seen that when the quadrotor starts to operate from different initial attitudes, the PPO-DWC control strategy can effectively make the quadrotor reach a stable state with a small steady-state error. This demonstrates the good performance of the PPO-DWC offline policy.

B. PPO-PD CONTROLLER ABILITY TEST
In this test, the same quadrotor model and neural network parameters are used as in Subsection A to observe the performance of the PPO algorithm with the PD stability augmentation controller from the two aspects of learning efficiency and control performance.

1) OFFLINE TRAINING
Before the RL control policy training, three groups of PD parameters as the stability augmentation controller of the system are set, which are shown in Table 3. In order to observe the effect of the PD stability augmentation controller in the RL training stage, the three groups of parameters given are also representative.  In the first group of PD parameters, K Di (i = 1, 2, 3) = 0, the stability augmentation controller is only proportional control. Without RL gain, the system is in an unstable state. On this basis, the proportional feedback can speed up the adjustment process, quickly respond to the command, and reduce the steady-state error. The introduction of the K D parameter in the second group of parameters improves the stability of the system, accelerates the dynamic response speed of the system, and reduces the adjustment time. At the same time it reduces the overshoot and overcomes the oscillation, thereby improving the dynamic performance of the system. The third group of parameters is based on the second group of parameters, which is to modify K D parameters according to the control effect of the quadrotor system. The system can be stabilized without RL gain, and it is also a relatively good group of the three data groups.
In this subsection, the average loss and average accumulated reward are also used as the standard to measure the learning effect. The larger the reward value, the smaller the error between the attitude and the desired state at each step. We also calculate the average value of each 50 groups of data as a set of samples to compare the learning algorithm's value loss and accumulated reward after introducing the stability augmentation controller. The final comparison of the average value loss is shown in Fig. 13. Overall, the value loss of PPO with PD stability augmentation controller is minimal at the beginning of training and finally tends to be stable. It indicates that stability augmentation feedback is beneficial for the training efficiency of RL. By comparing the final convergence states of the four algorithms, it can be seen that the PPO algorithm with PD parameter III has the least value loss. It also shows that when adjusting the PD parameter, the more stable the PD parameter of the system is, the more beneficial the RL is to learn the desired control policy.  PPO algorithm. This is because the quadrotor itself is still unstable under proportional control. Ultimately the quadrotor reaches equilibrium mainly depending on the effect of RL gain. However, due to the role of proportional adjustment in the initial training, the system avoids a large number of random trials and errors, so the error of the initial accumulated reward is far less than that of the original PPO algorithm. PPO-PD with parameter II can obtain a higher reward value after adding the K D parameter, which is better than the PPO algorithm and PPO-PD with Parameter I algorithm. As shown in the PPO-PD result with Parameter III, the more stable the PD parameter makes the system, the more reward it gets under RL gain.

2) STABILITY TEST
Using the three groups of parameters listed in Table 3, the four RL algorithms have generated the control policies of the quadrotor in the offline training phase. In order to carry out the stability test, we set the initial flight attitude of the quadrotor as [−30, −30, −30] • , which is relatively difficult to reach the desired attitude. The desired attitude is still [0, 0, 0] • . The model parameters of the quadrotor are shown in Table 1. The attitude changes of the quadrotor are observed under the same model flying for 8 seconds. The comparison results of attitude changes under the four control policies are shown in Fig. 15.
When using the first group of PD parameters, since K Di (i = 1, 2, 3) = 0, the stability augmentation controller is only proportional control. It can be observed that without RL gain, the proportional control cannot make the quadrotor stable. After introducing RL gain, the quadrotor can be stabilized under Parameter I. Because the proportional adjustment accelerates the response speed of the system, the quadrotor can reach the desired attitude more quickly. Therefore, its control performance is obviously better than the original PPO control policy. The differentiation element is introduced when the second group of PD parameters is used. The PD stabilization controller can only make the roll and pitch angles reach the desired state. The yaw angle cannot reach the desired attitude and slightly oscillates. After the RL gain is added, it suggests that the quadrotor can reach the desired attitude faster than the original PPO algorithm, and the correction amplitude during flight is significantly reduced. This is due to the differentiation control increasing the damping of the system and improving the system's stability.
According to the flight results of the quadrotor under the second group of PD parameters, we slightly adjust the parameters, and thus obtain the third group of PD parameters. The quadrotor can achieve a steady state without RL gain through the stability augmentation controller under this parameter. The control policy trained by PPO with this stability augmentation controller can converge to the equilibrium point faster, and the error will be smaller.
In order to intuitively compare the importance of PD parameter change for PPO control policy, we compare the MAE between the attitude angle and the desired attitude provided by PPO-PD algorithms with three different parameters. As shown in Fig. 16, the more stable the PD control can achieve, the better performance the PPO-PD can obtain. Since the parameter selection of PD control is very tedious work, it is difficult to make the system reach a steady state in a complex environment. After introducing RL gain, not only can the system be stabilized expediently, but also the control accuracy and learning efficiency are superior to ordinary RL.  PPO-DWC and PPO-PD improve the PPO from two different levels. PPO-DWC aims to change the structure of the algorithm to solve the problem of vanishing gradients. The sample exploration of PPO is extended to converge to the desired policy quickly while PPO-PD introduces a stability augmentation controller outside the RL policy. An accurate PD parameter can speed up the training time, thereby affecting the convergence of the policy. In this subsection, we combine the two algorithms. The stability augmentation controller uses Parameter III in Table 3. The DWC-based PPO with stability augmentation controller (PPO-DWC-PD) is brought into the same network parameters and quadrotor model parameters for training. We compare the results with PPO-DWC and PPO-PD.

1) OFFLINE TRAINING
The result comparison of the average value loss of the algorithms is shown in Figure 17. In general, PPO-DWC-PD has the most remarkable ability of quick response and the highest learning efficiency. It combines the advantages of the other two algorithms. It has both the ability of PPO-PD to converge quickly, and the ability of PPO-DWC to obtain smaller value loss under efficient exploration. Figure 18 shows the average accumulated reward of the three algorithms, demonstrating that the learning efficiency of PPO-DWC-PD is higher than that of PPO-PD. By observing the convergence of the three algorithms, it can be concluded that PPO-DWC-PD is the fastest algorithm to obtain the maximum reward value. PPO-DWC gets the most reward because the weight of RL gain in the system has changed after the introduction of the PD controller.

2) STABILITY TEST
To further observe the control performance of PPO-DWC-PD, a worst case is selected to increase the complexity of the task. Assuming that the distance from the quadrotor rotor to the center of mass L is 1.  Figure 19.
The quadrotor under the PPO-PD control policy still oscillates slightly in the steady state because the change of the model parameters greatly influences the stability augmentation controller. The PPO-DWC control policy has good robustness and produces a relatively slight steady-state deviation. In contrast, the control performance of the PPO-DWC-PD policy is the fastest to converge to the steady state among the three policies, and the most stable when reaching the steady state.

V. CONCLUSION
In this paper, an improved PPO algorithm based on PPO-DWC and PPO-PD is proposed to solve the continuous motion control of the quadrotor. This is a learning-based control policy, which significantly improves the flight accuracy and reduces the steady-state error of the quadrotor attitude control. The PPO algorithm is improved in two directions. One is to optimize the algorithm structure of the PPO. For the problem of the disappearance of the sample gradient in the PPO algorithm, the dimension clipping method is used to calculate the policy function, which successfully improves the sample efficiency and the convergence of the quadrotor control policy is faster. Second, a stability augmentation controller is introduced to avoid blind exploration of the quadrotor in the initial stage. The PPO output is used as a gain term to enhance the stability of the quadrotor. The simulation results show that the control policy has good robustness and the performance of the new algorithm is better than original PPO and PID controller. In future work, we will focus on analyzing and processing the old sample batches in the PPO algorithm to enhance learning efficiency, and combine the compound reward function signal to reduce the observed steady-state error. Moreover, a more complex nonlinear stability augmentation feedback will also be considered.