Reinforcement Learning-Based Composite Controller for Cable-Driven Parallel Suspension System at High Angles of Attack

This paper investigates an intelligent method for the motion control of a cable-driven parallel suspension system (CDPSS) in wind tunnel tests, especially at high angles of attack, which is characterized by unsteady and nonlinear aerodynamics. Considering the modeling uncertainties and the complex aerodynamic interference, a composite controller that combines deep deterministic policy gradient (DDPG) and computed-torque is proposed to improve the control performance. The tasks at hand consist in the construction of the training environment based on the dynamic equations and the Markov Decision Process (MDP) design. The supplementary computed-torque control is used to enhance the learning rate of the agent. Then a well-trained agent is applied in the high angles of attack maneuvers control with different examples, including single-DOF and multi-DOF coupled motion. The simulation results show the controller could fulfill the training tasks efficiently, and it turns out to be robust and maintain strong generalization ability despite handling the unlearned tasks.


I. INTRODUCTION
Cable-driven parallel mechanism (CDPM) is a special kind of robotic manipulator that employs flexible cables instead of traditional rigid links, which produces the advantages of simple structure, low inertia, relatively larger workspace, and stiffness, and superior modularity and reconfigurability. This has attracted increasing interest among researchers, and rich literature exists for CDPM in many different applications, such as cooperative cranes [1], haptic devices [2], rescue or medical rehabilitation [3], radio telescopes [4]. Particularly, due to the cables' more minor flow field interference and higher dynamic performances, CDPM provides a new method of suspension for the aircraft model in the wind tunnel tests. For example, the French National Aerospace Research Centre (ONERA) proposed an active suspension method for wind tunnel tests project (SACSO) by using cables, and it was successfully applied in a vertical wind The associate editor coordinating the review of this manuscript and approving it for publication was Nishant Unnikrishnan. tunnel for aerodynamic design and validation of fighters [5]. Bruckmann et al. [6] investigated an active cable-driven suspension system for ship maneuvers in the wind tunnel. The system is designed for 150 kg weight and a motion frequency of 0.5 Hz at an amplitude of approximately 0.5m in the low-speed wind tunnel. Lambert [7]- [8] developed a cabledriven dynamically-controlled wind tunnel traverse mechanism in 6-DOF to investigate the coupling between a blunt body and the embedding flow during a controlled maneuver. Lin and Wang [9], [10] made kinematic and dynamic analyses of the cable-driven suspension system, conducted some typical tests in the low-speed wind tunnel, and obtained static and unsteady aerodynamic forces by using the internal six-force balance.
The high precision and robust control of the cable-driven parallel suspension system (CDPSS) is critical to its applications in dynamic wind tunnel tests. In the literature, many advanced control approaches have been proposed in the Cartesian space or joint space for CDPM. Begey et al. [11] designed a cascade dynamic controller in a singular VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ perturbation framework for CDPM with highly flexible cables. Stability was demonstrated and the method's efficiency was experimentally validated. A robust adaptive approach was proposed to control CDPM which is subject to dynamic uncertainties in [12]. It's designed based on function approximation technique, which can adaptively learn the uncertain terms in dynamics of both manipulator and actuators. Santos et al. [13] proposed a model predictive control strategy for large-dimension CDPM working at low speeds. The simulations results indicate a superior performance of the proposed controller compared with two commonly used strategies: Sliding mode control and PID+ control. Shang et al. [14] designed a composite controller based on the tracking error and synchronization error of the moving platform in the cable space to realize the synchronization motion between the cables, and finally increase the tracking accuracy of the moving platform. The vibration rejection control was dealt with in Cartesian space for CDPM to address the vibrations induced by the axial flexibility of massless cable in [15], and the approach was tested experimentally. However, when dealing with kinematic and dynamic uncertainties and external interferences, new approaches are still worthy of research, especially for the motion control of cabledriven parallel suspended aircraft at high angles of attack. During this kind of maneuver, the flow field around the aircraft is highly complex due to the multi-degree-of-freedom coupling, and the aerodynamic characteristics usually show strong nonlinearity and unsteadiness [16], [17]. The aerodynamic forces and moments vary with multiple parameters, including both static and dynamic motion states, such as angle of attack, sideslip angle and their variations, rates of pitch, yaw, and roll. As a result, the conventional aerodynamic model does not meet the requirements of flight simulation and control law design. The existence of modeling uncertainty and external interference makes it a great challenge for the high precision motion control of CDPSS, especially in high angle-of-attack maneuvers.
With the upsurge of artificial intelligence researches, intelligent algorithms represented by reinforcement learning (RL) have received considerable attention in the past decades. Especially, the deep reinforcement learning (DRL) technique can handle large state space by building a deep neural network to relate the value estimates and associated state-action pairs, thereby overcoming the shortcoming of conventional RL. Some recent works have started developing such approaches in various practical problems [18]- [22]. For example, the DRL technique has been shown to be successful in playing Atari and Go games [18], emerges as a powerful data-driven method for solving complex control problems. Lee et al. [19] studied a DRL algorithm, which is adopted to learn the efficient adaptive gain-tuning strategy, and validated that the proposed system can result in better station-keeping performance without deterioration in its control efficiency. Considering the limitations of conventional PID controllers for the dynamic positioning system, Modares et al. [20] developed an online AC algorithm along with a novel NN identifier to learn the optimal control solution for nonlinear systems with completely unknown dynamics and input constraints. Huang et al. [21] proposed an agent trained by deep deterministic policy gradient (DDPG) in unmanned aerial vehicle (UAV) control. The agent's generalization ability and robustness were proved reliable in handling unlearned flying tasks. Lin et al. [22] investigated a supplementary controller based on RL to improve the control performance of quadrotor UAVs by constructing an actor-critic (AC) structure and some improved technologies, e.g., Q-learning, temporal difference, and experience replay. With the proposed method, the speed and stability of training can be significantly improved.
In this paper, considering the complex, nonlinear and unsteady aerodynamic characteristics of the aircraft model at high angles of attack, an alternative approach to standard computed-torque based on reinforcement learning is studied. The main contributions of the paper are the following. 1) An reinforcement learning-based DDPG algorithm is designed. Its training environment is constructed based on the dynamic equations of the system, and the reward functions are set in a segmental form according to the errors of state parameters and their variations. 2) A novel composite controller combining DDPG and computed-torque method for the typical application of CDPM in wind tunnel tests is proposed to achieve better control performance in high angle-of-attack maneuvers characterized by multi-DOF. This approach is verified through simulations.
The structure of this paper is as follows. Section II describes the dynamic model of this cable-driven system. Section III presents the construction of a deep deterministic policy gradient improved algorithm. Section IV expounds on the training and simulation analysis. Finally, Section V summarizes the main achievements of the paper.

II. DESCRIPTION OF CDPSS AND DYNAMIC EQUATIONS
Taking the low-speed wind tunnel test as the application background, the cable-driven suspension system can be set up in the test section accordingly, consisting of a fixed frame, aircraft model, actuated cables, servo motors, motion control system, and so on. The prototype of a typical aircraft model suspended by eight cables is given in Fig. 1. The i-th cable i = 1, . . . , 8 exits from the fixed base at point B i , and it is connected to the aircraft model at point P i . By adjusting the cable length, the position and orientation of the aircraft model could be dynamically controlled. The model's pose could be measured by visual method, such as a camera or laser, in which some cooperative targets are set on the model. Cable tension sensors are also used to monitor the tension in case of slackness.
Considering the limited dimension of wind tunnel test section, CDPSS is a small-scale mechanism, the ratio between end-effector and cables' masses is quite large, thus it is reasonable to assume the cable to be massless and straight. The impact of cable elasticity on the pose of aircraft model is studied in [23], and it could be further minimized by using the high elastic modulus material, such as steel cable. Therefore, the cable is assumed to be rigid in this paper, and the errors induced by this assumption could be taken as the parameter uncertainties, and will be dealt with by the new approach proposed in Section III. Moreover, the diameter of the fixed pulley is generally small enough for CDPSS, and could be neglected to facilitate the establishment of the kinematic model. Then, the fixed pulley is regarded as a point hinge with fixed coordinates.
The kinematic relationship diagram is shown in Fig. 2, the static coordinate system o w x w y w z w is fixed with the frame, and the moving coordinate system O X Y Z is fixed on the aircraft model. 8) in the static coordinate system satisfies, T be denoted as the pose of the aircraft model. According to the differential kinematics of parallel structure, the kinematic model of cable length vector is,L where L is the cable length vector, J is the Jacobian matrix of the system,Ẋ ω is the velocity vector. Based on the Newton-Euler method, the motion of the aircraft model and actuators are modeled separately. These dynamic equations are shown as follows, where M is the mass of aircraft model, f e and τ e are the aerodynamic forces and moments acting on the aircraft model, T i is the i-th cable tension, A G is the model's inertia matrix expressed in the static coordinate system,ω is the vector of angular acceleration. M 0 is the inertia matrix of actuators, C 0 is the viscous friction coefficients matrix for actuators, µ is the drive coefficient of ball screw, τ is the input torque, T is the cable tension vector. θ m is the rotational angle vector of actuators,L =µθ m . Generally, the cable tensions are effective only when they are positive, where T min and T max are respectively the lower and upper bounds of cable tensions. Use equations (3), (4), and (5) to derive in which, In high angle-of-attack maneuvers, the model of aerodynamic forces and moments is a nonlinear function of X andẊ, and is developed in the following form, where b is the wing semispan,c is mean aerodynamic wing chord, q is the dynamic pressure, S is the reference wing area, V is the flow speed. C L , C D , C Y , C M , C M , C M are respectively the aerodynamic coefficients of lift, drag, lateral forces, and roll, pitch, yaw moments.
are the aerodynamic derivatives, and could be identified in light of the typical low amplitude forced oscillations and rotary tests data [24].P,Q,R are the roll, pitch, and yaw rates. Angle of attack α and sideslip angle β are defined by the following relations, tanα = tanθ cosφ − sinφ tanψ/cosθ (8) sinβ = sinθ sinφ cosψ + sinψcosφ (9)

III. CONTROL ALGORITHM
As to the CDPSS, there are multiple sources of uncertainties, such as the nonlinear and unsteady aerodynamic forces caused by the boundary layer separation and flow hysteresis characteristics at high angles of attack, the variations of tangency points of driving cables and guide pulleys, the cable elasticity and the transmission efficiency of the ball screw.
Considering the above factors and the self-learning and selfadaption characteristics of DRL, this intelligent method is used to design the control law of CDPSS.

A. DDPG ALGORITHM
The learning process of RL is similar to the human brain's. The process can be described as Trial-and-Error; during each episode, a reward instructs the agent to achieve the ultimate objective. The basic principle of RL is shown in Fig. 3, i.e., the agent continuously executes actions in the environment; after each episode, the agent receives a reward from the environment; through this learning process, the agent will take the action that ranks the highest reward value. The process of RL can be described mathematically as Markov Decision Process (MDP). MDP is made up of five elements, i.e., M = S, A, P, R, γ , where S is the state set of the environment, A is the agent action, P is the state transition probability, R is the reward function which shows the preference degree of the environment for different action the agent took; γ is the discount factor, γ ∈ [0, 1], when γ takes a relatively large value, it means the agent pays more attention to the future earnings than current rewards.
The Deep Deterministic Policy Gradient (DDPG) algorithm mainly comprises environment, experience replay set, Actor network module, and value evaluation network module. The environment is the interaction space and exploration space of the agent. The agent obtains interaction samples during the interaction with the environment and stores the samples in the experience playback set for the training process of the agent. The network architecture of the algorithm is shown in Fig. 4. There are four neural networks in the DDPG algorithm, and the main functions of each neural network are listed below: 1. Actor Current Network: Updating policy parameter; executing current action a t according to the current state s t ; and a t will be used to generate s t+1 and R by interacting with the environment.
3. Critic Current Network: Updating value network parameter θ Q ; calculating current value Q s, a θ Q .
4. Critic Target Network: Calculating Q (s t , a t θ Q ) in target Q value.
The loss function of Critic Current Network can be expressed as follows, The loss gradient of Actor Current Network can be expressed as follow, The two Target Networks are updated by the following formulas, where τ is the update coefficient, it represents the degree of network replication of each episode; usually, it takes a small value such as 0.1.

B. SYSTEM MODEL DESIGN
To train the agent to achieve motion control of the aircraft model, MDP should be designed, and several key parameters need to be determined, such as the environmental state, executable actions, and rewards generated from the interactions with the environment. In this paper, the dynamic equations of CDPSS are taken as the training environment of the agent. The process of solving the equations is considered as the construction of the training environment. By solving the differential equations (6), the attitude angles of the aircraft model and their angular velocities could be obtained and denoted as X andẊ. The errors between these state parameters and the desired ones are taken as observation of the agent, i.e., the three angle errors and three angular velocity errors. All the new parameters are grouped together to be the environment state set, (14) where φ, θ, ψ respectively denote the roll, pitch, and yaw angles. Each of the eight cables is controlled by a motor, then the torques of motors are taken as environment's output and are the agent action set, After constructing the state and the action sets, an environmental reward function is set to enable the agent to reach the expected target. The reward function consists of two parts, one based on the angle errors and the other based on the angular velocity errors. Therefore, the total reward function can be expressed as, where r a represents the part of angle errors, and it can be divided into three parts, i.e., In the formula, | φ t | = |φ t − φ d |, φ d is the desired roll angle, and φ t is the observation value, r a2 and r a3 are respectively the reward function of θ t and ψ t , and the two reward functions have the similar form with r a1 . r v represents the part of angular velocity errors, and can also be expressed as where , r v2 and r v3 are respectively the reward functions of θ t and ψ t , and have the similar form withr v1 . In the formula, φ t = φ t −φ d , φ d is the desired angular velocity, andφ t is the observation value.  The neural network construction is shown in Fig. 5. The actor neural network's input layer contains six neurons corresponding to the six-dimensional environment. There is an interlaced connection between four fully connected layers and three convolution layers in the hidden layer, with Relu being its activation function. In the output layer, eight neurons correspond to the agent's eightdimensional action. Tanh is the neural network activation function. Fig. 6 shows the critic neural network, which has fourteen neurons in the input layer, corresponding to the sixdimensional environment and the eight-dimensional agent action. There are several steps between state input and output: firstly, the state input goes through two twenty-neuron fully connected layers and then adds up with the action output generated by a twenty-neural fully connected layer with the action input, finally goes through the other two twentyneuron fully connected layers as the state output. The ultimate output is a one-dimensional state action value corresponding to state input and action input. Relu is the neural network activation function. VOLUME 10, 2022

C. DESIGN OF COMPOSITE CONTROL LAW
Considering the uncertainties and interferences exist in the modeling process, a composite control law combining computed-torque and DDPG algorithm is proposed. The controller's structure is shown in Fig. 7. The new controller is divided into the control subsystem and the learning subsystem; the control subsystem is an online control, i.e., the environmental state inputs of the actor target network are well trained. By integrating the agent action output a t and the computed-torque controller output τ 1 , a new control law is generated and applied for the aircraft model, and it can be expressed as, To achieve the high-precision pose control of the aircraft model, direct pose feedback is adopted by using monocular vision to measure the motion state. Based on the dynamic equations, a typical computed-torque control law is designed. It ensures the stability of the aircraft model when moving at a specific trajectory and it could rectify the deviation between the actual pose and the expected pose of the aircraft model, as well as the deviation of rotation angle and the rotation angular velocity of the motor. Besides the proportional and derivative terms, nonlinear and other terms should be compensated; thus, the control law is set as, (20) where C + is the Moore-Penrose generalized inverse matrix of C, e = X − X d ,ė =Ẋ −Ẋ d , K p and K d are respectively the proportional and derivative gains of the control system. Q = I − J T+ J T w, which is the internal cable tension, and could be adjusted through the factor w to avoid the cable's slackness.

1) OFF-LINE AGENT TRAINING PROCESS
When the agent training process starts, initialize the critic policy network randomly and assign the parameters to the corresponding networks. Then start an episode; in this experiment, each episode has its fixed task and task time, and the agent control period is defined as a time step. Each time step, s t is input to actor policy network and then output a t . τ t is the combination of a t and computed-torque controller signal, then the s t+1 is gained by integrating, also, the welldesigned reward r t+1 function gains the immediate reward. Finally, storing (s t , a t , r t+1 , s t+1 ) in the experience pool as a trajectory sample and the flight strategy can be updated by applying the DDPG algorithm.
According to the above description, DDPG off-line training process is summarized as follow, and the flow chart is shown in Fig. 8, 1. Randomly initialize critic network Q(s, a θ Q ) and actor network µ(s |θ µ ) with weight θ Q and θ µ .
2. Initializing target network Q and µ with weights θ Q ← θ Q , θ µ ← θ µ . 3. Initialize replay buffer D. 4. For episode=1 to Max Episode do 36378 VOLUME 10, 2022 During the agent training process, the initial state of the aircraft model is shown in Table 1, and the correlated training parameters are shown in Table 2. This training task assumes that the aircraft model undergoes a sinusoidal oscillation, e.g., θ = 6 • sin(t), the yaw angle and roll angle are expected to maintain a stable state at around 0 • . The control period is set as 0.1s, and a single experiment lasts 15s.  The reward value curve reflects the performance variation of the agent to some extent. The agent's average reward value variation of every ten steps is shown in Fig. 9. At the beginning of the training process, the reward values are negative, which means the agent does not meet the requirement of the experiment; with the increasing of the step times, the value turns positive and continues to rise; finally, it stables near a fixed value. When applying the DDPG algorithm only, the agent reward value turns positive at about the seventieth time and finally converges at about the third hundredth time; while applying the DDPG+computed-torque algorithm, the agent reward value turns positive at about the tenth time, and for a long period after the fortieth time, the reward value is stable. From the analysis above, it is evident that computedtorque auxiliary controller speeds up the training process and improves the steadiness of the training.   expected trajectory of pitch angle rapidly and accurately; the tracking error is within 0.1 • . In roll and yaw direction, it can enter a stable state in 4s.

A. VARIABLE TASKS CONTROL
In the agent training process, the pitch angle tracking trajectory is 6 • × sin (t), but now instead, using the unlearned high angles of attack trajectories as flying tasks. According to the analysis CDPSS workspace, i.e., the movement range of aircraft model, several typical tasks are taken as examples: Task 1: Oscillatory maneuver at high angles of attack in pitch direction with a central pitch angle 20 • , an amplitude of 20 • , and an angular frequency of 0.6, and the initial roll and yaw angles are limited to −10 • − 10 • . As shown in Fig. 13, when the aircraft model moves with a single degree of freedom, the three different controllers manage to track the expected high angles of attack trajectory in 5s in the pitch motion, while computed-torque +DDPG  Similar results in the roll and yaw motion are shown in Fig. 15 and Fig. 16,, it can be seen that the pre-trained agent manages to maintain stability at around 0 • within 5s, and the computed-torque+DDPG control algorithm achieves faster convergence than the other two.
Task 2: Three-DOF coupled oscillatory maneuver at high angles of attack in the pitch direction with a central pitch angle 20 • , an amplitude of 30 • , and an angular frequency of 0.5; and in the roll direction with a phase π/6, an amplitude of 5 • , and an angular frequency of 0.6; in the yaw direction with an amplitude of 6 • and an angular frequency of 0.6.  As shown in Fig. 17 and Fig. 19, when the aircraft model moves with two-DOF in the pitch and roll directions, three different controllers are used to make a comparison. As for pitch motion, computed-torque+DDPG controller manages to track the corresponding trajectory at around 3.2s, and 3.6s for computed-torque controller, 3.9s for DDPG controller; as for roll motion, computed-torque+DDPG controller spends 3.9s managing to track the trajectory, and 4.1s for computedtorque controller, 5.2s for DDPG controller. As can be seen  from Fig. 18 and Fig. 20, as for pitch motion, the steady stable errors of the computed-torque+DDPG controller is within 0.08 • , and the max tracking error is about 0.1 • for the DDPG controller, and 0.23 • for computed-torque controller; as for roll direction, the tracking error of computed-torque+DDPG controller is within 0.07 • , and the max tracking error is about 0.1 • for DDPG controller, and 0.15 • for computed-torque controller.
The yaw tracking history and error curves are shown in Fig. 21 and Fig. 22, the pre-trained agent maintains stable at around 0 • within 5s, and the computed-torque+DDPG control algorithm is the first to reach a stable state.
On the basis of the two tasks' simulation results, whether in single or multiple degrees of freedom movement, the computed-torque+DDPG control algorithm is superior to computed-torque or DDPG alone; in other words, it exhibits a shorter tracking period required, faster adaption for a new task and smaller error.

1) CONTROLLER ROBUSTNESS DISCUSSION
In order to check the control robustness, simulations have been conducted for several values of parameters. The reference trajectory is set as a motion with an initial angle of 0 • and an amplitude of 15 • in the pitch direction. Case 1: model weight is increased by 10%. Case 2: model weight is reduced by 10%, and aerodynamic force is reduced by 10%.
As shown in Fig. 23, under the conditions of two-parameter changes, the designed controller combining computed-torque and DDPG can recover the stable tracking state in about 6s.
It indicates that the proposed composite controller can still maintain certain rapidity and stability in the control process despite the model uncertainties.
As to the CDPSS, when the suspended aircraft model maneuvers at high angles of attack, the aerodynamic forces show nonlinear and unsteady characteristics, and it is difficult to model it accurately. In such a situation, the performance of the controller robustness is analyzed.  When the controller is in a stable state, the aerodynamic disturbance with the function of sin(t) [4.5 × 10 −4 ; −1.8× 10 −3 ; 0; 0; −1.2 × 10 −2 ; 0] T is added during 1.5s-3.5s. The simulation results are given in Fig. 24. The computed-torque+DDPG controller restores stability after about a 0.7s oscillation. It shows that the designed controller can adapt to a certain degree of external interference.

V. CONCLUSION
This paper addresses control of fully constrained cable-driven parallel suspension system in high angle-of-attack maneuvers. Nonlinear and unsteady aerodynamic forces, and other kinematic and dynamic parameter uncertainties inevitably have negative impacts on the motion accuracy and performance. Thus, to cope with these uncertainties and bounded interferences, a reinforcement learning-based composite control algorithm is proposed to achieve suitable tracking performance. This novel composite controller consists of two major parts: 1) an intelligent DDPG controller, in which the MDP is designed based on the system dynamic equations; and 2) a computed-torque controller, which is based on a rigid model of the system, and is used to enhance the DDPG training speed and guarantee stability. After the off-line agent training, several tasks are simulated, and the results show the composite controller outperforms the traditional computed-torque controller and single DDPG in the view of convergence speed and precision. The impacts of different uncertainties and aerodynamic interferences were evaluated through the robustness analysis, which further validates the effectiveness and feasibility of the designed controller. However, the constructed training environment is based on the theoretical dynamic equations in this paper, and the proposed controller is only examined through simulation studies. Therefore, future works aim at better training performance of DDPG and experimental validation for this approach in the dynamic high angle-of-attack wind tunnel tests. QI LIN received the B.S., M.S., and Ph.D. degrees in propulsion engineering from the Nanjing University of Aeronautics and Aerospace, Nanjing, China, in 1982China, in , 1984China, in , and 1988, respectively. She is currently a Professor with the School of Aerospace Engineering, Xiamen University, Xiamen, Fujian, China. She is the author of more than 50 peer-reviewed journal articles and more than ten inventions. Her current research interests include cable-driven parallel robots, advanced wind tunnel tests technologies, and flow control.