Fault Tolerant Control Using Reinforcement Learning and Particle Swarm Optimization

Diversity, uncertainty and suddenness of unexpected faults bring a challenge for fault tolerant control due to the lack of valid data especially for a fault during an early stage. In this study, a reinforcement learning approach with a critic action architecture is proposed to overcome this challenge by designing an online learning fault-tolerant controller so that the faulty system can approximate the performance index of the fault-free system. Different from the traditional Hebb enhancement rules in the reinforcement learning, the training process is speeded up by introducing a supervisory learning on the basis of the training dataset which is built with the states and the virtual optimal control acquired by particle swarm optimization. The effectiveness of the algorithm is demonstrated by a test bed of a three-tank system.


I. INTRODUCTION
Fault tolerant control (FTC) has received much attention in engineering applications due to its prominent guidance of system reliability and safety, high system performance, productivity and operating efficiency [1]. Great efforts have been made to mitigate the influences from the faults on the operating system based on hard-ware and/or information redundancy based tolerant control strategies [2]- [7]. Generally, the FTC approaches are classified into passive FTC and active FTC [8], and model-based FTC played a dominant role during the past decades. Recently, fuzzy adaptive finite-time fault-tolerant control [9] and event-triggered robust fuzzy adaptive finite-time control [10] approaches were investigated respectively. It is noted that model-based FTC highly depends on the accuracy of the model as an excellent model is composed of accurate and sufficient information on a research object such as parameters and structures, which enables an effective control. However, the model would be unavailable or difficult to build for a complex industrial process. Encouragingly, data-based methods [11]- [14] provide a feasible way to complement the model-based approaches based on abundant data that directly reflect the actual status of system. Among them the machine learning and artificial The associate editor coordinating the review of this manuscript and approving it for publication was Youqing Wang . intelligence such as neural network (NN) [15], support vector machine (SVM) [16] and deep learning [17] are striking by imitating the human brain and other intelligent agents. These intelligent approaches use ''learning'' to grasp the inner rules based on the training data. Usually there are supervised methods and unsupervised methods for ''learning'' in data-driven fault diagnosis [18]. The supervised learning has a higher efficiency of learning based on a training set that requires a number of data to cover the relations from inputs to goals. The unsupervised learning is essentially a data clustering based on certain criterion (usually norm) that also needs enough data to guard the pattern's accuracy though it does not need a prior training set.
In principle it is impossible to get all kinds of fault data due to the constraint of physical condition and the economic consideration. On the other hand the data of FTC are imbalance in which plenty of data are healthy and a few are faulty. How to find a way to learn by using data as little as possible is well motivated for FTC. The reinforcement learning (RL) brings a novel way to learn by introducing an idea of control value instead of control effect. The control value is evaluated immediately after each action. As a result, the ''learning'' becomes a process of seeking the best evaluation of control which does not require all kinds of faulty data in advances. It only needs the current state and the next state at every trial. After a trial and error that simulates the process of animals' adapting to environment [19], [20], the controller obtains an ability of keeping consistency of the performance index and the working condition by action's self-adjusting.
The RL provides a feasible approach to learn in unknown environment without need of abundant prior data. It is beneficial to the FTC of lacking early faulty data. As a result, if one takes the fault as environment, the healthy performance index as a goal, the FTC control rule will be achieved by self-adjusting with RL approach. Recently, RL based FTC have paid attention and some results were reported such as FTC tracking control for MIMO discrete-time systems [21], FTC design for a class of nonlinear MIMO discrete-time systems [22] and RL-based FTC for linear discrete-time dynamic systems [23]. It is noticed that learning of RL is a very slow process which makes RL be limited to a small FTC system [24] although RL has succeeded in game, computer vision [25] and robot [26]. Several methods accelerating on RL which aim to a huge system have proposed such as trajectory trace [20], trust region policy optimization(TRPO) [27], actor-critic with experience replay (ACER) [28], proximal policy optimization algorithm(PPO) [29], FPGA [30] and so forth. However, all these methods need a large of training data, which backs to the dilemma of lack valid training data in FTC. It is necessary for RL to look for a training accelerating way to meet the requirements for FTC to mitigate the faulty influence as quickly as possible.
Compared with the origination from imitating the primary behavior of animals that looks for a better action by trials and errors with a low efficiency, the biological actions have evolved some effective searching methods such as a social behavior after they have evolved for many millenniums in the racing for surviving with environment. The particle swarm optimization (PSO) [31]- [33] which imitating the social behavior of the bird flock provides a way to find the goal without needing more data. Therefore, it would be interesting to select the PSO as an auxiliary tool to improve RL's learning speed owing to its excellent parallel random searching ability in unknown surrounding. Motivated by the above ideas a reinforcement learning fault tolerant control via PSO is proposed. A critic action architecture [34]- [36] which is an efficient way of RL is proposed to implement FTC by a combination of a critic network which provides an evaluation of the control value to meet the goal and an action network which provides the adjustment of control action. The learning of critic action is accelerated by converting a training process of the critic network and the action network alternatively with Hebb rule to an efficient supervised training via an improved PSO. The advantages of proposed approach are given as following: 1) This approach overcomes the conflict between the required data and no access to get data for all kinds of faulty data in advance. 2) The optimal control for the best reachable performance index will be obtained by employing the critic action regardless of system model and fault style.
3) The responding time for fault is shorten by introducing the supervisory learning to train the critic action with PSO.

II. PROBLEM DESCRIPTION AND PRIMARIES A. PPROBLEM DESCRIPTION
Suppose a fault-free nonlinear system/process is described by which is under a control by a pre-designed state feedback controller where x p = x p1 , x p2 , · · · ,x pn T ∈R n , u p =(u p1 , u p2 , · · · , u pm ) T ∈R m are the state and control vector respectively, f and g represent the maps that satisfy f : R n × R m →R n and g:R n → R m , and k is the sampling time. A performance index J p is defined in a quadratic form where M and N are known weighted matrices and the superscript T represents the transpose of a vector or matrix.
A faulty system that originates from system (1) becomes as follows: and the performance index J f with the same form as formula (3) is marked as formula (5) where x f = (x f1 , x f2 , · · · ,x fn ) T ∈R n (the subscript f denotes fault), u = (u 1 , u 2 , · · · ,u m ) T ∈R m , h is an unknown function of system subject to unexpected fault.
The target is to find a control sequence {u(k)|k = 1, 2, · · ·} that makes the performance index J f subject to faults approach to the fault-free performance index J p . The system state variables are assumed to be measurable.

B. REINFORCEMENT LEARNING AND CRITIC ACTION ARCHITECTURE
For the system (1) and (2) the expected immediate cost R k (x (k + 1) , x (k) , u(k)) is obtained after a transition from state x(k) to state x(k+1) by taking control u(k). The control value V u k (x(k)) is defined as a summary of immediate cost from state x(k) along the control u where γ is a discount factor with 0 ≤ γ < 1 in order to be convergent. One will get the optimal controls by carrying an alternation of the policy evaluation and policy improvement with the formulas (10) and (11) Formulas (7) and (8) provide a feasible way to get the optimization by only using the state information and the immediate cost. In fact it is not solved directly because V k and V k+1 need the information of x after k + 1 according to (6). Therefoer, critic action architecture is proposed to implement it. In critic action architecture a critic network is used to approach to the control value V k and V k+1 with input x(k) and x(k+1) under the control u. The critic is trained forward in time by trying to minimize the error measure ∞ k=1 TE 2 (k) over time. Here where X (k) stands for either a states vector x(k) or a concatenation of x(k) and a control vector u (k). An action network is used to build a map from input x(k) to output u(k). One will get the partial derivative of V k (X (k)) with respect to the control variable by back propagating through the critic network and then use the gradient of the cost function with respect to the action's weights to gain the parameters variable of action network for all inputs. The critic action architecture is proposed by Werbos [34] and expressed as heuristic dynamic programming (HDP) and action dependent heuristic dynamic programming (ADHDP) in [35]. More details are omitted due to the limitation of space.

C. CONVERGENCE OF HDP
The convergence of HDP is given in [37]. Here, it is repeated only for integrity and the proof is omitted.
Lemma 1: Consider sequences V k and u k defined by (10) and (11) If V 0 (x (k)) = 0, then it follows that V k is a non-decreasing sequence ∀k :V k+1 (x(k)) ≥V k (x(k)). Moreover, as k → ∞, V k → V * , u k → u * , and hence, the sequence V k converges to the solution of the discrete-time Hamilton-Jacobi-Bellman. Remark 1: The Hamilton-Jacobi-Bellman is the goal the of convergence of the HDP though it cannot be solved exactly in a general nonlinear case.

III. THE CRITIC ACTION-BASED FTC
The goal of FTC is to approximate the performance index under fault free condition, by seeking a control u in the case of unexpected fault. A critic action-based fault-tolerant control (CAFTC) which suits for all kinds of faults without any prior faulty information is proposed and the schematic block-diagram is depicted by figure 1. The CAFTC is composed of three parts: a critic network, an action network and a healthy model. In Figure 1 the same critic network is drawn to two boxes in order to express the different samples of a time series. There is no requirement for neural network type to the critic action architecture. As a result, the forward neural network is selected as a critic network and an action network because it is the base of other neural networks.
For a critic network, denoted W c1 as the connected weight between the input layer and hidden layer, W c2 as the connected weight between the hidden layer and output layer, the sigmoid function as the activation function with form of formula (12) s out = sig (s in ) = 1 1 + e −s in (12) whose derivative form iṡ where s in and s out are the input and output of the sigmoid function.
As a result, the estimationV k (x(k)) of V k (x(k)) and V k+1 (x(k + 1)) of V k (x(k + 1)) are obtained according to formulas (14) and (15) Formulas (14) and (15) transform the solving of V k and V k+1 to the learning of connected weights W c1 and W c2 . The backpropagation method takes an effect on training the parameters. The error δ c (k) between the estimatedV k (x) and real V k (x) is obtained by replacing V k (x) according to formula (7) andV k+1 (x(k + 1)) instead of V k+1 (x (k + 1)). Then we have The goal of the critic network is to minimize the error measurement over time. As a result, the weight error W c1 and W c2 at k is where l is learning rate, X (k) is the input of critic network, s out,c is the output of the sigmoid function of the critic network, W c2 is the connected weight between the hidden layer and output layer. s out,c and W c2 will be obtained from the critic network, and δ c can be obtain according to formula (16).
One will update the connected weights W c1 and W c2 of the critic network according to the formulas (20) and (21): For an action network, W a1 is denoted as the connected weight between the input layer and hidden layer, W a2 as the connected weigh between the hidden layer and output layer, sigmoid function as the activation function with the same form of formula (12). One will update the connected weights W a1 and W a2 of the action network according to the formulas (22) and (23): The weight errors W a1 and W a2 are adjusted by the reinforcement learning according to the Hebb enhancement rules of formulas (24) and (25): where l is the learning rate. Notice ∂V k ∂u may be obtained from the critic network ∂V k ∂u = w c2 ·[s out,c 1 − s out,c ]·w c1,u Substituting formula (26) into the formulas (24) and (25) where W c1,u and W c2 are the connected weight between u (part of X) and the hidden layer, and the connected weight between the hidden layer and output layer in critic network. s out,c and s out,a are the outputs of the sigmoid function in critic network and in action network, respective; W a2 is the connected weight between the hidden layer and output layer in action network, and x(k) is the input of the action network, W c1,u , W c2 , s out,c s out,a and W a2 can be obtained from the critic network and action network. The healthy model that is offline built based on the healthy data series will work parallel to the plant, and its dynamic performance is consistent with that of the plant under fault-free scenario. The well-trained feedforward neural network (FNN) can be used as a black-box healthy model [24]. The healthy model provides a current reference x r (k + 1) which develops a target value for the critic action in the training process.
The optimization of the control variable will be obtained by adjusting the weight parameters of the action network based on the formulas (22) and (23). After each action selection, the critic network evaluates the result to determine whether it is better or worse than expected. By iterating the process of the critic network training and action network training, the connected weighs of the critic network and action network will be convergent according to lemma 1. Furtherly one will get the optimal control u * (k) responding to x(k) according to action network.

IV. IMPROVED BY PSO
The training of critic action architecture needs the alternation between the critic network and the action network in order to converge. It is a slow evolution progress that spends too much time in the process of trials and errors. In fact there are some efficient ways for animals after a long term evolution to find a goal in the unknown environment. One of them is particle swarm optimization (PSO) proposed by Kennedy and Eberhart [31] that provides a quick random searching approach by imitating the social behavior of the bird flock without depending on the model [32]. The iteration of the basic PSO follows formulas (29) and (30) where ω is an inertia weight, c 1 and c 2 are accelerating constants, rand1 and rand2 are two random numbers independently generated between [0, 1], p best is the best location of current particle swarm, and g best is the best location of entire particle swarm experienced. It is worth noting that the iteration of PSO is different from that of RL. The RL often gets the result for a long time until it is convergent meanwhile the PSO can get a solution after a few iterations due to a lot of particles' parallel work. VOLUME 8, 2020

A. FITNESS FUNCTION
The fitness function (FF) of PSO, as an evalution of particles position, is identified as the similar form of temporal-difference error (TE) by replacing V k (x) and V k+1 (x(k + 1)) withV k (x) andV k+1 (x(k + 1)), respectively whereV k (x) andV k+1 (x(k + 1)) are obtained from the critic network, the activated threshold is set as 0, and R k (x (k + 1) , x (k) , u (k)) is determined as the immediate cost error between the fault and the healthy is the state at k + 1 after u and x r (k + 1) is the states at k + 1 from healthy model which is the target of FTC.
As a result, the seeking parameters W c1 and W c2 of the critic network becomes an optimal problem as follows: T subject to u = output of action network (33)

B. LEVEL SEARCH
The PSO provides an effective solution within a few iterations but it is prone to trap into local optimization [32], [33]. Here a level search approach is proposed to overcome it. Suppose the local optimizations are divided to different levels according to the fitness function value. The particles begin to search the solution space from initial level and they are easy to trap into a local optimization. It is difficult for a standard PSO to escape it because the particles lost the searching ability due to the homoplasy at the local optimization. To overcome it a secondary search by changing the fitness function value and reassigning the particles searching ability with re-randomizing restarts a new search from this level. This search works until it traps into another local optimization and replaces by another new search from a new level. This process is going on until it finds a best fitness function value. The algorithm is summarized as procedure 1.

Procedure 1
Step 1: Initialize the particles responding to W c1 and W c2 of critic network Step 2: Compute the fitness function of each particle according to formula (31) Step 3: Renew the p best and g best according to p best = min{FF i (x i (k)) |i = 1, 2, · · · l} g best = min{FF i (x i (k)) |i = 1, 2, · · · l, k = 1, 2, · · · , k} Step 4: Update v i and x i according to formulas (29) and (30) Step 5: Repeat steps 2 to 4 until it is convergent and record g best1 Step 6: Modify the fitness function value and reassign the particles with random number [0, 1] Step 7: Repeat steps 2 to 4 until it is convergent and record g best2 Step 8: Repeat steps 2 to 7 until it cannot find a better solution g best1 ≥ g best2 .
Step 9: The position of particle at g best1 is the solution of W c1 and W c2 of critic network.
Step 10: Get a virtual optimal control u * (k) by solving an optimization of (34) using PSO again It is noted that formula (33) and formula (34) have the same goal but different restrains. Formula (33) implies an achievable goal under the real control vector u (the u acts on the system at k therein we get the state x(k + 1)) and memorize this goal to the parameters W c1 , W c2 . Formula (34) provides an approach to get the optimal control u * (k) in order to catch this goal but this control u * (k) cannot be put into practice because the system has gone into the next time k+1. It is the reason that the u * (k) is a virtual optimal control. As stated above this u * (k) depends on the x (k) and x (k + 1) whatever the faults happen. So the black box relations between x (k) and u * (k) (k= 1, 2, · · · , n) will be obtained by training the action network.

C. PSO IMPROVED FTC
The state x (k) and the virtual optimal control u * (k) construct a sample element in training set which is consistent with the current fault. A training set is obtained after many attempts by changing states with the effect of different control actions. The training set for supervisory learning has been solved and the existing approach such as Levenberg-Marquardt algorithm [38] may play a significant role to improve the training efficiency. The schematic block-diagram of the FTC controller improved by PSO is dictated in figure 2 and the algorithm is summarized as procedure 2.
Remark 2: The virtual optimal control obtained by PSO level search makes it possible to build a training dataset which makes the RL training become a supervised learning. This provides a higher training efficiency instead of the slow trials. Meanwhile the advantage of CAFTC with no model, no fault style and no prior fault information is inherited due to the data of training set directed from the system subject to fault.

V. CASE STUDY
A three-tank system is implemented as a testbed to verify the proposed approach. The three-tank system is made up with tank body (T1, T2 and T3), submersible pumps (pump1 and pump2) whose flow (Q1,Q2) is controlled by a digital controller, the connection valves (CV1,CV2,CV3), the leak valves (LV1,LV2,LV3) and pipes. The liquid height of each tank (T1,T2 and T3) can be obtained respectively by a pressure level meter. Three tanks (T1,T2 and T3) have the same size connected with pipes. The system works in the condition of opening connection valves and closed leak valves. Therefore, the liquid of the tank body flows to the reservoir through CV3 and reenters the tank body by pump1 and pump2. The character of the plant can be changed with the flow resistance between tanks by manual regulating the opening of connection valves and leak valves. The pump1 and pump2 are controlled by an individual frequency converter. The flux of pump1 and pump2 is determined by their rotation speed that is controlled by an individual frequency converter. The control signal 0-5V of frequency converter is taken as the output of controller. The relation between the flux of pump and the control signal of frequency is obtained in advance by additional experiments. Afterwards we take the flux of pumps instead of rotation speed as the control variable of plant by omitting the frequency converter in order to be clear. The structure is shown in Fig. 3. The model of three tank system is given as formula (35) where h 1 , h 2 and h 3 are the liquid heights of T1,T2 and T3, S i is the base area of T1, λ 13 = ρ R 13 , λ 32 = ρ R 32 , λ 2 = ρ R 2 , R 13 , R 32 is the flow resistance between tank i and j, R 2 is the drainage resistance of CV3, R 13 , R 32 and R 2 are obtain by adjusting the opening of CV1, CV2 and CV3, ρ is the liquid density (for water, 1×10 3 kg/m 3 ), sgn ( * ) is the sign function , g is the acceleration of gravity, Q 1 and Q 2 is the flow from pump1 and pump2.
We apply a PID controller for pump1 and keep pump2 at the 50% opening (mid signal 2.5V of control signal 0-5V of frequency converter from soft way) to implement our target of keeping the liquid level in T3 in the condition of fault free. We call this stability subject to fault free as a standard state. When the fault occurs a FTC controller with two outputs (the flux of pump1 and pump2) is to replace the former controller (a PID for pump1 and 50% fixed opening for pump2). Our target is to preserve a reference liquid level in T3 by controlling the flow of pump1 and pump2 respective. VOLUME 8, 2020 A. SCENARIO OF ACTUATOR OUTPUT DEVIATION FAULT An actuator fault of pump1 is imitated by changing the relation between the flux of pump and the control signal of frequency subject to fault free scenario. This change makes the flux add/lessen compared with the initial set that connects the output of controller. By this means the actuator output deviation fault is obtained with soft way which prevents the real actuator from damaging. An actuator fault of pump1 which pumps 12 l/min (be translated according to the relation between the flux of pump and the control signal of frequency) more fluid than its initial set is added after sampling 100. The liquid height of T3, the states evolution and the control variables are shown in figure 4, figure 5 and figure 6.   The blue curve and the red curve represent the cases without FTC and with FTC. Figure 5 shows that the states x1, x2 and x3 under fault free condition are keeping stable before the fault occurrence and the liquid height of T3 is maintained in the reference level ( figure 4). When the fault occurs at sample 100 the state x1 will go up because of the more flow of T1 and the states x2, x3 will also go up due to the coupling in the case of without FTC. But it will go into another stability for the liquid height of T3 from 10cm to 15cm after a transient process. A FTC controller is designed according to Procedure 1 with a result of a forward neural network with 3-10-2. The training set is selected with 100 data and trained with Levenberg-Marquardt algorithm. The well trained NN is used as FTC. It is obvious that the liquid height of T3 is recovered due to the proposed algorithm.
More explanations about control variables will be given on the base of figure 6. In figure 6 the horizontal ordinate is sampling time and the longitudinal is the flux of pump. The scale zero of longitudinal coordinates represents the flux of pumps in the standard state. We use scale zero instead of the real flux because the standard state will vary with the reference level of T3 subject to fault free. Negative means less flow and positive means more flow than the standard state subject to fault free. The red curve and the blue curve represent the flux without FTC and with FTC respectively. It is seen that the pump1 will reduce the flow to react on the actuator more output fault. On the other hand the pump 2 will also reduce the output in order to keep the T3 liquid height at the reference level.

B. SCENARIO OF ACTUATOR STUCK FAULT
A stuck fault for pump 1 at 60% opening (the signal 3V of control signal 0-5V of frequency converter, means the pump1 is stroked due to lose the control) is given after sampling 100. Figure 7, figure 8 and figure 9 are the liquid height of T3, the states evolution and the control variables respective. It is seen from the blue curve of figure 8 that the liquid levels of T1, T2 and T3 are raising slowly (responding to the character of plant) after the stuck fault if it is followed the control subject to fault free. Figure 9 shows the control  variables with FTC (red curve) and without FTC (blue curve). The red curve and the blue curve are coincided because the pump1 is jammed and loses its regulating function. The pump2 reflects this fault by stopping delivering the flow for a while in order to release the accumulation. Then it will provide the stable flow to maintain the level of T3. The reason of no difference between the blue curve and the red curve after fault is due to the error that is dictated at the following paragraph. The figure 7 shows the liquid height of T3 can maintain the height subjected to fault free (red curve) with the control of FTC.
The similar stuck fault with pump 1 opening reduced to 30% (the signal 1.5V of control signal 0-5V of frequency converter) that cannot maintain the liquid level go up. The states evolution, the liquid height of T3 and the control variables are shown as figure 11, figure 10 and figure 12. The blue   curves present the cases without FTC and the red curves show the states evolution with FTC. It is seen from figure 12 that there is a lower intensity and shorter time of releasing flow compared with the 60% opening stuck fault. It also appears a deviation between the red curve and the blue curve due to the difference of stabilities.

C. SCENARIO OF PLANT LEAK FAULT
We also made a flow leak fault by partly opening LV2 of tank T3. It is dictated from figure 13 the liquid height in T3 will reduce from 9cm to 7cm due to the flow leak if the control subject to fault free is implementing which is shown in the blue curve. The red curve of figure 13 shows the tendency of liquid height in T3 under the proposed FTC. One can see the liquid height in T3 will hold a level subject to fault free FIGURE 13. The liquid height of T3. VOLUME 8, 2020 with contribution of FTC. The states evolution and the control variables are seen in figure 14 and figure 15.

D. SOME CONCERNS
To our best knowledge there is no other data-driven FTC to deal with all kinds of faults without any preconditions except the redundancy of system. This approach should go into effect as soon as the fault is detected without further diagnosis. It is relatively easy in FDD and there are many mature methods such as PCA, Bayesian decision and so forth. What is more, the critic action architecture has an excellent robustness due to its approaching with neural network which will resist the disturbance of noise.
The transient time from fault-detection to control-action depends on the number of samples and their training time under supervised training which is less than that of traditional trials and errors. More details of shorten time is seen in our early work [39]. In this paper the transient process has been omitted in order to highlight the effect of proposed FTC.

VI. CONCLUSION
A critic action architecture in RL has been proposed to design a fault tolerant controller to overcome the difficulty of the unknown changes in a plant due to the fault suddenness and unpredictability. The PSO using level search will find the virtual optimal control on the current states without training dataset and then construct a training dataset of the supervisory training which is more effective in training than the conventional RL. The proposed approach is a data-driven FTC without any prior knowledge of system that avoids the bottleneck problem of modeling. The FTC has been obtained by a training process using the online data that reliably reflect the fault process. Like other data-driven methods the proposed approach needs the collected data as an analysis basis. It has an ability of getting an optimal action target on the performance indicator under the condition of unexpected fault.