Robust Control of Unknown Observable Nonlinear Systems Solved as a Zero-Sum Game

An optimal robust control solution for general nonlinear systems with unknown but observable dynamics is advanced here. The underlying Hamilton-Jacobi-Isaacs (HJI) equation of the corresponding zero-sum two-player game (ZS-TP-G) is learned using a Q-learning-based approach employing only input-output system measurements, assuming system observability. An equivalent virtual state-space model is built from the system’s input-output samples and it is shown that controlling the former implies controlling the latter. Since the existence of a saddle-point solution to the ZS-TP-G is assumed unverifiable, the solution is derived in terms of upper-optimal and lower-optimal controllers. The learning convergence is theoretically ensured while practical implementation is performed using neural networks that provide scalability to the control problem dimension and automatic feature selection. The learning strategy is checked on an active suspension system, a good candidate for the robust control problem with respect to road profile disturbance rejection.


I. INTRODUCTION
Feedback control systems that are robust when faced with external disturbances are a common challenge and also frequently pose a direct or indirect design specification. To this end, the robust optimal control design is a highly attractive approach that has gained renewed attention lately in the zero-sum (ZS) game framework. Although the robust optimal design is, for quite some time now, well-posed for linear systems and solved by the H-infinity framework, it was not until the works by [1], [2] that the framework was imported to nonlinear systems robust optimal control in the form of the L2-gain optimal control. The goal of this latter formulation is to solve the Hamilton-Jacobi-Isaacs (HJI) equation, accepting the fact that for general nonlinear systems, it is often impossible to find analytical solutions. A very well-studied approach to the L2 control design is formulated as a ZS differential game [3]- [7] between two competing players: the optimal controller and the worst-case optimal disturbance controller.
The associate editor coordinating the review of this manuscript and approving it for publication was Shuping He .
Whenever the HJI equation is feasible, these two players are the minimax saddle-point solution for it, and the feasibility of the HJI equation is well studied for linear systems and it depends on the attenuation level as a prescribed degree of performance [8]- [10]. Whereas, for general nonlinear systems, the common approach is to not directly assume the HJI equation feasibility and rather to try to search for its solution using the concepts of upper-optimal and lower-optimal controllers who act as upper and lower bounds for the optimal controller, respectively, when the HJI solution exists, or, they act as independent optimal solutions when an infeasible HJI exists [11], [12].
There exists a large number of solution approaches to the HJI equation, stemming from the methods employed by Approximate Dynamics Programming (ADP) also known as Reinforcement Learning (RL), for which ample research is very active [13]- [26]. These works from ADP have been applied, to name just a few, on discrete-time systems [27] vs. continuous-time systems, in known-system [28] or in unknown-system approaches [29], [30]. In game theory, [27] proposed an iterative ADP algorithm for solving the HJI VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ equation related to ZS game problems for discrete-time affine nonlinear systems with known dynamics. In [28], ADP was derived to find the optimal saddle-point in feedback strategies of a nonlinear continuous-time system ZS game with affine input and control constraints using two-player policy iterations. In [29] ADP solves a discrete-time zero-sum game for linear systems with continuous states using Q learning for unknown system. In [31], a near optimal solution based on successive approximation of HJI equation and disturbance inputs and control update was given for a discrete time affine nonlinear system subjected to unknown internal system dynamics and disturbances. In [32], an online adaptive real-time policy learning method based on zero-sum games for nonlinear discrete-time systems was proposed for learning the HJI equation. In [33], an online adaptive robust dynamic programming algorithm using policy iteration scheme for ZS-TP-G of continuous-time unknown systems subject to uncertainties was considered. In [34], a data-based adaptive critic method using output feedback for unknown model and system states was described under disturbance measurement assumption. In [35], a data-based policy iteration Q-learning algorithm for ZS-TP-G was developed for linear systems to eliminate process dynamics knowledge. Recent game-theoretical contributions (some in nonzero-sum games) for nonlinear systems are reported in [36]- [38], in the more general framework of robust control [39], [40]. The first ADP-based solution approach for solving the HJI equation for unknown system dynamics is the Q-learning one [29] in which learning the HJI solution relies on data collected from the system to be controlled. In Q-learning, the learning process produces the two optimal controllers as state-feedback ones that must use the entire state information available. This is contradictory to some extent to the model-free label of the method itself, since knowing the natural system state requires a significant insight into the system, such as the system order and usually the nature of the underlying phenomenological process which is highly correlated with, e.g., the time-scale of the system. On the other hand, not using the entire state for learning optimal control poses great challenges to the learning process since the system is a partially observable one. This is the reason why some ADP approaches for solving the HJI ZS-TP-G were devised for handling observable systems and they rely only on input-output (IO) samples collected from the system. These IO samples are subsequently used to build a so-called virtual state that defines a virtual state-space model transformation of the original system. Unfortunately, the approach has been tackled for linear systems only, in a number of recent works [34], [41], [42] and not for general nonlinear systems, to the authors' best knowledge. This last remark serves as an incentive to one of this work's main contributions.
On another hand, designing off-the-shelf ADP techniques is another challenge since a great deal of effort is concerned with suitable parameterization of the nonlinear cost function and of the controllers, respectively. Most often, automatic basis function selection is a difficult task. Whereas, for observable systems whose state is built from past IO samples, the order of the equivalent virtual state quickly increases, which creates a two-fold problem: the adequate exploration of the input-state-output space and the time-correlation of the successive input and output samples [43]. These issues are related to the so-called dimensionality disaster problem which is well-known to the ADP and RL methods. This is why neural networks (NNs) have been the most flexible tool employed until now for parameterizing function approximators. Their main advantage is the self-regulated scalability to the control problem dimension, approximation capacity boosted by complex architectures and overfitting avoidance mechanisms intrinsically embodied in the NNs training procedures. Therefore, the NNs are considered to be a standard tool in ADP that can automate the basis functions (or features) selection in the function approximation tasks that are mandatory with the ADP approaches.
Based on the above ideas, the goal of this article is to integrate several concepts into a fully functional approach to designing robust control for observable general unknown nonlinear systems. The contributions of this work are as follows: -extension of the Q-learning approach to solve the optimal robust control problem as a ZS-TP-G solution to the HJI equation of general unknown nonlinear observable systems. An equivalent virtual state-space model of the original system is built from IO samples and subjected to robust control learning. The approach does not assume that a solution to HJI is feasible, therefore it searches for one via the computation of the intermediate upper-optimal and lower-optimal controllers. Theoretical analysis provides convergence of the proposed Q-learning-based solution.
-a NN-based implementation that proves scalability to the control problem dimension and automatic feature selection, in spite of the highly-dimensional virtual state vector.
-validation on a nonlinear industrial system of practical importance: an active suspension system. The active suspension system is a well-suited candidate for learning robust control since it inherently deals with the road profile disturbance rejection when employed on a variety of transportation vehicles (cars, trains, etc.) and it presents itself as a naturally underdamped system stemming from the two-mass-spring-damper class of systems. On another hand, the suspension system is a high order one (it has six natural states when the active hydraulic actuator dynamics is considered) and it makes it costly to measure all states. Hence it makes a good candidate for an observable system. Fortunately, it turns out to be a fully observable one where the virtual state can be constructed from present and past values of only one output measurement (the deviation of the ''car body'' from the rest position) and from past values of the two inputs: the control input of the actuator and the disturbance input). Since measuring the road disturbance input is not a valid option in practice, a solution is offered to this problem, which proves that better attenuation of the road profile impact on the ''car body'' motion is achieved than with respect to a competitive optimal controller which is not learned with disturbance rejection ability in mind. While there exists a consistent body of scientific literature dealing with the optimal active suspension control and in particular in that of reinforcement learning applied for the suspension control [44]- [48], none of the above solutions deal with the nonlinear unknown-dynamics observable case, as done herein.
There are several advantages of learning a disturbance rejecting optimal controller for an active suspension system: -avoidance of the system dynamics knowledge, -the observability-based solution that requires only IO samples to reconstruct the system state, -artificial disturbances that emulate road conditions are easily generated in fixed stands in a learning facility.
This article structure is oriented as follows. The second Section defines the ZS robust optimal control problem formulation and proposes a Q-learning-based solution employing upper-optimal and lower-optimal controllers. Theoretical learning convergence analysis is performed. Section III describes the practical implementation of the proposed learning strategy, under neural networks used as function approximators. The case study in Section IV extensively validates the learning concept on a realistic quartercar active suspension model and provides discussions and implementation details. Final conclusions are the subject of the final Section V.

A. UNKNOWN OBSERVABLE SYSTEM
Let the nonlinear unknown system S : x k+1 = f(x k ,ū k ), defined in discrete-time, comprise of a transition function equation and output equation respectively. The system state is ∈ D ⊂ R m d , the measured (and controlled) output is denoted y k = [y k,1 , . . . , y k,p ] T ∈ Y ⊂ R p . The functions f, h are assumed unknown on their definition domains and also continuously differentiable. In addition to unknown system dynamics, further system assumptions are listed: A1. System (1) is completely state observable. A2. System (1) is IO controllable from u k to y k . A3. System (1) is IO stable inside the domain defined by the input and output. A1-A3 are common for defining control problems for systems with unknown dynamics. Since they are not verifiable due to unknown system model, they are validated from working experience with the system, or from technical datasheets. Assessment efforts of linear systems' controllability and observability was proposed e.g. in the works [49], [50].
Observation 1: The input vector lumping both the control inputs and the disturbance inputs is important for deriving the two-player formulation of the optimal robust control solution.
In the attempt to derive state-feedback controllers, the state in (1) is not measurable. The observability theory allows to derive an alternate state-space model for (1) in terms of a virtual state. The support for this claim is given as follows.
Lemma 1: If pair (f, h) in (1) is observable, then there exists a map and a positive integer r such that Proof: See [51]. A virtual state vector is next introduced as Then, Theorem 1 in [51] showed that, based on (1) and (2), a new virtual state-space system with output equation is defined as which is completely observable (the components of z k are sequences built from current and past successive IO samples) and controllable (since it has the same input and output as (1)). The summarized ideas from [51] are: a) System (3) is IO controllable since it has the same input and output as (1); b) With unknown state dimension n in (1), r from Lemma 1 corresponds to an observability index and it is also unknown. Increasing r (and, subsequently the dimension of the virtual state z k ) is the general advice. As [51] shows, beyond some value of r, no information gain about the state x k is obtained from z k ; c) Controlling (1) and (3) is the same issue, except that (3) uses a ''measurable'' state information. It means that learning control for (3) will render the control for (1); d) Model (1) accommodates a wide range of processes, including time-delay ones. By properly introduced additional state variables and via variables substitutions, the time delay in the control input and in the states will result in another virtual state-space model (3) that is fully state observable and controllable (Comment 7 from [51]). e) When learning state feedback controllers of the form u k =C(z k ) (with some functionC : Z → U ), note that when plugging in this controller to close the control system loop, a recurrent controller emerges, since z k includes past samples of the inputū k . This type of recurrent controller is known as a nonlinear output error (NOE) model.
In order to develop the control solution in terms of disturbance rejection, the inputū k is split to show distinctively the control input and the disturbance input (Observation 1) as Next, the optimal robust control problem of (4) is formulated as a ZS-TP-G.

B. ZS CONTROL PROBLEM DEFINITION AND SOLUTION
The goal is to minimize a certain cost serving as performance index, with the optimization problem of the optimal control problem defined as  (5) and it stabilizes the closed-loop control system. Minimization of the cost from (5) is interpreted as a degree of attenuation achieved by the control system faced with any disturbance d k ∈ d , when only the optimal controller C * (z k ) is used in closed-loop.
Definition 1 [12]: In the existence domains spaces of the controllers C, D, the optimal controllers C * , D * are a saddlepoint solution for the ZS-TP-G (5) if, for all C, D, it holds that For general nonlinear systems with intractable analytical solutions for (5) and moreover, for those nonlinear systems with unknown dynamics, the existence of a saddle-point equilibrium is not guaranteed, as pointed out in [12]. In this sense, according to [2], upper-optimal and lower-optimal costs were introduced as These upper-optimal and lower-optimal costs ensure that J * (z k ) = J * (z k ) =J * (z k ) when the saddle-point solution J * (z k ) exists and also that J * (z k ) ≤J * (z k ) when such a solution is not feasible. Moreover, J * (z k ),J * (z k ) satisfy the Hamilton-Jacobi-Isaacs optimality equations which suggests using iterative ADP solutions to overcome the difficulty of calculating the upper optimal and lower optimal costs for general nonlinear systems.
Notice that when the saddle-point solution (5) does not exist, the optimal controllers from (7) differ, i.e.C * (z k ), To compute the optimal controllers from (7) for both upper and lower costs, upper and lower extended costs called Q-functions are defined as having the well-known meaning: they are the cost of taking any action (u k , d k ) in state z k and afterwards acting only subject to controller actions calculated by C and D in all subsequent states. They are directly connected to the original upper and lower costs J as shown in (8). The advantage of such Q-functions is that the optimal controllers are computable by directly minimizing w.r.t. u k and d k the upper optimal and lower-optimal Q-functionsQ * (z k , u k , d k ), Q * (z k , u k , d k ), once these are found.
Value Iteration (VI)-like algorithms are next proposed to calculate the upper-optimal and lower-optimal Q-functions. Their style is similar. For the upper optimal Q-function calculation, the VI Algorithm 1 is as follows.

Algorithm 1
Starting from initial (not necessarily admissible) con-trollersC 0 ,D 0 , and an initial upper Q-function esti-mateQ 0 , for all the possible combinations of the tuple (z k , u k , d k ), alternate the following two steps at each iteration j (starting with j = 1): S1. Update the Q-function as S2. Improve the controllers as in S3. If stopping criterion (no more changes fromQ j−1 tō Q j ) is not met, go to S1, else stop the algorithm.
The sense of the update operator ''⇐'' in (9) is understood as the update of an infinitely dense tableQ j (z k , u k , d k ), for all (infinitely) possible combinations (z k , u k , d k ).
A compacted update form of the Algorithm 1 is possible, by repeating the Q-function update as in and to compute the upper-optimal controllersC * ,D * by directly minimizingQ It is important to notice the order of the min max operations in the VI updates for the upper Q-function, namely, max(.) is performed before min(.). This is the important difference w.r.t. the VI Algorithm 2 update for finding the lower-optimal Q-function, described as follows:

Algorithm 2
Starting from initial (not necessarily admissible) controllers C 0 , D 0 , and an initial lower Q-function estimate Q 0 , for all the possible combinations of the tuple (z k , u k , d k ), alternate the following two steps at each iteration j (starting with j = 1): S1. Update the Q-function as S2. Improve the controllers as in S3. If stopping criterion (no more changes from Q j−1 to Q j ) is not met, go to S1, else stop the algorithm.
A compacted update form of Algorithm 2 is again possible by repeating the Q-function update as in (14) and to compute the upper-optimal controllersC * ,D * by directly minimizingQ Convergence of the VI update for the upper Q-function to the upper-optimal controllers and to the upper-optimal original cost is next analysed.
Theorem 1: The updates (9)-(10) (in compacted form as in (11)) starting fromC 0 ,D 0 and from an initial upper Q-function estimateQ 0 (z k ,C 0 (z k ),D 0 (z k )) ≥ 0, generating the sequences {Q j }, {C j }, {D j } according to Algorithm 1, will converge to the upper-optimal Q-functionQ * (z k , u k , d k ), to the upper-optimal original costJ * (z k ) and to the upper-optimal controllersC * (z k ),D * (z k ). Proof: From the compact update (11), notice that on the right-hand side we have, based on (8), that . Notice that the right-hand side of (15) is in fact the VI update performed in the space of the upper original cost: which holds for all z k+1 . In addition, notice that Altogether, update (11), based on (15) and (16) define a uniquely associated paired sequence It was shown in Lemma 1 from [12] that, with a proper positive definite initializationJ 0 (z k ) ≥ 0, the VI update performed in the space of the original cost preserves J j (z k ) ≥ 0 for all iterations j. Theorem 2 in [12] shows that lim j→∞Jj (z k ) =J * (z k ), for all z k .
Following that the update (11) in the upper Q-function's space embeds the update (16) in the space of the original upper cost and the latter converges toJ * (z k ), it implies by definition (8) . It also implies that the controller sequences {C j (z k )}, {D j (z k )} converge to their upper-optimal valuesC * (z k ),D * (z k ).
By similar reasoning, convergence of the VI update for the lower Q-function to the lower-optimal controllers and to the lower-optimal original cost is captured by the next Theorem 2.
Theorem 2: The updates (12)-(13) (in compacted form as in (14)) starting from C 0 , D 0 and from an initial lower will converge to the lower-optimal Q-function Q * (z k , u k , d k ), to the lower-optimal original cost J * (z k ) and to the loweroptimal controllers C * (z k ), D * (z k ).
Proof: The proof uses a similar reasoning with the proof of Theorem 1, but relies instead on the convergence of the lower original cost sequence updates, shown in Lemma 2 and in Corollary 2 from [12]. It is therefore not detailed here.
Observation 2: The proposed algorithms for computing the upper-optimal and lower-optimal Q-functions corresponding to the ZS-TP-G game do not use the system dynamics knowledge. Practical implementations of the proposed algorithms are detailed in the following Section. VOLUME 8, 2020 Observation 3: Following Theorem 5 and Corollary 7 from [12], important practical implications of the convergence of the upper-optimal and lower-optimal Q-function updates exist. Convergence of the upper-optimal and lower-optimal Q-functions to the same value is a necessary and sufficient condition for the existence of the saddle-point solution to the ZS-TP-G game. Meaning that Q * = Q * = Q * , where Q * is the saddle-point solution in the space of Q-functions. This is a consequence of J * = J * =J * in the space of the original costs. On the other hand, convergence of the upper and lower Q-functions to different values Q * =Q * means that a saddle-point solution to the ZS-TP-G game is infeasible.

III. PRACTICAL ALGORITHMS IMPLEMENTATION
A. ZS-TP-G NN IMPLEMENTATION Algorithms 1 and 2 described in the previous Section are practically implemented using function approximators, to deal with large continuous state and action spaces affected by high dimensionality. Neural networks (NNs) are the most common structures employed to this purpose, owing to their high approximation capability and well-established tuning rules.
Let the NN function approximators for the Q-function, and for the controllers C, D, be denotedQ(z k , u k , d k , π Q ), C(z k , π C ) andD(z k , π D ), respectively, where π i , i ∈ {Q, C, D} represents the tuneable NN weights of each individual approximator. Most VI-like algorithms such as the batch-fitted Q-learning variant that is going to be implemented in this work, operate batch-wise and rely on a dataset of transition samples collected from the process by interaction. These samples form a collection (set) of tuples M = {(z k , u k , d k , z k+1 )} which allows the calculation of the penalty function. Especially for the VI for unknown dynamics case, these tuples must efficiently explore the state-action space and to cover as uniformly as possible the entire space z × u × d , i.e. to try all possible actions (u k , d k ) in every state z k . The advantage of the VI algorithms is that they are off-policy in nature and they learn the optimal controllers from transition samples collected under any other controllers that can be used for efficient state-action space exploration.
In terms of updating the approximated Q-function iteratively, based on the transition samples dataset M , the step S1 from Algorithms 1 and 2 ((9) and (12) respectively) is captured by the optimization problem π j+1 Q = arg min whereQ,Ĉ,D can be any ofQ,C,D (Algorithm 1) or Q, C, D (Algorithm 2). The iteration number j has been moved from the subscript ofQ,C,D (Q, C, D) to the superscript of their corresponding parameterizations. Equation (17) improves the Q-function estimate by bootstrapping on its most recent estimate: (17) is the mean sum of squared errors (MSE) training cost of the neural networkQ(z k , u k , d k , π), having targets This makes the Q-function estimate improvement directly amenable to standard NN training procedures (e.g. gradientbased backpropagation). The squared error term under the sum in (17) is the well-known one-step temporal difference.
For the controller improvement steps in Algorithms 1 and 2 (equations (10) and (13) respectively), the controller parameters π j+1 C , π j+1 D are obtained from the cascaded NN Q(z k ,Ĉ(z k , π j C ),D(z k , π j D ), π j+1 Q ) again by gradient descent and ascent steps (per the min(.) and max(.) operations required by Algorithms 1 and 2). Since the succession of the min(.) and max(.) operations is different for the upper-optimal Q-function calculation Algorithm 1 and for the lower-optimal Q-function calculation Algorithm 2, the details are next given for the former.
In Algorithm 1, the max(.) operation is performed first, aiming at maximizingQ(z k ,C(z k , π j C ),D(z k , π D ), π j+1 Q ) w.r.t. π D . This is equivalent to setting the targets of Q(z k ,C(z k , π j C ),D(z k , π D ), π j+1 Q ) equal to zero and take a number of gradient ascent steps for a specified number T 1 of gradient ascent steps, starting from an initial inner-loop iteration value π [i] D = π j D , over a number of B 1 selected states z k (either randomly picked from the dataset M or randomly generated in the domain z ), and using a step-size α 1 . At each iteration of (18), the number of B 1 states z k are first forward propagated throughC(z k , π j C ),D(z k , π [i] D ) and afterwards through is calculated with backpropagation and multiplied by the gradient ofD(z k , π D ) w.r.t. π D , again calculated by backpropagation. After T 1 iterations of (18), π j+1 D = π [T 1 ] D is rendered. The min(.) operation in Algorithm 1 follows, to minimizē Q(z k ,C(z k , π C ),D(z k , π j+1 D ), π j+1 Q ) w.r.t. π C . Notice that theD(z k , π j+1 D ) NN already employs the most recent updated parameter obtained after (18). Similarly, this is equivalent to setting zero targets forQ(z k ,C(z k , π C ),D(z k , π j+1 D ), π j+1 Q ) and minimize w.r.t. π C , accomplished by a specified number T 2 of gradient descent steps of the form performed by starting from an initial inner-loop iteration value π [i] C = π j C , over a number of B 2 selected states z k (either randomly picked from the dataset M or randomly generated in the domain z ), and using a step-size α 2 . The same computation mechanism applies, as in the case of the max(.) operation (18). After T 2 iterations of (19), C is rendered. Algorithm 2 dedicated to the optimal lower Q-function and optimal lower controllers' calculations differs in the order of the min(.) and afterwards max(.) operations. Meaning that π j+1 C is first updated, then used to update π j+1 D . They are given as We summarize the NN-based solutions to the ZS-TP-G aiming at computing the upper-optimal and lower-optimal Q-functions and upper-optimal and lower-optimal controllers, respectively, using the batch-fitted Q-learning style. For the upper-optimal Q-function and upper-optimal controller, Algorithm 3 is described first.
For the lower-optimal Q-function and lower-optimal controller calculations, the following Algorithm 4 is described.
Observation 4: After Algorithms 3 and 4 converge, it is established whether the saddle-point solution to the ZS-TP-G exists (the upper-optimal and lower-optimal Q-functions converge to the same value) or, on the contrary, the saddle-point solution does not exist. In practice, it is more convenient to measure and analyse the upper and lower original costs values, evaluated with the current iteration controllers on a test scenario, that is,

Algorithm 3 NN-Based Solution for the Upper-Optimal Q-Function and Upper-Optimal Controller for the ZS-TP-G
1. Take the dataset M of collected transition samples as input.
4. Initialize π [i] D = π j D and iterate for T 1 times on (18) to find π j+1 D . A set of B 1 states z k is used. 5. Initialize π [i] C = π j C and iterate for T 2 times on (19) to find π j+1 C . A set of B 2 states z k is used. 6. If the stopping criteria is not met in terms of maximum number of iterations (j <j) and in terms of significant changes in the Q-function NN parameters between iterations ( π j+1 Q − π j Q > π ), update j = j+1 and go to 3, otherwise stop.

Algorithm 4 NN-Based Solution for the Lower-Optimal Q-Function and Lower-Optimal Controller for the ZS-TP-G
1. Take the dataset M of collected transition samples as input.
4. Initialize π [i] C = π j C and iterate for T 1 times on (20) to find π j+1 C . A set of B 1 states z k is used. 5. Initialize π [i] D = π j D and iterate for T 2 times on (21) to find π j+1 D . A set of B 2 states z k is used. 6. If the stopping criteria is not met in terms of maximum number of iterations (j < j) and in terms of significant changes in the Q-function NN parameters between iterations ( π j+1 Q − π j Q > π ), update j = j+1 and go to 3, otherwise stop. and J j (z k ) = Q(z k , C(z k , π j C ), D(z k , π j D ), π j Q ), respectively.J j (z k ) and J j (z k ) will be measured in the next case study.
In the following, a state feedback optimal controller is introduced for comparing the performance of the upper and lower optimal controllers in terms of disturbance rejection capability.

B. STATE FEEDBACK OPTIMAL CONTROLLER NN IMPLEMENTATION
To assess the performance of the learned optimal upper and lower controllers, a state feedback optimal controller (SFOC) VOLUME 8, 2020 is learned, that is set to solve the next optimization problem: The above cost preserves a part of the penalty term in the original cost from (5), without penalizing the disturbance term (W D = 0). The controller that learns to solve (22) uses the straightforward system state and not a virtual state but it is not designed to aim for disturbance attenuation. (22) is solvable by any variant of Algorithm 3 or 4, without a dedicated disturbance controller NN approximator D(x k , π j D ), but only with a NN controller C(x k , π j C ) that is improved at each step by minimizing the current iteration Q-function NN Q(x k , u k , π j Q ). Let the Algorithm 5 used for SFOC learning be Algorithm 5 SFOC Learning 1. Collect another dataset M 1 of transition samples from the system, to be used as input to the algorithm. The collection task takes place under d k = 0.
2. Initializej, B, T , α, π and initialize architectures and training settings for the NNs Q(x k , u k , π j Q ) and C(x k , π j C ). Initialize the values j = 0, π 0 Q , π 0 C . 3. At a certain iteration step j, using the entire dataset M 1 of transition samples, obtain an improved NN estimate of the Q-function as the solution π j+1 Q of π j+1 Q = arg min

Initialize π [i]
C = π j C and iterate for T times on to find π j+1 C . A number of B ≤ |M 1 | states x k from the dataset M 1 can be used.
5. If the stopping criteria is not met in terms of maximum number of iterations (j <j) and in terms of significant changes in the Q-function NN parameters between iterations ( π j+1 Q − π j Q > π ), update j = j+1 and go to 3, otherwise stop.

IV. VALIDATION CASE STUDY A. ACTIVE SUSPENSION SYSTEM
The continuous-time state-space model of the active suspension system for a quarter-car is [52] where the model parameters are given as [52]: The displacementsx 1 andx 3 of the sprung (car body) and unsprung mass (wheel), respectively, are defined in (25) w.r.t. to their resting position. A four-way valve-piston that is actuated hydraulically, generates the force denoted asx 5 in (25) as a consequence of applying a voltage on the actuator input -this is the control input u. The road profile derivative w.r.t. time models the input disturbance d. To normalize the disturbance in d ∈ [−1; 1] (corresponding to +/-3 cm/s maximum amplitude of the road profile derivative), a scaling constant 1 = 0.03 multiplies the input d in the model (25). Similarly, the input u is brought to u ∈ [−1; 1] by using the scaling constant 2 = 0.001. In (25), the sgn(.) denotes the sign function. Clearly from (25), the output equation extracts x 1 as a measurable. The active suspension is schematically depicted in Fig. 1.
Since the IO data from model (25) will be collected at the fixed sample period of T s = 0.01 sec, the model is regarded as an equivalent discrete-time one (with a zero-order hold on the inputs that preserves their value constant for one sample period) and used for IO measurement. Importantly, the active suspension model is not used in the learning process. Let the states of the discrete-time equivalent model (25) be grouped by x k = [x 1,k , . . . , x 6,k ] T (i.e. x i,k corresponds tox i ). It then follows that (25) can be expressed as For the active suspension, the artificial disturbance d ∈ [−1; 1] is easily generated in fixed indoor stands and is therefore measurable for learning purposes.

B. ACTIVE SUSPENSION SYSTEM OBSERVABILITY DISCUSSION
The observability of (25) is next discussed, according to the analysis of [53]. The system observability does not depend on the inputs (u k , d k ) who are set to zero in (25). The analysis is carried out on the continuous-time system (25) and if the continuous-time system is observable, the observability of its discrete-time counterpart is implied when a sufficiently small sampling period is employed for discretization. Kou et al. showed in [53] that for a nonlinear dynamic autonomous continuous-time system with state equationẋ(t) = f (x(t)), x ∈ R n and with output equation where y(t 0 ) = h(x(t 0 )) h 0 (x(t 0 )), The dynamic system described above is completely observable on t ∈ [t 0 , t 1 ], if H (.) is injective (univalent, or one-to-one) from an initial state x(t 0 ) to z. The univalence of H (.) is a sufficient observability condition, since z contains only the output and its derivatives at initial time t 0 [53] and not on the entire t ∈ [t 0 , t 1 ]. If one can show the map H (x) is (locally) invertible (i.e., a bijection) then its injectivity follows. Local map invertibility is ensured by the non-singularity of its Jacobian matrix determinant at a certain given point, which for a square map H (x) is equivalent to the maximum matrix rank of the Jacobian at the given point.
For system (25), by repeated substitutions (y =x 1 ,ẏ = x 1 =x 2 , . . . ) using the model equations (25), it is verified that the Jacobian of H (x) is of full rank six, irrespective of the point at which it is calculated. Meaning that (25) is observable in continuous-time (and subsequently in discrete-time, for a small enough sampling period). This implies that a virtual state can be constructed from past inputs-outputs samples.
In practice, the model (25) is assumed unknown and the observability must be assumed if not verifiable from literature or from working experience with the process. Since the number of true states as well as the observability index are unknown, the virtual state should be built from more inputs-outputs past samples. It was reported in [51] that beyond a certain number (the presumed observability index) of past IO samples, there is no gain in information about the state value.

C. COLLECTIONG TRANSITION SAMPLES FOR THE LEARNING PROCESS
The first goal is to collect a transition samples dataset M = {(z k , u k , d k , z k+1 )}. Since the system (25) is observable, a controllability index equal to six builds a virtual state from past samples of the inputs u k , d k and from present and past samples of the output y k . The virtual state has the form z k = [y k , . . . , y k−6 , u k−1 , . . . , u k−6 , d k−1 , . . . , d k−6 ] T ∈ R 19 and the system (25) is transformed to a virtual state-space model of the form z k+1 = F(z k , u k , d k ) with output equation y k = z 1,k . In addition, the system is IO stable due to existing friction and therefore it can be open-loop excited.
Then, the transition samples are gathered using the next parameters for u k and d k : the input u k ∈ [−1; 1] is modelled as a sequence of piece-wise constant steps, while the amplitude follows a random uniform distribution. Each step last for 0.5 sec, and it is perturbed with a random noise extracted from another uniform distribution of amplitudes inside [−0.2; 0.2]. This noise is added to u k every T s seconds. The disturbance input d k ∈ [−1; 1] is modelled similar to u k but each constant portion lasts 0.6 sec. And it is additively perturbed by a similar uniform random noise with the same random uniform noise of amplitude [−0.2; 0.2] every T s seconds. The additive noise on the two input channels u k , d k are uncorrelated. The database M of |M | = 20 000 transition samples is built from 200 sec experiment time with the system (25) in open-loop, excited by the above u k , d k . Since the inputs were already normalized in [−1; 1] by introducing the input normalizing coefficients in the model (25), the output is normalized to y k ∈ [−1; 1], k = 1, |M | by dividing each sample with the maximal absolute value max k |y k | from the recorded history. Usually, all the inputs and outputs should be normalized, leading to all the components of the virtual state being normalized.
The virtual state's components normalization is extremely important since NNs approximators are going to be used. The normalization coefficients of all states are memorized and used to de-normalize the states, when running the learned controller in the loop. VOLUME 8, 2020

D. CONTROLLER LEARNING SETTINGS AND RESULTS
Before the learning process, the penalty in the original cost (5) is constructed as 20y 2 k + u 2 k − 4d 2 k (for W C = 1, W D = 4). It is computed for each transition sample (z k , u k , d k , z k+1 ).
In order to find the upper-optimal and lower-optimal controllers and Q-functions according to Algorithms 3 and 4 respectively, some approximators are selected as follows. For each algorithm the same architectures are being used. The Q-function is a 21−50−1 feedforward NN with tanh(.) activation in the hidden layer and linear output activation. The input of this approximatorQ(z k , u k , d k , π Q ) is formed from 19 components of the virtual state and the two control inputs u k , d k , while π Q formally captures the NN weights. These weights are initialized with uniform random numbers inside [−0.005; 0.005]. A random 80% of the training samples are used for effective training and the rest are used as validation data, for forcing early stopping in order to prevent training overfitting. The training algorithm is a fast, scaled conjugate gradient, for maximum 500 episodes. The training uses the MSE criterion over the entire batch of transition samples. The Q-function NN training (in both optimal upper and optimal lower Q-function search process) solves in fact (17).
For the controllersĈ(z k , π C ) andD(z k , π D ), the NN approximators are also feedforward NNs of the form 19-−10−1, with tanh activation in the hidden layer and linear output activation. Their weights captured by π C , π D are initialized as for the Q-function NN, but the two NNs' training must comply with the gradient ascent/descent steps imposed by the upper-optimal controllers' search (equations (18) and (19) performed inside Algorithm 3) and by the lower-optimal controllers' search (equations (20) and (21) performed inside Algorithm 4).
Other parameters are selected as follows. For Algorithm 3, j = 500, T 1 = T 2 = 50, α 1 = α 2 = 10 −3 , π = 10 −5 , while the gradient ascent/descent steps from (18) and (19) are performed on a number of B 1 = B 2 = 256 values z k randomly picked from the dataset M at each ascent or descent step. For the Algorithm 4, the same parameter settings are used.
The results obtained after the learning process takes place is shown in Fig. 2, in terms of the normed difference between successive Q-function weights vectors and in terms of the measured attenuation cost [27], [31] defined and measured on test scenario lasting 100 seconds, where a disturbance d k ∈ [−1; 1] (modelled as successive piece-wise constant steps of uniform random amplitudes and lasting for 0.5 sec) is used. The sequence d k has not been presented to the system in the transition samples collection phase used for learning the upper-optimal and lower-optimal controllers. Observation 5: Importantly, at every iteration, J test is measured with the upper and lower controllers (C(z k , π j C ) and C(z k , π j C ) respectively) in closed-loop, without using the disturbance controllersD(z k , π j D ) and D(z k , π j D ), their outputs being replaced by the test input signal d k . These latter disturbance controllers are necessary only throughout the learning process of Algorithms 3 and 4.
Inspecting the bottom subplot in Fig. 2, the original costs J j (z k ) and J j (z k ) converge to the same value, meaning that the saddle-point solution to the game exists. The upper subplot in Fig. 2 also indicates that after many iterations, no more changes tend to occur in the upper and lower Q-function estimates, a sign of learning process' convergence.

E. COMPARISONS AND DISCUSSIONS OF THE RESULTS
For learning the SFOC via Algorithm 5 for comparison purposes, the following optimization problem is solved (28) which is computed for the transition samples collected under the same u k settings used in the previous subsection, letting d k = 0. As a consequence of null disturbance, a fifth order state model version of (25) results. Two feed-forward NNs of 6-30-1 and 5-5-1 are employed for the Q-function estimate and for the controller estimate, respectively. T = 50 gradient descent steps (24) are repeated with each major iteration of the Algorithm 5. Each state component x i,k from x k is normalized x i,k ∈ [−1; 1], i = 1, 5, k = 1, |M 1 | by dividing each sample with its greatest modulus max k x i,k over the recorded history.
The rest of the parameters in Algorithm 5 arej = 500, α = 0.005, B = 128, π = 10 −5 . After the maximum number of 500 elapsed iterations, the optimal controller and the optimal Q-function estimates result.
For comparison, on the same test scenario, the control obtained with the upper-optimal controller and with the SFOC are shown in Fig. 3.
The Fig. 3 is interpreted as follows. The black line in all subplots correspond to open-loop (u k = 0) and the profile of x 1,k (the ''car body'') is the same as the profile of x 3,k (''the wheel'' as the unsprung mass). It means that the wheel follows the road profile obtained as the discrete-time integral of d k from Fig. 3. Therefore, x 1,k − x 3,k = 0 in Fig. 3A in black line. On the other hand, the blue line x 1,k in Fig. 3B means that the car body is insensitive to the road disturbance and the active suspension control manages to absorb the road profile via the unsprung mass x 3,k using the control input in blue line from Fig. 3D. The SFOC control (red line x 1,k in Fig. 3B) does not reject the disturbance as the upper-optimal controller, since it was not learned having in mind the disturbance rejection goal. After measuring the attenuation obtained with both the upper-optimal controller and with the SFOC controller, it results that J OptUppContr test = 6.6 · 10 −3 , J SFOC test = 34.5 · 10 −3 , clearly indicating the effective attenuation attained by the former. This is despite the SFOC using the full state information directly, which may be considered an advantage.
On another hand, the virtual state used for learning the upper-optimal and lower-optimal controllers incorporates the measured disturbance which is the road profile derivative. This may not be acceptable in practice since it is difficult to measure the road profile disturbance. The learned upper-optimal controller is tested next, by setting z k = [y k , . . . , y k−6 , u k−1 , . . . , u k−6 , d k−1 = 0, . . . , d k−6 = 0] T ∈ R 19 with null disturbance in the virtual state. The actual disturbance d k affects the controlled system, and the closed-loop is tested under the same scenario as before, under both the upper-optimal controller and under the SFOC. The results are shown again in Fig. 4.
The conclusion from Fig. 4 is obvious. Even in the case when a null disturbance is fed to the virtual state, the disturbance rejection is better with the upper-optimal controller than with the SFOC, in terms of x 1,k in Fig. 4B being closer  to zero than the same x 1,k obtained with the SFOC. Meaning that effective disturbance attenuation is still obtained, without measuring the road profile derivative.
The transmissibility from the disturbance input d k to the output y k = x 1,k is also measured in the frequency domain, assuming an approximate linear model both for the open-loop suspension system and for the closed-loop suspension control system. The frequency response function estimator is identified in three cases: a) in open-loop setting (u k = 0); b) the loop closed with the upper-optimal controller, with the virtual state z k fed by the disturbance input d k and c) the loop closed with the upper-optimal controller with z k fed by d k = 0. The results captured by Fig. 5 were obtained after exciting either of the open-loop system or the closed-loop control system with a zero-mean sine-stream signal d k of amplitude 0.5, for 100 logarithmically-spaced frequencies in the range of 0.01-1000 Hz. Then the magnitudes of the ratio between the Fast Fourier Transform (FFT) of the output y k = x 1,k and the FFT of the input d k obtained at each particular frequency, are calculated. Even in the active suspension provides natural attenuation in open-loop, it is observed that in the case b) (corresponding to measured disturbance in the virtual state), the low-frequency attenuation is significantly stronger than that obtained with the SFOC (Fig. 5, left). Still, better low-frequency attenuation obtained with the upper-optimal controller is measured in the case c) (Fig. 5, right), when the disturbance is not measured and it enters as a null value in the virtual state.
The learned upper-optimal and lower-optimal controllers for the active suspension observable system was shown feasible. For the active suspension system, the proposed ZS-TP-G robust control learning approach is highly attractive since it takes place in a fixed test rig where artificial disturbances that emulate the road conditions are easily generated. Afterwards, the disturbance controller is discarded and the control loop is closed by either the upper-optimal controller or the lower-optimal controller. Subsequently, the active suspension can then be used in real-world road conditions. The learned attenuation was shown efficient even in the case when the virtual state is constructed from a null measured disturbance. This aspect expands the applicability range of the approach. All features above may stimulate industrial implementation owing to the reduced number of sensors and to the on-site learning ability.

V. CONCLUSION
The approach presented in this article proposes several features, enumerated next. It learns an optimal robust controller using ADP formulated as a ZS-TP-G for systems with unknown dynamics. The learned controller is the saddle-point of the ZS-TP-G when the solution is feasible, otherwise it can be any of the upper-optimal or lower-optimal controllers that solve the game. The learning process consisting of the operations that are specific to the upper-optimal and lower-optimal controllers' calculations, was shown to converge by theoretical analysis.
NNs approximators were used for the practical learning implementation. This is advantageous for general nonlinear systems since it enables automatic feature selection in the Q-function and controller parameterization. The proposed framework deals with observable systems perceived from IO data, therefore solving the partial observability problem that can prevent successful learning. It relies on the virtual state built from present and past values of the input and output samples. Learning a robust control for the virtual state space system is shown equivalent to learning a robust control for the underlying system. Since the virtual state construction leads to a higher-order virtual state-space system, NNs ensure the scalability of the learning problem in all aspects, except for the efficient exploration problem which is one of the major issues with ADP and reinforcement learning.
The approach presented here is believed to handle many practical systems (such as Markov jump systems and nonlinear multiagent systems [54]- [57]), therefore it is a further goal to validate it on observable systems of even higher order who, similarly to the active suspension, show significant practical interest.