Optimization for Interval Type-2 Polynomial Fuzzy Systems: A Deep Reinforcement Learning Approach

It is known that the interval type-2 (IT2) fuzzy controllers are superior compared to their type-1 counterparts in terms of robustness, flexibility, etc. However, how to conduct the type reduction optimally with the consideration of system stability under the fuzzy-model-based (FMB) control framework is still an open problem. To address this issue, we present a new approach through the membership-function-dependent (MFD) and deep reinforcement learning approaches. In the proposed approach, the reduction of IT2 membership functions of the fuzzy controller is completing during optimizing the control performance. Another fundamental issue is that the stability conditions must hold subject to different type-reduction methods. It is tedious and impractical to resolve the stability conditions according to different type-reduction methods, which could lead to infinite possibility. It is more practical to guarantee the holding of stability conditions during type-reduction rather than resolving the stability conditions, the MFD approach is proposed with the imperfect premise matching concept. Thanks to the unique merit of the MFD approach, the stability conditions according to all the different embedded type-1 membership functions within the footprint of uncertainty are guaranteed to be valid. During the control processes, the state transitions associated with properly engineered cost/reward function can be used to approximately calculate the deterministic policy gradient to optimize the acting policy and then to improve the control performance through determining the grade of IT2 membership functions of the fuzzy controller. The detailed simulation example is provided to verify the merits of the proposed approach.

Optimization for Interval Type-2 Polynomial Fuzzy Systems: A Deep Reinforcement Learning Approach Bo Xiao , Member, IEEE, Hak-Keung Lam , Fellow, IEEE, Chengbin Xuan, Ziwei Wang , Member, IEEE, and Eric M. Yeatman , Fellow, IEEE Abstract-It is known that the interval type-2 (IT2) fuzzy controllers are superior compared to their type-1 counterparts in terms of robustness, flexibility, etc.However, how to conduct the type reduction optimally with the consideration of system stability under the fuzzy-model-based (FMB) control framework is still an open problem.To address this issue, we present a new approach through the membership-function-dependent (MFD) and deep reinforcement learning approaches.In the proposed approach, the reduction of IT2 membership functions of the fuzzy controller is completing during optimizing the control performance.Another fundamental issue is that the stability conditions must hold subject to different type-reduction methods.It is tedious and impractical to resolve the stability conditions according to different type-reduction methods, which could lead to infinite possibility.It is more practical to guarantee the holding of stability conditions during type-reduction rather than resolving the stability conditions, the MFD approach is proposed with the imperfect premise matching concept.Thanks to the unique merit of the MFD approach, the stability conditions according to all the different embedded type-1 membership functions within the footprint of uncertainty are guaranteed to be valid.During the control processes, the state transitions associated with properly engineered cost/reward function can be used to approximately calculate the deterministic policy gradient to optimize the acting policy and then to improve the control performance through determining the grade of IT2 membership functions of the fuzzy controller.The detailed simulation example is provided to verify the merits of the proposed approach.
Impact Statement-The connection between the membership functions of type-2 fuzzy systems and reinforcement learning is observed and investigated for the first time.In the paper, the authors present the reinforcement-learning-based type-reduction for the interval type-2 fuzzy-model-based control systems.The theoretical guarantee of the stability conditions holds during the optimization process conducted by the reinforcement learning agent.The proposed research work bridges the areas of fuzzy control with reinforcement learning.Adopting reinforcement learning techniques

I. INTRODUCTION
N ONLINEAR systems are difficult to analyze and con- trol in general due to their inherent complex dynamics.To analyze the nonlinear control problems systematically, the Takagi-Sugeno (T-S) fuzzy model was proposed for its rigorous mathematical framework [1], [2].Over the past three decades, promising research outcomes of the T-S fuzzy-model-based (FMB) control systems were reported in the literature.For example, fuzzy tracking control problems were investigated in [3]- [8].The design of sampled-data fuzzy control problems were addressed by T-S FMB techniques in [9]- [14].It is also worth mentioning the recent work using fuzzy logic to control the robotic manipulator in an adaptive way [15].Regarded as the polynomial extension of the classical T-S fuzzy models, polynomial fuzzy models are presented to reinforce the ability to express the nonlinearity.Thanks to the ability of polynomial terms to represent nonlinearity, the polynomial fuzzy model is considered to have more potential to deal with the nonlinearity than the T-S fuzzy model [16]- [19].
As one of the most common implementations, type-1 fuzzy sets can handle the nonlinearity effectively in control systems, but they lack the ability to deal with the uncertainty directly [20], [21].The potential to deal with uncertainty is of great importance since uncertainty is inevitable in many situations, such as, different understanding of the linguistic variables from different people, imperfect knowledge of the system dynamics, the limited accuracy of measurement, noise in the system or observations.To analyze the uncertainty directly through the fuzzy sets, the type-2 fuzzy sets were proposed and the uncertainty can be represented with the concept of footprint of uncertainty (FOU) [20].It is true that type-2 fuzzy sets are effective in handling uncertainty directly, but the general form of type-2 fuzzy sets are more complicated, which results in high computational demand due to the additional fuzzification on the grade of fuzzy set.To alleviate the computational cost without losing the merit of handling uncertainty directly, the interval type-2 (IT2) fuzzy logic was introduced [20].For the IT2 FMB control systems, the seminal works were proposed by Lam [21], [22], in which the construction method of the IT2 fuzzy model was presented and imperfect premise matching (IPM) concept was coined for the stability analysis and control design.It is worth mentioning that in IT2 FMB control systems, the membership functions in the IT2 fuzzy model capture uncertainties and, thus, are uncertain in value.As a result, the membership functions of the controller are different from those of the IT2 fuzzy model.The IPM concept and the membership-function-dependent (MFD) analysis techniques proposed by H. K. Lam [23] play a crucial role in paving the way to make the stability analysis and control design possible.
Driven by the most straightforward motivation, most available fuzzy controllers are designed based on parallel distribution compensation (PDC) concept.To implement the PDC concept in the design of FMB control systems, the membership functions of the fuzzy model and controller are required to be perfectly matched to facilitate the analysis.The research works to relax the stability conditions for the PDC approach based fuzzy control systems were reported in [2], [24]- [28] and the results were generalized by adopting the Pólya's Theorem [29].However, most of the works mentioned did not consider the specific information of membership functions in the stability analysis, which makes the stability conditions valid unnecessarily for any kind of membership functions, thus, the stability conditions are conservative.In addition, since IT2 membership functions of the model and controller contain uncertainty, it does not seem possible to make the membership functions of the model and controller exactly the same during the control processes, which implies that the assumption of the PDC approach is invalid.Without the symmetry information utilized in the PDC approach, the corresponding stability conditions can be very conservative and the linear controller is expected to be obtained.To address this imperfect match problem, the MFD approach is an alternative to be implemented for stability analysis.Through the MFD approach, the specific information of the membership functions can be included into the stability conditions.The related work was reported to further relax the stability conditions and endow the fuzzy controller more flexibility in control synthesis [13], [18], [21]- [23], [30]- [42].In addition, interesting applications of type-3 fuzzy systems can be found in [43] and [44].
Although there are research on type-reduction for IT2 fuzzy systems reported in the literature, e.g., the KM algorithm in [45], the BMM algorithm in [46].However, they focus on calculating the centroids of the IT2 fuzzy systems using the lower and upper bounds of the outputs instead of optimizing the control performance.
On the other hand, reinforcement learning (RL) has recently achieved great success in computer science and machine learning thanks to its ability to optimize the agent's behavior through interacting with the environment.For example, the works adopting the techniques of RL in playing Atari games and Go to achieve human-level or beyond human-level performance [47]- [49].In this article, the RL approach is applied to conduct the type-reduction and improve the performance of the IT2 polynomial fuzzy controller during the control process.To train a proper RL agent to fulfill the type-reduction objective, the deterministic policy gradient can be utilized to find the optimal policy and the corresponding embedded type-1 can be extracted [50], [51].The main motivation adopting RL-based approach is to adaptively optimize the control performance during the processes.In this way, the RL agent will learn gradually to pick the embedded type-1 membership functions according to the different system states, which has long-term (delayed) effects on the overall control performance.However, the prototype RL accepts only discrete states and discrete action inputs to develop the state-action value table.This requirement restricts the applications of RL to the continuous state cases since it is impractical to calculate the exact value function with infinite dimensions numerically.An alternative to this problem is to use the function approximator to approximate the state-action values.In this approach, the optimal policy and/or the value functions are represented by an approximation function that allows for continuous states as its input.One of the most popular choices of the approximation function is the deep neural network (DNN) thanks to its global approximation capability of continuous functions at desired accuracy [51]- [53].Utilizing the global approximation capacity of neural networks (NNs), successful adaptive and optimal control design were reported in [54]- [57].When the policy and state-action value function are approximated by DNNs, RL becomes approximate reinforcement learning (ARL).In ARL, the states of the observed environment and the according actions determined by the policy will be used to estimate the state-action value function.During the control processes, the acting policy is improved through the state transitions associated with the reward function, thus it can be used to guide the agent to pick the optimal action according to the performance index.
Although the advantages of utilizing the information of membership functions have been reported in the works [13], [18], [21]- [23], [30], [31], [33], [34], [36], the issue of developing proper type reducer for the IT2 fuzzy controller during the control process remains untouched.Inspired by the successes of FMB control strategy and RL, in this article, we present the type-reduction approach adopting the RL techniques, in which the type-1 membership functions are extracted from FOU during the improvement of control performance.In the proposed control strategy, the control gains are first obtained by solving the sum-of-squares (SOS) based stability conditions, in which the perfect match between the fuzzy model and controller is not required, i.e., IPM.Since the specific information of IT2 membership functions is used to facilitate the analysis and the control design, the stability conditions are guaranteed to be valid for all type-1 membership functions embedded in the FOU.Therefore, the type-1 membership functions extracted within the FOU satisfy the stability condition automatically without resolving the stability conditions.Started from the obtained control gains, RL techniques are applied to train the RL agent, which is used as the type-reducer afterwards.The RL agent determines the grades of the type-1 membership functions embedded in the FOU according to the reward function and its current state.At the same time, the control performance is improved to maximize the return (or reduce the cost) during the control process.
As far as the authors know, there is no work reported in the literature adopting the RL agent to complete the type-reduction process of IT2 polynomial-fuzzy-model-based (PFMB) control systems, during which the control performance is optimized at the same time.The main novelties and contributions of the proposed RL-based type-reduced approach are summarized as follows.
1) The connection between the IT2 membership functions and the optimization through RL is recognized for the first time.2) To optimize the control performance with consideration of delayed effects, the RL agent is adopted in the IT2 PFMB control system to extract the type-reduced membership functions from the FOU of the IT2 membership functions.
3) The IPM concept and MFD techniques are utilized to make it possible for the stability analysis on the IT2 polynomial fuzzy systems.4) The action set of the RL agent is properly designed and the predefined control performance according to the cost/reward function is improved during improvement based on the deterministic policy gradient.5) The stability conditions of the control system are always guaranteed to hold for different sets of reduced type-1 membership functions.6) The merits of the proposed method are verified through comparisons with the different type-reduction methods and the transient responses of the control processes.The rest of this article is organized as follows.In Section II, the preliminaries of the IT2 polynomial fuzzy model and controller are presented.In addition, the SOS-based membership-functionindependent (MFI) stability conditions and the MFD stability analysis with the consideration of the specific information of membership functions are introduced.In Section III, the type-reduction through RL is presented and discussed in detail.
The simulation examples are provided in Section IV.Finally, Section V concludes this article.
Notation: The bold vector variable x(t) ∈ R n is used to represent the nth dimensional state vector in the control system.The operations of matrix inverse and transpose are represented by the superscripts "−1" and T , respectively.The positive or negative definiteness of matrix S are represented as S > 0 or S < 0. If p(x(t)) can be written as a linear combination of monomials with finite real coefficients, then q(x(t)) can be regarded as polynomial in x(t).Furthermore, if there exist polynomials q 1 (x(t)), q 2 (x(t)), . .., q m (x(t)) such that q(x(t)) can be decomposed as m i=1 q m (x(t)), then q(x(t)) is SOS, which implies q(x(t)) ≥ 0. I m×m and 0 m×m are used to represent the m-by-m dimensional identify and empty matrices, respectively.

II. PRELIMINARIES AND STABILITY CONDITIONS
In the following section, the essential preliminaries of the IT2 polynomial fuzzy model, IT2 polynomial fuzzy controller, MFI stability analysis of the control system, and the MFD stability conditions including information of membership functions are presented.

A. IT2 Polynomial Fuzzy Model
Inspired by the IT2 fuzzy modeling concept from [18], [21], [22], the p sets of local polynomial dynamic models associated with the corresponding IT2 fuzzy rules are introduced to describe the plant nonlinearity and handle the uncertainty in the plant directly where Mi 1 , Mi 2 , . .., Mi Ψ are the IT2 fuzzy terms under the ith model rule i = 1, 2, . .., p; A i (x(t)) ∈ R n×N and B i (x(t)) ∈ R n×m are the local polynomial subsystem and subinput matrices under the ith fuzzy model rule, which are given by the modeling process of the plant; x(t) ∈ R n denotes the state vector in the model of plant, x(x(t)) ∈ R N stands for the polynomial vector of monomials in x(t), it is assumed that x(t) = 0 iff x(t) = 0 [16]; u(t) ∈ R m stands for the control input generated by the IT2 polynomial fuzzy controller.Since the IT2 membership functions are considered in the fuzzy modeling, the lower and upper of firing strength for the ith model rule can be introduced as w L i (x(t)) and w U i (x(t)).Directly from the definition, the relationship ) and w U i (x(t)) can be calculated through the lower and upper grades of the membership as follows: in which μ Mi α (f α (x(t))) and μ Mi α (f α (x(t))) stand for the upper and lower grades of membership, respectively, α = 1, 2, . . ., Ψ. From the definition, the values of μ Mi α (f α (x(t))) and μ Mi α (f α (x(t))) are within the interval [0, 1], and the prop- ) and ϑ i (x(t)) are nonlinear functions serving the type-reduction purpose.
Considering all the local polynomial dynamic models together through the IT2 membership functions, the dynamics of plant subject to uncertainty can be described by the IT2 polynomial fuzzy models: where p i=1 wi (x(t)) = 1, wi (x(t)) ≥ 0, ∀ i. Remark 1: In the fuzzy modeling, the uncertainty in the plant is directly captured through the FOU characterized by the lower and upper boundaries of the IT2 membership functions.The uncertainty in the plant makes the grades of membership uncertain in value, and thus, the IT2 polynomial fuzzy controller cannot employ the exactly same membership functions from the IT2 polynomial fuzzy model in practice.To solve this problem in a strict way, consequently, IPM is adopted in the analysis of the IT2 PFMB control system.

B. IT2 Polynomial Fuzzy Controller
To control the nonlinear plant subject to uncertainty, which is represented by the IT2 polynomial fuzzy model (2), the IT2 polynomial fuzzy controller constructed by c sets of local polynomial controllers through the corresponding fuzzy rules is proposed as follows [18], [21]: where Ñ j 1 , Ñ j 2 , . .., Ñ j Ω are the IT2 fuzzy terms under the jth controller rule; G j (x(t)) is the polynomial local feedback gain matrix to be determined.The lower and upper of firing strength for the controller rule j can be denoted as m L j (x(t)) ) and m U j (x(t)) can be obtained from the lower and upper grades of membership )) ≥ 0 denotes the lower grades of membership.From the basic properties of IT2 membership function, we directly obtain the property Inspired by [21], mj (x(t)) can be defined as follows: where ), and j (x(t)) are the controller typereduction functions to be determined by the RL agent during the control process to improve the performance.Considering all the local polynomial controllers together through the IT2 membership functions, the control input is described as (5)

C. MFI Stability Analysis
The control objective in this article is to properly design the fuzzy control gain matrices G j (x(t)), which is able to stabilize the states of nonlinear plant in the sense of asymptotic stability, i.e., lim t→∞ x(t) → 0.
Connecting the IT2 polynomial model ( 2) and IT2 polynomial controller ( 5), the closed-loop dynamics of the control system can be obtained as follows: To develop the polynomial dynamics of the closed-loop IT2 PFMB control system, the relationship between ẋ(x(t)) and ẋ(t) is analyzed here, the linking matrix L(x(t)) ∈ R N ×n is introduced here to calculate ẋ(x(t)) from ẋ(t) where the α βth element in L(x(t)) is defined as Adopting the linking matrix in ( 7) into ( 6), the dynamics ẋ(x(t)) can be developed as in which Ãi (x(t)) and Bi (x(t)) are defined as Ãi ( From the polynomial dynamics (9), the stability analysis on the IT2 PFMB closed-loop control system can be conducted within the framework of the Lyapunov stability theory.To develop the stability conditions, X(x(t)) −1 is first introduced and it is required to be valid such that X(x(t)) = X(x(t)) T > 0.Then, the following Lyapunov function candidate V (t) is introduced as Remark 2: To avoid nonconvex terms generated by the derivative terms in V (t), we define K = {k 1 , k 2 , . . ., k q } as the set of row numbers, for which the elements of entire row of From the Lyapunov stability theory, if V (t) ≥ 0 (equality holds when x(t) = 0) and V (t) < 0, the asymptotic stability of the whole IT2 FPMB closed-loop control system is guaranteed.To develop the asymptotic stability conditions, V (t) is calculated as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
in which To make condition (11) convex, z(t) = X(x(t)) −1 x(x(t)) and N j (x(t)) = G j (x(t))X(x(t)) are introduced to transform (11) into the following form: where x(x(t)) ∀ i and j.The asymptotic stability of the IT2 PFMB control systems is satisfied when −Q ij (x(t)) > 0, ∀i, j.The results of MFI stability analysis are summarized in Theorem 1.
Theorem 1: The IT2 PFMB closed-loop control system, where the nonlinearity and uncertainty in the physical plant are represented by the IT2 polynomial fuzzy model (2), is guaranteed to be asymptotically stable subject to the control input generated by the IT2 polynomial fuzzy controller (5), if the polynomial matrices X(x(t)) = X(x(t)) T ∈ R N ×N , N j (x(t)) ∈ R m×N , j = 1, 2, . .., c, can be found after solving the following conditions by SOSTOOLS: where ς ∈ R N is an arbitrary vector independent of x(t), ) > 0 and ε 2 (x(t)) > 0 are introduced for numerical reasons.The feedback matrices can be retrieved as Remark 3: It should be pointed out that the stability conditions in Theorem 1 are independent on the specific forms of membership functions, which implies that the stability conditions are required to be valid for any membership functions by solving the SOS-based stability conditions in Theorem 1 [18], [23], which is unnecessary in practical applications.Furthermore, the same control gain for different fuzzy rules are expected after solving those SOS-based stability conditions, then the ability to fulfill nonlinear control through fuzzy controller is unfortunately lost.Therefore, the MFI stability conditions presented in Theorem 1 are conservative.

D. MFD Stability Analysis
As pointed out in Remark 3, utilizing the information of the IT2 membership functions is essential to further relax the stability conditions.In this section, the information of the specific membership functions adopted in the control system is included into the stability analysis to help relax the stability conditions obtained in Theorem 1.Let us recall the original stability conditions considering the fuzzy summation of both the IT2 polynomial fuzzy model and controller as It is convenient to denote h ij (x(t)) = wi (x(t)) mj (x(t)) and h ij (x(t)) will be adopted in the analysis afterwards.To properly import the information of membership functions into SOS conditions without loss of the numerical feasibility, the function approximation of h ij (x(t)) is introduced along with some constants.
In the MFD analysis, the whole state space (operation domain) Φ is divided into L connected subdomains such that Φ = L l=1 Φ l .The maximum and minimum of membership function of h ij (x(t)) during operation domain l are obtained and denoted as h ijl and h ijl , which are constants and can be included into the SOS conditions.It always holds that 0 ≤ h ijl ≤ h ijl (x(t)) ≤ h ijl ≤ 1.Then, for subdomain l, (13) can be rewritten as Besides the constant approximation of the membership function, some slack matrices Υ ijl (x(t)), i = 1, 2, . . ., p, j = 1, 2, . . ., c, l = 1, 2, . . ., L are defined to satisfy the following requirements: Adopting the terms h ijl , h ijl , and Υ ijl (x(t)), the stability conditions with the information of membership functions, ) in subdomain l can be therefore relaxed as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where δ ijl = h ijl − h ijl ∀i, j, l is a constant determined by the shape of IT2 membership functions.
Besides, the state information of subdomains is also considered to further relax the stability condition: where x l ∈ R n and x l ∈ R n can be calculated as the lower and upper boundaries for Φ l through the division of state space; D is a diagonal matrix, in which the element can only be either 0 or 1.
When the element d k in D equals to 1, d k = 1, 2, . .., n, it means the domain information of d k th dimension in x(t) is used in the stability conditions, otherwise not.M l (x(t)), l = 1, 2, . . ., L is defined as SOS matrix.
Remark 4: It is also worth pointing out that in stability condition (15), h ij , h ij , and δ ij represent the shape information of the membership functions of IT2 PFMB control systems.h ij , h ij , and δ ij are constants instead of general nonlinear forms, which can be processed efficiently through the SOS convex optimization.
Theorem 2: The IT2 PFMB closed-loop control system, where the nonlinearity and uncertainty in the physical plant are represented by the IT2 polynomial fuzzy model (2), is guaranteed to be asymptotically stable subject to the control input generated by the IT2 polynomial fuzzy controller (5), if the polynomial matrices X(x(t)) = X(x(t)) T and M l (x(t)) ∈ R N ×N , i = 1, 2, . .., p, j = 1, 2, . .., c, l = 1, 2, . .., L can be found after solving the following conditions by SOSTOOLS: Bi (x(t)) T .x l ∈ R n and x l ∈ R n are calculated as the lower and upper boundaries for Φ l through the division of state space; D is a predefined diagonal matrix, in which the elements can only be either 0 or 1. Predefined ε 1 (x(t)) > 0, ε 2 (x(t)) > 0, ε 3 (x(t)) > 0 and ε 4 (x(t)) > 0 are introduced for numerical reasons.h ijl and δ ijl are constants representing the shape information of the IT2 membership functions to be incorporated into the stability analysis.The polynomial feedback matrices are retrieved as G j (x(t)) = N j (x(t))X(x(t)) −1 .
Remark 5: From the SOS stability conditions in Theorem 2, it can be found that as long as the membership functions h ij (x(t)) are within the range [h ijl , h ijl ], the stability is guaranteed by the same set of SOS stability conditions.Furthermore, as pointed out in [58], type-2 fuzzy sets can be regarded as functions on spaces.Therefore, [h ij , h ij ] can be also considered as the constrained function space.Through optimization over the FOU of the IT2 polynomial fuzzy controller, the optimal embedded type-1 membership functions can be obtained.In addition to this, since the stability conditions are guaranteed for all the grades within FOU of the IT2 membership functions, the optimization can be conducted safely inside the FOU without the risk of stability loss.

III. TYPE REDUCTION THROUGH RL
To form the problem of type reduction in the framework of RL and adopt the optimal policy as the type reducer, the environment (in the sense of RL) is represented by the closed-loop control system.The RL agent intervenes the environment through picking the grades of membership functions for the fuzzy controller.Through the interactions with the environment, the RL agent is trained and the acting policy is improved through the deterministic policy gradient in a data-driven style.The acting policy μ determined by the RL agent indicates the way how it chooses the grades of the IT2 membership functions according to different states x(t).The full trajectories of grades of the IT2 membership functions for the premised variables within the operation domain constitute the embedded type-1 membership functions.

A. Cost Function
Considering the delayed effects of the current choice on the overall control performance, to adaptively extract the embedded type-1 membership functions from the FOU of the IT2 membership functions of the polynomial fuzzy controller, RL can be applied with the following cost function under policy μ in which, r(x(t), u(t)) := x(t) T Sx(t) + u(t) T Ru(t)), S T = S ≥ 0 and R T = R ≥ 0.

B. Construction of Action Set
The purpose of the policy μ is to choose the optimal grades within the FOU of the membership functions of the IT2 fuzzy controller according to the current state x(t), which minimizes the total cost during the control process.However, due to the fact that the ranges of FOU according to different state x(t) are different, it is not efficient to directly use the grades of membership functions as actions from the RL agent.
To make the grade of membership picking more effectively within the FOU, we denote the valid range of FOU of the IT2 membership function for the first fuzzy control rule as [m 1 , m 1 ] Considering the nonnegative variable g 1 ∈ [0, 1], then any value m1 inside the FOU can be represented as m1 = (1 − g 1 )m 1 + g 1 m 1 and it can be found that the value g 1 ∈ [0, 1] uniquely determines the value of m1 .Also, equally divided g 1 in [0, 1] results equally divided value of m1 in [m 1 , m 1 ].Therefore, g ∈ G can be considered as more equally distributed action set of the RL agent and g is formed by g j , j = 1, 2, . .., c from different sets of IT2 membership functions.

C. Value Function and Deep Deterministic Policy Gradient
When the size of state set and action set G are too large, calculation of the exact state-action value function Q(x(t), g(t)) is difficult and may be impossible in many cases.One alternative is to use some function approximators to replace the exact state-action value function and/or policy.As a universal approximator, DNN can be used with nonlinear activation functions to approximate Q(x(t), g(t)) [51], [53].At the same time, the acting policy can also be approximated by another DNN and the behaviors of the acting policy can be improved through the deterministic policy gradient to achieve higher return (lower cost) [51].
In the DNN for the approximation of the state-action values, which is regarded as the critic network, the state x(t) and action g(t) are the input of the critic network while the estimated stateaction value function Q(x(t), g(t)) is the output of the DNN.For the acting policy, which is regarded as the actor network, state x(t) will be input into the actor network and the output of the actor network is the action g(t).
To simplify the notation used, the sets of whole weights of the critic and actor networks will be written as θ Q and θ μ , respectively.Similarly, the sets of the whole weights of the target critic and actor networks are represented as θ Q and θ μ .The state-action value from the critic network for the current state x(t) with the corresponding action g(t) is represented as Q(x(t), g(t)|θ Q ).The actor network determines the grade picking policy, in which the action g(t) is generated according to the observation of the current state x(t).After holding the action for a small interval δ t , observe the next state x(t + δ t ), and the reward t+δ t t r(x(τ ), u(τ ))dτ is calculated according to the predefined cost function.The weights in the critic network Q(x(t), g(t)|θ Q ) can be updated according to the target critic network, target actor network, and the obtained rewards during interactions.The weights in the actor network can be updated through the approximated deterministic policy gradient calculated by the critic network.To improve the sample efficiency, the technique of experience replay is adopted.The tuple < x(t), g t , x(t + δ t ), t+δ t t r(x(τ ), u(τ ))dτ > is added into the experience buffer (EB) for experience replay.The structure of the EB without time relationship can be denoted as < x i , g i , x i , r i >, i = 1, 2, . ... For the experience fragment i, x i is the start state, x i is the next state, g i is the action, and r i is the reward/cost.
In order to update the weights in the actor and target networks θ μ and θ Q during the control process, the mini-batch (N transitions) of the experience fragments are randomly sampled from the EB to train the DNN in the supervised manner.The target value Qi for the ith sample for the supervised learning of DNN can be obtained through the target critic and actor networks as follows: Using the estimated training targets, the cost function of the critic network can be calculated as and the mini-batch gradient ∇ θ Q J(θ Q ) can be calculated through the N experience fragments as (20) After the mini-batch gradient is obtained, θ Q can be updated associated with the properly defined learning rate α Q as To search for the optimal policy, the deterministic policy gradient can be approximated through the same mini-batch experience.The deep deterministic policy gradient is approximated to minimize the cost as follows [51]: and the weight of the actor networks can be updated through the deterministic policy gradient After a certain number of updates of the critic and actor networks, the weight of the target networks will be replaced as The full algorithm for updating the critic and actor networks through deterministic policy gradient is summarized in Algorithm 1.

IV. SIMULATION EXAMPLE
To demonstrate the effectiveness of the proposed deep RL method, a detailed example is discussed in this section.To reduce the burden of symbols, in the following context, the time symbol t in the variables is omitted to make the context concise.In this numerical example, the following simulation example with x(x) = x = [x 1 , x 2 ] T is given as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Algorithm 1: Approximate Reinforcement Learning With Deterministic Policy Gradient.1: Randomly initialize the weights of the actor, critic and the corresponding target networks as: calculate training targets for the critic network as: Calculate the cost function according to Qi as:
Considering the operation domain of x 1 ∈ [−5, 5], then the global lower and upper boundaries of the operation domain for x 1 is obtained directly as x 1 = −5 and x 1 = 5.As demonstrated in Theorem 2, in order to incorporate more specific information of the membership functions into the analysis, the whole operating domain x 1 has been divided uniformly into 20 subdomains (i.e., L = 20), then it can be obtained that x 1 l = −11/2 + 1/2l and x 1 l = −5 + 1/2l, l = 1, 2, . .., 20 as the lower and upper boundaries of the lth sub-domains of x 1 , respectively, D = diag{1, 0}.
To solve the SOS-based stability conditions in Theorem 2, the following parameters should be defined first: ε 1 (x) = ε 2 (x) = ε 3 (x) = ε 4 (x) = 0.001; X(x) is a constant matrix; N j (x 1 ), j = 1, 2 are polynomials with monomials in x 1 of degree 0; M l (x) ∀l is of degree 0; Υ ijl (x) ∀i, j, l is a polynomial matrix in x of degree 2. Through SOSTOOLS, the stability conditions in the form of SOS can be solved numerically, the feedback matrices are obtained as G 1 = [22.0220,−0.4502], G 2 = [11.2020,4.9780], and X = 0.1943 −0.0106 −0.0106 0.2052 .Remark 6: The type-reduction functions ϑ i (x(t)) and ϑ i (x(t)) of the model are unknown due to the uncertainty lies in the plant.Also, the uncertainties are handled by the FOU and the stability conditions are always valid for the plant subject to uncertainty.However, for the simulation purpose, the type-reduction functions have to be defined explicitly to reduce the IT2 polynomial fuzzy model into a type-1 one.
In order to conduct the time response simulation, we manually define the type-reduction functions first.For the typereduction functions, we choose ϑ 1 (x 1 ) = (sin(5x 1 ) + 1)/2, ϑ 1 (x 1 ) = 1 − ϑ 1 (x 1 ), ϑ 3 (x 1 ) = (cos(5x 1 ) + 1)/2, ϑ 3 (x 1 ) = 1 − ϑ 3 (x 1 ) to obtain w1 (x 1 ) and w3 (x 1 ).From the property that 3 i=1 wi (x 1 ) = 1, w2 (x 1 ) can be obtained directly from w1 (x 1 ) and w3 (x 1 ).In the simulation, to train the DNNs of the critic and actor, Adam [59] is adopted with learning rates α Q = 0.005 and α μ = 0.001; The discount factor γ is chosen Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.as 0.999 during the training; the size of the replay buffer is chosen as 1 million transitions; the mini-batch size is chosen as 64; Max_Episode, Max_Time_Step, and Replace_Threshold chosen as 4000, 2500, and 5, respectively; the NN of the critic has 2 hidden layers with 30 and 50 neurons with rectified nonlinearity [60] and log-sigmoid as the nonlinear activation functions, the final output layer is with linear activation function; the NN of the actor has 2 hidden layers with 20 and 30 neurons with hyperbolic tangent and rectified nonlinear activation functions, the final output layer is with log-sigmoid activation function to bound the output of actor into the range [0, 1]; all the weights of the DNNs for critic and actor are initialized through the way proposed in [61]; S = 1 0 0 1 and R = 0; δ t is chosen as 0.01 s.
After 600, 000-steps updates of the actor and critic networks, the values/costs of different states calculated by (17) under the policy μ * determined by the trained actor can be viewed in Fig. 1.In the figure, the costs are evaluated at the states uniformly from x 1 and x 2 in the ranges [−5, 5] and [ −4, 4].400 states in total have been evaluated in the figure.
Remark 7: In Fig. 1, it can be found when the state goes more closely to the origin, the cost becomes smaller.This is reasonable since the definition of the cost function in (17) encourages the states to go to origin quickly.The largest cost 6.4357 is obtain at the state x 1 = 5, x 2 = 4, which is very far away from the origin.
To compare the type-reduction method determined by the policy μ * with the other feasible type-reduction methods.Ten sets of type-reduction methods are manually defined as g 1 m (t) = 0, g 2 m (t) = 0.11, g 3 m (t) = 0.22, . .., g 10 m = 1.Then, the grade of the IT2 membership functions under the qth manually defined type-reduction method can be obtained according to the manually defined type-reduction functions as m1 (x 1 ) = (1 − g q m (t))m 1 (x 1 ) + g q m (t)m 1 (x 1 ), q = 1, 2, . . ., 10.The costs of those manually chosen type-reduction method are calculated by (17) and the cost functions according to different type-reduction methods can be viewed in Fig. 2. In Fig. 2, from top left to bottom right, the ten subfigures represent the cost function under type-reduction functions g 1 m (t), g 2 m (t), . .., g 10 m (t), respectively.In Fig. 2, it can be found that the cost functions under manually defined type-reductions are generally larger than the cost function under the policy μ * , however, the difference can not be observed easily.To show the optimality of the policy μ * in a clearer way, we divide all the ten cost functions under the manually defined type-reduction methods by the cost function determined by policy μ * , we can view the ratio of the cost function at different states in Fig. 3.
Similar as Fig. 2, in Fig. 3, from top left to bottom right, the ten subfigures show the ratios of the cost function under type-reduction functions g 1 m (t), g 2 m (t), . .., g 10 m (t) with those under the policy μ * , respectively.It can be found that for all the manually defined type-reductions, the ratios for almost all the states are larger than 1, which implies that under the policy μ, the smallest cost can be obtained.The ratio of the cost functions shows the optimality of the proposed approach.
From the time response of the states during the control process, the effectiveness of the proposed approach can be demonstrated in a straightforward way, the time response started with the initial state x(0) = [5, −1.5] is provided.In Fig. 4, the curves in teal color represent the response of x 1 (t) and the curves in gold color represent the response of x 2 (t); the solid curve represents time response with the policy μ * licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.determining the grades of the IT2 membership functions for the fuzzy controller; the dashed curve represents the time response with the grades of the IT2 membership functions for the fuzzy controller is calculated by m1 (x 1 ) = m 1 (x 1 ) (from the definition, m2 (x 1 ) = 1 − m1 (x 1 )); the dotted dashed curve represents the time response with the grades of the IT2 membership functions for the fuzzy controller is calculated by m1 (x 1 ) = 1/5m 1 (x 1 ) + 4/5m 2 (x 1 ); the dotted curve represents the time response with the grades are calculated by m1 (x 1 ) = 1/2(m 1 (x 1 ) + m 1 (x 1 )); The total cost calculated from (17) during the control process under policy μ * is 0.6972 while the cost for control processes adopting other type reducers are 1.1279, 1.011, and 0.8681, respectively.From the total cost during the control process, it shows that adopting the policy μ * as the type-reducer, the embedded type-1 membership function can be extracted from FOU according to the optimized control performance.Beside, m1 (t) and m2 (t) determined by policy μ * during the control process can be viewed in Fig. 5, in which the green curve represents m1 (t) and the yellow curve represents m2 (t).

V. CONCLUSION
In this article, new results for the type-reduction of IT2 polynomial fuzzy systems through the deterministic policy gradient are presented for the first time.In the proposed method, the optimal reduced type-1 membership functions are extracted from the FOU according to the policy approximated by DNN to optimize the predefined control performance index.The gradual improvement on the control performance is conducted through interaction within the confined function space, i.e., the FOU of IT2 membership functions to guarantee the stability conditions without resolving the SOS conditions according to different sets of embedded type-1 membership functions.Different from the PDC approach, the MFD approach is adopted to utilize the specific information of membership functions.Besides, thanks to adopting the information of IT2 membership functions to facilitate the stability analysis, any type-1 membership function within the FOU still fulfills the stability conditions.Therefore, the stability conditions are valid for all the potential embedded type-1 membership functions according to the optimized control performance.The extraction of embedded type-1 membership functions from the FOU is formed as the function optimization problem to fully optimize the control performance and the deterministic policy gradient is adopted to search for the corresponding optimal policy μ * , which determines the optimal grades of type-1 membership functions.The detailed simulation example has been provided to demonstrate that the control performance is optimized during the type-reduction process.Having presented the merits of the DRL-based type-reduction method, it should be pointed out that the disadvantage of the proposed method is that it requires more computational resources than its counterparts reported in the literature.

Fig. 1 .
Fig. 1.Cost function according to the acting policy determined by the actor network.

Fig. 4 .
Fig. 4. Time responses of states under different type-reduction methods.