Neuro-Optimal Event-Triggered Impulsive Control for Stochastic Systems via ADP

This article presents a novel neural-network-based optimal event-triggered impulsive control method. First, a novel general-event-based impulsive transition matrix (GITM) is constructed to represent the probability distribution evolving characteristics regarding all system states across the impulsive actions, rather than the prefixed timing sequence. On the foundation of this GITM, the event-triggered impulsive adaptive dynamic programming (ETIADP) algorithm and its high-efficiency version (HEIADP) are developed to deal with the optimization problems for stochastic systems with event-triggered impulsive controls. It is shown that the obtained controller design scheme can reduce the computational and communication burden caused by updating the controller periodically. By analyzing the admissibility, monotonicity, and optimality properties of ETIADP and HEIADP, we further establish the approximation error bound of the neural networks to address the connection between the ideal and neural-network-based realizations of the present methods. It is proven that the iterative value functions of both the ETIADP and HEIADP algorithms fall in a small neighborhood of the optimum as the iteration index increases to infinity. By adopting a novel task synchronization mechanism, the proposed HEIADP algorithm fully utilizes the computing resources of multiprocessor systems (MPSs), while significantly reducing the memory requirement compared to traditional ADP approaches. Finally, we carry out a numerical study to show that the proposed methods can fulfill the desired goals.

Derong Liu is with the Department of Mechanical and Energy Engineering, Southern University of Science and Technology, Shenzhen 518055, China, and also with the Department of Electrical and Computer Engineering, University of Illinois Chicago, Chicago, IL 60607 USA (e-mail: liudr@sustech.edu.cn;derong@uic.edu).
Digital Object Identifier 10.1109/TNNLS.2022.3232635nonlinear systems, which permits the initializer to be an arbitrary positive semidefinite function, thus improving the universality and reliability of the ADP methods.In [12] and [13], the local value iteration ADP algorithm with a novel updating mechanism is introduced to relax the computational burden while its admissibility and global optimality properties are analyzed systematically.In [14], a novel value iterationbased off-policy ADP algorithm is proposed for the optimal control of continuous-time linear periodic systems, so that approximate optimal solutions can be obtained directly from the collected data, without the exact knowledge of system dynamics.Bhattacharya et al. [15] focus on DP problems and use policy iteration and rollout techniques to solve a class of autonomous sequential repair problems where the system states are partially observable.The stable value iteration algorithm is suggested in [16] to resolve optimization problems for nonlinear two-player zero-sum games based on ADP.It is also shown in [16] that if the iteration index reaches a given number, the generated iterative control inputs make the closed-loop system asymptotic stable.Zhu and Zhao [17] use the double Q-learning technique and provide the double-loop iterative realization methods for the ADP algorithms, to construct the optimal controllers for zero-sum games of stochastic systems.ADP also possesses great practical values.For example, it can be used to obtain the optimal controllers for permanent magnet synchronous motors [18], antilock brake systems [19], dc-dc power converters [20], etc.Researchers in recent years have been aiming at building the high-performance impulsive controllers for the impulsively controlled systems.Haddad et al. [21] extend the dissipativity theory to nonlinear dynamical systems controlled by impulsive controllers for which the corresponding invariant set stability and Lyapunov theorems are established.Li et al. [22] design the impulsive control schemes for nonlinear systems with constant, unbounded time-varying or bounded time-varying delays.Lakshmikantham et al. [23] present the concept of impulsive evolution processes and analyze the corresponding stability properties of the impulsive controllers via methods of discontinuous Lyapunov functions.Dufour et al. [24] analyze the Hamilton-Jacobi-Bellman (HJB) equation associated with the optimal impulsive control problems of piecewise deterministic Markov processes (PDMPs).Miller et al. [25] develop the martingale representation of the stochastic systems subject to joint impulsive and gradual controls, while constructing the optimal strategy based on the DP equation.Basu and Stettner [26] provide the optimal impulsive controller design schemes for zero-sum games under several weak assumptions and weak Feller conditions.Dufour and Piunovskiy [27] study continuous-time stochastic systems on a general Borel state space governed by both impulsive and regular controllers to minimize the infinite-time horizon discounted cost.Heydari [28] focuses on nonlinear impulsive systems and presents the controller design schemes for obtaining the impulsive instants in the optimal control problems with unlimited number of impulses.Wei et al. [29] develop a novel iterative ADP algorithm to solve the optimal impulsive control problems for infinite discrete-time nonlinear systems.Wang and Balakrishnan [30] give the optimal neuro-controller synthesis for impulsedriven systems while using neural network approximation structures to solve the optimality equations.Wang et al. [31] extend the ideas in [30] to consider the optimal impulsive control problems where the impulsive instants are calculated by a prefixed function.The optimal impulsive control technologies have also been utilized to practical applications such as the Internet congestion control problems [32], antiangiogenic tumor therapies [33], and human immunodeficiency virus treatments [34].
Event-triggered control, as a promising methodology for reducing computational and communication costs, nowadays becomes a hot topic in the community.Vamvoudakis [35] proposes the event-triggered ADP algorithm for nonlinear continuous-time systems, which reduces the controller updates by sampling the state only when an event is triggered while maintaining stability and optimality.Wang et al. [36] construct an event-triggered adaptive robust control approach for nonlinear systems through a neural DP strategy, achieving the robustness of the designed controller under a suitable triggering condition.Luo et al. [37] design the event-triggered optimal controller directly based on the solution of the Hamilton-Jacobi-Bellman equation and provide formal performance guarantees of the controller by proving a predetermined upper bound.Mu et al. [38] present an event-sampled integral reinforcement learning algorithm for partially unknown nonlinear systems using a novel dynamic event-triggering strategy, in which the actor network adopts the event-based communication to update the controller only at triggering instants.Zhao and Liu [39] develop an event-triggered decentralized tracking control approach for modular reconfigurable robots based on ADP, while solving the local HJB equation via a local critic with an asymptotically stable structure.Xue et al. [40], [41] use event-triggered ADP to design controllers for uncertain nonlinear dynamical systems.
Although the impulsive or event-triggered control technologies have made considerable progress, several problems still have to be addressed.First, the regular time-based state transition probability matrix P (its (m, n)th entry is the probability p(σ n |σ m , ν) where σ m , σ n is the discrete elements of the state space X and ν is the controller) is strictly constrained by the action time of controllers.Specifically, to reflect the probability distribution evolving characteristics from the timing node k to k + , from P we can derive the multistep transition matrix P = P * P * • • • * P whose element p (σ n |σ m , ν) establishes the relationship between the system states x(k) and x(k + ) (we use the symbol x ∈ X to generally refer to the system state of the stochastic systems).However, the entire transition matrix P requires that the action time of ν must be fixed at the time sequence (k, k + 1, . . ., k + − 1) with a prefixed time span for all initial states x(k) ∈ X.This restriction of the regular time-based transition matrices (P and P ) makes them fail to reveal the impulsive transition dynamics across the impulsive actions, since the time span between two adjoining impulsive actions (which is also called the "impulsive control cycle") is adaptively changing according to the current system state and not predefined.Consequently, the regular time-based transition matrices (P and P ) can be used to derive the evolution of the probability distribution at the prefixed time sequence, but not the evolution of the probability distribution across the impulsive actions.In summary, these facts indicate that the restrictions of the regular time-based transition matrices and the variable impulsive control cycles of the impulsive controller show no compatibility and conflict with each other, which causes the traditional impulsive control methods [24], [25], [26], [27] complicated and highly specialized with low generality and uniformity.Noticing that the "arrival of the impulsive action" can be treated as an event, therefore to address the above issues, a novel generalevent-based impulsive transition matrix is needed to represent the probability distribution evolving characteristics across the impulsive actions.
Second, the impulsive control methods proposed in [24], [25], [26], and [27] use DP to iteratively obtain the explicit solutions of the optimality equations.However, due to the increased difficulty of solving the exact solutions and the "curse of dimensionality" issues of DP (especially when the system dynamic characteristics are complex), these DP-based methods are almost impossible to implement in reality.Fortunately, the above DP-associated problems can be successfully avoided by the newly developed ADP methodology, which uses approximate structures to efficiently and numerically approach the optimum.Nevertheless, there are still no reports on utilizing ADP to solve the optimal impulsive control problems for discrete stochastic systems.Besides, to derive the optimal impulsive controllers for the stochastic systems under the ADP framework, the new general-event-based impulsive transition matrix which can reflect the impulsive transition dynamics across the impulsive actions, is crucial and required.
Furthermore, existing researches on impulsive control via ADP [28], [29], [30], [31]: 1) merely deal with the deterministic systems, and are invalid when the controlled systems are stochastic as a consequence of the issues mentioned above and 2) require the system states to be sampled periodically and the controller/actuator be updated at each time step, consuming huge computational and communication resources.As for the existing event-triggered ADP approaches [35], [36], [37], [38], [39], [40], [41], they: 1) mainly focus on deterministic cases and there are no reports on how they can be applied to impulsively controlled systems and 2) require the triggering condition to be carefully designed by the decision maker prior to the optimization of the controller, thus characterized by low portability and extendability, which also indicates that the optimality of the triggering condition needs improvement.
Finally, on the subject of algorithm execution, the traditional ADP-based approaches including [1], [2], [3], [4], [5], [6], [7], [8], [28], [29], [30], [31], [35], [36], [37], [38], [39] require the iterative items to be updated globally at each step.This updating mechanism may not be friendly to the computing devices especially with limited memory sizes.Particularly, if the complexity of the controlled systems increases, ADP needs massive amount of sampled data to globally and precisely train the utilized neural networks at each iteration, which poses a significant memory burden to the computing devices and may even cause the algorithms unfeasible.On the other hand, more and more modern computing devices come with multiple processors, which are referred to as multiprocessor systems (MPSs), now widespread and playing critical roles in various areas of daily life and industrial production.Hence, considering the fact that traditional ADP algorithms are usually designed to run on a single processor, how to adapt ADP to fully utilize the computing resources of MPSs remains an urgent problem to be resolved.
The above issues motivate us to carry out the present research, and we summarize the main novelties in the following three aspects.
1) A novel general-event-based impulsive transition matrix (GITM) is established, which can reveal the probability distribution evolving characteristics across the predefined general events.Due to this advanced property, this novel transition matrix plays a key role in the developments of the ETIADP and HEIADP methods.Moreover, the GITM possesses strong extendability.Specifically, the "general event" of GITM is not only limited to "the arrival of the impulsive instants," but also can be further customized into "the jumping behavior of the system happens," "the switching behavior of the controller happens" or "the triggering condition is violated," etc.Hence, these unique features of GITM substantially improve the universality of the ADP-based approaches.
2) A new event-triggered impulsive controller (ETIC) design scheme is developed and for the first time ADP is applied to obtain the optimal ETICs for stochastic systems, whereas value iteration is employed to address the "curse of dimensionality" issues and effectively approximate the optimums including the optimal triggering condition.
3) The high-efficiency event-triggered impulsive ADP (HEIADP) algorithm is developed to fully utilize the computing resources of MPSs while reducing its memory requirement.
The advantages and differences of the developed methods compared to the existing methods are emphasized as follows.
1) Existing impulsive control methods use the traditional state transition probability matrices which require that the action time of the impulsive controller must have the same length for each system state.Unfortunately, for the impulsive controller, its impulsive control cycles are always adaptively changing with the system states and not fixed.This mismatch between the variable impulsive control cycles and the requirements of the regular time-triggered transition matrices causes the traditional impulsive control methods complicated and highly specialized with low generality and uniformity.In comparison, the proposed ETIADP and HEIADP algorithms are based on the GITMs which reflect the "impulsive" dynamics, possess greater flexibility and improve the extendability of the ADP approaches.(Please see Remark 1 for more details about the advantages of the GITM over the traditional one.) 2) Existing DP based impulsive control methods may cause the "curse of dimensionality" problems, making the algorithms unusable in reality.In comparison, the proposed methods are ADP based, which means neural networks are used to effectively approximate the optimums, thus successfully avoiding the DP-related issues.In addition, to the best of our knowledge, it is the first time ADP is utilized to design the optimal impulsive controllers for stochastic systems.
3) Existing impulsive control methods for deterministic systems cannot be directly extended to stochastic systems, offering limited generality.In comparison, the proposed methods can effectively solve the optimal impulsive control problems of stochastic systems, due to the established GITMs.
4) Existing impulsive control methods require the system states to be sampled periodically and the controller/actuator to be updated at each time step, consuming huge compu-tational and communication resources.In comparison, the event-triggering mechanism is introduced to the optimal impulsive controller design scheme for the first time, further improving its efficiency.In other words, according to our literature research, there has been no reports on how to obtain the optimal event-triggered controllers for impulsive stochastic systems, which is another contribution of our article.
5) Existing ADP methods, when implemented in computing devices, consume huge memory spaces if the system complexity is high.In comparison, the proposed HEIADP algorithm can significantly reduce the memory burden.
6) Existing ADP methods, when executed in MPSs, are faced with the "task unsynchronization" problem, resulting in low utilization rate of the multiprocessor computing resources.In comparison, the proposed HEIADP algorithm can overcome the above issue by adopting a novel MPS task scheduling scheme.

A. System Dynamics and the ETIC Design
The controlled systems are modeled as follows: where x and a are the system state and control action, respectively.ω(k) is the random variable which has its own probability distribution.In turn, the global state space and control action space are denoted by X and A, respectively.If the current system state is x(k) with the control action being a(k), then, the value of the following state x(k + 1) is governed by a probability distribution p(•|x(k), a(k)).The equilibrium of ( 1) is x = 0 when the control action is zero, meaning p(0|0, 0) = 1.The timing instants where the event-triggered impulsive controller is active are of two types: the decision-making instants and impulsive control instants.
If the current time is the decision-making instant, then the ETIC determines whether the current "controller updating event (CUE)" is triggered or not, and specifies when the succeeding impulsive action takes place.If the current time is the impulsive control instant, the ETIC applies the impulsive action to the system.In particular, we use {θ l }, l = 0, 1, . . ., to represent the sequence of the decision-making instants.When l = 0, we have θ 0 = 0.The symbol G is used to denote the collection of all possible impulsive intervals (the time length from the current decision-making (impulsive control) instant to the following one), i.e., G = {T 1 , T 2 , . . ., T max }.Obviously, we have θ l+1 − θ l ∈ G, l = 0, 1, . . .In practical engineering, the frequency of the impulsive actions of the ETIC should be regulated.Specifically, if the frequency is too high, it may cause wear and tear of the controller and affect system reliability; If the frequency is too low, it may not achieve the desired controller performance.Hence, to acquire a proper balance between controller performance and reliability, a delay function ψ(x, l) : X × N ≥0 → G which is used to adjust the frequency of the impulsive actions, is designed.If the current time satisfies k = θ l which is exactly the decision-making instant and the current CUE is triggered, then based on the current system state x(θ l ), ψ outputs the current "impulsive" control cycle (i.e., the time span bounded by the current decision-making instant θ l and the next one θ l+1 ) of the ETIC.In other words, we have θ l+1 − θ l = ψ(x(θ l ), l).Meanwhile, under the above conditions, the control law cluster as the impulsive magnitude of the succeeding impulsive action which is applied at the impulsive control instant, i.e., θ l+1 − 1, of the current impulsive control cycle (as its name suggests, the control law cluster u(x, l) is composed of a series of control laws which are defined by u(x, ρ, l) : X × G × N ≥0 → A).On the other hand, if at the current time k = θ l , the CUE is not triggered, then, the time span and the impulsive action of the current impulsive control cycle are kept the same as that of the last "triggered" one.In addition, if the current time index is not the impulsive control instant, then, the ETIC is in idle state and there is no control input, whether the current CUE is triggered or not.
We set up a register r (k) ∈ R along with the counter c(k) in the ETIC.Specifically, these two components r (k) and c(k) are used to signal the arrival of the decision-making or impulsive control instant to the controller.Moreover, construct the recorder r = [r , r ] T ∈ R, where r and r record the outputs of ψ(x, l) and u(x, l) at the triggered impulsive control cycle, respectively.Then, we can define the "impulsive" state ς as Let the function (ς, l) : X × N ≥0 → {0, 1} denote the CUE scheduling/triggering strategy, where (ς(θ l ), l) = 1 means the CUE is triggered or equivalently, the time span and the impulsive action of the current impulsive control cycle are both updated, and (ς(θ l ), l) = 0 means the previously calculated cycle length and impulsive action are applied to the current "impulsive control cycle" (CUE is not triggered, and the current control cycle inherits the attributes of the last triggered one).Through monitoring the system state trajectory x(k), the components r, r and c renew their values accordingly.Suppose the current system state is x(k), the updating rules of these items are demonstrated in Tables I-III, with the corresponding initial values configured as r (0 From Table I, if r (k) = 0, then, ETIC treats k as the current decision-making instant, i.e., k = θ l .When r (k) = 1, ETIC then recognizes k as the current impulsive control instant.As for the recorder r (k), the corresponding dynamics is shown in Table II, where r memorizes the cycle length of the current "triggered" impulsive control cycle, along with the impulsive action applied in the above period.In turn, if the current CUE is not triggered, r remains unchanged.Table III demonstrates the dynamics of c(k), which records the total number of impulsive actions applied to the system before the current time k.Therefore, we can derive The collection of all accessible values of the register r is denoted by R which consists of r 0 , r 1 , . . ., r |R| , with r 0 = 0. We use to represent the "expanded" system state.As for the global space X of all the "impulsive" states, it is discrete and countable and we list its elements as σ 1 , σ 2 , . . ., σ | X | .Then, we use l (e)] T as the abstract mathematical description of the ETIC where 1 l returns the output of the triggering strategy at the decision-making instants while 2 l returns the control action of ETIC at the impulsive control cycle [θ l , θ l+1 ) Equation (3) shows that the decision procedure of the ETIC is subject to the following five factors: 1) the current expanded state; 2) the total count of impulsive actions applied before the present time; 3) the delay function ψ; 4) the triggering strategy ; and 5) the control law cluster u.Therefore, more accurately the ETIC ought to be marked with the symbol l, ,ψ,u (e) instead of l (e).Nonetheless, to conveniently analyze the properties of ETIC, we still adopt l or l (e) to represent l, ,ψ,u (e) if no confusion occurs.Moreover, the operating mechanism of the ETIC is demonstrated intuitively in Fig. 1.
Based on the preceding analysis, we now define the event-triggered impulsive policy employed across all impulsive control cycles as h = ( (ς, 0), ψ(x, 0), u(x, 0), . . ., (ς, l), ψ(x, l), u(x, l), . ..).In addition, we use K l , M l and C l to represent the collections of all possible triggering strategies, delay functions, and control law clusters applied at the lth impulsive control cycle, respectively.Then, the event-triggered impulsive policy space is expressed as in which × means the Cartesian product.For the space C l , we have Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. Development of the GITM
Suppose at the time k, the expanded state of the system is e(k) = e r i ,σ m .Under that condition, p (e r j ,σ n |e r i ,σ m , l ), = 1, 2, . . ., denotes the probability that the stochastic system goes to the state e r j ,σ n at the time index k + when the ETIC l is utilized at the time indexes k to k + − 1.According to the operating mechanism of ETIC, at the lth impulsive control cycle, (ς, l), ψ(x, l), and u(x, l) are employed to determine the controller outputs.Thereupon, we define the corresponding GITM as P (ς,l),ψ(x,l),u(x,l) which is as in ( 4), shown at the bottom of the page.Specifically, the (m, n)th component of the GITM is p(σ n |σ m , (ς, l), ψ(x, l), u(x, l)) which represents the probability that the impulsive state ς evolves to ς(θ l+1 ) = σ n at the (l + 1)th decision-making instant starting from ς(θ l ) = σ m at the lth decision-making instant.The elements of P satisfy p(σ n |σ m , (ς, l), ψ(x, l), u(x, l)) where Remark 1: From ( 5) and ( 6), it is worth mentioning that GITM is only governed by the triggering strategy (ς, l), control law cluster u(x, l) and delay function ψ(x, l), and is not affected by the adaptively changing impulsive control cycles of the ETIC.This advantage of the GITM over the traditional one is also demonstrated intuitively in Figs. 2 and 3.In these figures, the area of the rectangle whose label and corresponding time index are σ i and k, respectively, represents the probability of the system state being σ i at k, while the shaded rectangles connected by the dashed lines represent the Fig. 2. Changes of the probability distribution of system states over time (for illustration purpose only).Fig. 3. Changes of the probability distribution of system states across the predefined general events (for illustration purpose only).probability distribution of system states at a certain time point or upon the general event occurs.Specifically, Fig. 2 shows that the traditional transition matrices require the action time of the controller during system transition must be the same (e.g., 1 or 2 time units) for all current states with the same time index, in order for it to describe the changes of the probability distribution over time.Fig. 3 indicates that when a general event happens, the corresponding system states may not be at the same time (for example, after the lth impulsive action is applied, the probability of state σ 2 being at time k + 1 is 20%, while there is a 30% chance that the system occupies σ 3 at time k + 2).Fig. 3 also demonstrates the variable transition time from one general event to another, which causes the traditional matrices fail to represent the general-event-based Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
impulsive dynamics.By comparing Figs. 2 and 3, it shows that the established GITM can describe the probability distribution evolving characteristics across the "general events" while the traditional matrices cannot (from now on, the "general event" of GITM specifically refers to "the arrival of the decisionmaking instant").With the above superiority of the GITM compared to the traditional transition matrices, ADP is then empowered such that it can be used in solving the optimal impulsive control problems of stochastic systems.GITM also plays a key role in analyzing the convergence, admissibility, and error boundedness properties of the ETIADP and HEIADP methods.
Choose the event-triggered impulsive policy as h = ( (ς, 0), ψ(x, 0), u(x, 0), . . ., (ς, l), ψ(x, l), u(x, l), . ..) and let e(θ l ) = e r 0 ,ς .Under the above situation, the expected infinite horizon cost subjected to h is The content within the bracket of ( 7) is a random variable whose expectation is calculated by the operator E{•}.The utility function U (•, •) in ( 7) is expressed as ).Specifically, at each time index k, the function Q gives the immediate cost regarding the current expanded state e(k) and the corresponding control action 2 l .represents the costs of resampling the current state, recalculating the outputs of the ETIC (or utilizing the computational resources) and utilizing the communication resources, when the controller updating event is triggered.The function π is expressed as Designers should carefully make the choices of the functions Q and in order for the impulsive controllers and systems functioning properly and healthily.That is, the degree of the penalty (i.e., the utility function U ) should be stepped up if the impulsive action frequency increases.The reason for that is that the impulsive controller/actuator can be abused or severely damaged and their lifespans may be shortened if the impulsive action frequency is too high, which therefore ought to be suppressed.Definition 1: If h = ( (ς, 0), ψ(x, 0), u(x, 0), . . ., (ς, l), ψ(x, l), u(x, l), . ..) makes J h (e r 0 ,ς , l) < ∞ hold for any ς ∈ X , l ∈ N ≥0 , then we say the event-triggered impulsive policy h or the triplet ( (ς, l), ψ(x, l), u(x, l)) is admissible.
In addition, define the impulsive utility function U ,ψ,u (ς ), impulsive value function V r 0 (ς ), and impulsive performance index function J h,r 0 (ς ) as U ,ψ,u (•), V r 0 (•), J h,r 0 (•) : X → R, respectively.Particularly, U ,ψ,u (ς ) is the accumulated cost of one impulsive control cycle, during which the impulsive state at the decision-making instant equals to ς, that is, V r 0 (ς ) denotes the value of the impulsive state ς at the decision-making instant where the expanded system state is . Given the policy h, J h,r 0 (ς ) is the corresponding performance index regarding to the current impulsive state ς with the current time being an decision-making instant.The corresponding vectorized forms U ,ψ,u , V r 0 and J h,r 0 are all | X |-dimensional vectors whose i th elements are U ,ψ,u (σ i ), V r 0 (σ i ) and J h,r 0 (σ i ), respectively.

A. Derivation of the ETIADP Algorithm
Define the set of all "impulsive" equilibrium points as o = {ς |x = 0, r = 0, r ∈ G} (it is guaranteed that any admissible ETIC can drive the impulsive system into the impulsive states ς ∈ o , and once the impulsive trajectory enters inside the space o , it remains there and never leaves).Let the elements of the initial value function V r 0 0 satisfy Then, the initial triggering strategy 0 (ς ), delay function ψ 0 (x) and control law cluster u 0 (x) are calculated by The iterative triggering strategy i (ς ), delay function ψ i (x), and control law cluster u i (x) are derived through ) Then, ETIADP iterates through (12) and (13) with i → ∞.

B. Convergence and Monotonicity Analysis
Theorem 1: For i = 0, 1, . . ., let V r 0 i , i , ψ i , and u i be obtained by ( 11)- (13).Choose the constants Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
hold for any ∈ K, ψ ∈ M and u ∈ C.Then, the iterative value function V r 0 i converges to the optimum, that is, lim Remark 2: It is obvious that there must exist constants ζ , η, and η satisfying ( 14)- (16).In other words, as long as the initial value function is chosen as (11), the sequence of iterative value functions obtained by the ETIADP algorithm converges to the optimal performance index function.In terms of the monotonicity property, Theorem 1 shows that the value sequence may be neither monotonically decreasing nor monotonically increasing.Nevertheless, under several conditions, the value sequence can increase (decrease) monotonically, which is illustrated by the following theorem.
All the neural networks mentioned above consist of three layers, namely the input, hidden, and output layers.Furthermore, the mathematical descriptions of the action network group and critic network are The input layer to hidden layer weight matrices are denoted by W i,ρ (W i , W i , W i ) while the hidden layer to output layer weight matrices are denoted by W i,ρ (W i , W i , W i ).The bias vectors attached to the hidden and output layers are represented by , which is the tan-sigmoid function, is the transfer function of the hidden layers.The transfer functions of the output layers of the networks ψi and ˆ i are γ (•) and μ(•), respectively.Let w = W T i {g(W At each iteration of ETIADP, the weight matrices and bias vectors are properly tuned by employing the gradient decent and error back propagation techniques such that the neural networks precisely approximate the targets.Overall, the implementation structure of the ETIADP algorithm is illustrated by Fig. 4.

A. Derivation of the HEIADP Algorithm
Modern computing devices are usually equipped with multiple processors and the neural processing unit (NPU, which accelerates neural network operations such as convolutions and matrix multiplications).For example, the Apple A14 Bionic system on a chip (SoC), designed by Apple Inc., features six central processors and includes the dedicated NPU that is called "Neural Engine."Meanwhile, the Kirin 9000 SoC introduced by Huawei Technologies Company is an octa-core chipset integrating a tri-core NPU.The simplified architecture of the above MPSs is illustrated by Fig. 5(a), where, without loss of generality, the MPS is assumed to have four central processors c 1 , c 2 , c 3 , and c 4 .The MPSs, along with computing techniques such as multithread programming and parallel processing, enable the ETIADP algorithm to be executed in a concurrent manner.Specifically, with the MPS shown in Fig. 5(a) employed, the concurrent ETIADP algorithm at the i th iteration step divides the original and impulsive global system state spaces, i.e., X and X , into smaller disjoint nonempty subsets X i,1 , X i,2 , X i,3 , X i,4 and Xi,1 , Xi,2 , Xi,3 , Xi,4 , respectively, where Then, X i, j and Xi, j are sent to the processor c j , which generates and gathers the required sample data for training ( ûi , ψi , ˆ i , Vr 0 i+1 ) at X i, j / Xi, j according to ( 12) and ( 13).After the NPU receives all the sampled data which are obtained by all processors and associated with the global spaces X and X, it updates the neural networks Vr 0 i+1 , ûi , ˆ i and ψi .Implemented in such a concurrent manner, the ETIADP algorithm iteratively executes the above steps with i → ∞ and converges to the optimum.The task scheduling for the MPS under the concurrent ETIADP algorithm is demonstrated in Fig. 5(b).However, this algorithm may result in the unsynchronization of the task progresses across processors.Since the complexity levels of the assigned tasks are different and the processors may operate at different speeds, the consumed time for each processor to complete its sample generation task of the current iteration step usually differs from each other.Specifically, as shown in Fig. 5(b), c 1 is the first processor among the MPS Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.to finish collecting the required sample set at the i th iteration step, then, it is in the idle state during [t 1 , t 4 ] waiting for the other processors to catch up.Similarly, c 2 and c 3 , as the second and third processors to complete their assigned tasks of the current iteration, they also have to be in the idle state until c 4 gets its job done.Only at time t 4 , when all the sampled data associated with the global spaces are gathered, can the NPU kick in and train the neural networks ( ûi , ψi , ˆ i , Vr 0 i+1 ).Overall, due to the unsynchronization problem, the concurrent ETIADP algorithm leaves a number of processors of the MPS unused at some time intervals during execution, and causes the computational resources of the MPSs not being fully utilized.
To address this deficiency, a concurrent HEIADP algorithm is proposed, which realizes synchronization by forcing the current task of the MPS to transition from sample set generation to neural networks training, when the first processor to finish the current assigned sample generation task occurs.Fig. 5(c) shows the task scheduling for the processors with the concurrent HEIADP algorithm implemented in the MPS.According to Fig. 5(c), at time t 1 , the first processor c 3 among the MPS completes the assigned sample generation task at the i th iteration step.At the same time, all current sample generation tasks running on the other processors are forced to terminate simultaneously, and all processors proceed to send the already generated sample data to the NPU.Let B i, j ⊆ X i, j and Bi, j ⊆ Xi, j denote the actual "effective areas" associated with the sample dataset generated by c j at the i th iteration step.Suppose the sample dataset generated by c j at i th iteration is S i, j = {(x 1 , y 1 ), (x 2 , y 2 ), . . ., (x |S i, j | , y |S i, j | )}, where x 1 , . . ., x |S i, j | ∈ X i, j are the input data and y 1 , . . ., y |S i, j | are the corresponding expected output data.Then, by choosing a relatively small positive constant ξ, we define the set as the "effective area" associated with S i, j .The concept of "effective areas" is illustrated by Fig. 6.Then, the NPU trains the neural networks ( ûi , ψi , ˆ i , Vr 0 i+1 ) at B i = ∪ 4 j =1 B i, j or Bi = ∪ 4 j =1 Bi, j , while letting them inherit the values of their corresponding predecessors at X\B i or X \ Bi .Afterward, the concurrent HEIADP algorithm repeat similar steps for the next iteration, and so on.The above synchronization mechanism ensures that the processors in the MPS are properly loaded during the whole execution, thus making full use of the multiprocessor computing power.
Another advantage of the HEIADP algorithm over the ETIADP approach lies in its low memory usage.As for the concurrent ETIADP approach, noticing that before the NPU training the neural networks, the corresponding computing devices or the MPSs have to store all the sample data associated with the global state spaces, i.e., X and X , into the physical memory spaces.Therefore, when dealing with complexed/large-scale controlled systems, or if the control action space and system state space are of huge volume, the required vast amounts of sample data may not be totally fit into the limited physical memory spaces (i.e., the memory overflow problems), causing the algorithms unfeasible.In contrast, HEIADP at each iteration only locally updates the iterative ˆ i , ψi , ûi and Vr 0 i+1 with respect to the local state spaces B i and Bi .Consequently, the size of the sample datasets which the algorithm collects and stores at each iteration is also reduced significantly, meaning that the memory burden and the memory requirement for the concurrent execution are relaxed.In summary, the present approach improves the utilization rate of the computational resources of the MPS, while reducing the memory requirement for large-scale problems.Next, the mathematical abstraction of the concurrent HEIADP algorithm is given.

B. Mathematical Abstraction of the Concurrent HEIADP
Given the set sequences {B i } and { Bi }, i = 0, 1, . . ., which according to (21), should satisfy B i ⊆ X and respectively.Define {α i (ς )}, i = 0, 1, . . ., as the sequence of weight functions, where Then, let {A i }, i = 0, 1, . . ., be a sequence of diagonal matrices, such that The algorithm is initialized by the value function V r 0 0 which is chosen according to (11), and the initial 0 , ψ 0 and u 0 are computed by where the constrained searching spaces of , ψ and u are expressed as respectively.Particularly, −1 , ψ −1 and u −1 (•, ρ) are chosen arbitrarily in the sets K, M and C ρ , respectively.For any i = 1, 2, . . ., obtain the iterative value functions V r 0 i via where E is the X -dimensional identity matrix.The iterative triggering strategy i (ς ), control law cluster u i (x), and delay function ψ i (x) are derived through where Then, HEIADP iterates through ( 25) and ( 26) with i → ∞.
to and subtracting the same term from (35), it becomes Combining similar terms of (36), we can obtain (37) Similar to ( 35)-( 37), for any φ(l − 1) which indicates Using the same techniques as in ( 30)-( 39), we can prove we have From ( 43) and based on the facts ( 39) and ( 40), we immediately derive that The proof is completed.Remark 3: The convergence proof of Theorem 1 can be obtained from that of Theorem 3 if we let B i = X, Bi = X, ∀i = 0, 1, . . .(thereupon, the ETIADP algorithm is in fact a special case of the HEIADP algorithm, since if the sequence of effective areas satisfy B i = X, Bi = X , ∀i = 0, 1, . . ., the HEIADP algorithm reduces to the ETIADP algorithm).Moreover, Theorem 3 shows that as long as the iterative impulsive policies and iterative value functions get updated infinite times for any ς ∈ X, then, the HEIADP algorithm is guaranteed to converge to the global optimum.However, in practical applications, it is impossible for the algorithm to execute for infinite iteration steps on the computing devices.Instead, the convergence termination condition < ε is used to terminate the algorithm, and the iterative value function which satisfies the termination condition for the first time during algorithm execution, is treated as the "optimal" performance index function.To guarantee the admissibility of the corresponding obtained controller, an admissibility criterion identifying the admissible policies is provided via the following theorems, which also show the HEIADP algorithm can obtain an admissible policy within finite iteration steps.
Proof: Assume the conclusion is false.Then, for any δ ∈ N ≥0 , there always exist an integer 0 ≤ φ(δ) ≤ φ(δ + 1) − φ(δ) − 1 and an impulsive state ς ∈ Bφ(δ)+φ(δ) \ o satisfying It contradicts the positive definiteness of U .Hence, the assumption is false and the conclusion holds.According to the above convergence and admissibility analysis, the HEIADP algorithm in the form of mathematical abstraction is summarized in Algorithm 1.
Since the approximation structures such as neural networks are used in the ETIADP and HEIADP algorithms, there exist errors between the approximated and theoretical values.The following theorem analyzes the error dynamic characteristics of the proposed methods, and establishes the approximation error bound of the critic networks, by which the approximated iterative value functions fall in a small neighborhood of the optimum as i → ∞.In other words, the following theorem addresses the connection between the ideal and neural-network-based realizations of the proposed methods.
V. SIMULATION EXAMPLE Consider the stochastic process {x(k)} generated by the dynamics F (x, a, ω) which is expressed as . (60) The symbol x = Q(e, a) is the immediate cost regarding the current expanded state e and the corresponding impulsive action a at each time k.
In particular, Z e,a in Q is constructed in order to suppress the abuse of high frequency impulsive actions, thus guaranteeing the healthy operations of the impulsive controllers/actuators Algorithm 2 MPS Task Scheduling Scheme for the Concurrent HEIADP Algorithm Initialization: Let i = 0, δ = 0; Denote the processors in the MPS as c 1 , c 2 , . . ., c n ; Let φ(0) = 0; Construct the sequences of state subsets as { Bκ } ⊆ X and {B κ } ⊆ X, κ = 0, 1, . . ., where B 0 = B0 = ∅; Divide the global original state space X into disjoint nonempty subsets X 0,1 , X 0,2 , . . ., X 0,n , such that X = ∪ n j =1 X 0, j ; Divide the global impulsive state space X into subsets X0,1 , X0,2 , . . ., X0,n , according to (21); ∀ j = 1, 2, . . ., n, send X 0, j and X0, j to the central processor c j .Iteration: 1: ∀ j = 1, . . ., n, processor c j generates and gathers the required sample set 1 for training ( ûi , ψi , ˆ i , Vr 0 i+1 ) at X i, j / Xi, j according to ( 12) and ( 13); 2: The first processor in the MPS to complete the assigned sampled data generation task informs the others to terminate their ongoing sampled data generation tasks simultaneously; 3: Obtain the corresponding effective areas B i and Bi of the currently acquired sample set from all processors; 4: The NPU trains ( ûi , ψi , ˆ i , Vr 0 i+1 ) at B i or Bi , while letting them inherit the values of their corresponding predecessors at X\B i or X\ Bi ; 6: Let i = i + 1; 7: Construct the subsets X i,1 , X i,2 , . . ., X i,n , such that X\B i ⊆ ∪ n j =1 X i, j ; 8: Construct the subsets Xi,1 , Xi,2 , . . ., Xi,n , according to (21); 9: ∀ j = 1, 2, . . ., n, send X i, j and Xi, j to the central processor c j , and go to Step 1. and systems.Notice that the value of the recorder r represents the time interval from the current impulsive action to the following one.Hence, with r stepping down, the frequency of the impulsive actions goes up, and the penalty Z e,a is accordingly strengthened and increased.Besides, (e, a) is the computational and communication cost caused by the ETIC resampling the system state and updating its output.By utilizing the MPS task scheduling scheme in Algorithm 2, it is guaranteed that condition ( 27) is satisfied.Then, according to Theorem 3, the concurrent HEIADP should theoretically approach to the impulsive optimum of (60).    is fixed as ≡ 1, then, the controller updates periodically, consuming more computational and communication resources, and the policy space becomes t = M 0 ×C 0 ×M 1 ×C 1 ×• • • in which the time-triggered ADP finds the optimal timetriggered controller.However, t has a lower volume than the event-triggered impulsive policy space associated with the proposed methods, i.e., t ⊂ = K 0 ×M 0 ×C 0 ×K 1 ×M 1 × C 1 ×• • • .Therefore, the performance of the optimal ETIC in should theoretically be better than the optimal time-triggered controller in t , which is validated in Fig. 10(b).
Fig. 11 compares the memory/CPU usage of the concurrent ETIADP with that of the concurrent HEIADP.From Fig. 11(a), it is noticed that the CPU utilization rate of the ETIADP algorithm significantly drops at some time periods, during which some processors in the MPS finish their sample generation tasks of the current iteration earlier than the others, thus transitioning to the idle state and waiting for the others to catch up.This unsynchronization across processors cause  the algorithm not fully utilizing the computing resources of the MPS.In contrast, in Fig. 11(b), the CPUs are properly loaded the whole execution time of the concurrent HEIADP algorithm, due to the novel MPS task scheduling scheme.As for the memory consumption, Fig. 11(c) and (d) show that ETIADP (which represents the traditional ADP-based methods wherein the iterative items are updated globally) has a peak memory usage of nearly 100%.In contrast, HEIADP introduces a novel updating mechanism which reduces the memory usage to around 20%-25%, thus more suitable to run on computing devices with limited memory sizes.Therefore, based on the above experimental results, HEIADP can effectively improves the operating efficiency (in terms of CPU utilization and memory footprint) compared to the traditional ADP-based approaches, while approaching the optimum.

VI. CONCLUSION
To obtain the optimal ETIC of the stochastic systems, the ETIADP is proposed with its convergence and error boundedness properties analyzed.The HEIADP is also developed to fully utilize the computing resources of MPSs.The effectiveness of the methods is verified by the numerical study.

Manuscript received 25
September 2022; accepted 18 December 2022.Date of publication 6 January 2023; date of current version 9 July 2024.This work was supported in part by the National Natural Science Foundation of China under Grant 62203120 and Grant 62073085 and in part by the Guangdong Basic and Applied Basic Research Foundation under Grant 2021A1515110870.(Corresponding author: Derong Liu.)

Fig. 1 .
Fig. 1.Output trajectories of the ETIC and corresponding register (for illustration purpose only).

Fig. 5 .
Fig. 5. Comparison between the concurrent ETIADP and HEIADP algorithms.(a) Simplified architecture of the MPS.(b) Task scheduling for processors with the concurrent ETIADP algorithm implemented in the MPS.(c) Task scheduling for processors with the concurrent HEIADP algorithm implemented in the MPS (for illustration purpose only).

Fig. 6 .
Fig. 6.Effective areas associated with the sample sets (for illustration purpose only).

x 1 x 2
represents the system state, a represents the control action (a is the output of 2 , and a is the output of 1 ), ω = ω 1 ω 2 represents the random variable and T = 0.01.Due to the existence of ω, (60) is essentially a stochastic system (see Chapters 3-5 of [10] for more explanations).Both the state and control spaces are finite and countable.All possible impulsive intervals of ETIC are given by G = {T 1 , T 2 , T 3 , T 4 , T 5 }, where T 1 = 1, T 2 = 2, T 3 = 3, T 4 = 4 and T 5 = 5.By constructing the corresponding register r, recorder r and the "expanded" state e, the utility function is defined by U (e, [a, a] T ) = Q(e, a) + π(a) (e, a) where

Fig. 11 .
Fig. 11.(a) CPU usage as the concurrent ETIADP is running.(b) CPU usage as the concurrent HEIADP is running.(c) Memory usage as the concurrent ETIADP is running.(d) Memory usage as the concurrent HEIADP is running.