An Efficient Impulsive Adaptive Dynamic Programming Algorithm for Stochastic Systems

In this study, a novel general impulsive transition matrix is defined, which can reveal the transition dynamics and probability distribution evolution patterns for all system states between two impulsive “events,” instead of two regular time indexes. Based on this general matrix, the policy iteration-based impulsive adaptive dynamic programming (IADP) algorithm along with its variant, which is a more efficient IADP (EIADP) algorithm, are developed in order to solve the optimal impulsive control problems of discrete stochastic systems. Through analyzing the monotonicity, stability, and convergency properties of the obtained iterative value functions and control laws, it is proved that the IADP and EIADP algorithms both converge to the optimal impulsive performance index function. By dividing the whole impulsive policy into smaller pieces, the proposed EIADP algorithm updates the iterative policies in a “piece-by-piece” manner according to the actual hardware constraints. This feature of the EIADP method enables these ADP-based algorithms to be fully optimized to run on all “sizes” of computing devices including the ones with low memory spaces. A simulation experiment is conducted to validate the effectiveness of the present methods.

systems. In [11], the local policy iteration ADP algorithm with a novel updating mechanism is developed in order to relax the computational burden while reserving the monotonicity and global optimality properties of the traditional ADP algorithm. In [12], a new actor-predictor-critic framework is designed and the ADP-based policy iteration method is adopted to approximate the analytical solution of the Bellman equation with applications to the hypersonic vehicles attitude-tracking control. In [13], an approximate policy iteration scheme, where the successive policies are approximated using neural network classifiers, is proposed to solve the optimal solution for partially observable Markovian decision problems with finite state and control spaces. The multiagent rollout policy iteration algorithms, with both their exact and approximate forms, are suggested in [14] to resolve the finite or infinite horizon discounted stochastic optimization problems. It is also shown that these policy-iteration-like methods in [14] are strictly offline algorithms and can be used to provide a base policy for use in an online multiagent rollout scheme. In [15], by introducing neural networks to approximate the complicated minimax Q value functions, the general policy iteration algorithm is utilized to construct the Nash equilibrium policy for two-player zero-sum Markov games. On the aspect of applications, the ADP methods can be widely applied to practical optimization problems, such as optimal control for ice-storage air conditioning systems [16], continuously stirred-tank reactor systems [17], and wastewater treatment plants [18].
Controller design along with its property analysis for the impulsive nonlinear or stochastic systems, such as the attitude and position control of spacecrafts using impulsive thrusters [19], [20] and missile guidance and control using impulsive actuators [21], have received considerable attention in recent years. As for the stability analysis, Lakshmikantham et al. [22] offered a systematic treatment of the theory of impulsive differential equations and provided the global stability criteria to demonstrate how the impulses affect the stability behavior of impulsive systems. Haddad et al. [23] presented Lyapunov, asymptotic, and exponential stability theorems for nonlinear time-dependent and state-dependent impulsive dynamical systems. Li et al. [24] achieved the global exponential stability for the impulsive control of nonlinear delay systems. On the basis of stability analysis, the optimality property of the impulsive controller is a significant factor to be considered among the existing literature. Miller et al. [25] solved the optimal impulsive control problems of finite-state inequality. Dufour and Piunovskiy [26] and Dufour et al. [27] studied the associated Bellman or HJB equation and obtained both the optimal impulsive and gradual (continuous) controls for continuous-time Markov decision processes or piecewise deterministic Markov processes. Basu and Stettner [28] considered a Markov stopping game with impulsive strategies and found its saddle-point equilibria through the corresponding Isaacs-Bellman equations. Wei et al. [29] and Heydari [30] investigated the finite or infinite horizon optimal control problems of impulsive nonlinear deterministic systems using ADP methods. Wang and Balakrishnan [31] presented the single network adaptive critic (SNAC, which is one of the structures of ADP)-based controller design technique for nonlinear systems with the impulsive control inputs applied at fixed timing instants. In [32], the proposed method in [31] is further extended to deal with the cases where the impulsive control instants are determined by a predefined function.
However, there exist several defects in the traditional optimal impulsive control methods. First, for impulsive stochastic systems, the traditional state transition probability matrix P involved in [25]- [28] requires that the action time of the impulsive controller must be fixed at the current time k for each system state x ∈ X. Similarly, the multistep transition matrix P used by the traditional methods requires that, for each initial state x(k), the impulsive controller be applied to the system from time k to +k, of which the time span is strictly fixed as . Unfortunately, for the impulsive controller in its simple form, it is only active when the "impulsive instants" arrive (the impulsive instants refer to two types of time nodes, that is, the time nodes where the controller need to determine when to apply the next impulsive action, and the time nodes where an impulsive action need to be applied to the system right now, respectively). If the current time is not an impulsive instant, then, the impulsive controller is in the idle state and there is no control input. The fact above causes P or P fail to represent the state transition probability (or the change of the probability distribution) from one impulsive instant to another for all system states at once, since the control cycle (time span between two adjacent impulsive instants of the same type) may vary with the state x and not be a constant. This incompatibility between the restrictions of P or P and the variable control cycle of the impulsive controller makes the traditional methods [25]- [28] complicated and highly specialized, lacking uniformity and generality.
Second, the algorithms in [25]- [28] are all dynamic programming based, which has to solve the corresponding optimality equations analytically. As the complexity of the systems increases, the computational burden of these DP-based methods grows exponentially, resulting in the "curse of dimensionality" problems. ADP, characterized by the strong self-learning and self-adaption abilities, has shown its advantages in avoiding the DP-related dimensionality disaster. However, in order to utilize ADP to design the optimal impulsive controller for discrete stochastic systems (to the best of our knowledge, no published study has yet investigated this subject), or to analyze the monotonicity, convergency, and optimality properties of the later proposed impulsive adaptive dynamic programming (IADP) and efficient impulsive adaptive dynamic programming (EIADP) algorithms in this article, a novel general transition matrix which can provide the entire transition dynamics between two impulsive instants for all system states, is essential and needed.
On the other hand, existing ADP-based impulsive controller design schemes [29]- [32] only consider the deterministic systems and cannot be directly extended to stochastic cases due to the above-mentioned issues. Moreover, in [29], the numerical implementation method wherein the whole state and control space have to be discretized, is still computationally expensive especially when the space size is large. The impulse magnitude of the impulsive controller of [30] is fixed, offering limited versatility. In [31] and [32], the impulsive instants which are prefixed or determined by a predefined function, cannot guarantee its optimality, leading to an inefficient impulsive controller design.
In terms of hardware implementation, the traditional ADP algorithms and impulsive control methods including [10]- [32] have not taken into account the hardware constraints of the corresponding computing devices, such as the memory size and CPU performance limitations. When faced with higher system complexity, the huge sample dataset needed for ADP and the utilization of sophisticated approximation structures (e.g., the deep neural networks) can easily cause memory overload and thus lead to the infeasibility of the algorithms. Furthermore, along with the era of Internet of Everything, a growing number of low-performance devices, such as smartphones, smart wearables, smart homes, and vehicle computers, are widespread in the society and industrial fields. Hence, how to adapt ADP to a wide variety of computing devices especially the ones with limited computing and storage resources, remains to be a challenging task.
The above problems motivate the present research, and the main contributions of this article can be summarized as follows.
1) A novel general impulsive transition matrix which is "event"-based (e.g., the event in our case can be defined as "any type of impulsive instant" or "one specific type of impulsive instant") and can reveal the transition dynamics or probability distribution evolution patterns between any two impulsive events, is established. This general matrix acts as the theoretical foundation and is used for property analysis of the latter proposed IADP and EIADP algorithms. Besides, the event associated with the general matrix can be customized by users into "triggering," "switching," "jumping," etc., thus significantly expanding the application domain of these ADP-based methods. 2) ADP is applied to solve the optimal impulsive control problems of discrete stochastic systems for the first time and the policy iteration-based impulsive ADP algorithm is developed, which can effectively approximate the optimal impulsive performance index function and overcome the "curse of dimensionality" problems of the traditional DP methods. 3) An efficient impulsive ADP algorithm is proposed to reduce the memory consumption of the IADP method, such that these ADP-based approaches are fully optimized to run on all "sizes" of computing devices. The remainder of this article is organized as follows. In Section II, we present the system dynamic characteristics, the impulsive controller design, and the introduction of the general impulsive transition matrix. In Section III, the policy iterationbased IADP and EIADP algorithms for stochastic systems are developed. In Section IV, the monotonicity, convergency, and optimality properties of the generated iterative control laws and value functions are analyzed. In Section V, a simulation experiment is conducted to validate the effectiveness of the developed methods. Finally, in Section VI, concluding remarks are provided.

A. System Description and the Impulsive Controller Design
The following discrete stochastic systems are considered: where x(k) ∈ X denotes the system state, a(k) ∈ A is the control input (action) of the stochastic process, and ω(k) is the external disturbance which is independent of ω(κ) for all κ < k. The symbols X and A are used to denote the system state space and the action space, respectively. Given the current state x(k) and control input a(k), the next state x(k + 1) is determined by a probability distribution p(·|x(k), a(k)). The state x = 0 is an equilibrium of the stochastic system (1) under the control input a = 0, that is, p(0|0, 0) = 1. Two types of time nodes, which are called the decision-making instants and impulsive control instants, respectively, are essential for the impulsive controllers. At the decision-making instants, the impulsive controller determines when the next impulsive control or action happens. At the impulsive control instants, the corresponding impulsive action is applied to the stochastic system. More specifically, the decision-making instants are denoted by {k l }, l = 0, 1, . . . , where the symbol l indicates that it is the (l + 1)th time the decision-making instant has arrived in the stochastic process. For l = 0, we let k 0 = 0. Besides, define N as the set of the impulsive intervals (the time span between two adjacent decision-making or impulsive control instants), which is expressed as Without loss of generality, we assume L 1 < L 2 < · · · < L max and use |N | to denote the number of elements in N . Based on the definition (2), the decision-making instants {k l } obviously satisfy k l+1 − k l ∈ N , l = 0, 1, . . . Suppose the current time is the decision-making instant k l . Then, according to the system state x(k l ), a delay function μ(x, l) which is defined as μ(x, l) : X × N ≥0 → N , specifies the time interval between the current decision-making instant k l and the next one k l+1 (this time interval is also referred to as the "control cycle" of the impulsive controller). That is To be more precise, based on (3), the time index k l is also dependent on the history of the system state [x(k 0 ), x(k 1 ), . . . , x(k l−1 )] and should be rewritten as k l,x(k 0 ),x(k 1 ),...,x(k l−1 ) . In detail, k l, Hence, if only the information of the system state x at the (l + 1)th decision-making instant is provided, the specific value of k l still cannot be determined and may vary with the state sequences or system trajectories. Nevertheless, for simplicity, we still use k l to indicate k l,x(k 0 ),x(k 1 ),...,x(k l−1 ) in the rest of this article if no confusion will arise. Then, at the time k l + μ(x(k l ), l) − 1 which is referred to as the impulsive control instant, the control law group of the stochastic process. The element of the control where A is the action space. According to the notation above, the impulsive policy utilized over the entire time horizon can be expressed as h = Define the set of delay functions μ(·, l) applied at the (l + 1)th decision-making instant as M [l] . Define the set of all control law groups d(·, l) = [d(·, L 1 , l), d(·, L 2 , l), . . . , d(·, L max , l)] utilized at the (l + 1)th impulsive control instant as D [l] . Then, we can use to denote the impulsive policy space, that is where × denotes the Cartesian products and the set D  [1] = · · · . The same applies to D [l] , that is, D = D [0] = D [1] = · · · . Two registers r (k) and r (k) and a counter c(k) are designed with the controller, which indicate whether or not the decisionmaking or impulsive control instant has arrived. By observing the current system state x(k), the register and counter update their own values at each time step. Given the current system state x(k), the dynamics of the register and counter can be expressed as and where the initial values are set as r (0) = r (0) = 0, c(0) = 0. (Moreover, μ is not predefined or fixed and will be chosen in the space M. For example, μ can be chosen as μ(·, ·) ≡ 1 in (5)- (7). Furthermore, we will use the latter proposed IADP and EIADP algorithms to find the "optimal" delay function μ * in the policy space.) By (5) and (6), if the current time index k satisfies r (k) = 0, then, k is recognized as the decisionmaking instant k = k l . In case r (k) = 1, k is identified as the impulsive control instant. Furthermore, all possible values of r and r are listed as follows: Then, based on (8), we define the set of all possible register values using {R|r ∈ R} whose elements can be enumerated as r 0 , r 1 , . . . , r M . The positive integer M < ∞ represents the size of R, and we let the element r 0 be r 0 = [0, 0] T . On the other hand, through (7), the value of the counter c(k) indicates that the current k satisfies k c(k) ≤ k ≤ k c(k)+1 − 1, and the delay function μ(x, c(k)) and control law group d(x, c(k)) are utilized at the current impulsive control cycle. In other words, the (c(k) + 1)th decision-making instant has already come before or at the current time k, and the next one has not. Define the "extended" system state as Then, the extended global system state space is S = R × X.
For convenience of analysis, the symbol s, or more accurately, s r,x is used to denote the system state s = s r,x = r x .
Meanwhile, let the elements of the original system state space X be enumerated as X = {e 1 , e 2 , . . . , e m , . . .}. Then, S satisfies S = ∪ M j=0 S r j , where S r j = {s r j ,e 1 , s r j ,e 2 , . . . , s r j ,e m , . . .}. Suppose the (l + 1)th decision-making instant, that is, k l , has just occurred, the unified mathematical expression of the impulsive controller applied to the stochastic process is developed as follows: where the impulsive action at the impulsive control instant is determined by the control law d(x, μ(x(k l ), l), l) within the control law group d(x, l).
From (10), it can be seen that the output of the impulsive controller not only depends on the current extended state and number of previously applied impulsive actions but also on the delay function and control law group. Hence, the symbol ϑ l (s, k) should be written more precisely as ϑ l,μ,d (s, k). However, for convenience of analysis, we use ϑ l or ϑ l (s, k) to indicate ϑ l,μ,d (s, k).

B. Derivation of the General Impulsive Transition Matrix
When the impulsive controller ϑ l (s, k) is applied to the stochastic processes, with the current and next register values fixed as r(k) = σ ∈ R and r(k + 1) = ρ ∈ R, respectively, the partial one-step transition matrix P ρ,σ,ϑ l (s,k) can be expressed as in (14) [ (14)- (18) are displayed on the next page]. The (m, n)th entry p(s ρ,e n |s σ,e m , ϑ l (s σ,e m , k)) of P ρ,σ,ϑ l (s,k) denotes the probability of the extended system state s transitioning from s σ,e m to s ρ,e n , subjected to ϑ l (s, k). Then, with the current and next register values unfixed, the global one-step transition matrix P ϑ l (s,k) can be defined as in (15). If the stochastic process is initialized by the system state s(0) = s r i ,e m , then, p k (s r j ,e n |s r i ,e m , ϑ l (s, κ)), k = 1, 2, . . . , represents the probability that at time k the process occupies the state s r j ,e n with the impulsive controller ϑ l (s, κ) applied at κ = 0, 1, . . . , k − 1. Composed of p k (·), the partial kstep transition matrix P k,ρ,σ,ϑ l (s,κ) is defined by (16). The relationship between the one-step transition matrix P ϑ l (s,k) and the probability function p k (s r j ,e n |s r i ,e m , ϑ l (s, κ)) satisfies (17). Based on the facts above and given the condition wherein the delay function μ(x, l) and control law group d(x, l) are utilized at the (l + 1)th decision-making instant k l and impulsive control instant k l+1 − 1, respectively, the general impulsive transition matrix P μ(x,l),d(x,l) is expressed as in (18). [In practical engineering, approximateμ andd, whose definitions will be given later, are used. And the corresponding general matrix Pμ (x,l),d(x,l) can be obtained similarly as in (18) denotes the probability of the original system state x transitioning from x(k l ) = e m at the current decision-making instant to x(k l+1 ) = e n at the next decision-making instant. By introducing a new product operator , P μ(x,l),d(x,l) can be obtained by where the mapping rule of is such that the elements of P satisfy Remark 1: In (11) and (12), notice that the general impulsive transition matrix P μ(x,l),d(x,l) is only determined by the delay function μ(x, l) and control law group d(x, l) and is not subjected to the variable control cycle of the impulsive controller ϑ l (s, k), indicating it can reveal the transition dynamics or probability distribution evolution patterns for all states between two impulsive events (here, the definition of event is narrowed down to "the decision-making instant arrives"). This advantage of P μ(x,l),d(x,l) over the traditional single or multistep transition probability matrices enables the ADP methods to be applied to the impulsive stochastic systems and lays foundation for the monotonicity, stability, and optimality analysis of the latter proposed IADP and EIADP algorithms. Given , if the extended system state at the (l+1)th decision-making instant is s(k l ) = s r 0 ,x , then, the expected total reward under the policy h is defined as where the mathematical symbol E{·} represents the expectation of the random variable inside the bracket.
is the utility function which assigns a running cost for the corresponding extended system state s(k) and control ϑ(·) at each time step k. In particular, the higher the frequency of impulsive actions is, the greater the utility function (or punishment intensity) U(·, ·) is. Frequent impulsive actions may lead to performance loss or wear and tear of the controller/actuator, which should be discouraged to ensure healthy and normal operation of the systems.
By searching in the impulsive policy space , our goal is to find the optimal delay function μ * (x, l) and control law group For simplicity, from now on, we focus on analyzing the properties of the stationary impulsive policies In fact, as will be discussed and proved later, the optimal nonstationary impulsive policy is exactly the optimal stationary one, that is, Under the stationary policy h, the delay functions μ(x, l) and control law groups d(x, l) employed at each impulsive instant l, l = 0, 1, . . . , are the same. Hence, the "stationary" delay function, control law group, impulsive controller, general impulsive transition matrix, and performance index function no longer depend on l and can be indicated by With a slight abuse of notation, nonstationary variable and stationary one use the same letters, such as μ(x) and μ(x, 0), d(x) and d(x, 0), etc.
Besides, U μ,d is used to denote the |X|-dimensional impulsive utility vector whose detailed mathematical definition will be given later; and O and E are the |X| × |X|-dimensional zero matrix and identity matrix, respectively. P r 0 initial is the |X| × |S|-dimensional matrix which is expressed as P Before presenting the definition of the impulsive utility vector U μ,d , a novel sum operator is needed.
Definition 2: The novel sum operator is denoted by the symbol , whose mapping rule is demonstrated through (21)- (23).
Let A, B κ , κ = 1, 2, . . . , be matrices with the same suitable dimension, which are expressed as follows: where a m and b m,κ are the mth row vector of A and B κ , respectively. Then, we can define the new sum operator as such that Suppose the current time step k is the (l + 1)th decisionmaking instant, that is, r (k) = 0 and k = k l , then, by the definition of the delay function μ(x), the following impulsive control instant is k + μ(x(k)) − 1, where x(k) represents the system state at time k. The state evolution of the stochastic process from k to k + μ(x(k)) − 1 can be illustrated as . . .
where F s is the equivalent system dynamics [for the detailed mapping rule of F s , refer to (1), (5), and (6)] of the extended state s. According to (24), only at the impulsive control instant k + μ(x(k)) − 1, the stochastic process is under control by d(x(k +μ(x(k))−1), μ(x(k))), while the system is under zeroinput for k, k + 1, . . . , k + μ(x(k)) − 2. Let k = k l , then, we can construct the |X|-dimensional impulsive utility vector U μ,d according to the following equation: k+n) . (25) Based on (25), it can be seen that the impulsive utility vector represents the accumulation of the running cost U ϑ(s,k) (·) from the current decision-making instant to the following impulsive control instant for each original system state x ∈ X. Then, given the delay function μ(x) ∈ M, we can define V * ,r 0 μ as the partially optimal value function, which is obtained by Theorem 1: Let J h * ,r 0 be the optimal impulsive performance index function. Let V * ,r 0 μ be the value function defined in (26). Then, we can obtain Proof: The conclusion can be proven in two steps. First, we prove that if for any original system state x ∈ X, μ * (x) is the optimal delay function, then, we have Assume that (28) is false. Based on the definition of the optimal performance index function J h * , it is clear that there exists at least one original system state x ∈ X, which satisfies the following inequality: or in vector notation The impulsive optimum J h * ,r 0 can be derived as Let According to the definition of V * ,r 0 μ * , we have Based on (30) and (33), we obtain the conclusion that It is a contradiction that J h * ,r 0 is the optimal impulsive performance index function. Hence, the assumption is false and the conclusion J h * ,r 0 = V * ,r 0 μ * holds. Second, we prove that for any original system state x ∈ X, the partial optimal value function satisfies Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
We also prove the conclusion by contradiction. Assume that (35) is false. Then, there at least exists one original system statex ∈ X, which satisfies or in vector notation Let Then, (37) indicates that there exists an impulsive policy which guarantees the value function J h,r 0 to be smaller than J h * ,r 0 . This is a contradiction. Hence, the assumption is false and the conclusion (35) holds. Based on (28) and (35), we immediately obtain (27).

A. Impulsive Policy Iteration ADP Algorithm
To obtain the optimal impulsive performance index function J h * ,r 0 , the policy iteration-based impulsive ADP algorithm is developed in this section.
Set the iteration index number as i = 0. For all x ∈ X, L 2 ), . . . , d 0 (x, L max )] and μ 0 (x) denote the initial admissible control law group and delay function, respectively. Let i = 1. The iterative value function V r 0 i can be obtained by the following equation: The iterative control law group d i (x) and delay function μ i (x) are computed as follows: where the superscript of d is dependent on the choice of μ(x). Here, the iterative control law . Then, the algorithm iterates through (41)-(43), as the iteration index number increases from 1 to infinity.
Remark 2: Define the set X μ,τ as Let E μ,τ be the |X| × |X|-dimensional matrix, such that the (j, j)th entry of E μ,τ is 1 if e j ∈ X μ,τ , and the other elements of E μ,τ are all set to 0. P μ,τ initial is the |X| × |S|-dimensional matrix which is defined as P μ,τ initial = [E μ,τ , O, O, . . . , O]. Given any delay function μ ∈ M, using the componentwise notation, (42) can be rewritten as s,κ) ). Then, the mapping rule of (·) is such that the (j, j)th entry of A is M =0 p τ (s r ,e j |s r 0 ,e j , ϑ(s, κ)), where p τ (s r ,e j |s r 0 ,e j , ϑ(s, κ)) is the element of the multistep transition matrix τ −1 κ=0 P ϑ(s,κ) . For any μ ∈ M, we have where the mapping rule of the impulsive controller ϑ(s, k) in (45) and (46) is determined by the choice of μ and d. Particularly, the symbol ϑ τ (s, k) in (46) indicates the impulsive controller of which the delay function is fixed as μ(x) ≡ τ, x ∈ X. Let d i (x, τ ) be the decision rule which minimizes the right-hand side of (46). Then, comparing (45) Lemma 1: Let the impulsive utility function U τ,d and general impulsive transition matrix P τ,d indicate that the delay function μ(x) ≡ τ is applied at the decision-making instant.
where the (j, j)th entry of the |X| × |X|-dimensional matrix E x is 1 if x = e j , and the other elements of E x are all set to 0. Then, the iterative delay function μ i (x) can be set as In the implementation of the impulsive ADP algorithm, approximate structures such as neural networks are utilized to approximate the corresponding control law group and value function. Specifically, neural networksd The utilized neural networks are trained according to the error backpropagation algorithm and have a three-layer structure, with ζ being the size of the hidden layers. Then, the mathematical models of the critic network and action network areV and respectively. W hci and W hai,τ are the weights going to the hidden layers from network inputs. Meanwhile, W oci and W oai,τ are defined as the weights going to the output layers from the hidden layers. b hci , b hai,τ , b oci , and b oai,τ represent the bias vectors for each layer of the networks. The activation func- In the training process of the critic and action networks, gradient decent methods are used to adjust the above parameters to achieve a close approximation of d i (x) and V Besides, with modern computing techniques, such as multithread programming and parallel processing, the ADP algorithms are usually executed in a concurrent manner. In our case, all elements of the neural networksμ , are updated simultaneously at the "policy improvement" or "policy evaluation" phase. However, it is worth noting that prior to the policy improvement stage, all the sample data needed for neural network training, and the redundant space required for concurrent execution have to be loaded into or allocated within the memory of the computing devices. Hence, once the complexity of the controlled systems increases, or when the size of state space and control space is large, the massive sets of sample data and the proportionately increasing redundant space required for parallel processing may easily overload the computing devices with low memory capacities.
To reserve the concurrence property while reducing the memory demand of the ADP implementation, EIADP algorithms are proposed. Within the EIADP algorithms, the iterative control law group d i (x) and delay function μ i (x) are updated partially at the policy improvement phase. As a result, significantly fewer sample datasets need to be collected and the memory footprint for the concurrent execution decreases substantially at each iteration step, thus reducing the memory requirement of the ADP algorithms. As for algorithm implementation, there are two execution modes for the EIADP algorithms, which are synchronous mode and asynchronous mode, respectively. The EIADP algorithm under synchronous mode only updates one element of the iterative control law group [e.g.,    1 (x, τ ), τ ∈ N sub , or there exists a state subset X sub ⊂ X such that μ i (x) = μ i−1 (x), x ∈ X sub ). Next, the detailed introduction of the EIADP algorithms is given under different execution modes.

B. EIADP Algorithm Under Synchronous Mode
The EIADP algorithm under synchronous execution mode is illustrated in Algorithm 1.

C. EIADP Algorithm Under Asynchronous Mode
Define the sequences {δ i } and {B i }, i = 0, 1, . . . , where δ i ∈ N = {L 1 , L 2 , . . . , L max } and the state subset B i ⊆ X. Let μ 0 (x) and d 0 (x) be the initial values. Let i = 1. V r 0 i is obtained by (41). Then, the iterative control law group d i (x) is computed as In (52), D i is the subset of the space D which is the global set of all control law groups d(x).
represents the global set of the control law d(·, δ i−1 ). In other words, D i can also be expressed as where M i is a subspace of M and defined by M i = {μ(x)|μ(x) = μ i−1 (x) ∀x ∈ X\B i−1 }. Based on Theorem 1, (52), and (53), we can derive that the pair of Then, as the iteration index number i increases from 1 to infinity, the EIADP algorithm iterates through (41) and (52)-(54). In addition, the EIADP algorithm under asynchronous execution mode is illustrated in Algorithm 2.

IV. PROPERTY ANALYSIS
In this section, we analyze the properties of the iterative value functions and control law groups obtained by the policy iteration EIADP algorithms in terms of monotonicity, admissibility, and convergency.
Proof: With the iteration index number i = 0, define the sequence of value functions V r 0 1,q , q = 0, 1, . . . , such that where V For the system state x ∈ (X\B 0 ) ∩ X μ 0 ,δ 0 , we obtain

Algorithm 2 EIADP Algorithm Under Asynchronous Execution Mode
Initialization: μ 0 (x), d 0 (x) and the parameter ε are set in the same way as Algorithm 1; Construct the sequence of time intervals {δ κ } and the sequence of state subsets {B κ }, κ = 0, 1, . . . , such that δ κ ∈ N , B κ ⊆ X. Iteration: 1: Let the iteration index i = 0; 2: Let i = i + 1; Then, considering any x ∈ X\(X μ 0 ,δ 0 ∪ B 0 ), we have According to (56)-(58), for all x ∈ X, we derive that Using the mathematical induction, for any q = 0, 1, . . . , we can obtain the conclusion Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
Based on (55) and (60), we obtain With q increasing to infinity, (61) becomes and lim q→∞ P q+1 According to Definition 1, (63) and (64) indicate that the stationary impulsive policy Then, the sequence {V r 0 1,q+1 } can be expressed as Since the sequence {V r 0 1,q+1 } is monotonously nonincreasing and bounded above zero, it converges to V r 0 1,∞ which satisfies In fact, it can be proved that as long as the initial value function V Finally, by the mathematical induction, the proof is completed.
, and d i (x) be the iterative functions obtained by the asynchronous EIADP algorithm. Then, we have The symbol in (72) denotes a subset of the global impulsive policy space , whose definition is given in the following proof.
Proof: Based on Theorem 2, the iterative value functions obtained by the asynchronous EIADP algorithm are monotonically decreasing and converge as i → ∞. Using V where the other elements of d ı(η,τ )+1 satisfy d ı(η,τ )+1 (x, τ ) = d ı(η,τ ) (x, τ ) ∀τ = τ, τ ∈ N . Based on the definition of the setG, the number of occurrences of τ ∈G within the sequence {δ i } is finite, which means the control law d i (x, τ ) ∀τ ∈G is no longer updated after finite iteration steps. The iteration step at which d i (x, τ ) gets updated for the last time is obtained bỹ In other words, for any i ≥ĩ(τ ), d i (x, τ ) remains unchanged, that is, dĩ (τ ) (x, τ ) = dĩ (τ )+1 (x, τ ) = · · · ∀x ∈ X ∀τ ∈G. On the other hand, for any τ ∈ X\G, with η → ∞, (73) becomes According to the fact above, the iterative control law groups obtained by the EIADP algorithm under asynchronous execution mode converge as i → ∞, where we denote the converged item as d L Based on the definition ofĜ, for any x ∈ X\Ĝ, there must exist a time intervalτ ∈ N , such that x ∈ B ı(η,τ ) holds for any η = 0, 1, . . . , ∞. Consequently, according to Lemma 1, When η → ∞, (75) can be developed into which holds for any x ∈ X\Ĝ. Similar to the iterative control laws, for any system state x ∈Ĝ, there is no update for μ i (x) after finite iterations. The iteration step i where μ i (x), x ∈Ĝ gets updated for the last time is expressed aŝ the value of μ i (x) for x ∈Ĝ stays the same as i continues approaching infinity, that is, μˆı (x) (x) = μˆı (x)+1 (x) = · · · ∀x ∈Ĝ. Hence, the converged value of the sequence {μ i (x)} obtained by the asynchronous EIADP algorithm can be denoted as lim i→∞ μ i (x) = μ L ∞ (x) ∀x ∈ X. According to Remark 2, Lemma 1, (74), and (76), the function pair i+1,q+1 and the iterative value function V r 0 i+1 within the asynchronous EIADP algorithm satisfy Letting i → ∞, combined with (77), (78) can be expressed as where the converged stationary impulsive policy is . Set the subset of the global impulsive policy space as denote the set of all admissible impulsive policies in . According to (79), by choosing arbitrary h = (μ(x, 0), d(x, 0), μ(x, 1), d(x, 1), . . . , μ(x, l), d(x, l), . . .) ∈ , we can derive Applying the impulsive policy h and (79) to (80) for n times, we obtain where −1 Since h is chosen in arbitrarily, we have J h L ∞ ,r 0 = V r 0 ∞,L ≤ J h,r 0 ∀h ∈ , which means the stationary impulsive policy h L ∞ is the optimal one in the local space . Consequently, V r 0 ∞,L is the local optimal impulsive performance index function.
This completes the proof. and respectively. In turn, the subset now is = In summary, provided thatĜ = ∅,G = ∅, the asynchronous EIADP algorithm can converge to the global optimal impulsive performance index function. As for the synchronous EIADP algorithm, the corresponding monotonicity, convergency, and optimality analysis can be derived similarly as Theorem 3 and is omitted here.
V. SIMULATION EXAMPLE Consider the following system: where x = [x 1 , x 2 ] T is the system state, u is the control input from the impulsive controller, and ω = [ω 1 , ω 2 ] T is the external disturbance. By discretizing (86) using the sampling interval T = 0.01, we obtain . (87) The set of all admissible impulsive intervals is configured as N = {1, 2, 3, 4, 5}. Then, with the register r(k) and the extended system state s(k) defined by (5)-(9), we can construct the utility function for (86) as where the matrix Q = I (I is the identity matrix with suitable dimension) and the matrix R s,u is defined as follows: On the one hand, the utility function U(s, u) assigns a running cost for the current system state and the corresponding control input; On the other hand, U(·) and the penalty matrix R s,u are designed such that high frequency of impulsive actions is discouraged to reduce wear and tear of the controllers and systems. (The register r indicates the time span between two adjacent impulsive actions. The lower the value of r is, the higher the frequency of impulsive actions is, and the greater the punishment intensity R s,u is.) Let the global state space be expressed as Then, the sequences {B i } and {δ i } are chosen as B i = X rem(i,4) and δ i = rem(i, 5) + 1, respectively. The function rem(a, b) outputs the remainder of a divided by b. Hence, the expression of the function ı(η, τ ) and the set T x,τ ∀x ∈ X 0 are and respectively. From (91), the number of elements in the set T x,τ , x ∈ X 0 , τ ∈ N is infinite. In fact, through enumerating the elements in T x,τ similar to (91), the above conclusion also holds for any x ∈ X, which means φ(x, τ ) = ∞ ∀x ∈ X ∀τ ∈ N . In other words, we can deriveĜ =G = ∅, indicating that the asynchronous EIADP algorithm should converge to the global optimum of system (86). In the numerical implementation of the EIADP method, neural networks are used to approximate the corresponding iterative control law groups and value functions. Let the initial system state be x 0 = [−1, 0.3] T . Fig. 1 shows the initial system state trajectories and impulsive value function. The output trajectory of the initial admissible controller is illustrated by Fig. 1(c), where the initial delay function is set as μ 0 (x) ≡ 4. The optimal system state and control trajectories, and the optimal control law group obtained by the Fig. 3. Output trajectory of the optimal delay function μ L ∞ obtained by the asynchronous EIADP algorithm.  asynchronous EIADP algorithm are demonstrated in Fig. 2. Meanwhile, the optimal delay function μ L ∞ obtained by the asynchronous EIADP algorithm is illustrated in Figs. 3 and 4. Specifically, Fig. 3 shows the output trajectory of μ L ∞ at time k = 0, 1, . . . , 250, as the system is running, and the optimal impulsive interval for each system state x ∈ X (i.e., the mapping rule of μ L ∞ ) is presented by Fig. 4, with different colors indicating its corresponding impulsive intervals.  Fig. 5 makes a comparison between the optimal impulsive value functions obtained by the impulsive ADP algorithm and the asynchronous EIADP algorithm. From Fig. 5, it can be seen that, with the sequences defined as B i = X rem(i,4) and δ i = rem(i, 5) + 1, the proposed asynchronous EIADP algorithm converges to the same global optimum as the impulsive ADP algorithm, thus validating the conclusions of Theorem 3 and Remark 3.
On the other hand, if N = {1} or the iterative delay function is fixed as μ i (x) ≡ 1, i = 0, 1, . . . , then, the impulsive ADP algorithm is reduced to the traditional ADP algorithm. The converged optimum by the traditional ADP algorithm is shown in Fig. 5 [1] × · · · . Hence, in terms of performance, the optimum in is guaranteed to be superior than the one in t . Moreover, Fig. 5(b) shows that the impulsive ADP algorithm can obtain a smaller performance index function than the traditional ADP algorithm, which verifies the above conclusion. Fig. 6 demonstrates the difference in memory usage with the IADP and EIADP algorithms running. (Memory usage tracking is realized by the corresponding hardware sensors and the software "CPUID Hardware Monitor PRO," under the total capacity of 64 GB for the physical memory.) It is shown that the impulsive ADP algorithm consumes around 60%-80% of the total memory space. By utilizing the proposed asynchronous EIADP algorithm, the memory usage is reduced to approximately 15%. Therefore, combining Figs. 5 and 6, it can be validated that the EIADP algorithms can significantly release the memory load of the impulsive ADP algorithm while obtaining the global optimal impulsive performance index function.

VI. CONCLUSION
In this article, the policy iteration-based IADP algorithm is developed to solve the optimal impulsive control problems of discrete stochastic systems. A novel general impulsive transition matrix is established to analyze the monotonicity, stability, and convergency properties of the IADP algorithm. As the iteration index number increases to infinity, it is proven that the iterative value functions converge to the optimum. The efficient impulsive ADP algorithms with two types of execution modes, that is, asynchronous and synchronous modes, are proposed in order to reduce the memory consumption of the traditional ADP methods. Finally, a simulation experiment is given to illustrate the performance of the developed algorithms.