Distributed Reinforcement Learning for Networked Dynamical Systems

In this article, we propose a scalable algorithm for learning distributed optimal controllers for networked dynamical systems. Assuming that the network is comprised of nearly homogeneous subsystems, each subcontroller is trained by the local state and input information from its corresponding subsystem and filtered information from its neighbors. Thus, the costs of both learning and control become independent of the number of subsystems. We show the optimality and convergence of the algorithm for the case when the individual subsystems are identical, based on an algebraic property of such networks. Thereafter, we show the robustness of the algorithm when applied to general heterogeneous networks. The effectiveness of the design is investigated through numerical simulations.


I. INTRODUCTION
D ISTRIBUTED control is a widely used control method- ology for networked dynamic systems (NDS), with examples, such as multiagent networks [1], [2], [3]; robotic networks [4]; social networks [5]; transportation networks [6]; and smart grids [7].With the recent emergence of advanced machine learning techniques for control [8], especially of reinforcement learning (RL) [9], several papers have been written on developing RL-based distributed control of networked systems [10], [11], [12].The majority of these studies, however, only consider distributing the actuation of the learned controllers while the learning process itself still remains centralized.Due to this centralized structure, the computational complexity of learning increases notably as the number of subsystems in the network increases.
The limitation in scaling up RL algorithms has spurred recent developments in the model-free distributed design [13] of controllers, whereby multiple subsystems can independently learn their local controllers through intercommunication.For the sake of convenience, we will hereafter refer to these methodologies as distributed RL [14], [15], [16].A significant advantage of distributed RL is that computational complexity remains invariant to the number of subsystems as learning can be executed in parallel.However, guaranteeing the optimality and stability of the closed-loop system for any generic NDS with any arbitrary topology still remains an open question.One approach that has been proposed in recent literature to make the problem tractable is to exploit the structure of the network or of the control objective.For example, the authors in [17] and [18] developed distributed RL for linear multiagent systems when the agent dynamics are decoupled, assuming that the reward function depends on a global state that is accessible by every agent.The authors in [14], [19], and [20] extended this to cases when the agent dynamics are interconnected.The approach relies on an exponential decay property of networks such that the effects of local decisions act only locally.However, this approach would not be practical when the network of interest does not have the exponential decay property, e.g., for networks with short average distances.
In this article, we present an alternative approach for exploiting network structure for distributed RL, based on an algebraic property of networks that are composed of identical subsystems connected over a complete graph [21], [22].We refer to this type of NDS as a homogeneous NDS.Our main contributions are as follows.
1) Starting with the ideal case of a homogeneous NDS, we first develop a centralized RL algorithm that finds a distributed optimal controller.The optimality of the distributed controller relies on the fact that the overall Riccati equation can be separated into two independent Riccati equations: one for the intrasubsystem dynamics and another for the intersubsystem dynamics.These two dynamics, respectively, obey the difference and the average of subsystem states.Therefore, by applying an existing RL algorithm, such as off-policy iteration (Off-PI) [23], which is a one-shot and quadratic-convergent algorithm, to the state and input data, we can find a distributed optimal controller.However, this algorithm by itself is not completely scalable because it requires average data across all subsystems.2) To solve the scalability issue, we augment our method with a consensus-based model-free distributed observer [24] to estimate the average state instead of the entire state.The distributive nature of the observer allows the subsystems to have their own individual learning algorithm that can learn the RL controller distributively as well, requiring state information from only their neighboring subsystems over any connected and undirected network graph.The convergence of the proposed scheme, along with the optimality and stability of the learned distributed controller, is theoretically guaranteed due to the finite-time convergence of the distributed observer.
The communication topology of the distributed observer is independent of that of the plant NDS.Sparser communication topologies, however, as will be shown later in simulations, require longer time of convergence for the learning phase.3) Finally, we consider the heterogeneous case, i.e., when the network is comprised of subsystems with different dynamic models, connected over any arbitrary network topology.Direct extension of our distributed RL design to such networks while guaranteeing stability and convergence, however, unfortunately is not possible.We therefore establish the applicability of the design to heterogeneous NDS using a robustness approach.We specifically derive a numerical bound on how much the individual subsystems are allowed to differ while the proposed algorithm can still guarantee stability and convergence.We validate our claims using numerical simulations.To the best of our knowledge, this is the first distributed learning method that allows for designing the network structure among learners and, at the same time, guarantees convergence to the optimal controller.Although this theoretical guarantee is limited to nearly homogeneous NDS, the proposed method has a scale-free property.For linear NDSs consisting of κ subsystems, each of whose dimension is n, the computational complexity of one of the computationally efficient RL algorithms [23] is of the order (κn) 6 [25], whereas that of the proposed method is of the order n 6 .Therefore, the proposed method is effective for NDSs consisting of a large number of (relatively small dimensional) subsystems.The most relevant existing works, as indicated above, are [14], [19], and [20], which address weakly coupled networks such that the effects of local decisions act only locally.In contrast, the proposed method targets strongly coupled networks where local actions have global impacts, two leading examples being consensus networks [26] and power networks [7].Integration of these two different methods is one of our future goals.
The rest of this article is organized as follows.Section II describes the problem setting.Section III-A presents the centralized RL algorithm.A way of distributing the algorithm is described in Section III-B.Section IV shows a robustness of the proposed algorithm to general network systems where heterogeneous subsystems are interconnected via any graph structure.Section V shows the effectiveness of the proposed algorithm through simulations.Finally, Section VI concludes this article.

. , p
T n ] T .For x ∈ R n , the 2-and infinity-norm are denoted by x 2 and x ∞ .For functions f (x), g(x), if there exist M and δ such that x < δ ⇒ f (x) ≤ M g(x) , then we say f (x) = O(g(x)).Given systems Σ 1 : u → y and Σ 2 : y → u, we denote the closed-loop system of those as (Σ 1 , Σ 2 ).

A. Networked Dynamical Systems
Consider a network system, where κ subsystems are coupled through a graph.For k ∈ {1, . . ., κ}, the kth subsystem dynamics are described as where x k ∈ R n is a state, and u k ∈ R m is a control input.In (1), {A a , A b , B a } represents a nominal dynamical state-space model for the subsystems, while { Ãkk , Ãkl , Bk } represents a perturbation from that model for each subsystem.Throughout this article, we suppose that the perturbation is small, i.e., individual subsystems are nearly homogeneous, and the weights of interconnection are close to each other.Later in Section IV, we will discuss how the perturbation affects our learning algorithm.
Let Ã ∈ R κn×κn be such that the (k, l)th n-by-n block matrix is Ãkl for k = l while the (k, k)th block matrix is κ l=1 Ãkl , and B := diag( B1 , . . ., Bκ ).Using the notation the interconnected network can be written as Our objective is to develop a distributed learning algorithm for designing a distributed optimal controller for the system Σ in (3).We make the following assumptions.
Assumption 1: The matrixes A a , A b , B a , Ãkl , and Bk are unknown.
Assumption 2: The matrix A is Hurwitz.Assumption 1 implies that although we know that the system of our interest is a network of κ subsystems coupled together via a graph, we do not know its state-space model.Assumption 2 is usually satisfied in many real-world networks.Under these settings, we formulate a distributed learning problem in the next section.

B. Problem Formulation
For k ∈ {1, . . ., κ}, let L k ⊆ {1, . . ., κ} be a given index set of neighboring subcontrollers, such that the graph corresponding to {L k } k∈{1,...,κ} is connected and undirected.We impose this distribution structure to both the learning phase and control phase.In other words, each subsystem can only use its own and its neighbors' information for learning its subcontroller and for computing the input of the learned subcontroller.We suppose that both phases are performed within a single episode, i.e., the learning is performed for t ∈ [0, T ] with a given parameter T while the control is for t > T .A schematic picture of the two phases is shown in Fig. 2. Each phase is formulated as follows.
1) Control phase: Let the kth subcontroller be described as where y k ∈ R r is the output used for communicating neighboring subcontrollers, x k and u k are defined in (1), and K k is a dynamical map to be designed.The control objective of the set {K k } k∈{1,...,κ} is to minimize the cost function with where to ensure Q ≥ 0 and R > 0. Using these structured Q and R, we can construct a distributed RL algorithm while appropriately controlling the behavior of the nearly homogeneous network Σ. 2) Learning phase: The requirements for designing K k in (4) are two-fold: First, it should be data-driven due to Assumption 1.The second requirement is that only neighboring information should be available for the design, where y ini k is defined as for a given K ini k .Note that u k is not generated by K ini k , but will be given in feedforward mode.In this setting, the controller design can be described as a map A from the neighbors' data to the kth subcontroller, i.e., where the weights Q and R are assumed to be shared among subsystems.In conclusion, our problem is summarized as follows.Problem 1: Consider Σ in (3) and J in (5) with Q and R in (6), where Q a , Q b , and R a are given such that (7) holds.Let Assumptions 1 and 2 be held.Given {L k } k∈{1,...,κ} , T and {u k (t)} k∈{1,...,κ} for t ∈ [0, T ], find A in (10) and {K ini k } k∈{1,...,κ} in (9) such that the resultant {K k } k∈{1,...,κ} makes J small as much as possible.
Owing to the distribution architecture of both learning and control phases, as shown in Fig. 2, the complexity for computing K k in (10) and {u k , y k } in (4) is entirely independent from the number of subsystems.The main challenge of Problem 1 lies in the distributive nature of the learning phase.

III. PROPOSED METHOD
We first present a centralized RL algorithm to design a distributed optimal controller.Later, by distributing the algorithm, we show a solution to Problem 1.Throughout this section, for developing the algorithm, we impose the following assumption on Σ in addition to Assumptions 1-2.
Assumption 3: Ã = 0, B = 0 in (2).Assumption 3 implies that the dynamics of individual subsystems are identical while their interconnections are symmetrically identical.In other words, under this assumption, Σ is a network of homogeneous subsystems coupled together via a complete graph.Later in Section IV, we will investigate the heterogeneous case.

A. Centralized RL Algorithm of Designing Distributed Optimal Controller
We start from the following lemma.Lemma 1: Consider Σ in (3) with J in (5).Let P a > 0 and P > 0 be solutions of respectively, where Then, the control law u = −Kx with where K a := R −1 a B T a P a and K := R −1 a B T a P , minimizes J. Proof: Note that the positive-definite solution of Ric : is unique because A is stable.We will show that the solution can be described as where P b := ( P − P a )/κ.The positive-definiteness of P in ( 16) follows if P > 0 and P a > 0 because P is block-diagonally dominant.The n-by-n block-diagonal and off-diagonal parts of Ric can be written as and respectively.We see that ( 17) is equivalent to the sum of ( 11) multiplied by κ − 1 and ( 12).Also, by subtracting ( 11) from ( 12), we have (18).Therefore, P in ( 16) is the unique solution of Ric.Using this P , the optimal gain PBR −1 can be written as (14), which completes the proof.This lemma means that the κn-by-κn Riccati equation Ric in (15) can be decomposed into two n-by-n Riccati ( 17) and ( 18), each of whose size is independent of the number of subsystems κ.In other words, constructing ( 14) via solving those two equations is scalable in κ.We note that the resultant controller u = −Kx has a distributed structure that is compatible with the structure of Σ.To solve ( 17) and ( 18) in a model-free way, we define the arithmetic averages of x and u as and the differences from the averages as Then, taking the coordinate transformation of x and u to Let us focus on the first n equations of (21).The Riccati equation of the dynamics with the cost J := ∞ 0 xT Qx + ūT R a ūdt coincides with (12).Therefore, by applying any one-shot Q-learning algorithm, such as Off-PI [23] to a finite-length dataset {x, ū} with J, we can find K in ( 14) without knowing the system model.Similarly, for any choice of k ∈ {1, . . ., κ − 1}, we can find K a by applying the same algorithm to the data of {x k , ũk } with the cost Also, we can see that this result holds when k = κ by redefining ξ and μ such that xκ and ũκ are included.The brief summary of the design steps can be written as follows.
2) Choose k ∈ {1, . . ., κ}.Compute {x, ū} and {x k , ũk } in ( 19)-( 20). 3) Learn K and K a by applying Off-PI to the datasets with J and J a , respectively.4) Construct K in (14).The details of the Off-PI and condition on data for guaranteeing convergence will be shown later in Section III-C.The above algorithm has a structure in which the information of all subsystems is sent to one place, and then the learned control gains are distributed to individual subsystems.
The above algorithm, however, is not completely scalable because it requires all-to-all communications among subsystems for computing x and ū in (19).To overcome this difficulty, we aim to estimate the averages by a model-free consensus-type distributed observer in [24], which is described in the next subsection.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. Average State Estimation
Unlike the original average-state observer in [24] estimating a scalar-valued average of multiple scalar valuables, we need to estimate the n-dimensional vector x.Thus, for each n entry, we construct a distributed observer composed of κ subobservers.
Let x i k ∈ R be the ith entry of the kth subsystem's state x k ∈ R n in (1).Our objective here is to estimate the average Let the graph structure among subobservers be specified by {L k } k∈{1,...,κ} introduced in Section II-B.Also, let L ∈ R κ×κ be its graph-Laplacian matrix whose (k, j)-entry is defined as Though the graph can be different for each i, for simplicity we suppose that the graphs of all distributed observers are identical.Following [24], the ith entry of the kth subobserver dynamics is described as: where where where The relation (26) implies that the outputs of all subobservers for any t after the finite time t i * become identical to the true average; please see [24] for details of this proof.
Remark 1: The output p i j in reality contains chattering noise due to the sign function in (24).One way to mitigate the noise is to replace it by the sigmoid function where σ is a sufficiently large parameter [24].Finally, the set {o i k } i∈{1,...,n} can be described as An illustrative example of O k is shown in Fig. 3.The distributed observer {O k } k∈{1,...,κ} has a feature that Remark 2: Although the exact computation of t * in ( 30) is difficult because the true average xi appears in (27), an upper bound of t * , denoted as T , can be estimated as follows.Let q k (0) = 0 in (29).From a simple calculation, t * is shown to satisfy t * ≤ max{ x 1 (0) , . . ., x n (0) }/λ 2 (L), where x i := [x i 1 , . . ., x i κ ] T .Even though x i (0) is still difficult to know, one would be possible to estimate its upper bound in advance.Assume x i (0) ≤ β i and β i is a known parameter.Then we have Later in our simulations, we use T as the time of convergence.
Remark 3: The right-hand side (RHS) of ( 24) changes significantly near the true value, even if the sgn function is replaced by the sigmoid function.Therefore, in numerical simulations, the step-size parameter of numerical integration methods to solve (24) should be small enough.Consequently, the computational complexity for numerical integration is often large; however, it is different from the computation cost for the learning algorithm.
Next, we show how to incorporate this distributed observer with the RL algorithm presented in the previous section.

C. Proposed Algorithm
This section presents a solution to Problem 1.By choosing where O k and y k are defined in (29), we can estimate x in a distributed manner, as shown in (30).Although we can estimate ū by a similar procedure, for simplicity we assume that ū is shared in advance among subsystems.This sharing can be done because u k is a feedforward signal.From (30), the signal Let Q a , Q b , R a satisfy (7).Give {L k } k∈{1,...,κ} such that the graph is connected and undirected, {β i } i∈{1,...,n} in Remark 2, and {u k } k∈{1,...,κ} .Compute L in (23) and T in (31), and choose T > T and a natural number N .Share these information among subsystems in advance.For k ∈ {1, . . ., κ}, do the following: Initialization: Give ū and ũk in ( 19)- (20).Construct K ini k in ( 9) with (32) where γ i k satisfies (25).Give δ ≥ 0 and {t j } j∈{1,...,N } such that a,k = 0 and i ← 0.

Find { P
where a,k ≤ δ, otherwise let i ← i + 1 and return to Step 5.

Construct
where In Policy Improvement, steps 4-6 are equivalent to a data-driven implementation of the Kleinman's algorithm [27], which is an iterative scheme for solving the Riccati equation of Σ.The reason why we construct k as (38) is as follows.Note that the control u = −Kx with K in ( 14) can be rewritten as If we can replace x and { K, K a } in (39) by p k and { K a,k } for t ≥ T while keeping the output u k same.As a result, we obtain k in the form of (38).Next, we show that (40) is indeed satisfied when i → ∞.In other words, we show the convergence of the algorithm and stability/optimality of the resultant closedloop system.
Theorem 1: Consider Problem 1 under Assumptions 1-3, and consider Algorithm 1.If then following statements are true.
i) The closed-loop (Σ, {K k .The set {K k } k∈{1,...,κ} minimizes J in (5).Proof: Under the condition (41), there exists a unique pair a,k }) making the term (36) [respectively, (37)] exactly zero [23].Moreover, the solution satisfies respectively, where a,k .Thus, the solution is independent from k, i.e., Using these symbols, we define From a procedure similar to the proof of Lemma 1, it is shown that {P with K [0] = 0, where A [i] := A − BK [i] .This is the ith step of the Kleinman's algorithm [27] to Σ. Therefore, A [i] is stable for any i.Because the behavior of the closed-loop Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
It should be noted here that the proposed algorithm is parallelly computable among subsystems.In other words, its computational cost is scalable in κ, and yet another benefit is that the resultant fully distributed controller {K k } k∈{1,...,κ} is shown to be optimal.Moreover, owing to the advantage of the Off-PI [23], the algorithm is one-shot, i.e., we do not need to recollect data by using an updated control gain K [i+1] .
Finally, we summarize the impact of the design parameters of Algorithm 1 on the learning/control phases and a guideline for parameter selection as follows.
1) Note that the Riccati ( 11) and ( 12) correspond to the behavior of the average x in (19) and xk in (20) for any k.Therefore, similarly to the standard optimal control, we can individually regulate {x k } k∈{1,...,κ} and x by tuning Q a and Q b with R a while satisfying (7).
2) The graph structure {L k } k∈{1,...,κ} does not have any impact on the resultant control performance as long as the graph is strongly connected.However, as we can see in (27), the structure affects the time to start learning.When the graph is sparse (respectively, dense), the second smallest eigenvalue of the associated Laplacian will be small (respectively, large).As a result, t * in (30), where t i * is defined in (27) becomes large (respectively, small).Therefore, the sparser the communication graph is, the slower will be the time to learn.This implicit impact on control phase will be demonstrated in the later numerical simulations.
3) The impact of choosing β i in (31) only affects the time to start learning T used in Step 2. Note that β i should satisfy x i (0) ≤ β i , as described in Remark 2. Therefore, one can choose a larger β i depending on the confidence of the prior knowledge regarding x i (0) .The tradeoff, however, is that T becomes slower in that case.4) The gain γ i k in (24) should satisfy only (25).Therefore, a guideline for choosing γ i k would be to assign it an adequately high value.In practice, the larger the gain, the more significant the impact of chattering noise on the estimated average state.However, as shown in continuous-time numerical simulations in Section V-A, the choice of large γ i k has little impact on the learning results in continuous-time settings.For the case when Algorithm 1 is time-discretized for implementation to digital computers, please see the next subsection.5) The choice of N , {t j } j∈{1,...,N } and {u k } k∈{1,...,N } used in Steps 2 and 3 does not have any impact on the learned controller as long as the condition (41) is satisfied.In Algorithm 1, the parameters are assumed to be given according to the setting of Problem 1. Determining these parameters, particularly for N , may appear to be difficult.However, they can be determined through a data-driven approach by incrementally repeating Steps 2 and 3 with increasing N until (41) is met, as it can only be verified using the estimated data.
6) The parameter δ utilized in Step 6 should be chosen to be sufficiently small, such as 10 −6 .As the Kleinman's algorithm is known to exhibit quadratic convergence [28], selecting a sufficiently small δ will not significantly impact the iteration count for learning.Remark 4: One can rewrite (36) [respectively, (37)] as a linear matrix equation form because there exists a unique solution making the evaluation cost (36) [respectively, (37)] exactly zero, as shown in the proof.However, the existence of such a solution is limited to the case for homogeneous networks.In the following section, we consider applying the algorithm to general heterogeneous network systems, and therefore, we describe Step 6 as a minimization problem.
Remark 5: The number of data samples N to satisfy (41) is in order of n 2 while the computational cost for executing Step 6 is in order of n 6 , where n is the dimension of each subsystem.Even if n is large, by preconditioning the data p k and xk based on the singular value decomposition, we can reduce the cost to the order of n 3 ; see [25] for more details.

D. Time-Discretization for Implementation
Algorithm 1 contains two continuous-time calculations, i.e., the continuous-time evolution of o i k in (24) used for K ini k and K k and the continuous-time integral (35).For implementing this algorithm in a digital computer it will be necessary to discretize.One approach is replacing o i k with the time-discretized subobserver with the sampler and zero-order hold (ZOH), described as where Δ > 0 is the sampling interval and t ∈ Z.Note that the sign function in ( 24) is replaced with the sigmoid function defined as (28).In addition, one may replace (35) by its trapezoidal approximant When o i k and ρ •• j in Algorithm 1 are replaced with o i k and •• j , we refer to the revised algorithm as Algorithm 1D.
Due to the replacement of o i k by o i k , the estimated signal p i k will be chattering around the true average, which may result in failure of learning.One way to avoid this phenomenon is to choose an appropriately small γ i k by monitoring the level of chattering noise on q i k .One heuristic procedure of choosing γ i k is as follows: When the estimate p i k is around the true average, κ j=1 L kj p i k ≈ 0 holds.Therefore, one can see from (44) that q i k (t + 1) − q i k (t) is chattering depending on the value of γ i k , and vise versa.Based on this observation, we can tune γ i k such that the chattering noise of q i k (t + 1) − q i k (t) is small.The trapezoidal approximation (45) leads to a Δ-dependent error between ρ •• j and •• j .As the error becomes smaller as Δ becomes smaller, we can expect that stability and suboptimality Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
by the learned controller will be guaranteed as long as Δ is sufficiently small.The effectiveness of the discrete-time algorithm Algorithm 1D as well as selecting appropriate value of γ i k will be verified through a numerical simulation in Section V-C.A rigorous robustness analysis necessitates the extension of our result to an entirely discrete-time version based on the discrete-time consensus observer [29] and the discrete-time RL method [30], which is out of the scope of this article.

IV. ROBUSTNESS ANALYSIS FOR HETEROGENEOUS NDSS
In this section, we consider general NDSs without imposing Assumption 3 on Σ in (3).For the following arguments, we define: and let a homogeneous network system be denoted as where x := [x T 1 , . . ., x T κ ] T .We assume Assumptions 1 and 2 and the following.
Assumption 4: The matrix A is Hurwitz.Assumptions 2 and 4 imply that both of heterogeneous and homogeneous networks are stable.Let denote the degree of heterogeneity.If is small, one can expect that applying the proposed algorithm to the heterogeneous network Σ will yield a suboptimal distributed controller.In the remainder of this section, we verify this expectation mathematically.To this end, we introduce the following lemma.Lemma 2: Consider Σ in (3) and Σ in (47).Let Assumptions 1, 2, and 4 be held.Define In addition, consider O k in (29), and define xk in (33).Then, it follows that: where Proof: From a simple calculation, we have  (19) and (20).Finally, from (30), the claim follows.
This lemma characterizes the average estimate p k (respectively, the difference estimate xk ) of the heterogeneous system Σ by the actual average x (respectively, the actual difference x) of the homogeneous system Σ with the degree of heterogeneity .Since p k and xk are used for the kth learner, the learning results will be characterized by heterogeneity.This can be summarized in the following theorem, where we denote the data matrixes φ • and ρ •• in (37) constructed from {x, u} in the bold font, i.e., φ • and ρ •• , for clarifying the representation.
Theorem 2: Consider Σ in (3), Σ in (47), J in (5), and the proposed algorithm.Assume Assumptions 1-4, (41), and where x and x are defined in (49).Consider P [i] > 0 and K [i]  satisfying where A [i] := A − BK [i] and K [0] = 0.Then, the following holds.i) The Lyapunov function candidate V [i] (x) := x T P [i] x satisfies where x follows (Σ, {K k } k∈{1,...,κ} ), and If is sufficiently small such that V [i] < 0 for any x = 0, then we have Proof: First, we show where a,k } is a solution minimizing (37) while with . Note here that (57) has a unique solution for any i because of (52).From Lemma 2, φ xk , ρ xk ũk , and ρ xk xk satisfy (50) Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Therefore, for i = 0, (37) can be written as Note here that a,k = 0 was used for deriving (58).The solution minimizing (58) can be written as From a simple calculation, for any two matrixes A and A whose sizes are identical, we have where X := A T A + A T A + A T A .By applying this formula to (59), we have a,k } minimizing (58) is shown to satisfy (56) for i = 0.By repeating the above procedure, we have (56) for any i.Similarly, we have where { k } is that when = 0. We now show the claim i).We define Let k be given by (38) with the replacement of K a,k and Kk by K , where u k is given by K [i+1] k coincides with −F [i+1] x.Thus, for the following stability analysis, we focus on the closed-loop ẋ = (A − BF [i+1] )x.It follows from (56), (60), and the proof of Theorem 1 that = P [i] A [i] + BK [i] − BK [i+1] By the technique so-called completing the square, we have Thus, property i) follows.Next, note that the cost J when Assumption 3 holds coincides with J [i] .Thus, property ii) follows.This completes the proof.Property i) implies that the stability of the closed-loop system is guaranteed when is sufficiently small if O( ) is negligible compared to the other terms in the RHS of (54).Property ii) guarantees suboptimality achieved by the learned controller.As decreases, the resultant control performance of the heterogeneous network gets closer to that for homogeneous cases.

A. Homogeneous Case
We first consider a homogeneous NDS composed of ten subsystems, each of which is a 3-D single-input linear system.Thus, κ = 10, n = 3, and m = 1.The kth subsystem dynamics is described as (1), where A a , A b , and B a are Bk = 0 and Ãkl = 0 for any l.Each entry of x(0) is assumed to be randomly chosen from the range [−0.5, 0.5].We construct Q and R from (6), where Note that Q ≥ 0 and R > 0 hold.When the model information was available, we can construct the optimal control gain by (14), where However, because the model information is not available, we find the gain by applying Algorithm 1 to this network.Let the structure among subobservers be a path graph, i.e., for k ∈ {2, . . ., κ − 1}.The graph structure is shown in Fig. 4. L is constructed using (23), yielding λ 2 (L) = 0.0979.We use the exploration input as u k (t) = 100 h=1 sin(ω k,h t)/10 with ω k,h randomly chosen from the range (−50, 50) for h ∈ {1, . . ., 100}.Also, let β i = 1 in (31) for any i.Then, T = 10.22.For k ∈ {1, . . ., κ}, we construct a subobserver O k := Fig. 5. Trajectory of x 1 when no input is applied for t ≥ 0 (red chained), the exploration noise is applied for t ∈ [0, T ) (black solid), the modelbased optimal controller (yellow dotted), and the designed controller {K k } k∈{1,...,κ} are actuated at t ≥ T .Also, T denotes the time to start learning.
{o i k } i∈{1,...,n} in (29), where o i k is given by ( 24) with the replacement of sgn by sig defined in (28) to mitigate chattering noise.Let γ i k = 249 and σ = 10 000.We will later investigate how these values impact the learning results.In addition, we let δ = 10 −8 , N = 200, t j := T + (j − 1)(T − T )/N , and T = 12.22.From t ≥ 0 we run K ini k := O k , and subsequently, for t ∈ [T , T ], we collect {x k , u k , {y ini j } j∈L k }.In this case, the condition (41) is satisfied.The Policy Improvement in the algorithm was over at i = 9.All codes were developed in Matlab 2021b and run in an Intel(R) Core(TM) i9-9900 K 3.60 GHz, RAM 64.0 GB computer.The total computational time for running all the iterations of Algorithm 1 was found to be only 0.00014 (s), which is almost negligible.The resultant control gains are a,κ and κ exist due to the chattering noise of the estimation, those differences are almost negligible.By comparing the resultant gains with (64), we can see that the optimal controller is successfully obtained.In Fig. 5, the blue solid and yellow dotted lines depict x 1 (t) ∈ R 3 (i.e., the state of the first subsystem) when {K k } k∈{1,...,κ} in (38) and the above model-based controller are actuated at t = T , respectively.For comparison, we show the case when no input is applied (i.e., u(t) ≡ 0 for t ≥ 0) by the red chained lines.The following two observations can be made from this result.i) The proposed method can learn the optimal controller in a decentralized manner.ii) However, the performance improvement compared to the free-response is limited due to the long time required for learning.7. Trajectory of x 1 of the homogeneous network when no input is applied for t ≥ 0 (red chained), the exploration noise is applied for t ∈ [0, T ) (black solid), {K k } k∈{1,...,κ} (yellow dotted) is actuated at t = T , and {K k } k∈{1,...,κ} is actuated at t = T .Also, T denotes the time to start learning {K k } k∈{1,...,κ} .

TABLE I COST ACHIEVED BY THE RESULTANT CONTROLLER
For demonstrating the impact of choosing the graph structure on the control performance, we consider {L k } k∈{1,...,κ} whose graph Laplacian L has the sparsity pattern, as shown in Fig. 6.Because the graph is more dense than L k given as (65), λ 2 (L ) was smaller than λ 2 (L) = 0.0979, i.e., λ 2 (L ) = 0.4023.As a result, the time to start learning becomes T = 2.49, which is faster than T = 10.22.As a result, T = 4.49.Let the newly learned controller be denoted as {K k } k∈{1,...,κ} .The closed-loop dynamic response with this controller is almost identical to that with {K k } k∈{1,...,κ} because of Theorem 1.However, owing to the fast learning, the performance achieved by {K k } k∈{1,...,κ} becomes more damped than {K k } k∈{1,...,κ} , as shown in Fig. 7 compared to Fig. 5. Therefore, we can see a tradeoff between communication cost and the learning speed.
Next, we investigate how the choice of γ i k and σ affects the learning results.Let the graph structure be the one shown in Fig. 6.For simplicity we let γ i k is identical for any k and i.Table I shows the cost achieved by the resultant controller for four different choices of (γ i k , σ), respectively.We can see that the resultant cost x T (0)P x(0) is almost same, whereas γ i k and σ change.This result implies that the values of γ i k and σ have less impact on the learning.In this simulation, the RHS of (25) for t ≤ T was 61.1268.One may think that the model information is needed for choosing γ i k to satisfy (25); however, the fact that choosing γ i k as approximately 20 times the RHS value does not affect the results implies that such an appropriate choice is not necessary.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. Heterogeneous Case
We next investigate the heterogeneous case.Let A a , A b , and B a be same as shown in the previous section, while A [kl] and B k are given such that every element is randomly chosen from a range D. We take L k as the one corresponding to L in Fig. 6, R , where Q a and Q b are given in (63).Let other parameters be identical in Section V-A.Table II shows whether the resultant closed-loop system is stable or not for several choices of D and q gain .Due to the reason, as described in Remark 3, only in this simulation, we have utilized the true average x for learning.We can see that the closed-loop is stabilized when q gain and the degree of heterogeneity are small.The closed-loop response of x 1 is shown in Fig. 8 for q gain = 1 and D = [−0.001,0.001].The average-state estimation is used where the graph Laplacian is L .In this case, the learned controller exhibits similarly to the ideal model-based controller.On the other hand, one can also see that the closed-loop is closer to unstable as q gain or the degree of heterogeneity are larger.This is because the high control gains trigger higher levels of uncertainty for which the robustness guarantees of Algorithm 1 can no longer catch up.Numerical computation of the allowable ranges for q gain and the degree of heterogeneity as well as the interdependence between the two are open questions that will be explored in our future work.

C. Discrete-Time Simulation for Heterogeneous Second-Order Oscillatory Networks
We show an effectiveness of the discrete-time version of the proposed algorithm Algorithm 1D in Section III-D for a larger  1 (t + 1) − q 1 1 (t) in (44) when γ i k = 5, 2 for any k and i. complex heterogeneous NDS, where multiple second-order oscillators are interconnected via a complete undirected graph.Let κ = 100.For k ∈ {1, . . ., κ}, let the kth oscillator dynamics be described as where m k and d k are the inertia and the damping coefficient, g kj is the stiffness of the connection between the kth and jth oscillators, and f k is the proportional gain introduced for stabilizing Ā in (13).Suppose g kj = g jk .Let the values of m k , d k , and g kj are randomly chosen from the ranges [4.9, 5.1], [1.4, 1.6], and [0.099, 0.101], respectively.Let L k be given as a strongly connected Erdos-Renyi random graph with 300 edges.Fig. 9 shows the graph structure of a closed-loop system.Let Q a = 10, Q b = 1, R a = 1, and σ = 10 6 , the sampling interval of Algorithm 1D in Section III-D be Δ = 0.01(ms), β i = 3 in (31), and N = 400.The value of T defined as (31) becomes T = 0.16, and the value of T is T = 4.16.
For selecting an appropriately small value of γ i k , we plot p 1 1 (t) and q 1 1 (t + 1) − q 1 1 (t) in Fig. 10 for two cases, where γ i k = 5 and 2. For reference, we plot the true average x1 in Fig. 10(a) and

TABLE III ITERATIONWISE LEARNING TIME [SEC] OF CENT-RL AND ALGORITHM 1D
(b).We can see from Fig. 10(a)-(b) that the estimate successfully tracks the true average for t > T in both cases.Furthermore, Fig. 10(c) shows that chattering noise occurs in q 1 1 (t + 1) − q 1 1 (t) after t > T in both cases, and the magnitude of the noise when γ i k = 2 is smaller than when γ i k = 5.As we can expect that smaller magnitude of noise will yield better learning results, we choose γ i k = 2 for the following demonstration.Let γ i k = 2. Fig. 11 shows the closed-loop response of ω 1 when Algorithm 1D is applied.This figure implies that the suboptimal controller is successfully obtained by the discretetime version of the proposed method even for larger and more complex NDS in Fig. 9.
Finally, we show an advantage of Algorithm 1D over a time-discretized version of the centralized RL algorithm in [23], which we refer to as Cent-RL.To compare the two algorithms, we consider three second-order oscillator networks each of which consists of 30, 40, and 100 oscillators.Table III shows the iterationwise learning time, defined as the time required for updating the controller once.The learning time for Cent-RL when κ = 100 could not be measured due to the massive amount of computation.The computational complexity of Cent-RL is known to be of order (κn) 6 .Indeed, as shown in Table III, the learning time increases notably as the number of subsystems κ increases.In contrast, the learning time of the discrete-time algorithm Algorithm 1D remains constant as κ increases, implying the scalability of the proposed algorithm.

VI. CONCLUSION
In this article, we have proposed a scalable algorithm for learning distributed optimal controllers for generic NDSs composed of nearly homogeneous subsystems under information constraints that follow a given arbitrary graph structure.Each subcontroller consists of an observer, which estimates the mean and differential states in a distributed manner, and the static controllers that feedback those.We have theoretically shown the convergence of the proposed algorithm and the optimality of the learned controller for the case where individual subsystems are identical.Furthermore, we have demonstrated the robustness of that analysis to general heterogeneous networks.The effectiveness of the proposed algorithm has been verified through numerical simulations.An interesting observation is that when the proposed algorithm is executed in discrete-time with a large sampling time (in the order of Δ = 1 ms), it could not learn a stabilizing controller.Resolving this numerical issue would require a rigorous extension of this result by incorporating the discrete-time consensus observer [29] and the discretetime RL method [30], which will be addressed in our future work.

Fig. 8 .
Fig. 8. Trajectory of x 1 of the heterogeneous network, where the notations are same as in Fig. 5.
(10)cy Improvement is the procedure A in(10)because that outputs the controller K k by processing the data based on D k .
(20)incides with xk (t) in(20)if t ≥ t * .This fact and (30) imply that {p k , xk }, which is accessible in a distributed manner, is available as an alternative data for {x, xk } without errors.Therefore, replacing {x, xk } by {p k , xk } in the centralized RL algorithm in Section III-A, we obtain a distributed RL algorithm, whose pseudo-code is summarized as follows.In Data Collection, the learner collects D k in (8) and computes {p k , xk } by communicating with only its neighbors.Clearly, and computep k and xk in (33).Compute φ • for • ∈ {p k , xk } and ρ •• for {•, •} ∈ {{p k , p k }, {p k , ū}, {x k , xk }, {x k , ũk }} as the stacked versions of φ • j and ρ •• j for j ∈ {1, . . ., N} defined as Moreover, from the definition (49), we have (50) for {•, •} ∈ {{x, x}, {x, x}, {x k , xk }} and for any t, where x and xk are defined in 51)Due to Assumptions 2 and 4, the signals x(t) and x(t) are bounded.Therefore, the second and third terms in the RHS of (51) are also bounded.Thus, from (48), we have x(t) = x(t) + O( ).