Distributed Channel Access for Control Over Unknown Memoryless Communication Channels

We consider the distributed channel access problem for a system consisting of multiple control subsystems that close their loop over a shared wireless network. We propose a distributed method for providing deterministic channel access without requiring explicit information exchange between the subsystems. This is achieved by utilizing timers for prioritizing channel access with respect to a local cost which we derive by transforming the control objective cost to a form that allows its local computation. This property is then exploited for developing our distributed deterministic channel access scheme. A framework to verify the stability of the system under the resulting scheme is then proposed. Next, we consider a practical scenario in which the channel statistics are unknown. We propose learning algorithms for learning the parameters of imperfect communication links for estimating the channel quality and, hence, define the local cost as a function of this estimation and control performance. We establish that our learning approach results in collision-free channel access. The behavior of the overall system is exemplified via a proof-of-concept illustrative example, and the efficacy of this mechanism is evaluated for large-scale networks via simulations.

loops was thoroughly investigated. It was shown that the certainty equivalence principle holds as long as the adopted network protocols guarantee packet acknowledgements/negativeacknowledgements (ACK/NACKs). In addition, when only the link between the sensor and the controller is unreliable, the closed-loop performance can be investigated by considering the impact of packet dropouts on the corresponding Kalman filter. As a result, the prominent results obtained for sensor scheduling in remote estimation for various channel models, communication and energy constraints can readily be adopted in respective WNCS settings; see, e.g., [4]- [11].
Typically, WNCSs contain several control subsystems which share the same communication resources. Due to the limited available bandwidth, the wireless devices need to coordinate for sharing the scarce and unreliable communication resources efficiently to accomplish the control tasks and with a good performance. This has led to a surge of research on the design of effective resource allocation schemes in the last decade; see, for example, the recent survey [12] and references therein. It has been shown that finding the optimal schedule for multiple subsystems that have access to multiple lossy channels requires solving a mixed-integer quadratic program which is computationally infeasible for large systems [13], [14]. In the presence of a central coordinator in the network, this can be overcome by employing priority-based resource allocation schemes, which determine the priorities dynamically with respect to a finite-horizon criterion. Try-once-discard (TOD) is one of the most well-known schemes of this type which, at each frame, allocates the channels to the subsystems with the largest discrepancy between the true and estimated state values [15]. For the linear quadratic Gaussian (LQG) control problem, the value of information (VoI) contained in the sensors' current observations for the network was proposed as the priority measure in [16], [17]. In a similar context, the contribution of the loss of data packet for a controller on the increase of the quadratic cost of the entire system was labeled as the cost of information loss (CoIL) in [18]. For decoupled systems, it was shown that minimizing the linear quadratic cost is equivalent to prioritizing channel access with respect to CoIL. In case of sensors with limited energy budget, the energy expenditure can also be included in the objective or as a constraint for determining transmission priorities as proposed in [19]- [21].
In some scenarios, a central coordinator is non-existent, thus requiring the subsystems to coordinate in a distributed manner for accessing the channels. The channel access in such settings is often provided by implementing random medium access schemes, which are incapable of taking the control performance into account, e.g., ALOHA. However, a limited arXiv:2103.06048v2 [eess.SY] 2 Aug 2021 number of novel deterministic solutions have emerged for resolving contention over perfect channels showing promising results [22], [23]. In this paper, we consider the LQG scenario and address how the scarce and unreliable communication resources can be allocated in a distributed manner by proposing a priority-based channel access scheme. This is achieved by adopting a variant of the timer-based mechanism for CoIL (TBCoIL), which was initially proposed in [23] for contention resolution over ideal communication channels. The proposed variant of this mechanism allows for distributed channel access in settings where sensors have access to multiple wireless channels subject to independent and identically distributed (i.i.d.) packet dropouts. Moreover, this mechanism leads to control-aware channel access as long as the timer values are defined accordingly. To this end, we utilize the concept of CoIL and show that the optimal timer setup requires knowledge of the rate of packet dropouts. In practice, however, the sufficient statistics of the probability distributions according to which the packet dropouts happen are unknown. To enable implementation in such practical scenarios, we propose a method for learning the essential channel parameters online and in a control-aware manner.
Applying learning methods in control problems has a long history being mostly centered around learning the unknown system dynamics by reinforcement learning; see [24] for a thorough literature review. However, only a limited number of works consider learning methods in the context of scheduling for WNCSs. In [25], the problem of sensor scheduling with communication rate constraints over a single channel is addressed. After proving the threshold-like structure of the optimal scheduling policy, iterative algorithms are designed to obtain the optimal solution without knowing the packet dropout rate. In a similar context, the relationship between the sample complexity and stability margin of a system over an unknown memoryless channel was investigated in [26]. Most recently, a multi-armed bandit (MAB) approach was proposed for near-optimal resource allocation [27]. In the case of multiple stable systems having access to multiple known identical channels, the optimal scheduling problem was solved by deriving the Whittle's index leading to promising performance at low computational cost. In this paper, we also utilize the celebrated results obtained for MAB problems. However, our setup poses unique challenges since it includes unstable subsystems, which need to coordinate in a distributed manner for accessing multiple available channels. Furthermore, each wireless link can have a distinct packet dropout rate which is unknown. To overcome these, we cast our problem as a MAB one and propose a novel distributed solution, which also takes the control performance into account.
The main contributions of this paper can be summarized as follows: • We propose a distributed channel access method for WNCSs with multiple unreliable communication links. This is achieved by extending the application of TBCoIL to multi-channel wireless networks. We then utilize the concept of CoIL to formulate the channel access problem for minimizing the stage cost and show that utilizing timers for solving this problem in distributed manner requires a priori knowledge of the packet dropout rates. • We propose a framework for verifying the mean square stability of the system under the proposed channel access scheme. By modeling the packet arrival sequence as a Markov chain and investigating its stationary distribution, we derive the sufficient conditions that guarantee stability. Furthermore, we illustrate how the proposed framework can be utilized to determine stability in practical settings where analytical expressions for the stationary distribution cannot be derived. • We additionally consider the practical scenario in which the channel statistics are unknown a priori. We first demonstrate how the well-known indexing policies developed for single-player MABs can be employed in timers to solve multi-player MABs in a distributed and collision-free manner. Next, we cast the channel access problem as a multi-player MAB and introduce time-varying weights for scaling the indices with respect to the control performance. The resulting control-dependent indices are then utilized in timers to solve the channel access problem in a distributed and control-aware manner. The remainder of this paper is organized as follows: the necessary preliminaries and system model are provided in Section II. The proposed distributed channel access mechanism for channels with known statistics is described in Section III and stability conditions are derived. In Section IV, we use the MAB approach for learning the unknown channel parameters and propose a novel method for concurrent distributed channel access and learning. In Section V, we evaluate the effectiveness of the proposed method by numerical simulations. Finally, we draw conclusions and discuss future directions in Section VI.
Notation: Z ≥0 (Z >0 ) denotes the set of nonnegative (positive) integers. The transpose, inverse, and trace of a square matrix X are denoted by X T , X −1 , and tr(X), respectively, while the notation X 0 (X 0) means that matrix X is positive semi-definite (definite). S n + is the set of n by n positive semi-definite matrices and the n by n identity matrix is represented by I n . E{·} represents the expectation of its argument and P{·} denotes the probability of an event. f n (·) is the n-fold composition of f (·), with the convention that f 0 (X) = X. The Euclidean norm of a vector x is denoted by x and σ max (X) denotes the spectral radius of a matrix X. Finally, the cardinality of a set X is denoted by |X |.

II. SYSTEM MODEL AND PRELIMINARIES
The schematic diagram of the WNCS under consideration is depicted in Fig. 1. Each subsystem consists of an unstable dynamical process, a dedicated local controller, smart sensor, and estimator. We assume the actuators are collocated with the controllers, but the sensors need to transmit their data over a capacity-constrained time-slotted network. We consider the scenario in which the subsystems exchange no explicit information and they coordinate for channel access in a distributed manner 1 . The effects of state quantization and transmission delays are considered negligible and are thus ignored henceforth.

A. Local Processes and Measurements
The process dynamics are assumed to be unstable and modeled by the following linear time-invariant (LTI) stochastic process: where x i,k ∈ R ni , y i,k ∈ R pi , and u i,k ∈ R mi are the states, outputs measured by the sensor, and inputs of subsystem i at time step k, respectively. A i , B i and C i are the system, input and output matrices of appropriate dimensions and to avoid trivial cases, we assume that σ max (A i ) > 1 for all i. Moreover, w i,k and v i,k are the uncorrelated zero-mean Gaussian disturbance and measurement noise, respectively, with respective covariances W i 0, and V i 0. The initial state x i,0 is also a Gaussian random variable with meanx i,0 and covariance X i,0 0, which is independent of w i,k and v i,k , i.e., Each sensor is assumed to have enough computational power for pre-processing the measurement data. Similar to the typical configuration considered in remote estimation, we consider the scenario in which the sensors compute the state estimate and transmit that rather than the raw measurement. The resulting estimator outperforms the one based on raw (unprocessed) measurements since it results in smaller error covariance while offering tighter stability conditions [28], [29]. Let I s i,k = {y i,0 , . . . , y i,k } denote the available information set at the sensor of subsystem i at time k and definê Due to availability of complete observation history, the sensor can compute the minimum mean square error (MMSE) estimate of the state by running a local Kalman filter which computesx s i,k|k to be transmitted to the corresponding estimator recursively bŷ The term decentralized control is often used to describe scenarios in which determining the control inputs requires no explicit information exchange between subsystems. In this work, we use the term distributed to describe the channel access problem in accordance with the WNCSs literature [12].

Subsystem 1
Subsystem N Channel Fig. 1. Example of the WNCS layout where N subsystems compete to access a shared wireless channel j. P i represents the plant of subsystem i ∈ {1, . . . , N }, with S i , E i , and C i being its sensor, estimator and controller, respectively. Note that the timer is embedded in the smart sensor and determines whetherx s i,k|k is transmitted from S i to E i .
By assuming that the pair (A i , C i ) is observable and X has a unique positive semi-definite solution. We denote this solution by P i which represents the steady-state error covariance of subsystem i. In such settings, it is commonly assumed that the Kalman filter has entered steady state since the a posteriori error covariance converges to P i exponentially fast for any initial conditions [30]. Since all the parameters for evaluating g i • h i (X) = X are known, we initiate the filter from P s i,0|−1 = P i andx s i,0|−1 = 0 to ensure that it is already in steady state.

B. Imperfect Communication
Let N and M denote the index set of subsystems and available channels, respectively, with |N | = N and |M| = M . The problem of scheduling typically arises when the shared communication resources are limited, i.e., M < N , as is the case considered here. Let the decision variable δ i,j,k ∈ {0, 1} represent whether subsystem i transmits on channel j at time step k as follows δ i,j,k = 1,x s i,k|k is transmitted on channel j, 0, otherwise.
We assume packet ACK/NACKs are guaranteed at each time instant and define another binary variable γ i,j,k such that γ i,j,k = 1 corresponds to the event of successful packet reception given that δ i,j,k = 1; otherwise, γ i,j,k = 0. Similarly, to represent whether a subsystem i receives the data packet at k, regardless of the selected channel, we define an additional binary variable θ i,k as To ensure collision-free channel access, we impose a constraint on the network such that a channel can only be accessed by one subsystem at a given time i∈N δ i,j,k ≤ 1, ∀j, ∀k.
Moreover, since it is assumed that one slot is enough to convey all the information from the sensor to the estimator at each time slot k, each subsystem can use one channel at most, i.e., j∈M δ i,j,k ≤ 1, ∀i, ∀k.
The communication channels in this work are non-ideal and thus prone to packet dropouts due to the effects of phenomena such as multipath fading, shadowing, interference, etc. This unreliability can be taken into consideration by modeling the packet dropouts over each channel as i.i.d. Bernoulli random sequences, and consequently the probability of successful transmission satisfies a Bernoulli distribution with mean q i,j ∈ (0, 1]. Using the introduced decision variables, the probability of successful packet delivery over a wireless link is given by

C. Control and Estimation
In this work, the standard quadratic cost over the infinite horizon is chosen as the performance metric. This cost is defined as where Q i and R i 0 are weighting matrices of appropriate dimensions. As long as the channel access decisions are independent of the control inputs, the certainty equivalence principle holds and the optimal controller for minimizing this cost is linear and obtained by u i,k = L i,∞xk|k , where L i,∞ is the optimal feedback gain given by where Π i,∞ is the positive semi-definite solution of discretetime algebraic Riccati equation (DARE) which always exists due to perfect actuation links and assuming that the pairs i ) are controllable and observable, respectively [31]. The local estimator computes the state estimate, denoted byx i,k|k , based on the information received from the sensor. Let t i,k min{κ ≥ 0 : θ i,k−κ = 1} denote the time elapsed since the most recent successful transmission. Based on its locally available information, i.e., E{e i,k|k e T i,k|k |I i,k } denotes the error covariance at the estimator with the error being defined as e i,k|k x i,k −x i,k|k . Moreover, P i is the steady-state a posteriori error covariance at the corresponding sensor. This estimation architecture is equivalent to the optimal estimator that one would obtain if all observations up to time k − t i,k were successfully delivered [29].

D. Cost of Information Loss (CoIL)
The concept of CoIL was introduced in [18] to quantify the additional cost incurred due to the loss of information. Let F k ⊆ N denote the set of subsystems that transmit their data packet at k and F k N \ F k . Furthermore, we define E 0 i,k as the cost of subsystem i in case it does not receive any data at k; similarly, E 1 i,k is the cost when this subsystem receives its data packet. Let j i : F k → M denote the index j with δ i,j,k = 1 for i ∈ F k . At the beginning of time slot k, the expected value of the stage cost, denoted by J k , can be written as where I k−1 ∪ i∈N I i,k−1 is the available information at the beginning of time slot k, Q {q i,j : ∀i ∈ N , ∀j ∈ M} contains the probability of successful transmission over each wireless link, and CoIL i,k E 0 i,k − E 1 i,k denotes the cost of information loss. Consequently, minimizing the expected cost is equivalent to finding F k such that the last term is maximized. CoIL i,k can be construed as the amount a subsystem i increases the entire cost, in case it receives no data packet at k.

E. Timer-based Mechanism
The timer-based mechanism, denoted as TBCoIL, was first proposed in [23] for providing channel access in Networked Control Systems (NCSs). This mechanism is able to provide collision-free distributed channel access in capacity constrained networks with no packet dropouts. TBCoIL is based on the idea of assigning a local timer to each subsystem i. At each time step k, the value of each timer is calculated by where m i,k is a nonzero cost which represents how critical the data packet of subsystem i is at time k. By choosing a cost that can be calculated by the local information as m i,k , all subsystems can set their timers without requiring any explicit information exchange. Note that λ in (8) can be interpreted as a tuning parameter which allows for adjusting the duration of the contention period based on the requirements. By using an identical value for λ in all subsystems, the subsystem with the largest cost will have the smallest timer.
At the beginning of each transmission slot, subsystems calculate (8) and start their timer from the obtained value. The timer of the subsystem with the largest cost reduces to zero first and the corresponding subsystem sends a shortduration flag packet on the network which informs the remaining contestants to stop their timers and back off to avoid collisions. Then, this subsystem transmits its data packet for the remaining duration of the slot. As the next transmission slot begins, the timers are reset to newly calculated values and the same procedure is repeated. Note that the timer is embedded in the sensor where the local computations are being done for deciding whether to access the channel as depicted in Fig. 1. The idea of this mechanism in which the timer is a function of the channel quality only is a celebrated result in wireless cooperative networks [32].

III. DISTRIBUTED CHANNEL ACCESS MECHANISM
We first modify TBCoIL to extend its application to the case where multiple imperfect channels are available. We assume that each subsystem is equipped with M independent timers, i.e., a separate timer for each channel. Similar to the original method, the timer values are inversely proportional to the local cost and are determined by where λ j is a constant specific to channel j ∈ M but is identical for all i, and the nonzero local cost, denoted by m i,j,k , is calculated individually for each channel. For simplicity, we will assume that λ j is the same for all channels, i.e., λ j = λ for all j.
As the transmission slot begins, subsystems start their timers from (9). The smallest timer corresponds to the largest cost and thus the highest priority. Let {i * , j * } = arg min i,j {τ i,j,k } which represent the indices of the smallest timer at k. As this timer reaches zero, subsystem i * transmits a short-duration flag packet on channel j * immediately, which informs other subsystems to stop their timers for j * and back off. Simultaneously, i * stops the rest of its timers, i.e., withdraws from competition for other channels, and transmits its data packet on j * without collision. Meanwhile, the remaining subsystems compete for the remaining available resources until all M channels have been allocated. Therefore, this mechanism inherently satisfies constraints (3) and (4). Similar to the original method, as the time slot ends, all timers are reset (based on the newly calculated local costs) and the entire procedure is repeated in the next time slot. Remark 1. The idea of assigning multiple timers to subsystems can be realized by assuming that each subsystem is equipped with a single real-time clock. The value of an imaginary timer assigned to a specific channel can equivalently be represented by a checkpoint on the elapsed time of the clock from the beginning of the respective time slot. As the clock reaches the first checkpoint, i.e., the smallest timer expires, the corresponding channel is claimed and all the remaining checkpoints are removed, which can be interpreted as withdrawing from competition for the remaining resources. Furthermore, if a flag packet is received on a channel, the corresponding checkpoint is neglected which is equivalent to backing off for avoiding packet collision.

A. Timer Setup
The main challenge for implementing the proposed channel access mechanism is quantifying the local cost such that it corresponds to control performance while, to avoid explicit information exchange, it is a function of the local information only. We first start by breaking down the quadratic cost of the system to identify the components that are affected by channel access decisions and derive the associated CoIL. As we will show, local information is sufficient for computing CoIL, and we can utilize it in timers for solving the control-aware channel access problem in a distributed fashion. Lemma 1. Consider the cost criterion defined in (6). The stage cost at k is given by where Considering a finite horizon, the linear quadratic cost (6) can be written as [33, Lemma 6.1, Ch. 8] where Π i,k is determined by solving the standard DARE over the finite horizon K and is used accordingly for obtaining the associated matrices L i,k and Γ i,k . Note that the covariance of the process noise is time-invariant. Hence, using [33, Lemma 3.3, Ch. 8] yields where the first equality results from the zero-mean property of w i,k , the definition of e i,k|k x i,k −x i,k|k and recalling that X i,0 and W i are the covariance matrices of the initial state and process disturbance. Moreover, the last equality follows from the law of total expectation and the definition of the error covariance. Therefore, by considering the infinite horizon and using the steady state values Γ i,∞ and Π i,∞ for all i the stage cost at k is determined by (10).
We are now ready to derive CoIL by examining how the channel access decisions impact the stage cost in (10).
The cost of information loss for each subsystem can be formulated as Proof. Following the same procedure as (7) yields Thus, by comparing this result with the definition in (7), CoIL is obtained by (11).
Since minimizing (12) is equivalent to maximizing the last term, the optimal resource allocation at time slot k can be formulated as where CoIL is given by (11) and ∆ k is a binary matrix that includes all the optimization variables at time k, i.e., Note that this optimization problem can be rewritten as a generic assignment problem (see [18]) and solved efficiently in a centralized manner by adopting methods such as the Hungarian method [34]. Nevertheless, We aim at solving this problem in a distributed manner. As aforementioned, if the local information is sufficient for determining the cost m i,j,k in (9), implementing the timer-based mechanism ensures that channel access is granted to the subsystems with the highest cost in a distributed fashion. Given that q i,j is known for all i and j, as computation of CoIL in (11) only requires local information, the product in (13) can be used as m i,j,k in timers, i.e., The first M timers that expire, each for a different channel, determine the transmitting subsystems and the corresponding claimed channels. As a result, this setup provides a distributed solution to (13). Since q i,j has Lebesgue measure zero, assuming negligible propagation delays and one-bit flags ensures that channel access is collision-free. Even in homogeneous networks, i.e., network containing subsystems with identical dynamics, timers lead to collision-free channel access since q i,j 's are distinct despite (possibly) identical values for CoIL i,k . Note that computation of CoIL only requires knowledge of the system parameters, initial condition P i , and the age of the last successfully received packet, i.e., t i,k . Therefore, it can be determined independently of the measurements which allows for implementing the timers away from the sensor that takes the measurements thus leading to more flexible architectures.
Remark 2. At each time instant κ, the optimal resource allocation problem for minimizing (6) over a finite horizon can be formulated as This is a mixed-integer optimal control problem (MIOCP) formulated in discrete time. Due to the extreme difficulty of solving this problem, approximate solutions can be obtained by adopting the partial outer convexification approach and utilizing numerical solvers as discussed in [13]. Although the formulation in (13) provides the solution over a single time step, in addition to computational efficiency, it facilitates distributed implementation as aforementioned.
Remark 3. Application of TBCoIL and its proposed variant is not limited to networks consisting of LTI processes with LQG controllers. Nevertheless, for the system described in Section II, CoIL is only a function of locally available information and thus it enables control-aware distributed channel access with timers. The same method can be applied in other scenarios, e.g., nonlinear systems, as long as the control objective is defined such that computation of CoIL requires no explicit information exchange between subsystems.

B. Stability Analysis
We investigate the stability of the WNCS under the proposed channel access scheme by considering the Lyapunov mean square stability criterion. For ease of exposition, the subscript corresponding to the index of the subsystem is dropped in Definition 1 and Lemma 3 since only a single subsystem is considered.
Definition 1 (Lyapunov mean square stability [35]). The equilibrium solution is said to possess stability of the second moment if given ε > 0, there exists ξ(ε) such that x 0 < ξ implies Lemma 3. For the closed-loop systems considered in this work, there exists ϕ satisfying 0 < ϕ < ε, such that (16) is equivalent to Proof. Let A L = A+BL ∞ . The state dynamics can be written as since w k is zero-mean and independent ofx k|k and e k|k , and E{e k|k |I k } = E{x k |I k } −x k|k = 0. From the law of total expectation it follows that which, in accordance with Definition 1, must be bounded. Then, for the second term we obtain [36,Fact 8.12.28] which is bounded if E{P k|k } < ∞. The certainty equivalence principle holds and thus the adopted controller ensures mean square boundedness of the state estimate, which in turn ensures boundedness of the first term in (18). Since the first and last term of (18) are non-negative and bounded, the stability condition (16) only depends on the boundedness of E{P k|k }. Hence, x k is Lyapunov stable in the mean square sense if and only if there exists 0 < ϕ < ε such that tr E{P k|k } < ϕ.
As a result of Lemma 3, stability of the WNCS under the proposed channel access scheme can be guaranteed as long as for all i ∈ N , there exists 0 < ϕ i < ∞ such that tr E{P i,k|k } < ϕ i . To this end, we exploit the fact that the number of consecutive packet dropouts determines the error covariance at the estimator, i.e., where 0 c=1 0. By showing that the process t i,k is an ergodic Markov chain, its stationary distribution can be utilized to determine the boundedness of E{P i,k|k }. We first demonstrate how the Markov chain can be constructed and analyzed, through an illustrative example, and subsequently derive the stability conditions. Example 1. Consider a WNCS consisting of two unstable subsystems that share a single channel, i.e., N = 2 and M = 1, where the timers as set according to (14) for providing channel access. Let S = Z ≥0 × Z ≥0 denote the state-space of a two-dimensional Markov chain. For any m ∈ Z ≥0 and l ∈ Z ≥0 we denote the respective state by (m, l) ∈ S, which corresponds to t 1,k = m and t 2,k = l. To determine the transition probabilities, we define the state-dependent action by a (m,l) = 0, if Subsystem 1 claims the channel, 1, if Subsystem 2 claims the channel,  Fig. 2. Two-dimensional Markov chain modeling the evolution of (t 1,k , t 2,k ) in a WNCS where two subsystems share a single channel. Here, the transition probabilities ρ 1 , ρ 2 , and ρ 3 correspond to (21a), (21b), and (21c), respectively.
which indicates the outcome of the employing the timers in (14). For each state (m, l), CoIL for each subsystem can be determined from (11). Furthermore, the probability of successful transmission over each wireless link is known and timeinvariant. Hence, the timer values and the resulting channel access decision at each state can be determined regardless of the time instant k which is represented by the state-dependent deterministic action in (20). Let 0 < q i ≤ 1 be the probability of successful transmission for subsystem i ∈ {1, 2} and also, let p i 1 − q i . Note that subscript j is dropped since only a single channel is available. The transition probabilities are given by

  
Note that since M < N , the state (0, 0) exists only when the system is initiated and can safely be ignored. This state can be excluded from the analysis by removing the first row and column of T , which is denoted byT . Thus, the resulting Markov chain has a single communicating class and is irreducible, aperiodic and, since q i > 0, it is positive recurrent. As a result [37, Ch. 1], this chain always has a limiting distribution π = [π 0 , π 1 , . . .] where π m = [π (m,0) , π (m,1) , . . .], which is the unique solution to where 1 is the all-ones column vector of appropriate dimensions. The vector π found by solving (22) can be used to determine µ i (t) P{t i,k = t} which is crucial in the remaining of this section. For the illustrative case considered in this example we have The method of Example 1 can readily be applied in larger WNCSs to form the Markov chain that models the evolution of t i,k 's. The states of the chain in such general settings represent (m 1 , m 2 , . . . , m N ), where m i = t i,k . Furthermore, the transition probabilities are determined by the state-dependent actions that result from the interaction of N ×M state-dependent timers. This leads to an N -dimensional irreducible, aperiodic and positive recurrent Markov chain with a corresponding transition probability matrixT . Therefore, the limiting distribution of this Markov chain can be used to determine the unique solution of (22). Therefore, µ i (t) can be determined for all i and t, which, as we show next, is crucial for examining whether tr E{P k|k } is bounded as required by Lemma 3.
Proof. Due to the ergodicity of the Markov chain, taking the limit of the expected value of (19) yields Subsequently, and applying Gelfand's formula yields Hence, if (25) holds for all i ∈ N , the upper bound in (27) exists which itself guarantees that 0 < ϕ i < ∞ exists such that tr E{P i,k|k } < ϕ i thus concluding the proof.
In case the closed form expression for µ i (t) is known, Theorem 1 can readily be utilized to verify stability. In general, however, finding a closed form expression might not be possible as it is the case in Example 1. Nevertheless, as it will be demonstrated in Section V, the p-series convergence test can be used in practice to examine stability within the same framework. (21) is inevitably zero, all states can be reached with a nonzero probability. This is due to the fact that h t i (X) is a monotonically increasing function of t [39, Lemma A.3]. More specifically, assume that for a given state (m, l), the parameters are such that Subsystem 1 has a smaller timer which means that a = 0. From (21) it follows that P {(m + 1, 0) | (m, l), a} = 0. Nonetheless, there exists a state (m, l ) with l > l such that CoIL of Subsystem 2 is large enough to result in a smaller timer value than Subsystem 1. Therefore, in state (m, l ) the actions is a = 1 and thus P {(m + 1, 0) | (m, l ), a} = q 2 .

A. Problem Statement
Optimal resource allocation requires knowledge of the exact values of q i,j 's which describe the time-invariant distributions of the time-varying channels. Due to the dynamic nature of the considered subsystems and the changing environment, the coherence time of the channel is relatively small and fast fading occurs, rendering it impossible to have instantaneous channel state information (CSI) acquisition. In such settings, learning methods can be applied to gain knowledge of the underlying channel statistical parameters, which are assumed to change very slowly with respect to the coherence time. Despite the abundance of existing learning algorithms which are applicable to standard wireless networks, adopting a suitable learning algorithm in our problem is challenging due to two setup-related reasons: (i) the considered WNCS structure allows no information exchange between subsystems and thus the learning method should be compatible with distributed implementation; (ii) since the main objective is minimizing the quadratic cost, the adopted algorithm should be compatible with the proposed timer-based mechanism. More specifically, channel statistics cannot be learned separately without taking into account CoIL. We aim at devising a novel distributed method which aims at maintaining a good control performance while learning the channel statistics.

B. A MAB Approach
MAB problem refers to optimal sequential allocation in unknown random environments. In classic single-player stochastic MAB, a player has access to multiple, say M , independent arms. The player pulls an arm j ∈ M at each round which yields a reward drawn randomly from an unknown probability distribution specific to that arm. Since the player has no prior knowledge of the reward distributions, he might play an inferior arm in terms of reward. We define regret as the difference between the reward obtained from playing the best arm and the player's choice. Let r j,k and I k denote the instantaneous reward obtained from arm j and the selected arm at round k, respectively. Then, the (external) regret up to round K is defined by The objective is to find a policy for selecting the arms, i.e., to determine I k at each round k, such that this regret is minimized over the game horizon. The performance of a policy relies on how it addresses the exploration/exploitation dilemma: searching for a balance between exploring all arms to learn their reward distribution while playing the best arm more often to gain more reward. The channel selection problem for a single subsystem can be conveniently cast as a single-player MAB. In this scenario, channels represent arms and playing an arm corresponds to claiming a channel for packet transmission. We adopt a binary rewarding scheme (r j,k ∈ {0, 1}), where in case of a successful transmission, a unit reward is obtained over the corresponding channel (r I k = 1), otherwise, no reward is earned (r I k = 0). The channels are independent and packet dropouts are i.i.d. random and, subsequently, the rewards are i.i.d. random. The mean of the Bernoulli distribution of rewards over each channel corresponds to the probability of successful transmission (5). Therefore, by adopting a suitable policy, after an initial exploration phase, the channel with the best quality is exploited for maximizing the success rate or, equivalently, the reward.
Index policies are a class of solutions to this problem, which assign an index to each arm and play the one with the largest index. One of the main categories of the methods that belong to this class are based on upper confidence bound (UCB). These policies estimate an upper bound of the mean reward of each arm at some fixed confidence level and determine the indices accordingly. One of the celebrated results based on this idea is UCB1, a policy introduced in [40]. In this policy, at each round k, the upper confidence bound of the mean reward, denoted byq j,k , is calculated and the arm with the largestq j,k is played. In this work, we use a slightly modified version of UCB1 to ensure collision-free channel access. More specifically, we calculateq j,k bŷ q j,k =r j,k + 2 ln z k z j,k + j,k , where, similar to the UCB1 algorithm, z k is the total number of plays, z j,k denotes the number of plays of arm j up to k, andr j,k is the average reward obtained from playing arm j up to k, i.e.,r where 1 Iκ=j is 1 when arm j is played at round κ is j.
Moreover, differently from the original algorithm, we also employ a uniformly distributed random variable j,k ∼ U(a, b) in the exploration term of (31). This ensures that the upper confidence bounds are distinct during the initial exploration phase of the algorithm thus enabling collision-free channel access with timers. Furthermore, since z j,k ∈ Z ≥0 choosing small values for a and b ensures that convergence of (31) is not disrupted as demonstrated in Subsection V-B. Note that the index of the played arm at k is given by The problem of distributed channel access in standard wireless networks, unlike WNCSs, only concerns maximizing throughput without considering the importance of the contents of the data packets [41]. This problem can be cast as a multi-player MAB where, given the aforementioned binary rewarding scheme is adopted, the maximum reward at each time step k is given by optimal resource allocation according to max δ i,j,k ∈{0,1} i∈N j∈M subject to constraints (3) and (4). Since the reward distribution over each wireless link is assumed to be time-invariant, the optimal decision variables are likewise time-invariant. Consequently, subscript k is dropped and we denote the solution by δ q * i,j . As a result, regret is given by where I i,k denotes the index of the selected channel by subsystem i at round k. By implementing suitable policies one can ensure that this regret grows logarithmically.

C. Distributed Channel Access Algorithm
We first cast our problem as a multi-player MAB and then propose a novel indexing policy for addressing the exploration/exploitation dilemma with respect to the control performance in a distributed manner. Since our goal is to minimize the quadratic cost, with a slight abuse of notation, we define the cost regret up to time K as The aim of the policy is to, without any prior knowledge of the channel qualities, determine the subset of subsystems that transmit and their respective channels; this corresponds to the first term of (34). The last term of (34) is the minimum cost that is incurred when Q is known; its solution is obtained by solving the optimal resource allocation problem formulated in (13). Performance of a channel access policy can now be measured in terms of minimizing the cost regret.
Although the cost regret is fundamentally different from the standard regret defined in (33), we propose a new method for exploiting the well-established results for minimizing the latter in our favor by introducing time-varying weights that reflect the control performance. We still apply the aforementioned binary rewarding scheme, i.e., which is an i.i.d. random variable with E{r i,j,k } = q i,j , and calculate the initial index of each channel by an index policy that is compatible with distributed implementation. More specifically, policies which require local information for calculating the index of each arm and the resulting indices are distinct, e.g., as per (31). Nevertheless, in our policy, these initial indices are then weighted by the control performance metric, namely CoIL. Consequently, the index of the selected channel by each subsystem is given by where δ * i,j,k is obtained by the following optimization problem subject to constraints (3) and (4). This ensures correct estimation of the success probability of each channel, while at the same time, the slot is allotted to the subsystem with the highest cost. This policy can be implemented in a distributed manner by adopting the timer-based mechanism for solving (37). By using the weighted indices as the local cost, the timers are determined by As a result, assuming that the duration of the flag packet is negligible, sinceq i,j,k has Lebesgue measure zero, this mechanism guarantees collision-free channel access even for homogeneous systems.
When the channel access policy is designed with respect to regret as defined in (33), its implementation only maximizes the number of successful transmission. This translates to sacrificing the control performance, which is the primary objective in WNCSs, in favor of maximizing the throughput. Nevertheless, the outcome of these policies can be manipulated in favor of the control objective by applying the time-varying weights, i.e., CoIL. This significantly improves performance despite (possibly) higher packet dropout rates as shown by the numerical results in Section V. Algorithm 1 illustrates the detailed distributed implementation of our proposed policy.
Remark 5. Before initiating the index calculation in (31), each arm needs to be played once. This can easily be achieved by temporarily adopting round-robin, where subsystems transmit according to a random sequence for the first N ×M time steps, i.e., the number of subsystems times the number of channels. Afterwards, by setting the reference time to N ×M , the generated set of observations and accumulated rewards, denoted by Z i, 1 {z i,j,1 |∀j ∈ M} and R i {R i,j |∀j ∈ M}, respectively, are used for determining channel access according to Algorithm 1.
The states are the wheel angle, the tilt angle, and their respective derivatives. The input is the voltage of the DC motors delivering torque to the wheels and the output is the measurements given by the encoder and the inertial measurement sensor. Furthermore, the covariance of the process disturbance and measurement noise are chosen as W = 0.1I 4 and V = 0.01I 2 , respectively, and the weighting matrices in (6) are Q = I 4 and R = 0.1.

A. Stability Analysis
Considering the scenario of two balancing robots contending for channel access as in Example 1, the truncated Markov chain can be analyzed to provide insight on the stability of the system. Due to the lack of a closed form expression for µ i (t), the stability condition in Theorem 1 cannot be verified directly. Therefore, we consider a truncated version of the Markov chain of Fig. 2 by letting 0 ≤ m, l ≤ m as graphically represented in Fig. 3. In this scenario, we can form the transition matricesP m andQ m by keeping only the first m rows and columns of P m and Q m , respectively. As a result, the transition probability matrix of this chain can be expressed asT which is row stochastic, irreducible, and aperiodic; therefore, the stationary probability vector can be obtained by [43], [44] where D(m, l) = 1 for all m, l. Although this is an approximation of the actual chain in Fig. 2, by choosing sufficiently large m, (39) provides a highly accurate approximation of the stationary distribution of the actual chain. According to the p-series convergence test, the series on the right hand side of (27) is convergent if exists p > 1 and β < ∞ such that µ 1 (t) and µ 2 (t) as per (23) and (24), respectively. As it can be seen, µ i (t) A i 2t is a monotonically decreasing function of t for t ≥ 4 for both subsystems and it is upperbounded by β t p . This can be construed as convergence of the series in (40) and thus stability. In sharp contrast, Fig. 5 shows case where q 1 is reduced to 0.2. In this scenario, µ i (t) A i 2t becomes an increasing function of t which indicates that the right hand side of (27) is a divergent series and thus stability of the system cannot be guaranteed.

B. MAB Approach
In this subsection, it is assumed that all subsystems communicate with a central scheduler which prioritizes channel access based on a measure m i,j,k . This enables us to also consider policies which are not compatible with distributed implementation as well as the ones which might result in collisions. Fig. 6 illustrates how the choice of the prioritizing criterion affects exploration/exploitation in a small network consisting of three subsystems and two communication channels. The probability of successful transmission over each wireless link is given in table I. Since the dynamics are identical, best performance is achieved when subsystems 1 and 2 transmit more frequently on Channel 1 and Subsystem 3 on Channel 2.
In the first scenario, we consider using CoIL as the sole priority measure without taking into account the different link qualities. To this end, we use m i,j,k = CoIL i,k q 0 as the priority measure, where q 0 is chosen as an identical constant for all links, and the higher priority subsystems choose the available channels randomly. The top plot in Fig. 6 depicts the result of adopting this scheme where, as expected, subsystems transmit over both channel equally often and the statistics are not learned thus leading to endless exploration. Next scenario concerns adoptingq i,j,k , which is the UCB1 index determined by (31), as the only priority measure. As shown in the middle plot, although the best channels are exploited in this scheme, the dynamics are ignored for the exploration/exploitation. Therefore, despite the identical dynamics, Subsystem 2 is rarely granted channel access (starved) which could destabilize the system. Finally, we consider the effect of prioritizing with respect to CoIL as well as channel qualities by implementing the proposed policy in (8). As the results illustrated in the bottom plot of Fig. 6 indicate, exploration in this scheme is done with respect to CoIL while the outcome is exploited for minimizing the cost regret. More specifically, all subsystems are frequently given channel access due to their unstable dynamics. Meanwhile, Channel 1 is allocated more often to Subsystem 1 and Subsystem 2 to ensure the highest probability of successful transmission, i.e., exploitation.
Performance of several policies in terms of average regret and cost regret is depicted in Fig. 7. In addition to the aforementioned measures, we also consider the case of granting channel access based the solution of (32) when channel qualities are known, denoted by q. The solution of the MIOCP problem in (15) is obtained by using the open-source nonlinear mixed integer programming (BONMIN) solver with κ = 5. Moreover, we consider the impact of using the indices in (31), denoted by CoILq, instead of the originally proposed algorithm of UCB1, denoted by CoILq−org, in the timer setup (38). As expected, the average regret while using the upper confidence bound calculated in (31) as the measure converges to zero, while allocating the resources without considering the channel statistics, i.e., using CoIL q 0 , leads to the largest average regret. On the other hand, when considering the average cost regret, the latter outperforms the scenarios where control performance is neglected. More specifically, the system is destabilized and the cost regret is unbounded for q andq while CoIL q 0 can stabilize the system despite its nonzero average cost regret. Moreover, the average cost regret of our proposed policy when channel qualities are unknown converges to zero fast indicating satisfactory performance. A similar trend is observed with the original UCB1 indices which shows that our proposed policy enables distributed implementation without adversely affecting the exploration/exploitation. Although using the solution of MIOCP results in lower quadratic cost than (13) as indicated by the negative cost regret, it can only be realized in a centralized configuration and requires considerable computational resources.

C. Distributed Implementation in Large Networks
To evaluate the impact of the adopted learning method on performance of the timer-based mechanism, we consider three additional setups where the channel statistics are taken into account by implementing kl-UCB [45], kl-UCB++ [46], and a Bayesian framework [47]. Similar to the method used for modifying UCB1, a randomly generated number is added in the exploration term of kl-UCB and kl-UCB++ and the resulting indices, i.e.,q i,j , are adopted in (38) for minimizing the cost regret. The addition of the random number ensures that obtaining identical indices has Lebesgue measure zero and thus they can be used in timers for providing distributed channel access without collisions. Unlike the MAB approaches, the adopted Bayesian method is based on the assumption that the channel has memory, i.e., the packet dropouts are correlated rather than being i.i.d. random. Nevertheless, it is capable of learning the belief of successful transmission within the timer-based framework. The cost incurred by a mechanism which ignores the channel statistics, i.e., using CoIL q 0 as the measure in (9), is chosen as the benchmark for cost reduction achieved by other setups. Additionally, we consider a centralized setup which prioritizes channel access based on VoI, introduced in [16], rather than CoIL. Since VoI is developed for resource allocation over perfect channels, we assume that one of the available channels is assigned randomly to the subsystem with the highest VoI similar to CoIL q 0 . Fig. 8 depicts how much the average quadratic cost in (6) is reduced by the aforementioned setups compared to adopting CoIL q 0 for N ∈ {8, 16, 24, 40} and M = 0.75N . As expected, the best performance, i.e., lowest average cost, is achieved when the probability of successful transmission over each link is known and incorporated in (14). This setup can reduce the incurred cost from 30% to 36% depending on the size of the WNCS. When the exact values of q i,j 's are unknown, ignoring them as in VoI leads to the least amount of improvement. Nevertheless, it offers up to 20% reduction in cost compared with CoIL q 0 due to utilizing the measured output for prioritizing channel access rather than the statistics of the error. Using the false assumption of Markovian packet dropouts and applying the Bayesian learning method leads to better performance in smaller networks while its performance deteriorates in larger settings. When q i,j 's are unknown a priori, using the indexing policies for channel access results in the best performance. The results indicate that regardless of the adopted indexing policy, the setup in (38) leads to significant improvements ranging up to 30%. Nevertheless, utilizing the indices obtained by kl-UCB offers 1% better performance compared with kl-UCB++ and UCB1.

A. Conclusion
In this paper, we presented a novel distributed deterministic channel access mechanism for WNCSs with imperfect (and possibly unknown) communication links. We adopted local timers for prioritizing channel access and derived the optimal timer setup for improving performance in terms of a linear quadratic cost in a distributed manner. In case of unknown channel parameters, we cast the channel access problem as a MAB and proposed a novel policy for distributed deterministic channel access. This policy utilized well-known indexing policies for estimating the success probability of channels and weighs them by a time-varying control measure, namely CoIL, which were then incorporated in timers. The simulations showed that the best performance with the timerbased mechanism is achieved when the channel parameters are known a priori. When the parameters are unknown, however, implementing our proposed policy leads to significant improvement when compared to policies in which the channel statistics are ignored.

B. Future Directions
Part of ongoing research is the consideration of more advanced models for the communication channels; for instance, channels with temporally correlated state variations. Another ongoing direction concerns the scenario in which flag packets have non-negligible duration which results in nonzero probability of collision between data packets. While for the deterministic case it can only result in deteriorated performance, probabilistic models can become more relevant, especially in cases where a large number of subsystems shares a limited number of channels. Furthermore, extending the proposed channel access method to WNCSs which involve subsystems with coupled dynamics poses another interesting yet challenging problem.