Joint Optimization Via Deep Reinforcement Learning in Wireless Networked Controlled Systems

This paper proposes a deep Reinforcement Learning (RL) based co-design approach for joint-optimization of wireless networked control systems (WNCS) where the co-design approach can help achieve optimal control performance under network uncertainties, e.g. delay and variable throughput. Compared to traditional and modern control methods where the dynamics of the system are important for predicting a system’s future response, a model-free approach can adapt to many applications of stochastic behaviour. Our work provides a comparison of how the control performance is affected by network uncertainties such as delays and bandwidth consumption under an unknown number of devices. The control data is transmitted under different network conditions where several applications transmit background traffic data using the same network. The problem contains several sub-optimization problems because the optimal number of devices is non-deterministic under network delay and channel capacity constraints. The proposed approach seeks to minimize control errors in wireless network control systems in order to improve Quality of Service and Quality of Control. This proposed approach is used and compared using three model-free RL Q-learning algorithms for high-throughput flow control in a double emulsion droplets formation application. The results show that the allowable number of devices for reliable network communication under bounded network constraints is 10 when using binary search. The control performance of the system without considering network effect in the reward function (Scenario 1) was good with the C51 algorithm; when including OMNet++ based network effect in the reward function (Scenario 2), the best performance was achieved with all three algorithms (C51, DQN, DDQN) with an exponential reward function, and only with C51 in the case of a linear reward function. Finally, under random network conditions (Scenario 3), C51 and DDQN performed well, but DQN did not converge. Comparisons with other machine learning and non-machine learning algorithms also highlight the superior performance of the utilized algorithms.


I. INTRODUCTION
The outset of this work in wireless networked control systems (WNCS) stems from a multidisciplinary research effort related to a microfluidics application. Microfluidics has enabled automation in the pharmaceutical and diagnostic The associate editor coordinating the review of this manuscript and approving it for publication was Marco Anisetti . fields thanks to the use of small reagent volumes, increased particle monodispersity with uniform drug composition, and efficient evaluation methods for drug testing [4], [49]. To integrate microfluidic devices in the consumer market, a highly synchronized flow rate is a major challenge to be addressed [64]. For example, in the case of liposomal drug delivery [46], [47] which is promising for high-throughput cell screening, double emulsion could help in the better formation of droplets. The formation of double emulsion droplets depends upon the synchronized delivery of the reagents at a specific flow rate [56]. The formation of double emulsion requires at least four pumping units for generating emulsions and additional pumps for reagents delivery. To achieve a high flow rate, different techniques have been proposed, which include several microfluidic units working in parallel [34], [71]. This raises issues that include not only the control of the devices but also the data communication and storage for achieving efficient control in a high throughput production unit. Our previous research focused on integrating wireless Cyber-Physical System (CPS) concepts [8], [9] with bioanalytical devices, which could help with efficient control in a high throughput laboratory setup. Such a Cyber Bioanalytical Physical System (CBPS) integrates the physical and biological processes with the computation and communication domains, enabling an efficient remote operation of the processes, which is the future of laboratory automation.
In a CBPS, synchronization between the devices is important to ensure the overall stability and reliability of the system [73]. The fault tolerance and delay requirement restrictions put constraints on the overall performance of the system (as in the case of Ultra-Reliable Low Latency Communications (URLLC)) [18]. These factors are affected by delays introduced by the control systems, which include computation and prediction delays, as well as the uncertainties of the wireless networks, including queuing delays, transmission delays and backhaul delays [41], [43]. It is thus important to see the design of this sub-domain of CPS, i.e. WNCS, as a co-design problem, [12], [42] rather than an interactive design in which one design lies on top of the other. The control and information distribution aspects of the application can be exploited by looking at the co-design of WNCSs. The principle of co-design of networked control systems is well established [17] and Figure 1 shows the design framework for networked controlled systems (inspired by FIGURE 1. Framework for the co-design of networked controlled systems (inspired by [17]). the above-mentioned work), which acted as a starting point for our work in WNCS. In this paper, we use reinforcement learning to compensate for the delays of the systems (both wireless and control) in order to optimize the overall system's error response reliability. The reason behind using model-free RL rather than model-based approaches is that when dealing with massive systems, the delay models are non-deterministic in nature; additionally, the physical dynamics of the system might be unknown.
In addition, over a shared communication network, the traffic pattern [24] could be highly non-deterministic, specifically when dealing with event-triggered control; i.e., developing a traffic model over a shared communication network is also a non-deterministic problem. Because of factors such as data storage capacities and adaptability, simple search algorithms become infeasible when attempting to cope with changes in real-time as the problem complexity increases with network growth. The use of online learning algorithms could help solve these issues at the expense of convergence time as compared to offline algorithms, which require a lot of data. The proposed concept could even be extended to Ultra Reliable Low Latency Communication (URLLC) applications in which the system is subject to stringent delay constraints and system-wide optimization is necessary to achieve reliable performance.

A. SUMMARY OF CONTRIBUTIONS
Our contributions are summarized as follows: 1) We present our proposed joint optimization of WNCSs using a co-design approach. The aim is to analyze the benefit of using a model-free RL in stochastic systems as compared to classical and modern control methods. 2) To analyze the problem in-depth, classical optimization theory is used to formulate the problem. The objective of the problem is defined as the minimization of control errors under network constraints as well as errors introduced via the used reinforcement Q-learning technique.
3) The problem is extended for the application of droplet generation using a stepper motor where the flow rate is controlled by motor operation. To estimate the control delays as close as possible to reality, we performed the benchmarking of Raspberry Pi which is used as a central control unit of fluidic pumps in our laboratory setups. The wireless control of the pump is obtained via WiFi and the network uncertainties were mimicked using the OMNet++ simulation tool.
Furthermore, our proposed solution is evaluated under three different network scenarios: Scenario1: The network uncertainties, such as delay and bandwidth consumption, were simulated using OMNet++. The optimal number of devices was calculated using binary search methods which satisfied the delay and bandwidth constraints for reliable performance. Finally, RL was performed using different algorithms, i.e., DQN, DDQN, C51, and LSTM, and a comparison was made based on the convergence of the algorithms.
Scenario2: The delay and network data simulated via OMNET++ were used as a control factor during the RL environment design and were also used as a dynamic parameter in the reward function to obtain an efficient performance of the algorithm under the network uncertainties.
Scenario3: The network uncertainties were defined as random variables and were introduced in the RL reward function.
To mitigate the overestimation errors, a deep Q-learning algorithm is used instead of Q-learning and the performance of the methods is compared with other model-free RL algorithms including C51. Furthermore, our results show that double-DQN is more efficient to mitigate overestimation errors. These RL algorithms are equipped with an experience replay buffer [5] which acts as a middle ground between the offline and online algorithms; in turn this helps making convergence faster. An experience replay buffer is used to save the trajectory of previous experiences in order to improve the learning process's performance. The size of the previous observation must not be too small nor too large; i.e. updating the policy after each iteration will be extremely time consuming, and updating it after too many observations (which may overlook the pattern of change) will not improve performance.

B. DESCRIPTION OF THE APPLICATION
The formation of single or double emulsion droplets helps in high-throughput screening of the cell's susceptibility for drug formation and testing. There are several other applications of microfluidic droplets in the chemical industry [20] other than drug testing. In microfluidic applications, the formation of a double emulsion requires a synchronized flow of the different reagents. As mentioned earlier, the generation of such droplets requires at least four pumping units for emulsion generation, and more if needed. These pumping units require an efficient control method that guarantees the fluidic flow from each pumping unit at a specified flow rate.
Control of such pumps over wireless networks could add the possibility of cost-effective remote operation. However, if the systems are running in parallel with other high-dataconsuming applications, such as video streaming, wireless communication may introduce additional challenges such as delay, packet loss, and channel congestion. Using classic control methods or robust control methods, e.g. Proportional Integral Derivative (PID) or Model Predictive Control (MPC) could be highly inefficient for applications with high synchronization requirements [28]. Indeed, one drawback of PID is that it is ineffective for Multi-Input Multi-Output (MIMO) systems and necessitates the tweaking of several parameters to get the desired response; one drawback of MPC it that it necessitates the modeling of the system's dynamics. The wireless network in a wireless control system is non-deterministic by nature, necessitating the continual adjustment of PID parameters or the development of the MPC model.
However, using a model-free (Black box) [57] or semisupervised (grey box) implementation could help such a system achieve an optimal response. RL is based on trial and error methods and is derived from the field of psychology e.g. animal learning [45]; several pumping systems controlled over wireless networks will obtain the actions from the RL agents working in parallel and might learn from each other if necessary, as depicted in Figure 2. The use of the RL algorithms assists in adaptation of the system to a higher level without remodelling the system dynamics. Further details of RL and its comparison with other control methods are provided in the upcoming section II.

C. PREVIOUS WORKS
Existing research in the topic of WNCS focuses on several elements of its design challenge such as stability, reliability, and energy efficiency. On the other hand, the use of deep learning is mostly studied in the case of URLLC [53]. In work [29], a hybrid approach which combines wireless connectivity with wired connectivity for control of Unmanned Aerial Vehicles (UAVs) has been provided. The proposed reliable VANET routing decision scheme is dependent on network conditions and is based on the Manhattan mobility model. In [38], a model-free deep RL based framework is analyzed for URLLC in downlink of OFDMA systems while optimizing the power. A delay sensitive joint optimization control studies has been carried out for networked control systems in [41] for multi-loop systems, emphasizing the importance of delay sensitivities in the design of optimal control and network policies. In [36], a clustering-based strategy for efficient energy optimization in embedded processors for wireless sensor networks is investigated in order to increase the lifetime of WSN nodes and enhance better utilization of resources. In the context of communication rivalry, an adaptive learning-based approach for vehicle-to-vehicle and vehicle-to-infrastructure communication has been presented in [52]. The gain settings of the PID controller are explored under the influence of non-linear delay using neural networks and ant colony optimization in study [63], but other critical network parameters such as packet error and channel capacity are not taken into account. The use of RL algorithms has been examined in study [37] to handle the collision problem in vast IoT networks, with encouraging findings; the optimization problem is modelled as a function of access delay, access success and energy consumption rewards. A similar technique to ours has been investigated in [68] with the goal of optimizing platoon performance by accounting for wireless network delay and control stability. However, the work models the vehicle dynamics, which is a complex task in case the dynamics of the application are unknown or difficult to model. In [35], a framework for prediction and communication co-design has been provided for improving reliability of URLLC systems using optimization technique. However, a limitation of the mentioned work is that the implementation requires the information about the state transition of different parameters of the system. In [26], an offline scheduling algorithm has been proposed for machine-to-machine communications; however, as mentioned earlier offline algorithms are less adaptable to real-time changes. A joint optimization method for Quadratic Linear Regulator (LQR) cost and energy consumption is analyzed in [65], providing an energy-to-control efficiency framework for URLLC in IoT systems, but where factors such as channel capacity and number of users have not been taken into account.

II. REINFORCEMENT LEARNING VERSUS PREDICTIVE CONTROL
In reinforcement learning algorithm, an agent learns a strategy to control an environment based on feedback and reward strategy.
The state of the system is determined by valuation function Q(s, a) which is based on the sum of expected rewards R associated with previous states plus the discount factor γ related to next states. The overall reward R t is given by: whereas the long-term reward is based on γ discount factor is given by: For policy π that defines the probability distribution for any action a for state s, the valuation function is given by: The valuation function tries to achieve an optimal value Q * (s, a) [58] where: Q-Learning is based on following Bellman update rule [6]: where α denotes the learning rate. The reward function plays a significant role in RL; to reach a particular objective. The major challenges while designing a reward function includes positive infinitive loop in the feedback of a reward function as the objective would be achieved sooner while the agent has not still learned all the possible scenarios. Including a discount factor [39] in the reward function which comprises the factor affecting the overall performance of the system such as bandwidth assigned to each device after a certain device has left the network could help solve the infinite loop. The discount factor used in RL is similar to the quasi-hyperbolic discount as mentioned in Equation 6.
The quasi-hyperbolic discount function gets a value of ρ when β = 1; the discount factor solves the problem of positive loop in the infinite horizon as well as adds the contribution from the next states.

A. REINFORCEMENT LEARNING AND PREDICTIVE CONTROL
Predictive control of any system is regarded as an optimization problem where the problem is solved over a control horizon based on the system dynamics. Classic and robust control methods revolve around achieving a stable response of the system e.g. PID, LQR, MPC, etc. MPC has been in use for decades for solving networked control system problems thanks to its stable response [10], [67]. On the other hand, RL is based on agent(s) and an environment where the agent tries to learn the policy based on the feedback from the environment to solve an optimization problem through exploration and exploitation [66]. RL deals with how to learn control strategies by acting as an optimization framework for complex problems. MPC algorithms might not converge in the real world where problems are more complex and non-deterministic in nature. Table 1 shows a brief comparison of RL with MPC and LQR. MPC might perform as close as to the RL algorithms for convex problems [15], but for WNCS where the network problem itself could be non-deterministic or non-convex in nature, MPC control will fail to solve the problem in an efficient manner [53] (see also Table 1).

B. MODEL-FREE REINFORCEMENT LEARNING ALGORITHMS
In RL, an agent learns the policy or valuation function based on dynamics of the system i.e. the model is given or learns the model of the environment with provided data or practical implementation [21], [27], [54]. RL algorithms can be model-based or model-free. In real-world problem where the system model might not be present or demonstration for a specific action is impossible, model-free algorithms could play an undeniable role. The model-free RL algorithms are divided into policy or valuation based learning techniques [33], [59] and are further classified into different algorithms as shown in Figure 3. In this work, our main focus was to highlight the use of value-based model-free algorithms  in wireless networked controlled systems. Although providing only the bounds or rules for the environment should be sufficient for these algorithms, we evaluated the response of the algorithms with deterministic data.
The algorithms chosen in this work are extensions of simple Q-learning algorithms including DQN, DDQN and C51 also known as categorical DQN. The only difference between Q-learning and DQN [62] is that the agent in DQN is based on neural networks rather than a simple Q-table.
In DQN, an overestimation phenomenon is well observed due to the maximization function [7]. To solve this overestimation problem, DDQN uses two identical neural network models where one learns the Q-value and the other is a copy of the model learned from the last stage. Using a second model in combination with the current state model helps the system to evaluate different actions which might be more suitable for some states rather than the one on which the system is trained [1]. As compared to DQN or DDQN, categorical DQN uses a distribution value of the return rather than an expected value [11]. In multi-modal distributed data, where several peaks may be present in the data and a single average cannot truly represent the system's response, categorical DQN can solve the problem by looking at the distribution of the Q-function.

III. PROBLEM FORMULATION
The problem can be considered as a single task being completed by a number of centralized distributed event-triggered systems over a shared communication network where Table 2 provides definitions for necessary variables and symbols. Data from each system is timed stamped, un-synchronized and is transmitted under network imperfections (random delay, variable sampling time, packet drops, packet reordering). The tasks are divided into p 1 , p 2 , . . . , p n systems and are controlled via a series of controllers (c 1 , c 2 , . . . c n ) with some or no inter dependency. The input of any single system will depend upon the learning parameter of controller i as well as on the output of the same controller and on the output from other controllers.
Here we are trying to minimize the delay and mean square error of the control system by prediction (model-free), where u i (t) is the input of the i th system. Consider the i th systems is defined by Equation 8 [50]: where w(t) is the additive disturbance, x i (t) is the state of the system, y i (t) is the output of the i th system and u i (t) is the input of the system. The control error of the system is given by Equation 10 The learning algorithm is designed to compensate for control errors and additive disturbances in order to follow the reference trajectory for optimum control performance as per Equation 11: A. DECISION VARIABLES | CONSTRAINTS Several decision variables influence the control performance of the systems, including network and control constraints.
The few variables that we included in our problem formulation are as follow:

1) TRANSMISSION ACK
The binary valued vector for transmission acknowledgment is defined such that: where δ is a function of σ (trigger condition)

2) DELAY CONSTRAINTS
The overall delay reduction for the system will ensure the stability of the system. In the case of a wireless networked controlled system with prediction and transmission happening over uncertain/un-reliable networks, the overall delay is the sum of transmission delay d t , processing delay d p and queuing delay d q . The delay of the overall system is random in nature and could be modelled as Markov chains as if the network is under congestion so all the systems over the network will face delay. However, to ensure the stability of any system i, the system should satisfy the following constraint: where the transmission delay is upper bounded by the maximum channel capacity and is given by: where N i p are the bits to be transmitted and T r is the transmission rate. A complete End-to-End delay model has been discussed in [43]. There exists an inverse relationship between transmission delay and effective bandwidth of network which eventually puts a bound on queuing delay. However, in the case of non-deterministic network, delay models (where an upper bound on the overall E2E delay and channel capacity is defined by separate tuning of different delay parameters) might not be required. For further details about the relationship between delay and number of devices one can refer to [55], [72].

3) CHANNEL CAPACITY CONSTRAINTS
To ensure the efficient utilization of resources and minimum transmission errors as well as packet loss, the information transferred by the cumulative systems should be less than the channel capacity. The channel capacity [13] is a function of bandwidth and Signal-to-Noise Ratio (SNR) and is given by Equation 15, where at any instance t the relation between channel capacity and bandwidth is given by: where the SNR can be represented as a function of transmission power P t , channel gain h and noise spectral density σ 2 .
To ensure reliability of the overall system when N systems are transmitting, the upper bound on channel capacity is given by: ) ≤ C max (17) where N is number of devices and the upper bound on the number of devices (N max ) is effected by both capacity and delay constraints.

4) SYNCHRONIZATION ERRORS
The synchronization error [43] between i and j agents is given by: where K ij is communication links between the i th and j th agent. For simplicity, we define here a synchronization parameter ξ i s which depends upon how much output of agent i th is delayed which will affect eventually output of j th agent.

5) OVERESTIMATION ERRORS
RL is based on learning optimal policies in the Markovian decision process where the objective function Q(s, a) learns incrementally. The state learning depends upon reward r and discount factor γ as in Equation 21. In presence of VOLUME 10, 2022 external noise ζ a s , the q-learning overestimation phenomenon occurs [61].
Lemma 1: Assuming the Q-learning happens under the stochastic environment with i.i.d variables X = X 1 , X 2 , . . . X n which introduces a noise ζ a s with zero mean in the evaluated function value, Q-learning overestimates in stochastic environments.
This phenomenon was first reported by Thrun, Anton in 1993. In presence of noises the evaluation function approximates as where the target evaluation function is given by The error introduced by the environmental noise in Equation 21 is given by: The upper bound on this overestimation is given as in equation: The upper bound is well proved by Thrun, Anton and is included for the reader's convenience (Lemma 2).

B. OBJECTIVE FUNCTION
The goal is to maximize Quality of Service (QoS) and Quality of Control (QoC), which is achieved by minimizing synchronization and control errors, and is expressed as follows:

max(QoS and QoC)
Which is based on minimization of control errors e i c and noise w(t) for i th system.
minf (e i c , w(t)) The cost function for Mean-square control error is given by: where y r (t) is the reference output. Based on constraints and optimization goal, the overall objective with the constraints is given as below: The objective is to reduce MSE under reliability constraints (25a, 25b, 25e) where constraint (25c) shows the upper limit on maximum number of devices and constraint (25d) is a feasibility constraint.

C. FORMAL DESCRIPTION
The importance of the use of formal methods in understanding the behaviour of stochastic systems has been discussed in our previous research [8]. Learning automata have been used for decades to solve complex problems like routing in stochastic environments [31]. In this context, RL provides the core of learning automata. A learning automaton based formal description of the problem could help to understand the considered problem in a perspective to replicate the approach for multi-agent systems as shown in Figure 4. This section provides the necessary definitions for learning automaton.
Definition: A learning automaton [3] is a tuple described as L= (η, Act, P, p t , u n , ζ n , R) where: η → Set of bounded input Act → Defines the set of action in the action space (a 1 , a 2 , . . . , a n ) ζ n → Defines the sequence of environmental response. ζ n ⊆ η u n → Set of outputs/actions P → Probability Space, which depends upon Probability and Sigma-Algebra function (F) of a set for bounded inputs and output sequence F n = σ (ζ 1 , p 1 , u 1 ; . . . ; ζ n , p n , u n ) p t → Set of probability distribution p t = [p n (1), p n (2), . . . , p n (n)] T p n (i) is conditional probability for the set of actions occurring under σ algebra function and sum of probabilities equates to 1 p n (i) = Pr{ς : u n = u(i)|F n−1 } where F n−1 ⊂ F R → defines the reinforcement scheme where R t+1 = r t+1 + γ R t where R t represents the overall reward for the previous actions and γ is the discount factor ζ n → conditional probability of the environment responses ζ t = [ζ n (1), ζ n (2), . . . , ζ n (n)] T

IV. PROPOSED SOLUTION
As mentioned in the introduction section, a co-design approach offers a more satisfactory optimal control performance in the presence of wireless network constraints as compared to an interactive design approach. To formulate the problem, conventional optimization theory is used. The problem is formulated in mathematical form as indicated in Equation 25 with network constraints 25a, 25b, 25c, 25d, 25e, 25f and 25g. The problem under consideration is highly non-deterministic subjected that the number of devices (N ) communicating is unknown. To solve the problem, the initial step is to calculate the maximum number of devices subject to channel capacity, delays and errors. Here we assumed that the minimum channel capacity required for each device to ensure delay and a small error probability is C * . The constraint (23g) comes into play for multi-agent interaction; to simplify the problem to a single agent, we have dropped the constraint from (23g). Finding the solution to the problem consists of the following steps: Learning: As mentioned earlier, to ensure optimal control performance a RL technique is used. Allocating a reward to the output response of the system in a stochastic environment under network constraints will help to achieve the desired performance. In case of error greater than the defined control threshold, the reward will be −1 i.e. a penalty, whereas in case of small error the reward will be +1. Section III-C provides a formal representation of the problem and the proposed algorithm 1 summarizes the approach used to solve the problem.
Reliability: To ensure reliability, the channel capacity constraint must be satisfied, which puts a limit on the maximum number of devices communicating. Thus, delays and errors are co-related with the assigned channel capacity. To make the problem simpler, constraints 23a, 23b, 23e are assumed to satisfy a reliability upper bound κ opt . κ opt provides minimum delay and errors under channel capacity constraints.
The maximum number of devices is obtained via a common binary search algorithm. Further discussion and explanation can be obtained from the simulation and results section V.

A. TIME COMPLEXITY ANALYSIS 1) TIME COMPLEXITY ANALYSIS FOR OUR APPROACH
For the proposed Algorithm 1, if a RL based approach is used, the computational complexity for step1 for determining the channel capacity for each user and step2 for the  For step3, where the upper bound on the number of maximum allowable users is determined using a common binary search, the computational complexity is O(logn). As for the while loop, the complexity is O(n 2 ). The complexity of the value iteration algorithm is O(S 2 × A × n) [22], where S are the states, A are the actions and n is the number of iterations. Therefore, the total time complexity of the proposed algorithm is given as Thus, the overall time complexity of the proposed algorithm becomes O((S 2 × A × n) 2 ). In what follows, we also present, for reference, the time complexity of approaches based on MPC and LQR.

2) TIME COMPLEXITY FOR AN MPC-BASED APPROACH
Assuming that the model of the system is given, the relationship between the input and output variables of the system is known. The complexity of the algorithm remains the same as described above for our approach for step1, step2 and step3. However, for the while loop, the complexity for determining the output depends upon the number of inputs m and the prediction horizon p [70]; thus, the overall time complexity of an MPC-based approach under capacity and number of devices constraint is given as: Hence, the overall time complexity will be O(m × p × n) 3 in case of conventional MPC and O(m i × p × n) 3 in case of step-based MPC, where i represents the number of steps.

3) TIME COMPLEXITY FOR AN LQR-BASED SOLUTION
Assuming that the system dynamics are known and step1 − step3 remains the same, solving the control problem VOLUME 10, 2022 (least-square) [14], [69] alone gives the time complexity as: The total complexity, including the while loop, would turn out as: where p is the control horizon; thus the overall time complexity would be O(p × n 3 ) 2 ).

V. SIMULATIONS AND RESULTS
To evaluate the performance as close as possible to a real-life scenario, we obtained the network and control parameters for the pump used in our laboratory as shown in Figure 5. The pump is integrated with a Raspberry Pi (RPi) to implement a wireless controller over WiFi. The pump unit is a compact, portable, dual-channel piezoelectric pump that uses 2 Bartels mp6 piezo pumps in a closed-loop regulated pressure generator setup. The internal low-level controller is an ESP32 microcontroller, which will be connected to an RPi4 board. RPI4 benchmarking was performed to obtain an overview of its capabilities in terms of computation.

A. NETWORK SIMULATIONS
OMNET++ is a powerful C++ based simulation tool for wireless, wired and many other networks. OMNet++ was used to obtain network parameters such as delay and channel capacity for different numbers of devices. The results of the network simulations were aimed at acquiring End-to-End (E2E) delay, which consists of transmission delay (d t ), processing delay (d b ), propagation delay (d p ), and queuing delay (d q ) for control and background traffic applications. The network was simulated around 802.11e standards with the Quality of Service (QoS) service enabled and disabled [2]. In 802.11e, the MAC uses enhanced distributed channel access (EDCA) by which the video and audio packets sent can have different priorities, which helps achieve minimum delay in delay-sensitive applications. Using the same services, control commands were sent at the same priority level as video in 802.11e which enforces that control packets will be transmitted before the background traffic. The background traffic model represents unnecessary load over the network while transmitting control data. The network configuration included controllers with static processing delay defined as 5 Sec, server, configurator, Access Point (AP) and radio medium.
The upper bound on End-to-End (E2E) delay was defined as 200 ms for control applications. The bit rates were defined as 800 kbps and 33.3 Mbps for each control and background application, respectively. The maximum channel capacity was defined as 54 Mbps (2.4 GHz center frequency). The network simulations were performed for different numbers of devices i.e. 1, 5, 10,. . . , 15. Figure 6 gives an overview of the delay achieved for control devices versus background application when QoS is enabled for a single host. The control application experiences a constant and almost negligible delay whereas background applications experience a huge delay at the start and then tries to stabilizes; this initial delay is due to packet accumulation when the application initializes.   7 depicts how the delay increases when 10 hosts are communicating control data, versus the case with 1 host due to shared bandwidth. As the number of hosts increases, the delay experienced by the control applications also increases.
Next, when QoS is not enabled, the control applications are not prioritized and the control devices experience severe delay and low throughput. Figure 8 shows the throughput for the control versus background traffic when the QoS is not enabled.
The simulations were repeated for different numbers of host applications (N=1, 5, 10, 15); Figure 9 gives an overview of the maximum throughput achieved for control applications while QoS service is enabled.  As compared with the non-QoS case, the throughput of the network changes over time, depending upon the data transmitted by high priority applications. Because of this, the throughput for control applications is comparatively higher than the throughput of the background applications. The average delay and throughput were calculated for control and background applications from the gathered data under QoS-enabled services; Table V As mentioned in the introduction section, the RL control of the pump was obtained under network uncertainties using three different approaches. To reduce the problem complexity, the agents were assumed to be performing independently from each other and the problem was solved for a single agent interacting with the environment where other agents are present and affecting the same environment. To add the effect of delay and bandwidth consumption parameter in the reward function, a reliability parameter ρ 6 was introduced in the reward function. For accommodating different possibilities where either delay or bandwidth consumed by the application exceeds the upper bound, which in turn leads to packet loss, the reliability parameter ρ was assigned a probability value between 0 and 1. The ρ factor was introduced in two different ways: as a linear multiplier, as well as an exponential multiplier, to analyze the effect of it in the convergence of the algorithm.
Scenario 1: In the first scenario, the network effects were not included in the reward function of the RL environment. Three algorithms, namely DQN, DDQN and C51 were used for the learning of the agent. In addition to the state of the system, the difference between allowable upper and lower bound of the flow rate was factored in the reward function. Figures 10, 11, and 12 show the average returns and loss for DDQN, DQN and C51 agents, respectively.   The performance was best with the C51 algorithm where average returns were more stable; on the other hand, the performance with DDQN and DQN was poor.
Scenario 2: As mentioned earlier, to mimic the real network scenario, OMNET++ simulations were performed. In Scenario 2, the simulation results were included as a learning factor in the reward function either as an exponential or a linear multiplier.
The results with C51 algorithm with either exponential and linear rewards, see Fig. 13, outperformed DDQN and DQN with linear rewards. The results with DQN (see Fig.14) and DDQN (see Fig.15) with exponential reward performed well.

Scenario 3:
To accommodate more uncertain network scenarios, both bandwidth consumption and delay were introduced in the reward function as random variables.
Under random network conditions, C51 (see Fig.16) and DDQN (see Fig.18) performed well whereas DQN (see Fig.17) did not converged. This implies that although DQN performed satisfactory average reward when the problem was limited to a single agent but taking into account for network congestion or delay caused by other networks showed the unsuitability of the agent in high network traffic scenarios.    Figure 19 shows the variation of flow rate obtained over 30000 iterations where network simulations were included in the reward function. It is evident from figure 19 that the DQN takes a bit longer to reach a stable response for flow rate; however, DDQN and C51 reach stable response comparatively faster.
In addition to the above discussed scenario, we also analyzed other learning algorithms like LSTM using RNN for scenario 2. A reliability parameter depending upon E2E delay, number of devices and data rate was chosen between 0-1, the Poisson distribution of which was then used as a control error. A Poisson distribution for control error (based on network parameters) is assumed due to the non-deterministic nature of the system leading to no predefined distribution of control error. The total time of computation for 30,000 iterations was recorded as 4m 12 sec. Figure 20 shows the control error for prediction on training and test data for LSTM algorithm using an RNN dense layer. Both of them look back (previous timestamps) and the batch sizes were chosen to be 50, and 333 different events were used as input data. Covariance, Pearson's correlation, and Spearman's correlation between two test flow rate data inputs (lying within the constraints) and flow rate achieved by agents utilizing various RL algorithms were analyzed for further evaluation of the algorithms. Here, the covariance shows a linear relationship between the test data and the actions taken by the agent for achieving the optimal/desired response. The reported value in Table 3 is the covariance between the variables and itself; a positive value indicates a variable change in the same direction whereas a negative value suggests a change in the opposite direction. However, as the covariance is not a best measure to characterize the relationship between data  because it is hard to interpret, the Pearson and Spearsman's relationships between variables is also analyzed. The possible value of Pearson's correlation lies between −1 to 1, and values above 0.5 show a strong correlation between data in the same direction, while values below −0.5 show a strong correlation between data in the opposite direction. To account for a non-linear relationship between test data and agent actions, Spearsman's correlation was also calculated, where −1 shows a strong negative correlation and +1 shows a strong positive correlation. As evident from the results summarized in Table 3 for these evaluation metrics, C51 in scenario 1 and scenario 2 outperforms the other algorithms, whereas DDQN shows satisfactory results in some cases.
Another perspective to analyze is the role of the experience replay buffer. As mentioned earlier, C51 is an offline learning algorithm while DQN and DDQN are online learning algorithms; the experience replay buffer provides a middle ground for efficient operation. Based on the evaluation results, the role of the replay buffer in improving the overall performance of the C51 algorithm was further analyzed for Scenario 2. Different batch sizes were used to store the previous observations in the experience replay buffer and the cumulative rewards were calculated. It is evident from Figure 21 that increasing the batch size helps increase the average rewards in early stages of the learning process, and the use of a small batch size, leads to fewer cumulative rewards. However, drawing a conclusion that 'the bigger the batch size the better the performance' is not true as continuous observation and update improves policy. A frequent update of the observations is required, making sure that storing enough past history for the system to learn but not all of the observations. The overall simulation results showed that even if the random network VOLUME 10, 2022 conditions are used still C51 performs well to achieve an optimal control performance in a stochastic environment. Overall, the results show that reliability can be well achieved using model-free RL approaches. The scalability of the system is restricted by network constraints such as capacity and system requirements, i.e. delay; this sets an upper bound on the number of devices that can be supported under specific network conditions. Event-triggered network control reduces network load but introduces reliability issues which are out of the scope of this work. However, as the problem complicates, the method could take more time to converge. Although a binary search can provide an estimate for the optimal number of devices, it is best to include the network effects in the reward function for the system to learn the possible network scenarios for dynamic adaptation to network changes.

VI. CONCLUSION
In this work, we focused on the co-design for jointoptimization of wireless networked controlled systems using model-free RL. The research emphasized the importance of wireless network constraints in addition to the control system constraints to achieve an optimal system performance. The paper focused on many aspects of the problem in terms of optimization theory and argued the presence of various factors which motivated the use of RL as compared with classical and robust control methods. As a use case, the application of the theory was implemented for double emulsion droplet formation unit; DQN, DDQN and C51 algorithms were used to achieve the control performance of the system under bounded constraints. C51 was found to outperform the other algorithms due to its multi-modal problem-solving capabilities. The results also showed that the reward function plays an important role in the agent's learning process and that designing the reward function carefully could help to achieve better performance. Currently, our work does not investigate the reliability issues introduced via event triggered control for better efficiency. In the future, the aim is to explore a middle ground between better performance and reliability under different network scenarios using hybrid control approaches. Also, the power constraints have not been studied and are left for future work.

APPENDIX. LEMMAS
Lemma 1: Assuming the Q-learning happens under the stochastic environment with i.i.d variables X = X 1 , X 2 , . . . X n which introduces a noise ζ a s with zero mean in the evaluated function value. Q-learning overestimates in stochastic environments: With the noise introduced during learning the reward attached with the Q-function is given as below: r(s, a) = r(s, a) + From probability theory the expectation of any variable X i is given by its distribution over the N samples and can be formulated as below: At each stage the reward will be higher than expected due to cumulative errors added at each stage. Even if function values are too small at any stage, due to maximum operator in q-learning the function will tend to select the maximum from the estimated distributions ψ i .

A. COMPARISON TO OTHER NON-MACHINE LEARNING METHODS
To support our argument of using RL instead of simple search algorithms, we analyzed the problem in more depth. For the optimization problem Equation 25, we tried to compute the linear approximation of the control error under constraints based on the desired output and input data-set gathered using simulations. For the computation, we used the well-known ''Newton Raphson Method''. However, the method failed to converge within the first 50 iterations under the subset of the acquired data. The use of ''Least Square Minimization'' and ''Trust Region Constrained'' algorithms was also considered, but as discussed earlier, the problem is non-deterministic in nature, which did not lead us to any feasible implementation. We also tried using ''Binary Search'' to solve other constraints of the problem, but the algorithm did not find any solutions with the provided simulation data-set. The use of a ''Brute Force'' algorithm was also tested; however, using the whole dataset led our system to run out of memory, or under best conditions, the algorithm was not able to find any solution in a 4-5 hour period. This led to trying the use of a smaller subset of the data; however, the algorithm did not manage to find the optimal/desired solution.

B. COMPUTATION TIME
For the simulations, we used a Lenovo IdeaPad L340 Gaming Laptop equipped with an Intel (R) Core (TM) i5-9300H CPU @ 2.4 GHz. Table 4 shows the computation time for the three scenarios for the implemented RL algorithms.