Saving Energy and Spectrum in Enabling URLLC Services: A Scalable RL Solution

Communication systems supporting cyber-physical production applications should satisfy stringent delay and reliability requirements. Diversity techniques and power control are the main approaches to reduce latency and enhance the reliability of wireless communications at the expense of redundant transmissions and excessive resource usage. Focusing on the application layer reliability key performance indicators (KPIs), we design a deep reinforcement learning orchestrator for power control and hybrid automatic repeat request retransmissions to optimize these KPIs. Furthermore, to address the scalability issue that emerges in the per-device orchestration problem, we develop a new branching soft actor–critic framework, in which a separate branch represents the action space of each industrial device. Our orchestrator enables near-real-time control and can be implemented in the edge cloud. We test our solution with a Third Generation Partnership Project-compliant and realistic simulator for factory automation scenarios. Compared with the state of the art, our solution offers significant scalability gains in terms of computational time and memory requirements. Our extensive experiments show significant improvements in our target KPIs, over the state of the art, especially for fifth percentile user availability. To achieve these targets, our framework requires substantially less total energy or spectrum, thanks to our scalable reinforcement learning solution.


I. INTRODUCTION
C YBER-PHYSICAL systems (CPSs) are engineered systems in which computation and communication are integrated with physical processes. Within CPSs, the continuous interactions with the physical world may require strict requirements on reliability and latency for the underlying communication system [1]. Failing to fulfill such demands may result in faulty behaviors, economic losses, or even put people's lives in danger. The fifth generation of wireless networks, 5G, is designed to satisfy such stringent requirements via ultrareliable low-latency communications (URLLC) [2]. To meet such requirements, we can use interference management and power control [3] as well as diversity techniques [4]. The Third Generation Partnership Project (3GPP) proposes various diversity techniques [4], [5]: 1) hybrid automatic repeat request (HARQ) retransmissions on the medium access control (MAC) layer to benefit from the time and frequency diversity [5]; 2) repetitions on the physical (PHY) layer to leverage time and frequency diversity [6]; 3) automatic repeat request retransmissions on the radio link control (RLC) layer to benefit from the time and frequency diversity [7]; 4) packet duplication on the packet data convergence protocol (PDCP) layer (using either multiconnectivity to leverage spatial diversity or carrier aggregation to benefit from frequency diversity) [8].

A. Diversity and Power Control in CPSs
Operating on full power with multiple diversity transmissions is neither effective nor efficient. The full power transmission could lead to high interference on the neighboring users. Multiple diversity transmissions could result in a high load in the cell, potentially increasing packet queuing delays to the level that they miss their corresponding delay bounds. Moreover, both of these approaches (i.e., power control and diversity techniques) are costly in terms of energy consumption and system capacity. Thus, it is necessary to properly utilize/combine such approaches based on various parameters such as load, channel state, and queue size. Anand and Veciana [5] and Jang et al. [9] propose a joint resource allocation and HARQ optimization framework to minimize the required bandwidth subject to packet error ratio in single-cell scenarios. These works make strict assumptions on queue models (e.g., M/GI/∞ with zero queuing delay in [5] and M/G/1 in [9]), which often are not realistic in cyber-physical production systems.

B. Machine Learning Potential in CPSs
To overcome the drawbacks of these model-based approaches, there has been increasing interest in using deep reinforcement learning (RL) to learn from the data and minimize the use of unrealistic assumptions. Khalifa et al. [10] propose a risk-sensitive RL to dynamically orchestrate repetitions in frequency in order to reduce the packet error ratio. In this article, like many others, packet error or packet loss is referred to as an event when the transmitted packet is not received or cannot be decoded in the receiving entity. However, in URLLC applications, including CPS and our work, a packet is also considered lost if it is received after its corresponding delay bound [2], [11]. Moreover, Khalifa et al. [10] assume per-slot orchestration, implying that the entire pipeline (including data transmission, processing, decision making, and applying the decision) should occur within every time slot, which could be as small as 125 μs [12]. Such intensive real-time interactions for service management and orchestration are not envisioned in the near future [13]. Near-real-time RL solutions for reliability enhancement have been little discussed in the literature. Furthermore, the majority of existing works have been evaluated on small-scale software modules developed by researchers for limited use cases. Hence, when it comes to practical scenarios, the scalability of such solutions is questionable. Finally, toward achieving stringent delay and reliability objectives, the energy efficiency of the designed solutions has been usually neglected in prior arts, i.e., the energy costs of different approaches, by which an objective is achieved, have not been considered.

C. Contributions
In this article, we develop an RL-based control solution for URLLC services. As the controller, we develop an orchestrator module to manage the transmission power and diversity transmissions. Different from the prior arts, our solution takes the application layer performance into account, does not require real-time interactions with the network, and is more scalable than the existing works. Our main contributions are listed as follows.
1) Translation of application layer requirements of CPS to communication system orchestration objectives: According to [2], application layer availability and reliability are defined, respectively, as the mean proportion of time and the mean time during which the communication service is delivered to the application layer of the receiving entity according to its required quality of service (QoS). We combine basic concepts of reliability theory [14] with the application layer requirements (defined by 3GPP [2]) to develop concrete mathematical models for application layer reliability and availability of URLLC services. 2) Near-real-time orchestration: Using deep RL, we develop a model-free framework to dynamically orchestrate power control and diversity techniques. We implement the orchestrator using both deep Q-networks (DQNs) and soft actor-critic (SAC). Our orchestrator performs near-realtime control, which is in line with current developments of edge computing for cellular networks [13]. 3) Scalable learning: To address the scalability of our solution in terms of the size of action space (which is an exponentially increasing function of the number of industrial devices), we propose branching SAC (BSAC) in which the action representation of each device is distributed across separate neurons. Such decomposition of devices' actions enables the linear growth of the action space with an increasing number of devices. This significantly reduces the computational and memory complexities of the solver, and consequently, our BSAC solution can orchestrate many more devices within a factory environment. We evaluate the performance of BSAC in our problem with a near-product 5G simulator. 4) Efficient assurance of stringent constraints: We evaluate our framework using a 5G-compliant simulator.
Compared to the state-of-the-art baselines, our proposed framework can improve the fifth percentile user availability by more than 18%, in our simulated scenarios. a) Spectral efficiency: For achieving the same level of availability as in our proposed solution, the state-ofthe-art algorithms require at least twice the bandwidth (which translates to a much higher operational cost to the operator). b) Energy efficiency: For achieving the same level of availability as in our proposed solution, the state-ofthe-art algorithms require at least 90% more energy consumption (which corresponds to a much higher operational cost and carbon footprint to the operator). Our intelligent and efficient (in terms of computations, memory, and signaling) approach meets stringent reliability requirements of URLLC services while saving more energy and spectrum. Such algorithms pave the way in improving 5G and beyond wireless access networks for industrial applications.
The rest of this article is organized as follows. In Section II, we introduce our system model. We describe the application layer reliability key performance indicators (KPIs) in Section III. We formulate the orchestration problem in Section IV and model it as an RL problem in Section V. The BSAC orchestrator is developed and evaluated in Sections VI and VII. Finally, Section VIII concludes this article.
Notations: Normal font x or X, bold font x or X, and uppercase calligraphic font X denote scalar, vector, and set, respectively. We denote by |X| the cardinality of set X, and by 1{x} the indicator function taking 1 only when condition x holds.

II. SYSTEM MODEL
We consider a CPS in which end devices are responsible for performing various functions that facilitate automated production. The communication system is responsible for the timely delivery of 1) sensor data to base stations and from there to an orchestrator and 2) computed or emergency tasks to the actuators. For the latter, the application layer performs the requested action upon receiving the corresponding messages. Sporadic packet loss in the underlying communication system will not necessarily affect the operational performance. Survival time, denoted by T sv , defines the maximum time period that the application layer can tolerate consecutive losses in the underlying communication [2].
We focus on downlink (DL) transmission in a 5G deployment with several gNodeBs (gNBs), where each gNB corresponds to three cells. Throughout this article, we use index i to denote a cell and j to denote an industrial device. We consider a set of cells C := {CL 1 , CL 2 , . . . , CL |C| } in the network. Each cell CL i serves a set of devices U i := {1, 2, . . . , K i }, where K i is the total number of devices served by CL i . K := i K i denotes the total number of devices by all the cells. We use both diversity and power control to enhance the reliability of CPSs. Let N f be the number of diversity techniques that we can control. Examples of a technique include HARQ retransmission and packet duplication at the PDCP layer. We can compactly show the diversity setting for u ij (i.e., device j associated to CL i ) as ij is the setting for a diversity technique such that f . . , N f }, and N k denotes the maximum diversity transmissions that the specific diversity technique can have. For example, if we can control HARQ retransmissions and PDCP duplication for u ij and the maximum transmission diversity is set to 10 and 2, respectively, then f ij = (f (1) ij , f (2) ij ) contains two diversity techniques (i.e., N f = 2). Besides, N 1 and N 2 are 10 and 2, respectively. For power control, we define p ij , which represents the transmission power level to u ij .
We propose a feature orchestrator optimizing the diversity techniques and transmission power at per-device granularity. Therefore, our orchestrator performs intraslice management to maximize the operational performance of URLLC service given the allocated resources (e.g., bandwidth, maximum transmission power, and processing capabilities) to the URLLC slice. From the architecture perspective, such orchestrations can be implemented on the edge (also known as on-premise) cloud as a virtual network function (VNF) in the URLLC slice. In this architecture, the edge cloud is located close to the radio access network (RAN). Besides, the measurement data are shared with the orchestrator via a northbound interface, and the network-level decisions are transmitted to gNBs via a southbound interface [15]. This cloud-native implementation enables us to perform network-level optimizations instead of current threshold-based implementations in gNB. Moreover, owing to the proximity of the edge cloud to the RAN, the added propagation delay is low, while the powerful computing resources available at the edge reduce the processing delay considerably [16]. On the big picture, our feature orchestrator can be implemented as an inner loop to interslice management techniques in which the orchestration is mainly based on the priority of each slice [17]. These techniques implement slice (or slice subnet if within a domain, e.g., RAN) management functions and typically leverage high-level decision variables (e.g., the spectrum [18] or resource provisioning [19]) to optimize the allocated resources to each slice. Thus, our solution benefits such management functions toward a resource-efficient URLLC slice. Fig. 1 illustrates the proposed architecture.

III. PERFORMANCE METRICS
In the network layer, we say that the communication to u ij is operational if packets are successfully and timely received by the PDCP layer in the receiving device. Hence, the channel state is described by a Bernoulli state variable X ij represents the packet decoding error probability for the nth DL transmission to u ij (regardless of the packet delay). This probability is a function of signal-to-noise-and-interference ratio (SINR), code rate of the transmitted packet, and f ij . Considering the delay, we can define the network state variable as where D (n) ij and D req ij represent the end-to-end delay of the nth packet of u ij and the delay requirement for u ij , respectively. Assuming that packet n is received at time t n , we define the continuous network state variable as where u(t) is 1 for t ≥ 0 and 0, otherwise. In the application layer, sparse packet failures shorter than the survival time, T sv , can be tolerated or even, in some cases, corrected [2], [20], [21]. Hence, the application layer state variable can be defined as Fig. 2 illustrates the relationship between Z ij (t) and Y ij (t). A. Application Layer KPIs 1) Availability, α ij : Availability is an item's ability to perform its functions given the specified requirements at an arbitrary time [14]. Therefore, in wireless systems, availability (or average availability) at the application layer is defined as the mean proportion of time during which the communication service is delivered according to the required QoS with interruptions that are shorter than T sv . That is 2) Reliability, τ ij : Reliability on the application layer is defined as the mean duration of time that the application is operational (i.e., Z ij (t) = 1). Let N (t) denote the number of times the crossing from Z ij (t) = 1 to Z ij (t) = 0 occurs. Then, we can define the reliability of device u ij as

IV. ORCHESTRATION PROBLEM
The goal for the system is to find the configuration of diversity setting and transmission power level (i.e., f ij and p ij for ∀i ∈ C, ∀j ∈ U i ) in order to maximize the application layer reliability and availability of a CPS. Hence, we define the orchestration problem for a given set of reliability enhancement techniques as follows: in which η α > 0 and η τ > 0 in (7a) are the scaling coefficient of availability and reliability, respectively. Besides, (7b) stands for the maximum allowable transmission power to u ij , and (7c) indicates that diversity transmission are nonnegative integers with a specified maximum. Note that both (7b) and (7c) should be respected at all decision time epochs. The application layer availability and reliability of a system are functions of the channel state variable, X (n) ij , as well as the end-to-end delay of packets at receiving devices [see (2)- (6)]. Channel state variable and end-to-end delay depend on many variables such as instantaneous path gain, SINR, code rate, and transmission buffer status where each of these variables can be impacted by f ij and p ij . Hence, the joint modeling of channel state variable and end-to-end delay is a highly complex task and requires significant assumptions on the channel, traffic, and queue models.

V. TRANSFORMATION TO DEEP RL PROBLEM
The optimization problem (7) is a complex nonconvex optimization problem. Besides, characterizing the effect of f ij and p ij on our reliability KPIs, the objective function of (7) requires explicit channel and queue models, which includes approximations that may not hold true in practice. The complexity of the optimization problem (7) and the lack of accurate models for components of (7a) motivate the use of RL.
Our RL framework consists of: 1) an RL agent; 2) an environment that is described by a set of states; 3) their interactions through actions and rewards [22]; and 4) a step period, Δt, the length of time between two consequent actions.

A. State Space, S
The state space describes the environment where the RL agent learns based on the sequence of action-reward pairs. Our state at time slot t, denoted by S t , includes explicit QoS and implicit QoS variables as described below.
1) Explicit QoS Variables: As suggested by Ganjalizadeh et al. [20], application layer availability and reliability are a function of packet error ratio (i.e., lim t→∞ Pr(Y ij (t) = 0)), mean down time on network layer (i.e., mean time that Y ij (t) is zero), and survival time for the specific use case. Hence, we propose to use the packet error ratio and mean down time of each device in the state space. Note that survival time is a setting that does not change, and therefore, it is not necessary to be added. These metrics can be estimated using empirical measurements within the RL step period, Δt.
2) Implicit QoS Variables: We identify (i) SINR (β ij ), (ii) path gain (g ij ), (iii) end-to-end delay (D ij ), (iv) RLC buffer status (q ij ), (v) the number of diversity transmissions, and (vi) the number of used resource blocks as the most relevant measures impacting the network layer QoS (directly or indirectly). For (i)-(iii), the empirical distribution (i.e., F x (Δt) for x ∈ {β ij , g ij , D ij }) is well suited to describe the environment as perceived by u ij . Therefore, we use certain statistics of these measures, namely fifth and 95th percentile, median, and mean. The rest of the measures [i.e., (iv)-(vi)] are counters, and a simple mean in the state would be sufficient.

B. Action Space, A
The action space is the set of decision parameters through which the RL agent interacts with the environment. In practice, many RL frameworks quantize the continuous transmit power into several discrete levels [23]. We define the set of power levels as P := {p min , p 1 , p 2 , . . . , p M , p max }. The action space, including diversity transmissions, for each device becomes A :

C. Reward Function
From (5) and (6), application layer availability and reliability are measured in infinite time, while the temporal granularity of the RL is defined by a step period, Δt. Hence, we propose using these two measures' estimators in the reward function. Although the estimation of these long-term measures in a short step period might be inaccurate, such approximation reflects the impact of each action on application layer QoS in the short-term future. The availability estimator can be defined as However, the reliability unit is time and cannot be reliably estimated via small time intervals. Hence, we define crossing rate from Z ij (t) = 1 to Z ij (t) = 0 as l ij := lim T →∞ N (T )/T . Therefore where l ij can be estimated at each step period by l ij (Δt) := N (Δt)/Δt. We propose to use a ij (Δt) and l ij (Δt) in the reward function instead of α ij and τ ij . Therefore, we transform (7a) to maximize the availability estimator and minimize the crossing rate. We define the reward function, R t , as In this expression, w is the weight specifying the relative importance of availability and crossing rate, respectively. The term 1/Kw scales the reward function to have an upper bound of 1. Moreover, the step period of RL is denoted by Δt, at the end of which the reward is calculated at time t. For instance, in our evaluations, we set Δt to 1 s, gather statistics of availability and crossing rate in this 1 s, and calculate the reward at time step t + 1 with those statistics.

VI. BSAC ORCHESTRATOR
In this section, we address the orchestration problem using both DQN and BSAC algorithms. In the following, we briefly explain both algorithms.

A. Background: DQNs and Optimization
Q-learning is a model-free RL algorithm that works based on the concept of action-value function. This function, denoted by Q (S t , A t ), is the long-term average accumulated discounted reward of state S t when action A t is taken at time t [22] Q(s, a) : where R k is the observed instantaneous reward at time k, T is either a finite number representing the actual episode length or infinity for continuous RL tasks, γ is the discount factor, and π is a policy providing a mapping from the state to the action to be taken.
In Q-learning, actions are chosen by following a deterministic policy over the action-value function. In the basic Q-learning, the action-value function is updated at each time step of the episode by [22] where 0 < ζ < 1 is the learning rate.
We use (11) to find the optimal action from each state. To this end, during the training phase, the optimization algorithm trades off between exploration and exploitation. Exploration describes the process in which the RL agent prioritizes searching for new actions (in this state) that were not experienced before to improve its value estimation. In contrast, in the exploitation phase, the RL agent selects actions greedily from what it has learned, i.e., the action with the highest estimated Q-value [22] is taken The exploration phase is often performed via -greedy action selection method, in which an action is either selected randomly with probability t , or otherwise using (12), where t is usually a decaying function in t [22].
In the Q-learning approach, the Q-function is represented by a Q-table, containing states as rows, actions as columns, and Q-values as entries. In practice, the number of states is often too large. This leads to either reducing the number of states (e.g., through quantization) or replacing the Q-table with a machine learning model [e.g., a neural network (NN)]. The latter is also called DQN, which we use to solve the orchestration problem. However, (12) implies that all the possible discrete actions in A need to be present in the output layer, thus resulting in poor scalability and slow convergence [24].

B. Soft Actor-Critic
SAC is a model-free deep RL algorithm, which performs explorations by the joint maximization of the expected accumulated discounted reward and entropy (i.e., a measure of uncertainty in the policy) [25]. In SAC, an NN approximates the Q-value of a state-action pair (called critic model), while an actor model computes the next action. Afterward, this action and the current state are fed into the critic model for action-value function approximation. The central optimization problem in SAC is to find a stochastic policy such that [25] π (·|·) := arg max π(·|·) E where π (·|·) is the optimal policy, the expectation is over the distribution of state-action trajectories driven by stochastic policy π, T is the number of time steps, ψ>0 determines the relative importance between reward and entropy term, and H(A t ):=−E A t [log (π(A t |S t ))] is the entropy of the action taken in time t, where A t ∼π(·|S t ). In other words, SAC performs guided exploration, i.e., it selects a policy through which the summation of the discounted reward and entropy is maximized.

C. Branching SAC
One of the main challenges of DQN and SAC is their poor scalability. In the following, we present BSAC to address this issue.
Our first step is to introduce continuous action in actor-critic algorithms. The continuous action is then quantized to a discrete value representing the setting for all devices in the factory. Throughout our study, we implemented our orchestrator using both SAC, which is a stochastic policy gradient method, and deep deterministic policy gradient [26]. Nevertheless, our evaluations indicated that SAC performs better in terms of convergence rate and the average reward for our orchestration problem.
The transformation of SAC to continuous action space enabled us to run a larger action space that was computationally impossible to run with the DQN. However, the performance in terms of average reward is still unsatisfactory. Hence, we introduce BSAC in which the action space of each industrial device (i.e., A ij as the actions space of u ij , ∀i ∈ C, ∀j ∈ U i ) is represented by two neurons in the last layer of the actor network. Using such modification, the size of the actor's last layer grows linearly with the action space (instead of exponential growth in DQN's Q-network or discrete SAC's actor-network). In addition, BSAC leverages two additions. First, it uses target Qnetworks and clipped double Q-learning originally introduced in twin delayed deep deterministic policy gradient algorithm. According to [27], such additions could overcome overestimation in value approximation while ensuring convergence to a suboptimal policy. Second, our BSAC leverages prioritized experience replay originally developed for the DQN in [28]. In other words, we repeat transitions with higher temporal differences more frequently. However, we anneal such prioritization during the training to correct the introduced bias in the expected value of the action-value function. The architecture of BSAC is depicted in Fig. 3(a).
Accordingly, similar to [25], the soft Q-value can be updated iteratively as where A t+1 is the vector of actions taken at time t+1, and |A t+1 |=2K for ∀t ∈ [0, T −1]. Note that each action in A t+1 is sampled from π(·|S t+1 ).
To employ RL on problems with large state and action spaces, Q and π can be approximated iteratively using NNs (i.e., critics and actors, respectively). Let us denote the weights of actor network and its target network by θ andθ, respectively. Besides, we assume that ϕ i andφ i for i ∈ {1, 2} represent the weights of double Q-networks and their target network. As it is preferred to keep offline training as an alternative, we describe the rest of the training algorithm in such a form that enables offline training. Moreover, we include the target actors and clipped double Qlearning in the procedure. Hence, we suppose that the transitions are stored in a buffer, B. Then, irrespective of the sampling technique (e.g., prioritized experience replay), we can represent a sampled transition with B := (S, A, R, S , Υ), where S and R are the state and reward obtained by taking action A in S, respectively. Besides, Υ is a binary parameter that is 1 if S is not the last state in the episode, and is 0, otherwise.
Hence, the parameters of the soft action-value functions can be trained by minimizing the mean squared error between their corresponding Q ϕ i [i.e., critics 1 and 2 in Fig. 3(a)] and the target action-value function, Q, as [refer to J Q in Fig. 3(a)] where Q ϕ i is the parameterized action-value function [i.e., critics 1 and 2 in Fig. 3(a)] and A is sampled from parameterized target policy (i.e., πθ, shown as target actor in Fig. 3(a)). Besides, the target action-value function is defined as Qφ i (S, A)−ψ log πθ(A|S) . (16) To minimize J Q (·), ϕ 1 and ϕ 2 can be updated in gradient descent direction as in [25]. To ensure that the temporal difference error remains small, we update target critics' weights at each time step For the actor part, the objective is to update the policy toward the exponential of the new soft action-value function, in which case the updated policy is ensured to improve in terms of its soft value [25]. Hence, the policy parameters can be learned by minimizing the expected Kullback-Leibler divergence between the new policy and the exponential of the new Q-function. Using the definition of Kullback-Leibler divergence, we can rewrite the cost function for policy improvement step [i.e., J π in Fig. 3(a)] as A) . (17) In order to minimize J π (·), we follow [25] and use the reparameterization trick to reformulate the expectation over actions into an expectation over noise, contributing to a lower variance estimator. In other words, the samples are drawn using a squashed Gaussian policy as A:= tanh (μ θ (S) + σ θ (S) · χ), where μ θ (·) and σ θ (·) are the estimated mean and standard deviation of a Gaussian distribution, respectively, and χ ∼ N(0, I) (i.e., a multivariate Gaussian distribution where its mean is an all-zero vector, and its covariance matrix is the identity matrix). Hence, we can reformulate J π (θ) as Consequently, the policy parameters, θ, are updated in gradient descent direction as in [25]. Besides, we update the target actor weights slowly byθ = νθ + (1 − ν)θ.
The learning procedure described through (15)-(18) can realize both off-policy and offline training. In the former, when interactions with the environment are possible, new experienced transitions are added to the buffer, and several transitions from the buffer are used to update the new policy. Nevertheless, the behavior policy is not necessarily the same as the target policy. In the latter, the dataset is collected using some unknown behavior policy; thus, there can be no interaction with the environment [represented via solid lines in Fig. 3(a)]. However, the actors and critics can also be trained via on-policy learning. In such a case, the formulation changes from episodic to infinite horizon RL (i.e., T →∞ and Υ is always 1). Besides, the algorithm accesses only the latest transition (i.e., there is no buffer), and consequently, the actions are sampled and executed via the most recent estimate of policy (i.e., the behavior policy and target policy are identical in this case). This characteristic is significant when dealing with URLLC services in which there is no tolerance for losing their strict requirements. Leveraging this capability, one can train the NNs using: 1) an already existing dataset in offline mode; 2) a virtual network (e.g., realistic simulations or digital twin) in off-policy mode, while the episodes can run in parallel to speed up the learning procedure; and 3) the operational network (e.g., in safe exploration mode) to tune the NNs' weights further.
The convergence of BSAC generally follows SAC in [25,Th. 1]. Nevertheless, in this article, our primary focus is to study the performance of large-scale optimization of diversity transmissions and transmission power in realistic industrial automation environments (refer to Section VII-A for our simulation setting). Hence, the BSAC's convergence criterion to optimal policies in an industrial setting is out of scope of the present work and is left as an open problem for future research. However, comprehensive experiments, illustrated in Fig. 5(b), show promising convergence behavior for all of our experiments.

VII. SIMULATION RESULTS AND ANALYSIS
In this section, we present the simulation methodology and evaluate the performance of our deep RL orchestrator.

A. Simulation Methodology and Configuration
We performed link-level and network-level simulations. In the former, we consider a 3-D model of a small factory and calculate path gains and 3-D channel data using [29, Sec. 7]. The path gain matrices, radio rays, and nodes' location information are then fed to the network-level simulator, in which we simulated PHY, MAC, and higher layers in a multicell multiuser network. The network-level simulator is event-based, and it operates at an orthogonal frequency-division multiplexing (OFDM) symbollevel time granularity. In this setup, the RL agent resides in a separate server and communicates with the network-level simulator via a ZeroMQ interface. The simulation setup is shown in Fig. 4. We modeled the environment as a small-sized factory with dimensions of 15 × 15 × 11 m 3 . We adjusted the derived indoor Hotspot model, in [29], to capture the blockage effect (in a factory environment) from stationary or moving objects. In addition, we assumed that a gNB is located in the center of the small factory with a height of 10 m and is configured with a three-sector cell setting. The gNB has two antennas positioned vertically on top of each other, with a down tilting that optimizes system capacity. The bandwidth allocated to industrial devices is assumed to be 10 MHz, unless otherwise stated. The devices are positioned in a way to emulate high interference scenarios in the factory. We kept the device position in different simulations the same and moved the surrounding environment with a speed of 30 km/h. The control application's traffic ingress to devices, as considered in [2], is assumed to be periodic with 2-ms period, and a delay bound of 2.5 ms (i.e., D req ij = 2.5 ms, ∀i∈C, ∀j∈U i ). In our simulations, depending on the selected modulation and coding scheme, packets are segmented/concatenated and sent as transport blocks. The received instantaneous SINR of each transport block results in an error probability, which depends on the radio channel (e.g., path gain, blockage, and fast fading) and the dynamic interference from other devices' transmissions. After the reassembly of the packets, if a packet is decoded successfully, the PDCP layer neglects the packets received after their delay bound. In the end, on the application layer, we take survival time into account and calculate the availability, α ij (Δt), and crossing rate, l ij (Δt).

TABLE I SIMULATION PARAMETERS
The RL orchestrator performs training with 100-s simulations for 10 000 parallel runs and we set the step period, Δt, to 1 s (i.e., we performed a total of one million iterations in exploration phase). This implies that the state (which includes the compact statistics gathered in the last second) is sent to the deep RL orchestrator every second. Upon receiving the state, the deep RL orchestrator sends the action immediately and gets the reward of its action with a delay of 1 s. Moreover, since we used 30-kHz subcarrier spacing and configured 14 OFDM symbols per slot, a Δt of 1 s contains 2000 slots of length 0.5 ms. Without loss of generality, we reduced the action space by quantizing the transmission power to only two levels, i.e., p ij ∈ {p min , p max }, ∀i ∈ C, ∀j ∈ U i , where p min and p max are set to 0.008 and 0.02 W, respectively. These power values are applied to all resource blocks scheduled for a specific industrial device during a step period. Moreover, from diversity techniques, we only considered HARQ, where the maximum number of transmissions for each device was also set to 2, i.e., f ij ∈ {1, 2}, ∀i ∈ C, ∀j ∈ U i . The simulation parameters are presented in Table I.

B. Performance Evaluation
For the performance evaluation of our proposed deep BSAC orchestrator (represented as bsacOrch), we considered several baselines, as follows.
1) maxPwr/maxPwrRet: All the resource blocks were configured with p max , while setting the maximum number of transmissions as 1 and 2, respectively. 2) lowPwr/lowPwrRet: All the resource blocks were configured with p min , while setting the maximum number of transmissions as 1 and 2, respectively. 3) maxPwrHb: We increased the bandwidth from 10 MHz (i.e., default value) to 20 MHz and configured all the resource blocks with p max , while setting the maximum number of transmissions to 2, 4) hPwr: We increased the total output power from 0.5 W (i.e., default value) to 1 W, while setting the maximum number of transmissions to 2. 5) dRlOrch: DQN-based orchestration was inspired by Saeidian et al. [30] and adjusted to our objective in (7a) via reward (9). In the evaluation phase, we ran the orchestrators (both dR-lOrch and bsacOrch) on a server with Intel(R) Xeon(R) Gold 6132 CPU @ 2.60 GHz, eight cores, and 64 GB of random access memory. For the sake of fair comparison, we trained dRlOrch and bsacOrch orchestrators for one million iterations. Besides, we used the same random seed for all the figures.

C. Scalability
In this subsection, we evaluate the scalability of bsacOrch in terms of memory requirements and computational time. 1) Memory Requirement: As mentioned in Section VI, Fig. 3(b) depicts the simplified NN architecture for DQN and BSAC algorithms. On both, we assumed two layers of fully connected 128 × 128 for all the hidden layers. In the DQN, there exist only one NN that performs function approximations for Q(s, a i ), where a i is an action in the action space A. Note that in the DQN, or any other discrete deep RL solution, the number of neurons in the last layer of its NN becomes as large as the size of the action space, |A|. Let us assume that the action set is identical on all industrial devices (i.e., A u = A ij , ∀i∈C, ∀j∈U i ). Then, assuming that there are overall K controlled devices, the action space in our problem becomes |A| = |A u | K . For example, in our simulations, we assumed |A u | = 4, for two levels of transmission power and with or without retransmissions. In this case, the size of action space for 20 devices becomes 4 20 . However, the number of neurons in BSAC NNs, in both the last layer of actor and the first layer of critic, increases linearly with the number of devices (unlike exponential increase in DQN and other discrete RL solutions). One of the contributions of our work is that we proposed a vector of continuous actions, where each index represents the actions space of one industrial device. This linear increase can be evidently observed in Fig. 5(a). This figure shows the memory requirement to run each algorithm, skipping the target NNs in our actual orchestrator; however, one might double these values to include the memory demand for the target NNs too. As can be seen in this figure, for the orchestration of 30 industrial devices, our BSAC algorithm would require only 1.2 GB of memory, while the DQN algorithm demands more than 10 YB (i.e., more than 10 16 GB) of memory. Therefore, we could not even run the dRlOrch for more than ten devices on our high-performance server.
2) Computational Time: For the evaluation of computational time, we evaluated the bsacOrch with 5, 10, 20, and 30 controlled devices, and the dRlOrch with five and ten controlled devices, all with one million iterations. The only exception to this is dRlOrch with ten controlled devices, which has been trained only with 120 000 iterations for the reasons discussed below and in Section VII-C1. In both the orchestrators, we measured and stored the learning time, which is the amount of time observed to compute the gradients and perform one gradient descent step. As demonstrated in Table II, bsacOrch does not show a significant increase in terms of learning time as we scale the number of controlled devices. In fact, in our evaluations, it never went higher than 16 ms, and the mean has been always between 5 and 10 ms. Despite bsacOrch's slight higher per iteration learning time compared to the dRlOrch orchestrator, the former converged with a much lower number of iterations [see Fig. 5(b)], i.e., it converged with less than 250 000 iterations (the black dotted vertical line) for all number of devices. Although dRlOrch showed decent performance in terms of computational efficiency for the scenario with five controlled devices, its performance dramatically dropped when we increased the number of controlled industrial devices. The mean learning time for the dRlOrch orchestrator with ten controlled devices became 390× the mean learning time for dRlOrch with five industrial devices, implying that the dRlOrch cannot be used for large-scale industrial networks, where there exist tens of devices. As a result, we could never finish one million iterations for ten devices because one million iterations could take us a month to train the dRlOrch for ten industrial devices (due to the mean learning time of 2.46 s).

D. Performance Gain on Small-Scale Factory
In the small-scale factory simulation scenario, we simulated five industrial devices, where three out of these five devices are positioned in a way to emulate high interference. Moreover, the application generates packets with a size of 80 bytes. Fig. 6 illustrates the distribution of application layer KPIs in which each sample represents the availability and crossing rate of each device estimated via empirical measurements in a 100-s simulation. As Fig. 6 suggests, retransmissions in lowPwrRet   lead to lower availability and higher crossing rate than in low-Pwr. Our careful analysis of the results shows that lowPwrRet has 1.6% higher loss on PDCP layer but almost none is lost on RLC. Likely, continuous retransmissions to devices with poor channel quality increased the cell load, resulting in a higher number of packets losing their delay bound. Our bsacOrch succeeds in improving the fifth percentile device availability to 0.99982 (i.e., three-nines availability). This availability level implies that a factory empowered by our bsacOrch can fulfill the three-nines availability with a probability of 0.95 (i.e., Pr(α ij ≥ 0.999) ≥ 0.95). Such availability is 18.3% higher than fifth percentile availability in maxPwrRet. Meanwhile, the benchmark with twice the bandwidth, maxPwrHb, achieves only 18.1% higher fifth percentile in availability than maxP-wrRet. In other words, the BSAC orchestrator achieves this gain with half the bandwidth, leading to a substantial saving in the network operational costs. Our bsacOrch could even outperform the baseline with twice the transmission power (i.e., hPwr) by 0.5% in fifth percentile availability. This gain is due to properly addressing the tradeoff between the delay, power, and packet loss, as also indicated in delay and resource block distributions. From the crossing rate perspective, shown in Fig. 6(b), we could even get lower crossing rates. Since packets experience a very low delay in highPwrHb case, it is likely that low SINR is the major cause for packet loss in highPwrHb. However, our orchestrator could avoid such low SINR to a great extent using dynamic power control and, hence, get close to highPwrHb performance with half the bandwidth. As Fig. 6(a) and (b) confirms, although the main objective in introducing bsacOrch was to support large action space, bsacOrch outperforms dRlOrch even for small action space (i.e., when there are five controlled devices on the factory floor). Fig. 7 illustrates the transmission energy consumption of our bsacOrch against hPwr and maxPwrHb. These three showed similar performance in terms of application layer availability and reliability. As this figure shows, compared to maxPwrHb, the bsacOrch consumes 1.9× energy, while it requires half the bandwidth. On the other hand, our orchestrator consumes 52% less energy compared to hPwr, while still achieving better reliability and availability. This implies that such an orchestration of diversity techniques and power control achieves reliability and availability that state-of-the-art solutions could achieve with either double bandwidth or energy consumption. Notice that we did not consider the training energy consumption here because the bsacOrch is trained once until a significant change occurs. In many factory automation use cases, such drastic changes are rare, and therefore, we barely need to retrain the agent.

E. Performance Gain on Larger Scale Factory
In the following, we deploy a larger number of industrial devices on the same factory footprint given in Section VII-A. Consequently, the industrial devices observe higher interdevice Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
interference. As a result, not only the devices observe more interference, but also the bsacOrch needs to manage many more devices near real time. We assumed a data unit size of 64 bytes, double precision in this scenario. Fig. 8 presents the mean and fifth percentile device availability of maxPwrRet and bsacOrch for different numbers of controlled devices. From this figure, the gain on mean and fifth percentile availability of bsacOrch over maxPwrRet grows as the number of devices increases. It is because of the increased degree of freedom to perform better interference management. Unlike the baselines, our solution can handle this bigger optimization problem, thanks to its scalability.

VIII. CONCLUSION
In this article, we proposed a deep RL-powered orchestrator to dynamically control transmission power and the maximum number of diversity transmissions for URLLC services. In particular, our orchestrator receives certain statistics (e.g., SINR and delay) from industrial devices and maximizes the application layer reliability and availability KPIs. To achieve a scalable solution, we implemented the orchestrator using BSAC, a novel model-free deep RL algorithm in which the action space of each industrial device is represented via separate branches in the actor network. Our solution operates in near real time, and hence, it is suitable for deployment in the edge cloud. The results of our extensive simulations confirm the superior performance of the proposed solution compared to the state-of-the-art benchmarks, including the DQN-based solutions. Since a similar level of performance could be achieved by legacy solutions at the cost of a much higher bandwidth or transmit energy, the proposed solution opens doors to saving more energy and spectrum in realizing URLLC for 5G and beyond 5G networks.