rocorl: Transferable Reinforcement Learning-based Robust Control for Cyber-Physical Systems with Limited Data Updates

Autonomous control systems are increasingly using machine learning technologies to process sensor data, making timely and informed decisions about performing control functions based on the data processing results. Among such machine learning technologies, reinforcement learning (RL) with deep neural networks has been recently recognized as one of the feasible solutions, since it enables learning by interaction with environments of control systems. In this paper, we consider RL-based control models and address the problem of temporally outdated observations often incurred in dynamic cyber-physical environments. The problem can hinder broad adoptions of RL methods for autonomous control systems. Speciﬁcally, we present an RL-based robust control model, namely rocorl , that exploits a hierarchical learning structure in which a set of low-level policy variants are trained for stale observations and then their learned knowledge can be transferred to a target environment limited in timely data updates. In doing so, we employ an autoencoder-based observation transfer scheme for systematically training a set of transferable control policies and an aggregated model-based learning scheme for data-efﬁciently training a high-level orchestrator in a hierarchy. Our experiments show that rocorl is robust against various conditions of distributed sensor data updates, compared with several other models including a state-of-the-art POMDP method.


I. INTRODUCTION
Recently, deep reinforcement learning (RL) has gained much attention and been recognized as a practical solution to implement autonomous control functions for intelligent cyber-physical systems, e.g, vehicles [1], [2], robots [3], drones [4], and others. Such an RL-based control function normally relies on timely observations by several sensing mechanisms as part of system operations to acquire state information about its surroundings and make informed decisions about control actions in response to state changes. In a networked system environment, on the other hand, sensors might not be centralized but distributed, and thus they need to communicate with a controller to keep observations highly up-to-date. In this case, while an RLbased control function can be formulated upon real-time data models commonly used for navigation and tracking [5], [6], its underlying communication infrastructure might have inherent restriction in completely meeting data synchronization requirements of real-time data [7]. That is, a real-time data model necessitates continual updates of remote sensor data, but network resource constraints (e.g., limited bandwidth, intermittent connections, transmission delays) inherently limit the timeliness of those updates. This inconsistency between real-time data models and underlying , the X-axis denotes mean values of variable data update periods in discrete timesteps, where the standard deviation sets to be 2.5 for lognormal distribution, and the Y-axis denotes the performance ratio calculated by Eq. (44). In (b), the observations are acquired from sensor data streams with variable update periods, thus inherently containing stale information, i.e., represented by circle data.
network constraints can cause robustness issues for RLbased controls, particularly when observations are dependent on the timeliness of sensor data updates. Figure 1 briefly depicts a performance degradation pattern in our RL-based control tests in which agents with neural networks utilize sensor data updates (i.e., periodic updates with variable intervals) to make decisions about control operations. In this simulation test, we configure variable periods of data updates from sensors to an RL agent to generate temporally outdated data input (stale observations) for the agent, as abstracted in Figure 1(b). We first test an agent with vanilla RL (the red line in Figure 1(a)) that is trained without any specific consideration on stale observations but tested with stale observations. The simulation environment will be detailed in Section V. The "Perf. ratio" on the Y-axis represents the relative performance degraded from the "normal" case where no stale observation exists. Its definition is in Eq. (44). The low performance of vanilla RL, less than 30% of the normal case, indicates the negative effect of stale observations on RL-based controls. Considering temporal features of sensor data, we also evaluate several different types of RL policy network structures which are known to be effective for partial observation problems. Recurrent RL (the orange line in Figure 1(a)) shows the test result when an LSTM (long short term memory)-based RL agent exploits a sequence of sensor data as input to its recurrent policy network. As shown, the longer the update periods (on the X-axis), the lower the performance. This result is contrary to our expectation that some effective rules for handling stale observations could be learned by this seemingly proper network structure that makes use of a sequence of observations for state estimation. It turns out that recurrent RL barely shows robust performance when having limited data updates, e.g., where the mean ≥ 2 on the X-axis. We will provide our analysis on this result in Section II-A.
In this paper, motivated by the test results aforementioned, we address the problem of stale observations for RL-based controls in a networked environment with random periods of sensor data updates. Such a control process with limited data updates is not an unusual structure, since geographically distributed sensing mechanisms are often part of a cyberphysical system; numerous sensing and actuation components are connected on a network. We consider a selfoperation device as a target cyber-physical system, for which controls are governed by an RL agent and observations contain temporally outdated, stale information.
To do so, we employ a hierarchical learning structure by which a set of low-level transferable policies are systematically trained and then their learned knowledge can be transferred to a complex target environment having limited data updates. In generating a set of low-level policy variants that can collectively handle stale observations in a hierarchy, we leverage the autoencoder structure to reduce the difference of state estimates between different environments with respect to variable staleness. We also exploit a highlevel orchestrator, which can be seen as a meta policy that learns rules for continuously selecting the most appropriate low-level policy from several over time. Furthermore, for tackling the issue of sample inefficiency induced by the meta policy on the higher level which has only fewer data sampling points than controlling timesteps, we take an efficient model-based learning approach.
The contributions of our paper are as follows.
• We address the stale observation problem for RLbased controls of a network-constrained cyber-physical system. To do so, we employ transfer learning driven through a hierarchical model that consists of a highlevel orchestrator and a set of low-level transferable control policies (in Section II). We name our proposed model rocorl (robust control using RL). • We present the hierarchical policy training schemes of rocorl: autoencoder-based observation transfer that facilitates the systematic generation of transferable control policies upon stale observations (in Section III), and aggregated policy learning that accelerates modelbased learning for the orchestrator upon insufficient rollout samples (in Section IV). • We conduct a case study with the Airsim simulator [8] for autonomous drone operation in edge computing. The case study demonstrates robust performance and wide applicability of our approach for adopting RL in a networked environment, e.g., showing up to 11% performance improvement over a state-of-the-art RL method in the configuration of variable update periods (in Section V).

II. OVERALL SYSTEM
In this section, we explain our assumption on limited updates of spatio-temporal sensor data in a target networked environment, describe the problem of stale observations  Figure 2. The overall structure of rocorl: we consider a networked system that consists of a self-operation device node and an edge service, where the device has its locally observed short-range high-resolution sensor data (local data), the edge service has its globally observed long-range low-resolution sensor (global data), and they communicate for conducting device control operations with network limitations.
caused by limited data updates, and then briefly present our approach to the problem.

A. STALE OBSERVATION PROBLEM
In a cyber-physical system with distributed sensors, data updates are considered periodic or sporadic. Temporally outdated observations are often inherent and they cannot be completely avoided especially due to underlying network limitations. Throughout this paper, we refer to this situation, temporally outdated observations incurred by variable periods of sensor data updates, as the stale observation problem for RL-based control systems, which makes it hard to train an end-to-end RL model. Stale Observation. In general, an RL-based control system operates upon real-time data streams that are managed by a set of networked modules, where the data streams continuously maintain observations for making informed decisions about device controls. As shown in Figure 2, we consider two types of real-time data according to the location of sensing mechanisms: global data e managed by an edge service and local data d managed by a device. That is, we refer to global data as near-edge, i.e., longrange low-resolution sensor data that can be accessed in real-time on the edge-side, while we refer to local data as near-device, i.e., short-range high-resolution sensor data that can be accessed in real-time on the device-side. We focus on the issue related to a network-constrained environment, and accordingly we presume that most sensor data cannot be both due to the limitation of timely edge-device synchronization. For a self-operation vehicle, sensor measurements of the vehicle can be near-device local data (e.g., lidar, radar, visual images), whilst long-range real-time map information around the vehicle can be near-edge global data (e.g., the location of moving obstacles or traffic information).
In our notation, we represent (near-edge) global and (neardevice) local data as We assume that E and D are bounded regions on R n for some n ∈ N without loss of generality. We also represent a state at discrete timestep t ∈ N 0 as Considering limited updates, we then formalize stale observations on the device-side as for some i ∈ N 0 that specifies the elapsed period from the last update. Similarly, stale observations on the edgeside can be represented as ω t = (e t , d t−i ). In most cases, we consider stale observations from the device perspective, since we concentrate on RL-based controls for a selfoperation device that exploits its observations to make control decisions.
RL Training Issue with Stale Observation. Here, we describe the problem of RL training upon stale observations. In principle, RL is a method that finds an optimal policy π, that maximizes the expected reward sum over an MDP (Markov decision process) with (S, A, P, R).
Note that S is a set of states, A is a set of actions, P : S × A × S → [0, 1] is a transition probability, R : S × A → R is an immediate expected reward function. Then, for a θ-parameterized policy π θ , the objective function is represented as where P π θ is a probability distribution of states s ∈ S induced by π θ and T ∈ N 0 denotes the episode timesteps.
In case that stale observations are involved, however, this objective function rarely optimizes the policy π θ due to the fact that states s t cannot be accurately estimated. In the RL research community, for the cases when the observation is partial or limited, i.e., a POMDP (partially observable MDP) where an RL agent is not able to observe true states from its environment, several methods including the recurrent policy [9], [10] have been investigated. In general, a POMDP is represented by where Ω is a set of observations, O : S → (Ω → [0, 1]) is an observation probability, and the other elements are the same as in Eq. (5). Specifically, given a history h t of observation and action samples, the recurrent policy (e.g., LSTM policy) parameterized by θ, say π (r) θ , intends to infer an action from the history h t . This work is based on a property of POMDPs such that the observation probability O is a fixed probability distribution that completely depends on the current state [9], [10].
However, given stale observations, the assumption on a fixed observation distribution does not hold. Suppose that at i previous step, global data are updated from the edge service to the device. For the current state s t , we have O(s t ) = Pr[ω t = (e t−i , d t )|s t ] which is not fixed, according to Eq. (3); the distribution of observations ω t is not completely determined by the current state s t , i.e., since e t−i is necessarily part of s t−i . The difference between a typical POMDP and a process with stale observations is illustrated in their transition diagrams in Figure 3.
In the following, we show why this different temporal dependency leads to incorrect calculations of policy gradients when a conventional end-to-end RL model is applied for a process with stale observations. Let H be a random variable form of a history h. Then, we can rewrite the objective function in Eq. (6) in where R(H) is the sum of rewards obtained during a history h ∼ H. By taking the log derivative trick and Monte Carlo approximation in N -episodes of history samples, we obtain In a POMDP, the log derivative for model parameters θ in ∇ θ log Pr[h] computed by the observed history h can be acquired by realizing that the conditional probability of a particular history is the product of all observation and action samples [10]. Accordingly, the log derivative term becomes where π (r) θ denotes our target θ-parameterized policy. This is valid with the fixed distribution assumption in Eq. (9). Combining Eq. (12) and (13), we obtain the estimated While the gradient calculation in Eq. (14) is intended for optimizing π (r) θ for a typical POMDP, stale observations in Eq. (10) make Eq (13) invalid, thereby resulting in inaccurate gradient values when a gradient-based optimization method is used. This explains the performance degradation previously shown in Figure 1, and motivates us to reformulate the stale observation problem into a hierarchical form, as we will present in the following.

B. OVERALL APPROACH
To tackle the aforementioned issue of policy gradient methods related to stale observations, we exploit the hierarchical structure of RL processing as shown in Figure 4, where a set of low-level control policies are maintained, and over time, one of them is continuously selected and used for performing rollouts. We name this hierarchical RL model rocorl, where the selection is made by a high-level decision maker, called orchestrator.
Consider a control procedure at a specific timestep t: (i) on the device-side, one of the control policies (say π u ) continuously rolls out with respect to its projection data from stale observations ω t in Eq. (3), where near-device local data d t are maintained update-to-date, while near-edge global data e t often become stale. (ii) Suppose that at some timestep (t + k + 1) after a certain amount of time, an available connection is made between the device and the edge service, and up-to-date data can be shared through the connection. The orchestrator µ selects a policy (say π v ) at the same time. (iii) This enables the device to switch its running policy from π u to π v , if needed. Then, π v keeps rolling out from the timestep (t + k + 1) to the next update. We refer to the time period between two successive updates with a variable time interval as a round. These steps of hierarchical controls over successive rounds are depicted in Figure 4. We presume that such update periods are neither configurable nor deterministic, considering the variability of underlying network conditions. That is, neither edgeside nor device-side modules know the next update time iv VOLUME ?, 2020 in advance.
As we will explain in Section III-A, an inter-round process can be abstracted as an MDP. We thus take a conventional RL method for training the orchestrator, where S µ denotes the state set for the orchestrator and Π denotes a set of control policies. On the other hand, we take several steps for systematically generating the set of control policies, where Ω (j) denotes the observation space for π j .
To generate such a set of control policies that are capable of collectively dealing with variable staleness, we take a systematical training scheme that exploits the feature compression capability of autoencoder networks. The autoencoder-based scheme is capable of creating a set of control policies that can correspond to different spatiotemporal observation spaces, while aiming at reducing the difference of state distributions of different environments for each control policy. Thus, the scheme renders the control policies transferable between different environments with respect to limited data updates and data staleness.
Furthermore, we address the difficulty in training the orchestrator policy, which requires relatively long training times due to the temporal abstraction of a two-level hierarchy. To mitigate the difficulty, we adapt a model-based RL structure by employing the efficient policy parameter aggregation from both model-based and model-free learning. We will explain these policy training schemes in the next sections, the autoencoder-based observation transfer in Section III and the aggregated policy learning in Section IV.

III. HIERARCHICAL RL CONTROL
The rocorl model consists of an orchestrator µ and a set of control policies Π = {π 1 , . . . , π m }, and it makes twolevel inferences: (i) selecting a policy for each round by the orchestrator when a connection is made, and (ii) conducting low-level control actions by a selected control policy over timesteps within a round. Algorithm 1 represents the twolevel inference, where the first inference function called, inference(µ, s), corresponds to the former in Eq (15) where the orchestrator µ selects a control policy based on its state s, and the second one corresponds to the latter in Eq (16) where a selected control policy π v calculates an action a based on its observation transfer from observations ω. The inference(·) function yields an action, given a policy and an observation (or a state). The transfer(·) function conducts observation transfer for π v , from ω to ω (v) , which will be explained in Section III-B.
To make the inference robust against stale observations, we leverage a transfer learning mechanism via hierarchical RL, aiming to adapt the knowledge of RL policies from a learnable environment (without stale observations) for a target, hard-to-learn environment (with stale observations). In the following, we show how to train the policies in a hierarchy of rocorl. We first provide the proof that a decision process by the orchestrator µ conforms to an MDP between successive rounds. We then show how to train each control policy π ∈ Π by employing the autoencoder-based observation transfer scheme that alters the observation formulation according to selective global data for each policy π j ∈ Π.

A. ORCHESTRATION IN ROUNDS AS AN MDP
To show that the inter-round orchestration conforms to an MDP, we exploit the uniformization property that can turn a Semi-MDP (SMDP) into an equivalent MDP [11], [12]. An SMDP is used for representing a generalized decision process with variable time intervals (i.e., holding times) between successive actions. For an SMDP, the distribution of holding times τ (s, a, s ) is specified for s, s ∈ S, a ∈ A in addition to the MDP representation in Eq. (5) In the following, we first formulate the orchestrator µ as an SMDP, (S µ , A µ , P µ , R µ , τ ), and then finds its equivalent MDP. Given a set of states S and a set of control policies Π in our model, obviously we have S µ = S and A µ = Π since the orchestrator µ performs macro-actions such as selecting a policy from Π for each round without any modification on underlying states. Given π ∈ Π for a round I, further suppose that the holding time of I is given k (i.e., k ∼ τ ). We then have the reward function R µ : S × Π → R based on averaging over I, where t is the starting timestep in I : [t, t + k]. Furthermore, we represent the transition of the orchestrator µ according to its hierarchy by P µ (s, π, s , k) = Pr[s = s t+k+1 |s t , π, k].
In our model, the holding times τ correspond to the variable periods of rounds which are completely governed by underlying network conditions. We thus establish the because they have nothing to do with the temporal abstraction by the holding times τ . Furthermore, we have R = R µ since the holding time effect can be removed when average rewards are taken. Regarding P , we have the conditional expectation over the holding time k ∼ τ , for π during I, where T denotes the expected holding time such as To maintain the Markov property on P , we design our hierarchical model to make the observation space of each control policy strictly confined within a single round. We then establish the equivalent MDP by Eq. (19) and (20).
The equivalent MDP enables the orchestrator to learn macro-action rules in a hierarchy, similarly to other hierarchical RL methods such as the option framework [13]. Given a set of policies Π = {π 1 , . . . , π m }, it is possible to train the orchestrator µ using a conventional RL method with the objective function, where {s tn } N 1 is a finite subsequence of {s t } whose elements correspond to the starting time of rounds I : [t, t+k].

B. TRAINING CONTROL POLICIES
As theoretically discussed in Section II-A and empirically shown in Figure 1, it is non-trivial to have a properly learned model in environments with stale observations, when using a conventional end-to-end RL method. Therefore, in rocorl, we train each control policy instead in a normal environment that is particularly set to have no stale observations (i.e., no restriction on data updates), and adapt it for a target environment having limited data updates.
Here, we describe how to use autoencoders for training a set of control policies. Later, we will explain how rocorl makes the control policies transferable from the normal to the target environment, by exploiting different observation spaces and combining a high-level orchestrator with the control policies in a hierarchy in Section III-C.
For training several control policies to be transferable, we employ the autoencoder-based observation transfer scheme. We first define an individual policy as a composite function The function f intends for exploiting a decomposable observation space and increasing the diversity in latent representation, while the function g conducts a common decision-making task. We refer to such a function f as a observation transfer and g as a common decider.
To generate a set of observation transfer in a systematic way, we adopt the autoencoder structure. (i) We first exploit the policy π well-learned for the normal environment M nm to obtain samples S of full observation trajectory. (ii) Then, we train the autoencoder (dec • f full : s → z → s) using s ∈ S, where the full observation space Ω is exploited at this time. Let the encoder part of the learned autoencoder be (iii) We then obtain restricted observations Ω (j) from Ω with projection maps Φ j for j = 1, . . . , m − 1.
For instance, a projection can be made from Ω by excluding all (or some) features of global data. For each projection map, we then obtain another observation transfer f j that is learnable by the pair samples from Ω (j) and Z, Since each z ∈ Z for Ω (j) is equal to that in Eq. (24), each observation transfer f j can be supervised-learned using relevant loss such as similarity distance from ground truth states [14], vi VOLUME ?, 2020 for j = 1, . . . , m − 1, where z = f full (s). This learning process is iteratively conducted for different projection maps. To obtain the projection maps, we first group global data using their correlation into c sets. Then, for each set, including or not including it, we can create c 2 combinations for partial observation spaces, while several candidates turn out to be lower-performance policies than others and so they can be pruned.
(iv) To obtain the common decider g, we also train the policy π = g • f full for the fixed f full . Finally, given a set of m observation transfers {f 1 , . . . , f m−1 , f full }, we combine each with the common decider g, thereby achieving a set of m policies Π, Algorithm 2 shows the aforementioned steps (i)-(iv) for generating a set of control policies. The SLTrain(·) function is the implementation of conventional supervised learning algorithms, given a target model and labeled samples. Similarly, the RLTrain(·) function is the implementation of conventional RL algorithms for the normal environment, given a target policy. Note that we obtain the normal environment from our target environment by statically configuring no limitation on global data updates. This normal environment setting is only for training each policy by Algorithm 2. All the tests are done under the target environment with various stale observations and nondeterministic round lengths.  2) and (3), we can naturally induce such a normal MDP environment M nm = (S, A, R, P). Suppose we have optimal policies π * st for M st and π * nm for M nm . Then, it is possible to obtain identical state distributions which are induced by those policies [15].
This leads to the same identical property for the reward distributions induced by the policies. That is, if there exist only morphological differences between the two environments, it is possible to transfer knowledge between the environments [15]. In our case, the normal M nm and stale M st environments have only morphological differences upon O and Ω, because they are the same except whether or not data updates are limited. In line with Eq. (29), then, our goal with respect to training an orchestrator can be formulated as the below, minimizing the statistical distance between M nm and M st .
Here, R denotes a reward distribution, · is a norm between two probability measures (e.g., L 1 norm), and J is in Eq. (22). In the following, we show that a set of control policies obtained by Algorithm 2 and a well-trained orchestrator together minimize the statistical distance of normal and target environments, as represented by the hierarchical learning objective in Eq (30).
Suppose that we have an orchestrator µ and a set of control policies Π in Eq. (28) trained according to the loss in Eq. (27). Further suppose that the target environment M st is Lipschitz [16] [17] with respect to actions. That is, for actions a, a ∈ A, there exist K 1 , K 2 ≥ 0 such that Then, we have M nm is also Lipschitz, since M nm is naturally induced by M st ; recall that M nm shares S, A, P, and R with M st . Given the hypothesis in Eq. (31), here, we show that the orchestrator µ with the policy set Π satisfies Eq. (30) with a bounded error. Specifically, we show that there is a fixed number K > 0 such that where · 1 is the L 1 norm. First, we show that the error bound of each control policy π j ∈ Π is For each observation transfer function f j and z = f full (s) ∈ Z, consider the estimation error ε j > 0 (i.e., loss in Eq. (27)) such that holds whereẑ (j) is the estimated feature by f j , i.e.,ẑ (j) = f j (Φ j (ω)) ∈ Z and z is the ground truth, f full (s). We then obtain a fixed number K 3 > 0 such that holds, since the common decider function g is also a Lipschitz function as a feed forward neural network [18]. Note that we implemented the g using a feed forward neural network. Using Eq. (31) and (35), we then obtain that the rewards achieved by g • f j are bounded as By integrating both sides of the second inequality in Eq. (36), we obtain the below, |R(s t , g(z)) − R(s t , g(ẑ (j) ))|dλ ≤ K 2 K 3 ε j dλ VOLUME ?, 2020 vii where λ = |P π * nm − P πj | is a measure defined on the set of events in S. Let K = K 2 K 3 . Since K is a constant, we obtain that satisfies the error bound of Eq. (33). Second, we show that the bound of Eq. (38) (or Eq. (33)) obtained above drives Eq. (32), if combined with the welltrained orchestrator µ upon all π j ∈ Π. For the well-trained orchestrator µ that maximizes the objective in Eq. (22), it is obvious that By exploiting the below relation based on Eq. (30), from Eq. (39). Using Eq. (41) and Eq. (38), we finally obtain the below same to Eq. (32).

IV. MODEL-BASED ORCHESTRATOR
In this section, we describe our data-efficient training method for a high-level orchestrator.

A. DIFFICULTY IN TRAINING ORCHESTRATOR
In rocorl, an orchestrator makes decisions less frequently than control policies, which makes it hard to collect sufficient samples for training. Figure 5(a) depicts sample throughput (on the Y-axis) with respect to various time periods in timesteps for a single round (on the X-axis), when the orchestrator is trained with the Airsim simulator [8]. As the average time period increases, the sample throughput decreases; e.g., a 10 fold increase period results in 83.5% reduction in throughput, increasing the training hours 6 times. It is because the longer each round period, the fewer sampling points within a given training time. Model-based RL methods are effective for solving realworld decision problems because of data-efficient learning capability [19]. However, in applying model-based RL methods to orchestrator training, we need to consider the uncertainty of a learned model-environment. In general, a learned model-environment often becomes inaccurate due to the high complexity of its target environment or overfitting, and this raises the problem of policies not being optimized by interaction with the model-environment [20], [21].
In our case, the orchestrator performs actions at variable time periods that are randomly determined based on network conditions. While each orchestrator action takes place, a control policy performs a relatively large number of actions, which can have a significant impact on environment dynamics. This hierarchical structure increases the uncertainty in predicting the next state for the orchestrator, hence reducing the accuracy of a learned model-environment. Figure 5(b) indicates such limitation where learning loss (on the Y-axis) increases with longer periods of each round (on the X-axis), where learning loss represents the modelenvironment accuracy measured by L 2 one-step predictions. Several methods are used to learn a model-environment, e.g., splitting a validation dataset for early stopping, input/output data normalization, batch normalization, L 2 regularizer [22], and model-ensemble techniques [21].

B. AGGREGATED POLICY LEARNING
To mitigate the issue of low accuracy of a learned modelenvironment, we propose the aggregate policy learning scheme by which the model parameters of the orchestrator are updated through interacting with not only a modelenvironment but also a target environment in parallel.  In general, a model-based learning process consists of transition sampling, model-environment learning, and policy improvement. First, a fixed policy is exploited to collect experienced transitions D = {(s t , π t , R µ (s t , π t ), s t+k+1 ) |I : [t, t + k]} through interacting with its target environments for update periods I. Then, a model-environment is learned to approximate a transition probability P µ (s, π, s ) and reward function R µ (s, π) via a dynamics function P ψ and a reward function R φ . We see the transition probability as a distribution of random variable s , given s and π. Notice that learning P ψ and R φ is conducted in a supervised manner with the experienced transitions D. The objective functions are viii VOLUME ?, 2020 defined to minimize the L 2 one-step prediction losses: min respectively. Then, a learned model-environment with P ψ and R φ can be used for policy improvement. Figure 6 depicts the aggregated policy learning scheme in which model-free learning with a target environment is performed in parallel with model-based learning. In modelfree learning, accurate trajectories from target environments are used for policy improvement, thus the sample throughput is low. On the other hand, in model-based learning, more trajectories including inaccurate ones from modelenvironments are used, thus the sample throughput is high.
As such, to alleviate the low accuracy issue of policy improvement, the two policies updated by model-free and model-based learning are aggregated. Specifically, for policy aggregation, we use the weighted model averaging algorithm in federated learning techniques [23], where θ f and θ m represent the respective model parameters of learned policies by model-free and model-based learning, and α ∈ (0, 1) denotes an aggregation weight.

V. EVALUATION
In this section, we describe the implementation of rocorl, and evaluate it with various simulation conditions. We also demonstrate our case study, automated drone navigation in the Airsim simulator, where a flying RL agent is presumed to run on a drone and to rely on observations with near-edge global and near-device local sensor data. Our model implementation is based on Python v3.7 and Tensorflow v1.14, and neural networks for policies are trained on a system of an Intel(R) Core(TM) i7-7700 processor and an NVIDIA RTX 2080 GPU. The orchestrator and control policies share common hyperparameter settings for their neural network structure, except that each control policy has its own observation space, as explained in Section III-B. For training policies, we use the PPO and SAC algorithms [24], [25], and the Adam optimizer [26]. The hyperparameter settings are summarized in Table 1. For comparison purposes, we also implement and test several algorithms in addition to the rocorl model.

Hyper Parameter Value
• REC_FU and REC_PA are recurrent models with different observation spaces, where the former manages a full observation space with stale data, and the latter manages a partial observation space without stale data. • DVRL is a state-of-the-art POMDP method [27]. Different from recurrent models that has no direct influence on belief states, it exploits the evidence lower bound (ELBO) loss to directly affect the belief state inference during learning. In evaluating models upon stale observations, we concentrate on model robustness, and accordingly, we measure the ratio of the model performance to an ideal reference. That is, for a model F , the performance degradation ratio (Perf. ratio) is calculated as where M st and M nm are the stale and normal environments defined in Section III-C and J is defined in Eq. (6). In our tests, the reference performance is empirically calculated based on the accumulated reward obtained in normal environments through the properly learned policy π * nm . We conduct 10K trials for each test case.

A. SIMULATION TESTS
For various simulation tests, we implement a moving object environment using pyBox2D [28]. Similar to self-driving scenarios, an agent is set to avoid moving objects and go to the goal location. The agent receives the lidar-like yet simple sensor data as local data, while it receives dynamic map information including the position and speed of long-range obstacles as global data. Regarding actions and rewards, the agent manipulates its steering and velocity, acquiring a binary reward; 1 if the agent achieves its goal, 0 otherwise. The simulation environment settings are summarized in Table 2. Figure 7 shows that rocorl outperforms the other mod-  Table 2. Simulation environment settings: in the type column, e and d denote the global and local data type respectively, while c denotes non-temporal data that can be commonly accessed from both device node and edge service with no stale observation. with respect to various settings on data update periods. rocorl shows that it maintains stable performance regardless of (a) mean and (b) deviation values on data update periods in timesteps, in contrast to other models. The larger mean or deviation, the more degraded the performance of REC_FU. On the other hand, REC_PA is not significantly affected by the update periods since stale data are all removed from its observations, but it has a limitation of relatively low performance. In Figure 8, we also evaluate the models with respect to various dynamics levels of target environments. We observe that high dynamics settings with rapid moving obstacles increase the difficulty of goal tasks and thus affect the performance of all the models in comparison. Yet, the performance degradation of rocorl is not relatively significant, indicating its robustness. In a highly dynamic environment, it is important to make good use of available observations including stale global data, while doing so might have an adverse effect of increasing the likelihood of misusing them. rocorl selectively uses proper control policies over time to effectively reduce the likelihood of such misuse through the orchestrator.
In Figure 9(a), we evaluate the models with respect to various staleness. Here, the number of partitions (on the Xaxis) represents the number of sensor data sets with different update times; e.g., two individual global data sets differently updated make three partitions including one for device data. Accordingly, more partitions normally increase the variety of stale observations in most cases, tending to make hard to learn stable rules upon stale observations. While the default setting of our target environments sets to have two partitions, one from the device and the other from the edge service, we also consider a situation in which global data from multiple edge services (or nodes) are asynchronously updated to the device. rocorl shows more robust performance than the others against various staleness, e.g., 17∼25% higher than REC_FU.
While the performance of REC_FU degrades with more partitions, that of REC_PA and rocorl does not. Moreover, rocorl outperforms REC_PA by 6∼9%. As the number of partitions increases, the variety of stale observations increases, which in turn renders a more difficult environment and more unstable observation probability for REC_FU to learn the optimal policy.
In Figure 9(b), we evaluate the relationship of the number of policies with different observation spaces and the number of partitions in rocorl. The large policy set (i.e., the size = 8) shows slightly more robust performance than the other (the size = 2). Note that the large set is configured to contain all the policies of the small set in this test for direct comparison. The result clearly shows the benefit of rocorl that makes use of the diverse capability of transferable control policies upon variable staleness. It is consistent with Eq. (32).

B. CASE STUDY
In the following, we describe our case study for autonomous quad-copter controls with the Airsim simulator [8]. Figure 10 illustrates the system implementation where rocorl works as a flying agent for a self-operation drone. We set device-attached sensor measurements such as device position, velocity, acceleration, orientation as well as lidar measurements as near-device local data. We set longrange dynamic map information such as the trajectories of x VOLUME ?, 2020 moving objects as near-edge global data. This configuration is based on one of the edge computing scenarios [29], [30]. Furthermore, the dynamic map information is updated to the device, similarly to the learning map scenario [31], where update periods are randomly given.
Regarding actions and rewards, we implement an agent that continuously manipulates the 3-D acceleration of a drone, and acquires rewards according to the distance from the goal, i.e., R(s t , a t ) = go − po t 2 − go − po t+1 2 where po t is the drone position at time t and go is the goal location. The case study environment settings are summarized in Table 3, while other unspecified settings are the same as those for the previous simulation experiments. Here we evaluate model performance with respect to various settings on update periods. We set random periods  Table 3. Case study environment settings: the type specifies different data types in the same as in Table 2.
with various mean and deviation values (on the X-axis) for successive near-edge global data updates from the edge service to the drone. In Figure 11, we assume that those values are independent from other environment conditions, while in Figure 12, we configure a practical correlation pattern between update periods and environment density. Timely updates to the drone can be more restricted due to possible interference in a harsh area where many obstacles exist. As shown, rocorl outperforms the other models, e.g., 60∼71% enhancement over the recurrent REC_FU and REC_PA models as well as 2∼11% enhancement over the state-of-the-art DVRL for all test cases. In Figure 13, we evaluate model performance with respect to dynamic conditions of a flying environment where the dynamics level is configured by speed distributions of moving obstacles. rocorl shows highly stable performance for all cases, while the others show degraded performance when they are affected by severe changes; e.g., DVRL is degraded by 15% with an 8 times increase of maximum obstacle speed.  In Figure 14, we evaluate the data-efficiency of rocorl. The aggregated policy learning of rocorl shows its learning curve with rapid increases of acquired discounted rewards over timesteps, in comparison with the other case where only model-free learning is used, indicating 2∼3 times improved training capability than the other case in terms of data-efficiency.
In Figure 15(a), we compare the discounted reward patterns achieved over time by different models. Here, in addition to DVRL and rocorl, we include the welllearned reference model running in the normal environment Except for the normal model, these models run in the target environment. As noticed, the normal model and rocorl show more similar patterns than others. This implies that we have transferable control polices in rocorl, which can minimize the state and reward distribution differences between the normal and target environments, thus allowing Eq. (30) to be established. Furthermore, in Figure 15(b), we compare the obstacle distance patterns obtained by different models. Given the probability distribution of moving obstacle distance by the normal model, we represent its difference from another model (i.e., P π * nm − P µ in Eq. (30)). Similar to the reward patterns, we notice that the difference between the normal model and rocorl is much smaller than that between the normal model and DVRL or POL_FU. It indicates that rocorl running in the target environment learns how to control upon stale observations, as if it were running in the normal environment. This policy transferability is enabled by hierarchical learning with the autoencoder-based observation transfer scheme in rocorl.

VI. RELATED WORKS
The recent advance of reinforcement learning (RL) has yielded its broad adoption in the area of autonomous cyberphysical systems that operate through interacting with surrounding environments, e.g., autonomous driving [1], [2], robots [3], [32], [33], drones [4], [34], mobile edge computing [30], and others. Such an RL-based system senses environment states, aiming to make timely observations and conduct proper reactions. Thus, the system performance normally becomes dependent on the quality of observations to some extent.
For a linear dynamic system with Gaussian noise observations, the Kalman filter has been adopted to estimate the ground truth of underlying system states [35], [36]. While Gaussian noise observations normally form a POMDP (partially observable MDP) problem, stale observation problems cannot be necessarily formulated with Gaussian noise observations. In the RL research community, many works were investigated to address the issues of partial or noisy observations which are formulated as a POMDP. In [10], Wierestra et al. demonstrated that recurrent networks with policy gradient algorithms can be effective for POMDPs. Similarly, in [9], Hausknecht et al. adopted the recurrent neural network with the Q-learning algorithm (DQN). Recently, the multi-agent RL system with noisy private observations and selective communication between agents was investigated [37]. Moreover, DVRL [27] and PILCO [20], [38] were proposed. DVRL exploits the evidence lower bound (ELBO) loss, leading to direct estimation on belief states for recurrent RL policies, while PILCO employs dynamic models learned upon belief states directly than observations. These previous works commonly concentrated on partial observations, yet none of them addressed the stale observation problem.
Since POMDP methods were generally designed under the assumption of a fixed correlation between ground truth states and observations (i.e., O(s t ) = Pr(ω t |s t )), it is difficult to apply them for an environment with stale observations in which gradients of neural networks cannot be given accurately. To the best of our knowledge, our work is the first to formulate the problem of stale observations in the RL context and to propose the solution employing the transfer learning via a hierarchical model.
The option framework was first proposed in [13], and was much studied recently, e.g., the end to end option learning with deep RL [39], transfer learning with options [40]. Our work is in the same vein of these option-based works for handling different levels of temporal abstraction and solving a complex decision making problem. However, none of the previous works considered limited data updates or stale observations.
Meanwhile, Gupta et al. [15] presented a knowledge transfer theory for RL agents in the form of statistical distance. They focused on finding out good feature extraction functions that can reduce the morphological difference between two different environments. In our problem, it is difficult to directly extract features from stale environments. Thus, we exploit observation transfer functions in an optionlike hierarchical learning structure, rather than extracting features directly from stale environments.
There has been a body of research on model-environment representation in the field of robotics [32], [33]. Recently, several studies have been introduced for adopting deep neural networks in model-based RL and handling complex environments [21], [41], [42]. In [21], the model ensemble technique based on the trust-region policy optimization was introduced to tackle the shortcomings of backpropagation through time. In [41], asynchronous process structures were investigated to alleviate the model bias problem. Using the techniques of [21] and [41], we employ an integrated learning approach that combines both model-free and modelbased learning to solve the model-bias issue caused by hierarchical RL structures.
Our system design is based on edge-device data synchronization and it shares a similar structure with decentralized POMDPs [37], [43], [44]. In [37], MADDPG-M handles a specific constrained case where agents operate under partial observations that are weakly correlated to true states. While MADDPG-M focuses on agents' decisions about what to share with others, our work seeks to intelligently make use of imperfect observations, under a simple but nonrestrictive assumption that data are shared randomly in time.

VII. CONCLUSION
In this paper, we presented rocorl, the hierarchical RL-based control model that can effectively deal with temporally outdated observations incurred by intermittent sensor data updates in a cyber-physical environment. For training the model, we employ the autoencoder-based observation transfer and aggregated policy learning schemes. Our approach is based on a set of policy variants with different observation transfers by which the learned knowledge is transferable for environments with stale observations. Through experiments with the Airsim simulator, we show that rocorl is robust against various restrictive conditions of sensor data updates, compared with several other models including a state-ofthe-art POMDP method.
Our future work is to adapt the rocorl model for a large scale edge computing environment where many edge computing nodes communicate and interact.