Using Deep Reinforcement Learning to Improve Sensor Selection in the Internet of Things

We study the problem of handling timeliness and criticality trade-off when gathering data from multiple resources in complex environments. In IoT environments, where several sensors transmitting data packets - with various criticality and timeliness, the rate of data collection could be limited due to associated costs (e.g., bandwidth limitations and energy considerations). Besides, environment complexity regarding data generation could impose additional challenges to balance criticality and timeliness when gathering data. For instance, when data packets (either regarding criticality or timeliness) of two or more sensors are correlated, or there exists temporal dependency among sensors, incorporating such patterns can expose challenges to trivial policies for data gathering. Motivated by the success of the Asynchronous Advantage Actor-Critic (A3C) approach, we first mapped vanilla A3C into our problem to compare its performance in terms of criticality-weighted deadline miss ratio to the considered baselines in multiple scenarios. We observed degradation of the A3C performance in complex scenarios. Therefore, we modified the A3C network by embedding long short term memory (LSTM) to improve performance in cases that vanilla A3C could not capture repeating patterns in data streams. Simulation results show that the modified A3C reduces the criticality-weighted deadline miss ratio from 0.3 to 0.19.


I. INTRODUCTION
Connected devices are part of IoT environments in which every device talks to other related devices -by gathering and transmitting data -to timely communicate important/critical sensor data to interested parties for further usage. With the enormous amounts of data generation, selecting only important/critical/relevant data to be timely used for various usages is still an issue. Besides, environment complexity regarding data generation could impose additional challenges to balance criticality and timeliness when gathering data. For instance, when data or arrival of data for two or more sensors correlate, or there exists temporal dependency among sensors, incorporating such patterns can expose challenges to trivial policies for data gathering. Examples of environments with such attributes could be a network of sensors for capturing marine temperature, where there is temporal The associate editor coordinating the review of this manuscript and approving it for publication was Md. Abdur Razzaque . dependence among data [10] or intelligent buildings where deployed sensors generate data that is temporally and spatially dependent [3].
Capturing patterns such as temporal dependency or correlation on the data arrival of sensors could help to collect data more efficiently concerning the desired goal. For example, consider a network of four sensors (S A , S B , S C , and S D ) where at each time step, we can collect data from only one sensor. The goal is to collect data packets in such a way that we have the highest accumulated criticality values over a time horizon, where each data packet is characterized by a pair of criticality and deadline value (e.g., data of sensor S i is referred as (Cr, d)). As an example (shown in Table 1 and Table 2), imagine that there is a temporal dependency between sensors S A and S B due to their close location. In case of not capturing the dependency mentioned above (detailed in Table 1 at the time step t, we have data from sensors S A and S C , where data at S C has a higher criticality value and deemed as the better choice for transmission despite having larger deadline TABLE 1. Example of sensor selection in a 4-step time window, when dependency cannot be captured. In this case, there is a temporal dependency for the arrival of data packets. The available data packets at sensor i is shown as S i = (cr , d ). As the table shows, the total accumulated criticality at the end of the time step t = 4 is 8.

TABLE 2.
Example of sensor selection in a 4-step time window, when dependency can be captured. In this case, there is a temporal dependency for the arrival of data packets. The available data packets at sensor i is shown as S i = (cr , d ). As the table shows, the total accumulated criticality at the end of the time step t = 4 is 11.
value compared to S A . However, other data packets will be expired by the time that we want to collect data again at step t + 1, and the total criticality value collected at the end of t = 4 would be 8 (equivalent to criticality-weighted deadline miss ratio of 13−8 13 = 5 13 = 0.38). Alternatively, as shown in Table 2, by knowing and capturing the temporal dependency between sensors S A and S B , one could transmit data at S A since a data packet will arrive for S B soon after S A . So, the total amount of criticality for the transmitted data packets from S A and S B can be as large as the criticality of data at S C (if it was alternatively selected), and we may be able to collect data of S C still if it is not expired yet. In this case, the total criticality value collected at the end of t = 4 would be 11 (equivalent to criticality-weighted deadline miss ratio of 13−11 13 = 2 13 = 0.15). We discuss how we can prioritize and gather data from multiple devices when the data may have different timelines and criticality requirements -especially in complex environments. We use the term ''complex'' to refer to scenarios where the sensors may be correlated in time or space. Such correlations make sensor polling decisions harder. Given the challenges in capturing dependencies among sensors, and motivated by the success of Deep Reinforcement Learning (DRL) in policy estimation, we explore leveraging DRL techniques to tackle our problem. DRL techniques have shown to outperform alternative methods (e.g., traditional Q-learning) in terms of handling large state spaces, which is a requirement for deploying large IoT networks. Also, DRL techniques can handle both continuous and discrete state spaces. We propose an approach based on the Asynchronous Advantage Actor-Critic (A3C) [13] by improving the network structure such that it captures the likely case of having temporal dependency and correlation within data streams. We achieved this improvement by adding an LSTM layer to consider some previous states (rather than only one state) to learn recurring patterns within data. Access to previous states implies the notion of embedding memory into the model. Besides the addition of memory to the model, the A3C itself offers benefits. It outperforms some other RL alternatives (e.g., Q-learning methods based on Q-table and DQN) concerning resource requirement and time performance [13]. We provide the formal presentation of the problem as well as the simulation-based evaluation of scheduling policies for this problem.

A. CONTEXT
The Internet of Things (IoT) refers to the vast number of things (i.e., electronics-infused devices) connected to the internet, which acts as sensors in their hosting environment, generating massive volumes of data. In such settings, everything from data acquisition to processing and analysis can leverage Machine Learning (ML) techniques to preserve efficiency and performance. As all the steps (i.e., data acquisition, processing, and analysis) mostly involve decision making of some sort, ML techniques would help in making informed decisions by capturing patterns in data and sensors behaviour. In a sense, the integration of ML into the IoT world would transform simple sensor-actuator devices into ambient intelligent devices.
In different application settings such as factory automation, hospital, and more environments [15], the notions of message criticality and timeliness co-exist. Imagine the case of factory automation, where sensory messages that require immediate actions compete with less urgent messages (i.e., no need for instant actions) though being critical. In such cases, there needs to be a decision-maker as to act upon sufficiently reasonable concerning the desired objective function.
While the ultimate goal is to deploy applications that can identify changes such as the one described above in the environment and appropriately adapt to them, sometimes such changes could occur in a more challenging way: • There could be cases that correlation exists between the arrival of events for two or more sensors. For instance, in a factory, temperature and pressure sensors may generate data mostly at about the same time, different from optical sensors. VOLUME 8, 2020 • Furthermore, there could be cases where temporal dependence exists among sensors due to the physical arrangement of devices. For instance, imagine that two optical sensors (sensor A and sensor B) installed in separate locations in a factory. Sensors A and B may observe the same event, but at different points in time.
• It is more challenging to capture the correlation among sensor data. In the context of factory automation, there could be cases where the criticality/timeliness of messages for some sensors correlate with some other sensors. This situation may happen because of reasons such as related functionalities of the operating devices there. For example, two temperature sensors that are installed in the same factory warehouse sense correlated temperature values, whereas their sensed data may not correlate with temperature data of sensors in other warehouses.
From a system architecture perspective, we consider a network of devices (i.e., sensors) that sense events in the environment. There is a central unit managing the transmission of data from sensors by selecting a sensor at a time. Sensors communicate their sensing data to the central unit upon selection. The system that we study deals with the challenge of possessing minimal resources (e.g., in terms of energy and memory) as a likely setting for many networks ( [4], [5], [23]), which emphasizes the essence of having an effective decision-making policy. For efficiency of energy and memory consumption, sensors would hold a limited number of messages. Such architecture makes it possible to be extended and creates clustered networks with multiple hierarchy levels. For instance, in a more extensive network, each central unit can operate as a relay device, such as the model described by Rashtian and Gopalakrishnan [15]. In the scope of our work here, we consider a central unit responsible for deciding during each decision-making epoch.

B. CONTRIBUTIONS
We propose and evaluate a scheduling mechanism for the described architecture where the environment adds complexity by causing correlations among messages and arrival dependency messages that need to be transmitted. We start by exploring the applicability of the A3C method [13] as a successful Deep Reinforcement Learning method and proposing our approach by improving upon that. The contributions of the work have four folds: • Mapping the A3C approach to our problem and establishing a baseline method that showed consistent performance in less complex scenarios.
• Showing that the A3C performs no better than the greedy baselines in complex scenarios.
• Embedding memory to the network in the vanilla A3C to improve the performance in complex scenarios with high dependencies.
• Showing that based on the simulations, the modification to A3C did not negatively affect the performance in other scenarios where the vanilla A3C had already performed well. Such observation confirms the advantage of the proposed solution over the vanilla A3C in all studied scenarios.

II. SYSTEM MODEL
We consider a system with a centralized architecture, as shown in Figure 1. The IoT devices (i.e., sensors) are communicating to a central unit. Sensors sense the environment events and capture them as messages. The central unit transmits one message at a time from one of the sensors. We consider messages of equal-sized lengths arriving at sensors. For energy consumption and memory costs, we assume each sensor would hold one data message. Sensors have enough resources to maintain only one massage at the time. We use the following notation to describe a message and its characteristics: Message M i has an applicable deadline d i and criticality κ i . If a message arrives at time t, then the absolute deadline for delivery of this message is t + d i . Until the message reaches its expiration time, it will provide its highest value if selected. If the message is selected after its expiry, it will provide a discounted value (as will be discussed in Section 14). If a new message arrived at the sensor, and the sensor already has a message, the older message would be replaced by the new message.
At each scheduling epoch, the central unit selects one sensor to use the available bandwidth to transmit its associated message. Also, no assumptions are being made about the event arrival rate.

A. PERFORMANCE METRIC
In the model that we have described, the system goal is to minimize the criticality-weighted deadline miss ratio that is defined as follows: Let N be the total number of messages generated during the time interval of interest that also have deadlines within that time interval. Let x i be an indicator variable for whether message M i missed its deadline or not. The criticality-weighted deadline miss ratio is With such performance, we can define the problem that we want to tackle.

B. PROBLEM STATEMENT
The problem that we want to tackle is to determine an efficient bandwidth allocation policy at the central unit to minimize the miss criticality ratio over a finite time interval as the optimal policies may be intractable: We do not know what messages will arrive until they arrive, and decisions regarding the selection of messages need to be made at each scheduling epoch.

III. BACKGROUND
We have chosen to take advantage of Reinforcement Learning (RL) to tackle our problem. We provide a brief background on RL techniques in general, A3C, and later discuss our proposed method.

A. OVERVIEW OF DEEP REINFORCEMENT LEARNING
In this section and before discussing the A3C network, we first introduce the basic concepts of RL and Deep Learning (DL), based on which Deep Reinforcement Learning (DRL) is defined.

1) REINFORCEMENT LEARNING (RL)
RL is a class of algorithms in machine learning that can achieve optimal control of a Markov Decision Process (MDP) [9], [19], [24]. There are generally two entities in RL -an agent and an environment. The environment evolves in a stochastic manner within a state-space at any time. The agent operates as the action executor and interacts with the environment. When it acts within a particular state, the environment will generate a reward/penalty to indicate how good was the action taken by the agent. The agent learns from this generated response from the environment over time. The policy determines the strategy for an agent to take action when being in a state. The agent task is to learn from its already taken actions such that in the future, it will take actions that are optimal concerning the value function V π (s 0 ). We define the value function as the expected reward from actions taken by a policy (π) over a finite time horizon: where τ s 0 denotes a chain of states selected by adopting π policy starting from s 0 . R total is the reward accumulated from traversing such sequence of states. Apart from value function, another important function is Q function Q π (s 0 , a 0 ), which is the expected reward for taking action a 0 in state s 0 and thereafter following a policy π . When policy π is the optimal policy π , value function and Q function are denoted by V (s) and Q (s, a), respectively.
a ∈ A are given, the optimal policy can be easily found by π = arg max a Q (s, a). In order to learn the value functions or Q functions, the Bellman optimality equations can usually help. Taking the discounted MDP with a discount factor of γ for example, the Bellman optimality equations for the value function and Q function is respectively.
Bellman equations represent the relation between the value/Q functions of the current state and the next state.
Usually, a large amount of memory is required to store the value functions and Q functions. In some cases, when only small, finite state sets are involved, it is possible to store these in the form of tables or arrays. This method is called the tabular method. However, in most of the real-world problems, the state sets are large, sometimes infinite, which makes it impossible to store the value functions or Q functions in the form of tables. Therefore, the trial-and-error interaction with the environment is hard for learning the environment dynamics due to formidably computation complexity and storage capacity requirement. Even if we can learn the dynamics, it imposes massive consumption of computing resources. In this case, one can approximate some functions of RL such as Q functions or policy functions with a smaller set of parameters by the application of DL. The combination of RL and DL results in the more powerful DRL.

2) DEEP LEARNING (DL)
DL refers to a family of machine learning algorithms that leverage artificial neural networks (ANN) to learn patterns from a large amount of data. It can perform well in tasks like regression and classification. The weights and bias of every node in a neural network (NN) are the parameters of the NN. Usually, a neural net with two or more hidden layers is called a Deep Neural Network (DNN). A loss function L(θ ) = g(Ŷ (θ ), Y ) is used in deep learning, which is a function of the outputŶ (θ ) from the network and the desired output (Y ). The loss function evaluates the performance of the desired NN (in terms of learning the corresponding parameters till this point) in terms of modelling the given data (i.e., Y = f (X )). Depending on the task type, various loss functions can contribute. For instance, the standard regression loss functions include Mean Square Error (MSE), Mean Absolute Error (MAE), Mean Bias Error (MBE). Concerning the classification task, loss functions such as Cross-Entropy loss and Support Vector Machine (SVM) loss perform well. After that, gradient descent methods are used to update θ parameters in NNs and consequently minimize the loss function. Given a loss function L(θ ), the parameters get updated by a gradient method such as the simple gradient: Such gradient descent methods start from an initial point of θ 0 . As the input data is fed to NN, the average loss function over all input data is calculated and used to minimize L(θ ) by taking a step along the descent direction, i.e., where α is a hyper-parameter named step size and indicates how fast the parameter values move towards the optimal direction. The above process is repeated iteratively by inputing more data to NN until convergence.

3) DEEP REINFORCEMENT LEARNING (DRL)
As discussed earlier, DRL refers to a family of methods that combine RL and DL to approximate either Value or Q functions (or even both) via a deep NN. In general, the DRL approaches can be categorized into two main groups: Valuebased and Policy Gradient.
In Value-based methods for DRL the states s t ∈ S or state-action pairs (s t , a t ) ∈ S × A are inputs of NNs, while Q functions Q π (s t , a t ) or value functions V π (s t ) are approximated by parameters θ of NNs. An NN returns the approximated Q functions or value functions for the input states or state-action pairs. There can be a single output neuron or multiple output neurons. For the former case, the output can be either V π (s t ) or Q π (s t , a t ) corresponding to the input s t or (s t , a t ). For the latter case, the outputs are the Q functions for state s t combined with every action, i.e., Q π (s t , a 1 ), . . . , Q π (s t , a |A| ).
In Policy Gradient methods, NNs can directly approximate a policy as a function of the state, i.e., π θ (s). The states are used as inputs to the NNs, while policy π is approximated by parameters θ of NNs as π θ . In contrast to value-based DRL methods, the policy gradient methods for DRL is a direct mapping from state to action, which leads to better convergence properties and higher efficiency in high-dimensional or continuous action spaces [6]. Therefore, we chose to take leverage of policy gradient methods for our research problem. Specifically, we focused on A3C as the basis of our solution and improved upon it concerning our problem.

B. A3C NETWORK
A3c networks consist of multiple independent agents (i.e., neural networks) with their weights, which interact with a different copy of the environment in parallel. Therefore, they can explore a more significant part of the state-action space in much less time. The agents (or workers) are trained in parallel and periodically update a globally shared neural network, which holds shared parameters. The updates are not happening simultaneously, and that is where the asynchronous notions come from. After each update, the agents reset their parameters to those of the global network and continue their independent exploration and training until they update themselves again.
We shall now briefly elaborate on Asynchronous Advantage Actor-Critic (A3C) as the underlying structure in our approach.
We define a value function V (s) of a stochastic policy π (s) (that returns distribution of probabilities over actions) as an expected discounted reward: where V (s) is the weight-average of r + γ V ( (s) for every action that can be potentially taken in state s ∈ S.
We also define the action-value function Q(s, a) as: where we emphasize that the action is given and there is only one following s . We define the advantage function as: A(s, a) is the advantage function as it informs how good it is to take action a in a state s compared to the average performance. In case the action a is better than the average, the advantage function has a positive value. It gets a negative value when the action is worse than average. Furthermore, we define ρ as the distribution of states, which indicates the probability of being in states. ρ s 0 and ρ π denotes the distribution of beginning states in the environment and states under policy π, respectively.
Since policy π is only a function of state s, we can approximate it directly. In this case, a neural network (with θ as the weights) would take a state s and output an action probability distribution π θ . We shall use π and π θ interchangeably as the policy parametrized by the network weights θ.
On the other hand, we want to optimize the policy. We define a metric function J (π) as an averaged discounted reward that a policy π can accumulate over possible beginning states s 0 : Now, we use the gradient of J (π ) to improve it. The gradient of J (π ) is derived in the Policy Gradient Theorem ( [20], [22]) and has the following form: A(s, a). θ log π (a||s)] (12) where the first part (i. e., A(s, a)) informs the advantage of taking action a in state s. The second part of it (i.e., θ log π (a||s)) informs a direction in which logged probability of taking action a in state s rises. Since both terms are together, the equation 12 increases that the likelihood of actions that are better than average performance while decreases the likelihood of actions worse than average performance. Since it is not feasible to compute the gradient over every state and every action, we use sampling for this computation (as the mean of samples lays near the expected value). The advantage function also needs to be computed. Let us expand the definition as: where we can see that running an episode with a policy π would provide us with an unbiased estimate of the Q(s, a). In other words, it is sufficient to know the value of V (s) to compute A(s, a). Therefore, we can also approximate V (s) by a neural network (similar to approximating action-value in DQN [7]).
Moreover, we can combine the two neural networks for estimating V (s) and π (s) to learn faster and more effectively. Also, on the negative side, separate networks are likely to learn very similar low-level features. Besides, combining the networks would act as a regularizing element that results in better stability. In the case of the two networks, our neural network share all hidden layers and outputs two set of resultsπ (s) and V (s). The part that optimizes the policy is the actor and the one that estimates the value function is the critic. In fact, the critic provides actor with insights about its action.
The asynchronous notion in A3C originates from the fact that the running of an agent would result in gathering samples with high correlation. To avoid such an issue in DQN, a method called experience replay is used by storing the samples in memory and form a batch by retrieving them in random [1]. However, the way A3C handles this issue is to run multiple agents simultaneously. In this case, each agent has its copy of the environment and would use its samples as they arrive. The advantage of this approach is two folds: First, it avoids the correlation as agents would have their unique experience (i.e., various states and transitions). Second, this method requires less memory compared to the experience replay method.
Limitations of A3C: Given all the properties of the vanilla A3C, we examined its performance (quantified using the criticality-weighted deadline miss ratio) in the presence of temporal dependencies and correlation of sensor data values as an example of complex environments. Specifically, we experimented with 8 sensors in the setup explained in detail later (Section IV-B), where there is temporal dependence concerning the arrival of data as well as the correlation among data values for two or more sensors as described later in Scenario 4 at Section IV-A. Figure 2 shows the criticality-weighted deadline miss ratio of policies concerning the workload intensity as defined further in equation 16. As indicated in Figure 2, the performance of A3C is comparable to the naïve greedy policies. With this observation, we hypothesized that such turbulence in the A3C performance relates to the lack of any memorization mechanism to capture recurrent patterns in data streams. Therefore, we decided to examine our hypothesis by proposing a modified version of A3C.

C. PROPOSED APPROACH
We propose an architecture for determining policy at the central hub for deciding on data streams in complex environments. Aligned with the examples provided in Section I-A, we consider studying the following cases, where: • There exists a correlation among the arrival of data for two or more sensors in the system.
• There is temporal dependence for the arrival of data for two or more sensors in the system.
• There is temporal dependence concerning the arrival of data as well as the correlation among data values for two or more sensors.
We chose the above cases as a representative set of scenarios for a complex environment inspired by real-world examples (e.g., in [10], [11]). We acknowledge that such a set of cases can get extended to represent more complex environments. We provide more details about each case when we discuss the evaluation more completely (Section IV).

VOLUME 8, 2020
We make no assumptions concerning the similarity of information when there is any data arrival correlation or temporal dependency. It means that in such scenarios, if we collect data from two correlated/dependent sensors at two consecutive times (e.g., at t and t+1), it does not mean that we have collected ''redundant'' data. We are solely focusing on collecting data packets in such a way that we have the highest accumulated criticality values over a time horizon. However, in some environment settings, it could be the case that such data arrival correlations/dependence is lead to redundant data. As mentioned above, the proposed model does make this assumption and consequently does not consider having redundant data/messages. 1 As shown in Figure 3, we can abstract the network in two parts: input and learning model. At the input layer, the system state is an array of n = n(s) tuples (n(s) = |S|, S is the set of sensors in the system), each tuple representing the available data-packet (criticality, deadline).
Concerning the learning model, we use the Asynchronous Advantage Actor-Critic (A3C) algorithm [13], which leverages a deep neural network to learn the policy and value functions while running parallel threads to update the parameters of the network. Regularization with policy entropy improves exploration by limiting the premature convergence to sub-optimal policy [13]. The core of our network contains an LSTM layer (output space = n(S)) followed by a fully connected network (output space = 2 × n(S)) to perform the both required estimations. The LSTM layer is introduced to arm the agent such that it can have some memory of previous states. We propose such embedding of memory to the model as it is likely for an agent in complex environments to encounter recurring patterns in states. In these cases, the model is capable of making efficient decisions only if it can distinguish such patterns. Therefore, we chose to embed memory (i.e., by adding the LSTM layer) to enrich the model with this feature. Alternatively, one could argue about embedding memory to the sensors. However, we believe that such a decision would lead to a more fragile architecture and would negatively impact the scalability of IoT deployments.
Similar to any Reinforcement learning technique, we are concerned with how to take actions in the environment to maximize some notion of cumulative reward. It is essential to define a concrete and reasonable optimization goal, i.e., the reward. We define the reward in such a way that if a message is chosen after the expiry, a penalty is incurred: where α and β are parameters that help a system architect achieve a balance between criticality and timeliness. These may be adapted on a per-sensor basis, but in this discussion, all local hubs use the same settings for these two parameters. t denotes the current time step, and d is the corresponding deadline of a message. ι is the parameter that characterizes the penalty for a message, and it may be adapted per-message or per-class of messages. For simplicity of exposition, we use the same value of ι for all messages. An example graph of the reward function for some choices of the parameters mentioned above is shown in Figure 4. In terms of the loss function, we define it as follows: (15) where L π is the loss of the policy, L v is the value error and L reg is a regularization term. The constants c v and c reg can show which part we want to emphasize on.

D. ENVIRONMENT
We created an environment for a complex IoT environment where data-packet streams arriving at sensors may have not only different distributions but also temporal dependency and correlation concerning their arrival times and data-packet values. Such properties are known in IoT environments such as intelligent buildings [3], marine environments [10], and multiple mobile sensing and computing applications [25]. In our environment, as mentioned in Section II, data packets arrive at sensors and the central unit chooses a sensor to fetch its data at each polling turn. Then, it verifies whether the deadline of the chosen sensor is still valid. If so, it calculates a reward without any extra penalty; otherwise, it still generates a reward (like a soft scheduler [15]) but subtracts a penalty to reflect the deadline expiry for the chosen data packet. This iteration continues at each time step until the episode finishes.

IV. QUANTITATIVE EVALUATION
Having described the proposed approach based on deep reinforcement learning, we discuss other heuristics that we can use to compare with our proposal. To recap the performance goal, we want to reduce the criticality-weighted deadline miss ratio.
We evaluate the policy from the proposed approach along with three other policies, namely: the vanilla A3C, critical greedy, and deadline greedy. Given the explanations for the A3C (Section III-B) and the proposed approach (Section III-C), we briefly elaborate on the greedy policies: • Critical greedy: In this policy, the central unit selects a sensor using the highest criticality value among the messages at each sensor.
• Deadline greedy: In this policy, the central unit selects a sensor using the earliest deadline among the messages at each sensor.

A. EVALUATION SCENARIOS
As mentioned above, we have four different approaches (i.e. policies) to select a sensor each time. We compare these policies concerning their criticality-weighted deadline miss ratio. To understand the performance differences, we consider four different scenarios, each representing an environment concerning data arrival and values. We chose Scenario 1 to study our proposed solution in less complex environments (where there is no dependency among sensors). Inspired by the scenarios reported in the literature ( [3], [10]), we considered Scenarios 2, 3, and 4 to explore the performance of our approach in complex environments to overcome the A3C limitation as discussed earlier and shown in Figure 2: • Scenario 1: The arrival of data to sensors follows different distributions: a uniform distribution and multiple Poisson distributions -without dependencies.
• Scenario 2: The arrival of data to half of the sensors is dependent on each other in a pairwise manner. In other words, sensors in each pair are dependent on each other. In the case of having eight sensors in the system, two pairs (i.e., half of the sensors) correlated the arrival of data. For example, the arrivals of data for the (s 1 , s 2 ) pair of sensors were correlated. The same correlation existed between the arrival of data for the (s 3 , s 4 ) pair.
We did not consider any such correlation for the rest of the sensors.
• Scenario 3: The arrival of data to half of the sensors is temporally dependent on each other in a pairwise manner. Similar to the details provided in Scenario 2, for half of the sensors, pairs of sensors had temporal dependence on each other. For example, when we had eight sensors, there was temporal dependence for the (s 1 , s 2 ) pair of sensors, and similarly for the (s 3 , s 4 ) pair while there was no dependence between other sensors.
• Scenario 4: The arrival of data to half of the sensors is temporally dependent on each other in a pairwise manner. Also, the state value (i.e., data value) of those sensors is dependent on each other. In the case that of eight sensors, for the pair of (s 1 , s 2 ), the data arrivals were temporally dependent, and data values were correlated. The same setting holds for another pair of sensors (e.g., (s 3 , s 4 )), and the rest of the sensors did not have a dependency of any kind.

B. EXPERIMENTAL SETUP
We perform our experiments in all of the above scenarios, where we tried multiple numbers of sensors (n = 4, 8, and 16). We chose to report the results for n = 8 here due to the similarity of results and the results for the other two cases of 4 and 16 sensors are available in the Appendices (Section VI-A) and B (Section VI-B), respectively.
To be consistent, we fixed the values of temporal dependencies and correlations (regarding arrival time or state values) wherever they existed across the scenarios. For temporal dependency, we used t w = 5 to denote the temporal difference with correlation value of corr = 0.9 (i.e., to hold high correlation). This correlation value was also used for the cases that only correlation exists (no temporal dependency). As mentioned in Section IV-A, half of the sensors (i.e., equivalent to four sensors as n = 8) have dependencies in scenarios 2, 3, and 4, which means two pairs of sensors have either or both of temporal dependency or correlation. We used the arrival rate of λ = 0.15 for these scenarios. For the first scenario, we chose the arrival time for one sensor from a uniform distribution over the time horizon of the study, and the other sensors from multiple Poisson processes (λ = [0.05, 0.15, 0.3, 0.35, 0.5, 0.65, 0.85]). Also, for each message, the deadline and criticality values are chosen from a uniform distribution. Therefore, we chose deadlines of messages from the uniform distribution U d (1,10), and criticality values from the uniform distribution U κ (1, 10). In the case of dependencies among data values (i.e., criticality and deadline values), the chosen values are still within the ranges of the above uniform distributions. We ran the simulations with eight threads as it directly impacts the performance due to affecting the quality of gradient. We tweaked the hyperparameters and ended up having simulation runs through 250 episodes each 5000 steps, the minimum batch size of n batch = 32, and the learning rate of δ = 0.005. Also, we chose the values for the constants of the loss function as c v = 0.5 and c reg = 0.01.

C. RESULTS
We elaborate on our observations from the simulations. In each scenario, we considered the aggregate results of the 250 episodes for the evaluations.   Table 3. The proposed approach (m_A3C) performs competitively compared to the vanilla A3C.  Table 3. The proposed approach (m_A3C) performs competitively compared to the vanilla A3C.
Workload Intensity: We define a workload intensity metric as: where M is the set of all messages arriving in a simulation episode, λ arrival is the arrival rate of events, and Cr i and d i are the criticality and deadline for each message, respectively. We could calculate this metric for each episode of simulations over 5000 steps. Such a metric essentially captures, to some extent, the workload intensity. Therefore, we report the performance of the policies concerning the ''criticality-weighted deadline miss ratio'' (on the Y-axis) along with the ''workload intensity'' (on the X-axis) when presenting the simulation results. As depicted in Figures 5, 6, 7, and 8, each data point represents the overall ''criticality-weighted deadline miss ratio'' in terms of the workload-intensity metric (as defined earlier) over 5000 steps during an episode. Table 3 summarizes the results of experiments for the case of 8 sensors.
Across all scenarios, we find the proposed approach performs well. In two of the scenarios (III and IV), it outperforms all others, which confirms the idea of adding memory to capture temporal relations. In scenario I, it still performs slightly better than the vanilla A3C and outperforms the other two policies. In scenario II, it performs competitively when compared with vanilla A3C while still outperforming the two greedy baselines. Table 4 (Appendix-C) provides a more comprehensive set of results by including the experiments for the cases of having 4 and 16 sensors.
Observing that the modification to the A3C did not negatively affect the performance in less complex scenarios suggests the advantage of the proposed solution over the vanilla A3C as a generalized solution.
In addition to generalizability, our proposed solution also offers scalability compared to the traditional RL approaches, such as the tabular Q-learning. As described earlier (Section III-A1), with the proposed DRL solution, we do not deal with updating the Q-table of states and actions. Instead, we have the states as the input of the NN and  Table 3. The proposed approach (m_A3C) outperforms other policies.  Table 3. The proposed approach (m_A3C) outperforms other policies. It reports the average Criticality-weighted Deadline Miss Ratio for the policies in each of the scenarios. As shown in green-coloured cells, the proposed policy (m_A3C) consistently performs as the best policy. Even when it is not the best (i.e., Scenario II), it still performs reasonably well with only one percent of difference (concerning ρ) compared to the best result. Also, the red-coloured cells represent the worst performance (by the greedy policies) across all the scenarios. approximate the value function and action values. Therefore, we alleviate the massive consumption of computing resources for updating a Q-table. In this way, it is easier to increase the number of sensors as it only changes the size of the input array in the NN, which requires considerably lower computing resources than Q-tables that grow exponentially. One possible bottleneck towards the large scale implementation of our approach would be the training time that could increase in environments with more sensors and complexity. This issue, however, could be addressed by leveraging GPUs for computational speedup, as shown to be effective in improving A3C performance by Babaeizadeh et al. [2].

V. RELATED WORK
Our research brings multiple areas such as DRL, performance analysis, and job scheduling together. Therefore, the related work section is mostly about how DRL has helped to solve problems of scheduling specifically about the performance of sophisticated sensors deployment that capturing data patterns is non-trivial [21], [26].   Table 4. The proposed approach (m_A3C) outperforms the vanilla A3C.  Table 4. The proposed approach (m_A3C) outperforms the vanilla A3C.
In the scheduling of distributed tasks in grid and cloud, there are several efforts that optimize metrics such as lifespan and load balance [12], [18]. However, such works focus only on criticality/importance or timeliness. In our work, we study the trade-off between these two system properties.
In terms of performance analysis, our work shares similarities such as the notion of the metrics between what we study as the criticality-weighted deadline miss ratio (Section 1) and other metrics such as the one explained by Salah et al. [16]. Aside from the performance metrics, different works consider hierarchical models, such as the ones studied by Khazaei et al. [8] and Rashtian and Gopalakrishnan [15]. Our work, though, is focusing on the case of a centralized model as a standard model in infrastructure-asa-service (IaaS) platforms.  Table 4. The proposed approach (m_A3C) outperforms the vanilla A3C.  Table 4. The proposed approach (m_A3C) outperforms the vanilla A3C.  Table 4. The proposed approach (m_A3C) performs competitively compared to the vanilla A3C.
On the RL side, our work compares to the body of work done via policy gradient methods. Policy gradient methods include a large family of reinforcement learning algorithms. They have a long history [24], but only recently were backed by neural networks and had success in high-dimensional cases. A3C algorithm was published in 2016 and can do better than DQN [14] with a fraction of time and resources [13]. Although there are several papers leveraging policy gradient methods, our work is distinguished from many of them by both the application and network structure.  Table 4. The proposed approach (m_A3C) outperforms the vanilla A3C.  Table 4. The proposed approach (m_A3C) outperforms the vanilla A3C.

VI. CONCLUSIONS AND FUTURE WORK
We have demonstrated that a deep reinforcement learning approach to balancing message criticality and timeliness in a complex IoT environment is effective. We proposed an approach by improving upon the A3C algorithm via embedding memory to the model such that it can capture recurring patterns of data in complex environments. Our solution outperforms the studied alternative heuristics that one could use in this context. We envision the applicability of such an idea in the context of IoT environments that host temporally dependent and correlated data arrival. We have tackled this problem in the specific settings where messages have two essential properties of criticality and timeliness.
We have shown that our solution is effective in four scenarios. The proposed approach outperforms the rest of the policies in complex scenarios, where we have data arrival correlations, data arrival temporal dependency, and data arrival temporal dependency plus correlation of data (concerning criticality and deadline values). Also, our solution remains effective in the more straightforward, where the arrival of data to sensors follows different distribution without dependencies. The observation of the results both in complex and non-complex scenarios suggest the generalizability of the proposed solution.
As future work, we envision multiple avenues to explore. One natural extension to this work is to explore the effectiveness of the approach with more scenarios. This path may lead to the modification of network architecture. Another possible plan could be to compare the proposed approach with solutions based on some other DRL algorithms, such as PPO [17]. Such explorations could shed light on the trade-off of convergence vs. stability across DRL-based approaches. Lastly, another direction for future work would be to answer the question of ''How should we model information correlation to preventing redundant message collection?''. In this case, we may be able to use mutual information as a metric to capture the value of different messages; we could ignore messages with high mutual information correlation in relation to messages that have been scheduled. As our proposed model here does not assume and address collecting redundant data,  Table 4. The proposed approach (m_A3C) outperforms the vanilla A3C.
it would be interesting to explore potential modification to the framework, to prevent collecting repetitive data while still capturing recurring patterns within data. We envision that addressing this problem in our proposed framework would require updates to the reward function and performance metric, but the overall solution structure is likely to remain the same. Finally, it would be interesting to solve a similar case when the assumption is to have two or more classes of tasks (e.g., critical and regular/normal tasks), and the goal is to prioritize critical tasks over other classes of tasks. Such a study would complement our work in this paper as the current problem settings do not assume such a clear distinction among tasks and solely focuses on minimizing the weighted-criticality deadline miss ratio metric over a finite horizon.

APPENDIX
Here we present the rest of the results from our experiments as mentioned in the paper.

A. EXPERIMENT RESULTS FOR 4 SENSORS
In this section, we present the experiment results for the case of having 4 sensors (see Figs. 9-12).

B. EXPERIMENT RESULTS FOR 16 SENSORS
In this section, we present the experiment results for the case of having 16 sensors (see Figs. 13-16). Table 4 summarizes the results for the cases of 4, 8, and 16 sensors. It reports the average Criticality-weighted Deadline Miss Ratio for the policies in each of the scenarios. As shown in green-coloured cells, the proposed policy (m_A3C) almost consistently performs as the best policy. Even when it is not the best, it still performs reasonably well with only one percent of difference (concerning ρ) compared to the best result. The red-coloured cells show the worst performance among the policies, which is dominantly shared between the two greedy policies.

C. TABLE OF RESULTS SUMMARY
HOOTAN RASHTIAN received the M.Sc. degree from The University of British Columbia (UBC), in 2014, where he is currently pursuing the Ph.D. degree with the RADICAL Computing Systems Lab. He has received the NSERC Engage Grant and multiple Mitacs Accelerate grants for industry-related projects. He has collaborated with multiple Research and Development projects in the industry, such as Oracle Labs to apply machine learning techniques, including predictive modeling, recommendation systems, meta learning, and unsupervised learning on real-world problems. His research interests include developing techniques in statistical machine learning, reinforcement learning, and dynamic programming to improve data collection and prioritization in data-rich environments such as IoT settings.
SATHISH GOPALAKRISHNAN is currently an Associate Professor of electrical and computer engineering with The University of British Columbia (UBC) and also an Associate Head of undergraduate programs with his department. He works on research problems related to the safety of cyber-physical systems (computing systems embedded in and interacting with the physical world). He is interested in techniques and tools for resource allocation and the verification of correct system behavior in such embedded computing systems. The broader context for his work relates to how we think about and develop dependable software. He has been recognized for his research with Best Paper Awards from the IEEE Real-Time Systems Symposium and with the Peter Wall Scholarship from the UBC.