Scalable Delay-Sensitive Polling of Sensors

In a sensor-rich Internet of Things environment, we may be unable to gather all data at a processing centre at the rate at which the data is generated. The rate of data collection from a sensor may be limited by available bandwidth/cost (or energy considerations), especially if one were to use cellular networks for such systems. In this context, we present a mechanism for determining which sensors to gather data from at each polling epoch. Our sensor polling mechanism prioritizes sensors using information about the data generation rate, the expected value of the data as well as its time sensitivity. Our problem formulation and its solution relate to the restless bandit model for sequential decision making. Whereas existing methods for the restless bandit model are not directly applicable because the state space is continuous and not discrete, we prove that similar techniques can be used because of particular characteristics of the underlying problem. We then show that our approach can be very effective even when not optimal through an extensive quantitative study where event arrivals follow a hyper-exponential distribution.


I. INTRODUCTION
Sensor systems are a significant component of the Internet of Things (IoT). Dense sensor deployments are a rich source of data about the physical world and are used to inform decisions in a variety of applications. In essence, agriculture, transportation, emergency response are a few such applications. Another source of data is certain types of online social networks/services, such as Twitter, which also contain feeds that influence decision making in the types of applications that also rely on physical sensing. Also, privacy in data collection can be an important application that uses data from sensor deployments in environments.
We consider a scheduling problem that arises in the context of detecting events across multiple locations in a system. Imagine that we would like to know if an event was observed at some location. We would like to know of this event relatively quickly (delay sensitivity). However, there are many locations to monitor, making it infeasible to poll each location at each decision epoch. One possible approach to this problem would be to set up triggers at each location for every object The associate editor coordinating the review of this manuscript and approving it for publication was Min Jia . or event of interest and be informed when the event occurs. A downside to this approach -one that we consider -is that we may have to inform each location of the events of interest, potentially compromising the privacy of the events of interest (privacy sensitivity). We, therefore, study a model where a central entity polls an observation point (location) and obtains all observations since the previous polling request to that point. This central entity may then identify the events of interest from the obtained set of observations to take suitable action. We can also use the term sensors to refer to these observation points, and we will use these terms interchangeably for the rest of this article.
From the design perspective, and with the above-mentioned central entity, we can assume that we have a centralized data service that can provide subscription services to applications. The centralized entity may not have the most accurate knowledge of sensor data (and thus, the value of the data). However, one benefit of performing centralized scheduling is to enforce a global bandwidth constraint. (''Centralization'' may still use distributed computing techniques but we can think of a single entity that plans the data collection and data warehousing.) The central entity has limited bandwidth and can only poll some sensors at each decision epoch.
The question that then arises is: Which sensors should be polled? We may want to prioritize sensors that have not been polled recently and at the same time account for the fact that some sensors may provide more valuable data consistently.
Another advantage of this model is that sensor deployment costs may be amortized over multiple applications, and simple application programming interfaces (APIs) allow many applications to be developed easily. This model has also been referred to as sensing-as-a-service [18]. In the sensing-asa-service model, as more sensors join the network, it becomes increasingly challenging to select appropriate sensors such that the services mentioned above (e.g., APIs) are efficiently available.
Our focus is on the abstraction of polling sensors where we have some notions of delay sensitivity and privacy sensitivity (or other limitations) that prevent a push-driven architecture. If the value derived from different events may be different, then a simple round-robin polling policy may not be suitable. We study how a single entity can perform as an efficient scheduler for sensors data collection operations when detecting incidents/events in IoT platforms. The growth in sensor deployments and the volumes of data generated [18] necessitate careful scheduling of data collection from sensors. Such a process is essential due to potential constraints such as the available communication bandwidth (which can be overwhelmed by an approach that simply queries all sensors periodically) or the direct cost of accessing the underlying network. One possible design strategy for such systems would be to utilize the cellular network infrastructure, which can be expensive, although simple to implement and manage.
We propose a centralized approach to periodically collecting sensor data. To manage the process of polling sensors for data, we consider two issues related to the sensors and the data they collect: 1) Data Value: Not all data is equal in a data-rich world.
The data stream from one sensor may be deemed more important than the stream from a different sensor. Such judgment of value may be a result of factors such as sensor location and sensor type. The value can be quantified, for instance, by the number of API queries to use the data in the sensing-as-a-service model. 2) Time Sensitivity: Some data may be important for long-range statistics, but some data may be needed soon after the associated observation. Further, data may lose its value as the time from the associated observation increases.
Using the context that we have provided so far, we propose a policy to prioritize sensors at each polling (data collection) epoch to respect the bandwidth (or cost) constraints and to maximize the long-term average value obtained. In modelling the underlying problem (Section III), we treat the sensors as offering time-varying rewards based on when we poll them. The reward from a sensor is the ''value'' seen by the sensing service. Such a reward depends on the data that the sensor recorded and on when it recorded the data.
Our problem model is related to the restless bandit model for sequential decision making [26] due to Whittle. The central difference, however, between our formulation and the classic restless bandit formulation is that the state-space for our problem is continuous, whereas it is discrete in Whittle's formulation. It is this difference that requires the rigorous treatment we present.
Our main contributions are as follows: • Establishing that the polling problem can be solved using a dynamic program that has a unique solution (Section IV); • Deriving a simple dynamic priority, or index, policy that allows us to approximate the solution to the [intractable] dynamic program (Sections V and VII); • Identifying an adaptive and improved index policy when event arrivals at sensors are modeled using hyper-exponential distributions to capture a wider range of operating conditions (Section VI); • Demonstrating the effectiveness of the index policy using numerical evaluations. (Section VIII).

Organization:
We start by presenting the related work (Section II) to support and position our contributions. We then discuss the model that captures the sensor scheduling problem (Section III) before presenting the main technical results highlighted in the list of contributions (Sections V, VI, VII and VIII). The last section summarizes our findings and discusses extensions to this work (Section IX).

II. RELATED WORK
Before we provide a detailed treatment of the sensor polling problem we consider, we describe some of the related work. In doing so, we position our work relative to prior research.
The model of restless bandits was presented by Whittle [26]. In the original formulation, there are many arms and pulling arm results in a reward. The underlying probability distribution for rewards is unknown and may change with time (hence the restlessness). The goal is to identify a policy that maximizes the long-term average reward. This model generalizes the multi-armed bandit model that was studied initially by Gittins and had a detailed presentation in a more recent monograph [6]. The solution approach is to define a priority/index computation that is efficient and helps determine the actions to take at each decision epoch.
Most such problems can be tackled using stochastic dynamic programming [20], but such direct approaches suffer from the curse of dimensionality, leading to impractical solutions for a large number of sensors/bandit arms. The index approach is suboptimal-although asymptotically it is near-optimal-but avoids brute-force dynamic programming by reducing the problem to a set of more straightforward problems (one per arm). Prior work has focused on the state space of the arms being discrete, but, as we shall see, in our formulation, the state space is continuous (albeit closed) and requires that we establish the existence and suitability of priority/index policies. Kleinberg and Immorlica have suggested the model of recharging bandits [13], wherein an arm that has not been played accrues rewards over time according to some concave function. The assumption of concavity in how an arm's rewards grow between pulls of the arm allows for a polynomial-time approximation scheme. The problem we study bears similarity with the work by Kleinberg and Immorlica, but that fact that a sensor's state may grow with time (as events are observed) and decay (as the data becomes stale) does not permit for the same analysis as in the case of recharging bandits. We may view the problem that we have presented as one involving recharging-discharging bandits, with the recharge-only model being a particular case. This difference seems to require the scheme we have discussed because the general restless bandits' problem is PSPACE-Hard even to approximate [17] and work by Guha et al. [7] presents some approximation algorithms for some special cases. The work we present here is more general than the recharging bandits model but not as general as what was studied by Guha et al., and the results we present are relevant and exciting for a specific class of problems.
Sensor scheduling has been modelled as a restless bandit problem but with different constraints and objectives [16]: to find specific elusive targets using faulty sensors. Similarly, there has been an effort to use the restless bandit model for sensors with energy harvesting considerations [11]. Such work did not address the issues of data value and time sensitivity, which also require some analysis of the continuous state-space. Iannello and Simeone [10] studied the problem of optimally scheduling stochastically generated independent and time-sensitive tasks, where a centralized controller assigns at each time slot a node to a server for it to execute a task. The setting of their work is similar to ours in that a centralized decision-making entity is assumed, task inter-arrival times are exponentially distributed, time-sensitivity is an explicit constraint, and the policy derived is a restless multi-armed bandit. One key difference, however, is that we consider continuous-time state dynamics, as opposed to the discrete-time state evolution model that the previous work considers. The consideration of continuous-parameter dynamics-and a continuous statespace-poses significant analytical challenges that are otherwise not present in discrete parameter/state-space modelsa model where sensor reinitialize state after some time steps were explored by Villar [24].
Optimal strategies for obtaining data from sensors with delay/freshness constraints have been studied in the context of a single sensor, with the goal of balancing energy consumption with data freshness [5]. In these articles, a single sensor node was considered with the goal of maximizing a weighted function that accounted for sensor energy and data freshness. The problem of choosing which sensors to poll when there are multiple available sensors requires a different approach.
Heuristics for data collection have been explored in the IoT setting [9], [12], but these methods approach the problem from a broader system-building perspective, and the algorithms do not have proofs of optimality, or approximation ratios, or competitive ratios.
The sensor scheduling problem has been tackled recently in the specific context of networked control systems, with the goal of state estimation. Weerakkody et al., for example, examine the multi-sensor scheduling problem to minimize mean squared error estimates [25]. Such work assumes specific knowledge of the underlying system state and its dynamics. Our formulation is relevant in situations where such system dynamics are not clearly defined, and the value of sensor data is measured by some exogenous processes such as data use. With similar information, Han et al. have also examined the stochastic sensor scheduling problem [8].
Clark et al. has used the general notion of adaptive delayed polling of sensors. [3], but not with near-optimal schedules.

III. MODEL
We shall now explain the model that we study in this work. While we explain the symbols throughout the text, a complete list of symbols (and their corresponding descriptions) is available in the table of symbols in Appendix IX-D.
Consider a sensor deployment consisting of N sensors S 1 , . . . , S N . Let [N ] = {1, . . . , N }. The data sensed by sensor i ∈ [N ] has initial value ν i , which is a non-negative random variable with known finite mean ν i < ∞, and this value decreases exponentially with rate β i > 0. This means that data collected t units of time after it was initially sensed has (random) value ν i e −β i t . We assume that ν 1 , . . . , ν N are i.i.d. Also, we assume that data is collected periodically with period P > 0.
Moon et al. have presented the approach of assigning value to gathered data based on its use in their work on a learning framework for improving search results [15].
We assume that the successive events that a sensor may detect are such that the inter-arrival times are governed by a hyper-exponential distribution, which is a mixture of exponential distributions. This assumption about inter-arrival times captures a wide range of operating conditions because hyper-exponential distributions can capture exponential distributions and approximate heavy-tailed distributions. Sensor S i , i ∈ [N ] senses new events from the environment at rate µ i,j > 0 with a probability q j ; we assume that events are sensed, or equivalently, events arrive at the sensors, according to a hyper-exponential process that is a mixture of M exponential distributions. At each collection period, we select the sensors to poll and thus collect the sensed event data.
Our next step is to describe the expected utility/value that we can obtain when we poll a sensor. We define the expected utility/value from polling sensor S i as the average value over the average discounted time spanning one period; that is, (1) VOLUME 8, 2020 We will use the estimate of utility accrued in a period, as described above, in our initial discussion. Later (Section VI), we will show that we can obtain improved estimates using a property of the hyper-exponential distribution.
Each sensor may have sensed multiple events between two sampling instants. Let ϒ i (t) be the utility accumulated at sensor S i at time t. This is also the state of S i at time t. Let a i (t) be the action taken for sensor S i at time t, which we define as The evolution of the state of each sensor depends on the polling action. There are two cases. For convenience, let , a i (t) = 0, and the state will be changed at the next polling period according to Second, if a i (t) = 1, then the reward obtained is the accumulated utility up to time t; that is, In this case, the state is ''reset'' to its initial value: Note that the controlled state process ϒ = {ϒ(t) : t ≥ 0} depends on the sequence of actions. Moreover, ϒ is a deterministic process due to the following reasons: 1) We are working with the expectations of the random elements involved in the definition of the state 2) We are considering deterministic decisions. In this model, our objective is to maximize the average reward over the infinite horizon: There is also a ''bandwidth'' constraint defined as where γ i > 0 is the ''bandwidth'' required by sensor S i , and B is the given average bandwidth that is available for the overall expected sensor polling activity.
We assume that sensor S i requires bandwidth γ i irrespective of the number of observations it reports when polled. We could treat γ i as a random variable, but the long-term behaviour can be approximated by using the mean γ i . We will assume a deterministic γ i for the rest of the article.
Our formulation is related to the restless multi-armed bandit framework [26], but the difference in our formulation is that the state space is a closed domain in R N . The general analysis in this setting requires establishing some essential results.
In the specific case that γ i = 1 for all i ∈ [N ], our problem reduces to the original restless multi-armed bandit setting and B becomes the expected number of sensors that need to be polled at every decision epoch.
We shall next examine index policies, wherein the global problem with N sensors (arms) is decomposed into N single-sensor problems. In what follows, we shall use the terms arm and sensor interchangeably. When referring to an arm, we will use the term pulling for the sensor polling activity.
In the first part of our discussion (Sections IV and V), we will assume a fluid-flow approximation of the stochastic event arrival process, which triggers observations at sensors. Event arrivals over a time window are averaged, and each observation at a sensor has the same value, although observations at different sensors may have a different value. Later (Section VII) we will relax this assumption and study the truly stochastic behaviour of the system where events arrive at discrete time points, and the value of observation may differ at a sensor, but the mean value of a sensor observation is known.

IV. DYNAMIC PROGRAM FORMULATION
This section establishes that the problem at hand can be solved using a stochastic dynamic program. Dynamic programming in this setting is intractable so we will use the structure of the dynamic program to identify a more efficient solution.
We consider the average reward problem for S i using the theory of Lagrange multipliers. To this end, denote as ϒ i the state space of sensor S i , and let v i (s i , a) be the immediate reward that S i receives when the state is s i ∈ ϒ i and the action taken is where λ is a Lagrange multiplier. One may interpret λ as a subsidy allocated to sensor i so as to make idling (nonpolling) more attractive. Using ρ as discount factor ρ, we can define the discounted reward over an infinite horizon as: The value function associated with (4) is Now the dynamic programming equation may be written as Lemma 1: The solution to dynamic program defined by (5), V ρ is (i) unique and bounded, (ii) continuous over ρ ∈ (0, 1) (in the Lipschitz sense), and (iii) monotonically increasing and convex.
We provide proof of Lemma 1 in the appendix. As discussed earlier, the reward is discounted over time.
is bounded. Using the Bolzano-Weierstrass Theorem [21] and Arzela-Ascoli Theorem [2], we may pick a subsequence As ρ → 1 along an appropriate subsequence, (6) becomes which can be written as Now that we have derived the dynamic programming equation, we want to show that the value function increases monotonically, is convex, and that V (s) = 0. Such properties are easily verified since point-wise limits preserve convexity and monotonicity.
Also, we want to show that maximizing V (s) in (8) is achieved by the optimal action and, correspondingly, that ξ is the optimal reward. To establish this, consider the following argument. Let a * (s) be the action that maximizes If multiple actions maximize the foregoing function then we can pick one of those actions arbitrarily. Under the condition Now, if we consider the average value of both sides over time, we get As T → ∞, ξ is the average reward per chosen control policy. For any other action set, we will have L ≥ R in (9) and hence ξ is greater than equal the average reward under a different set of actions. This implies the optimality of ξ . Let us now define the set of states when we do not poll a sensor as well as the states when we do poll a sensor: If t 0 is when this sensor is first polled and if t 0 < ∞ (t 0 = ∞ is the ''never poll'' case), by using the optimal policy and iterating t 0 times with the optimal value function in (7), we may write the dynamic programming equation as If we use a different policy that is not optimal, then we will have Consequentially, we can write V (s) as: where we maximize the reward over all action sequences. This implies that equation (7) has a unique solution.

V. COMPUTING SENSOR PRIORITIES
Now, we show that an index policy, or a dynamic priority policy, exists for the problem. (This is akin to Whittle's approach for the classic restless bandit problem.) In a dynamic priority policy, the priority associated with a sensor is updated at the start of each epoch. (In a fixed priority policy, the same priority is used for a sensor at every epoch.) To establish the existence of an index policy, we utilize the following facts: • The value function is monotone; • The value function is convex; • The mapping from s to s − V (αs + υ) is concave. Consequently, as λ is varied from −∞ to +∞, the set D grows in a monotone fashion from the empty set to the entire state space S.
First, we will show that some corner cases can be ignored. 1) If υ * ∈ D, that is the optimal action at υ * is not to poll the sensor, and the related cost is γ λ. Then, ξ = γ λ and the optimal strategy is to not poll the sensor at any state. Thus, D = υ, υ * and D c = ∅. The index would then be calculated as This means that it is optimal to poll the sensor when the reward is υ Also, D c = υ, υ * and D = ∅. In this case λ should obey the following inequality: What we have now is that the deterministic control policies a(t) = 0 and a(t) = 1 have cost γ λ and υ respectively, and ξ must then satisfy two conditions: • ξ ≥ min(γ λ, υ), and • ξ ≥ min(γ λ, υ) when λ ∈ (λ l , λ υ ) and λ l and λ υ are lower bound and upper bound, respectively. Also, D and D c are non-empty. There is also some υ + ∈ (υ, υ * ) where polling or not polling a sensor are equally good. In this case, υ * increases with λ. We can obtain g(x) as the inverse of this function. g(x) increases with x ∈ (υ, υ + ). g(x) is, in essence, λ when polling or not polling a sensor are both suitable decisions.
Let λ = g(s) for some s ∈ (υ, υ * ). Every time we poll the sensor, the corresponding state resets to υ. The optimal policy is then periodic: do not poll a sensor until the state enters D c and then poll it. Finite perturbations in initial conditions do not impact long-term behaviour, so we assume without loss of generality that s(0) = υ (i.e., initial state). Define τ (s) = min {t : ϒ(t) ∈ D c }, where s is the initial state. Then In the long-run, the overall average cost will converge to the average cost over one polling period. Thus, Theorem 1: The index of sensor S i is Note: We introduced the subscript i so that we can have independent indices for different sensors. Also, if a sensor is polled even once, the states {ϒ i (t)} are discrete after that with jumps every time step. The states depend on υ i and α i solely. For a sensor that is never polled, the states can be restricted to discrete values depending on ϒ i (0). If we restrict attention to such discretized states, the index can be reduced to Proof: For state s ∈ D c , we obtain V (s) = s − ξ using (7). Also, using Lemma 2, for s = αs+ν ∈ D c , we can obtain V (s ) = s − ξ . Combining these results with (7) and the definition of an index for restless bandits, we have whereξ i , due to (10), is the optimal policy cost when λ i = g i (s). Using (10) yields where Using (12) in (11), we can solve for g i (s) thus obtaining:  Figure 1 depicts shows the step by step flowchart of sensor polling process in our proposed approach. Note that an alternative approach is to treat the decision at each epoch as a knapsack problem [4] after we have computed the indices but, as we discuss later, this approach does not result in significant benefits for the extra work involved.

B. COMPUTATIONAL COMPLEXITY
The Whittle-like index priority calculation in the previous section is very easy to compute and implement with only a linear increase in space and time complexity with the number of sources. Note that following the Whittle's approach, we decouple our problem into N sub-problems (one per sensor), each of which involves a constant-time update. Therefore, the worst-case complexity of calculating indices is linear ( (N )) in the number of resources. Once we have indices at each time step, the index policy requires that the indices be sorted ( (N log N ), in the worst case) such that it can select the best sensors. However, we need not perform a complete sorting at each epoch. There is a semi-periodic behaviour that we can exploit, and this allows us to select the top few sensors more efficiently in practice, often in sub-linear time.

VI. ADAPTIVE ESTIMATION OF ACCRUED UTILITY
In our initial analysis, we modelled the utility accrued at sensor S i during each period using (1), which we reproduce here for reference: An insight that we can use into refining this estimate is as follows: suppose a sensor has not observed an event for t time units, what is the probability that this sensor not observe any event after t + P time units? When events arrive with separation that is hyper-exponentially distributed, we can showrather quickly -not seeing an event can allow us to model the future with a modified hyper-exponential distribution. Let us suppose that t is the inter-arrival time between two events.
where q i = p i e −λ i (t)/ M j=1 p j e −λ j (t) . The derivation above illustrates that when we do not see arrivals under a hyper-exponential distribution, then we can make predictions using a modified hyper-exponential distribution.
We can use this insight as follows: • if we poll a sensor and do not find an any useful data then we can modify the arrival distribution and change our estimate for the expected utility in the next period according to the modified hyper-exponential distribution • if we do obtain useful data when we poll a sensor, then we make our next estimate using the original hyper-exponential distribution associated with that sensor. We also note that if we have multiple consecutive periods when a sensor does not produce useful data, then we can keep shifting the associated distribution based on the number of periods that have elapsed with no event.
One can comfortably accommodate this observation in the analysis we have shown so far. Therefore, we can derive an alternative index policy using this adaptive approach, and we denote this policy IP v in our numerical study (Section VIII) and compare it to the original policy that we denote IP f .

VII. EXPLICIT ANALYSIS OF STOCHASTIC ARRIVALS
We now consider the case when the real discrete-event stochastic process governs observations at the sensors. In this case, the value of observations at a sensor may differ (and are unknown ahead of time), but the mean data value at each sensor is known.
Let t i k represent the times at which sensor i records a new observation. The value of these observations is represented by ν i k . The index k here is the observation count.
We assume that a sensor-dependent Poisson process governs the arrivals of observations. Further, we assume the observation values ν i k are also independent, and that they are identically distinguished for sensor i.
The value accumulated at source i during the jth epoch (between sampling instants j − 1 and j) will be: The state of the system at t = (j + 1)P is then: The average expected reward can be defined as and we want to maximize this function subject to the cost/bandwidth constraint lim sup For the immediate discussion, we will not use the index i. We can focus on any one sensor and explicitly use i later, as needed.
is the discounted value function satisfying the following: In the expressions above, ϕ represents the function that governs υ(t), ∀ t. Lemma 3: The solution to (13) has the same properties as the solution to (5): 1

) It increases monotonically and is convex.
Proof of this lemma is provided in the appendix. Now, the dynamic program considering average costs can be written as V (s) + ξ = max γ λ + V (αs + υ)ϕ(dυ), s , VOLUME 8, 2020 we can use the discounting approach from earlier and also make V (·) unique by forcing V (s)ϕ(dυ) = 0.
We can show that V (·) is monotone and convex using pointwise limits. When f (·) is convex, we have s → f (bs + y)ω(dy), for all bs and probability measures ω on R.
Using the above equation, we can then establish that: with s ∈ S, using a line of reasoning similar to the deterministic case. Then, through iteration, we can show that V is the unique solution to the dynamic programming equation by representing V (s) as follows. For passive s: where θ is the time at which the sensor is polled for the first time. We take the maximum over all valid sequences of actions {υ(t)}, where υ := E υ(j) + υ = υ 1−α . For active s, we have The second equality for V (s) when s is passive is a consequence of the Optional Stopping Theorem (due to Doob) [14]. Now, considering only the interesting cases where ξ (λ) > min(λγ , υ), the r.h.s. of (15) will be less than s − ξ (λ) for s > υ . This observation implies that V (s) is smaller than the r.h.s. of (16). Also, it suggests that s must have been active. Reasoning as we did in the deterministic or fluid-flow approximation situation, we can assume the state space as [υ, υ ].
We can show that the optimal policy for polling sensors is a threshold policy exactly like in the deterministic case but with the appropriate change of definition for sets D and D c .

Lemma 4: There is an index policy to solve the sensor selection problem with stochastic observations.
Proof of Lemma 4 is provided in the appendix. Define τ as the first time ≥ 1 when the state of an arm enters D c with ϒ(0) = u(0). This is the next polling time after 0.
We can restrict our attention to stationary Markovian policies because of the underlying dynamic programming formulation.
Let λ = g(s) be the index value at s. With λ = g(s), τ (s) = E[τ ], and ξ as the optimal set.
Using the standard theory for renewal-reward processes [23], we have The definition of g(s) is that it is a subsidy needed to make an arm passive at s. Then, we can note that s ∈ D c and the best action would be to poll the sensor. Using the definition of g(s), s = γ g(s) + E V αs + υ(1) , and therefore From the equation above, we can solve for g(s) based on the observation that all the expectations are computable. To solve for g(s), we could adopt the following computational procedure. Fix the threshold policy to useŝ as the threshold. V should satisfy the following conditions: The index g(ŝ) must satify g(ŝ) = λ and which is derived from (18). We can write V (s), the unique solution to (17), (18), and (19) to make an explicit connection to λ. We then learn g(ŝ) through a series of stochastic approximations: where g m is forced towards (18) holding. This gives us the following theorem. Theorem 2: The index of sensor S i under fully stochastic observations can be obtained using the following sequence of approximations:  The proposed index policy typically outperforms greedy and round-robin sensor selection. The Y-axis is the Average Reward / Greedy Reward (hence the Greedy policy always has the value of 1) and the X-axis represents the Workload Intensity that we define as: i ∈M ν i × µ arrival × β i . According to the definition of the workload intensity, it increases with an increase in data value (ν i ), an increase in data arrival (µ arrival ) or an increase in decay rate (β i ) of messages. In this procedure, we assume that {υ m } are random variables that are independent and identically distributed according to distribution φ. This assumption leads to significant computational demands but can be reduced to relative value iteration algorithms where g m is time-dependent [19]. g m will, asymptotically, converge to the index. This computational approach is needed before eachŝ, but one could select a finite set ofŝ values and then interpolate as a further approximation strategy.
Computational Challenges With the Index Policy: When we consider the fully stochastic nature of the sensing process, the computational complexity of the index policy is high and this approach, even though it is sub-optimal, may not be practical for large sensor deployments. Consequently, we can use the index policy derived using the fluid-flow approximation as a replacement. We did compare the index policy that we first derived, applied to the utterly stochastic scenario, with two other heuristics, namely, greedy and round-robin sensor selection). We discussed these two other policies in our numerical evaluations (Section VIII) and found that the first index policy does outperform the other heuristics. We therefore believe that the fluid-flow approximation does preserve some of the essential problem characteristics and can yield satisfactory policies. We do not include this set of numerical evaluations in the next section because the results are similar to the other results we report.  The index policy tends to outperform the greedy as well as the round-robin policy except for a few cases when round-robin selection has a small advantage. The Y-axis is the Average Reward / Greedy Reward (hence the Greedy policy always has the value of 1) and the X-axis represents the Workload Intensity that we define as: i ∈M ν i × µ arrival × β i . According to the definition of the workload intensity, it increases with an increase in data value (ν i ), an increase in data arrival (µ arrival ) or an increase in decay rate (β i ) of messages.

VIII. NUMERICAL EVALUATION AND ANALYSIS
The index policies we propose are sub-optimal. The decomposition of the stochastic dynamic program into separate problems per arm results in the sub-optimality. On the other hand, we find that the index policies perform well in comparison to some other conceivable policies. We compare the two proposed policies (labelled as IP v and IP f ) with another two policies: • Greedy (GD): The greedy sensor selection strategy chooses the sensors that have highest value at the time of polling (epoch) until it reaches the bandwidth limit.
• Round-robin (RR): The round-robin strategy selects sensors by turn until it reaches the bandwidth limit. Our discussion here is restricted to the fluid-flow approximation (Section V) since the application of the four policies (IP v , IP f , GD, RR) to the completely stochastic model yield similar results. We simulated our problem environment for our numerical evaluations in python 2.7 on a PC with 8GB RAM, and core i5 intel cpu.
The greedy strategy chooses sensors based on their current accumulated reward (that is accumulated for each sensor since its last selection). The round-robin strategy selects sensors in turn and ignores accumulated values and other factors.
For the index policies (IP v and IP f ), in general, we selected the arms with the highest indices until the bandwidth constraint was not exceeded. However, the hyper-exponential distribution varies in IP v (unlike remaining fixed in IP f ) based on occurence of event(s) in each period of time (as discussed in Section VI).
To compare IP v and IP f with GD and RR, we carry out two types of experiments. In the first set of experiments, the bandwidth cost of each sensor/arm was kept identical (i.e. fixed); in the second, we selected these costs at random (i.e. variable).
We focused on the studying the performance for different policies when ''number of bandits (or sensors)'' and ''bandwidth limits'' changes. There are also other parameters in these experiments that are chosen as follows: • The value of a data item (ν) at the sensor is selected from an exponential distribution with parameter 1.0. Note that when we use the fluid-flow approximation, all observations at a particular sensor provide the same value ν i at sensor i, and therefore we use the described distribution to select this value; the value of data may differ from one sensor to another.
• The rate at which the value of a data item decays (β) is chosen from a uniform distribution over [0.01, 0.99].
A different decay rate is selected for each sensor.
• The rate at which events arrive (µ) in each of the two distributions (in the hyper_exponential) is chosen from a uniform distribution over [0.01, 25]. Again, a different arrival rate is chosen for each sensor. Workload Intensity: We define a workload intensity metric as: where M is the set of all messages arriving in a simulation run, µ arrival is the arrival rate of events, and ν i and β i are the value and decay rate for each message, respectively. We consider the decay rate as an approximation for the deadline of each message. We could calculate this metric of ''workload intensity'' for each simulation over the total number of time steps. Such a metric essentially captures, to some extent, the workload intensity. Therefore, we report the performance of the policies concerning the ''average reward'' (on the Yaxis) along with the ''workload intensity'' (on the X-axis) when presenting the simulation results.

A. IDENTICAL BANDWIDTH COSTS
First, we consider the simple case of equal sensor polling costs (γ = 1 for all sensors). We start with four bandit arms and a bandwidth limit of two (M = 2), which implies that we can pull two arms at any given epoch. We increase the number of bandits to 32 bandits and with a bandwidth of 16 to observe the performance of our algorithms at a slightly larger scale. We ran 1000 Monte Carlo trials for each policy and calculated the average reward attained by a strategy.
The experiments indicate that IP v and IP f dominantly outperform the other two algorithms (Figure 2). To be more specific, IP v and IP f always outperformed GD. However, in comparison with RR, both policies performed better in the majority of cases. IP v outperformed RR in almost all cases (98% to 100%) with a performance margin of 12.1% to 16.1%. Also, IP f outperformed RR in most cases (75% to 98%) with a performance margin of 6% to 11.8%. In both comparisons, when RR did outperform IP, the performance difference was only 0.5%.

B. VARIED BANDWIDTH COSTS
In order to further evaluate our index-based approach, we considered the case of having different bandwidth costs among sensors/arms. We randomly assigned γ values to each sensor/arm; the value was selected from a uniform distribution between 0 and M /2, where M is the bandwidth limit. We did not see differences in performance relative to the earlier set of experiments with fixed bandwidth costs. Again, IP v and IP f almost always outperformed GD with higher performance margin with respect to the case of '' fixed bandwidth costs'' as shown in Table 1. In comparison with RR, both policies performed better in the majority of cases. IP v outperformed RR in (89% to 99%) of the times with a performance margin of 11% to 15.9%. Also, IP f outperformed RR in (76% to 93%) of the times with a performance margin of 7% to 11.3%. In both comparisons, when RR did outperform IP, the performance difference was less than 1.5%. The performance details of IP v and IP f relative to the other policies is also tabulated (second half of Table 1).

C. IP v VS. IP f
Besides the comparison results of our proposed policies to other policies, we compared IP v and IP f in terms of total rewards accrued over simulation runs. As implicitely implied from the above results in sections VIII-B and VIII-A, IP v outperformed IP f in all simulation setups (as shown in Figure 4). This observation suggests IP v as the dominant policy.

D. SELECTING ARMS USING INDICES: TOP-K ARMS VS. KNAPSACK PACKING
The index policy prioritizes the arms/sensors to be polled at an epoch. We can either select the arms with the highest indices until we exhaust that bandwidth, or treat the problem as one of packing a knapsack [4]. We believed that there might be some gains in solving the knapsack problem (even though it is an NP-hard problem) and select arms. However, numerical results suggested that the more straightforward approach of selecting the arms with the highest indices performed well, and the more elaborate approach seems unnecessary. Note that we use IP v for this comparison.

E. INSIGHT FROM NUMERICAL EVALUATIONS
The main message that we want to communicate from our numerical evaluation is this: the index policy usually does better than the other policies, and when it does under-perform another policy, the difference in performance is rather small (Figures 2 and 3). The reason that the index policy performs better in terms of the total accrued rewards is that both IP f and IP v policies are based on the whittle-like priority index (that we derived in Section V) that prioritizes which sensors to poll over a time horizon. These results suggest that IP v is a policy that we can apply consistently and expect reasonable performance from.

IX. CONCLUSION
We expect that as the deployment of the Internet of Things progresses, we will need to manage a massive data volume, and not all the data can be gathered in one place for processing. When some centralization is needed for data processing, sensor data selection is the first step in a pipeline of tasks that includes data management and analysis to support large-scale reasoning and decision making in a variety of applications [1]. We have concentrated on the first step alone; the amount of data produced by sensors alone can be overwhelming, and we need the right strategies for handling this deluge. Effective filtering of data in the early stages can reduce the pressure on other stages of IoT data processing systems that would store and perform computation on the data.
Based on our analysis and evaluation, the index-based approach, which is computationally simpler than stochastic dynamic programming, is difficult to realize precisely when we consider the stochastic model for event arrivals and observation values. On the other hand, a fluid-flow approximation of the stochastic process may be sufficient to yield a simple and effective index policy for deciding which sensors to poll periodically. Although we poll sensors periodically, the set of sensors polled at each epoch may change and, depending on the underlying parameters; we find that an atomic structure may not exist for how this set of sensors changes from one epoch to the next.
We have chosen to model a polling approach and restricted our attention to the problem of deciding which sensors to poll at each epoch. Such polling systems are easy to implement, simplify system architecture, and can be suitable for satisfying specific data privacy requirements.
To conclude, our focus in this work is on the abstraction of polling sensors where we have some notions of delay sensitivity and limitations (such as privacy sensitivity as discussed in Section I) that prevent a push-driven architecture and we showed that our proposed approach is effective in any such applications.
We have not answered the question of what is an ideal polling period. The answer to this question will also depend on the characteristics of what is being sensed. We could remove the restriction of periodic polling and allow for adaptive polling intervals. This problem needs further study, although we believe that such adaptivity will lead to more fragile system architecture.
We want to emphasize, as we conclude, that sensors need not be physical sensors but could also represent feeds on services such as Twitter. The approach we present can be adapted to a variety of applications.
Concerning potential future directions, we envision multiple avenues to explore. For instance, Can we generalize this work to other distributions that govern arrival time? One idea would be to come up with multiple solutions and switch among a set of possible solutions appropriately based on the situation. Also, one could investigate the feasibility of modelling the problem as an online learning problem over the environment. Proof: To show 1), we refer to the theory of Discrete Time Markov Chains (DTMC) where having a unique bounded continuous solution for V ρ is standard [22]. For 2), consider s i , s i ∈ ϒ i with s i = s i , and consider processes {ϒ i (t)} and {ϒ i (t)} with initial conditions (states) s i and s i , respectively. Both processes are controlled by the same actions {a(t)}. Denote as T the first time instant at which the action is to poll the sensor; that is, T = inf{t ≥ 0 : a(t) = 1}. Since the state is reset when polling the sensor, it follows that ϒ i (t) = ϒ i (t) for all t > T (since the action sequence is the same for both processes), and thus = 0 for all t > T . On the other hand, the action at each t < T is to keep the sensor idle, and by (3), v i ϒ i (t), a(t) = v i ϒ i (t), a(t) = λγ i (independently of the state), and therefore v i ϒ i (t), a(t) −v i ϒ i (t), a(t) = 0 for all t < T . Thus, we have At t = T , the states of processes {ϒ i (t)} and {ϒ i (t)} will have evolved according to (2), which for any integer k ≥ 0 and initial state s i gives ϒ i (kP) = 1 + α i + · · · + α k i s i = Observing that T is an integer multiple of P, we have Thus, we have Interchanging the roles of s i and s i , we obtain a symmetric inequality. Thus In order to prove 3), take s i , s i ∈ ϒ i with s i > s i . Consider processes {ϒ i (t)} and {ϒ i (t)} generated by a common action sequence {a(t)}, where the two processes differ only by the initial state. One can merely verify that ϒ i (t) ≥ ϒ i (t) for all t. Thus The claimed monotonicity then follows by taking the supremum of both sides of the latter inequality over all valid action sets.
To establish convexity, let V ρ,T (s i ) be the finite horizon discounted value: This satisfies the following dynamic programming equation: V ρ,T (s i ) = max λγ i + ρV ρ,T −1 (α i s i + υ i ), s i + ρV ρ,T −1 (υ i ) for all T ≥ 1, with V ρ,0 (s i ) = s i . We can establish convexity of V ρ,T by induction on T . Since pointwise limits preserve convexity, the fact that V ρ (s i ) = lim T →∞ V ρ,T (s i ) implies that V ρ is also convex.

B. PROOF OF LEMMA 3
Proof: Claim (i) can be derived from the standard theory of DTMCs, as stated before.
Claim (ii): Let ϒ(t) and ϒ (t) be defined with identical processes for when the sensor records data and for the control (poll/not polled). With a(·) begin optimal for ϒ(·). ϒ and ϒ only differ in their initial states, which are s and s , respectively. Then, where t − := min t ≥ 0 : ϒ(t) = ϒ (t) . t − represents that instant at which we first poll this sensor. We can show that the value function is continuous as we did earlier.
To show claim (iii), we consider the processes ϒ(t) and ϒ (t) generated by a common action sequence a(t). Similar to the proof of Lemma 2, we will consider the supremum over all action sets but after taking expectations on both sides. This would establish monotonicity. Convexity follows in a fashion similar to the deterministic case.
The maximum is over all valid policies, and consequently, over every threshold policy too. Pick some threshold, and let the initial condition be s ∈ [υ, υ ]. Now, consider a process that uses the threshold policy. Then, θ is a r.v. that does not depend on λ. Now, ξ (λ) < γ so we can infer that the expression on the r.h.s. monotonically increases with λ. This property holds for when we maximize over all threshold-based policies. Let s (λ) be the optimal threshold with λ as the Lagrange multiplier (or subsidy) for passivity. Now, (15) and (16) will continue to hold for s = s (λ). Define F(λ, s) as ∀s ∈ D c . Here the maximization is over all threshold policies. F(λ, s) is a convex increasing function in s because V is convex and increasing in s. F(λ, s) also increases with λ. s (λ) is a fixpoint of F(λ, ·). The best action at s = ν is to be passive because ξ (λ) > υ, hence F(λ, υ) > υ, ∀λ.
Next, we note that it is optimal to be active when s = υ . This gives us F(λ, υ ) = υ . The convex curve s → F(λ, s) intersects y = s at precisely one point in [υ, υ ] and the intersection is at s (λ), by definition. This point increases with λ because F(·, ·) does.
Hence we conclude that an index policy must exist.

D. SYMBOLS USED
The following table summarizes the symbols used in the article.