To Compute or not to Compute? Adaptive Smart Sensing in Resource-Constrained Edge Computing

We consider a network of smart sensors for an edge computing application that sample a time-varying signal and send updates to a base station for remote global monitoring. Sensors are equipped with sensing and compute, and can either send raw data or process them on-board before transmission. Limited hardware resources at the edge generate a fundamental latency-accuracy trade-off: raw measurements are inaccurate but timely, whereas accurate processed updates are available after processing delay. Hence, one needs to decide when sensors should transmit raw measurements or rely on local processing to maximize network monitoring performance. To tackle this sensing design problem, we model an estimation-theoretic optimization framework that embeds both computation and communication latency, and propose a Reinforcement Learning-based approach that dynamically allocates computational resources at each sensor. Effectiveness of our proposed approach is validated through numerical experiments motivated by smart sensing for the Internet of Drones and self-driving vehicles. In particular, we show that, under constrained computation at the base station, monitoring performance can be further improved by an online sensor selection.


I. INTRODUCTION
D ISTRIBUTED computation scenarios such as the Internet of Things and Industry 4.0 represent a major breakthrough in engineering applications, whereby coordination of sensing and actuation moves away from classical centralized controllers to servers and devices at the network edge.This empowers multiple local systems to achieve together complex goals at global level: this happens with management of electricity and energy harvesting in smart grids [1], [2], resource utilization in smart agriculture [3], [4], modularization and productivity enhancement in Industry 4.0 [5]- [7], urban traffic with interconnected vehicles [8], [9], and space-air-ground services [10].
In particular, recent advances in both embedded electronics, with powerful micro controllers and GPU processors [11], [12], and new-generation communication protocols for massive networks, such as 5G [13], [14], are currently pushing network systems to rely on sensors and, more in general, edge devices to carry most of the computational burden.Indeed, distributed computation paradigms such as edge and fog computing [15]- [18] [19] and federated and decentralized learning [20]- [22], even though still in their infancy, enjoy febrile activity and excitement across the research community.
Despite the growing resources and technological development, emerging edge technologies are still limited compared to centralized servers: indeed, edge devices are forced to trade several factors, such as hardware cost for processing speed for energy consumption.In particular, data processing on devices at the edge requires a non-negligible computational time.
In this work, we consider a group of edge smart sensors, such as compute-equipped IoT nodes or UAVs, that measure a signal of interest -e.g., voltage in a smart grid, or movements of vehicles for surveillance -and transmit the measurements to a base station that performs remote global monitoring and possibly decision-making.Limited hardware resources induce a latency-accuracy trade-off at each sensor, that can supply either raw, inaccurate samples of the monitored signal or refine those same data on-board by running suitable algorithms, which produce high-quality measurements at the cost of processing delay caused by constrained hardware.Such local processing may consist of averaging or filtering a batch of noisy samples, or feature extraction from images or other high-dimensional data [23], [24], to mention a few examples.Because the monitored system evolves dynamically, delays in transmitted measurements may hinder usefulness of these in real-time tasks, so that sensing design for multiple, heterogeneous sensors becomes challenging.In particular, as sensors cooperate, it is unclear which of them should rely on local computation to transmit accurate information, and which ones would be better off sending raw data.Also, channel constraints such as limited bandwidth may introduce non-negligible communication latency, further increasing complexity of the sensing design.Specifically, local processing might compress acquired samples, so that transmission of raw data to the base station takes longer.

A. Related Literature
Resource allocation in terms of sensor and actuator selection represents a major research topic in IT, robotics, and control.

arXiv:2209.02166v3 [cs.DC] 18 Aug 2023
Classically, the need for selection emerges from maximization of a performance metric subject to limited resource budget, being it of economical, functional (e.g., weight of autonomous platforms), spatial (e.g., locations to place sensors), or other nature.Typical works in this field [25]- [33] focus on such budget-related constraints and pay little attention to impact on system dynamics.For example, [25] proposes selection strategies based on coverage probability and energy consumption for a target tracking problem, [33] studies a clustering-based selection to address communication constraints in underwater environments, and [31] tackles placement of cheap and expensive sensors to optimize reconstruction of dynamical variables.The aforementioned works, even though address computation and/or communication issues, either care about energy consumption or address latency in a qualitative way, but do not use that information to compute an exact performance metric that depends on the system dynamics.Another, more control theoretic, body of work exploits tools from set-valued optimization, e.g., submodular functions with matroid constraints [34], or studies analytical bounds [35] or convex formulations [30], [36], yet within a static framework that does not address changes in the overall dynamics.
In a similar realm, control theory is traditionally concerned with either channel-aware estimation and control, or co-design of communication and controller, addressing wireless channel issues such as unreliability, latency, and more in general limited information [37]- [42].For example, [37], [42] are concerned with rate-constrained stabilizability, while [38], [41] address LQR and LQG control.More recently, performance of wireless cyber-physical systems subject to state and input constraints has been thoroughly investigated leveraging model-based prediction and optimization tools such as MPC [43]- [46].However, also this line of work does not consider processing-dependent delays and their effect on dynamics and performance.Even in recent work on sensing, communication, and control codesign [47], [48] there is no unifying framework that exactly relates sensing and computation on resource-limited platforms to estimation and control performance in dynamical networks.A novel framework concerned with an adaptive design for LQG control which addresses accuracy-dependent sensing latency is presented in [49].However, it considers a single sensor and proposes a heuristic solution with no theoretical guarantees.
A recent body of literature tailored to edge and fog computing studies distributed computation on resource-constrained devices, focusing on minimization of delays [50]- [53] or latency-dependent energy consumption [54].While there is a clear, empirically supported intuition that outdated sensory information is detrimental to performance through the dynamical nature of monitored systems, the above works do not address the true performance metrics (which may be unknown or too complicated to compute), but employ heuristic proxies (e.g., delays) without quantifying impact on closed-loop performance.
Finally, a similar trend is found within a recent body of the communications literature on Age of Information (AoI) [55]- [58], a metric that quantifies the time elapsed since the latest received update from a source of information.These works focus on minimizing quantities related to AoI of updates, but typically neglect dynamics of the measured variables.Also, most works, e.g., [59], [60], assume that the dynamical systems measured by different sensors are uncoupled, limiting applicability of this approach in networked control systems.

B. Novel Contribution and Organization of the Article
In contrast to previous work, we jointly address sensor local processing, computation and communication latency, and system dynamics towards a dynamical smart sensing design.
In [61], the authors proposed a general model for a processing network, including impact of computation-dependent delays on monitoring performance, and provided a heuristic sensing design.However, that design is static.i.e., sensors cannot adapt to the monitored system during operation, which may hinder performance.For example, time-varying systems generally prevent the optimal sensing configuration to be static.Also, sensors could store incoming samples into an unlimited buffer.We advance such issues through a novel design framework that builds on the insights in [61].Moreover, this article considerably expands the preliminary version [62] as described next.
First, in Section II-A we propose a novel model for a processing network tailored to data acquisition and transmission by resource-constrained smart sensors.These can adapt their local computation overtime and exploit the latency-accuracy trade-off online to maximize global network performance, by choosing to either transmit raw samples or refine data onboard.In addition, motivated by [61], we let sensors temporary stand-by (sleep) to alleviate the computational burden for sensor fusion.Roughly speaking, such online sensor selection can crucially improve global monitoring performance if the processing resources available at the base station cannot handle large amounts of sensory data in real time.Remarkably, this result goes against the common wisdom that deploying more sensing resources always improve performance.
In Section II-B, we formulate an optimal design problem to manage sensing resources in a network of smart sensors.We do this by computing an estimation-theoretic performance metric that embeds both dynamical parameters and accuracy and delays associated with sensory data.To partially overcome intractability of the problem, in Section III we formulate a simplified version of it, which is tackled via a Reinforcement Learning approach in Section IV, see Fig. 1.Reinforcement Learning, and data-driven methods in general, are now popular in network systems and edge computing because of challenges raised by real-world scenarios [63]- [65].
Finally, in Section V we validate our approach with numerical experiments motivated by sensing for autonomous driving and Internet-of-Drone tracking.We address realistic communication through an industrial-oriented simulator (OMNeT++) that accurately models the lower layers of the protocol stack.We show that accounting for latency due to resource constraints can improve performance through a careful allocation of sensing and computation.In particular, the online sensor selection becomes crucial when a large number of sensors is available.

II. SETUP AND PROBLEM FORMULATION
In this section, we first model a processing network composed of smart sensors (Section II-A), and then formulate the sensing design as an optimal estimation problem (Section II-B).

A. System Model
Dynamical System.The signal of interest is described by a time-varying discrete-time linear dynamical system, where x k ∈ R n collects the variables (state) of the system, A k ∈ R n×n is the state matrix, and white noise w k ∼ N (0, W k ) captures model uncertainty.Such class of models is widely used in control applications, by virtue of their simplicity but also powerful expressiveness [28], [66]- [69].For example, a standard approach in control of systems modeled through nonlinear differential equations is to approximate the original model as a parameter-or time-varying linear system, for which efficient control techniques are known [28], [70], [71].
In view of transmission of sensor samples, we assume discrete-time dynamics with time step T , where subscript k ∈ N means the kth time instant kT .Without loss of generality, we fix the first instant k 0 = 1.The sampling time T represents a suitable time scale for the global monitoring and, possibly, decision-making task at hand.For example, typical values of T are one or two seconds for trajectory planning of ground robots, while higher frequency is required for drones performing a fast pursuit or for self-driving applications.
Smart Sensors.The system modeled by (1) is measured by N smart sensors (or simply sensors) gathered in the set V .= {1, . . ., N }, which output a noisy version of the state x k , where y k is the measurement produced by the ith sensor at time k, for any i ∈ V, and v (i) k is measurement noise.Smart sensors are equipped with processing capabilities alongside standard sensing hardware, and can either transmit raw samples of the signal x k or locally process acquired samples to provide refined measurements.For example, a smart camera may send raw frames or run computer vision algorithms on the acquired images to get high-quality information, as in typical robot navigation applications where informative features are extracted from visual data.Symbol y (i) k indifferently refers to raw or processed measurements: as formalized in Assumption 1, the difference between such two kinds of data is embedded into the measurement noise covariance V i,k .Remark 1 (Sensor processing).We consider the case where sensor local processing is static, that is, sensors can refine the current sample (as in [49], [61]), but do not (re-)process past samples.This model is suitable to devices that provide data without need (or possibility) of tracking the history of the measured signal, which is handled by the base station.This is different than, e.g., works [72], [73], where sensor processing is adaptive and involves the history of collected samples.Although it might be possible to integrate such kind of processing into our framework, this is a compelling research direction that will be explored in the future.
Sensors face a latency-accuracy trade-off through limited hardware: raw data are less accurate, but local data processing introduces extra computational delays that make refined updates more outdated with respect to the current state of the system.
For example, consider a car that is approximately moving at constant speed, with w k capturing small unmodeled accelerations: as the car moves, knowledge of its real-time position through the nominal model (constant speed) becomes more and more imprecise because of unknown accelerations hidden in w k , which make the car drift away from its nominal trajectory.In this case, a sensor may prefer to sample the system (e.g., collect positions of the car) more often, rather than spending time to obtain precise, but outdated, position measurements.
We formally model the latency-accuracy trade-off with the next assumptions.We also introduce a third operating mode (sleep mode) that lets sensors stand-by.The usefulness of sleep mode is associated with limited computational resources for aggregation of sensory data, and will be motivated in the paragraph "Base Station" below and in Section II-B.
Assumption 1 (Sensing modes).Each sensor i ∈ V can be in raw, processing, or sleep mode.Raw mode: measurements are generated after delay τ i,raw with noise covariance V i,k ≡ V i,raw .Processing mode: measurements are generated after processing delay τ i,proc with noise covariance V i,k ≡ V i,proc .Sleep mode: the sensor is temporary set idle (asleep): neither data sampling nor transmission occur under this mode.
Next, we define how local operations are ruled overtime.
Definition 1 (Sensing policy).A sensing policy for the ith sensor is a sequence of categorical decisions Measurements are received after delays induced by local computation (rectangular blocks) and communication (dashed arrows).For example, under γ i k 0 = p, the sample acquired at time k 0 is first processed (with processing delay τ i,proc ), then transmitted at time k 1 = k 0 + τ i,proc (with communication delay δ i,proc ), and finally received at the base station at time k 1 + δ i,proc = k 0 + ∆ i,proc (with delay at reception ∆ i,proc ).
or processing (γ i k = p) mode.Let time k ′ be defined as Then, the next sample (under any mode) occurs at time s i (k), Finally, the sequence of all sampling instants K i is given by where consecutive sampling times are defined by the recursion and K i [l] denotes the lth element of the sequence, with l ∈ N.
In words, Assumption 3 states that sensors can acquire a new sample only after the previous measurement has been transmitted.This is a realistic assumption if agents have limited storage resources [74].The effect of a sensing policy on sampling and local data processing is illustrated in Fig. 2.
Communication channel.All sensors transmit data to a common base station through a shared communication channel, which is wireless or wired according to the application requirements.The channel induces communication latency that may further delay transmitted updates and depends on several factors such as transmission medium, network traffic, and interference.We let δ i,raw and δ i,proc denote communication delays of raw and processed data transmitted by the ith sensor, respectively.In general, δ i,raw and δ i,proc might differ depending on possible data compression due to processing.In case δ i,raw = δ i,proc , we denote both delays by δ i .The total delay experienced by updates from sampling to reception at the base station is given by ∆ i,raw = τ i,raw + δ i,raw for raw and ∆ i,proc = τ i,proc + δ i,proc for processed data.Data is received after time k 4 and cannot be used in estimation of x k 5 , i.e., y sampling, processing, and transmission are depicted in Fig. 2.
Base Station.Data are transmitted to a base station in charge of estimating the state of the system x k in real time.Such estimation enables remote global monitoring and decisionmaking, e.g., coordinated tracking or exploration.Let xk denote the real-time estimate of x k .In view of the sequential nature of centralized data processing, the real-time estimate of x k is computed in ϕ k time (fusion delay), which is proportional to the amount of data used in the update [61].Consider Fig. 3: from time k 1 through k 4 , new data are received at the base station (green dashed arrows).If the estimation routine starts at time k 4 , it takes ϕ k5 to process all newly received sensory data (possibly, also old ones if some data arrive out of sequence), and hence the next updated state estimate, xk5 , will be available at time k 5 = k 4 + ϕ k5 .Hence, fusion delays induce open-loop predictions that degrade quality of the computed estimates (similarly to what discussed about local sensor processing), and motivate sleep mode to reduce the incoming stream of sensory data and improve overall performance [61].
Assumption 4 (Available sensory data).In view of Assumptions 1, 3, all sensory data available at the base station and used to compute xk at time k are where the lth measurement from the ith sensor y and received after overall delay ∆ i,Ki[l] , and ϕ k is the time needed to compute xk at the base station.

According to Assumption 4, a measurement y (i)
h can be used to compute the estimate of x k in real time if it is successfully delivered to the base station (with delay at reception ∆ i,h ) before or at time k − ϕ k , where ϕ k is the amount of time needed to compute xk .Data processing at the base station with limited resources and data availability is depicted in Fig. 3.
Remark 2 (Real-time estimation).Based on the above discus-

Inaccuracy of real-time estimates
Fig. 4. Real-time estimation at the base station.The state estimate is updated at each point in time (top).Because of limited resources at the base station, open-loop updates are performed whenever fresh sensory data are being processed (bottom), causing estimation to degrade overtime through additive noise w k in nominal dynamics (1).As soon as the data processing subroutine produces an updated estimate with new measurements, e.g., xk 1 at time k 1 , the estimation inaccuracy is reduced.Note that the top plot is qualitative: the estimate quality does not degrade linearly, in general.
sion, new data cannot be used by an estimation procedure between times k and k + ϕ k .In a real system, a real-time state estimate must always be available for effective monitoring.We assume that two parallel jobs are executed.A support subroutine processes received measurements and computes a state estimate at time k in ϕ k time (cf.Fig. 3).The real-time estimation routine computes one-step-ahead open-loop updates at each point in time according to the nominal dynamics (1) (progressively degrading estimation quality), and resets when the support subroutine outputs an updated estimate with new measurements (with higher estimate quality). 2A schematic representation is shown in Fig. 4. Importantly, degradation of estimation in the top plot is not due to lack of new measurements (like in Age of Information literature), but is caused by constrained resources that induce a computational bottleneck in the support subroutine (bottom plot in Fig. 4).

B. Problem Statement
The trade-offs introduced in the previous section call for a challenging sensing design at the network level.In particular, all possible choices of local sensor processing (we address a specific choice for all sensors as a sensing configuration) affect global performance in a complex manner, whereby it is unclear which sensors should transmit raw measurements, with poor accuracy and possibly long communication delays, and which ones should refine their samples locally to produce high-quality measurements.In fact, the authors in [61] show that the optimal configuration when considering steady-state performance is nontrivial.Also, the optimal sensing configuration is time varying, in general.Thus, sensing policies π i , i ∈ V, have to be suitably designed to maximize the overall network performance.
The state x k is estimated via Kalman predictor, which is the optimal observer for linear systems with Gaussian disturbances.It can be shown, e.g., via state augmentation, that the Kalman predictor is optimal even with delayed measurements, whereby it suffices to ignore updates associated with missing data (see Appendix A in Supplementary Material).Out-of-sequence arrivals can be handled by recomputing all predictor steps since 2 One-step-ahead open-loop steps are assumed computationally cheap.the latest arrived measurement has been acquired, or by more sophisticated techniques [75], [76].
Let xk .= x k − xk the estimation error of Kalman predictor at time k, and let P k .= Var (x k ) its covariance matrix.We formulate the sensing design as an optimal estimation problem.
Problem 1 (Sensing Design for Processing Network).Given system (1)-( 2) and Assumptions 1-4, find the optimal sensing policies π i , i ∈ V, that minimize the time-averaged estimation error variance with horizon K, where the Kalman predictor f Kalman (•) computes at time k the state estimate xπ k and the error covariance matrix P π k using data Y π k available at the base station according to π .
= {π i } i∈V , and Π i gathers all causal sensing policies of the ith sensor.
Remark 3 (Impact of processing on estimation).Processed measurements are more accurate than raw ones: hence, if delays were neglected, the optimal (trivial) design would be to always process, because this yields the smallest variance of measurement noise (Assumption 1) and minimizes the estimation error variance of the Kalman predictor when updates with measurements are performed.However, computational delays associated with data processing introduce extra open-loop steps that increase the error variance, making the optimal design nontrivial.In other words, uncertainty about the true dynamics (captured in (1) by noise w k ) makes refined measurements be less informative about the current state of the system, so that high accuracy alone might not pay off in real-time monitoring.Remark 4 (Novelty of sensor selection).Sleep mode actually implements an online sensor selection, whereby sleeping sensors do not supply data to the base station.We identify two key elements that make our framework fundamentally different from standard sensor selection in the literature.First, while we exploit sleep mode towards optimal performance, sensors are typically selected to trade performance for available resources under the conventional belief that more sensors yield better performance.In contrast, selection emerges naturally in our framework to maximize performance in view of the computational bottleneck at the base station that may increase the estimation cost in (6).Moreover, rather than a static selection, we allow for dynamical switching to and from sleep mode, which both enables performance improvement through richer design options and is more challenging to optimize.

III. SENSING POLICY: A CENTRALIZED IMPLEMENTATION
Problem 1 is combinatorial in the number of sensors and raises a computational challenge in finding efficient sensing policies, because the search space may easily explode.For example, 10 sensors yield 2 10 = 1024 possible sensing configurations at each sampling instant.Also, Problem 1 requires to design a potentially asynchronous schedule for each sensor, which is an additional combinatorial problem in the time horizon K. To further complicate things, a sensing policy π i not only affects delay and accuracy of measurements supplied by the ith sensor, but also determines the very sequence of sampling instants K i (cf.(3)-4), augmenting the search space to all possible time sequences over K steps.In particular, sleep mode represents a computational challenge, because it requires evaluation of all instants subsequent to its activation to decide the best time for triggering a new update.
To partially ease the intractability of the problem, and motivated by practical applications, we restrict the domain of candidate sensing policies to reduce problem complexity while maintaining a meaningful setup.First, we look at the simple but relevant scenario with a homogeneous network and motivate the design of a centralized policy in Section III-A.We then go back to the general scenario with a heterogeneous network and formulate a simplified version of Problem 1 in Section III-B.

A. Homogeneous Network
Sensor Model.In this scenario, all smart sensors have equal measurement noise distributions, with V k = V raw or V k = V proc for raw and processed data, respectively.Also, all sensors feature identical computational and transmission resources, given by delays τ raw , δ raw for raw measurements and τ proc , δ proc for processed measurements, respectively (δ in case of no compression).This homogeneous network models the special but relevant case where sensors are interchangeable.This happens for example with sensor networks measuring temperature in plants or chemical concentrations in reactors.Also, this model captures smart sensors collecting high-level environmental information, such as UAVs tracking the position of a body moving in space.Centralized Policy.In this case, it is sufficient to decide how many, rather than which, sensors follow a certain mode.Accordingly, we focus on the design of a centralized policy that commands all sensors with no distinctions among them.
Definition 2 (Homogeneous sensing policy).A homogeneous sensing policy is a sequence of categorical decisions ) is taken at time k (ℓ) such that n s sensors are in sleep mode and n p out of the other N − n s sensors are in processing mode between times k (ℓ) and k (ℓ+1) , with 0 ≤ n s ≤ N and 0 ≤ n p ≤ N − n s .Without loss of generality, we set k (1) In words, the base station decides a configurations for all sensors at predefined time instants, which is both practical for applications and convenient to reduce complexity of the problem.However, decisions may be taken at any times, as long as these are consistent with sensor computational delays (e.g., to guarantee that one sample is collected for each decision).
With a slight abuse of notation, to denote the mode of a specific sensor that is following the homogeneous decision γ hom ℓ , we write γ i ℓ = m meaning that the ith sensor is set in mode m by the ℓth homogeneous decision, where m ∈ {r, p, s} can be raw (r), processing (p), or sleep mode (s), respectively.We stress that in this context γ i ℓ does not represent a decision of a single-sensor sensing policy π i (as in Definition 1), but all decisions are centralized and γ i ℓ denotes the mode that the base station commands to the ith sensor through decision γ hom ℓ .By design, centralized decisions are communicated regardless of current sensing status.In light of common practice in realtime control [77]- [81], we assume what follows.Formally, given measurement y obeying decision γ hom ℓ−1 , the sampling dynamics (3b) becomes Further, y The new sampling mechanism is depicted in Fig. 5.According to Assumption 5, a measurement is not transmitted to the base station if it is not ready when a concurrent decision is communicated.In Fig. 5 the ith sensor discards a measurement whose processing is not completed at time k (ℓ) , when a new decision switches its mode.Formally, a sensor disregards raw (resp.processed) measurements sampled at time k < k (ℓ) such that k+τ raw > k (ℓ) ( k+τ proc > k (ℓ) ), i.e., their acquisition ends after a different mode is imposed by decision γ hom ℓ (cf.(8a)).We denote by Y πhom k all available data at the base station at time k according to (8) and such discard mechanism imposed by policy π hom , that excludes some data included in Y k (cf.( 5)).

B. Heterogeneous Network
We now return to the original model ( 2) with heterogeneous sensors.Without loss of generality, we assume that the sensor set V is partitioned into M subsets V 1 , . . ., V M , where subset V m , m ∈ {1, . . ., M }, is composed of homogeneous sensors of the mth class.From what discussed in the previous section, it is sufficient to specify how many sensors follow a certain mode within each subset V m .Hence, we narrow down the domain of all possible policies according to the next definition.
Definition 3 (Network sensing policy).A network sensing policy is a collection π net .= {π hom,m } M m=1 , where each homogeneous sensing policy π hom,m is associated with homogeneous sensor subset V m , and all homogeneous decisions {γ hom,m ℓ } M m=1 are communicated together at time k (ℓ) .In Definition 3, decision times are fixed like in the homogeneous case, so that decisions are communicated to all sensors at once.At time k (ℓ) , homogeneous decision γ hom,m ℓ involves sensors in V m , and the overall sensing configuration is given by the ensemble of such decisions.All data available at the base station at time k are collected in Finally, we get the following simplified problem formulation.
Problem 2 (Centralized Sensing Design for Processing Network).Given system (1)-( 2) with Assumptions 1-5, find the optimal network sensing policy π net that minimizes the timeaveraged estimation error variance with horizon K, where the Kalman predictor f Kalman (•) computes at time k the state estimate xπnet k and the error covariance matrix P πnet k , using data available at the base station according to π net , and Π net is the space of causal network sensing policies.

IV. REINFORCEMENT LEARNING ALGORITHM
By assuming complete knowledge of delays and measurement noise covariances affecting sensors in the different modes, both Problem 1 and 2 can be analytically solved.However, the computation of the exact minimizer requires to keep track of all starts and stops of data transmissions for each sensor, resulting in a cumbersome procedure which admits no closedform expression, and requires to solve a combinatorial problem which does not scale with the number of sensors.Moreover, the assumptions considered in the formulation of the problem may be too conservative in real-life scenarios, and the latter method cannot be relaxed.Indeed, as long as either delays or covariances are not explicitly known or have some variability, i.e., they can be modeled by proper random variables, the minimization becomes intractable.This is true even if the expectations of these random variables are known, since the dynamics in Problem 2 lead to a nonlinear behavior for the quantity to be minimized.
For the reasons above, we tackle the problem of choosing the optimal sensing policy to minimize estimation uncertainty through a Reinforcement Learning (RL) algorithm, which executes a sequential decision-making suitable for a dynamical sensing design and can flexibly address the general problem formulation.Specifically, the RL algorithm is run at the base station and implements a network sensing policy π net by iteratively choosing a sensing configuration at each time k (ℓ) .A scheme of the overall framework is given in Fig. 6.

A. Optimizing Latency-Accuracy Trade-off
The Reinforcement Learning problem of maximizing a reward function through the correct sequence of actions is addressed in this work by the Q-learning algorithm.The latter is a model-free and off-policy algorithm which updates the current estimate of the action-value-function targeting an optimistic variant of the temporal-difference error.In a finite Markov Decision Process, this approach converges to the optimal actionvalue function under standard Monro-Robbins conditions [82], and is efficient with respect to standard competitors [83], [84].
With regard to Problem 1, policy π i is composed of categorical variables corresponding to sensing modes, and characterizes the potential for intervention in the operations of the ith sensor.The constraints due to the centralized implementation in Problem 2 allow us to consider a single policy π net : S → A describing how many sensors are required to process or sleep for each subset V m .In particular, action a ∈ A is described by M pairs of integers specifying, for each group V m , (i) how many sensors transmit and (ii) how many out of the latter ones are in processing mode (cf.Definition 2).For example, if two sensors are commanded to transmit, one of these being in processing mode, and the other two sleep, and similarly for V 2 .
Since we aim to minimize the time-averaged error variance (9a), a straightforward metric to be chosen as reward function is the negative trace of matrix P πnet k , which evolves according to the Kalman predictor with delayed updates.In the considered framework the base station is allowed to change sensing configuration (corresponding to a new action) at each time k (ℓ) , therefore a natural way of defining the reward is to take the average of the negative trace of the covariance during the interval between times k (ℓ) and k (ℓ+1) , so that the base station can appreciate the performance of a particular sensing configuration in that interval.This leads to the following instantiation of the RL problem, with k (L+1) .= K.The quantity of interest is the trace of the error covariance and thus a straightforward approach would take S = R + .To keep the Q-learning in a tabular (finite) setting, we discretize the state space through a function d : R + → N + .In particular, the image of d [•] is given by M bins, which were manually tuned in our numerical experiments to yield a fair representation of the values of P πnet k observed along the episodes.Then, based on the bin associated with Tr(P k (ℓ) ), the agent outputs a sensing configuration a ∈ A through π net (•) at each time k (ℓ) , given by a ℓ = π net (d[Tr(P k (ℓ) )]).
Notably, choosing γ = 1 and time intervals [k (ℓ) , k (ℓ+1) ] of equal length matches (10) and the objective cost (9a) exactly.Remark 5 (System dynamics and computational complexity).The Reinforcement Learning procedure is concerned only with the selection of the sensing scheme and not with computation of the estimate xk (contrary to data-driven estimation).In particular, the sensor configuration is optimized with respect to the evolution (10b) of the covariance matrix P k induced by the Kalman predictor and not with respect to the actual system dynamics (1), which are measured by the sensors.The Reinforcement Learning step is then independent from the dimension n of the original system (1), because it deals only with the error variance of state estimates, while the estimation is performed by the Kalman predictor in a model-based fashion.

B. Discussion: Challenges of Reinforcement Learning
While the proposed RL-based solution can tackle Problem 2 in more flexible and efficient way than brute-force or greedy search, it also comes with nontrivial limitations due to computational and performance challenges of RL algorithms, which we discuss next.Tackling such challenges requires dedicated efforts that will be addressed in follow-up work.Note however that the framework considered in this article is representative of a broad range of control applications, and the following issues do not constitute a threat in the current setting.
First, while we consider a single processing mode in Assumption 1, a smart sensor may in general choose among several options to refine raw samples.For example, a robot equipped with cameras may run multiple geometric inference algorithms for perception, each trading runtime for accuracy [49], [85].In general, the sensing policy of each sensor might feature several design options (modes), which in turn imply a larger action space for the Q-learning and might raise a nontrivial computational challenge because the total number of actions does not scale with the number of sensors.
Second, even though we focus on a centralized learning technique, this inevitably leads to poor computational scalability especially with heterogeneous sensors, because the action space is the combination of actions of individual sensors.While it is worth pointing out that many control and robotic applications involve either identical or a few different sensors, for which a centralized learning approach is feasible, investigating computationally efficient strategies to improve the scalability of the training in general is a relevant research question.One way to tackle this challenge might be Multi-Agent Reinforcement Learning [86], where each agent (here, smart sensor) receives the reward from the environment and autonomously trains its own policy, possibly exchanging information with other agents.In this scenario, each agent chooses only its own actions, so that the total number of actions actually scales with the number of agents and permits a computationally scalable training.Another argument in favor of this scenario is the possibility of training asynchronous sensing policies tailored to the general problem formulation (6), which is hardly solvable via centralized learning and might turn especially useful to effectively trigger the sleep mode.However, the price to pay is the reduced or absent coordination among the agents, which can slow down the overall training or even prevent convergence.
Last, although the Q-learning algorithm is one of the most widespread algorithms because of its effectiveness and ease of implementation, the proposed procedure could be improved by refining some aspects of the current set up.One of the most challenging aspects is the handling of the continuous statespace, which has been solved through a simple discretisation.The latter can be seen as the simplest instance of function approximation, so there could be better ways of addressing the specific state-space resulting from the present formulation.It is nonetheless remarkable that satisfactory results can already be achieved with this simple version, proving the flexibility of the Q-learning algorithm.Note indeed that the particular framework of a continuous state-space and a discrete actionspace significantly reduces the range of algorithms that can be applied to solve the problem addressed in this work.The Q-learning algorithm proved flexible enough to handle the difficulties of the non-Markovian environments in both settings considered in Section V. To assess its effectiveness, the convergence is numerically studied in Appendix C in the Supplementary Material, together with a discussion on how the sample complexity scales with the number of sensors in the homogeneous setting.While more powerful RL algorithms could give better solutions, an extensive investigation is out of the scope of this work, whose goal is to propose a general methodology for the addressed sensing design problem.

V. NUMERICAL SIMULATIONS
In the previous sections, we have presented an estimationtheoretic framework for optimal sensing design under resource constraints at processing units and communication channel, together with a solution approach based on Reinforcement Learning.We next showcase applicability of our setup through two edge-computing scenarios.This allows us to get insight into the structure of optimal sensing, and also shows that our proposed approach can outperform standard design choices.
In Section V-A, we consider drones for target tracking and see how online sensor selection can improve performance.In Section V-B, we consider smart sensors monitoring an  autonomous car to get insight into processing allocation for heterogeneous networks. 3Finally, in Section V-C we elaborate on the role of Reinforcement Learning in conjunction with a model-based tool such as Kalman filter.

A. Team of Drones for Target Tracking
We simulated a team of 25 drones tracking a vehicle on the road (Fig. 7) modeled as a double integrator [61].Each drone carries a camera and can either transmit raw frames or run neural object detection on-board, sending fairly precise bounding boxes.We simulated in Python, with parameters in Table I based on experiments in [87], [88] and communication delays δ = 10ms.We set fusion delays ϕ k proportional to the number of data that are processed by Kalman predictor to compute xk .We addressed an optimization horizon [0, K] split into ten 500ms-long windows and Q-learning hyperparameters reported in Table II, where t means the tth episode (one episode being one horizon), training for 500000 episodes.
The sensing design policy learned for the horizon is shown in Table III (second row).Notably, only raw mode is chosen:    in particular, 10 drones are active in raw mode during the first window and 20 through the rest of the horizon.This means that, with the parameters in Table I, data processing at the edge is inconvenient because the resource constraints of drones induce long processing delays.In addition, the use of sleep mode, that actually implements an online sensors selection, improves performance: in words, transmission of data from all drones cannot be efficiently handled by the base station and introduces extra computation latency, with consequent performance degradation.This finding is remarkable because it clashes against the typical assumption that performance improves monotonically with the number of sensors.We compare our approach against two standard, static design choices: all sensors transmit raw data at all times (all-raw) and all sensors refine measurements at all times (all-processing).The comparison is shown in Fig. 8 and Table IV.Both baselines are outperformed by optimization (9).Interestingly, our solution also keeps small the Moving Average (MA) of the error variance with respect to the other two designs (see Fig. 8).The largest improvement is recorded at steady-state, while during the transient all curves are very close, with all-raw performing best at times.This may have two causes: the transient phase is more difficult to explore for the Q-learning, but also, that seemingly sub-optimal behavior during the first two windows might be necessary given that the learning procedure targets the whole horizon.Indeed, the optimal solution to (9) need not patch together policies that optimize different time windows.
To further investigate the structure of optimal sensing design, we have trained policies with different values of sensor parameters.In particular, we have considered data processing with progressively higher accuracy, quantified by measurement noise variances v proc ∈ {1, 0.5, 0.1}.The learned sensing policies are shown in Table III.As the accuracy of data processing improves, fewer sensors are needed to achieve high estimation quality (small error variance), while the enhanced  processing induces to set more sensors to processing mode.This is indeed consistent with intuition and may help in design of real applications.Detailed results for the two additional cases are given in Appendix B in Supplementary Material.Remark 6 (Energy saving).An appealing side effect of our proposed design through the online sensor selection it induces is reduced energy consumption, which can increase the lifespan of the system.Considering industrial devices such as Genie Nano cameras [89], with typical power consumption of 3.99W for sampling and transmission, and assuming 0.15W for data processing [90], the energy consumption under the confronted sensing policies is shown in Fig. 9 and Table V.In particular, our policy uses only 76% of the energy consumed by all-raw.Remark 7 (Computational scalability).As mentioned in Section IV-B, our centralized learning approach can handle smallto-medium network sizes but may struggle when the number of sensors is large.To evaluate how the learning complexity scales with the network, we have run experiments with various numbers of drones, which are reported in Appendix C.

B. Smart Sensing for Self-Driving Vehicle
In our second experiment, we considered a self-driving car traveling at approximately constant speed.Specifically, we considered its transversal position with respect to the center of the lane, which is estimated by an internal controller (base station) that receives data from sensors on-board the car and tracks the car trajectory (Fig. 10), for example to control a lane shift at sustained speed (e.g., for passing or on a highway).The car dynamics are modeled through a double integrator, which is a flexible choice used for uncertain dynamics with direct control of accelerations [47], [91]- [94].Given such a  model, Kalman predictor is an effective estimator assuming that lateral movements are limited compared to the car speed.We considered two radar devices, two cameras and one lidar, which are commonly employed in self-driving applications [95].Many techniques used in autonomous driving exploit lidar point clouds, such as segmentation, detection and classification tasks [96].Also, radars are emerging as a key technology for such systems.Some of today's self-driving cars, e.g., Zoox, are equipped with more than 10 radars providing 360 • surrounding sensing capability under any weather conditions [97].Finally, camera images are essential to enable commercialization of self-driving cars with autonomy at level 3 [98].The sensor parameters (Table VI, with = v raw I and V proc = v proc I) were chosen based on real-world experiments [99], with sampling period T = 1ms to ensure real-time vehicle control.
The sensor network is designed according to the architecture proposed in [100]: here, smart sensors embed a sensor (e.g., camera) and a microcontroller that can refine raw sensory data to decrease the computational effort for sensor fusion.The base station is a controller inside the car that manages all jobs needed for autonomous driving.Because the application is safety-critical, transmissions occur through two redundant high-speed Ethernet cables.In light of the small number of sensors and transmission speed, we assume that communication latency is negligible with respect to sampling and processing.
Communication was simulated through the discrete-event simulator Objective Modular Network Testbed in C++ (OM-NeT++) [101].This is widely adopted to simulate networks,   because it combines standard communication protocols (e.g., IEEE 802.3) and the possibility to create customized procedures exploiting existing modules.Further, it enables realistic simulations by accurately modeling both the electromagnetic environment and the lower layers of the protocol stack (from physical to transportation layers).In our simulations, sensors carry IEEE 802.3 (so-called Ethernet) communication boards.
For training, we considered a time horizon [0, K] split into five time windows with length 300ms each, and trained for 100000 episodes with hyperparameters reported in Table VII.
From Table VIII we can infer that the learned policy requires processing almost from all the sensors when the error variance is high (top row).However, the need for processing diminishes with the variance, turning to raw mode both lidar and radars at the smallest values (bottom row).Interestingly, processing mode is always chosen for cameras, revealing that refining of image frames overhangs the additional computational delay.Note that in this case, given the small amount of sensors, the fusion delays induced at the base station are negligible and sleep mode is never selected, namely, sensors always transmit.
The learned policy was tested against the two standard design choices all-raw and all-processing like the previous scenario.The outcome over the horizon is plotted in Fig. 11 and summarized in Table IX.As it is possible to appreciate from Fig. 11, the Q-learning learns to cleverly allocate computational resources according to the current estimate accuracy.During the transient phase (till 600ms), when the error variance is large, processing mode is selected for lidar, cameras and one radar, according to the first two rows in Table VIII.Notably, this choice performs close to all-processing (red curve), while the all-raw configuration is clearly disadvantageous (higher blue curve).Conversely, at steady state only the cameras are in processing mode: this resembles more closely the all-raw policy, which performs better (lower blue curve) than all-processing.Overall, we can see from Table IX that the proposed approach leads to a total improvement of about 5% compared to baseline policies.While this result may look marginal, we note that the improvement is rather small over the main transient phase, because the Kalman predictor is able to drop the error variance very quickly for all sensing configurations, but is way larger (about 15 − 20%) when the curves settle about small values.Also, while the objective cost (9a) refers to the whole horizon, we note that in fact the learned policy performs better than the baselines nearly at each point in time, as Fig. 11 shows, with the curve obtained with the Q-learning policy being almost always below the others.Further, the MA is again consistently smaller than both baselines, highlighting an even better performance of the proposed approach with respect to the targeted optimization.

C. Discussion: the Role of Learning in Model-Based Estimation
The exposed simulations suggest that the proposed approach can improve performance of smart sensor networks dealing with estimation tasks as compared to standard design choices with static processing decisions.In particular, the learningbased design exploits observation of the estimation error online to select effective sensing configurations at different points in time, while baselines cannot adapt to transient or steady-state regimes that benefit from different processing allocations.
It is noteworthy that a learning method such as Q-learning can effectively drive the sensing design, leading to improvement with respect to baselines, even with an estimation tool as effective and robust as the Kalman predictor.Indeed, due to optimality of the latter algorithm applied to the chosen dynamical system, one can expect even trivial choices (such as all-raw and all-processing) to yield acceptable performance.Conversely, it is hard to suggest good heuristics in the present framework, as the performance varies with respect to the system dynamics, delays, and error variances.In particular, an optimal design given all available options is far from trivial: even the simplest setup bears a combinatorial problem that quickly makes deriving an optimal solution computationally infeasible.Indeed, submodularity properties that allow to analytically bound suboptimality of greedy algorithms [47] are hard to meet in realistic scenarios, e.g., under delays, out-of-sequence message arrivals, or multi-rate sensors [61].
Given these premises, the performance improvements obtained via the studied learning method are encouraging not only with regard to the addressed framework, but mostly in supporting the contribution of such tools to general estimation and control tasks, which can benefit from the power of learning to circumvent computational bottlenecks associated with optimization-based design.Hence, rather than looking at the two domains of model-based and data-driven control as mutually exclusive approaches, this work aims to reinforce arguments supporting a unified, best-of-both-worlds framework.VI.CONCLUSION Motivated by smart sensing for Edge Computing, we have proposed an adaptive design that addresses impact of resourceconstrained data sampling, processing, and transmission on performance of a monitoring task.Starting from a suitable mathematical model for the considered class of systems, we have tackled the sensing design problem via Q-learning, showing that the learned design can considerably improve performance compared to standard configurations that do not adapt to the time evolution of the system.
Future research avenues are multifold.Besides challenges of Reinforcement Learning (see Section IV-B), model assumptions may be adjusted to address more realistic sensing and communication, as well as different dynamics or control tasks.Also, our approach should be validated with real-world data.(11) with k M ≤ k − ϕ k .The following procedure can handle outof-sequence measurements sampled at or after time k 0 (oldest sample in (11)) and received before or at time k − ϕ k .For the sake of clarity, in (11) we have omitted subscripts and superscripts related to sensors.The estimation error covariance associated with xk given by Kalman predictor starting from P k0 and using measurements in Y k is given by [61] where the multi-step open-loop update between time k i and time k j ≥ k i (due to lack of measurements in (k i , k j )) is and the update with the ith measurement sampled at time k i is

APPENDIX B TEAM OF DRONES FOR TARGET TRACKING: ADDITIONAL SIMULATION RESULTS
In this appendix, we report additional experiments where we investigate how the sensing design by Q-learning varies as we change the measurement noise variance of processed data, i.e., the accuracy of data provided by sensors in processing mode.
Table X summarizes performances of the optimized sensing policies against the two baselines all-raw and all-processing.
Figures 12 and 13 show the behavior of the error variance along the addressed horizon [0, K] when the measurement noise of processed data has variance v proc equal to 0.5 and 0.1, respectively.
Finally, Fig. 14 shows the estimated energy consumption under the considered sensing designs, which is significantly reduced by the adaptive policies learned through our approach.This demonstrates an attractive by-product of a careful design that takes into account model features such as latency and accuracy of supplied sensory data.

APPENDIX C HEURISTIC CONVERGENCE OF Q-LEARNING
Although convergence of the Q-learning algorithm has been theoretically established in [82], this result holds for Markov Decision Processes, while the setting under investigation in this paper does not enjoy the Markov property because of the entanglement between the estimator's dynamics and the sensors' operations.For instance, when the transition between two windows happens, the current measurements which still have to be delivered are discarded (see Fig. 5): the state alone cannot account for this process as it is defined, so the dynamics considered in the formulation of Reinforcement Learning is not Markovian.One may then wonder if the Q-learning approach is flexible enough for this problem, and in particular if it converges around a fixed policy within the number of episodes that have been considered.Due to the non-Markovianity of the environment the answer can only be provided through numerical experiments.
In this appendix, we show the convergence results for several training instances related to the first setting (homogeneous network), in which a team of drones is tracking a vehicle (see Section V-A and Section B).
Figure 15 shows the behaviour of the long-term reward (the negative sum of the traces of the error covariance over the different windows within an episode -see Problem (10)) of the supposedly optimal policy under the current Q-table.By interacting with the environment and trying different sensors configurations the algorithm is learning a more reliable Q-table, whose maximisation leads to a policy which performs better and better on the true environment, as it is clear from the plot.For the simple case of N = 10 sensors, the algorithm reaches an empirically stable value already around 175000 episodes.
In order to understand how the sample complexity of the algorithm scales with the number of sensor, the same graph is drawn for the case of N = 25 and N = 35 sensors in Fig. 16.
The curve with N = 25 starts to converge around episode 400000, while the one with N = 35 does not reach a visible convergence within the chosen training horizon (500000 total episodes), corresponding to a superlinear trend.This can be understood from the fact that the size of the action set scales quadratically with the number of sensors (as there are 3 configurations for each sensor) and the latest proposed bound for the sample complexity of Q-learning [84] establishes a linear dependence in the size of the action set, therefore leading to a quadratic dependence on the number of sensors in our scenario.Note however that by considering a higher number of sensor the optimal policy necessarily attains a steady-state value which is equal or higher to the one which considers fewer sensor: adding more sensor cannot tamper with performance since, in the worst case, these additional sensors can be put in sleep mode.Therefore, the green curve will eventually reach the same value as the orange one, or higher.
In the case of heterogeneous sensors, the size of the action set scales exponentially with the number of different kind of sensors available, thus making it computationally very expensive to perform a numerical analysis on the convergence of the algorithm.Note also that such an analysis would need to take care of both the total number of sensors and the number of different kind of sensors available, involving many more experiments than the homogeneous case.An extensive convergence analysis is out of the scope of the present contribution.
Recall also that the case of many different (heterogeneous) sensors is quite atypical in real-world large-scale control applications, which usually comprise only a few types of sensors, which therefore makes the computational scalability -and the related convergence analysis -similar to the homogeneous case.

Fig. 1 .
Fig. 1.Scheme of the proposed methodological framework: the RL algorithm learns a sensing design to maximize performance of the estimation algorithm.

Fig. 3 .
Fig. 3. Data processing at the base station.Resource-constrained centralized processing introduces fusion delay ϕ k 5 to estimate x k 5 .Measurements y (i) k 1

Fig. 5 .
Fig.5.Homogeneous sensing policy.Sampling and data processing at identical sensors are ruled by policy π hom .Decision γ hom ℓ is communicated at time k(ℓ) and realized at individual sensors as γ i ℓ = r and γ j ℓ = s.Concurrently, the ith sensor disregards its current processed measurement (red cross) and switches to raw mode, acquiring a new sample at time k(ℓ) .

Assumption 5 (
Sampling frequency with homogeneous sensing policy).Decision γ hom ℓ switches mode of the minimum amount of sensors possible.If the ith sensor switches mode, the measurement currently being acquired or processed (if any) is immediately discarded.If the new commanded mode is either raw or processing, a new sample is acquired according to such new mode right after the decision γ hom ℓ is communicated.

Fig. 7 .
Fig. 7. Drone tracking simulation setup.The base station estimates the trajectory of the moving target (car) based on visual updates from drones.

Fig. 10 .
Fig. 10.Autonomous-driving simulation setup.Sensors on-board measure the position of the car and a centralized microcontroller tracks its trajectory.

Luca
Ballotta received the Master's degree in Automation Engineering and the Ph.D. degree in Information Engineering from the University of Padova, Italy, in 2019 and 2023, respectively.He was Visiting Student at the Massachusetts Institute of Technology in 2020 and 2022.He was awarded with the Young Author Prize at the 2020 IFAC World Congress.His research interests include multi-agent systems and networked control systems subject to resource constraints, resilient distributed optimization, and learning-based safe control.Giovanni Peserico received the Master's degree in Automation Engineering and the Ph.D. degree in "Alto Apprendistato" in Information Engineering in collaboration with Autec s.r.l.from the University of Padova, Italy, in 2019 and 2023, respectively.He is now covering a Cybersecurity Software Engineer position for Qascom, an italian company specialized in GNSS authentication and space cybersecurity.His research interests include safety and cybersecurity, industrial and wireless networks, networked control systems and learning-based safe control.Francesco Zanini received the Master's degree in Automation Engineering and the Ph.D. degree in Information Engineering from the University of Padova, Italy, in 2019 and 2023, respectively.He was Visiting Student at the University of Alberta in 2022, and later joined the institution as a post-doctoral researcher in 2023.His research interests lie at the intersection of reinforcement learning and dynamical systems, along with Koopman operators and learning theory.Paolo Dini received the M.Sc.and Ph.D. degrees from the Università di Roma La Sapienza in 2001 and 2005, respectively.He is currently a Senior Researcher with the Centre Tecnológic de Telecomunicacions de Catalunya (CTTC), where he coordinates the activities of the Sustainable Artificial Intelligence research unit.His research interests include sustainable computing and networking, distributed optimization and machine learning, multiagent systems and decision-making processes, data mining for cyber-physical systems.He has been involved in more than 25 research projects during his career.He is currently the Coordinator of the CHIST-ERA SONATA project on sustainable computing and communication at the edge and the Scientific Coordinator of the MSCA Greenedge European Training Network on edge intelligence and sustainable computing.His research activity is documented in more than 90 peerreviewed scientific journals and international conference papers.He received two awards from the Cisco Silicon Valley Foundation for his research on heterogeneous mobile networks in 2008 and 2011, respectively.He has coorganized several training events and workshops/special sessions at several international conferences sponsored by IEEE.He serves as a TPC in many international conferences and a reviewer for several scientific journals of the IEEE, Elsevier, ACM, Springer.He is European Climate Pact Ambassador since 2022 and participates in several outreach events (e.g., Research Nights) to promote sustainable design principles.APPENDIX A KALMAN PREDICTOR WITH DELAYED UPDATES Assume that at time k − ϕ k the base station can use the following time-sorted measurements to compute xk ,

Sensor Base station Raw measurement Sampling Processed measurement Sensing decision sleep Fig
. 2. Data collection and transmission.Computation at the ith sensor is ruled by sensing policy π i .Here, sensing decisions {γ i Learning framework.The RL algorithm receives accuracy of estimates (state) and outputs sensing configurations (action) that affect sensory data.

TABLE I SENSOR
PARAMETERS FOR DRONE-TRACKING SCENARIO.

TABLE III NETWORK
SENSING POLICY πNET LEARNED BY THE NUMBERS SHOW HOW MANY SENSORS TRANSMIT (PROCESS) ACROSS TIME.

TABLE IV MEAN
ERROR VARIANCE IN DRONE-TRACKING SIMULATION.

TABLE VI SENSOR
PARAMETERS FOR AUTONOMOUS DRIVING SCENARIO.

TABLE VIII Q
-LEARNING POLICY FOR AUTONOMOUS-DRIVING SCENARIO.

TABLE IX MEAN
ERROR VARIANCE IN AUTONOMOUS-DRIVING SIMULATION.
Table XI reports in detail the energy consumption under the learned sensing policies for the different cases of v proc over transient and steady-state.

TABLE X MEAN
ERROR VARIANCE IN DRONE-TRACKING SIMULATION.FIGURES BETWEEN BRACKETS SHOW COST DECREASE W.R.T. SECOND BEST POLICY.