On-Demand AoI Minimization in Resource-Constrained Cache-Enabled IoT Networks With Energy Harvesting Sensors

We consider a resource-constrained IoT network, where multiple users make on-demand requests to a cache-enabled edge node to send status updates about various random processes, each monitored by an energy harvesting sensor. The edge node serves users’ requests by deciding whether to command the corresponding sensor to send a fresh status update or retrieve the most recently received measurement from the cache. Our objective is to find the best actions of the edge node to minimize the average age of information (AoI) of the received measurements upon request, i.e., average on-demand AoI, subject to per-slot transmission and energy constraints. First, we derive a Markov decision process model and propose an iterative algorithm that obtains an optimal policy. Then, we develop an asymptotically optimal low-complexity algorithm – termed relax-then-truncate – and prove that it is optimal as the number of sensors goes to infinity. Simulation results illustrate that the proposed relax-then-truncate approach significantly reduces the average on-demand AoI compared to a request-aware greedy policy and a weighted AoI policy, and also depict that it performs close to the optimal solution even for moderate numbers of sensors.


I. INTRODUCTION
Internet of Things (IoT) is a key technology in providing ubiquitous, intelligent networking solutions to create a smart society.In IoT sensing networks, the sensors measure physical quantities (e.g., speed or pressure) and send the measurements to a destination for further processing.IoT networks are subject to stringent energy limitations, due to battery-powered sensors.This energy scarcity is often counteracted by energy harvesting (EH) technology, relying on, e.g., solar or RF ambient sources.Moreover, reliable control actions in emerging timecritical IoT applications (e.g., drone control and industrial monitoring) require high freshness of information received by the destination.Such destination-centric information freshness can be quantified by the age of information (AoI) [1], [2].These call for designing effective AoIaware status updating procedures for IoT networks to provide the end users with timely status of remotely observed processes while accounting for the limited energy resources of EH sensors.
We consider a resource-constrained IoT sensing network that consists of multiple EH sensors, a cache-enabled edge node, and multiple users.Users are interested in timely status information about the random processes associated with physical quantities (e.g., speed or temperature), each measured by a sensor.We consider request-based status updating where the users demand for the status of physical quantities from the edge node which acts as a gateway between the users and the sensors.The edge node is equipped with a cache that stores the most recently received status update packet from each sensor.Upon receiving request(s) for the status of a physical quantity, the edge node has two options to serve the requesting user(s): either command the corresponding sensor to send a fresh measurement, i.e., a status update packet, or use the aged measurement from the cache.The former enables the edge node to serve the user(s) with fresh measurements, yet consuming energy from the sensor's battery.The latter prevents the activation of the sensors for every request so that the sensors can utilize a sleep mode to save considerable amount of energy [3], but the data received by the users becomes stale.Thus, there is an inherent trade-off between the AoI at the users and conservation of the sensors' energy in the finite batteries.
In particular, the considered status updating network is subject to the following energy and transmission constraints.First, since the sensors rely only on the energy harvested from the environment, the sensors' batteries may be empty and thus they cannot send an update for each request.This energy causality induces an inherent per-slot energy constraint.Second, motivated by the limited amount of radio resources (e.g., bandwidth, time-frequency resource blocks), only a limited number of sensors can send fresh status updates to the edge node at each time slot, imposing a per-slot transmission constraint.
The objective of our network design is to keep the freshness of information at the users as small as possible, subject to the constraints in the system.To this end, we use the concept of on-demand AoI [4] that quantifies the freshness of information at the users restricted to the users' request instants.We aim to find an optimal policy, i.e., the best action of the edge node at each time slot that minimizes the average on-demand AoI over all the sensors and the users subject to the per-slot transmission and energy constraints.It is worth emphasizing that the on-demand AoI minimization is different from the conventional AoI optimization in that the freshness of information is only important when user(s) request the information, i.e., an optimal policy for the on-demand AoI minimization problem adapts to the request pattern.
We first cast the problem as a Markov decision process (MDP) and propose an algorithm that obtains an optimal policy.Moreover, since the complexity of finding an optimal policy increases exponentially in the number of sensors, we propose an asymptotically optimal low-complexity algorithm -termed relax-then-truncate -and show that it performs close to the optimal solution.

A. Contributions
The main contributions of our paper are as follows: • We consider on-demand AoI minimization problem in a multi-user multi-sensor IoT EH network subject to per-slot transmission and energy constraints.We formulate the problem as an MDP and propose an iterative algorithm that finds an optimal policy.
• To deal with massive IoT scenarios, we propose a sub-optimal low-complexity algorithm whose complexity increases linearly in the number of sensors.In particular, we relax the per-slot transmission constraint into a time average constraint, model the relaxed problem as a constrained MDP (CMDP), obtain an optimal relaxed policy, and propose an online truncation procedure to ensure that the transmission constraint is satisfied at each time slot.
• We analytically find an upper bound for the difference between the average cost obtained by the proposed relax-then-truncate approach and the average cost obtained by an optimal policy.Then, we show that the relax-then-truncate approach is asymptotically optimal as the number of sensors goes to infinity.
• Numerical experiments are conducted to analyze the performance of the proposed relaxthen-truncate approach and show that it significantly reduces the average on-demand AoI as compared to a request-aware greedy policy.Interestingly, the proposed algorithm performs close to the optimal solution for moderate numbers of sensors.
Our considered model is highly relevant to resource-constrained IoT scenarios with a massive number of devices.To the best of our knowledge, this work is the first one that proposes an asymptotically optimal low-complexity algorithm for minimizing on-demand AoI in an IoT network with multiple EH sensors.
In [5], the authors proposed AoI-optimal scheduling algorithms for a broadcast network where a base station is updating the users on random information arrivals under a transmission capacity constraint.In [6], the authors developed low-complexity scheduling algorithms, including a Whittle's index policy, and derived performance guarantees for a broadcast network.In [7], [8], the optimality of the Whittle's index policy has been investigated for the AoI minimization problem where a central entity schedules a number of users among the total available users for transmission over unreliable channels.In [9], the authors studied AoI-optimal scheduling under a constraint on the average number of transmissions where the source sends status updates to a destination (user) over an error-prone channel.The authors in [10] extended [9] to a multi-user setting, where the source has to decide not only when to transmit but also to which user.In [11], the authors proposed an asymptotically optimal algorithm for the AoI-optimal scheduling problem under both bandwidth and average power constraints in a wireless network with timevarying channel states.In [12], the authors studied AoI minimization problem in a multi-source relaying system under per-slot transmission and average resource constraints.
Different from [5]- [12], another line of research [13]- [21] focused on the class of problems where the sources are powered by energy harvested from the environment, i.e., investigating AoI-optimal scheduling policies subject to the energy causality constraint at the source(s).The works [13]- [20] studied AoI-optimal scheduling in single-sensor EH networks where the sensor is sending time-sensitive information to the user(s).The authors of [13] derived age-optimal sampling instants for the sensor by assuming known EH statistics.In [14], the authors studied AoI-optimal policies under an erasure channel with retransmissions where the channel and EH statistics are either known or unknown.In [15], AoI-optimal scheduling was studied where the sensor takes advantage of multiple available transmission modes.The work [16] investigated AoI-optimal scheduling in a cognitive radio EH system.In [17], the authors studied age-optimal scheduling under stability constraints in a multiple access channel with two heterogeneous nodes (including an EH node) transmitting to a common destination.In [18], the sensor monitors a stochastic process and tracks its evolution and thereby, a modified definition of AoI is proposed to account for the discrepancy in the remote destination.In [19], the monitoring node (sensor) collects status updates from multiple heterogeneous information sources.In [20], the authors studied AoI-optimal scheduling for a wireless powered communication system under the costs of generating status updates at the sensor nodes.In [21], the authors developed a deep RL algorithm for minimizing the average age of correlated information in an IoT network with multiple correlated EH sensors whose status updates are processed by a data fusion center.
The majority of the literature that considers AoI minimization, including all the above ones, assume that the time-sensitive information of the source(s) is needed at the destination at all time moments.However, in many applications, a user demands for fresh status updates only when it needs such timely information.To account for such information freshness driven by users' requests, we introduced the concept of on-demand AoI in [4], [22].In these works and a follow-up work [23], the main focus was on-demand AoI minimization in an IoT network with multiple decoupled EH sensors, as opposed to tackling the transmission-constrained status updating problem herein.Only a few works have studied a concept similar to the on-demand AoI.In [24], the authors introduced the idea of effective AoI (EAoI) under a generic requestresponse model where a server serves the users with time-sensitive information.In [25], the authors studied an information update system where a user pulls information from servers.
However, in contrast to our paper, the works [24], [25] do not consider energy limitation at the source nodes.In [26], [27], the authors introduced the AoI at query (QAoI) and developed an MDP-based policy iteration method to find an optimal policy that minimizes the average QAoI considering an energy-constrained sensor that is queried to send updates to an edge node under limited transmission opportunities.The QAoI metric [26], [27] is equivalent to our on-demand AoI when particularized to the single-user single-sensor case.

A. Network Model
We consider a multi-user multi-sensor IoT sensing network that consists of a set K = {1, . . ., K} of K energy harvesting (EH) sensors, an edge node (a gateway), and a set N = {1, . . ., N } of N users, as depicted in Fig. 1.Users are interested in timely status information about random processes associated with physical quantities f k , e.g., speed or temperature, each of which is independently measured by sensor k ∈ K.We consider requestbased status updating, where the users send requests on demand for obtaining status of quantities f k , k ∈ K.When a request for the physical quantify f k is generated at the user side, the associated sensor k may send a status update packet that contains the measured value of the monitored process and a time stamp of the generated sample.We assume that there is no direct link between the users and the sensors, i.e., the users receive the status updates only via the edge node.
We consider a time-slotted system with slots indexed by t ∈ N. At the beginning of slot t, users send requests for the status of physical quantities f k to the edge node.Let r k,n (t) ∈ {0, 1}, t = 1, 2, . . ., denote the random process of requesting the status of f k by user n; r k,n (t) = 1 if the status of f k is requested by user n ∈ N at slot t and r k,n (t) = 0 otherwise.The requests are independent across the users, sensors, and time slots.Let p k,n be the probability that the status of f k is requested by user n at each slot, i.e., Pr{r k,n (t) = 1} = p k,n .Note that there can be multiple users requesting for f k at each slot; r k (t) = N n=1 r k,n (t) ∈ {0, 1, . . ., N } indicates the number of requests for f k at slot t.We assume that all requests that arrive at the beginning of slot t are handled by the edge node during the same slot t.To this end, we assume that all the communication links, i.e., the sensor-edge and edge-user links, are error-free 1 .
The edge node is equipped with a cache of size K that stores the most recently received status update packet from each sensor.Upon receiving a request for the status of f k at slot t, the edge node has two options to serve the request: 1) command sensor k to send a fresh status update, or 2) use the previous measurement from the cache.Let a k (t) ∈ {0, 1} be the command action of the edge node at slot t; a k (t) = 1 if the edge node commands sensor k to send an update and a k (t) = 0 otherwise.We consider that, due to limited amount of radio resources (e.g., time-frequency resource blocks), no more than M ≤ K sensors can transmit status updates to the edge node within each slot.This transmission constraint imposes a limitation to the number of commands as We refer to M as the transmission budget hereinafter.

B. Energy Harvesting Sensors
We assume that the sensors harvest energy from the environment for sustainable operation.
We model the energy arrivals at the sensors as independent Bernoulli processes with intensities λ k , k ∈ K.This characterizes the discrete nature of the energy arrivals in a slotted-time system, i.e., at each slot, a sensor either harvests one unit of energy or not (see, e.g., [13], [18], [28]).Let e k (t) ∈ {0, 1}, t = 1, 2, . . ., denote the energy arrival process of sensor k.Thus, the probability that sensor k harvests one unit of energy during one slot is λ k2 , i.e., Pr{e k (t) = 1} = λ k , ∀t.
Sensor k stores the harvested energy into a battery of finite size B k (units of energy).Formally, let b k (t) denote the battery level of sensor k at the beginning of slot t, where b k (t) ∈ {0, . . ., B k }.
where 1 {•} is the indicator function.Note that d k (t) in (2) characterizes the energy consumption of sensor k at slot t.It is also worth noting that by (2), we have d k (t) ≤ a k (t), and consequently, (1) implies that K k=1 d k (t) ≤ M for all slots; hence, the name transmission constraint for (1).Finally, using the defined quantities b k (t), d k (t), and e k (t), the evolution of the battery level of sensor k is expressed as

C. On-demand Age of Information
To measure the freshness of information seen by the users in our request-based status updating system, we use the notion of age of information (AoI) [1] and define on-demand AoI [4].In contrast to AoI that measures the freshness of information at every slot, on-demand AoI quantifies the freshness of information at the users' request instants (only).
Let ∆ k (t) be the AoI about the physical quantity f k at the edge node at the beginning of slot t, i.e., the number of slots elapsed since the generation of the most recently received status update packet from sensor k.Let u k (t) denote the most recent slot in which the edge node received a status update packet from sensor k, i.e., u k (t) = max{t |t < t, d k (t ) = 1}.Thus, the AoI about We make a common assumption (see e.g., [4], [7], [9], [10], [12], [14]- [16], [18]- [22]) that ∆ k (t) is upper-bounded by a finite value ∆ max , i.e., ∆ k (t) ∈ {1, 2, . . ., ∆ max }.Besides tractability, this accounts for the fact that once the available measurement about f k becomes excessively stale, further counting would be irrelevant.At each slot, the AoI about f k drops to one if the edge node receives a status update from the corresponding sensor, or otherwise, it increases by one.Accordingly, the evolution of ∆ k (t) can be written as which can be expressed compactly as We define on-demand AoI for a sensor-user pair (k, n) at slot t as the sampled version of (4) where the sampling is controlled by the request process r k,n (t), i.e., In ( 5), since the requests come at the beginning of slot t and the edge node sends measurements to the users at the end of the same slot, ∆ k (t + 1) is the AoI about f k seen by the users.
The state of the system at slot t is expressed as the state space S has a finite dimension |S| = K k=1 (N + 1)(B k + 1)∆ max .2) Action: As discussed in Section II-A, the edge node decides at each slot whether to command sensor k to send a fresh status update (and update the cache) or not, i.e., , where A k is the per-sensor action space.The action of the edge node at slot t is given by a K-tuple a(t) = a 1 (t), . . ., a K (t) ∈ A with action space It is worth stressing that the action space A considers the transmission constraint (1) in its definition.Additionally, we define the relaxed action space that does not consider the transmission constraint (1) 3) Policy: A policy π is a rule that determines the action by observing the state.Particularly, a randomized policy is a mapping from state s ∈ S to a probability distribution a∈A π(a|s) = 1, of choosing each possible action a ∈ A. A deterministic policy is a special case of the randomized policy where, in each state s, π(a|s) = 1 for some a; with a slight abuse of notation, we use π(s) to denote the action taken in state s by a deterministic policy π.In addition, we define a (relaxed) policy as π R : S × A R → [0, 1] and a per-sensor policy as

4) Cost Function:
We consider a cost function that incurs a penalty with respect to the staleness of a status update requested and received by a user.Accordingly, we define the cost associated with user n and sensor k at slot t as the on-demand AoI for the sensor-user pair (5).Then, we define the per-sensor cost at slot t as Remark 1.Note that, due to the multiplicative factor r k (t), (6) accounts for the number of requests for each physical quantity at each slot, i.e., the more the requests for f k , the more important the corresponding freshness becomes.Particularly, when the status of f k is not requested by any user at slot t, i.e., r k (t) = 0, the immediate cost becomes c k (t) = 0.

E. Problem Formulation
For the considered system, the energy and transmission constraints pose limitations on when and how often a new status update can be generated at each sensor, which in turn affect the on-demand AoI.Our objective is to keep the on-demand AoI as small as possible, subject to the constraints in the system.
Formally, for a given policy π, we define the average cost as the average on-demand AoI over all sensors and users, i.e.,

Cπ lim
where E π [•] is the (conditional) expectation when the policy π is applied to the system and s(0) = s 1 (0), . . ., s K (0) is the initial state 3 .We aim to find an optimal policy π that achieves the minimum average cost, i.e.,

III. MDP MODELING AND OPTIMAL POLICY
In this section, we model the problem (P1) as an MDP and propose a value iteration algorithm that finds an optimal policy π .

A. MDP Modeling
The MDP is defined by the tuple S, A, Pr(s(t + 1)|s(t), a(t)), c(s(t), a(t)) , where the state space S and the action space A were defined in Section II-D.The cost function c(s(t), a(t)) represents the cost of taking action a(t) in state s(t), which is given by The state transition probability Pr(s(t + 1)|s(t), a(t)) maps a state-action pair at slot t onto a distribution of states at slot t + 1.The probability of transition from current state s(t) = (s 1 (t), . . ., s K (t)) to next state s(t + 1) = (s 1 (t + 1), . . ., s K (t + 1)) under action a(t) = (a 1 (t), . . ., a K (t)) is factorized as where (a) follows from the fact that given action a, the state associated with each sensor (i.e., the per-sensor state) evolves independently from the other sensors.Above, the per-sensor state transition probability Pr (s k (t + 1) | s k (t), a k (t)) gives the probability of transition from per- and it is expressed as  4)).The probabilities in (10) are calculated in the following.
The random variable r k = n r k,n is a sum of independent Bernoulli trials that are not necessarily identically distributed.Therefore, it has a Poisson binomial distribution [29] as At each slot, sensor k consumes one unit of energy for sending a status update (i.e., when and harvests one unit of energy with probability λ k , thus, we have According to ( 4) and ( 2), given current battery level b k , AoI ∆ k , and action a k , the next value of AoI ∆ k can be obtained deterministically.Thus, we have

B. Optimal Policy
In this section, we propose an iterative algorithm that obtains an optimal policy π for (P1).To this end, we first define the accessibility condition for an MDP and prove that our MDP modeling in Section III-A satisfies this condition.Then, we present a proposition that characterizes an optimal policy π for (P1).if every two states can be reached from each other under some stationary policy.
Proposition 1.The MDP defined in Section III-A is weakly communicating.
Proof.The proof is presented in Appendix A.
Proposition 2. The optimal average cost achieved by an optimal policy π , denoted by C (i.e., C = Cπ ), is independent of the initial state s(0) and satisfies the Bellman's equation, i.e., there exists h(s), s ∈ S, such that Further, an optimal action taken in state s is given by Proof.By Proposition 1, the weak accessibility condition holds and thus, the proof follows directly from [30, Prop.4.2.1] and [30,Prop. 4.2.3].
It is worth noting that any function h satisfying ( 14) is unique up to an additive factor, i.e., if h satisfies (14), then h + α, where α is any constant, also satisfies (14).The proposed RVIA is presented in Algorithm 1, where θ is a small constant for the RVIA termination criterion.
It is important to point out that the state space S and action space A grow exponentially with respect to the number of sensors K, and thus, the complexity of the RVIA presented in Algorithm 1 grows exponentially in K. Accordingly, finding an optimal policy π is practical (tractable) only for small numbers of sensors.To this end, we next propose a low-complexity sub-optimal algorithm whose complexity increases only linearly in K.

IV. LOW-COMPLEXITY ALGORITHM DESIGN: RELAX-THEN-TRUNCATE APPROACH
In this section, to handle massive IoT scenarios, we propose a low-complexity algorithm that provides a sub-optimal solution to problem (P1).The key observation is that the per-slot constraint (1) couples the actions a k (t), k ∈ K, which results in the exponential complexity of finding an optimal policy for (P1), as explained in Section III.Therefore, we start by relaxing the per-slot constraint (1) into a time average constraint and subsequently model the relaxed problem as a constrained MDP (CMDP).The CMDP problem is then transformed into an unconstrained MDP problem through the Lagrangian approach [32].The MDP problem decouples along the sensors and, therefore, for a fixed value of the Lagrange multiplier, we can find a per-sensor optimal policy.The optimal value of the Lagrange multiplier is found via bisection.This provides an optimal policy for the relaxed problem, called optimal relaxed policy hereinafter.Finally, we propose an online truncation procedure to ensure that the constraint ( 1) is satisfied at each slot.
We remark that our optimality analysis in Section V shows that the proposed relax-then-truncate approach is asymptotically optimal as the number of sensors goes to infinity.

A. CMDP Formulation
We relax the constraint (1) and formulate the relaxed problem as a CMDP.To this end, we define the average number of command actions under a policy π R as and express the relaxed problem as where Γ M K is the normalized transmission budget.We model (P2) as a CMDP defined by the tuple S, A R , Pr(s(t + 1)|s(t), a(t)), c(s(t), a(t)) , where the state space S and the relaxed action space A R were defined in Section II-D, and Pr(s(t+1)|s(t), a(t)) and c(s(t), a(t)) were defined in Section III-A.Note that the only difference between the CMDP tuple and the MDP tuple in Section III-A is in the action space (A R vs A).
It is worth noting that any policy π that satisfies the per-slot transmission constraint (1) satisfies the time average transmission constraint (18) in (P2).Thus, the average cost obtained by following policy π R is a lower bound on the average cost obtained under policy π , i.e., To solve the CMDP problem (P2), we introduce a Lagrange multiplier µ and define the Lagrangian associated with problem (P2) as For a given µ ≥ 0, we define the Lagrange dual function L (µ) = min π R L(π R , µ).A policy that achieves L (µ) is called µ-optimal, denoted by π R,µ , and it is a solution of the following (unconstrained) MDP problem Since the dimension of the state space S is finite, the growth condition [32, Eq. 11.21] is satisfied.Moreover, the immediate cost function is bounded below, i.e., c(s, a) ≥ 0, ∀a, s.Having these conditions satisfied, the optimal value of the CMDP problem (P2), Cπ R , and the optimal value of the MDP problem (P3), L (µ), ensures the following relation [32,Corollary 12.2] Therefore, an optimal policy for (P2) is found by a two-stage iterative algorithm: 1) for a given µ, we find a µ-optimal policy, and 2) we update µ in a direction that obtains Cπ R according to (22).These two steps are detailed in Sections IV-A1 and IV-A2, respectively.
1) An Optimal Policy for a Fixed Lagrange Multiplier: For a given µ, the problem of finding an optimal policy π R,µ in (P3) is separable across sensors k ∈ K. Thus, (P3) can be decoupled into K per-sensor problems as follows.We express the Lagrangian in (20) equivalently as where the per-sensor policy π k was defined in Section II-D3.Thus, finding an optimal policy π R,µ reduces to finding K per-sensor optimal policies, denoted by π R,µ,k , k = 1, . . ., K, as Each sub-problem (P4), for a particular k, can be modeled as an (unconstrained) MDP problem.Particularly, the MDP model associated with sensor k is defined as the tuple , where the per-sensor state space S k and the per-sensor action space A k were defined in Section II-D, the per-sensor state transition probabilities Pr(s k (t + 1)|s k (t), a k (t)) are calculated as in (10), and the cost of taking action Proposition 3. The per-sensor MDP formulated for (P4) is communicating, i.e., for every pair of states s, s ∈ S k , there exists a stationary policy under which s is accessible from s.
Proof.The proof is presented in Appendix B.
By Proposition 3 and Proposition 2, and rewriting the Bellman's equation in (14) for the per-sensor MDP formulation, we have where L k (µ) min π k L k (π k , µ).In addition, an optimal policy in state s ∈ S k is given by By turning ( 25) into an iterative procedure, h R,µ,k (s) and consequently π R,µ,k (s), s ∈ S k , are obtained iteratively.Particularly, at each iteration i = 0, 1, . . ., we have where s ref ∈ S k is an arbitrary reference state.For any initialization V (0) R,µ,k (s), the sequences {h The proposed RVIA is presented in Algorithm 2 (Lines 13-26).
Next, to give insight to optimal policies, we analyze the properties of a per-sensor optimal policy π R,µ,k obtained by the proposed RVIA.We first prove that V R,µ,k (s) has monotonic properties.Then, exploiting this monotonicity, we prove that an optimal policy has a thresholdbased structure with respect to the AoI.Lemma 1.The function V R,µ,k is non-decreasing with respect to the AoI, i.e., for any two states s = (r, b, ∆) ∈ S k and s = (r, b, ∆) ∈ S k , where ∆ ≥ ∆, we have V (s) ≥ V (s).
Proof.The proof is presented in Appendix C. Theorem 1.For any µ ≥ 0, a per-sensor optimal policy π R,µ,k obtained by RVIA has a threshold-based structure with respect to the AoI, i.e., if an optimal action in state s = (r, b, ∆) is π R,µ,k (s) = 1, then for all states s = (r, b, ∆), ∆ ≥ ∆, an optimal action is also π R,µ,k (s) = 1.
Proof.The proof is presented in Appendix D.
2) Determination of the Optimal Lagrange Multiplier: Recall that the cost function associated with the per-sensor MDP formulation (established for (P4)) is defined as c k (s k (t), a k (t))+µa k (t).
To search for µ * as defined in (28), we apply the bisection method that exploits the monotonicity of Jπ R,µ with respect to µ. Particularly, if 1 K K k=1 Jπ R,µ,k ≤ Γ for µ = 0, then the constraint ( 18) is inactive, and an optimal policy for (P2) is π R,0 .Otherwise, we apply an iterative update procedure until |µ + − µ − | < and 1 K K k=1 Jπ R,µ,k ≤ Γ are satisfied.Details are expressed in Algorithm 2, where is a small constant for the bisection termination criterion.
Remark 2. It is worth noting that the complexity of finding an optimal relaxed policy π R increases linearly in the number of sensors, whereas the complexity of finding an optimal policy π grows exponentially in K. Consider a scenario with K = 100 sensors, N = 7 users, ∆ max = 64, and B k = 15.The size of the state space S is |S| = (8 × 16 × 64) 100 = 2 1300 ≈ 10 400 .However, the per-sensor state space size is |S k | = 8 × 16 × 64 = 2 13 ≈ 10 6 .

B. Truncation Procedure
Recall that there is no guarantee that the per-slot constraint ( 1) is satisfied under an optimal policy for the relaxed problem (P2) (i.e., π R ).Here, we propose the following truncation procedure that satisfies the constraint (1) at each slot.More precisely, at slot t, let The truncation step separates into two cases: 1) if |X (t)| ≤ M , the edge node simply commands all the sensors in X (t), and 2) otherwise, the edge node selects M sensors from the set X (t) randomly according to the discrete uniform distribution and commands them to send status updates.The online truncation procedure is presented in Algorithm 3. Construct the set X (t) based on π R 3: Select M sensors from X (t) randomly (uniform) and command them 5: end if 6: end for

V. ASYMPTOTIC OPTIMALITY OF THE PROPOSED RELAX-THEN-TRUNCATE APPROACH
In this section, we analyze the optimality of the proposed relax-then-truncate policy -denoted by π hereinafter -developed in Section IV.We first find an upper bound for the difference between the average cost obtained by the policy π and the average cost obtained by an optimal policy π .Then, we present two lemmas that are used to show that the relax-then-truncate approach is asymptotically optimal as the number of sensors goes to infinity.
Theorem 2. The difference between the average cost obtained by the relax-then-truncate policy π and the average cost obtained by an optimal policy π is upper bounded as where MAD(•) denotes the Mean Absolute Deviation.
Proof.The proof is presented in Appendix E.
We next present two lemmas that will subsequently be used in Theorem 3 to prove the asymptotic optimality of the relax-then-truncate approach.
Lemma 2. For a random variable X that follows a normal distribution with mean ν and variance σ 2 , i.e., X ∼ N (ν, σ 2 ), the mean absolute deviation is given as MAD(X) = 2 π σ.Proof.The proof is presented in Appendix F. Lemma 3. When K → ∞, by following the policy π R , we have Proof.The proof is presented in Appendix G.
Proof.The proof is presented in Appendix H.

VI. SIMULATION RESULTS
In this section, we provide simulation results to demonstrate the performance of the lowcomplexity relax-then-truncate approach developed in Section IV.In addition, simulation results are presented to demonstrate the structural properties of per-sensor optimal policies obtained by the RVIA in Algorithm 2.

A. Performance of the Proposed Low-complexity Relax-then-Truncate Approach
We consider an IoT network with N = 3 users, where in each slot, user n requests a status of f k with probability p k,n = 0.6.The battery capacity of each sensor is set to B k = 7 units of energy and the AoI upper-bound is set to ∆ max = 64.Each sensor is assigned an energy harvesting rate λ k from the set {0.01, 0.02, . . ., 0.1} in the following sequential order: We compare the performance of the proposed relax-then-truncate policy with a greedy (myopic) policy and a lower bound.In the (request-aware) greedy policy, the edge node commands at most M sensors with the largest AoI from the set i.e., the set of sensors whose measurements are requested by at least one user.In other words, this myopic policy minimizes the expected one-step cost over all the sensors and users at each slot.The lower bound is obtained by following an optimal relaxed policy π R (see (19)).small and decreases as K increases.The proposed policy approaches the lower bound for large K, which validates the asymptotic optimality of the proposed algorithm as proved in Theorem 3. Fig. 3 depicts the performance of the relax-then-truncate algorithm with respect to the number of sensors K for different values of normalized transmission budget Γ.The results are obtained by averaging each algorithm over 50 episodes where each episode takes 10 6 slots.Due to asymptotic optimality of the proposed algorithm, for all values of Γ, the gap between the proposed policy and the lower bound is very small for large values of K. Comparing Figs.3(a)-(d) with each other, it can be seen that, as Γ increases, the proposed policy converges to the optimal performance faster.This is because the proportion of the sensors that can be commanded by the edge node at each slot increases as Γ increases, and hence, the proportion of the sensors that is truncated (i.e., the sensors that are not commanded under π compared to π R ) decreases.respectively, with respect to the normalized transmission budget Γ.For the benchmarking, we also plot the performance of an optimal policy for the case without any transmission constraint (i.e., M = K) [23].As shown in Fig. 4, the average cost for the proposed algorithm decreases as Γ increases.This is because, for fixed K, the transmission budget M increases by increasing Γ, and thus, the edge node can command more sensors at each slot, which results in serving the users via fresh measurements more often.Interestingly, from a certain point onward, increasing Γ does not decrease the average cost.This is because the average number of command actions does not increase anymore, as shown in Fig. 5, i.e., the constraint (18) becomes inactive and the edge node has more transmission budget than it needs.In these cases, the energy limitation of the sensors, which only rely on the energy harvesting, becomes dominant and does not allow the sensors to transmit more often.6) is (linearly) increasing with r k (t), the edge node has more incentive to command a sensor that is associated with a large number of requests.On the other hand, if there are no requests for f k (i.e., r k = 0), the optimal action is not to command the sensor, regardless of the battery level and AoI, i.e., π R,µ * ,k (0, b, ∆) = 0.This leads to energy saving for sensor k, which can be used later to serve the users with fresh measurements.where multiple users make on-demand requests to a cache-enabled edge node to send status updates about various random processes, each monitored by an EH sensor.We first modeled the problem as an MDP and proposed an iterative algorithm that obtains an optimal policy.
Since the complexity of finding an optimal policy increases exponentially in the the number of sensors, we developed a low-complexity relax-then-truncate algorithm and then analytically showed that it is asymptotically optimal as the number of sensors goes to infinity.Numerical results illustrated that the relax-then-truncate algorithm significantly reduces the average cost (i.e., average on-demand AoI over all sensors and users) compared to a request-aware greedy policy and performs close to the optimal solution for moderate numbers of sensors.

VIII. ACKNOWLEDGMENTS
The work has been financially supported in part by Infotech Oulu, the Academy of Finland . Let e i denote a unit vector of length K having a single 1 at the ith entry and all other entries 0. Let e 0 denote a zero vector (i.e., all entries are 0) of length K.We define a vector a i = (a i,1 , . . ., a i,K ) with elements a i,k = 1 {∆ k =i, ∆ k <∆ max } .First, since the requests processes are independent from other variables in the system (e.g., actions), a state with a request vector r is accessible from any other state.Second, realizing the actions e 1 for (b 1 − b 1 ) + slots, slots, the system reaches a state whose battery vector is b with a positive probability (w.p.p.).
Note that, regardless of the actions happening next, the system reaches a state whose battery vector is still b w.p.p.Third, realizing the consecutive actions a δ , a δ−1 , . .., a 1 leads the system reach a state whose age vector is ∆ w.p.p.In summary, the system reaches a state with request vector r , age vector ∆ , and battery vector b w.p.p.. Thus, s is accessible from s.

C. Proof of Lemma 1
Proof.Here, we drop the unnecessary subscripts for the sake of notational convenience, e.g., V R,µ,k is simply shown by V .To prove that V is non-decreasing with respect to the AoI, we consider two states s = (r, b, ∆) and s = (r, b, ∆), where ∆ ≥ ∆, and show that V (s) ≥ V (s).
We introduce the following (penalized) strategy πR : at each slot, command the sensors based on π R but add a penalty N K∆ max (|X (t)|−M ) +

|X (t)|
to the cost over all sensors (see (32)).It is clear that the average cost obtained under πR is not less than that obtained by π, i.e., Cπ ≤ Cπ R .Also, recall from ( 19) that the average cost obtained under policy π R is a lower bound for the average cost obtained by an optimal policy π , i.e., Cπ R ≤ Cπ .Moreover, policy π is a sub-optimal solution for (P1), i.e., Cπ ≤ Cπ .Therefore, we have Using (33), the difference between the average cost obtained by the proposed relax-thentruncate policy π and the average cost obtained by an optimal policy π is upper bounded as Cπ − Cπ where (a) follows from ( 33

G. Proof of Lemma 3
Proof.The cardinality of set X (t) (i.e., the set of sensors that are commanded under π R ) can be written as |X (t)| = K k=1 a k (t), where a k (t) ∈ {0, 1}, k ∈ K, are K independent binary random variables.Let ω k (t) be the probability that sensor k is commanded at slot t under policy π R , i.e., ω k (t) Pr(a k (t) = 1).We define a random variable Z(t) . We have where (a) follows because the MAD does not change by adding a constant to all values of the variable (similar to variance) and (b) follows from K k=1 ω k (t)(1 − ω k (t)) ≤ K 4 .By the Lyapunov central limit theorem [35,Theorem 27.3], Z(t) converges in distribution to a standard normal distribution, i.e., Z(t) ∼ N (0, 1), as K goes to infinity.Thus, we have where (a) follows from ( 35) and (b) follows from Lemma 2.

Fig. 1 :
Fig. 1: A multi-user multi-sensor IoT sensing network consisting of K EH sensors, an edge node, and N users.The end users are interested in timely status update information of the physical processes measured by the sensors.

( 10 )
where (a) follows from the chain rule, (b) follows from the independence between the request process and the other random variables, (c) follows because, given current battery level b k and action a k , next battery level b k is independent of the requests and the current AoI, and (d) follows since given b k , ∆ k , and a k , the next value of AoI ∆ k can be obtained deterministically (see (

Fig. 2 :
Fig.2: Performance of the proposed relax-then-truncate algorithm in terms of average cost (i.e., average on-demand AoI over all the sensors and users) over time for different values of the number of sensors K with a fixed normalized transmission budget Γ = 0.025.

Fig. 2 Fig. 3 :
Fig.2depicts the performance of the relax-then-truncate algorithm over time for different numbers of sensors K with a fixed normalized transmission budget Γ = 0.025.As shown, the proposed algorithm reduces the average cost by approximately 50 % compared to the greedy policy.Furthermore, the gap between the proposed policy and the lower bound is in general

Fig. 4 andFig. 4 :Fig. 5 :
Fig.4 and Fig.5illustrate the average cost and the average number of command actions,

Fig. 7 , 30 Fig. 7 :
Fig. 7, Fig. 8, and Fig. 9 depict the action under π R,µ * ,k in each state s = {1, b, ∆} for different values of the transmission budget M , energy harvesting rate λ k , and request probability p k,n , respectively.By comparing Figs.7(a)-(d), it is inferred that the command region enlarges as M increases, because the edge node can command more sensors at each slot.Further, from a certain point onward (M ≥ 25), the command region does not increase anymore, because the energy

(
grant 323698), and Academy of Finland 6Genesis Flagship (grant 318927).M. Hatami would like to acknowledge the support of Nokia Foundation.The work of M. Leinonen has also been financially supported in part by the Academy of Finland (grant 340171 and 319485).The work of N. Pappas and Z. Chen have been supported in part by the Swedish Research Council (VR), ELLIIT, and CENIIT.Z. Chen would like to acknowledge the support of Knut and Alice Wallenberg (KAW) Foundation.APPENDIX A. Proof of Proposition 1 Proof.For any state s = (s 1 , . . ., s K ), where s k = (r k , b k , ∆ k ), k = 1, . . ., K, we define the request vector r = (r 1 , . . ., r K ), the battery vector b = (b 1 , . . ., b K ), and the age vector ∆ = (∆ 1 , .

B. Proof of Proposition 3 Proof.
We consider two arbitrary states s, s ∈ S k and show that s = (r , b , ∆ ) is accessible from s = (r, b, ∆) under a (per-sensor) stationary randomized policy π k in which, at each state s, the edge node randomly selects an action a ∈ A k = {0, 1} according to the discrete uniform distribution, i.e., π k (0|s) = π k (1|s) = 1/2.For the case where b ≥ b, realizing the action a = 0 for τ = b − b + 1 consecutive slots leads to state (r , b + 1, min{∆ + τ, ∆ max }) w.p.p.; then the action a = 1 leads to state (r , b , 1) w.p.p., and subsequently action a = 0 for ∆ − 1 consecutive slots leads to state s = (r , b , ∆ ) w.p.p.Similarly, for the case where b < b, the action a = 1 for τ = b − b consecutive slots leads to state (r , b , 1) w.p.p., and subsequently a = 0 for ∆ − 1 consecutive slots leads to state s = (r , b , ∆ ) w.p.p.
∀k, ∀s, determine s ref ∈ S k and a small θ > 0 |A| .Let δ denote the largest element of the age vector ∆ (i.e., max k . ., ∆ K ).Recall that at most M sensors can send a fresh status update at each slot.Thus, any state whose age vector has more than M identical entries with values strictly less than ∆ max is a transient state.We consider two non-transient states s, s ∈ S c and show that s (r , b , ∆ ) is accessible from s (r, b, ∆) under a stationary randomized policy π in which, at each state s, the edge node randomly selects an action a ∈ A according to the discrete uniform distribution, i.e., π(a|s) = 1