Minimizing the AoI in Resource-Constrained Multi-Source Relaying Systems: Dynamic and Learning-Based Scheduling

We consider a multi-source relaying system where independent sources randomly generate status update packets which are sent to the destination with the aid of a relay through unreliable links. We develop transmission scheduling policies to minimize the weighted sum average age of information (AoI) subject to transmission capacity and long-run average resource constraints. We formulate a stochastic control optimization problem and solve it using a constrained Markov decision process (CMDP) approach and a drift-plus-penalty method. The CMDP problem is solved by transforming it into an MDP problem using the Lagrangian relaxation method. We theoretically analyze the structure of optimal policies for the MDP problem and subsequently propose a structure-aware algorithm that returns a practical near-optimal policy. Using the drift-plus-penalty method, we devise a near-optimal low-complexity policy that performs the scheduling decisions dynamically. We also develop a model-free deep reinforcement learning policy for which the Lyapunov optimization theory and a dueling double deep Q-network are employed. The complexities of the proposed policies are analyzed. Simulation results are provided to assess the performance of our policies and validate the theoretical results. The results show up to 91% performance improvement compared to a baseline policy.


I. INTRODUCTION
In many emerging applications of wireless communications such as the Internet-of-Things (IoT), cyber-physical systems, and intelligent transportation systems, the freshness of status information is crucial [3], [4].The age of information (AoI) has been proposed to characterize the information freshness in status update systems [5].The AoI is defined as the time elapsed since the latest received status update packet was generated [4], [5].Recently, the AoI has A. Zakeri, M. Moltafet, and M. Leinonen are with Centre for Wireless Communications-Radio Technologies, University of Oulu, Finland, e-mail: {abolfazl.zakeri,mohammad.moltafet,markus.leinonen}@oulu.fi.M. Codreanu is with Department of Science and Technology, Linköping University, Sweden, e-mail: marian.codreanu@liu.se.attracted much interest in different areas, e.g., queuing systems [6]- [9], and scheduling and sampling problems [10]- [24].The reader can refer to [25] for a survey on the AoI.
In some status update systems, there is no direct communication link between the source of information and the intended destination, or direct communication is costly.In such systems, deploying an intermediate node, called relay2 , is indispensable to enable a long-distance communication.Deploying such a node has an array of benefits, e.g., saving on power usage of wireless sensors and improving the transmission success probability.However, minimizing the AoI is particularly challenging in such relaying systems due to need of jointly optimizing scheduling on both source and relay sides, especially in a multi-source setup [10], [11].Moreover, minimizing the AoI becomes more challenging in the presence of unreliable wireless connectivity due to the possibility of losing some updates [13].At the same time, in practical status update systems, the number of transmissions is limited due to resource constraints (power, bandwidth, etc.), especially in power-limited sensor networks [10], [11], [14], [15].
In this paper, we consider a multi-source relaying status update system with stochastic arrivals.
The sources independently generate different types of status update packets which randomly arrive at a buffer-aided transmitter.The transmitter sends the packets to a buffer-aided full-duplex relay which further forwards the packets to the destination.The buffers store the last arrived packet from each source.All transmission links (channels), i.e., the transmitter-to-relay and relay-to-destination links, are unreliable (error-prone) and have a limited transmission capacity.
A practical application for the considered system could be industrial monitoring, where status updates of various sensors in a given factory zone are first gathered by a low-power transmitter and then sent to a remote monitoring center with help of a relay.Another application could emerge in vehicular networks, where status updates about various physical processes related to a vehicle are sent to a controller (e.g., a road side unit) for supporting vehicle safety applications [32].However, the vehicle is far from the coverage of the controller, and thus, a relay (could be another vehicle [32], or a UAV [27]) is needed to establish the communication.
We formulate a stochastic control optimization problem aiming to minimize the weighted sum average AoI (AAoI) subject to transmission capacity constraints and a long-run average resource constraint, which limits the average number of all transmissions in the system.We develop three different (transmission) scheduling polices by solving the problem.Namely, we provide: (1) a deterministic policy, (2) a drift-plus-penalty-based scheduling policy (DPP-SP), and (3) a deep reinforcement learning policy.A constrained Markov decision process (CMDP) approach and a drift-plus-penalty method are proposed.For the former, we first show that the unichain structure holds for the CMDP problem and then apply the Lagrangian relaxation method to solve it.We theoretically analyze the structure of an optimal policy for the resulting MDP problem and subsequently propose a structure-aware algorithm that provides a near-optimal deterministic policy (which is an optimal policy for the MDP problem) and another deterministic policy that gives a lower bound on the optimal value of the CMDP problem.We note that an optimal policy can be obtained by randomizing the proposed near-optimal deterministic policy and the lower-bound deterministic policy; however, obtaining such randomized policy might be computationally intractable.In the drift-plus-penalty method, we transform the main problem into a sequence of per-slot problems and then devise a near-optimal low-complexity DPP-SP, which performs the scheduling dynamically, using a scheduling rule described by a closedform solution to the per-slot optimization problem.Moreover, we provide a model-free deep reinforcement learning algorithm for which we first employ the Lyapunov optimization theory to transform the main problem into an MDP problem and then adopt a dueling double deep Q-network (D3QN) 3 to solve it.The proposed learning-based policy addresses the case in which the packet arrival rates and the error probabilities of the wireless channels are not known a priori, i.e., so-called unknown environment.It should be noted that the environment model may not be (readily) available, or using a perfect model is not applicable in practice owing to computational difficulties.The computational complexity of the proposed policies is analyzed.Finally, extensive numerical analysis are provided to validate the theoretical results and show the effectiveness of the proposed scheduling policies.

A. Contributions
The main contributions of this paper are summarized as follows: • We study the AoI in a multi-source buffer-aided full-duplex relaying status update system with stochastic arrivals and unreliable links.We formulate a stochastic optimization problem that aims to minimize the weighted sum AAoI subject to transmission capacity constraints and a long-run time average resource constraint.
• We develop three different scheduling polices by solving the main optimization problem.
Particularly, we propose the CMDP approach and the drift-plus-penalty method.Moreover, we develop a deep reinforcement learning algorithm by combining the Lyapunov optimization theory and D3QN.
• We theoretically analyze the structure of an optimal policy of the MDP problem (obtained via the Lagrangian relaxation) and develop a structure-aware iterative algorithm for solving the CMDP problem.The convergence of the algorithm is also proven.
• We devise a dynamic near-optimal low-complexity scheduling policy, i.e., DPP-SP, by providing a closed-form solution to the per-slot problem obtained under the drift-pluspenalty method.Moreover, we prove that DPP-SP satisfies the average resource constraint.
• We analyze the computational complexity of the proposed scheduling policies.
• We provide numerical analysis to verify the theoretical results and assess the effectiveness of the devised policies.The results show up to 91% performance improvement compared to a greedy-based baseline policy.

B. Related Works
Recently, the AoI in relaying systems has been studied in, e.g., [10], [14], [36]- [46].The authors of [36] analyzed the AoI in a discrete-time Markovian system for two different relay settings and analyzed the impact of relaying on the AoI.In [37], the authors analyzed the AAoI in a two-way relaying system under the generate-at-will model (i.e., possibility of generating a new update at any time) model in which two sources exchange status data.The AoI performance under different policies (e.g., a last-generated-first-served policy) in general multi-hop singlesource networks was studied in [38].In [14], the authors studied the AoI in a single-source energy harvesting relaying system with error-free channels and designed offline and online age-optimal policies.Reference [39] analyzed the AAoI in a single-source relaying system with and without the automatic repeat-request technique, where results show that the automatic repeat-request technique can reduce the AAoI.The age-energy tradeoffs in a relay-aided status update system were studied in [45], where the expressions for the AAoI and average energy cost were derived.
In [42], the expression of the AoI distributions in a single-source relaying system under different circumstances were derived.Minimization of the AAoI through optimizing the blocklengths in short-packet communications in decode-and-forward relaying IoT networks was conducted in [46].Authors of [43] optimized the steady-state AoI violation probability with respect to the sampling rate of monitoring a process in both single-hop and two-hop systems.In [40], the authors considered a single-source relaying system under stochastic packet arrivals where the source communicates with the destination either through the direct link or via a relay.They proposed two different relaying protocols and derived the respective AAoI expressions.
In summary, only a few works, such as [10], [11], [14], have incorporated a resource constraint (as we do in this paper) when analyzing and/or optimizing the AoI in a relaying system.Moreover, different from our multi-source system, most of the discussed works, e.g., [14], [36], [38]- [43], [45], [46], consider single-source relaying systems.Clearly, multi-source scheduling is generally substantially challenging, especially when there are also resource constraints (as in this paper).Because one needs to properly allocate a limited amount of resources among multiple sources, taking into account each source's characteristics (e.g., the arrival rate of each source and the source's information importance).
Our relaying system, considered as a two-hop network, is an extension of work [12], where the authors provided scheduling policies for minimizing the AAoI in a one-hop buffer-free network with stochastic arrivals and an error-free link with no average resource constraint.In contrast, our two-hop network is a buffer-aided network with error-prone links.The most-related works to our paper are [10], [11].The work [10] studied the AoI minimization in a multi-source relaying system with the generate-at-will model and unreliable channels.The authors proved that the greedy policy is an optimal scheduling policy for a setting called the error-prone symmetric IoT network whereas for the general setting, they applied DQN.In [11], the authors studied the AAoI minimization problem in a single-source half-duplex relaying system with the generateat-will model under a constraint on the average number of forwarding transmissions at the relay.In contrast to [11], we consider a multi-source setup; because of the single-source setup in [11], the scheduling problem of [11] is essentially the problem of optimizing whether the relay should receive or transmit at each slot, whereas our problem is multi-source scheduling.
Different to [12], we have two-dimensional decision variables in our problem which makes constructing optimal/good scheduling policies more difficult.We further consider an average resource constraint so that our problem is a CMDP problem, whereas the problems of [10], [12] are MDP problems.Notably, not only solving a CMDP problem is substantially challenging but analyzing its optimal policy structure is also challenging.Furthermore, the stochastic arrival model considered in our setup generalizes the generate-at-will model in [10], [11] and brings additional challenges in the design and analysis of scheduling policies since the statistics of the arrivals and the AoI at the transmitter are also involved in the system dynamics.Finally, besides the MDP/CMDP approach proposed in [10]- [12], we also propose the two different scheduling policies, i.e., DPP-SP, and the deep reinforcement learning policy that copes with unknown environments.Even though [11] also develops a low-complexity double threshold relaying policy, the thresholds need to be optimized numerically.In contrast, our low-complexity scheduling policy requires to execute two simple operations.

C. Organization
The rest of this paper is organized as follows.The system model and problem formulation are presented in Section II.The CMDP formulation and its solution are presented in Section III.The DPP-SP is presented in Section IV.The deep reinforcement learning algorithm is provided in Section V.The computational complexity of the proposed policies is analyzed in Section VI.The numerical analysis and conclusions are provided in Section VII and Section VIII, respectively.

A. System Model
We consider a status update system consisting of a set I = {1, . . ., I} of I independent sources, a buffer-aided transmitter 4 , a buffer-aided full-duplex relay, and a destination, as depicted in Fig. 1.The sources model physically separated fully autonomous sensors (i.e., they cannot be controlled and commanded) where their (status update) packets are sent to the transmitter using a random access protocol.Thus, the stochastic arrivals model is used to account for possible random packet losses on the links between the sources and the transmitter due to, e.g., collisions, and/or for possible idle slots where the source sensors do not send updates.
Additionally, there is no direct communication link between the transmitter and the destination, and thus, the transmitter sends all status update packets to the destination via the relay.
We assume that each status update is encapsulated in one packet 5 .The buffer size is one packet per source and each buffer stores the most recently arrived packet of a source, as they contain the freshest information.More specifically, a packet of a source arriving at the transmitter replaces the packet of the same source in the transmitter's buffer; similarly, a packet of a source received by the relay replaces the packet of the same source in the relay's buffer.It is worth noting that considering one packet size buffer for each source is sufficient in our system, as storing and transmitting outdated packets does not improve the AoI.Moreover, the relay transmits the packet available in the buffer at the beginning of the slots, while the buffer is updated at the end of the slots (if a new packet is successfully received).Fig. 1: A multi-source relaying status update system in which different status updates arrive at random time slots at the transmitter, which then sends the packets to the destination via a buffer-aided relay over unreliable links.

TABLE I:
The key symbols with their definitions used in the paper Notation(s) Definition(s) I/I/i The set/number/index of sources The AoI of source i at the transmitter/relay/destination µ i The arrival rate of source i w i The weight of source i p 1 /p 2 The reliability of the transmitter-relay/relay-destination link The successful packet reception indicator of the transmitter-relay/relay-destination link The transmission decision at the transmitter/relay N The bound of the AoI values Γ max The transmission budget K The average number of total transmissions δ The weighted sum average AoI at the destination We consider a discrete-time system with unit time slots t ∈ {0, 1, 2, . ..}.The sources, indexed by i ∈ I, independently generate status update packets according to the Bernoulli distribution with parameter µ i .Note that µ i = 1 gives the same performance when considering the system with the generate-at-will model and no sampling cost.Let u i [t] be a binary indicator that shows whether a packet from source i arrives at the transmitter at the beginning of slot t, i.e., u i [t] = 1 indicates that a packet arrived; otherwise, u i [t] = 0. Accordingly, Pr{u i [t] = 1} = µ i .For clarity, the definitions of the main symbols are collected in Table I.
Wireless Channels: As the wireless channels fluctuate over time, reception of updates (both by the relay and the destination) are subject to errors.However, unsuccessfully received packets can be retransmitted; we assume that all retransmissions have the same reception success probability.
Let p 1 and p 2 be the successful transmission probabilities of the transmitter-relay and relaydestination links, respectively.Also, let ρ We assume that there is a centralized controller performing the scheduling.
Age of Information: Let θ i [t] denote the AoI of source i at the transmitter in slot t.Also, let denote the AoI of source i at the relay and δ i [t] denote the AoI of source i at the destination in slot t.We make a common assumption (see e.g., [21], [22], [48]) that AoI values are upperbounded by a finite value N.Besides tractability, this accounts for the fact that once the available information about the process of interest becomes excessively stale, further counting would be irrelevant.The evolution of the AoIs of each source i ∈ I is given by Remark 1.When N is not sufficiently large, the system performance without bounding the AoI will be different from that with bounded AoI.The appropriate choice of N depends on the system parameters such as the number of sources and the links' reliabilities.

B. Problem Formulation
We denote the weighted sum average AoI at the destination (WS-AAoI) by δ and the average number of total transmissions per slot in the system by K, which are defined as where w i > 0, ∀i, denotes the weight of source i; 1 {•} is an indicator function which equals to 1 when the condition in {•} holds; and E{•} is the expectation with respect to the system randomness (i.e., random wireless channels and packet arrival processes) and the (possibly random) decision variables α[t] and β[t]7 .Moreover, K represents the system-wide power consumption.By these definitions, we formulate the following stochastic optimization problem minimize where the real value Γ max ∈ (0, 2] is the maximum allowable average number of transmissions per slot in the system.The time average constraint (2b) represents a system-wide power utilization budget.Thus, problem (2) provides a trade-off between the WS-AAoI and the system-wide power consumption.Note that since the maximum number of per-slot transmissions in the system is 2, we have Γ max ∈ (0, 2]; the values Γ max ≥ 2 make the constraint inactive.Moreover, the Slater condition [49, Eq. 9.32] clearly holds for problem (2), i.e., there exists some set of decisions for which K < Γ max .
In the next section, we will present a CMDP approach to solve problem (2).

III. CMDP APPROACH TO SOLVE PROBLEM (2)
In this section, we recast problem (2) into a CMDP problem which is then solved by using the Lagrangian relaxation method.

A. CMDP Formulation
We specify the CMDP by the following elements: • State: The state of the CMDP incorporates the knowledge about all the AoI values in the system.We define the state in slot t by s ), where , ∀ i ∈ I, are the relative AoIs at the relay and the destination in slot t, respectively.Using the relative AoIs simplifies the subsequent analysis and derivations.The intuition is that the evolution of the AoI of source i at the destination from slot t to t + 1 can be expressed as , N , and the evolution of the AoI of source i at the relay from slot t to t + 1 can be expressed as N .We denote the state space by S which is a finite set.
• Action: We define the action taken in slot t by a Let A denote the action space.Actions are determined by a policy, denoted by π, which is a (possibly randomized) mapping from S to A. We consider stationary randomized policies because they are dominant (see [49,Definition 2.2]) if unichain structure8 exists [49, Theorem 4.1]; we will show in Theorem 1 below that the unichain structure exists for the transition probability matrix of the underlying (C)MDP.
• State Transition Probabilities: We denote the state transition probability from state s to next state s ′ under an action a = (α, β) by P ss ′ (a).Since the evolution of the AoIs in (1) and the arrivals are independent among the sources, the transition probability can be decomposed as P ss ′ (a) = i Pr{s ′ i s i , a}, where Pr{s ′ i s i , a}, ∀ i ∈ I, denotes the state transition probability of source i under an action a, s i is the part of the current state associated with source i ∈ I, i.e., s i = (θ i , x i , y i ), and s ′ i is the part of the next state associated with source i, i.e., s Mathematically, Pr{s ′ i s i , a} is given by ( 3), shown on top of the next page, where θi min Theorem 1.The transition probability matrix with elements P ss ′ (a) corresponding to every deterministic policy is unichain.
Proof.See Appendix A.
• Cost Functions: The (immediate) cost functions include: 1) the AoI cost, and 2) the transmission cost.The AoI cost is the weighted sum of AoIs at the destination, i.e., C(s Given a stationary randomized policy π, we denote the WS-AAoI cost by J(π) and the average transmission cost by D(π), defined as follows Note that we have omitted the dependence on the initial state in ( 4) and ( 5 (3) where Π SR is the set of all stationary randomized policies.The optimal value of the CMDP problem ( 6) is denoted by J * and an optimal policy is denoted by π * .
In the section below, we turn to solve the CMDP problem (6).(6) In order to solve the CMDP problem (6), we transform it into an (unconstrained) MDP problem using the Lagrangian relaxation method [49], [51].The states, the actions, and the state transition probabilities of the MDP are the same as those of the CMDP.The immediate cost function of the MDP is defined as

B. Solving the CMDP Problem
where ) − Γ max is the Lagrangian and Π SD is the set of all deterministic policies; here, we restrict to the class of deterministic policies without loss of optimality because there always exists an optimal deterministic policy to the MDP problem (7) [50, p. 370], which is a result of Theorem 8.4.5 in [50] under the unichain structure shown in Theorem 1.
By [49,Theorem 12.8], under the Slater and two other technical conditions 9 , there exists a Lagrange multiplier λ * such that J * = min is an optimal policy for the CMDP problem (6).
Remark 2. Under the unichain structure shown in Theorem 1, by results of [51], it can be shown that an optimal policy of the CMDP problem (6) is a stationary randomized policy that performs randomization between two deterministic λ-optimal policies where one is feasible and the other is infeasible to (6) (e.g., [11], [15]).However, given such λ-optimal policies, finding a randomization factor of such optimal policy is computationally difficult [52,Sec. 3.2].
As stated in Remark 2, it is difficult to obtain an optimal stationary randomized policy (by randomizing the two λ-optimal policies) to the CMDP problem (6).Therefore, we will next develop a practical near-optimal (as empirically shown in Section VII) deterministic policy to the CMDP problem (6).In particular, we propose a solution relying on bisection search over the Lagrange multiplier λ and relative value iteration algorithm (RVIA).Namely, we alternate between solving the MDP problem (7) for a given λ and searching for a particular value of λ for which π * λ is feasible for problem (6) and gives the best performance among all feasible λ-optimal policies.(7): Towards solving the MDP problem (7), first, we present the following theorems related to a λ-optimal policy; particularly, Theorem 2 characterizes a λ-optimal policy and Theorem 3 specifies its structure.Then, we utilize these theorems to develop a structure-aware RVIA [50, Sec.8.5.5] that gives a λ-optimal policy.Theorem 2. There exists h(s), for each state s ∈ S, such that
Theorem 3. Any λ-optimal policy of problem (7) has a switching-type structure for β with respect to y = (y 1 , . . ., y I ).This is, if the policy takes action β = i, i ∈ {1, . . ., I}, at state s, it also takes the same action at all states s + ke 3i , for all k ∈ N, where e 3i is a vector in which the (3i)-th element is 1 and the others are 0.
RVIA is an iterative procedure that utilizes the optimality equation (8).Particularly, at each iteration n ∈ {0, 1, . . .}, for each state s ∈ S, we have where s ref ∈ S is an arbitrarily chosen reference state.The structure-aware RVIA is presented in Alg. 1 (see , where ε is a small constant for the RVIA termination criterion.In particular, at each iteration of RVIA, in Steps 6-9, we exploit the switchingtype structure specified in Theorem 3 to find an optimal action for each state s, i.e., a * arg min a∈A {L(s, a; λ) whenever we have determined an optimal decision of β * in Step 6, then we only need to find an optimal decision of α * .
The following theorem shows that RVIA given by (10) (i.e., Steps 3-16 of Alg. 1) converges and returns the optimal value of the MDP problem (7).
It is worth stressing that, as stated in Remark 2, there is no guarantee, even for an arbitrarily 10 Intuitively, increasing λ penalizes more the transmission cost in the Lagrangian; thus, by increasing λ, the average number of transmissions D(π * λ ) decreases, which, in turn, increases the WS-AAoI J(π * λ ).
small ζ, that the feasible deterministic policy π * λ + , obtained by Alg. 1, would be an optimal policy for the CMDP problem (6).Nevertheless, the empirical results in Section VII will show that policy π * λ + has near-optimal performance.At the same time, the infeasible policy π * λ − can serve as a benchmark, because it provides a lower bound to an optimal solution of (6).In Section VII, we will empirically show that policy π * λ − is a tight lower bound solution.It is essential to note that the computational complexity of the (relative) value iteration algorithms dramatically grows as the state and action spaces increase, i.e., the curse of dimensionality problem; the detailed complexity analysis of Alg. 1 can be found in Sec.VI.
Since RVIA is run at each iteration of bisection, Alg. 1 becomes computationally inefficient when applied for a large number of sources.To circumvent the curse of dimensionality, we propose a low-complexity scheduling policy in the next section.
IV. LOW-COMPLEXITY SCHEDULING POLICY TO SOLVE PROBLEM (2) In this section, we devise DPP-SP (i.e., drift-plus-penalty-based scheduling policy), using the idea of the drift-plus-penalty method [55], to solve the main problem (2).The proposed DPP-SP is a heuristic policy that has low complexity and, as empirically shown in Section VII, obtains a near-optimal performance.We prove that DPP-SP is guaranteed to satisfy constraint (2b).
According to the drift-plus-penalty method [55], the time average constraint (2b) is enforced by transforming it into queue stability constraint.Accordingly, a virtual queue is associated for constraint (2b) in such a way that the stability of the virtual queue implies satisfaction of the constraint.Let H[t] denote the virtual queue associated with constraint (2b) in slot t which evolves as where the expectation is with respect to the (possibly random) decisions made in reaction to the current system state.
Applying the drift-plus-penalty method to main problem (2), we seek for a control policy that minimizes an upper bound on the following drift-plus-penalty function, ϕ[t], at every slot t: where the expectation is with respect to the channel randomness (i.e., ρ 1 [t] and ρ 2 [t]) and (possibly random) decisions made in reaction to the current system state; parameter V ≥ 0 adjusts a trade-off between the size of the virtual queue and the objective function.It is noteworthy that, in (13), different from considering the original immediate objective function (i.e., the sum AoI at the destination) as the penalty term, we have added the sum AoI at the relay to the penalty term so that minimizing the upper bound of the drift-plus-penalty function at each slot also concerns the evolution of the sum AoI at the relay.
To obtain the upper bound of the drift-plus-penalty function, we derive an upper bound for the drift term ∆[t], given by the following proposition.
Proposition 1.The upper bound for the conditional Lyapunov drift in (12) is given by where B = 1/2Γ 2 max + 2.

Proof. See Appendix C.
Let us express the evolution of the AoI and the relative AoIs of each source i ∈ I by the following compact formulas 11 Using Proposition 1 and substituting (15) into (13), the upper bound for the drift-plus-penalty 11 These expressions are for unbounded AoI values as the derivation of DPP-SP does not require to bound them.
function ϕ[t] can be derived as Now, we turn to minimize the upper bound of the drift-penalty-function given in (16).To this end, we first compute the expectations with respect to the channel randomness, i.e., we have E{ρ Then, after removing the terms in ( 16) that are independent of the decision variables, we need to minimize the following expression: where the expectation is with respect to the (possibly random) decisions.
To minimize the expression in (17), we follow the approach of opportunistically minimizing a (conditional) expectation [55, p. 13], i.e., the expression in ( 17) is minimized by the algorithm that observes the current system state Z[t] and chooses α[t] and β[t] to minimize The expression in (18) minimize It can be inferred from problem (19)  In summary, the proposed DPP-SP works as follows: at each slot t, the controller observes 12 and determines the transmission decision variables according to the following rules What remains is to show that DPP-SP, operating according to (21), satisfies constraint (2b).
We prove this in the following theorem.
Theorem 5. Assume that E{L(H[0])} is finite.For any finite V , the virtual queue under DPP-SP that operates according to (21) is strongly stable, implying that DPP-SP satisfies constraint (2b).
Proof.See Appendix D.
As it can be seen in ( 21), DPP-SP performs only two simple operations to determine the actions at each slot.Hence, DPP-SP has low complexity and can easily support systems with large numbers of sources.The detailed complexity analysis of DPP-SP can be found in Sec.VI.

V. A DEEP REINFORCEMENT LEARNING ALGORITHM TO SOLVE PROBLEM (2)
In this section, we develop a deep reinforcement learning algorithm to solve the main problem (2).Inspired by [56], we use the Lyapunov optimization theory to convert the CMDP problem (6) into an MDP problem which is then solved by a model-free deep learning algorithm, namely, D3QN (i.e., dueling double deep Q-network) [34], [35].Note that another approach to the CMDP problem ( 6) could be a primal-dual reinforcement learning algorithm.In contrast to our algorithm, such an algorithm leads to an iterative optimization procedure.Thus, the proposed Lyapunovbased learning algorithm is in general simpler than a primal-dual DRL-based algorithm.
It is worth pointing that: i) as D3QN is a model-free algorithm, we do not require the state transition probabilities of the MDP problem, thus, the proposed deep learning is applicable for unknown environments (i.e., when the packet arrival rates and the successful transmission probabilities of the (wireless) links are not available at the controller), and ii) there is no guarantee that the proposed deep learning algorithm provides an optimal policy to the main problem (2); however, an advantage of the deep learning algorithm is coping with unknown environments with large state and/or action spaces which can be used as a benchmark policy.We further note that to implement the proposed learning algorithm, we do not need to bound the AoI values and store the state space (which may require considerable memory).
We define the expected time average reward function, obtained by policy π, as where r is the immediate reward function, and is the quadratic Lyapunov function with virtual queue H[t] given by (11).It is worth pointing out that the Lyapunov drift in the reward function is introduced to guarantee the satisfaction of the average constraint (2b) [56], [57].We want to solve the following problem Problem (23)

VI. COMPLEXITY ANALYSIS
Here, we analyze the (overall) computational complexity of the proposed policies.First, in terms of complexity, there are two different phases: 1) offline phase, i.e., an initial phase to find a policy, and 2) online phase where the (offline-derived) policy is used to generate the corresponding action at each slot.DPP-SP does not have the offline phase, whereas the deterministic policy obtained by Alg. 1 and the deep learning policy have both the offline and online phases.Next, we elaborate the complexity of the proposed polices in each phase.The complexity of the policies is summarized in Table II.
• The deterministic policy: The complexity of the offline phase of the deterministic policy is the complexity of running Alg. 1. Alg. 1 is an iterative algorithm that involves iterating between bisection and RVIA.The complexity order of each iteration of RVIA is at most , where the state space size |S| is approximately N 3I and the action space size |A| is (I + 1) 2 .Accordingly, the complexity of the offline phase of the deterministic policy , where M 1 and M 2 , are, respectively, the iterations required in bisection and RVIA.The complexity of the online phase is O(1) since it is needed to just fetch the corresponding action of each state from the lookup table obtained in the offline phase.
• DPP-SP: As mentioned above, DPP-SP does not have the offline phase.In the online phase, the policy needs I comparisons for each of the two decision variables, thus, 2I comparisons in total.Therefore, the complexity of DPP-SP in the online phase is O(I).
• The deep learning policy: The offline phase of the deep learning policy is its training phase.
Because the policy is based on the deep neural network, its (computational) complexity is mainly related to the model and size of the neural network and the training process.The training complexity of the neural network consists of two stages: 1) the forward propagation algorithm (forward pass) and 2) the backpropagation algorithm (backward pass).The complexity of the forward propagation algorithm is O (P h (P i + M 3 P h + P o )) [58], [59], where P i = 1 + 3I is the number of neurons of the input layer (which equals the number of elements in the state vector), and P o = |A| is the number of neurons of the output layer.Moreover, P h is the number of neurons in each hidden layer and M 3 is the number of hidden layers; it is assumed that all the hidden layers have the same number of neurons.The complexity of the backpropagation algorithm is similar to that of the forward propagation algorithm [58].In terms of the training process, the complexity is mainly related to the number of episodes M 4 and iterations (per episode) M 5 , and the batch size M 6 (i.e., the number of samples used to update the weights of the neural network).Accordingly, the overall complexity of the offline phase of the policy is In the online phase, the action selection is done by executing the forward propagation algorithm and thus, the complexity of the online phase of the policy is

VII. NUMERICAL RESULTS
In this section, we numerically evaluate the WS-AAoI (i.e., weighted sum average AoI at the destination) performance of the three proposed policies: 1) the deterministic policy π * λ + obtained by the structure-aware RVIA in Alg. 1, 2) DPP-SP given by ( 21), and 3) the deep learning policy provided in Section V.For Alg. 1, we set N = 10, I = 2, ζ = 0.1, and ε = 0.001.
For the deep learning policy, we consider a fully-connected deep neural network consisting of an input layer (|Z[t]| = 6 + 1 = 7 neurons), 2 hidden layers consisting of 512 and 256 neurons with ReLU activation function, and an output layer (|A| = 9 neurons).Moreover, the number of steps per episode is 600, the discount factor is 0.99, the mini-batch size is 64, the learning-rate is 0.0001, and the optimizer is RMSProp [60].The sources' weights are set to 1 for all sources.
The system parameters, i.e., the arrival rates µ = (µ 1 , µ 2 ), the channel reliabilities p = (p 1 , p 2 ), and the constraint budget Γ max are specified in the caption of each figure .Next, we provide algorithm-specific analysis in Section VII-A and performance comparison in Section VII-B.
A. Algorithm-specific Analysis 1) Algorithm 1: Here, we verify Theorem 3 by visualizing the switching-type structure of λ-optimal policies and investigate the WS-AAoI performance of the deterministic policy π * λ + .Fig. 2(a) shows the structure of a λ-optimal policy for the decision at the relay β with respect to the relative AoIs at the destination y 1 and y 2 for state s = (1, 0, y 1 , 2, 1, y 2 ).The figure validates Theorem 3 and unveils that the relay schedules an available packet of the source that has higher relative AoI at the destination; this is because the contribution of delivering such packet in the AoI reduction is higher than the other who has a lower relative AoI at the destination.Having β = 0 at (y 1 = 0, y 2 = 0) is because the most recent status update packets of the sources at the relay are also available at the destination; thus, resending them would not reduce the AoI.Fig. 2(b) exemplifies the structure of the λ-optimal policy for the decision at the transmitter α with respect to the relative AoIs at the relay x 1 and x 2 for state s = (1, x 1 , 4, 1, x 2 , 4).Having α = 0 at (x 1 = 0, x 2 = 1) implies that transmission does not occur at every state due to the resource budget.Moreover, α = 0 at (x 1 = 0, x 2 = 0) is because the most recent status update packets of the sources at the transmitter were already sent to the relay.The figure also shows that for fixed y 1 and y 2 , the transmitter will give a higher priority to schedule transmissions of Source 1 who has a lower arrival rate.That is, while the low packet arrival rate of Source 1 inevitably leads to infrequently receiving status updates by the destination, the optimal policy partly compensates for this by prioritizing to send fresh packets from Source 1, whenever possible.Fig. 3 illustrates the WS-AAoI performance of the proposed policies obtained by Alg. 1 as a function of the constraint budget Γ max obtained by averaging over 100,000 time slots.The "lower bound" is obtained by the infeasible policy π * λ − .First, Fig. 3 shows that the deterministic policy π * λ + achieves near-optimal performance and the lower bound is tight because the difference between the feasible policy and the infeasible policy is small.In addition, we observe that the gap between the deterministic policy and the lower bound increases as Γ max decreases.Thus, randomizing these two policies will produce the highest relative gain in this regime.
2) DPP-SP: For DPP-SP, we investigate the impact of the trade-off parameter V on the WS-AAoI and the average number of transmissions in the system in Fig. 4. Fig. 4(a) shows the evolution of the WS-AAoI over time slots for different values of V .We observe that, for   sufficiently small values of V , by increasing V , the WS-AAoI decreases.Fig. 4(b) shows the evolution of the average number of transmissions over time slots.The figure validates Theorem 5 by showing that the time average constraint (2b) is satisfied for all V .However, the convergence speed decreases as V increases.These observations give us some practical guidelines in that we should set parameter V large (but not excessively high) to obtain a low value of the WS-AAoI, because increasing V beyond a certain value does not bring significant improvements.
3) Deep Learning Policy: For the deep learning policy, we show the evolution of the episodic reward over episodes in Fig. 5(a), the evolution of the average number of transmissions over episodes in Fig. 5(b), and the evolution of the WS-AAoI over episodes in Fig. 5(c) for different values of Γ max .The episodic reward is defined by the sum of rewards obtained at each episode.Fig. 5(b) validates that the proposed deep learning policy satisfies the time average constraint (2b) for all the constraint budgets.However, the convergence speed is highly affected by Γ max , i.e., as Γ max increases, the policy converges quickly.The same convergence behavior is seen for the episodic reward function in Fig. 5(a) and the WS-AAoI in Fig. 5(c).

B. Performance Comparisons
In this subsection, we provide a performance comparison of the proposed policies.The results are averaged over 100,000 time slots and the parameter V is set to 100.For comparison, we also consider a greedy "baseline policy", which determines the transmission decision variables at each slot t according to the following rule: and ; otherwise, α[t] = 0 and β[t] = 0, where Dt denotes the average number of transmissions until slot t.This policy satisfies the time average constraint (2b).It is remarkable that the baseline policy and DPP-SP (given by ( 21)) have similar computational complexity.
1) Effect of the Constraint Budget: Fig. 6 depicts the WS-AAoI performance of the proposed policies and the baseline policy as a function of the constraint budget Γ max .First, Fig. 6 reveals that the low-complexity DPP-SP has near-optimal performance because it nearly coincides with the (near-optimal) RVIA-based deterministic policy π * λ + obtained by Alg. 1.The figure also shows that the deep learning policy obtains near-optimal performance when the constraint budget becomes sufficiently large, e.g., Γ max ≥ 0.8.Moreover, the figure shows that the WS-AAoI performance gap between the baseline policy and the proposed policies is extremely large when the constraint budget is small; this is because in such cases, performing good actions in each slot becomes more critical due to having a high limitation on the average number of transmissions.
The figure shows that the proposed policies achieve up to almost 91% improvement in the WS-AAoI performance compared to the baseline policy.Finally, we can observe that, as the constraint budget increases, the WS-AAoI values decrease; however, from a certain point onward, increasing the constraint budget does not considerably decrease the WS-AAoI.2) Effect of the Arrival Rates: In Fig. 7(a), we examine the impact of the arrival rates µ 1 and µ 2 on the WS-AAoI performance of the different policies.The figure shows that the WS-AAoI increases as the arrival rates decrease.This is because when the arrival rates decrease, the rate of fresh update delivery at the destination decreases.The figure also reveals that, as the arrival rates increase, the reduction of the WS-AAoI by the proposed policies in comparison to the baseline policy becomes increasingly more prominent.The reason for this behavior is that by the increase of the arrival rates there are more new fresh packets which can potentially reduce the AoI if they are delivered timely/optimally to the destination.The greedy baseline policy, however, cannot deliver them timely.Moreover, it is observable that when the arrival rates are sufficiently large, increasing them further does not considerably reduce the WS-AAoI.This observation is due to the fact that in our system, only one packet can be transmitted in each slot, and for large values of the arrival rates, the probability of having at least one fresh packet does not change considerably by changing the arrival rates.
3) Effect of the Successful Transmission Probabilities: In Fig. 7(b), we examine the impact of the successful transmission probabilities p 1 and p 2 on the WS-AAoI performance of the different policies.First, the figure shows that the WS-AAoI performance gap between the proposed policies and the baseline policy is significant, especially when the successful transmission probabilities are small.The reason is that when the successful transmission probabilities are small, finding optimal transmission times become more critical, as there are resource limitations.Moreover, the figure shows that the WS-AAoI considerably decreases as the successful transmission probabilities increase; this is expected, because the probabilities of successfully receiving the transmitted status update packets through the unreliable links increase, and consequently, the destination receives updates more frequently.

4) Effect of Number of Sources:
In Fig. 8, we show the effect of the number of sources on the WS-AAoI for different values of the constraint budget Γ max without bounding the AoI.Here, we utilize the DPP-SP, the deep learning policy, and the greedy baseline policy; notably, as explained in Section VI, Alg. 1 is not scalable to a multi-source setup (with a high number of sources).The figure shows that WS-AAoI increases by increasing I.This is because, for a fixed Γ max , when I increases, the opportunity of having transmissions for each source decreases; thus, the WS-AAoI increases.

5)
Effect of the Source's Weight: Fig. 9 illustrates the impact of the weight w i on the AAoI of sources for DPP-SP in a two-source setup.As can be seen, by increasing the weight of a source, its AAoI decreases, as expected.The reason is that by increasing the weight of a source, we put more emphasis on the AoI of the source, and thus, the policy tries to keep its AoI lower.

VIII. CONCLUSION
We studied the WS-AAoI minimization problem in a multi-source relaying system with stochastic arrivals and unreliable channels subject to transmission capacity and the average number of transmissions constraints.We formulated a stochastic optimization problem and solved it with three different algorithms.Specifically, we proposed the CMDP approach in which we first conducted analysis to show that an optimal policy of the MDP problem has a switching-type structure and subsequently, utilized this structure to devise a structure-aware RVIA that gives a near-optimal deterministic policy and a tight lower bound; the convergence of the algorithm was also proven.We devised a dynamic near-optimal low-complexity DPP-SP, representing an efficient online scheduler for systems with large numbers of sources.Moreover, we devised a deep learning policy combining the Lyapunov optimization theory and D3QN.
We numerically investigated the effect of system parameters on the WS-AAoI and showed the effectiveness of our proposed policies compared to the baseline policy; the results showed up to 91% improvement in the WS-AAoI performance.Accordingly, an age-optimal scheduler design is crucial for resource-constrained relaying status update systems, where greedy-based scheduling is inefficient.Moreover, the results showed that the proposed deep learning policy satisfies the time average constraint and achieves performance close to the other proposed nearoptimal policies in many settings.

A. Proof of Theorem 1
By [54, Exercise 4.3], it is sufficient to show that the Markov chain, described by the transition probability matrix with elements P ss ′ (a), corresponding to every deterministic policy has a state which is accessible from any other state.We show this by dividing the sources into two different groups I 1 and I 2 based on the values of the arrival rates µ i , i.e., sources with µ i = 1 belongs to I 1 and sources with µ i ∈ (0, 1) belongs to I 2 .Let us express each state s ∈ S by s = {s i } i∈I 1 ∪I 2 , where recall that s i = (θ i , x i , y i ).Then, in the Markov chain induced by every deterministic policy, state s acc = {s acc i } i∈I 1 ∪I 2 , where s acc i = (0, N, 0), ∀ i ∈ I 1 and s acc i = (N, 0, 0), ∀ i ∈ I 2 , is accessible from any other state.This is due to the fact that regardless of actions taken: (1) there is always new arrivals for sources belong to I 1 , (2) the probability of having no arrivals for all sources belong to I 2 , for at least N consecutive slots is i (1 − µ i ) N , i ∈ I 2 , which is positive, and (3) the probability of having unsuccessful receptions in both the relay and the destination for at least N consecutive slots is (1 − p 1 ) N (1 − p 2 ) N , which is positive.Thus, according to the evolution of the AoIs, starting from any state at any slot t leads to state s acc with a positive probability, which completes the proof.

B. Proof of Theorem 3
To show the switching-type structure w.r.t.y for a λ-optimal policy, we use Theorem 2. First, by turning the optimality equation ( 8) into the iterative procedure (10), for each state s ∈ S, we can iteratively obtain h(s) V (s) − V (s ref ) and consequently π * λ (s) (see (9)).We then use (10) and show a monotonic property of the function V (s) in the following lemma, which will be used in the next steps of the proof.Lemma 1.The function V (s) is a non-decreasing function with respect to every s j , where s j , j = 1, . . ., 3I, is the j-th element of state vector s = (θ 1 , x 1 , y 1 , . . ., θ I , x I , y I ).
Proof.The proof is based on the induction hypothesis.First, the sequence {V n (s)} n=1,2,... , updated by (10), converges to V (s) for any initialization (see Theorem 4).Also, Lemma 1 holds for V 0 (s).Now, we assume that V n (s) is non-decreasing in s j .The immediate cost of the MDP L(s, a; λ) = I i=1 w i (θ i + x i + y i ) + λ (D(a[t]) − Γ max ) is a non-decreasing function in s j , j = 1, . . ., 3I.In addition, s ′ ∈S P ss ′ (a)V n (s ′ ) is a non-decreasing function in s j via the induction hypothesis, and the minimum operator in (10) preserves the non-decreasing property.

D. Proof of Theorem 5
To show the strong stability of the virtual queue under DPP-SP, first, we define an idle policy that chooses the idle decisions in each slot t, i.e., α idl [t] = 0 and β idl [t] = 0; hence, a idl [t] (0, 0).
By using inequality (16) where (a) is due to inequality ( 16), (b) follows because DPP-SP, given by ( 21), minimizes the upper bound of the drift-plus-penalty function, i.e., the R.H.S of (a) in (28), in each slot t among all the possible decisions, including the idle decisions, and (c) is due to the fact that, for any decisions in slot t, the inequalities E{1 − ρ In (29), summing over t = 0, . . ., T − 1 (using the law of telescoping sums), dividing by positive T and Γ max , and rearranging yields where (a) follows because we neglected the negative terms in the L.H.S of (a).By taking a lim sup of (30) as T → ∞, and due to that E{L(H[0])} is finite, we obtain which implies that the virtual queue is strongly stable.
[55,5, Ch. 2], the time average constraint (2b) is satisfied if the virtual queue is strongly stable, i.e., lim sup T →∞1 T T t=1 E{H[t]} < +∞.The one-slot conditional Lyapunov drift, denoted by ∆[t], is defined as the expected change in the Lyapunov function over one slot given the current system state Z[t].Accordingly, ∆[t] is given by can be formulated as an MDP problem, where r[t] is the immediate reward, the state is Z[t] = {s[t], H[t]}, and the action is a[t] = (α[t], β[t]).To solve the MDP problem, we apply D3QN.Implementation details are presented in Sec.VII.