Age of Incorrect Information Minimization for Semantic-Empowered NOMA System in S-IoT

Satellites can provide timely status updates to massive terrestrial user equipments (UEs) via non-orthogonal multiple access technology (NOMA) in satellite-based Internet of Things (S-IoT) network. However, most of the existing downlink NOMA system are content-independent, which may result redundant transmission in S-IoT with limited resources. In this paper, we design a content-aware sampling policy via a semantic-empowered metric, named Age of Incorrect Information (AoII) to evaluate the freshness and value of status updates simultaneously, and formulate a long-term average AoII minimization problem with three constraints, including average/peak power constraint, network stability and freshness requirement. By regarding the long-term average AoII and three constraints as Lyapunov penalty and Lyapunov drift, respectively, we transform the long-term average AoII minimization problem to minimize the upper bound of Lyapunov drift-plus-penalty (DPP). Then, we utilize the deep reinforcement learning (DRL) algorithm Proximal Policy Optimization (PPO) to design our AoII minimization resource allocation scheme, and solve the non-convex Lyapunov optimization problem to enable the semantic-empowered downlink NOMA system. Simulation results show that our proposed SAC-AMPA scheme can achieve the optimal long-term average AoII performance under less power and bandwidth consumption than state-of-the-art schemes.


I. INTRODUCTION
W ITH the rapid deployment of low earth orbit (LEO)   satellite constellations, such as Starlink and OneWeb [1], [2], satellite-based Internet of Things (S-IoT) has the The associate editor coordinating the review of this article and approving it for publication was X. Chen.(Corresponding author: Jian Jiao.) Hui Hong, Jian Jiao, Tao Yang, and Qinyu Zhang are with the Guangdong Provincial Key Laboratory of Aerospace Communication and Networking Technology, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, China, and also with the Peng Cheng Laboratory, Shenzhen 518055, China (e-mail: 21s152058@stu.hit.edu.cn;jiaojian@hit.edu.cn;yangtao_hit@foxmail.com; zqy@hit.edu.cn).
Rongxing Lu is with the Faculty of Computer Science, University of New Brunswick, Fredericton, NB E3B 5A3, Canada (e-mail: rlu1@unb.ca).
Color versions of one or more figures in this article are available at https://doi.org/10.1109/TWC.2023.3334761.
Digital Object Identifier 10.1109/TWC.2023.3334761potential to be applied in rural and extreme environment for remote monitoring where terrestrial network is not accessible [3], [4].S-IoT can provide low-latency data communication and wide coverage of terrestrial user equipments (UEs), making it an important component of the domain-wide sixth generation (6G) network [5], [6].The applications of S-IoT includes disaster relief, aviation and navigation monitoring, remote sensing and other fields [7], [8].In these applications, timely transmission of status updates to terminals is crucial, as obsolete information can lead to terrible accidents.In order to measure the freshness of information, age of information (AoI) has been proposed, which represents the elapsed time from the latest status update is generated [9], [10].Moreover, with the increasing demand for timely transmission and spectral efficiency of status updates, non-orthogonal multiple access technology (NOMA) has been applied to S-IoT [11], [12], [13].The authors in [14] establish the NOMA framework for the space-terrestrial satellite networks, and propose a resource allocation scheme to optimize the system capacity and energy capacity.The authors in [15] propose a power allocation scheme to minimize the expected weighted sum AoI (EWSAoI) under three constraints in NOMA S-IoT system.
However, AoI has shown its shortcomings as a metric of information freshness in satellite with limited onboard resources, even when the source process of interest does not change after the latest status update, the status update is sampled and transmitted due to the increase of AoI [15].This because the AoI-optimal downlink NOMA system only considers the timeliness and fails to evaluate the significance and usefulness of status updates, which leads to contentindependent sampling and redundant transmission [16].For example, in the intelligent navigation monitoring system, the goal of the satellite is not to continuously transmit more remote sensing imagery, but reduce the mismatch between the satellite and UEs due to the limited bandwidth.
To overcome the limitations of AoI, the authors in [17] propose a semantic-empowered metric, named age of incorrect information (AoII) to evaluate the freshness and value of status updates received by UEs simultaneously, which measures the freshness by capturing the increasing penalty with time offered by age, and measures the value by the gap between the state of the source and the current knowledge of receivers, respectively.Therefore, the AoII-optimal sampling and resource allocation schemes can estimate the appropriate sampling time and © 2023 The Authors.This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
consume fewer unnecessary transmission opportunities, thus transmitting meaningful status updates timely.The authors in [18] compare the performance of AoII-and AoI-optimal policies for an end-to-end status update system, and illustrate that AoI-optimal policy wastes transmission attempts and leads to a worse performance than that of AoII-optimal policy.Hence, we propose a semantic-empowered downlink NOMA system in S-IoT in this paper, and design a content-aware sampling to achieve the long-term average AoII minimization transmission.
Consider that the AoII metric evaluates both the freshness and value of status updates, the AoII performance of downlink NOMA S-IoT system is affected not only by time-varying channels, limited power and storage buffer, but also by the content-aware source sampling policy and the mismatch between satellite and UEs.Consequently, the optimization of AoII in NOMA S-IoT system is a non-convex problem with multiple constraints.Recently, [16] and [19] have explored Markov decision process (MDP) optimizations for AoII.The authors in [19] formulate the AoII-optimal sampling problem as a MDP to compute the optimal sampling policy.The authors in [16] also utilize the MDP framework to derive the optimal transmission strategy to minimize the average AoII.However, with the exponential growth of UEs in S-IoT network, the optimization of AoII becomes more challenging, while MDP will cause the curse of dimensionality [20].On the other hand, the Lyapunov optimization framework provides better stability while S-IoT brings massive timely, sensitive and bursty data [21], which has the potential to solve the longterm average AoII minimization in S-IoT network with limited resources.Moreover, the Lyapunov optimization problem in communication systems can be solved by utilizing the deep reinforcement learning (DRL) algorithms to achieve excellent performance [22].The authors in [23] propose a Lyapunovguided DRL-based online computation offloading algorithm to maximize the data processing capability in mobile-edge computing networks, which utilizes the deep Q-network and is suitable for discrete action space.
Therefore, we utilize the content-aware sampling policy in the semantic-empowered NOMA system, and formulate a long-term average AoII minimization problem under three long-and one short-term constraints via the Lyapunov optimization framework, and finally solve the optimization problem via DRL-based algorithm.It is worth noting that our AoII minimization problem focus on the gap between the randomly changing of source process and the current estimation of UEs, which provides a guideline to determine the content-aware sampling and resource allocation for our downlink NOMA system.In particular, the main contributions are summarized as follows.
• To evaluate both the freshness and value of status updates, we propose a "sample-at-change" (SAC) sampling policy based on the semantic-empowered metric AoII for the downlink NOMA system in S-IoT, and formulate a long-term average AoII minimization resource allocation problem under three constraints, including average/peak power constraint, network stability and freshness requirement.To the best of our knowledge, this is the first work to minimize the long-term average AoII in semantic-empowered NOMA system.Then, we utilize the Lyapunov optimization framework to model the original problem by transforming three long-term constraints into three virtual queues, and formulate the drift-pluspenalty (DPP) term by regarding the long-term average AoII and three virtual queues as Lyapunov penalty and Lyapunov drift, respectively.We derive the upper bound of DPP and prove that the AoII minimization resource allocation problem can be derived by minimizing the upper bound of DPP, and convert the multi-slot longterm optimization problem into a group of single time slot AoII minimization problems.• To solve the non-convex Lyapunov optimization problem, we utilize a DRL algorithm proximal policy optimization (PPO) [24] and design our SAC-AoII minimization power allocation (SAC-AMPA) scheme to achieve the optimal power allocation order and coefficients in the semanticempowered NOMA S-IoT system, which outperforms the SAC-deep deterministic policy gradient (DDPG) [25] scheme.We analyze the convergence of our SAC-AMPA scheme and introduce two content-independent sampling policies, "periodic-sampling" (PSA) [26] and "generateat-will" (GAW) [27] for comparison.Simulation results show that our SAC-AMPA scheme can achieve both of the lowest long-term average AoII and power consumption than the PSA-and GAW-AMPA schemes, which validate that our content-aware SAC sampling policy can transmit the status updates with high freshness and value, while effectively conserving limited onboard resources.The rest of this paper is organized as follows.Section II describes our semantic-empowered NOMA S-IoT system, the sampling and transmission policy and the AoII model.Section III describes the modelling process of our long-term AoII optimization problem under three constraints in detail.Section IV transforms the problem of minimizing AoII to the problem of minimizing the upper bound of DPP term and proves its optimality.In Section V, we propose our SAC-AMPA scheme based on DRL algorithm.In Section VI, we conduct simulation experiments to compare the AoII performance and power consumption of our SAC-AMPA scheme with other state-of-the-art schemes.Finally, we give the conclusion in Section VII.

II. SYSTEM MODEL
In this section, we present the system model of the semantic-empowered NOMA S-IoT system, and the semanticempowered feature stands from the following aspects: the content-aware sampling and transmission policy, the formulation of the long-term average AoII to simultaneously capture the freshness and value of information of all UEs.

A. Semantic-Empowered NOMA S-IoT System
We consider the semantic-empowered NOMA S-IoT system as shown in Fig. 1.The low Earth orbit (LEO) multibeam high-throughput satellite S improves the communication service quality between satellite S and terrestrial UEs through multiple steerable spot beams.To address spectrum limitations and the interference between beams, we assume that LEO S serves terrestrial UEs through hybrid multiple access.Specifically, S serves different beams through orthogonal multiple access (OMA), and divides the frequency band into three subfrequency bands to prevent overlapping spectrum allocation between adjacent spot beams.Furthermore, S utilizes NOMA to transmit status updates to M UEs simultaneously within the coverage of each spot beam.Consequently, our research primarily focuses on the transmission within a single spot beam in the downlink network.We assume that the time period is divided to T time slots, with t ∈ {0, 1, . . ., T − 1} representing the current time slot.Each t has a duration of τ , which equals to the propagation delay from S to UEs.
Furthermore, we consider the following parameters for our analysis: the diameter of a spot beam d = 100 km, the altitude of satellite S is h = 600 km, the carrier frequency f = 30 GHz, and a minimum elevation angle of 30 • [28].Note that the velocity of S is about v = 7.5 km/s, we can assume that M UEs are quasi-stationary within one time slot duration τ [29].Moreover, the UEs are equipped with global navigation satellite system (GNSS) to pre-compensate the effects of Doppler shifts with sufficient guard band, as specified in 3GPP TR 36.763[30], thus our system can relieve the influence of Doppler shifts.
Since the obstacles and occlusions around UEs lead to scattering and masking effects, we model the S-to-UEs channel as the widely-utilized Shadowed-Rican (SR) fading channel model, which incorporates both fading and masking effects [31].We assume that the channels between S and UEs follow the SR fading, and are independently identically distribution (i.i.d.).The probability density function (PDF) of the channel gain |ch i | 2 is expressed as follows [32], where 2b i , m and Ω denote the average power of scatter component, the Nakagami-m parameter and line of sight (LoS) component, respectively.1 F 1 (•, •, •) represents the Gauss hypergeometric function.For S with single transmitting antenna, the cumulative distribution function (CDF) of |ch i | 2 can be expressed as [32], where t j e −t dt is the lower incomplete Gamma function.
For S with N s transmitting antennas, the CDF of |ch i | 2 is as follows [33], where B(., .) is the Beta function, . Without loss of generality, we assume that the channel condition keeps invariant in each t and randomly changes between slots.
We focus on the scenario that S equips N s transmitting antennas while each UE equips a single receiving antenna.We denote s i (t) as the desired signal and p i (t) ∈ C N as the complex weight column vector for the allocated transmit power of the i-th UE U E i at t.The index of M activated UEs are allocated according to the transmit power sorted by S for the desired signals of M activated UEs in an ascending order.In other words, the signal s M (t) is allocated the largest power |p M (t)| 2 .Consequently, the superposed signal s(t) for M activated UEs is as follows, After s(t) has been broadcast to M UEs, the received signal y i (t) for U E i can be expressed as follows, where ch i (t) ∈ C N following the SR fading distribution denotes the row vector of channel coefficients.Let L F (dB) = 92.4+ 20 log f (GHz) + 20 log d(km) denote the free space loss from S to U E i , where f and d denote the spot beam frequency and the altitude of S, respectively.n i (t) ∼ CN 0, σ 2 represents the additive white Gaussian noise (AWGN) with σ 2 as its variance.Then, by regarding other UEs' signals as intra-cell interference, U E i employs the successive interference cancellation (SIC) to recover s i (t) from y i (t) in turn.In detail, s M (t) with the highest power is decoded first by regarding all other M − 1 UEs' signals as intra-cell interference.Then, if the SIC decoding has been successful, s M (t) will be subtracted from s(t) by y i (t), and s M −1 (t) with the second-highest power will be decoded until s i (t) is recovered for U E i . Denote as the composite channel gain of U E i .Based on the principle of NOMA, given the assumption that the allocated power are sorted in an ascending order, we can further assume that |g Then, the allocated power for the signal of U E i should satisfy certain conditions to guarantee successful SIC decoding [34]: where η ∈ [0, 1] is the imperfect SIC coefficient [35].

B. Content-Aware Sampling and Transmission Policy
We model the source as a N states discrete Markov chain (X (t)) t∈E as shown in Fig. 2. At each t, the transition probability between two adjacent states is 2p, while the probability of remaining the current state is 1 − 2p [18].For convenience, we assume that the transition of source state occurs at the beginning of each t.Then, S decides whether to sample and transmit the source according to its sampling policy, where we propose a "sample-at-change" (SAC) content-aware sampling policy for the semanticempowered downlink NOMA system, which samples when the source changes its state.We also compare with two contentindependent sampling policies: 1) "periodic-sampling" (PSA), which samples periodically at 1/2p, and the expected number of sampling and transmission is equal to the SAC sampling policy, and 2) "generate-at-will" (GAW), which samples and transmits in each time slot regardless of the source state transition.
Moreover, U E i will feedback an ACK to S if the status update is recovered or a NACK if not, and S utilizes the preempt-last sample, first serve (P-LSFS) scheduling policy, i.e., S retransmits the unrecovered status updates until a new sample is generated.Therefore, S transmits the new and unrecovered status updates to the UEs in each time slot, and note that we assume the propagation delay from S to UEs is equivalent to τ .

C. Model of Long-Term Average AoII
In the semantic-empowered NOMA S-IoT system, we utilize AoII to measure the freshness and value for the status update of U E i simultaneously, let Xi (t) denote the last recovered status update of U E i , and the AoII function where the information penalty function g i (X i (t), Xi (t)) quantifies the difference between the current state of source and the last recovered status update of U E i , and is defined as g i (X i (t), Xi (t)) = X i (t) − Xi (t) , since our system cannot tolerate any gap between the source state and the current knowledge of UEs.The increasing time penalty function f i (t) measures the number of time slots that U E i maintains the unsynchronized status update, note that our system does not need transmit status updates as quickly as possible, and we have f i (t) = t − W i (t) [18], where W i (t) represents the last time slot before slot t when the U E i has the same status update as the state of source (g i (X i (t), Xi (t)) = 0).Therefore, if the last recovered status update of U E i is the same as the current source, U E i will not be penalized, otherwise, a penalty will be imposed and will increase with the time slots, and we have Let d i (t) ∈ {1, 0} denotes the status update of U E i is recovered successfully or not in t, and the evolution of AoII is shown in Fig. 3. Initially, we set X i (0) and Xi (0) are with state 1 and X i (0) = Xi (0) = 1, W i (0) = 0.Then, X i (1) transitions to state 2 by probability 2p and ∆ i (X i (1), Xi (1), 1) rises to 1, since and Xi (2) = X i (1) = 2, and ∆ i (X i (2), Xi (2), 2) resets to 0. In t = 3, d i (3) = 0 as the SIC decoding of U E i is failed and Xi (3) = 2, but X i (3) = 3 transitions to state 3, and ∆ i (X i (3), Xi (3), 3) increases to 1.Moreover, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Therefore, we define the long-term average AoII ∆AoII to evaluate the freshness and value of status updates at all M UEs in our semantic-empowered NOMA S-IoT system as follows where the sampling policy and the power allocation scheme in each time slot can both affect the expectation E[•].

III. AOII MINIMIZATION PROBLEM FORMULATION
In this section, we analyze three long-and one short-term constraints for AoII optimization in the semantic-empowered NOMA S-IoT system, and model the long-term average AoII minimization problem.

A. Three Constraints
Since the power resources in satellites are limited, we need to consider the short-term peak power constraint when allocating power resources in S-IoT network.Assume that P max denotes the maximum total power that can be provided to M UEs by S in t, and we have: Moreover, if the channel gain g i (t) is poor and ∆AoII of new status update is low in t, the limited power resources may be wasted for useless transmission attempts without consider the long-term power constraint.Thus, since the short-term power constraint only consider the power allocation in current time slot, which may deteriorate the long-term average AoII, the average power consumption satisfy the long-term power constraint P avg , and we have In addition, the storage resources of S are also limited.Denote Q i (t) as the latest status update packet targeted for U E i and buffered in queue backlog Q i in t.At the beginning of t, if a new status update targeted for U E i is sampled, the data will arrive at Q i at a rate of a i (t).Since the arrival rate of data cannot exceed a certain limit of Q i , we set the upper bound of a i (t) as a max .Once U E i successfully recover its status update through SIC decoding, the data will depart from Q i .Otherwise, the status update retransmits in the next time slot until a new status update for U E i is sampled.According to the SIC decoding, the departure rate b i (t) of U E i in t can be expressed as follows, Therefore, the following network stability constraint should be met to prevent data overflow according to [36]: Finally, if U E i with poor channel condition consistently fails to decode, the status update of U E i may become obsolete and the AoII value of U E i will increase.Therefore, we define the long-term throughput hi of U E i as follows, hi = lim Then, we set the long-term minimum throughput constraint as h.To ensure that U E i with poor channel condition still have the opportunity to transmit status updates, and the freshness requirement of U E i is given by,

B. Problem Formulation
Therefore, we model a long-term average AoII optimization problem under four constraints, including long-term and shortterm power constraints, freshness requirement and network stability: lim where (16b) and (16c) are the long-term average and short-term peak power constraints, respectively, (16d) is the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
freshness requirement, and (16e) is the network stability constraint.
Since the optimization problem in ( 16) is a complex long-term optimization problem within multiple time slots, we utilize the DPP algorithm in Lyapunov framework to transform it into a solvable single-slot optimization problem in Section IV.
IV. LYAPUNOV OPTIMIZATION VIA DPP First, we transform the optimization problem in ( 16) into minimizing the upper bound of the Lyapunov DPP term in this section.Then, we prove that by solving the weighted optimization problem of the DPP term, we can obtain the AoII minimization resource allocation scheme.

A. Transformation of Optimization Problem
Three constraints in (16c)-(16e) can be transformed into three virtual queues in the Lyapunov optimization framework, and satisfied by meeting the system stability [36].We define the evolution of three queues as follows: • First, we establish the power consumption debt queue P (t) to ensure that the average power consumption in each time slot does not exceed P avg : • Second, we monitor the data buffer queue Q i (t) for U E i at S. Considering that the data buffer queue Q i (t) stands for the queue backlog in time slot t, it can be expressed as follows [37], • Third, the throughput debt queue R i (t) is established for each U E i in order to meet the freshness requirement.The throughput debt queue R i (t) is updated as follows, Lemma 1: If three queues P (t), Q i (t) and R i (t) in ( 17)-( 19) are all rate stable for i ∈ {1, 2, 3, . . ., M }, then three long-term constraints in (16c)-(16e) are satisfied.
Proof: The derivations are presented in Appendix A. □ According to the Lyapunov optimization framework, let represent the virtual queue status in t, then the quadratic Lyapunov function can be expressed as follows [36], We can utilize the Lyapunov drift to represent the variation of the quadratic Lyapunov function in (20), which is expressed as the expected value under the current state S(t): By controlling the Lyapunov drift D(S(t)) to a small value, we can maintain the stability of three queues in ( 17)- (19).
Since a low value of D(S(t)) indicates that three queues are not congested [36].
Furthermore, our objective is to minimize the long-term average AoII while maintaining the stability of three virtual queues and meeting the corresponding constraints.To achieve this objective, we introduce a penalty function associated with the AoII: To sum up, the DPP term of our system can be expressed as follows, where the importance weight of the penalty function P (S(t)) Based on Lyapunov optimization framework, we can derive the upper bound of the DPP term as follows, where c is a constant, and the detail derivations of ( 24) are presented in Appendix B. Therefore, the original optimization problem in ( 16) can be transformed into minimizing the upper bound of the DPP term:

B. Analysis of DPP Term Optimization
In subsection IV-A, the original optimization problem ( 16) is transformed into minimizing the upper bound of our DPP term (25).We first introduce the related Lemma 2 [36], then we prove that by minimizing (25), the near-optimal longterm average AoII can be finally obtained in the following Theorem 1 and Theorem 2, which is achieved by the following reasons: (1) An appropriate optimization problem in (25), which can be easily transformed into a series of solvable single time slot problems; (2) The introduction of three virtual queues P (t), Q i (t) and R i (t) in ( 17)- (19), which satisfies three long-term constraints in (16c)-(16e); (3) The adoption of the DPP algorithm, which not only balances the stability and the AoII performance of our system, but also makes the target AoII performance achievable; (4) We propose the SAC-APMA scheme based on DRL algorithm in Section V, which optimizes the power allocation through interacting with the environment.
Lemma 2: Consider a stationary randomized strategy ω satisfying the law of large numbers, which makes i.i.d.power allocation decisions in each time slot.Let , and the average AoII in time slot t is and ∆(t) are bounded, for any δ > 0, there exists ω satisfies that: where ω(t) has the expectation exactly equals to {x * i (t), y * i (t), z * i (t), ∆ * (t)}, ∆ opt represents the optimal AoII performance, and ∆ ω (δ) represents the feasible suboptimal solution achieved by ω.
Proof: We consider that the AoII optimization problem is strictly feasible.
For ∀t, since |p i (t)| 2 , a i (t) and b i (t) are bounded, the second moments of x i (t), y i (t) and z i (t) are also bounded.Since we set the upper bound ∆ max , we have: which satisfies the boundness assumptions in [36].Therefore, the conclusion of Lemma 2 is proved.□ Theorem 1: If all virtual queues are mean rate stable, and the importance weight V > 0, then the long-term average AoII and virtual queues satisfy the following inequality: Since the DPP algorithm opportunistically minimizing the expectation and greedily minimizing the DPP term in each time slot, we have: where x * i (t), y * i (t), z * i (t) and ∆ * (t) are results of ω in Lemma 2. According to the conclusion of Lemma 2, we can derive: By taking δ → 0, we have: Moreover, if all the possible associations of variables [x i (t) , y i (t) , z i (t)] obtained by strategies ω are in a closed set Γ, i.e.
□ Theorem 2: By minimizing the upper bound of DPP, the upper bound of ∆AoII can be expressed as, Proof: Using the law of iterated expectations for inequality (35), we can obtain: Applying the law of telescoping sum, summing over t = 0, 1, 2, . . ., T − 1, and dividing by T and V , we have, By taking δ → 0 and T → ∞, we have, Theorem 2 has been proved.□ Based on the conclusion of Theorem 2, we demonstrate an upper bound exists for the long-term average AoII.Additionally, the Lyapunov optimization framework provides a tradeoff between the optimization and the length of virtual queues by adjusting V .On one hand, the drift term ensures the stability of virtual queues, thus ensuring three long-term constraints are satisfied.On the other hand, the penalty term can be used to achieve the target AoII performance.
Therefore, by solving the weighted optimization problem of DPP, we can finally obtain a near-optimal long-term average AoII performance under three long-term constraints in our semantic-empowered NOMA S-IoT system.Note that the above multi-slot long-term optimization problem ( 25) is an online single time slot optimization problem, which depends on the power allocation decision in the current time slot.Therefore, we can convert (25) into a series of single time slot deterministic optimization problems by ignoring the time variable t, which prove to be non-convex by calculating their Hessian matrixes [38], and can be expressed as: V. THE PROPOSED SAC-AMPA SCHEME In this section, we introduce a DRL-based approach to model our AoII optimization problem (39), and introduce the architecture of our SAC-AMPA power allocation scheme.

A. Problem Modeling Based on DRL
The application of DRL consists of three components: the environment, the agent, and actions.The interaction process of DRL is as follows: in each time slot t, the agent observes the environment state s t , then takes action a t from a specific policy π.Then, the agent obtains a reward r t+1 to evaluate the current action a t , and the environment transitions to the next state s t+1 [39].
In our semantic-empowered NOMA S-IoT system, the LEO satellite S is regarded as the agent.All M UEs' AoII states, channel conditions and queue backlogs jointly constitute the observed environment state, denoted as s t = s 1 t , s 2 t , s 3 t , where Moreover, the optimal power allocation in NOMA system is affected by both the channel conditions and queue backlogs of UEs [40].To improve the AoII in our semantic-empowered NOMA S-IoT system, we define the sorting function of U E i with channel gain g i and queue backlog Q i as follows, where the column vector w represents the normalization weight of g i in the sorting function In each t, the action a t = a 1 t , a 2 t consists two components: the allocated order of UEs, represented as , and the allocated power of UEs, represented as For example, there are two UEs, U E 1 with full Q 1 and low g 1 , and U E 2 with empty Q 2 and high g 2 , the SAC-AMPA scheme prefers to give U E 1 priority and feedback a small w, and Then they combine to the corresponding action t must satisfy the peak power constraint (39b).The reward in our system is defined as the difference of DPP value between t and t − 1.Noting that if the peak power constraint P max is not satisfied, the reward is set to a constant −P EN (P EN > 0).For convenience, we refer to the DPP term (39a) as DP P , and the expression of reward r i (s t , a t ) is as follows, By adopting this approach, S favors power allocation schemes that achieve higher rewards, thus avoid applying illegal schemes [41].The objective of S is to learn the optimal policy π * through continuous interaction with s t , thus maximizing the discounted cumulative long-term reward and gradually satisfying the long-term constraints.The discounted cumulative long-term reward R t is expressed as, where γ is a discount factor that represents the importance of rewards in future time slots.

B. Architecture of the SAC-AMPA Scheme
We utilize a DRL algorithm proximal policy optimization (PPO) [24] and design our SAC-AMPA scheme to achieve the AoII minimization in the semantic-empowered NOMA S-IoT system.The PPO algorithm leverages the policy gradient approach to approximate the probability distribution of actions given a specific state through the update of a stochastic policy neural network, denoted as π θ .It also adopts an actor-critic structure, consisting of three neural networks: a new policy network π θ with parameter θ, an old policy network π ′ θ with parameter θ ′ , and a critic network with parameter φ.Two policy networks work together to generate the probability distribution of actions under the current state, and their parameters can be optimized by the critic function.Through iterative interactions between the policy networks and the critic network, the PPO algorithm seeks to converge to the optimal solution.The iterations of our SAC-AMPA scheme is introduced as follows.
First, to update the parameters {θ, θ ′ , φ} in our SAC-AMPA scheme, the agent S first interacts with the environment s t through π θ to collect a batch of experience data, denoted as (s t , a t , r t+1 , s t+1 ).These data are utilized to update the networks N up times, as illustrated in Fig. 4.
Then, θ ′ is updated according to θ as follows.In each t, π θ takes state s t as the input and outputs a probability distribution of actions.S then samples an action a t from the probability distribution, and the loss function of the new policy is given by, where r t (θ) = π θ (at|st ) π θ ′ (at|st ) is the probability ratio of the new and old policies, and Ât is an estimate of the advantage function, which can be expressed as, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Reset the environment and observe the initial state s 0 .
Take power allocation action a t with respect to UEs' state s t , obtain reward r t+1 via (41) and state s t+1 .

8
Store experience data (s t , a t , r t+1 , s t+1 ) in D.   where Q (s t , a t ) is the state-action value function, which represents the actual reward r t obtained by taking action a t in state s t .V φ (s t ) is calculated by the critic network and fit the discounted cumulative long-term reward Rt from the state s t to the end.Ât represents the advantage of the actual reward obtained by taking action a t in state s t compared to the fitted long-term reward.
On one hand, θ is updated through gradient ascent, which can be expressed as θ = θ + α • ∇ θ L(θ).Here, α ∈ [0, 1) is the learning rate of the new policy network.Consequently, the loss function of the new policy network is defined as follows, where a clip function clip(•) is introduced.Considering the sensitivity of policy updates in continuous action spaces, when the difference between two action distributions generated by π ′ θ and π θ is too large, algorithmic errors will occur.To prevent large differences, the PPO algorithm introduces clip(•) to constrainsthe probability ratio of the new and old policies within the range [1 − ε, 1 + ε].Moreover, when the advantage function Ât > 0, indicating good performance of the current state-action pair, the probability ratio r t (θ) needs to be increased but should not exceed 1 + ε.Conversely, when Ât < 0, indicating poor performance of the current state-action pair, r t (θ) needs to be decreased but not less than 1 − ε.
On the other hand, φ is updated through gradient descent, i.e., φ = φ − α • ∇ φ L(φ).The loss function L(φ) of the critic network is defined as follows, A neural network architecture is used to share parameters between the policy and value functions π θ (a t |s t ) and V φ (s t ), and a loss function L P P O (θ) is typically employed to combine the error terms of both functions.To ensure sufficient exploration of the algorithm, L P P O (θ) can be further enhanced by an entropy term.Therefore, the final loss function of the new policy network π θ can be written as follows,

VI. SIMULATION RESULTS AND DISCUSSIONS A. Simulation Setup
In this section, we simulate the long-term average AoII and average power consumption P of our proposed SAC-AMPA scheme, and the important simulation parameters are Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.summarized in Table I.Moreover, the simulated parameters of SR fading channel are in Table II.Then, we validate that the AMPA scheme outperforms the state-of-the-art DDPG scheme [25].Further, our content-aware SAC sampling policy also compares with two content-independent sampling policies: 1) PSA-AMPA: which samples periodically at 1/2p [26], and utilizes the AMPA scheme to derive the power allocation.
2) GAW-AMPA: which samples and transmits in each time slot [27], and utilizes the AMPA scheme to derive the power allocation.

B. Simulation Results
First, we investigate the impact of fading parameters of SR channels on the long-term average AoII ∆AoII in our SAC-AMPA scheme.As illustrated in Fig. 5, ∆AoII decreases as the SNR increases, since the SIC decoding performance can be improved under higher SNR.When SN R ≤ 20 dB, ∆AoII in all three shadowing levels are high, especially in the FHS level.Note that the gap between the FHS and ILS/AS fading parameters is significant due to the lack of power resources in the satellite, we utilize the FHS fading parameters to validate the superior performance of the proposed SAC-AMPA scheme  in the following simulations.Fig. 6 shows that is improved with the increasing number of transmitting antennas, we can observe that the average AoII of N s = 4 is 33% lower than that of single antenna when SN R = 15 dB.The convergence of our SAC-AMPA and -DDPG schemes are demonstrated after about 300 episodes as shown in Fig 7 , and the average AoII of SAC-AMPA scheme is approximately 19% lower than that of SAC-DDPG scheme.This mainly because of the clip function clip(•) in PPO can control the evolution of our SAC-AMPA scheme.
Fig. 8 illustrate ∆AoII of the SAC-AMPA scheme with other state-of-the-art schemes with respect to SNR under N s = 1 and N s = 4, respectively, which demonstrates that our SAC-AMPA scheme achieves lower ∆AoII compared to the PSA-AMPA, GAW-AMPA and SAC-DDPG schemes under single and multiple transmitting antennas.Moreover, Fig. 9 illustrates ∆AoII of the SAC-AMPA scheme with two content-independent schemes with respect to the number of UEs, and shows that the SAC-AMPA scheme maintains optimal AoII and exhibits more advantage with the increase of UEs.On one hand, when employing the PPO algorithm, the SAC policy achieves better AoII performance than the GAW and PSA policies.This is attributed to two reasons: 1) The GAW policy transmits status updates in each time slot without considering their freshness and value, thus leading to  suboptimal performance.2) Compared with PSA policy, SAC policy effectively captures state transitions of the source and selects appropriate sampling time.On the other hand, when employing the SAC policy, the AoII performance of the PPO algorithm outperforms that of the DDPG algorithm.Fig. 10 simulates the impact of V in (39a) on ∆AoII and P in SAC-AMPA scheme.We can observe that when V increases, ∆AoII decreases while P significantly increases under the constraint (39b).Therefore, a tradeoff can be achieved between the long-term average AoII and P , as we have discussed in Section IV-B.Fig. 11 shows ∆AoII   under different number of states N and sampling thresholds.Note that the GAW policy is simulated at the sampling threshold = 0, and the SAC policy is simulated with sampling threshold = 1, 2, 3. Simulation results demonstrate that the SAC policy with sampling threshold = 1 achieves the optimal ∆AoII regardless of N .
Finally, Fig. 12 simulates the average power consumption P and long-term average AoII ∆AoII of three sampling policies versus the transition probability 2p.We can observe that the SAC and PSA policies have similar P , which are both lower than that of GAW policy.Because the GAW policy samples and transmits status updates in each time slot, and the former two sampling policies are related to p.When p decreases, less sampling and transmission occurs in the SAC and PSA policies.Moreover, the SAC policy remains optimal ∆AoII regardless of p due to its content-aware feature.

VII. CONCLUSION
In this paper, we proposed a content-aware sampling policy, named SAC for the semantic-empowered NOMA S-IoT system, and formulated a long-term average AoII optimization problem under three constraints.To solve this long-term non-convex optimization problem, we transformed the original problem into a Lyapunov optimization framework.Then, we utilized PPO to design our AoII-optimal power allocation scheme, named SAC-AMPA scheme.Simulation results demonstrated that our SAC-AMPA scheme can achieve the lowest long-term average AoII and average power consumption among the state-of-the-art schemes.Moreover, we simulated the convergence and AoII performance of the PPO and DDPG algorithms, analyzed the AoII and power consumption performance of different sampling policies, and compared the AoII with different sampling thresholds.Finally, we validated that our proposed SAC-AMPA scheme can transmit status updates with high freshness and value, while effectively conserving resources.= 0.By summing up the function P (t) from 0 to T and taking the limit and expectation, we can derive the inequality as follows: Similarly to P (t), when R i (t) is rate stable, we can derive:  (24) First, by substituting (20)  Then the proof of ( 24) is completed.

Manuscript received 10
June 2023; revised 11 September 2023; accepted 16 November 2023.Date of publication 30 November 2023; date of current version 12 June 2024.This work was supported in part by the National Natural Sciences Foundation of China (NSFC) under Grant 62071141, Grant 61831008, and Grant 62027802; in part by the Shenzhen Science and Technology Program under Grant JSGG20220831110801003; and in part by the Major Key Project of PCL Department of Broadband Communication.

Fig. 2 .
Fig. 2. Illustration of the N states discrete Markov source.

Fig. 3 .
Fig. 3. Evolution of the AoII of U E i in the semantic-empowered NOMA S-IoT system.

Fig. 5 .
Fig. 5.The average AoII versus SNR of the SAC-AMPA scheme, where the number of UEs M = 4, the transition probability 2p = 0.5, and the weight V = 150.

Fig. 10 .
Fig.10.Tradeoff between the long-term average AoII and the average power consumption P of the SAC-AMPA scheme, where M = 5, 2p = 0.5.

Fig. 12 .
Fig. 12.The average AoII and average power consumption P versus p of different sampling policies, where M = 4, SNR = 20 dB, and V = 150.
(a t |s t ) log π θ (a t |s t ) = E at∼π (− log π θ (a t |s t )) is the entropy of π θ in time step t, and c 1 , c 2 are importance weights of L(φ) and S π θ (s t ), respectively.The detailed SAC-AMPA scheme with the above training process is summarized in Algorithm 16.