Lyapunov Optimization-Based Latency-Bounded Allocation Using Deep Deterministic Policy Gradient for 11ax Spatial Reuse

With the growing demand for wireless local area network (WLAN) applications that require low latency, orthogonal frequency-division multiple access (OFDMA) has been adopted for uplink and downlink transmissions in the IEEE 802.11ax standard to improve the spectrum efficiency and reduce the latency. In IEEE 802.11ax WLANs, OFDMA resource allocation that guarantees latency, called latency-bounded resource allocation, is more challenging than that in cellular networks because severe unmanaged interference from overlapping basic service sets is enhanced due to the concurrent-transmission mechanism newly employed in IEEE 802.11ax. To improve the downlink OFDMA resource allocation with the unmanaged interference caused by IEEE 802.11ax concurrent transmissions, we propose Lyapunov optimization-based latency-bounded allocation with reinforcement learning (RL). We focus on the transmission-queue size for each station (STA) at the access point that determines the STA latency. Using Lyapunov optimization, we formulate the resource-allocation problem with the queue-size constraints in a form that can be solved using RL (i.e., a Markov decision process) and prove the upper bound of the queue size. Our simulation results demonstrated that the proposed method, which uses an RL algorithm with a deep deterministic policy gradient, satisfied the queue-size constraints. This means that the proposed method met the latency requirements, while some baseline methods failed to meet them. Furthermore, the proposed method achieved a higher fairness index than the baseline methods.


I. INTRODUCTION
With the rapid growth of Internet-connected devices, wireless local area networks (WLANs) have become important because of their reasonable cost and suitable specifications. Accordingly, WLAN applications have become diversified, and a demand exists for latency-bounded communications (e.g., wireless remote control) in WLANs [1].
To improve spectrum utilization and reduce transmission latency, orthogonal frequency-division multiple access (OFDMA) has been introduced in IEEE 802.11ax [2], The associate editor coordinating the review of this manuscript and approving it for publication was Ronald Chang .
whereby an access point (AP) can transmit frames to multiple stations (STAs) simultaneously. OFDMA has been used in cellular network systems and enables efficient spectrum utilization while satisfying latency requirements, by allocating OFDMA resources appropriately, based on the STAs' requirements; this is called latency-bounded resource allocation.
OFDMA resource-allocation mechanisms have been studied extensively for cellular networks [3]- [5]. However, resource allocation in WLANs is more challenging than that in cellular networks because WLANs adopt distributed control, and it is difficult for one AP to cooperate with others. Therefore, we must consider a resource-allocation method VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ that works well without cooperating with other APs. Moreover, overlapping basic service sets (OBSSs) cause unmanaged interference in IEEE 802.11ax WLANs, which makes resource allocation more difficult. In addition to frame collisions caused by carrier sense multiple access/collision avoidance-based channel accesses in conventional WLANs, the concurrent transmissions by STAs associated with OBSS APs in IEEE 802.11ax WLANs-which is introduced to improve the spatial efficiency-increases the potential interference level at the receivers. Such severe unmanaged interference destabilizes the transmission rate and makes latency-bounded OFDMA resource allocation more challenging.
Some studies have addressed OFDMA resource allocation in WLANs. Reference [6] proposed a high-throughput resource-unit assignment scheme. In this scheme, the AP allocates OFDMA resources to maximize the total throughput. However, it was not designed to control the transmission latency.
Latency-aware OFDMA resource allocation in WLANs has been studied previously [7]- [9]. In these studies, traffic is classified into real-time and non-real-time. OFDMA resources are first allocated to STAs with real-time traffic, and the remaining resources are then allocated to STAs with non-real-time traffic. Although this algorithm may reduce the latency for real-time STAs, these studies do not consider concurrent transmissions from OBSSs, which could result in inappropriate allocations when the latency requirements are not satisfied. Therefore, the aim of this study is to provide a latency-bounded OFDMA resource-allocation mechanism for IEEE 802.11ax WLANs that considers OBSS concurrent transmissions.
To solve the resource-allocation problem in IEEE 802.11ax WLANs, we employ reinforcement learning (RL) and Lyapunov optimization. RL is a technique in which an agent learns an optimal strategy in a given environment by trial and error, based on its experience. However, to apply RL to solve the resource-allocation problem [5], it is necessary to formulate the problem as a Markov decision process (MDP).
To formulate the latency-bounded resource allocation as an MDP, we use the Lyapunov optimization scheme presented in our previous study [10]. In the present study, we adopt a deep deterministic policy gradient (DDPG) algorithm [11] to solve the resource allocation problem formulated as an MDP using Lyapunov optimization.
The contributions of this paper are summarized as follows: • We provide a latency-bounded resource-allocation method for spatial-reuse operations in IEEE 802.11ax WLANs wherein an AP does not cooperate with OBSSs, and unmanaged interference is caused by concurrent transmissions from OBSSs. Most existing resource-allocation schemes assume cellular networks in which APs can cooperate with each other.
• By using Lyapunov optimization, the resource-allocation problem is formulated in a form that can be transformed  into an MDP that can be solved by RL, and we prove the upper bound of the transmission latency.
• We demonstrated via simulations that the proposed method satisfies the requirement for latency in environments wherein multiple OBSS APs cause interference, although existing studies have considered a single basic service set (BSS). Furthermore, we confirmed that the proposed method achieves great fairness with OBSSs compared to the baselines.
The rest of this paper is organized as follows. Section II presents the allocation model and the formulation of the latency-bounded allocation problem. Section III presents the transformation of the problem using Lyapunov optimization. Section IV defines an MDP, and Section V introduces the proposed allocation framework comprising DDPG. Section VI presents the simulation results. Section VII presents the conclusions of this paper. Fig. 1 presents the allocation framework of this study. We focus only on downlink transmissions. We assume that there is one AP and N STAs in the considered network, with some OBSS APs around the STAs. Let the index of the STAs be denoted by n ∈ {1, . . . N }, and the subchannels be denoted by m ∈ {1, . . . , M }. The OBSS APs use the same channel as the considered AP; that is, they affect the transmission between the considered AP and the associated STAs.

II. PROBLEM FORMULATION A. SYSTEM MODEL
In IEEE 802.11ax networks, multiple transmitters can transmit simultaneously when the received interference power is below OBSS_PD. OBSS_PD is the sensitivity threshold for the OBSS frames [2], and transmitters can set it within a fixed range. When a transmitter transmits simultaneously with other transmitters, it must reduce the transmission power as follows [12]: where P ref is the reference power defined in the standard [12], P OBSS_PD is the OBSS_PD, and P OBSS_PD min is the lowest OBSS_PD in a predefined range. Therefore, the higher P OBSS_PD that the transmitter sets, the more opportunities it can obtain for simultaneous transmissions; however, it must transmit at a lower transmission power. When the channel is idle (i.e., no transmitters are sending frames), the transmitter can transmit frames without reducing the transmission power. When the OBSS APs transmit frames via OFDMA, the abovementioned simultaneous transmission and transmit-power decisions are made on a per-subchannel basis.
Let the transmission power in subchannel m at instant t be denoted by P m [t] and the noise power by σ 2 . Moreover, let the interference power at STA n in subchannel m from OBSS APs be denoted by I nm . We approximate the data rate, based on the Shannon capacity [13]; that is, where W denotes the bandwidth of one subchannel, d n denotes the distance from the considered AP to STA n, and t denotes the time index. In the above estimation, we use the following distance-based path-loss model [14]: l(d) = 20 log 10 (f c ) − 28 + 10 α log 10 (max{d, 1}) (dB), (3) where d denotes the distance, f c denotes the center frequency, and α denotes the path-loss factor. In IEEE 802.11ax networks, BSS color bits are embedded in the frame header [2]. They indicate which BSS the transmitter belongs to. Therefore, in downlink transmissions, we can identify the OBSS AP under transmission using the BSS color bits. We then assume that the data rate r nm [t] is obtained as follows. First, we identify the transmitter from the BSS color bits in the OBSS frame's header. We then estimate the interference power by referring to the previous interference power of the transmitter. Finally, we calculate r nm [t] from the estimated interference power by (2).
We define the total data rate of STA n as follows: where x nm [t] ∈ {0, 1} denotes whether the considered AP allocates subchannel m to STA n. We assume that the AP has queues for each STA. Let the arrival rate be denoted by ρ n [t], and the queue size be denoted by Q n [t]. The queue size Q n [t] evolves as follows: where τ denotes a time-slot length and Q n [0] = 0.

B. PROBLEM FORMULATION
The objective of this study is to allocate OFDMA resources for latency-bounded transmissions. To achieve latencybounded transmissions, we introduce an allowable queue size Q n for STA n. Note that the main transmission-delay factor in WLANs is the queuing delay [15]. Therefore, we allocate subchannels, such that the queue size Q n is under the allowable queue size Q n . However, if we consider only the latency, many resources may be concentrated at an STA with strict constraints. Therefore, we adopt proportional fair allocation under the latency constraint. In the allocation, we set the product of the data rates R n [t] for an objective function. By maximizing the objective function, the proportional fair allocation is realized [16]. We summarize this optimization problem as follows 1 : The aim of the objective function is to realize fair allocation. By allocating the subchannels to maximize the objective function, we can reduce the deviation of the data rates and allocate the subchannels more fairly. The first constraint presents the latency constraints. If we maintain the average queue size under the allowable value, each STA's latency requirement is guaranteed. The second and third constraints indicate the resource-allocation constraints. The second constraint shows that each subchannel can be allocated exclusively during a single time slot. The third constraint shows that each subchannel can be allocated to at most one STA.

III. LYAPUNOV OPTIMIZATION
In optimization problem (6), the queue size Q n [t] is a function of the arrival rate, as in (5). Accordingly, if we obtain the expectation of the arrival rate, we can solve optimization problem (6) directly. However, it is impossible to obtain the future arrival rate and expectation of the arrival rate. Therefore, we introduce Lyapunov optimization [17]. Lyapunov optimization is an online algorithm and does not need future information. Thus, it can be applied to optimization problem (6).
We define a virtual queue Z n [t] that changes over time, as follows: where Z n [0] = 0. This virtual queue size indicates the total backlog of the gap between the actual queue size Q n [t + 1] and the desirable value Q n . Let a virtual queue vector be denoted as ; we introduce the Lyapunov function as follows: This function indicates the size of the virtual queue vector.
Using this function, we introduce the Lyapunov drift (Z[t]) as follows: This drift is the expectation of the Lyapunov function change.
If we minimize (Z[t]), we can minimize the virtual queue sizes Z n [t] and satisfy the first constraint in (6). However, we cannot allocate resources fairly by only minimizing (Z[t]). To make a fair allocation under the latency constraint, we introduce the drift-plus-penalty [17] as follows: where V ≥ 0 denotes an importance weight. The second term is the weighted expectation of part of objective function (6). The importance weight represents the extent to which we emphasize fair allocation against latency constraints. To help readers better understand the role of the weight V , we present the following proposition. Proposition 1: Let us define function y as follows: The total data rate R is upper bounded; therefore, we can assume that the expected value of y is upper bounded by a finite value y max ; that is, We suppose that there are constants B ≥ 0, ≥ 0, and a target value y * , such that Then, the expected average y and virtual queue Z satisfy the following: Proof: Provided in Appendix A. We can understand this proposition as follows. If we set a larger weight V , we can increase the value of y, which allows the resources to be allocated more fairly. However, an increase in V results in an increase in Z because the right-hand side of (16) increases, and the upper limit of Z is relaxed. The increase in Z indicates that the gap between the current queue size and the desirable queue size increases. Accordingly, if we increase V , the latency constraints are relaxed. In summary, the tradeoff between fair resource allocation and latency constraints can be controlled by the weight V .
Finally, we rearrange the terms in (10) and separate the uncontrollable variables from the controllable variables in the following lemma.
Lemma 2: The drift-plus-penalty (10) is upper bounded as follows: where B is invariable, irrespective of the resource allocation.
Proof: Provided in Appendix B. In Lyapunov optimization, we minimize this upper bound instead of the drift-plus-penalty. Therefore, we set this upper bound as a new objective function. We maximize the expectation term of the right-hand side in (17) because B is a constant, and the expectation term is negative. By extracting the terms from the expectation terms on the right-hand side in (17), we can transform the original problem, (6), as follows: The considered AP can calculate virtual queue sizes Z n [t] and set a coefficient V in advance. The total data rate R n [t] is a function of the allocation index x nm [t], as presented in (4), and the AP can set x nm [t]. Therefore, the considered AP must determine only x nm [t] to maximize the objective function (18).

IV. MARKOV DECISION PROCESS FORMULATION
We formulated the allocation problem and transformed it using Lyapunov optimization. Unfortunately, the unmanaged interference from the spatial-reuse operation complicates the allocation problem. Moreover, optimization problem (18) is 0-1 integer programming. 0-1 integer programming is proven to be NP-complete [18] and is difficult to solve directly. Therefore, we introduce the DDPG algorithm, which approximates the mapping from a state to an optimal allocation decision. Once this mapping is learned, an estimate of the solution to optimization problem (18) can be obtained in a short computation time. To apply the algorithm, we formulate resource allocation as a stochastic decision process; specifically, the MDP. For spatial reuse, we consider one AP as a learning agent. Moreover, we consider all OBSS APs as part of the environment. We define a stochastic decision process as a quadruplet ( , A, q, R) action a[t]. Therefore, the process satisfies the Markov property and is an MDP.

A. STATE
Let us denote the state space as the Cartesian product of the queue-size state space QUEUE and the channel state space CH ; that is, The queue-size state ω QUEUE ∈ QUEUE denotes the vector of the current queue sizes of the STAs; that is, We assume that the queue size Q n [t] does not increase beyond the upper limit Q max . Note that the queue size is continuous, and the state space QUEUE is also continuous. The channel state ω CH [t] ∈ CH denotes the OBSS AP index that is identified from the received frame. We assume that the considered AP senses the channel continuously. If the AP detects a transmission, it checks the BSS color bits in the received frame header and immediately identifies the transmitter. BSS color is a field embedded in the IEEE 802.11ax header and indicates the BSS of the transmitter [2].
The channel state space is denoted by a tuple of the interferer index of each subchannel; that is, where i m [t] presents the index of the OBSS AP transmitting in subchannel m. We denote i m [t] = 0 when no other transmitter is transmitting in subchannel m or the transmitter is not identified, owing to a preamble error.

B. ACTION
Let a m ∈ A m := {1, 2, . . . , N } denote the STA to which the agent allocates subchannel m. The action space A is then defined using the Cartesian product of the allocation action space A m ; that is, For all subchannels, the agent determines the STA to which the subchannel is allocated; thus, there are N M ways to allocate subchannels. Therefore, the action space may become large if the number of STAs or subchannels becomes large.

C. REWARD
We use objective function (18) as the reward; that is, the reward is defined as follows: where Z n [t] denotes the virtual queue size, V denotes the importance weight, and R n [t] denotes the total data rate defined in (4). Therefore, optimization problem (18) is solved when the agent learns the optimal strategy with which the reward is maximized.

V. RESOURCE ALLOCATION WITH DDPG
In this study, we seek to develop a policy that enables the fairest allocation under the latency constraints. In Section IV, we formulated the allocation problem as an MDP. Therefore, we can apply the RL algorithm using the formulated decision process. In the OFDMA resource allocation in WLANs, the number of possible allocations may become very large as the numbers of STAs and subchannels increase. In addition, as mentioned in Section IV-A, the state space is continuous. To address the problem with a continuous state space, we adopt a deep RL algorithm. Deep Q-network (DQN) [19] is one of the well-known deep RL algorithms. However, as pointed out in [11], while DQN solves problems with high-dimensional observation spaces, it can only handle discrete and low-dimensional action spaces. Instead of DQN, we use an actor-critic deep RL algorithm, DDPG [11]. DDPG is designed for problems with high-dimensional and continuous action spaces (e.g., robot operation). Therefore, it is suitable for optimization problem (18).

A. ACTOR AND CRITIC
DDPG is based on an actor-critic method, similar to the deterministic policy gradient (DPG) algorithm [20]. An actorcritic method has a parameterized actor function µ(ω|θ µ ) and critic function V (ω, a|θ V ). In DDPG, the two functions are represented by neural networks. θ µ is the weight of the actor network, and θ V is the weight of the critic network. The actor function µ(ω|θ µ ) is a deterministic policy in which the agent selects an action a in state ω. The critic function V (ω, a|θ V ) is the value function in state ω and action a. The structures VOLUME 9, 2021 of the actor and critic functions in this study are described in Section VI-A. Given a state ω[t], action a[t], and actor function µ, we define a critic function as follows [11]: where γ ∈ [0, 1) denotes the discounted factor. This function presents the expected sum of discounted rewards for one episode. The agent learns the optimal policy to achieve the highest action value. In the considered system model, the critic function V µ (ω, a|θ V ) is the discounted sum of the weighted data-rate product and virtual-queue term. Therefore, the agent learns to allocate resources more fairly, while meeting latency constraints. The discounted factor γ indicates how much we emphasize future rewards. If we use a large γ value, we obtain a better outcome after convergence. However, this lengthens the learning phase.
By using the Bellman equation [21], we can transform (24) as follows: As the DDPG policy is deterministic, we can eliminate the inner expectation as follows: We optimize a critic-function parameter by minimizing the loss function. The loss function is defined as follows: where θ V is a parameter of the approximate critic function, and h[t] is a function defined as follows: Moreover, we update an actor-function parameter to maximize the expected return from the start distribution J defined as follows: a[t]) .
The actor function µ(ω|θ µ ) is updated via the chain rule to the expected return as follows: where θ µ is a parameter of the approximate actor function. Gradient (30) has been proven to be the policy gradient [20]; thus, we use gradient (30) to update θ µ .

B. DDPG FEATURES
DDPG adopts several features to overcome the disadvantages of neural networks [11]. When using neural networks for RL, the samples must be independent and identically distributed. However, the samples that we can obtain do not meet these conditions. To address this problem, DDPG adopts a replay buffer. A replay buffer stores the tuple (ω[t], a[t], r[t], ω[t + 1]). We select some samples uniformly from the replay buffer for a minibatch and update the actor and critic networks. By using the minibatch, the selected samples are independent and identically distributed. Another disadvantage of neural networks is the convergence problem. Specifically, in DDPG, the network V (ω[t], a[t]|θ V ) is updated, such that the loss-function (27) is minimized. However, this update is not stable because the target value h[t] also uses the network V (ω[t], a[t]|θ V ). In other words, the learning is slow and does not necessarily converge. To stabilize the learning, DDPG adopts target networks [22]. Target networks consist of a copy of the actor and critic networks, and are used to calculate the target value h [t]. Let θ denote the weights of the original networks, and θ denote the weights of the target network. We update the weights of these networks as follows: where η is a constant with η 1. When using target networks, the target value h[t] changes more slowly, and the learning can be stabilized.
In the learning phase, DDPG adopts exploration policy µ , defined as follows: where µ denotes the current policy and N denotes a sample of the noise process. The noise process is determined to be suitable for the environment. In this study, we selected a white Gaussian noise process.

C. DDPG APPLICATION
In this section, we explain how we apply DDPG to this study. In this study, the input of the actor network is queue sizes for N STAs and the channel states of M subchannels. Moreover, the output of the actor network is the allocation of M subchannels. Thus, the actor network can be denoted by the mapping µ : R N × N M → N M . To enhance the learning efficiency, we normalize the input vector of the actor network. The input vector is calculated as follows: where ω QUEUE and ω CH respectively denote the queue and channel states defined in Section IV. Q max denotes the upper limit of the queue size and M denotes the number of subchannels.
In this study, the output of the actor network is represented by a normalized vector. Therefore, we must transform the where m denotes the subchannel index, N denotes the number of STAs, and · denotes the ceiling function.

VI. SIMULATION EVALUATION
In this section, we validate the performance of the proposed scheme through a numerical evaluation.

A. SIMULATION SETTINGS
We consider a downlink transmission from one AP to four STAs. Two OBSS APs are near the considered network and interfere with the transmissions in it. To evaluate the performance in a spatial-reuse environment, we select the topology in which the interference power is less than OBSS_PD, and the considered AP and OBSS APs can transmit concurrently. The topology is shown in Fig. 2. We assume that the OBSS APs use the same channel as the considered AP, and that they can detect each other. We also assume that one of the OBSS APs can start a transmission, owing to a carrier sense. We set the traffic rate according to a uniform distribution. When we determine the subchannel allocation, we assume that the data rate can be calculated from the estimated interference power based on the OBSS frame's header. The simulation parameters are presented in Table 1. Fig. 3 presents the structures of the DDPG networks. In this figure, the ''Dense'' layer is the fully connected dense layer, the ''ReLU'' layer is one of the activation functions [23], and the ''Linear'' layer is a linear function. We used the ''Sigmoid'' layer for the output layer of the actor network  In the actor network, the input is the state ω, and the output is the action a. In the critic network, the inputs are the state ω and action a, and the output is the action value V . to restrict the output range. The parameters of DDPG are summarized in Table 2.
We compare the queue sizes to evaluate the transmission latency. To evaluate the allocation fairness, we also use Jain's fairness index [25] to compare the data rates. Jain's fairness index is defined as follows: This index value is larger for more uniform data rates for all the STAs. We evaluated the following resource-allocation schemes in the WLAN with a spatial-reuse operation.

1) PROPOSED SCHEME
The resource allocation is determined by the DDPG agent. In this scheme, the allocation index x nm [t] is given as follows: where µ is the actor function in DDPG and m is the index of the subchannel.

2) RANDOM SCHEME
The resource allocation is performed randomly.

3) RATE SCHEME
The objective of the rate scheme is to maximize the total throughput, which is the same as that in [6]. In the rate scheme, each subchannel is allocated to the STA with the highest data rate in the subchannel; that is, The objective of the queue scheme is to prioritize latencysensitive STAs, which is the same as that in [7]. In the queue scheme, the entire channel is allocated to the STA with the largest queue size among the four STAs; that is, Note that although the spatial-reuse operation was not used in [6] and [7], our simulations use the WLAN spatial-reuse operation in all the resource-allocation schemes. In addition to comparing the performance of the resource-allocation schemes, we evaluated the performance of the proposed scheme without the spatial-reuse operation to emphasize its importance in WLANs. This scheme is referred to as ''w/o SR scheme'' in the simulation results. Fig. 4 presents the average queue sizes. The queue sizes of the five schemes are smaller than the allowable queue size. In these schemes, the queue size of the proposed scheme is the smallest. In the w/o SR scheme, the average queue size is the largest. This is because the considered AP does  not transmit concurrently with the OBSS APs, and the spectrum utilization is degraded. In the queue scheme, the AP considers only the queue sizes, not the data rates. Accordingly, the spectrum utilization is limited, and the queue size increases. In the proposed scheme, the considered AP considers both queue sizes and data rates. Thus, it can improve the transmission efficiency while satisfying the latency constraints. Fig. 5 presents the transition of the average queue size of the proposed scheme during DDPG learning. Depending on the episode, the fluctuation of the average queue size becomes stable; that is, the learning is stable. The average queue size is the smallest at 1600 episodes. Therefore, we use the result at 1600 episodes in the following comparisons. Fig. 6 presents the achievement rate for the allowable queue size. The proposed scheme realizes an achievement rate of 1; that is, the considered AP always maintains queue sizes under the allowable queue size. The queue scheme also realizes an achievement rate of 1. In this scheme, subchannels are allocated preferentially to the STA whose queue size is the largest. Thus, the queue size is kept under the allowable queue size. The achievement rate in the w/o SR scheme is the lowest among the five schemes due to the same reason as that in the average queue size. Fig. 7 presents the standard deviations of the queue sizes. If the deviation is low, the queue size is stable, and the   transmission latency is also stable. The proposed scheme realizes less deviation than the compared schemes. In the proposed scheme, subchannels are allocated to keep the queue sizes small. Accordingly, the change in the queue size is small, which results in a lower deviation. Fig. 8 presents the change in the maximum queue size for each scheme. The queue size of the proposed scheme does not exceed the allowable queue size. However, the queue sizes of the other schemes sometimes do not meet the allowable values. In the proposed scheme, the considered AP allocates to all the STAs more frequently to control the queue sizes. According to this allocation, the queue sizes are retained under the allowable value. The same is true of the queue scheme. Fig. 9 presents Jain's fairness index of the data rates. The fairness index in the proposed scheme is higher than that in the other schemes. Objective function (18) contains the product of the data rate, such that fair allocation is realized, and this term improves the fairness index. The fairness index in the queue scheme is the lowest among the five schemes because too many subchannels are allocated to the STA whose queue size is the largest, which degrades the fairness index.

VII. CONCLUSION
We proposed a novel latency-bounded allocation framework for spatial-reuse WLANs. Latency-bounded OFDMA resource allocation in WLANs is more challenging than that in cellular networks due to the unmanaged interference from the OBSSs. To perform latency-bounded OFDMA resource allocation under unmanaged interference, we used an RL algorithm. As one of the inputs of the algorithm, we adopted BSS color bits, which indicate the transmitter that transmits the frame. To realize latency-bounded transmissions, we focused on the queue size because the majority of the latency in WLANs is due to the queuing delay. We then formulated the OFDMA allocation problem with queue-size constraints and transformed the problem into a form such that RL could be applied via Lyapunov optimization. We proved that the queue size is upper bounded in the transformed problem. In the evaluation, we compared the queue size and fairness of the allocation of the proposed scheme with those of other schemes. The simulation results confirmed that the proposed method achieved a high Jain's fairness index while satisfying the latency constraints. Moreover, the proposed method reduced the average queue size and its deviation, which implies that the proposed method can increase the capability of WLANs to accommodate more STAs with a satisfactory latency.

A. UPPER LIMIT OF DATA RATE PRODUCTS AND LOWER LIMIT OF VIRTUAL QUEUES
Proof of Proposition 1: We consider a slot t. We then obtain the expectations of both sides of (14) as follows: By taking the sum over t ∈ {0, 1, . . . , T − 1} for some t > 0 and rearranging the terms, the above inequality is transformed as follows: By dividing (40) by VT , rearranging the terms, and neglecting non-negative terms appropriately, we can obtain the following inequality: By dividing (40) by T , applying (12), and rearranging the terms in the same manner as that used above, we can obtain the following inequality:  As we cannot control Q n [t] or a n [t]τ , and Q n is invariable irrespective of the allocation, we can transform (46) as follows: where B is a constant that cannot be changed by resource allocation. From (45), (46), and (47), we obtain Lemma 2.