A Three-Tier Deep Learning-Based Channel Access Method for WiFi Networks

Future WiFi networks require a channel access method that provides users with high capacity. Such a method must consider 1) channel bonding, which improves the transmission capacity of Access Points (APs); and 2) spatial reuse, where APs tune their Clear Channel Assessment (CCA) threshold and transmit power in order to transmit concurrently with neighboring APs. To date, there are no solutions that jointly optimize the channels used by an AP, and the CCA threshold and transmit power of a bonded channel. To this end, we outline a three-tier deep learning approach. Briefly, at Layer-1, it selects a set of transmitting channels. At Layer-2 and Layer-3, it respectively determines the transmit power and CCA threshold for each selected channel. An AP then employs deep reinforcement learning to learn the optimal policy for each layer given its interference intensity and queue length. The simulation results show that when compared to three competing solutions, an AP that uses our approach is able to reduce its queue length by up to 62.52% under realistic traffic load.


I. INTRODUCTION
I EEE 802.11 based Wireless Local Area Networks (WLANs), aka WiFi networks, play an essential role in people's daily activities. Indeed, Access Points (APs) are ubiquitous and densely deployed in places such as offices, shopping malls, stadiums and airports [1]. These APs must support a high number of users. As an example, in the sports event studied in [2], WiFi networks need to provide services to 12,000 users simultaneously with a maximum aggregated data rate of 3.5 Gbps. However, with limited spectrum resources, densely deployed APs may experience significant interference from neighboring cells if they operate on the same channel [3]. Moreover, emerging Internet applications such as online meetings/education, ultra high definition streaming and virtual reality videos require high throughput or capacity to ensure a high quality of service [4].
There are two methods to improve network capacity. One method is via channel bonding or aggregation [5], which allows an AP to combine multiple channels together to form a wide bandwidth. For example, an IEEE 802.11ax AP is able to bond up to eight 20 MHz channels to form a 160 MHz channel. As shown in [2], with increasing bandwidth, an AP is able to transmit with a high data rate and thus, improves its capacity.
Another method is to maximize spatial reuse, meaning multiple nodes are able to transmit concurrently [6]. The level of spatial reuse is determined by the adopted Medium Access Control (MAC) scheme. Specifically, current WiFi networks rely on Carrier Sense Multiple Access with collision avoidance (CSMA/CA) for channel access. Each device uses a Clear Channel Assessment (CCA) threshold to determine the status of a channel. If the power level on a channel falls below a given CCA threshold, a device is allowed to transmit its packets. Otherwise, the channel is not available. Hence, a key problem is to determine a suitable threshold that allows multiple APs or/and users to transmit concurrently without causing too much interference to one another [7]. In this respect, Transmit Power Control (TPC) is essential [1], whereby an AP adjusts its transmit power to minimize interference caused to neighboring APs [5] or/and avoid triggering the CCA threshold of these APs, which causes them to defer their transmission unnecessarily.
To illustrate the previous points, consider Fig. 1 where an AP serves four users. The AP is able to bond up to three channels for transmissions, and all users and the AP experience inter-cell interference on different channels. Assume the AP has packets for User-1. Further, it experiences high interference on Channel-2. In addition, assume User-1 experiences low interference on Channel-1. In this regard, the AP can choose to bond Channel-1 and Channel-3, and increase the CCA threshold on these channels to gain more opportunities to transmit. Moreover, the AP can allocate a higher transmit power on Channel-3 as compared to Channel-1 if the Signal to Interference plus Noise Ratio (SINR) on Channel-1 exceeds a given threshold. This helps to ensure the SINR on both channels exceeds a given value, and reduce the interference to neighboring cells. Consequently, the AP is able to transmit packets to User-1 with a high data rate.
The use of channel bonding, transmit power control and CCA threshold adjustment creates several problems. In particular, an AP that bonds multiple channels may suffer significant interference from neighboring cells [8]. Moreover, it leads to a lower power density, i.e., Watts/Hz [9] which may lead to a low data rate. Secondly, although transmit power control may reduce the interference to neighboring APs, it may also result in a low signal strength. In this respect, if an AP uses a low transmit power and transmits over a bonded channel, neighboring devices may not hear its transmission on some channels and thereby start to transmit [10]. This may cause severe interference to ongoing transmissions or even collisions. Moreover, if the CCA threshold is low, an AP may transmit even when there is high interference, which degrades its transmission capacity. Consequently, an AP needs to optimize its channel bonding strategy, transmit power and CCA threshold carefully. Otherwise, the AP may experience significant delays or queue overflows.
There are three challenges to consider when jointly optimizing channel bonding, CCA threshold and transmit power. First, the channel conditions on each channel vary over time. Next, an AP may bond different combination of channels for each transmission over time. Lastly, the traffic arrival at an AP is random.
Henceforth, this paper makes the following contributions: • It addresses a novel problem that calls for a solution to optimize an AP's channel bonding policy, transmit power and CCA threshold under random traffic and channel conditions. Further, it considers both adjacent and non-adjacent channel bonding. The AP's aim is to maximize its throughput and minimize its queue length.
• It presents the first formal model of the said problem. Specifically, this paper formulates a three-layer Markov Decision Process (MDP) for the said problem [11], and outlines a three-tier learning approach based on Deep Q-Network (DQN) [12] and Deep Deterministic Policy Gradient (DDPG) [13] that runs on an AP to independently solve the MDP. The proposed three-tier approach only requires local information, such as an AP's current queue length and locally measured interference on each channel, to learn the optimal policy. This means the proposed approach scales with network size and it is decentralized. Further, the proposed approach does not require prior knowledge of an environment, meaning it is model-free. We emphasize that our approach is run by a single AP and does not require an AP to cooperate with neighboring APs, which may be managed by different entities. Lastly, our work is significant because the amount of traffic and interference vary over time, which motivates the use of learning based solutions such as DDPG.
• To the best of our knowledge, there are no prior works that jointly and adaptively optimize channel bonding, transmit power and CCA threshold for APs with random traffic arrivals in order to improve network capacity, see Section II for details. Hence, this paper is the first to outline a solution for the said novel problem.
• This paper contains the first study on the aforementioned problem and solution. The simulation results show that an AP running the proposed three-tier learning approach is able to reduce its queue length by 32.94%, 50.99% and 62.52% as compared to algorithms that use fixed or randomly strategies on channel bonding, transmit power control and CCA threshold.
The rest of this paper is organized as follows. Section II compares prior works that consider channel bonding, transmit power control, CCA threshold and works that apply Reinforcement Learning (RL). Section III presents our system model and problem, and Section IV outlines a brief background of MDP, DQN and DDPG. After that, Section V details the proposed three-tier learning approach. Lastly, Section VI presents simulation results, and Section VII concludes the paper.

II. RELATED WORKS
A number of works have considered applying channel bonding to wireless networks. For example, the work in [14], [15],  [16], [17], [18], and [19] assigns non-overlapping channels to each AP. The aim is to reduce interference and satisfy the traffic demands of APs. The study in [20], [21], and [22] uses game theory or integer nonlinear programming to determine a channel bonding strategy for each AP. Their aim is to improve overall network throughput. The work in [23] determines channel bonding policies by jointly monitoring the traffic load on secondary channels as well as the delay experienced by all users. Its aim is to minimize the transmission delay for each user.
Many works have applied machine learning techniques to optimize channel bonding strategies. For example, the work in [24] uses deep learning to optimize the probability of selecting a set of channels. In [10], the authors use RL to select a set of channels to bond taking into consideration the interference on each channel. Their goal is to maximize the throughput of each AP. The work in [25] uses multi-agent RL and considers the traffic load of each AP. Agents learn to select a primary channel and a number of secondary channels to bond. The aim is to minimize the average transmission delay. The work in [26] aims to minimize interference between APs and satisfy the time varying traffic demands at each AP. The AP in [27] runs RL to learn the optimal probability to sense and bond each secondary channel. Its goal is to maximize the total network throughput given some traffic load. The authors in [28] first use a Markov chain to model the throughput for bonded channels before proposing a multi-arm bandit algorithm to determine a set of channels to bond and avoid collisions.
There are also works that consider combining channel bonding with transmit power control. In [29], the authors use RL to jointly optimize channel bonding and transmit power of an AP. The goal is to maximize the energy efficiency of an AP with random traffic arrivals. The work presented in [30] first uses Q-learning on each femtocell to select a set of channels to transmit. After that, the authors use convex optimization to allocate transmit power on each selected channel, and they aim to maximize the throughput of femtocells as well as minimize the interference experienced by macrocell users.
Some other works focus on incorporating channel bonding with CCA threshold adjustment. For example, the work in [31] and [32] adjusts the CCA threshold of APs first. In particular, the authors of [31] first determine the CCA threshold on the primary channel of each AP by jointly considering average interference, signal strength to associated users and channel occupancy time. After that, the CCA threshold on each secondary channel is obtained by adding a fixed value to the CCA threshold on the primary channel. The work in [32] adjusts the CCA threshold for secondary channels only. The CCA threshold on each secondary channel is the same, and is calculated based on a SINR threshold and the distance between an AP and users. Then, the work in [31] and [32] bonds channels for each AP according to CCA results. In addition, there are some other works that jointly optimize the transmit power and CCA threshold over a channel. For example, the work in [33] and [34] uses reinforcement learning to assign a primary channel, transmit power or CCA threshold to each AP. Their aim is to maximize the throughput of each AP.
Our work is fundamentally different to prior works, see Table 1 for a comparison. Firstly, although previous works consider channel bonding, transmit power control or CCA threshold adjustment, e.g., [14], [15], [19], [29], [30], [31], [32], [33], and [34], they do not jointly optimize them to improve spectrum efficiency. Secondly, works such as [14], [15], [19], and [22] use a centralized method to optimize channel bonding. These methods require global information and cooperation between APs. In contrast, we consider a decentralized solution, where an AP independently determines its policy. Further, our method is model-free and only requires local information, i.e., the current queue length and experienced interference. In addition, previous works, e.g., [14], [15], [16], [17], and [18], only bond adjacent channels. However, we consider bonding both adjacent and non-adjacent channels; this provides more flexibility than bonding adjacent channels and it is now supported in IEEE 802.11ax [5]. Moreover, although the work in [31] and [32] adjusts CCA threshold, it considers assigning the same CCA threshold to all channels. In contrast, we adapt the CCA threshold on each channel. Lastly, research such as [33] does not consider traffic at APs, and the work in [10] considers a fixed channel gain. In contrast, we consider time-varying channel gains and traffic arrivals.

III. SYSTEM MODEL
Time is divided into T time slots and indexed by t; each time slot has length δ (in seconds). We consider an AP i with a set of users U. Let d iu denote the Euclidean distance between AP i and a user u ∈ U. There are N channels in the set C; each channel has a fixed bandwidth of B MHz. In each time slot t, AP i transmits to a user u over a set of channels C t i ⊆ C. We assume the interference experienced by AP i and user u is generated by a set of neighboring APs, denoted as N i ; this set contains any APs that may degrade the SINR of AP i and user u. Further, we denote by C t j the set of channels used by a neighboring AP j ∈ N i in time slot t. Table 2 lists our  notations. AP i uses CSMA/CA for channel access where the CCA threshold of channel c at time t is denoted as γ t c . Moreover, let N t i ⊆ N i be the set of transmitting neighboring APs at the start of time slot t. Further, we denote transmitting APs on channel c at time t as N t c = {j|j ∈ N t i ∧ c ∈ C t j }. We consider block fading, meaning the channel gain is fixed within one time slot and differs across time slots. The channel power gain from AP i to user u in time slot t is denoted as g t iu , and it is calculated using the Log-distance path loss model (in dB) [35]: where PL(d 0 ) is the reference path loss (in dB) measured at reference distance d 0 , and ω is the path loss exponent.
The term X g (in dB) is a random variable drawn from a zero-mean Gaussian distribution N (0, σ 2 ), representing shadowing effect. Then, the channel power gain is g t iu = 1 10 PL(d iu )/10 . We denote by β t uc the SINR of a transmission from AP i to user u over channel c in time slot t. Formally, where N b is the ambient noise power density (in Watt/Hz), and P t ic is the transmit power (in Watt) used by AP i on channel c in time slot t. Note that the transmit power satisfies 0 ≤ P t ic ≤ P max and c∈C t i P t ic = P max . The term I t uc is the aggregated interference experienced by user u on channel c in time slot t, which is calculated as We denote by r t c the theoretical data rate of AP i on channel c in time slot t. This data rate is calculated using the Shannon-Hartley formula, which is given by Then, the aggregated data rate r t i of AP i over C t i channels in time slot t is given by Note that in this paper we consider the theoretical capacity of a channel given its SINR. In practice, the data rate is determined by the Modulation and Coding Scheme (MCS) adopted by a sender. To this end, the value of r t c can be set to the highest possible MCS or data rate for a given SINR.
We assume AP i has a queue of packets to transmit. The length of the queue at the end of time slot t is q t i , where 0 ≤ q t i ≤ q max . At the beginning of each time slot t, we use λ t i to denote the number of packets arriving at AP i, where the value of λ t i is sampled from a probability distribution. Further, we assume each packet has a fixed size of L bits. Define t i ∈ {0, 1} to indicate whether AP i transmits in time slot t ( t i = 1); otherwise, we have t i = 0. The queue length at AP i evolves as per Let π be a policy used by AP i that selects a set of channels C t i , and assigns transmit power P t ic on each channel c ∈ C t i at the beginning of time slot t. Moreover, the policy π also adjusts the CCA threshold γ t c ∈ [γ min , γ max ] on each channel c ∈ C t i . Define by R(π) an objective function for policy π. Formally, we have where η 1 and η 2 are two weights that balance the data rate and queue length of AP i, and E π [.] refers to the expectation over the objective value when using policy π. We note that the weight η 2 can be revised to be a non-linear function of AP i's queue length, whereby a high penalty is recorded if the queue length is at maximum. Let be a collection of policy π. The problem at hand is to find the optimal policy π * that maximizes the objective function R(π) over time. Mathematically, we have

IV. A MARKOV DECISION PROCESS MODEL
We first discuss Markov Decision Process (MDP) [11]. After that, we introduce DQN [12] and DDPG [13]. Lastly, we introduce the simplex sampling method [36], which is used to sample possible transmit power allocations over one or more channels.

A. MARKOV DECISION PROCESS
An MDP is defined as a tuple with four elements (S, A, R(s t , a t ), P(s t+1 |s t , a t )), where S and A denote the set of states and actions, respectively. In each time slot t, an agent, i.e., AP i in our problem, observes a state s t ∈ S and selects an action a t ∈ A. The environment then returns a reward R(s t , a t ), and moves from state s t to s t+1 ∈ S with probability P(s t+1 |s t , a t ). Let π(s t ) be a policy used by an agent, where the policy outputs an action a t given state s t , i.e., a t = π(s t ). Let V π (s) be a value function that measures the expected long-term reward that starts from state s using policy π thereafter. Mathematically, we have where γ ∈ (0, 1] is the discount factor. The goal of an agent is to find the optimal policy π * that maximizes the value function for all states, denoted by V * . This optimal value function V * can be computed using Bellman's equation [11] as where γ is the discount factor. The optimal policy π * is then given by

B. DEEP Q-NETWORK
DQN is a value based reinforcement learning algorithm [12]. It learns the optimal policy by approximating the optimal Q-value for each state-action pair [12]. Therefore, DQN supports discrete actions, e.g., selecting a set of channels. DQN consists of two neural networks, an evaluation network θ and a target network θ ′ . The two networks have the same structure, where the evaluation network θ outputs a Q-value for a given state-action pair, denoted as Q(s t , a t ; θ), and the target network θ ′ outputs the corresponding target Q-value Q(s t , a t ; θ ′ ).
DQN uses experience replay to update the weights of its neural networks. Specifically, each combination of state, action, reward and next state (s t , a t , R(s t , a t ), s t+1 ) is called an experience. DQN will store an experience in each time slot into its memory buffer M, which stores up to |M| experiences. For every K time slots, DQN uniformly samples a batch of experiences from M to update the weights of its evaluation network θ. The goal is to minimize a loss function which is given by where In addition, to ensure stability, for every K ′ time slots, the weights of the target network θ ′ are replaced by the weights of the evaluation network θ.

C. DEEP DETERMINISTIC POLICY GRADIENT
A drawback of DQN is that it is not able to learn the optimal policy when the action space is continuous [13], e.g., the transmit power and CCA threshold of AP i. Therefore, we employ DDPG, an actor-critic based algorithm, to address the said issue [13]. DDPG has four neural networks, namely an actor network θ µ , a target actor network θ µ ′ , a critic network θ Q and a target critic network θ Q ′ . The structure of a target actor network and target critic network is the same as the corresponding actor and critic network, respectively. The actor network chooses a deterministic action a t at each state s t , denoted as a t = µ(s t ; θ µ ), and the critic network evaluates the Q-value Q(s t , a t ; θ Q ) for each selected action a t at state s t . Similarly, the target actor network selects a target action µ(s t ; θ µ ′ ), and the target critic network outputs a target Q-value Q(s t , a t ; θ Q ′ ). DDPG also uses experience replay to update the weights of its networks. In particular, for every K time slots, DDPG first samples a batch of experiences from its memory buffer M to update the weights of its critic network θ Q . The update follows a similar process as DQN, which aims to minimize the loss function as per where Next, the weights of the actor network are updated using the Q-values evaluated by the critic network. Specifically, DDPG first calculates the gradient of Q-values with respect to all actions in sampled batch, denoted as ∇ a Q(s, a; θ Q )| s=s t ,a=µ(s t ;θ µ ) . Further, DDPG calculates the gradient of all actions with respect to the weights of the actor network θ µ , denoted as ∇ θ µ µ(s; θ µ )| s=s t . Then, by applying the chain rule, the weights of the actor network are updated using a policy gradient method [37] with the following approximation Finally, DDPG applies a soft update on target networks. For every K time slots, the weights of corresponding target networks are updated as per where τ is a small positive number, representing the target network update rate.

D. SIMPLEX SAMPLING
A key issue for DDPG is that it needs to randomly select an action to explore the action space. Recall that for the set of channels C t i used by AP i in time slot t, its transmit power allocation is obtained from a |C t i |-dimensional space. Further, any transmit power allocation must satisfy 0 ≤ P t ic ≤ P max and P max = c∈C t i P t ic . To this end, we apply the simplex sampling method from [36] to determine a transmit power allocation over the set of channels , representing a fraction of the maximum transmit power P max that is used on a certain channel. The function Simplex(.) first randomly generates a sequence of values, which it records in the vector Then, it sorts the elements in x in an increasing order, and adds x 0 = 0 and x |C t i | = 1 to the beginning and the end of x, respectively. After that, the vector v is obtained based on As an example, assume there are three bonded channels. Then the corresponding vector Fig. 2 shows the results of 10,000 samples generated by Simplex (3), where each red point (v 1 , v 2 ) represents a sampled vector v. Note that value v 3 is not shown as it can be calculated via [36].

V. A THREE-TIER LEARNING APPROACH
Our approach has three layers. This allows us to optimize each system parameter independently whilst keeping the other parameters fixed. For example, for a given set of channels and transmit power, we can then learn a policy to set the CCA threshold for different system states that occur when an AP uses the given number of channels and transmit power. We note that an alternative solution is to jointly optimize the number of channels, transmit power and CCA threshold simultaneously. However, this approach results in a very large action space that leads to low learning efficiency [38]. This problem is further exacerbated by the fact that both transmit power and CCA threshold have continuous values. To this end, our approach decomposes the action space, where each layer optimizes a specific system parameter, and advantageously, it can be optimized using a learning method that is suited to handle discrete or continuous action space.
Next, we first formulate our problem as a three-layer MDP. After that, we show how an AP or agent uses a three-tier learning approach to determine its policy.

A. THREE-LAYER MDP
AP i runs as an agent, operating in an environment with three layers as shown in Fig. 3; each layer corresponds to a task for AP i, and is modeled as an MDP. Briefly, AP i first selects a set of channels C t i in Layer-1. Then, it assigns a transmit power P t ic for each channel c ∈ C t i in Layer-2. After that, in Layer-3, AP i selects a CCA threshold γ t c for each channel c ∈ C t i . We note that the AP runs each layer sequentially; i.e., each layer is run after receiving an input from a higher layer or a reward from a lower layer.
Referring to Fig. 3, AP i interacts with its environment as follows. In each time slot t, AP i observes a state in each layer. Specifically, the state of Layer-1 is observed from the environment, and the state of Layer-2 and Layer-3 is obtained from Layer-1 and Layer-2, respectively. Based on its observed state, the agent at each layer outputs an action. The AP executes the action of each layer, which yields a reward and a new state for Layer-1.

1) DEFINITIONS
Here, we define the state, action and reward of each layer. Formally, they are as follows: • Channel Selection (Layer-1): -State s 1 t : The Layer-1 state s 1 t consists of the current queue length q t−1 i , and the interference experienced by AP i on each channel {I t ic |c ∈ C}. Formally, Note that AP i can use IEEE 802.11k to collect interference information from associated clients/stations. -Action a 1 t : The action for Layer-1 is to select a set of channels for AP i to transmit, i.e., a 1 t = C t i , where C t i ⊆ C. For simplicity, we use a 1 t to represent the set of selected channels C t i in the rest of this section. -Reward R 1 t : The reward for Layer-1 is calculated based on the data rate r t i and queue length q t i of AP i, and is defined as R 1 t = η 1 r t i − η 2 q t i , where η 1 and η 2 are two weights.
The state for Layer-2 is thus defined as The action for Layer-2 is to assign a transmit power on each selected channel. Formally, a 2 t = {P t ic |c ∈ a 1 t }. Note that the transmit power P t ic on each channel satisfies 0 ≤ P t ic ≤ P max and P max = c∈a 1 The reward for Layer-2 is the same as Layer-1. Formally, we have -State s 3 t : The state for Layer-3 is the transmit power assigned on each channel c ∈ a 1 t . Formally, we have -Action a 3 t : The corresponding action for Layer-3 is to select a CCA threshold for each channel. Formally, we have a 3 t = {γ t c |c ∈ a 1 t }. -Reward R 3 t : The reward for Layer-3 is the achieved data rate of AP i, i.e., R 3 t = r t i . We emphasize that as we propose a model-free approach for practical reason, meaning the transition probability P(.) in each layer is unknown. Advantageously, our approaches/algorithms allow an AP to learn the optimal policy for different environments upon deployment. Specifically, the AP is only required to observe the states, e.g., varying traffic or interference level, of its environment, and optimize its policy as per our algorithms to determine an action that maximizes its average reward.

B. ALGORITHM DETAILS
We now outline the algorithms used in each tier or layer; these algorithms are run on the same AP. Note that our approach is designed to be run by a single AP. It does not require nor assume other APs run our approach. Briefly, an AP, say i, uses DQN to select a set of channels in Layer-1, and uses DDPG to assign a transmit power over each selected channel in Layer-2. For Layer-3, we assume each channel is managed by an agent using DDPG, where the agent on channel c assigns a CCA threshold for the channel. We emphasize that these agents do not cooperate with each other, as channels are orthogonal and hence the data rate on a given channel is not a function of other channels.
for each c ∈ a 1 t do 6 Get transmit power P t ic on channel c and calls Layer1SelectChannels() to select an action a 1 t ; see line 3. The function Layer1SelectChannels() selects an action a 1 t using the function ϵ-greedy(.), where it randomly selects an action with probability ϵ 1 . Otherwise, it selects the action with the highest Q-value; see line 1 in Algorithm-2. The value of ϵ 1 is reduced over time until a minimum value of ϵ min . This is to ensure convergence. In addition, the function Layer1SelectChannels() also outputs the Layer-2 state s 2 t . Next, AP i enters Layer-2 with state s 2 t , and calls Layer2AssignPower(); see line 4 in Algorithm-1. It first calls ϵ-greedy() to output a vector v, where each element in vector v represents a certain fraction of the maximum transmit power P max . Specifically, the function ϵ-greedy(.) in Layer-2 will use Simplex(|a 1 t |) to sample a random vector v with probability ϵ 2 , where |a 1 t | is the number of selected channels. Otherwise, it uses the output of µ(s 2 t , θ µ 2 ) as vector v. Note that we use the Softmax function as the activation Algorithm 2 Layer1SelectChannels.
Return a 2 t , s 3 t function for the output layer of DDPG. This ensures the constraints for Layer-2 actions hold, i.e., 0 ≤ P t ic ≤ P max and P max = c∈C t i P t ic . Then, Layer2AssignPower() scales the output vector v with the maximum transmit power P max to obtain a Layer-2 action a 2 t , see line 2 in Algorithm-3. Lastly, the function Layer2AssignPower() outputs the state s 3 t for Layer-3.
In Layer-3, for each channel c ∈ a 1 t , the agent on channel c observes a state, i.e., transmit power P t ic ∈ s 3 t and calls Layer3AdjustCCA(); see line 5 to 8 in Algorithm-1. The algorithm first calls ϵ-greedy(), where it uniformly samples a value v from the range [0, 1] with probability ϵ c . Otherwise, it uses the output of µ(P t ic ; θ µ c ) as value v. Note that each DDPG agent in Layer-3 uses the Sigmoid function as the activation function in the output layer as the CCA threshold on each channel c is a one-dimensional parameter. The value v is then scaled into the range of [γ min , γ max ] to obtain the action a c t on channel c; see line 2 in Algorithm-4. Finally, the action of Layer-3 is obtained as a 3 t = {a c t | c ∈ a 1 t }. Lastly, the three actions a 1 t , a 2 t and a 3 t are executed by AP i to obtain reward R 1 t , R 2 t and R 3 t . Then, the agent at each layer stores its experience into its memory buffer starting from the second time slot; see line 11 to 16 in Algorithm-1. This is because the state for Layer-2 and Layer-3 depends on the action from their respective upper layer. Therefore, Layer-2 and Layer-3 obtain their respective next state s 2 t+1 and s 3 t+1 only after Layer-1 and Layer-2 select the action a 1 t+1 and a 2 t+1 in the following time slot. The stored experiences are then used by AP i to update the neural networks in each layer as shown in line 18 to 21 in Algorithm-1.

VI. EVALUATION
We conducted our simulations using Python 3.7 on a computer with i7-8700 CPU operating at 4.3 GHz and 16 GB RAM. 1 We used TensorFlow 1.14 [45] and Keras 2.2.5 [46] Algorithm 4 Layer3AdjustCCA. to build neural networks for our learning agents. These agents ran on an AP, denoted as i, that experienced interference from a set of neighboring APs in N i ; each AP is placed 20 m away from AP i, acting as the interference source to induce different interference states at AP i. There is at least one interfering AP operating on each channel. Each AP had four users that are uniformly placed within 5 m distance. Unless otherwise stated, our simulations used the parameter values listed in Table 3 and 4.
We implemented and compared the following algorithms/ rules: • DDPG: AP i uses DQN to select a set of channels, and uses DDPG for both transmit power distribution and CCA threshold adjustment on each channel.
• Mixed DDPG and DQN (MixDD): AP i uses DQN to select channels, and uses DDPG for transmit power allocation. It uses DQN for CCA threshold adjustment on each channel with eleven discrete CCA threshold values, ranging from -80 to -30 dBm.
• DQN: AP i uses DQN to select channels, and uses DQN for both transmit power distribution and CCA threshold adjustment on each channel. We discretized the AP's transmit power into eight levels, ranging from zero to 100 Watts. The CCA threshold is discretized into 11 levels, ranging from -80 to -30 dBm.
• All Channels Bonded (ACB): AP i will always use all channels for transmissions. The transmit power on each channel is evenly distributed. AP i uses a CCA threshold of -82 dBm. 2 • Random: AP i randomly selects a set of channels for transmissions. The transmit power on each selected channel is distributed evenly. The CCA threshold for each channel is set to -82 dBm.
• Primary Channel Only (PCO): AP i randomly selects a channel as its primary channel, and only uses the primary channel for transmissions with the maximum transmit power. The CCA threshold for the primary channel is set to -82 dBm.
Note that MixDD and DQN are two variations of DDPG. The motivation for studying them is to investigate the case where DDPG uses a different reinforcement learning algorithm in each layer. Our simulations used episodes that comprised of 500 time slots. For each episode, we collected the following metrics: In the Training stage, AP i had three available channels, and on each channel, there was an interfering AP located 20 m away. AP i always have packets to transmit. The packet size is fixed to 2304 Bytes [41]. In this stage, we also conducted simulations to study the impact of ϵ decay rules and values of η 1 and η 2 . To train our agents, for the first 5000 time slots, we programmed agents to randomly select actions to ensure they collected sufficient data. Next, we trained our agents for 40000 time slots. After that, we set the value of ϵ to zero and ran the simulation for another 5000 time slots to analyze their convergence performance.
In the Test stage, we used the same network model as the Training stage, and studied two traffic models, number of interfering APs and channel gain variance. We used a Poisson traffic model to control the number of packets that arrived at AP i in each time slot; its arrival rate ranged from 30 to 240 packets per time slot. We have also constructed a traffic model using the trace data in [47]. After that, we studied the impact of interfering APs on each channel, which ranged from one to six on each channel, meaning the number of interfering APs increased from three to 18. Next, we evaluated the impact of channel gain variance or the shadowing term σ 2 , which ranged from zero to 80 dB.
Next, in the Supplementary stage, we outline our performance study of DDPG, MixDD and DQN when there are different numbers of channels, which ranged from two to eight; each channel had one interfering AP. Lastly, we investigated the topology and channel model provided by the IEEE 802.11ax task group [48].

A. TRAINING STAGE 1) ϵ DECAY RULES
We evaluated three ϵ decay rules. The value of ϵ at each time slot t was calculated as follows: • Linear: ϵ t = max(1 − (1−ϵ min )t N L /K , ϵ min ). • Quartic: ϵ t = max(1 − ( t N L /K ) 4 , ϵ min ). In our discussion to follow, the term K and N L , refer respectively to the learning frequency, and number of time slots for learning. Fig. 4 shows the evolution of ϵ over time. Our simulations showed that for the Exponential rule, the value of ϵ reduced at the fastest rate before 15000 time slots, and then it reduced at a slower rate than the Quartic and Linear rule. By contrast, the value of ϵ for the Quartic rule decreased at the lowest rate at the beginning and started to decrease faster after 10000 time slots. Note that the value of ϵ will not decrease below the minimum value of ϵ min .
Referring to Fig. 5, DDPG, MixDD and DQN converged to around 190 packets per time slot. In addition, Fig. 5 shows that different ϵ decay rules have no significant impact on the converged throughput for all tested algorithms. We recorded the largest difference of 4.5 packets per time slot between DDPG with the Linear rule and DQN with the Quartic rule, which only differed by 2.39%. Fig. 6 shows the average throughput for different ϵ decay rules. The Exponential rule had the highest average throughput for all three learning algorithms. The average number of transmitted packets per time slots for DDPG, MixDD and DQN was 165.5, 156.1 and 153.2, respectively, meaning the Exponential rule outperformed the Linear rule by 11.18% and the Quartic rule by 24.32%, on average. This is because when ϵ reduced to a low value, an agent will exploit its learned policy with a high probability. The Exponential rule reduced ϵ to the minimum value ϵ min at the fastest rate. In addition, as per Fig. 5, all three ϵ decay rules converged to 190 packets per time slot. Consequently, the Exponential rule had the highest average throughput among all ϵ decay rules. Hence, in all subsequent simulations, we will use the Exponential rule as the ϵ decay rule when training agents.

2) IMPACT OF η 1 AND η 2
We have also evaluated the impact of different values for η 1 and η 2 . We trained DDPG with an η 1 and η 2 value drawn respectively from the range [0.1, 0.5, 0.9], and [0.01, 0.05, 0.09]. Fig. 7 shows the converged throughput for different combination of η 1 and η 2 values. We see that when η 1 equals 0.1, the converged throughput for η 2 = 0.05 and η 2 = 0.09 is respectively 66.7 and 67.2 packets per time slot. DDPG achieved an average number of transmitted packets that exceeded 190 packets per time slot for all other combinations of η 1 and η 2 . Recall that η 1 and η 2 weigh the importance of data rate and queue length, respectively. The results in Fig. 7 suggest that the value of η 1 should be sufficiently larger than η 2 to balance the impact of data rate and queue length on the reward received by an AP. Otherwise, a minor change in queue length could have a significant impact on the AP's reward, which may potentially cause an agent to learn an incorrect action or policy. Hence, in all subsequent simulations, we set the value of η 1 to 0.9 and η 2 to 0.01. Fig. 8 shows that the average number of packets transmitted by DDPG, MixDD and DQN increased over time. This average value for DDPG, MixDD and DQN respectively increased from 102.9, 97.8 and 88.8, and converged to 190.2, 189.6 and 186.0 packets per time slot. This is because they were able to learn the optimal policies for channel selection, transmit power distribution and CCA threshold adjustment over time. Fig. 8 also shows that the performance for DDPG is better than the other two learning algorithms. For example, the number of transmitted packets per time slot for DDPG is 4.43% and 6.84% higher than that of MixDD and DQN on average. This is because DDPG used a continuous action space for its transmit power distribution and CCA threshold. By contrast, MixDD employed discrete CCA thresholds on each channel whereas DQN learned over both discrete power distribution and CCA threshold action space. Thus, MixDD and DQN failed to learn the optimal action when their action space is not discretized. Therefore, DDPG outperformed the other two learning algorithms. We also see that ACB, Random and PCO had the same average number of transmitted packets over time; i.e., 118.6, 70.6 and 42.8 packets per time slot, respectively. This is because these three algorithms had no learning mechanism to optimize channel usage, transmit power and CCA threshold.

B. TEST STAGE 1) POISSON TRAFFIC
To study the impact of Poisson traffic, we assumed three channels. We varied the traffic arrival rate from 30 to 240 packets per time slot. Simulations ran for 5000 time slots for each traffic arrival rate. Fig. 9 and 10 show the impact of arrival rates. From Fig. 9, DDPG, MixDD and DQN had an increasing throughput trend. They were able to transmit 31.56 packets per time slot when the traffic arrival rate was set to 30 packets per time slot. This number then increased to 190.8 for DDPG, 188.8 for MixDD and 186.6 for DQN when the traffic arrival rate was set to 240 packet per time slot. This is because these three learning algorithms were able to transmit more than 186 packets per time slot after training as shown in Fig. 8. When the traffic arrival rate was lower than 180 packets per time slot, DDPG, MixDD and DQN were able to empty the queue of AP i. As a result, its throughput was limited by a low traffic arrival rate. Fig. 10 shows that the average queue length of DDPG, MixDD and DQN is lower than 1100 packets when AP i experienced a traffic arrival rate no larger than 180 packets per time slot. However, when the traffic arrival rate exceeded 180 packets per time slots, DDPG, MixDD and DQN did not have sufficient throughput to deliver all arriving packets, which increased the queue length of AP i. From Fig. 10, the average queue length of these three learning algorithms increased significantly from 1500 to 15968 packets when the traffic arrival rate increased from 180 to 210 packets per time slot. We observed similar trends for ACB, Random and PCO, where they had an average throughput of 117.2, 71.8 and 42.4 packets per time slot as shown in Fig. 8. This means they were only able to reduce the queue length of AP i when its traffic arrival rate is lower than the average throughput.

2) INTERFERING APs
Here, APs have Poisson traffic with an arrival rate of 180 packets per time slot. There were three channels, and one to six interfering APs on each channel. We uniformly placed all interfering APs within the range of 40 m around AP i. For each number of interfering APs, we ran the simulation ten times, with 5000 time slots in each run. Fig. 11 shows that the throughput of AP i decreased when there are more interfering APs. DDPG, MixDD and DQN were able to transmit 181 packets per time slot when there were three interfering APs. This number decreased to 160.6 for DDPG, 158.6 for MixDD and 156.2 for DQN when the number of interfering APs increased to 18. As a comparison, the average number of transmitted packets per time slot for ACB, Random and PCO decreased from 136.7, 81.7 and 49.4 to 38.8, 23.3 and 14.3, respectively. This is because the level of interference on each channel increased as the number of interfering APs increased, which led to throughput degradation for all algorithms/rules. However, DDPG, MixDD and DQN continued to have better performance against higher interference levels as compared to ACB, Random and PCO. The throughput of DDPG, MixDD and DQN reduced by 12.74% on average as the number of interfering APs increased from three to 18. In contrast, the performance of ACB, Random and PCO dropped by 71.37% on average under the same circumstance. From Fig. 11, DDPG, MixDD and DQN always had the highest throughput. In particular, these three learning algorithms achieved 137.72%, 296.67% and 548.5% higher throughput than ACB, Random and PCO on average, respectively. This is because these three learning algorithms were able to learn the optimal channel selection, transmit power distribution and CCA threshold adjustment for varying interference levels. From Fig. 11, DDPG had a higher throughput as compared to MixDD by 1.21% and DQN by 2.65% for six interfering APs scenario. The difference in throughput led to different average queue lengths. Referring to Fig. 12, DDPG had a queue length that was 5.88% and 10.51% shorter than MixDD and DQN. Hence, using a continuous action space for transmit power and CCA threshold led to better performance than a discrete action space. In contrast, the average queue length for ACB, Random and PCO was always around 16000 packets. These three rules were not able to transmit more than 180 packets per time slot. Therefore, they were not able to reduce the queue length of AP i, resulting in queue overflow.

3) CHANNEL GAIN VARIANCE
To study the impact of channel gain variance, we have used the following settings: Poisson traffic with an arrival rate of 180 packets per time slot, and three channels. The channel gain variance σ 2 was increased from zero to 80 dB. Referring to Fig. 13 and 14, the performance of DDPG, MixDD and DQN was not affected by the changing channel gain variance. Fig. 13 shows that DDPG, MixDD and DQN were able to transmit 181 packets per time slots for all channel gain variance values. The average queue length of these three algorithms was less than 2000 packets. This is because DDPG, MixDD and DQN were able to learn the optimal transmit power and CCA threshold for each channel that resulted in a high throughput against various levels of interference. In contrast, the throughput of ACB, Random and PCO increased with higher channel gain variance. The average number of transmitted packets per time slot for ACB, Random and PCO  increased from 117.8 to 127.0, 70.2 to 75.7 and 42.8 to 46.8, respectively. Their throughput improved by 8.25% on average as the channel gain variance increased from zero to 80. This is because as the variance increased, the corresponding Cumulative Distribution Function (CDF) changed, which increased the probability that the interference on each channel to be less than the CCA threshold used by AP i, i.e., -82 dBm. Therefore, for high channel gain variance, ACB, Random and PCO had more opportunities to transmit than when channel gain variance was low. Consequently, the average number of transmitted packets increased. However, the increased in throughput was lower than the traffic arrival rate. Hence, ACB, Random and PCO were not able to reduce the queue length of AP i, and suffered from queue overflow. Referring to Fig. 14, the average queue length for ACB, Random and PCO stayed at the maximum queue length of 16000 packets for all channel gain variances.

4) TRACE-BASED STUDY
The next simulation evaluated the performance of all algorithms/rules using the traffic trace file provided in [47]. We extracted eight days of traffic, from 19 October  .87% and 20.13% higher than ACB, Random and PCO, respectively. The reason was because the arriving traffic rate was low throughout a day. From Fig. 15, the average number of arriving packets in each time slot was around 30.6 across the eight days. Therefore, DDPG, MixDD and DQN were able to empty the queue of APs quickly. This can also be seen from Fig. 17, where DDPG, MixDD and DQN had the smallest average queue length of 3.05 packets. As a result, DDPG, MixDD and DQN did not have a large number of packets to transmit in each time slot, which resulted in a low average number of transmitted packets per time slot. In contrast, ACB, Random and PCO had an average queue length of 4.64, 6.52 and 8.24, meaning the average queue length of DDPG, MixDD and DQN was 32.94%, 50.99% and 62.52% shorter than ACB, Random and PCO.

C. SUPPLEMENTARY STAGE 1) NUMBER OF CHANNELS
We now present the simulation that studied different number of channels. Each channel had one interfering AP placed 20 m away from AP i. As the number of channels differs, agents used a different action space. Therefore, for each channel number, we re-built our learning agents, and trained them until convergence. Fig. 18 shows the converged throughput for different number of channels. The performance of DDPG, MixDD and DQN had no difference when the number of channels was no larger than four. The average number of transmitted packets per time slot for DDPG, MixDD and DQN increased from 130 to 245 when the number of channels increased from two to four. This means all three learning agents were able to learn the optimal policy when the number of channels was low. However, as the number of channels  increased, DDPG achieved the highest throughput. The average number of transmitted packets per time slot for DDPG increased from 351 to 451 when the number of channels increased from six to eight, which was 14.67% and 65.51% higher than MixDD and DQN on average, respectively. This means DDPG was able to learn the optimal policy in scenarios with different number of channels. In contrast, the throughput of DQN showed no improvement when the simulation used four channels. The converged throughput remained around 243 packets per time slot as the number of channels increased from four to eight. This is because the action space for DQN increased significantly with more channels. For example, the number of actions for the action space related to transmit power over six and eight channels was 1287 and 6435. As a result, agents were not able to explore and learn each action efficiently during training. Therefore, agents were not able to learn the optimal policy, which resulted in poor performance.

2) IEEE SCENARIOS
We have also conducted simulations using the IEEE scenario and channel model proposed in [48] and [49]. Referring to Fig. 19, we simulated an apartment with ten cells; each cell had a dimension of 10 m × 10 m. We placed an AP at the center of each cell, and selected the AP in the center bottom cell as AP i that ran our learning agents. Each AP had four associated users that are uniformly placed within its cell. We adopted the indoor channel model provided in [49], where the wall penetration loss was set to 5 dB. Fig. 20 shows the average number of transmitted packets for different algorithms/rules. We see that DDPG, MixDD and DQN were able to achieve the highest throughput over time. The average number of transmitted packets per time slot for DDPG, MixDD and DQN increased from 61.2, 59.3 and 56.8 to 135.0, 133.0 and 130.6, respectively. These three learning algorithms were able to learn the optimal policy for the stated IEEE scenario and channel model. DDPG achieved the highest throughput among all three learning algorithms, where its average number of transmitted packets per time slot was 4.01% and 5.83% higher than that of MixDD and DQN. In contrast, the throughput for ACB, Random and PCO remained the same over time. ACB, Random and PCO achieved an average number of transmitted packets of 58.5, 36.6 and 23.5 per time slot.

VII. CONCLUSION
This paper has outlined and studied a novel three-tier learning approach that aims to improve multi-channel utilization and minimize the queue length of an AP operating in a WiFi network. Specifically, the AP uses our approach to learn a policy that governs when and how it uses one or more channels, allocate its transmit power and set the CCA threshold for each selected channel given varying environmental conditions. Advantageously, the proposed approach requires an AP to use only local information, such as its queue length and observed interference level. The simulation results showed that the proposed learning approach was able to learn the optimal policy, and achieved the best performance under multiple scenarios. Numerical results showed that the proposed three-tier learning approach was able to reduce the average queue length of an AP by up to 62.52% when compared to an AP that used a fixed strategy over realistic traffic trace data. An interesting future work is to consider time-varying number of users with different quality of service requirements such as data rate or transmission frequency. In this respect, the agent will have to incorporate into its state the number of users, and their requirements. Its goal is then to learn a policy that ensures users meet their respective requirements. Another possibility is to leverage generative artificial intelligence or diffusion models to speed up training or to improve an AP's policy. Specifically, an AP can be first trained offline using data generated from such models to obtain a preliminary policy for a given environment. After that, the AP/agent uses actual system states to refine its policy.