Cooperate or not Cooperate: Transfer Learning with Multi-Armed Bandit for Spatial Reuse in Wi-Fi

The exponential increase of wireless devices with highly demanding services such as streaming video, gaming and others has imposed several challenges to Wireless Local Area Networks (WLANs). In the context of Wi-Fi, IEEE 802.11ax brings high-data rates in dense user deployments. Additionally, it comes with new flexible features in the physical layer as dynamic Clear-Channel-Assessment (CCA) threshold with the goal of improving spatial reuse (SR) in response to radio spectrum scarcity in dense scenarios. In this paper, we formulate the Transmission Power (TP) and CCA configuration problem with an objective of maximizing fairness and minimizing station starvation. We present four main contributions into distributed SR optimization using Multi-Agent Multi-Armed Bandits (MAMABs). First, we propose to reduce the action space given the large cardinality of action combination of TP and CCA threshold values per Access Point (AP). Second, we present two deep Multi-Agent Contextual MABs (MA-CMABs), named Sample Average Uncertainty (SAU)-Coop and SAU-NonCoop as cooperative and non-cooperative versions to improve SR. In addition, we present an analysis whether cooperation is beneficial using MA-MABs solutions based on the e-greedy, Upper Bound Confidence (UCB) and Thompson techniques. Finally, we propose a deep reinforcement transfer learning technique to improve adaptability in dynamic environments. Simulation results show that cooperation via SAU-Coop algorithm contributes to an improvement of 14.7% in cumulative throughput, and 32.5% improvement of PLR when compared with no cooperation approaches. Finally, under dynamic scenarios, transfer learning contributes to mitigation of service drops for at least 60% of the total of users.


I. INTRODUCTION
W ireless connectivity has become an irreplaceable commodity in our modern society.The exponential trend expected in the wireless technology usage has lead analysts to predict that by 2023, 71% of the global population will enjoy some kind of wireless service.In the group of Wireless Local Area Networks (WLANs), Wireless Fidelity (Wi-Fi) technology presents a growth up to 4-fold over a period of 5 years from 2018 to 2023 [1].The newest Wi-Fi standard IEEE-802.11ax[2], also known as Wi-Fi 6 expects to grow 4-fold by 2023 becoming 11% of all the public Wi-Fi hostpots [3].
Spatial reuse (SR) has been of interest for more than 20 years in the wireless community since it contributes to the reduction of the collisions among stations and the determination of channel access rights [4].As the number of dense WLAN deployments increases, SR becomes more challenging in the context of Carrier Sense Multiple Access (CSMA) technology as used in Wi-Fi [5].Wi-Fi 6 comes to address diverse challenges such as increasing number of Wi-Fi users, dense hotspots deployments and high demanded services such as Augmented, Mixed and Virtual Reality.Moreover, 802.11ax included additional features such as dynamic adjustment of the Clear Channel Assessment (CCA) threshold and Transmission Power (TP).Static CCA threshold may not be representative of diverse network topologies, and cause inefficient channel utilization or concurrent transmissions [6].Additionally, adjusting TP allows to reduce the interference among the APs and consequently maximize the network performance [7].Thus, SR and network performance can be positively improved by adjusting CCA and TP.Yet, the complex interactions between CCA and TP, call for intelligent configuration of both.
To this end, data scarcity and data access are key for any Machine Learning (ML) method [8].Recently, AI-based wireless networks have been of remarkable interest among researchers both in WiFi domain [9], and 5G domain [10] however the proposed solutions usually require complete availability of the data.In reality, data access is not always feasible due to privacy restrictions.Recent wireless network architectures have started to shift to a more open and flexible design.In 5G networks as well as the O-RAN Alliance architecture support the utilization of artificial intelligence to orchestrate main network functions [11].In the context of Wi-Fi, a novel project named OpenWiFi [12] released by the Telcom Infra Project intends to disaggregate the Wi-Fi technology stack by utilizing open source software for the cloud controller and AP firmware operating system.These paradigm changes allow for the development of many applications in the area of ML and more specifically in Reinforcement Learning (RL) applications to become reality.
In this paper 1 , we intend to optimize TP and CCA threshold to improve SR and overall network KPIs.To do so, we formulate the TP and CCA configuration problem with an objective of maximizing product network fairness and minimizing station starvation.We model the SR problem as a distributed multi-agent decision making problem and use a Multi-Agent Multi-Armed Bandit (MA-MAB) approach to solve it.The contributions of this work, different from the ones found in the literature, can be summarized in the following points: 1) We propose a solution for reducing the inherent huge action space given the possible combinations of TP and CCA threshold values per AP.We derive our solution via worst-case interference analysis.2) We analyse the performance of the network KPIs of well-known distributed MA-MAB implementations such as -greedy, UCB and Thompson on the selection of the TP and CCA values in cooperative and non-cooperative settings.3) We introduce a contextual MA-MAB (MA-CMAB) named Sample Average Uncertainty-Sampling (SAU) in cooperative and non-cooperative settings.SAU-MAB is based on a deep Contextual MAB. 4) We propose for the first time, to the best of our knowledge, a deep transfer learning solution to adapt efficiently TP and CCA parameters in dynamic scenarios.With these contributions, our simulation results show that the -greedy MAB solution improves the throughput at least 44.4%, provides improvement of 12.2% in terms of fairness and 94.5% in terms of Packet Loss Ratio (PLR) over typical configurations when a reduced set of actions is known.Additionally, we show that the SAU-Coop algorithm improves the throughput by 14.7% and PLR 32.5% when compared with non cooperative approaches with full set of actions.Moreover, our proposed transfer learning based approach reduces the service drops by at least 60%.
The rest of the paper is organized as follows.Section II presents a summary of recent work that uses Machine Learning to improve SR in Wi-Fi.Section III covers the basics on Multi-Armed Bandits including deep contextual bandits and deep transfer reinforcement learning.In IV we present our system model altogether with an analysis to reduce the action space via worst-case interference.Section V presents the proposed schemes and the results are discussed in section VI.Finally, section VII concludes the paper.

II. RELATED WORK
Reinforcement learning-based spatial reuse has been of interest in recent literature.The studies have focused on distributed solutions with no cooperation or centralized schemes of multi-armed bandits.These studies are summarized below.
In [13], the authors present a comparison among well-known MABs as -greedy, UCB, Exp3 and Thompson sampling in the context of decentralized SR via Dynamic Channel Allocation (DCA) and Transmission Power Control (TPC) in WLANs.The results showed that "selfish learning" in a sequential matter present better performance than "concurrent learning" among the agents.Additionally, [14] presents a centralized MAB consisting of an optimizer based on a modified Thompson Sampling (TS) algorithm and a sampler based on Gaussian Mixture (GM) algorithm to improve SR in 802.11axWi-Fi.More specifically, the authors propose to deal with the large action space comprised by TP and Overlapping BSS/Preamble-Detection (OBSS/PD) threshold by utilizing a MAB variant called Infinitely Many-Armed Bandit (IMAB).Furthermore, a distributed solution based on Bayesian optimizations of Gaussian processes to improve SR is proposed in [15].
Other solutions that are not related to reinforcement learning can be found in the literature with the aim of improving SR in WLANs.For instance, in [16] the authors propose a distributed algorithm where the APs decide their Transmission Power based on their RSSI.Moreover, in [17] the authors present an algorithm to improve SR by utilizing diverse metrics such as SINR, proximity information, RSSI and BSS color and compare with the legacy existing algorithms.The ultimate goal of the previous algorithm is the selection of the channel state (IDLE of BUSY) at the moment of an incoming frame given the previous metrics.Finally, the authors in [18] presented a supervised federated learning approach for SR optimization.
In all above works, the authors employ either centralized or decentralized schemes with no cooperation to address SR optimization in WiFi.In this work, we propose to address this via a coordination based MA-MAB.In addition, we tackle some of the issues previously encountered in others works such as the size of action space due the set of possible values TP and CCA.Finally, to the best of our knowledge, we propose for the first time to address SR adaptation in dynamic environments utilizing deep transfer learning.

III. BACKGROUND
In this section, we present a background on Multi-Armed Bandits including -greedy, Upper Confident Bound, Thompson sampling bandits and an introduction on contextual MABs with a focus on a neural network-based contextual bandit.Additionally, we introduce MABs to the multi-agent setting and we finalize with a background on deep transfer reinforcement learning.
Multi-Armed Bandits (MABs) are a widely used RL approach that tackles the exploration-exploitation trade-off problem.Their implementation is usually simpler when compared with full RL off-policy or on-policy algorithms.However, simplicity often comes with a cost of obtaining suboptimal solutions [19].The basic model of MABs corresponds to the stochastic bandit, where the agent has K possible actions to choose, called arms, and receive certain reward R as a consequence of pulling the j th arm over T environment steps.The rewards can be modeled as independent and identically distributed (i.i.d), adversarial, constrained adversary or random-process rewards [20].From the four models previously mentioned, two are more commonly found in the literature: the i.i.d and the adversarial models.In the i.i.d model, each pulled arm's reward is drawn independently from a fixed but unknown distribution Dj with an unknown mean µ * j .On the other hand, in the adversarial model each pulled arm's reward is randomly sampled from an adversary or alien to the agent (such as the environment) and not necessarily sampled from any distribution [21].The performance of MABs is measured in terms of cumulative regret RT or total expected regret over the T steps defined as: The utmost goal of the agent is to minimize RT over the T steps such as the limT →∞ RT /T = 0 which means the agent will identify the action with the highest reward in such limit.
A. -greedy, Upper-Confidence-Bound and Thompson Sampling MAB The -greedy MAB is one of the simplest MABs and as the name suggests, it is based on the -greedy policy.In this method, the agent selects greedily the best arm most of time and once a while, with a predefined small probability ( ), it selects a random arm [22].The UCB MAB tackles some the disadvantages of the -greedy policy at the moment of selecting non-greedy arms.Instead of drawing randomly an arm, the UCB policy measures how promising non-greedy arms are close from optimal.In addition, it takes in to consideration the rewards' uncertainty in the selection process.The selected arm is obtained by drawing the action from argmax a Qt(a) + c ln t/Nt(a) , where Nt(a) corresponds to the number of times that action a via the j th arm has been chosen and Qt(a) the Q-value of action a [22], [23].Finally, Thompson Sampling MAB action selection is based on Thompson Sampling algorithm as the name indicates.Thompson sampling or posterior sampling is a Bayesian algorithm that constantly constructs and updates the distribution of the observed rewards given a previously selected action.This allows the MAB to select arms based on the probability of how optimal the chosen arm is.The parameters of the distribution are updated depending on the selection of the distribution class [24].

B. Deep Contextual Multi-Armed Bandits
Contextual Multi-Armed Bandits (CMABs) are a variant of MABs, that before selecting an arm, observe a series of features commonly named context [19].Different from the stateless MAB, a CMAB is expected to relate the observed context with the feedback or reward gathered from the environment in T episodes and consequently predict the best arm given the received features [21].Diverse CMABs have been proposed throughout the literature such as LinUCB, Neural Bandit, Contextual Thompson Sampling and Active Thompson Sampling [19].More recently, a deep neural contextual bandit named SAU-Sampling has been presented in [25] where the context is related with the rewards using neural networks.The details of SAU-Sampling will be discussed in following sections.
C. Multi-Agent Multi-Armed Bandits (MA-MABs) Multi-agent Multi-Armed Bandits is the multi-agent variant of MABs in which N agents pull their j th arm and each m th agent will receive a reward drawn from their distribution Dm,j with an unknown mean µ * m,j [26].MA-MABs can be modeled as centralized or distributed.In centralized settings the agents' actions are taken by a centralized controller and in distributed settings each agent will independently choose their own actions.Distributed decision-making settings scale more effectively [27] and naturally deals easily with large K set of arms when compared with centralized settings that suffers of K arms' cardinality explosion.Finally, the total regret can be defined as: In this work, we consider two main approaches: distributed noncooperative and cooperative MA-MABs with adversarial rewards.

D. Deep Transfer Reinforcement Learning
Transfer learning or knowledge transfer techniques improve learning time efficiency by utilizing prior knowledge.Typically, this is done by extracting the knowledge from one or diverse source tasks and then applying such knowledge in a target task [28].If the tasks are related in nature and the target task benefits positively with the acquired knowledge from the source, then it is called inductive transfer learning [29].This type of learning is not uncommon and it is used by the human brain on a daily basis.However, a phenomena called negative transfer can occur, if after knowledge transfer, the target task performance is negatively affected [30].
In the realm of transfer learning we can find Deep Transfer Learning (DTL).DTL is a subset of transfer learning that studies how to utilize knowledge in deep neural networks.In the context of classification/prediction tasks, large amount of data is required to properly train the model of interest [31].In many practical applications where training time is essential to respond to new domains [32], retraining using large amount of data is not always feasible and possibly catastrophic in terms of performance."What to transfer" corresponds to one of the main research topics in transfer learning.Specifically, in the case of deep transfer learning four categories have been broadly identified: instances-based transfer, where data instances from a source task are utilized; mapping-based transfer, where a mapping of two tasks is used on a new target task; networkbased transfer, where the network pre-trained model is transferred to the target task; and adversarial-based transfer, where an adversarial model is employed to find which features from diverse source tasks can be transferred to the target task [33].
In this work, we utilize the DTL form called network-based transfer learning to adapt efficiently TP and CCA parameters in dynamic scenarios.An example of network-based transfer learning technique is presented in Fig. 2. Such technique is utilized in deep transfer reinforcement learning as part of a transfer learning type called policy transfer [34].In particular, policy transfer takes a set of source policies πS 1 , ..., πS K that are trained on a set of source The transmission power at the m th transmitter (AP) and the received signal strength at the r th receiver, dm,r and θ Distance between the m th transmitter and r th receiver and path loss exponent, F + m and F − m Subset of interferers and non-AP interferers, γm,r, Cm,r and C T Worst-case SINR and Shannon's maximum capacity of m th transmitter and r th receiver and cumulative maximum network capacity.tasks and uses them in a target policy πT in a way that is able to leverage the former knowledge from the source policies to learn its own.More specifically, the weights and biases that comprise each of the hidden layers of the source policies are the elements transferred to the target polices.Note that in practice policies are modeled as neural networks.In this paper, we take advantage of the design of a contextual multiarmed bandit presented in [25] and apply policy transfer to improve the agent's SR adaptability in dynamic environments.The results and observations of applying DTRL are discussed in section VI-E.In the next section, we will discuss the details of the system model and present an analysis on reducing the cardinality of the action space in the proposed SR problem formulation.

IV. SYSTEM MODEL AND PROBLEM FORMULATION
In this work, we consider an infrastructure mode Wi-Fi 802.11ax network N with N = |S| + |M| nodes where S is the set of stations with {x 1 , x 2 , ..., x |S| } ∈ R 2 positions and M is the set of APs with {c 1 , c 2 , ..., c |M| } ∈ R 2 positions.We can assume that |M| APs positions correspond to cluster centers and the stations will attach to their closest AP.In addition, the list of notations utilized in this work can be found in Table I.
In this paper, we improve SR via maximization of the linear product-based fairness and minimization of the number of stations under starvation by configuring TP and CCA parameters.

Max
fairness avg.station starvation complement (3a) Transmission power and CCA threshold selection (3c) Let's define the probability of an STA being idle in a BSS as: where φ m s ∈ [0, 1] is the probability of an STA transmitting to the m th AP.In addition, we proceed to define the probability in which an STA will successfully transmit a packet as: where ξCCA(•) = 1 if the sensed signal of a packet sent by the s th STA is below the CCA threshold (Pcs), otherwise becomes zero.
Here, ξED(•) = 1 if the sensed signal of packet sent by the s th STA is below the Energy Detection (ED) threshold (P ed ), otherwise becomes zero.Additionally, we consider Pcs = P ed to simplify our analysis.As indicated by [35] the expected length of the general time-slot E(Tg) and the expected information transmitted by the s th STA to m th AP E(Ig) can be expressed as: where D m s corresponds to the link data rate, TEDCA corresponds to the time required for a successful Enhanced Distributed Channel Access (EDCA) transmission, TT XOP is the transmission duration and δ the duration of an idle time slot.The link data rate will adaptively depend on SNR [36] and mapped based on SNR/BER curves [37].The received SNR can be defined as P m tx g m s /σ 2 where Ptx is the transmission power, g m s the channel power gain and σ 2 the power noise.
Finally, the throughput of the s th station attached to the m th AP can be defined as: Additionally, let's define the average linear product-based network fairness and average station starvation in a distributed setting: where R s m,A is the achievable throughput of the s th station attached to the m th AP.Additionally, ξST A = 1 if s th station's throughput is greater than a fraction ω ∈ (0, 1] of the achievable throughput, otherwise becomes zero in which case the station is considered in starvation.The considered problem is a multi-objective problem and can be addressed with the weighted sum approach.Thus, in each time step, the problem can be formulated as follows: Problem 1: Due the dynamic nature of the scenario, the transmission probabilities of the STAs φ m s are not directly controllable and require an additional step to map them to EDCA parameters [35].Instead, we simplify our analysis by utilizing a network simulator to model such dynamics and propose to solve the previous linear programming (LP) problem using a MA-MAB solution as described in section V.

A. Optimal action set via worst-case interference
Wi-Fi typical scenarios consist in APs and stations distributed non-uniformly.Contrary to the analysis presented in [38] we aim obtaining an optimal subset of TP and CCA threshold values to further reduce the action space size in SR problems.In this analysis, we only consider the Carrier Sense (CS) threshold term as form of the CCA threshold.
First, let's consider the worst-case interference scenario in a N > 2 arrangement.For the sake of simplicity we use the path-loss radio propagation model: where P m tx and P r rx are the TP at the m th transmitter (AP) and the received signal strength at the r th receiver, respectively.In addition, dm,r is the distance between the transmitter and receiver.Finally, θ ∈ [2,4] corresponds to the path loss exponent.Thus, from the perspective of m th AP the worst-case interference Im is defined as: where tx is the TP of the v th interferer and P sta tx is a constant corresponding to the fixed power assigned to all the stations based on the fact that typically stations are not capable to modify their TP.Additionally, X (m,v) and X (m,w) corresponds to the distance from the m th AP to the v th AP interferer and m th AP to the v th station interferer, respectively.X (m,.) is calculated as follows: where (.) refers either to the AP or non-AP interferer, Dm is the CCA threshold range of the m th AP, ςr,. is the distance between the receiver to the interferer (.) and xm,.corresponds to the distance between any (.) interferer and Dm.
The corresponding worst-case SINR γm,r at the receiver is defined as: Let's assume that N0 << Im, thus the equation is reduced to: Substituting equations ( 15) and ( 16) in (18) we obtain equation (19): The aforementioned equation describes γm,r in function of Dm and dm,r.Additionally, we substitute Dm = (P m tx /T m cs ) 1/θ in equation ( 19), obtaining: where, Now, we proceed to define the maximum channel capacity in terms of TP and Carrier Sense (CS) threshold (Tcs).Given a certain value of SINR, the Shannon maximum capacity is expressed as: where W is the channel bandwidth in Hz.Then, the cumulative maximum network capacity can be calculated as: In figure 3, it is shown a graph of the network maximum capacity as a function of TP and CS threshold.As observed, the network capacity achieves its higher values when a combination of high TP and low CS threshold is utilized.Note that, prior knowledge of the locations are required.

ALGORITHMS
In this section, we present the action space, context definition and reward function for the MA-MAB algorithms utilized in this work.

A. Action space
The action space corresponds to the number of combinations of Pcs and Ptx which in the context of MABs translates to the number of arms for each MAB agent.The action space is defined as: where P min cs , P max cs and P min tx , P max tx are the minimum and maximum values of CCA threshold and TP values, respectively.Lcs and Ltx corresponds to the number of levels to be discretized the CCA threshold and TP values, respectively.Finally, the number of arms corresponding to the action space for the m th agent

B. Reward function in distributed non-cooperative settings
The reward is defined following the optimization problem 1.The reward resembles the reward presented in [14] which includes a linear product-based fairness and station's starvation term [14], [17] but defined in a distributed manner.A station is considered to be on starvation when its performance is bellow to a predefined percentage of its theoretical achievable throughput.The reward is defined as: where Ψ AP m is the set of starving stations attached to the m th AP , N AP m the set of stations attached to the m th AP.We can also observe, that r AP m ∝ Cm,r as defined in Eq. 21.In the next subsection, we present the definition of the context considered in our MA-CMAB solution.

C. Distributed Sample Average Uncertainty-Sampling MA-CMAB
In [25], the authors present an efficient contextual multi-arm bandit based on a "frequentist approach" to compute the uncertainty instead of using bayesian solutions as Thompson Sampling.The frequentist approach consist in measuring the uncertainty of the action-values based on the sample average rewards just computed instead of relaying on the posterior distribution given the past rewards.In this work, we present multi-agent cooperative and not cooperative variants of the previously mentioned RL algorithm.
In our problem, the context is comprised only by the APs' local observations: 1) Number of starving stations, |Ψ AP m | where m corresponds to the m th AP under ω fraction of their attainable throughput during the t episode.

2) Average RSSI, S AP m
where m is the m th AP during the t episode.
3) Average Noise, Υ AP m where m denotes the m th AP during the t episode.Additionally the context is normalized as follows: The multi-agent SAU-Sampling algorithm in its non-cooperative version (SAU-NonCoop) is described in Algorithm 1.The algorithm starts by initializing action-value functions µ(xm| θm) as a deep neural networks and the exploration parameters J 2 m,a and nm,a for each m th AP. nm,a correspond to the number of times action a was selected in the m th AP and J 2 m,a is defined as an exploration bonus.In each environment step (Algorithm 1, line 2), each agent will observe their local context and compute the selected arm given the reward prediction.In (Algorithm 1, line 11) each CMAB agent will update θm,a using stochastic gradient descent on the loss between the predicted reward and the real observed reward.Finally, the exploration parameters are accordingly updated given the the prediction error as depicted in (Algorithm 1, line 12).

D. Cooperative Sample Average Uncertainty-Sampling MA-CMAB
In this section we present a cooperative version of SAU-Sampling named SAU-Coop.Different from the non-cooperative version, the total reward r C m considers the network Jain's fairness index in addition to their local reward r AP m as: where rJ as the overall network Jain's fairness index is defined as: where Rm = |Sm| s=1 R m s is the total throughput of all the Sm stations of the m th AP.

E. Reward-cooperative -greedy MA-MAB
In addition to the previous cooperative algorithm, we propose a cooperative approach based on the classical -greedy strategy [22] that takes into account in the action's reward update a percentage of the average reward of other agents.This algorithm is described in Algorithm 10.

F. Sample Average Uncertainty-Sampling MA-CMAB based Deep Transfer Reinforcement Learning
Typically, RL agents learn their best policy based on the feedback received from the environment in a T horizon time.However, in real-world scenarios the environment conditions can change in T + 1 and thus, adapting to the updated environment is necessary [39].In such cases, the "outdated" agent's policy might not be optimal to address the new conditions efficiently.For instance, a modification on the stations' distribution over the APs can cause that the SRrelated parameters chosen by the "outdated" agents' policy affect the network performance.
Transfer weights and biases via: To address the previous situation we propose two main solutions: 1.If the agent detects a change in the environment indicated by a singularity, it will decide to correct its configuration via forgetting the policy already learnt (forget) or 2. adapting the agent's policy to the new conditions via a transfer learning technique.A singularity is defined as a anomalous behavior of the KPIs of interest after the policy of the MAB agent has converged.In this work, we don't delve into how to detect a singularity and moreover, we assume the existence of an anomaly detector in our system [40].In Algorithm 3, we present the transfer learning algorithm depicting the second proposed solution.At t = 0 each SAU-Sampling agent will reset their weights and biases and start learning as part of Algorithm 1.At t = S1, where S1 corresponds to the time when an anomaly is detected and the transfer procedure is activated (Algorithm 3, line 7).In our setup we transfer l = 2 and reset l = 1 (Algorithm 3, line 11) , where l corresponds to the layer of the neural network utilized in the SAU-Sampling agent.However, as indicated (Algorithm 3, line 13), the transfer is not constrained to one layer but more generally to a set of layers.The set of transferred layers is considered as an hyperparameter to be tuned.The partial transfer of a model avoids negative transfer by giving the agent room to adapt to the new context since it mitigates model overfitting.

VI. PERFORMANCE EVALUATION A. Simulation Setting
We consider two scenarios in our simulations.The first one considers stationary users, meanwhile the second scenario considers mobile users to model dynamic scenarios (see section VI-E).In addition, stations and APs are two-antenna devices supporting up to two spatial streams in transmission and reception.In this work, we assume a frequency of 5 GHz with a 80 MHz channel bandwidth in a Line of Sight (LOS) setting.The propagation loss model is the Log Distance propagation loss model with a constant speed propagation delay.In addition, an adaptive rate data mode is considered with a UDP downlink traffic.We implement our proposed solutions using ns-3 and also we use OpenAI Gym to interface between ns-3 and the MA-MAB solution [41].In Table II

B. Reduced set of actions vs. all actions
In subsection IV-A we presented a mathematical analysis to obtain a reduced set of optimal actions with the goal of decreasing exploration time and consequently improving convergence time.As concluded in figure 3, high TP and low CCA threshold values maximize the network capacity in the simulation scenario under study.Therefore, we selected a fixed value of CCA threshold (Pcs = −82.0dBm) and a reduced set of TP Ptx ∈ {15, 16, 17, 18, 19, 20, 21} dBm and observed the performance against the full set of possible actions described in V-A.In figure 4, we present the convergence performance of three MA-MAB algorithms under UDP traffic of 0.056 Gbps in a non-cooperative and cooperative settings (indicated with subscripts "non-coop" and "coop", respectively ).The  algorithms correspond to -greedy (M AB eg ), UCB (M AB ucb ) and Thompson Sampling (M AB thom ) MA-MABs.For each algorithm, we plotted three convergence graphs in terms of fairness, cumulative throughput and station starvation representing the behavior when a reduced set of actions and the full action set (indicated with the subscript "all") are used, respectively.For the case of the set of optimal actions, we can observe that the performance is similar with a slight improvement when utilizing MAB-Thompson Sampling.On the other hand, when utilizing the full action set the behavior shows a noticeable improvement with MAB -greedy algorithm with respect the others.In [43], the authors study the unreasonable behavior of greedy algorithms when K is sufficiently large.They concluded that when K increases above 27 arms, intelligent algorithms are affected greatly by the exploration stage.The former results validate ours based on the fact that K = |Acs| • |Atx| = 21 2 .Finally, it can be noted that the impact of utilizing reduced optimal actions in terms of convergence time and KPI maximization.The set of optimal tasks allows to reduce the station starvation when compared with the best performer M AB eg nocoop all by an average of two starving users.However, in order to obtain such a set it is requires a prior knowledge of stations and APs geographical locations.In the following section we compare the results of -greedy MA-MAB and a default typical configuration without machine learning.

C. Distributed -greedy MA-MAB vs. default configuration performance results
In this subsection, we present the comparative results and advantages of utilizing a distributed intelligent solution such as MABgreedy over the default CCA threshold and TP configuration with no ML.In figure 5, we show the performance under four different UDP data traffic regimes: {0.011, 0.056, 0.11, 0.16} Gbps.We considered two typical configurations of CCA threshold: −82.0 dBm and −62.0 dBm.In both cases, the AP's TP is 16.0 dBm.It can be observed that MAB -greedy achieves a significant improvement over the default configuration (Pcs = −82.0dBm) with an average gain over all the considered traffic of 44.4% in terms of cumulative throughput, 70.9% in terms of station starvation, 12.2% in terms of fairness, 138.0% in terms of latency and 94.5% in terms of packet loss ratio (PLR), respectively.Additionally, a gain over the default configuration (Pcs = −62.0dBm) with an average gain over all the considered traffics of 53.9% in terms of cumulative throughput, 138.4% in terms of station starvation, 43.0% in terms of fairness, 84.0% in terms of latency and 105.4% in terms of packet loss ratio (PLR) is shown, respectively. 2We assume all APs are configured to use 1 channel out of the available 11.This is a practical selection to create dense deployment scenarios.

D. Cooperation vs. non-cooperation performance results
In the two past subsections we have shown the results considering the set of optimal actions.In this subsection we assume the nonexistence of stations and APs location information and thus, we must rely on the full set of actions.In consequence, we investigate if cooperation can improve the KPIs of interest by utilizing the cooperative proposal of the MAB -greedy algorithm (Rew-Coop) and the contextual SAU-Sampling algorithm (SAU-Coop).Additionally, we present two non-cooperative algorithms: SAU-NonCoop which corresponds to the non-cooperative version of the SAU-Sampling and Eg-NonCoop that refers to the MAB -greedy algorithm utilized in the previous section.As observed in figure 6, simulations show that SAU-Coop improves Eg-NonCoop over all the data traffic with an average of 14.7% in terms of cumulative throughput, 21.3% in terms of station starvation, 4.64% in terms of network fairness, 36.7% in terms of latency and 32.5% in terms of PLR.Similarly, the distributed version of SAU-Sampling presents a better performance over Eg-NonCoop, indicating that context is beneficial to solve the current optimization problem.Additionally, SAU-Coop presents a better performance over its non-cooperative version, specially when the data rate increases up to 0.16 Gbps where it is observed a gain of 14.1% in terms of cumulative throughput, 32.1% in terms of station starvation, 18.2% in terms of network fairness, 16.5% in terms of latency and 4% in terms of PLR.To sum up, cooperative approaches contribute positively to the improvement of SR in WiFi over noncooperative approaches.In addition, in cases where cooperation is not possible it is advisable to utilize contextual multi-armed bandits over stateless multi-armed bandits.

E. Deep Transfer Learning in Adaptive SR in Dynamic scenarios results
In order to model a dynamic scenario, we design a simulation where the users move across the simulation area and attach to the AP that offers the best signal quality.Consequently, the user load in each AP will change and thus, the dynamics of the environment.We model this scenario with 3 APs and 15 users where the load will change twice throughout the simulation.As depicted in table IV the user load of the m th AP denoted as Cm will change in two instances in time: 3 and 6 minutes, respectively.In figure 7 we present the network behavior in terms of fairness and station starvation under the scenario depicted by Table IV.In addition to the two methods previously mentioned: forget and transfer, we present the performance of a third approach called full transfer where the full transfer of the model is considered.During the first interval (0 − 3min) the performance is similar in the three methods as expected.However, after the two changes on the network load, two singularities in each graph are visible in the fairness and starvation graphs.More specifically, the forget method experiences the worst behavior, with a 54.3% and 11.7% decrease when compared with the transfer method in terms of station starvation and fairness, respectively.The forget method shows some peaks at the moment of the singularities representing 60% of total of the users with a service drop; this behavior is inherently related to the agents' process of start learning again and cannot be avoided.From the quality of service perspective, a disturbance such as the one observed is highly non-preferable.Meanwhile, the full transfer method underperforms the transfer method with 18.7% and 6% decrease in the previously mentioned KPIs.Interestingly, it can be observed in the second interval under study (3 − 6min) the forget method is able to overperform at the end of the period the full transfer method.This is due to a negative transfer as a result of transferring the whole model.As observed, not only the partial transfer learning reduces considerably the peaks in performance of the forget method but also it is able to achieve better adaptation over the full transfer method.In all methods, the cumulative throughput is similar, however as observed in figure 7 station starvation and consequently, fairness are affected.

VII. CONCLUSION
In this paper, we propose Machine Learning (ML)-based solutions to the Spatial Reuse (SR) problem in distributed Wi-Fi 802.11ax scenarios.We presented a solution to reduce the huge action space given the possible values of Transmission Power (TP) and Clear-Channel-Assessment (CCA) threshold values per Access Point (AP) and analysed its impact on diverse well-known distributed Multi-Agent Multi-Armed Bandit (MA-MAB) implementations.In distributed scenarios, we showed that -greedy MA-MAB significantly improves the performance over typical configurations when the optimal actions are known.Moreover, the Contextual Multi-Agent Multi-Armed (MA-CMAB) named SAU-Sampling in the cooperative setting contributes positively to an increase in throughput and fairness and reduction of PLR when compared with no cooperation approaches.Under a dynamic scenarios, transfer learning benefits the SAU-Sampling algorithm to overcome the service drops for at least 60% of the total of users when utilizing the forget method.Additionally, we obtained that partial transfer learning offers better results than the full transfer method.To conclude, the utilization of the cooperative version of the MA-CMAB to improve SR in WiFi scenarios is preferable since it outperforms the presented ML-based solutions and prevents service drops in dynamic environments via transfer learning.

Fig. 1 :
Fig. 1: Typical operational scenario: APs adjust their Transmission Power and CCA threshold towards an efficient spatial reuse.

Fig. 2 :
Fig. 2: Network-based transfer learning: the neural network source task's hidden layers are reutilized in the target network the subset of interferers |F + m | = |M|−1, corresponding to APs interfering with the m th AP and F − m the subset of non-AP interferers |F − m | = |S|, corresponding to the stations interfering with the m th AP.Furthermore, P v

Fig. 3 :
Fig. 3: Network capacity as a function of TP and CS threshold.

Algorithm 3 : 3 if 6 Reset 7 Reinitialize
SAU-Sampling MA-MAB Transfer Learning 1 Function DETECT SINGULARITY(K) ; // returns True if anomaly is detected in network KPIs data K at time t, and False otherwise.2 Let L = {l|l ∈ N, l > 0} the set of layers of model θl m,a and M ⊂ L the subset of layers to be transferred.Run algorithm SAU-SAMPLING MA-CMAB (Algorithm 1) while environment step t < T do exploration parameters S 2 m,a , nm,a; weights w and biases b of the l th layer of θl ∈M m,a via:

Fig. 4 :
Fig. 4: Convergence performance of -greedy, UCB and Thompson Sampling MA-MABs under non-cooperative and distributed regimen.The subscript "all" indicates the usage of the full set of actions.

Fig. 7 :
Fig. 7: Network response in terms of and station starvation when utilizing the forget, full transfer and transfer strategies.

TABLE I :
NotationsThroughput of s th STA of m th AP, Achievable throughput of s th STA of m th AP, D m

s
Adaptive data link rate of s th STA of m th AP Probability of succesful transmission by station s th STA to the m th AP, φ m

s
Probability of s th STA be transmitting to the m th AP, ξ CCA Binary function, ξ CCA = 1 if signal is bellow the CCA threshold Pcs, ξ ED Binary function, ξ ED = 1 if signal is bellow the Energy Detection (ED) threshold P ed , ξ ST A Binary function, ξ ST A = 1 if throug is bellow the Energy Detection (ED) threshold P ed , Expected length of general time-slot and expected information transmitted by the s th STA of m th AP, T T XOP and T EDCA 2 for environment step t ← 1 to T do 2 for environment step t ← 1 to T do and TableIIIwe present the learning hyperparameters and network settings parameters, respectively.

TABLE II :
Learning hyperparameters Number of neurons per hidden layer, n h = 100 Number of inputs, Nm = 3 and number of outputs, No = K

TABLE III :
Network settings

TABLE IV :
Dynamic scenario load distribution