Deep Reinforcement Learning-Based Channel Allocation for Wireless LANs With Graph Convolutional Networks

For densely deployed wireless local area networks (WLANs), this paper proposes a deep reinforcement learning-based channel allocation scheme that enables the efficient use of experience. The central idea is that an objective function is modeled relative to communication quality as a parametric function of a pair of observed topologies and channels. This is because communication quality in WLANs is significantly influenced by the carrier sensing relationship between access points. The features of the proposed scheme can be summarized by two points. First, we adopt graph convolutional layers in the model to extract the features of the channel vectors with topology information, which is the adjacency matrix of the graph dependent on the carrier sensing relationships. Second, we filter experiences to reduce the duplication of data for learning, which can often adversely influence the generalization performance. Because fixed experiences tend to be repeatedly observed in WLAN channel allocation problems, the duplication of experiences must be avoided. The simulation results demonstrate that the proposed method enables the allocation of channels in densely deployed WLANs such that the system throughput increases. Moreover, improved channel allocation, compared to other existing methods, is achieved in terms of the system throughput. Furthermore, compared to the immediate reward maximization method, the proposed method successfully achieves greater reward channel allocation or realizes the optimal channel allocation while reducing the number of changes.


I. INTRODUCTION
Channel allocation is an important problem in densely deployed wireless local area networks (WLANs) owing to the large number of access points (APs) and limited available channels. A poor channel allocation causes substantial contention among the APs and stations (STAs), and reduces the throughput of each AP. WLAN channel allocation schemes in centrally managed environments have been proposed [1]. IEEE 802.11 Task Group be (TGbe) focuses on a multiple increase in real-time applications that impose strict requirements on packet transmission delay and packet loss ratio [2]. To meet the requirements, new resource allocation algorithms The associate editor coordinating the review of this manuscript and approving it for publication was Ayaz Ahmad . are required. Coordinated AP control methods in densely deployed WLANs have been discussed. The poor channel allocation issue can be avoided by effective channel allocation with a limited number of channels. This is the motivation for this study. DSATUR is a graph coloring approach, where nodes and edges represent APs and the contentions among APs (i.e., carrier sensing relationship), respectively, and the color of each node represents its channel. Note that this approach indirectly improves the throughput by reducing the number of adjacent APs using the same channel.
To directly improve throughput, observations of the throughput are required because it is difficult to model the throughput as an explicit function of channels in general. MBLC [4], an immediate throughput improving approach based on observations, uses a weighted cost function of the observed interference based on the physically measured interference power on all channels. This approach switches the channel of an AP exclusively if the calculated weighted interference does not increase after the operation. The objective of this approach is to maximize the immediate throughput at the time of changing the channel of an AP. However, the throughput is not necessarily maximized at the end of the sequence because they can fall into a local optimal allocation. Following, we introduce prior studies that addressed channel allocation problems in WLANs. In [5], a channel assignment scheme was proposed to maximize the signal-tointerference ratio (SIR) at the user level. This scheme focused on the load balancing of APs according to the number of users of each AP. An optimal channel allocation algorithm in dynamic channel bonding WLANs was proposed in [6]. This algorithm achieved an improvement in the throughput by operating the bandwidth of each channel without overlapping. Raschellà et al. [7] evaluated a channel assignment algorithm. The objective was to minimize the parameters that represented the interference among the APs. As a distributed channel allocation method, potential game-based [8] methods are summarized in [9]. In these approaches, to achieve the Nash equilibrium, each AP stochastically selects its channel. As a potential game-based method for WLANs, Xu et al. [10] proposed an approach to minimize the carrier sensing relationships among the APs. Suliman et al. [11] indicated the potential of artificial intelligence in solving channel allocation problems in wireless communications. This study addressed the determination of the minimum number of channels to satisfy the demands of a network using an artificial immune system. Ghahfarokhi [12] introduced a distributed channel assignment algorithm using a machine learning algorithm, learning automata [13]. The objective of this study was to improve the quality of the experience and user-level fairness. Jeunen et al. [14] proposed a machine learning approach using a combination of airtime overlap minimization and bad neighbor detection, which identified devices interfering with each other. This method also focused on improving the experience for users.
As far as we know, the research problem of this paper, which is the direct maximization of the system throughput in WLANs based on the observed throughput, has not been studied before. This is because the prior studies aimed to improve the throughput indirectly (e.g., by reducing the number of carrier sensing relationships or by improving the SIR).
Because the throughput of WLANs is influenced by various factors, these prior studies did not essentially allocate channels to maximize the throughput.
Reinforcement learning is a possible solution to allocate channels based on the feedback of the measured throughput considering the allocation sequence. Reinforcement learning is a decision-making process that learns what choice provides greater reward based on the experience. In particular, deep reinforcement learning has attracted considerable attention and has been utilized to achieve channel allocation in wireless networks [15] because of the effectiveness of the function approximation. In [16], a deep reinforcement learning-based channel allocation method was proposed. However, this method addressed operating only one AP, i.e., this method did not consider a centralized channel allocation problem; rather, it addressed a distributed problem. According to [15], several of the deep reinforcement learning-based channel allocation studies have considered distributed channel allocation or the dynamic spectrum access problem. There would appear to be no previous study for the coordinated WLAN channel allocation problem.
This paper proposes a deep reinforcement learning-based scheme that is suitable for coordinated channel allocation problems in densely deployed WLANs. In reinforcement learning, states must be adequately associated to rewards because the agent acts based on the observed states; thus, we must carefully design the states. To improve the throughput in WLAN channel allocation problems, it is important to capture the carrier sensing relationships among the APs, as in the graph coloring approach of DSATUR. Therefore, we design the states based on the carrier sensing relationships among APs and used channels to allow an agent to associate a given reward relative to throughputs with the designed states. Furthermore, we require function approximation to manage the enormous number of observable states in densely deployed WLANs. In this context, extracting input features is important for the improvement of the learning performance. For example, convolutional neural network (CNN) is an effective method to improve the performance of a neural network, which is commonly used to extract the features of input images, (e.g., AlphaGo [17]). We demonstrate that the function approximation based on a simple neural network does not provide sufficient performance for the channel allocation problem in Section VI. To extract the features of the adjacent APs and used channels, we propose incorporating graph convolutional layers [18]- [21] into a neural network. Because the states based on carrier sensing relationships can be considered as a graph signal, a graph convolutional network (GCN) is suitable for our settings. Note that CNN is not suitable for the extraction of graph input features such as the state of our settings. GCN has been utilized to extract features of graph-shaped input in various machine learning studies [22], [23]. Based on GCN, a dynamic action recognition method of human body skeletons was proposed in [22]. By considering the skeleton as a graph wherein the nodes were the joints of the human body, this method adapted GCN to the graph to extract the features of the spatial and temporal actions of the nodes. In [23], a large-scale deep recommendation method was proposed by combining GCN and an efficient random walk approach. The experimental results demonstrated that this method generated higher-quality recommendations than the prior studies in a large-scale graph.
Although a GCN-based channel allocation method can temporarily improve system throughput, the generalization performance decreases over time because of the data duplication in the replay buffer. In WLAN channel allocation problems, once an AP is selected to change its channel, the AP tends to be repeatedly selected to change its channel to the same channel. The imbalanced learning data can advance learning for only the repeatedly observed states and reduce the performance for the other states; this phenomenon is called over-fitting [24]. To prevent the aforementioned degradation in generalization performance, we propose a selective replay buffering that reduces the duplication of the sampling. This idea is based on the undersampling and oversampling approaches for imbalanced learning problems [25].
The contributions of this paper are as follows: • This paper provides a GCN-based deep reinforcement learning framework that can be applied to problems with an enormous graph-shaped state. Moreover, in the proposed framework, the setting of the optimization objective (i.e., the reward in reinforcement learning) has considerable flexibility as long as it depends on the adjacency relationships of the graph-shaped state.
• This paper proposes a selective replay buffering that is used to avoid the over-fitting caused by the duplication of data for deep learning problems. This approach also functions well for the problem where a certain pair of state and action is repeatedly observed as in WLAN channel allocation problems.
• This paper confirms that the proposed framework successfully increases the cumulative reward. To elaborate, the framework enables a greater reward channel allocation or achieves the optimal channel allocation while reducing the number of changes, compared to the immediate reward maximization method. This is because the immediate reward maximization method does not necessarily achieve the optimal allocation or requires an extended time to achieve the optimal allocation. The rest of this paper is organized as follows. Section II describes the system model. Section III defines a Markov decision process (MDP) and Section IV introduces the reinforcement learning. Then, Section V introduces the proposed WLAN channel allocation method and Section VI presents an evaluation of the performance of the proposed method. Section VII concludes this study.

A. CHANNEL ALLOCATION PROBLEM IN WIRELESS LANs
In this study we acquire the control algorithm to allocate the optimal channels in the minimum time steps for any initial topology; this is composed of the locations and initial channels of the APs.
Assume that N APs are placed in a square-shaped region and M orthogonal channels with the same bandwidth are available. Let the index set of APs be denoted by N = {1, 2, . . . , N }, and the index set of available channels by M = {1, 2, . . . , M }. In this system model, we do not set specific values for the bandwidths of the channels; rather, we assume that all channels have the same bandwidth without overlapping the frequency bands. The details of the simulation settings are described in Section VI. We regard c i ∈ M as a one-hot vector of M dimensions, (e.g., if AP i ∈ N uses Channel 2 ∈ M, then c i = [0, 1, 0, . . . , 0] T ). Fig. 1 displays the system model of the deep reinforcement learning-based coordinated WLAN channel allocation. In the proposed system model, a central controller is considered and is responsible for information gathering and channel allocation from/to each AP, as in [1]. More specifically, the central controller observes the communication quality (e.g., the throughput), carrier sensing relationships among the APs, and channels used by the APs at every time step. VOLUME 8, 2020 Moreover, assume that the central controller can change the channel of an AP at a given time step. The central controller decides what AP changes to what channel based on the deep reinforcement learning from the observation. As indicated in Fig. 1, we propose a new approach to replace the observation buffering part with selective data buffering, and the learning part with the GCN-based approach.

B. GRAPH STRUCTURE OF STATE
We model the carrier sensing relationships using a contention graph G = (N , E). The edges e ij = {i, j} ∈ E of the graph are connected if and only if the APs i and j are within the carrier sensing range, regardless of their channels. An adjacency matrix is defined as a matrix representation of the graph G, and expressed as an N × N matrix A = (A ij ) as follows: By focusing on the characteristics of the graph, we analyze the features of the carrier sensing relationships and utilize the relationships to allocate channels based mainly on GCN, which is detailed in Section V-B.

III. MARKOV DECISION PROCESS
We define an MDP prior to presenting the formulation of the reinforcement learning problem. An MDP is defined as a quadruplet (S, A, P, R): S is a state space (which denotes a set of states in the environment); A is an action space (which denotes a set of available actions by the agent); P is a transition probability to the next state given the current state and action; and R is a function mapping from a tuple of the current state, action, and following state to a real value (this is called a reward function). The agent is designed to determine the best rule for taking an action for a given state (i.e., policy) using the observed history to date. Then, the agent should transfer to a state that provides a greater reward to itself. Therefore, the state should be adequately associated to the reward, and the method by which the states of an MDP are designed is a critical topic.
In general, the input parameter of the reinforcement learning model is the state. Using the input parameter, the agent selects an action based on the output value (e.g., expected reward) of the learning model and receives a reward with a transition to the next state. This series of events (state input, action selection, state transferring, and reward reception) is performed every time step.

A. DEFINITION OF MDP FOR WIRELESS LAN CHANNEL ALLOCATION
To design effective states for the WLAN channel allocation problem, the important insight is that the throughput is, in general, significantly influenced by the carrier sensing relationships among the APs. Therefore, we define a state to be a pair of the adjacency matrix and channel vectors for each circumstance, where we define the channel vectors as Note that this state can be considered as a graph signal; thus, we can capture the essential features of the state using GCN. To reduce the number of observable states, we determine an isomorphism between the graphs by comparing their canonical labels, which is detailed in Section III-B.
In a WLAN channel allocation problem, if we define the reward as the total throughput, unfairness could occur, (e.g., the central controller could allocate channels such that certain APs could not transmit. In this study, to improve the overall throughput without such unfairness, we define the reward as the average throughput of the lower 40% APs. Although we define the reward in this manner, this is not an essential constraint. Note that in the proposed system model, because the state is designed based on the carrier sensing relationships among the APs, the reward can be arbitrary as long as it is based on the carrier sensing relationships (e.g., a function of throughputs).
We define the action space A as the Cartesian product of the indices of the AP and channel N × M, where each action a = (n, m) ∈ A signifies what AP changes to what channel.

B. STATE MAPPING METHOD
In WLAN channel allocation problems, if multiple APs are in the same situation, any AP can be selected to change its channel. In such a case, by grouping topologies that are regarded as the same, we can reduce the number of observable states and improve the learning performance.
In this section, we introduce a method to reduce the number of observable states based on canonical labeling [26], [27]. The canonical labels are identical if the graphs exhibit an identical topological structure and identical labeling of the nodes and edges. Thus, by comparing the canonical labels, we sort the graphs in a unique and deterministic manner and consider two graphs as isomorphic if their canonical labels are identical. For the computation of automorphism and canonical labeling of the graphs, we use an open source tool, bliss [28], [29]. Specifically, bliss computes the canonical representative map function ρ, wherein the following two conditions are applicable: • the representative of a graph ρ(G) is isomorphic to graph G.
• the representatives of two graphs, ρ(G 1 ) and ρ(G 2 ), are identical if and only if the graphs, G 1 and G 2 , are isomorphic. In [28], it is demonstrated that bliss performs canonical labeling.
We also determine the channel indices in a unified manner because the indices of the channels do not influence the system throughput. For example, in Fig. 2, the system throughputs of both graphs are the same regardless of the channel indices. We assign the channel indices in the order of the AP indices after the canonical labeling method.

IV. REINFORCEMENT LEARNING
In this section, we provide an outline of reinforcement learning [30]. Reinforcement learning is a learning problem to acquire the optimal policy. It determines the action for a given state that provides the greatest cumulative reward.
In a reinforcement learning problem, a state value function V π (s) of a policy π is defined as an expectation of cumulative reward as follows: where γ ∈ [0, 1] denotes a discount rate, which is the parameter that denotes the value of the future rewards at the current state. Note that the order between two policies π 1 and π 2 is defined as follows: Although the order is not a total order on policy space , it is known that there is at least one optimal (deterministic) policy π * , which satisfies ∀π ∈ , π * π, if the reinforcement learning problem is based on an MDP. An optimal policy is commonly learned through the estimation of optimal action-value function, which is written as follows: The goal of reinforcement learning is to obtain an optimal policy π * that maximizes Q π (s, a) as follows: π * (a | s) = arg max a∈A Q * (s, a).
There are some approaches to obtain the optimal policy π * (a | s). Specifically, Q-learning is a method to obtain an optimal policy through the estimation of the optimal actionvalue function.

V. PROPOSED SCHEME
In this section, we present the details of the proposed deep reinforcement learning-based method, especially the key ideas, GCN [18]- [21] and selective replay buffering. Because the number of observable AP topologies is extremely high in densely deployed WLANs, we adopt the function approximation of Q * (s, a). In particular, we use GCN to capture the essential features of the graph signals, which corresponds to the channel information with topologies (A, C) in our problem. Furthermore, when an AP selects a channel according to a utility function in common WLANs, a fixed action tends to be selected in certain states, and the duplication of data can cause over-fitting [24]. To prevent the duplication and overfitting, we introduce the selective replay buffering.

A. ALGORITHM
In this section, we solve the problem defined in Section II based on deep reinforcement learning. The baseline method is deep Q-network (DQN) [31].
The main features of DQN are experience replay and fixed target Q-network. In general, Q-learning with function approximation could possibly not converge [32]. Fixed target Q-network is a method that promotes convergence by fixing the parameters of the Q-function for a certain period to avoid fluctuations in the target value, which depends on the Q-function itself, when learning the parameters. DQN uses two networks, namely a main network Q θ (which is the target of the optimization with a weight parameter θ) and a target network Q θ − (which is used to calculate the temporal difference errors (TD errors) with a weight parameter θ − ). The parameter of the target network θ − is updated to θ every I time steps and then maintained as fixed between updates. As I increases, the learning becomes more stable while the parameter update frequency decreases. Experience replay is a technique that breaks temporal correlation in the training data. The training data (s, a, r, s ) is first stored in a buffer called replay buffer D. Then, DQN updates the parameters using a mini-batch that is constructed using randomly sampled data from the replay buffer. Consequently, there is virtually no time dependence among the data in the mini batch.
In DQN, the parameter θ is updated in each time step t as follows: In addition to the original DQN, we employ the following well-known techniques: double DQN (DDQN) [33], dueling network [34], and prioritized experience replay [35], which are known to contribute to the general performance improvement of DQN. DDQN [33] is a DQN-based method to avoid overestimations by employing two different networks. Dueling network [34] is a method that can learn the values of the states without the effect of actions. Prioritized experience replay is an effective data sampling method from the replay buffer D. Details on these methods are presented in the Appendix.

B. GRAPH CONVOLUTIONAL NETWORKS
In this section, we describe a function approximation method that is suitable for a state designed based on an adjacency matrix. Because the number of observable states is extremely high in the proposed system model, we adopt function approximation to address this large-scale problem.
Feature extraction layers such as a convolution layer [36] play a crucial role in boosting the performance of reinforcement learning, (e.g., AlphaGo [17]). CNN extracts the features of the signals on an input images; however, the input of the proposed system model is not an image; rather, it is a graph. The designed state of the proposed system can be considered as a graph signal; thus, GCN [18]- [21] is a suitable algorithm to capture the essential features of the state. By applying the GCN layer in the neural network model, we can analyze the graph structure of the APs as a CNN [36] for an input image.
In general, the convolution calculation in the time domain is expressed as the Hadamard product in the frequency domain. Therefore, GCN is expressed by applying an inverse Fourier transformation to the result that corresponds to the Hadamard product after the Fourier transformation. If the input dimension corresponds to d ∈ R, the following process is adapted to each dimension.
An input vector x ∈ R N is a signal on a graph G with N nodes. Let D be a degree matrix of the graph, and let L = D − A be its graph Laplacian with the adjacency matrix A of the graph G. Let the graph Laplacian L be orthogonally transformed as L = U T xU with eigenvectors U = (u 1 , u 2 , . . . , u N ). Subsequently, a graph convolution of input signal x is defined as x → U (θ (U T x)), where θ = (θ 1 , . . . , θ N ) are the parameters to be learnt, and represents the Hadamard product.

C. SELECTIVE REPLAY BUFFERING
This section describes the proposed selective data buffering applied to the replay buffer D. When the agent selects actions based on a policy that typically has the optimal response, (e.g., -greedy with tiny ), a fixed action tends to be selected in certain states. As mentioned previously, the imbalanced data can advance learning for only the experienced states and reduce the performance for the inexperienced states, which is called over-fitting [24]. To prevent over-fitting, we propose selective replay buffering, which aims to reduce buffering the same data in the replay buffer. This idea is based on the undersampling and oversampling approaches for imbalanced learning problems [25]. Observe transition (s t , a t , r t+1 , s t+1 ) 4: if X (s t , a t ) ≡ 0 (mod α) then 5: for j ← 1 to β do 6: if replay buffer D is not full then 7: Store (s t , a t , r t+1 , s t+1 ) in D 8: else 9: Replace the oldest data in D by (s t , a t , r t+1 , s t+1 ) 10: end if 11: end for 12: end if 13: X (s t , a t ) ← X (s t , a t ) + 1 14: end for Algorithm 1 displays the flow of the buffering to the replay buffer for each episode. The main part of this algorithm is that the observed transition (s t , a t , r t+1 , s t+1 ) is stored in replay buffer D if the transition has never been observed or every α times that the same state transition is observed. Furthermore, to prevent observations remaining in the replay buffer D for an extended time, we store an observation β times repeatedly. Note that X (s, a) is the number of experiences performing an action a from a state s, which is initialized at the beginning of each episode. This method reduces the duplication of data stored in replay buffer.

VI. SIMULATION EVALUATION
In this section, we validate the efficiency of the proposed scheme using proof-of-concept simulations. Assume that the step number of one episode is fixed in these simulations.   [37], [38]; the APs within the carrier sensing range have a carrier sensing relationship. The relationship can be expressed through a graph as in Fig. 3, where the APs and carrier sensing relationships are denoted by nodes and edges, respectively. Fig. 4 indicates the overall architectures used in the simulations where Figs. 4(a) and 4(b) represent the GCN-based and simple neural network models, which comprise only fully connected layers, respectively. In detail, the ''Dense'' layer represents the fully connected dense layer; the ''Batch Normalization'' layer is the function layer which increases the learning speed and restrains the over-fitting [39]; the ''ReLU'' layer is a well-known activation function [40]; and the ''Graph Convolution'' layer represents the graph convolutional layer detailed in Section V-B. Note that each graph convolution layer requires that the adjacency matrix consider an input signal as a graph signal. The outputs are the estimated action values Q(s, a) ∀a ∈ A.
The simulation parameters are summarized in Table 2. Assume that the step number of one episode is 500 and the episode number is 10000. The central controller can change the channel of an AP at a given time step. In these simulations, we define the reward, the objective of the optimization, as the average throughput of the lower 40% APs. Note that the setting of the reward has considerable flexibility as long as it  depends on the adjacency relationships of the graph-shaped state as already discussed in Section I. Let the nth lowest throughput (n ∈ N ) among 10 APs in the kth topology (k ∈ {1, 2, . . . , 10000}) be denoted by ξ To evaluate the generalization performance of the Q-function during learning, we prepared 100 test topologies, where the APs were randomly located and the channel of all the APs were set to Channel 1. When we evaluated the generalization performance, we used the snapshot of the Q-function of that time to select the AP channel to be changed in each time step. For each test topology, the central controller repeated the changing of the channel of an AP according to the output of the Q-function 20 times. We used the reward corresponding to the state after 20 time steps from the initial state as the final reward of the test topology.
For reproducibility of results, we used the back-of-theenvelope (BoE) throughput evaluation technique [37] to model the throughput of the APs according to the carrier sensing relationships among the APs in each channel configuration. The BoE technique allows the adoption of shortcuts in performance evaluation and bypasses complicated stochastic analysis. The BoE throughput was derived under the assumption that each AP had a link with an STA at a given time, and all the links were saturated, i.e., all links always had frames to send. All simulations in this paper followed this assumption. Moreover, the simulations used a normalized throughput of the bandwidth as an observed throughput according to the BoE technique, i.e., the observed throughput had a value between 0 and 1. As a data collecting policy, we used -greedy [30], which randomly selects an action with probability and selects a greedy action with probability 1− . If was small, the agent tended to repeatedly select a fixed action in certain states, and the imbalanced stored data caused over-fitting.

A. EVALUATION OF GENERALIZATION PERFORMANCE FOR RANDOM TOPOLOGY
We compared the following five methods: • a deep reinforcement learning-based method with the simple neural network model in Fig. 4(b), referred to as ''DRL without GCN''.
• a deep reinforcement learning-based method with the GCN model in Fig. 4(a), termed as ''DRL with GCN''.
• a deep reinforcement learning-based method with the GCN model in Fig. 4(a) and selective data buffering explained in Section V-C, denoted as ''DRL with GCN and buffer method''.
• a random action selection method, referred to as ''Random''.
• a distributed method based on potential game [10], referred to as ''Distributed method''. As mentioned in Section I, as far as we know, none of prior studies addressed to exactly the same problem as this paper. Therefore, as a comparison method, we employed a potential game-based method [10], which allocates channels based on the adjacency relationships of APs to improve system throughput indirectly. In the potential game-based method, the action with the greater payoff function was stochastically selected with higher probability among other choices. This method is guaranteed to achieve the Nash equilibrium [8].
In this paper, we defined the payoff function such that the number of carrier sensing relationships (i.e., network collisions) was minimized according to [10]. In [10], it was proven that minimizing the network collisions provides a near-optimal throughput. Let the channel used by AP i ∈ N at time step t be denoted by c i [t] ∈ M. At each time step t, the probability that AP i selects the next channel c i [t + 1] is expressed as follows: where u i (c) denotes a payoff function; 1(x) denotes an indicator function that is one if event x is true and is zero otherwise; and ζ ≥ 0 denotes the parameter that determines the degree of selecting the state with a high payoff function. In this simulation, the parameter ζ was set to 0.1.   5 displays the learning curves representing the transitions of the generalization performance evaluated every 20 episodes. We evaluated the generalization performance by allocating channels for 100 inexperienced test topologies according to the learning models and observed the rewards after 20-step greedy actions. Each value in Fig. 5 indicates the mean reward of 100 test topologies after performing 20-step greedy actions. The generalization performances of the methods using GCN-based model increased when the learning progressed, whereas that of the simple neural network models exhibited virtually no increase. However, the generalization performance of the method without selective data buffering using the GCN-based model decreased after a certain amount of time. This is because the model learned for experienced states and reduced the performance for inexperienced states, which is called over-fitting. By employing selective data buffering, we could maintain the generalization performance at a high level. Fig. 6 displays the cumulative distribution functions (CDFs) of the rewards of 100 test topologies at the best and final performance points in Fig. 5. As indicated in the figure, the proportions of the high reward state of the results of the deep reinforcement learning-based methods with GCN-based model exceeded those of the other methods. Therefore, using the GCN-based model, the learning performance exceeded that of the simple neural network model. Moreover, the effect of the over-fitting can be seen in this figure by comparing the ''DRL with GCN (best)'' and ''DRL with GCN (final)'' lines. We can observe that the over-fitting was avoided by employing the proposed selective data buffering.   The upper sequence is more desirable because the time steps required for achieving the optimal channel allocation is shorter, and thus the cumulative reward is greater.
Let the nth lowest throughput (n ∈ N ) among 10 APs in the lth test topology (l ∈ {1, 2, . . . , 100}) be denoted by ξ  throughput ξ (n) of the deep reinforcement learning-based method with selective data buffering using the GCN-based model was greater than those of the other methods. In particular, the first to fourth lowest throughputs ξ (n) (n = {1, 2, 3, 4}) increased by using the proposed method. This is because we defined the reward of the learning as the average of the lowest four throughputs 1 4 4 n=1 ξ (n) . Moreover, by comparing the deep reinforcement learning-based methods with and without GCN, we confirmed that GCN makes it possible to increase the performance through training.
The performances of the potential game-based method displayed in Figs. 6 and 7 are inferior to those of the proposed method. This can be attributed, possibly, to the fact that the main target of the potential game-based method was not to improve the system throughput directly; rather it was to reduce the number of carrier sensing relationships, which could influence the system throughput.

B. EVALUATION OF CHANNEL ALLOCATION SEQUENCE
In this section, we evaluated the efficiency of the proposed method to maximize the cumulative reward. Specifically, we evaluated the channel allocation sequence from two perspectives: 1) how fast the proposed method achieved a destination (convergence speed perspective), where a method converging faster is superior, and 2) how proactively the proposed method selected channels (delayed reward perspective), where a method converging to a channel configuration with a greater reward is superior. The convergence speed perspective is evaluated in Section VI-B1; the delayed reward perspective is evaluated in Section VI-B2.

1) CONVERGENCE SPEED PERSPECTIVE
The channel allocation sequence influences the system throughput during the control, even if the destinations are the same. Fig. 8 is a hypothetical example that can be used for explanation. We consider a case where the numbers of APs and channels are five and two, respectively. The positions of the APs are indicated in Fig. 8, where each AP has relationships only with its adjacent APs. The optimal channel allocation is to allocate Channel 1 (or Channel 2) to APs 1, 3, and 5, and Channel 2 (or Channel 1) to APs 2 and 4, respectively, i.e., we should allocate channels so as to not allocate the same channel to adjacent APs. In this case, we can use two channel allocation sequences for the optimal allocation as indicated in Fig. 8. One sequence is to change the channels of APs 2 and 4, and another is to change the channels of APs 1, 3, and 5. The former sequence requires two time steps, whereas the latter requires three time steps. For improving the system throughput even during the channel allocation process, the former sequence is more desirable. Therefore, we aim to achieve the optimal channel allocation in fewer time steps. Fig. 9 displays the channel allocation sequence of a test topology where the numbers of APs and channels are five and two, respectively. In these figures, the nodes represent the APs and the edges represent the adjacency relationships. The solid line indicates the contention among the APs using the same channel within the carrier sensing range; the dashed line connects the APs using different channels within the carrier sensing range. The colors of the nodes indicate the channels used by the APs, and blue and orange denote Channels 1 and 2, respectively. The topology of this figure is as that of Fig. 8. From this figure, we can observe that the proposed method can allocate channels in the desirable sequence with the maximum cumulative reward. Similarly, Fig. 10 displayed the channel allocation sequence of a test topology where the numbers of APs and channels are nine and three, respectively. In addition to Channel 1 and 2, the purple nodes denote APs using Channel 3. In this topology, in the optimal channel allocation, orthogonal channels are allocated to adjacent APs. The proposed method can allocate channels in the desirable sequence with the maximum cumulative reward.
It is remarkable that in each figure, the channel of the AP centered in the topology is not changed at the first step t = 1. Note that this change does occur in the case of the immediate reward maximization. The immediate reward maximization method is a method that maximizes the immediate reward received at each time step, whereas the proposed method maximizes the cumulative reward in the channel allocation sequence.

2) DELAYED REWARD PERSPECTIVE
Moreover, we can demonstrate that the proposed method is superior to the immediate reward maximization in performance at the end of the sequence.
Figs. 11 and 12 display the channel allocation sequences of one test topology based on the immediate reward maximization and the proposed methods, respectively.    The numbers of APs and channels are 10 and three, respectively. Furthermore, Fig. 13 displays the time series of the reward transitions of each method in this example. The final reward of the proposed method is 0.5, whereas that of the immediate reward maximization method is 0.375. In the immediate reward maximization method, the agent takes an action that maximizes the immediate reward at each time step; thus, this method can fall into the local optimal allocation. To achieve the global optimal allocation using this method, plural APs must change their channels simultaneously. Because we can select only one AP to change at each time step in this simulation, this method cannot achieve the global optimal allocation. Conversely, the proposed method achieves the global optimal allocation with maximum reward. The rewards are zero until t = 3 for the proposed method. This result indicates that the proposed method selects the action that maximizes the cumulative reward, which includes the delayed rewards received after channel allocation at each time step, even if the immediate reward is small. This can be attributed to the fact that the reinforcement learning updates the learning parameters for maximizing the cumulative reward.

VII. CONCLUSION
A deep reinforcement learning-based channel allocation scheme was proposed for densely deployed WLANs. First, to capture the essential features of the carrier sensing relationships among the APs, we applied GCN to a graph where the APs were connected within their carrier sensing ranges. Then, we proposed a selective data buffering to prevent over-fitting by reducing the duplication of the sampling data specific to WLAN channel allocation problems. The main learning algorithm of this method was DDQN employing dueling network and prioritized experience replay. Furthermore, because the number of observable states was extremely high in this problem, we used canonical labeling to reduce the number and improve the learning performance. The simulation results indicated that the proposed scheme achieved greater rewards after 20-step greedy actions from a given initial state than the compared methods. Using GCN, we improved the learning performance compared with that of the simple neural network model, which comprised only fully connected layers. Finally, we demonstrated that the proposed method could allocate channels in a short number of time steps and with a large cumulative reward.

A. DDQN
DDQN [33] is a DQN-based method to avoid over-fitting. DDQN uses two networks in the same manner as DQN as discussed in Section IV, i.e., Q θ and Q θ − . The error value of DDQN is expressed as follows: arg max a Q θ (s t+1 , a)).
The parameters are updated to minimize this error value.

B. DUELING NETWORK
Dueling network [34] is a method that can learn what states are (or are not) valuable without having to learn the effect of each action for each state. Dueling network includes two streams to separately estimate the state-value and advantages for each action in a neural network architecture. The output value corresponds to the total value of the two streams.

C. PRIORITIZED EXPERIENCE REPLAY
In this paper, we sample the training data from the replay buffer D according to the prioritized experience replay [35], which allocates priority to all samples based on the TD errors. The TD error δ t in DDQN is expressed as follows: δ t = r t+1 +γ Q θ − (s t+1 , arg max a Q θ (s t+1 , a)) − Q θ − (s t , a t ).
The probability of selecting sample i is expressed as follows: where µ 0 is a small positive number that prevents the sampling probabilities from being zero when the TD error is zero. λ is a parameter that determines the degree of prioritizing for sampling (in particular, when λ = 0, the sampling is uniformly randomly implemented). If the absolute value of the TD error |δ t | is relatively large in the replay buffer, the probability of selecting the corresponding sample increases.