Load-Aware Dynamic Mode Selection for Network-Assisted Full-Duplex Cell-Free Large-Scale Distributed MIMO Systems

The network-assisted full-duplex (NAFD) system realizes flexible duplex in the spatial domain within the same time-frequency resource. With the explosive growth of the number of users and remote antenna units (RAUs) under 6G scenario, the resource utilization of the system is lower. When the resource of users is selected by the RAUs to send or receive, collisions or congestion may occur due to mechanisms such as grant-free. Aiming at making better use of system resources, a load-aware dynamic mode selection scheme with NAFD scheme is proposed to improve the access efficiency and resource utility of the system. This paper first propose a centralized Q-learning algorithm which determines a clever strategy to approach the ultimate goal by itself and excels in environment dynamics. However, the size of the Q-table used in the centralized Q-learning algorithm for storage is huge. Further, a distributed multi-agent Q-learning algorithm is proposed which has a smaller size of Q-table and lower complexity to suit for actual scenarios. The simulation results showed that the proposed load-aware dynamic mode selection scheme can significantly improve resource utility and throughput performance than other traditional schemes.


I. INTRODUCTION
Currently, ultra-reliable and low-latency communication (URLLC) related theories and technologies in 6G are in urgent need of breakthrough. In order to reduce the traditional half duplex (HD) system latency, a full duplex (FD), equipped with transmit antennas and receive antennas, has been widely studied in the literature to enable simultaneous transmission and reception in the same frequency band, with theoretically doubled throughput [1]. Self interference (SI) is the main barrier in implementing FD. Active and passive SI suppression techniques have been studied in [2], [3], which makes FD a realistic technology for modern wireless systems. [4] studied a FD cell-free massive multiple-input multiple-output (MIMO) network, where the APs employ a simple conjugate The associate editor coordinating the review of this manuscript and approving it for publication was Cunhua Pan . beamforming/matched filtering scheme with the channel state information acquired through the uplink training with orthogonal pilots transmitted from the users. It provided a simple power control method to mitigate residual selfinterference. [5] derived closed-form spectral efficiency (SE) lower bounds for FD cell-free MIMO system with maximumratio combining/maximum-ratio transmission processing and optimal uniform quantization. Recently, the problem of maximization of SE and energy efficiency (EE) of the FD cell-free MIMO system is considered in [6]. However, to achieve URLLC, SI cancellation processing latency in FD system should be considered [7]. Network-assisted fullduplex (NAFD) under cell-free massive MIMO network was proposed in [8], which does not have SI at the remote antenna unit (RAU) level and can solve the cross-link interference (CLI) problem by using joint processing [9] thus reducing the latency of interference cancellation. It realizes VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ flexible duplex transmission by selecting the uplink and downlink working modes of the RAUs in the spatial domain within the same time-frequency resource block at the network level, reducing the delay of the time division duplex (TDD) system and improving the spectral efficiency (SE) and energy efficiency (EE) of the system. NAFD scheme is quite promising to achieve URLLC due to the reduction of latency in HD and TDD system and SI cancellation latency in traditional full-duplex scheme.
With the explosive growth of the number of mobile terminals, it is necessary to study the reliable multi-access and resource utilization mechanism of NAFD scheme. To improve the utilization of UL/DL resources, the working mode selection of RAUs as uplink reception or downlink transmission based on the traffic loads and quality of service (QoS) of the users is investigated in this paper. Proposed load-aware dynamic mode selection scheme directly assigns DL transmitting or UL receiving for each RAU according to the traffic loads of the whole network. There is no need for the users to establish a handshake mechanism with RAUs which reduces signaling overhead, simplifies the access procedure, reduces the access latency, and improves the access efficiency of the massive access scenario.
On the other hand, most of the work usually fix the working mode of the RAUs to further analyze the system performance with NAFD scheme [10]- [12]. [10] estimated the effective CSI (inner products of beamforming and channel vectors) instead based on beamforming training scheme. [11] investigated the problem of joint transceiver design for NAFD systems under cell-free massive MIMO network with simultaneous wireless information and power transfer (SWIPT) considering the fronthaul capacity as a constraint. [12] focused on the optimization of SE of the systems taking SWIPT ratio design into consideration. [13] only focused on the problem of maximization of SE through mode selection scheme where the traffic loads and quality of service (QoS) of the users has not been considered. To the best our understanding, the working mode selection problem of the RAUs considering the traffic loads and QoS of the users has not yet been explored in the literature. As far as we know, there are no researches focusing on the resource utilization of the FD cell-free MIMO systems. This paper focuses on the resource utilization problem and system performance of NAFD cell-free large-scale distributed MIMO systems. In this paper, a load-aware dynamic mode selection scheme with flexible duplexing based on reinforcement learning is proposed. The load-aware technic has been studied in [14] and [15], where the authors proposed the load-aware system utility function based on their assumed scenarios to reflect the proposed performance improvement according the traffic loads. Specifically, in this paper, the reinforcement learning method that maximizes the expected benefits in a dynamic environment is used to optimize the working mode of the RAUs for uplink reception or downlink transmission. Q-learning is a classic method of reinforcement learning which does not require a deep neural network for function approximation [16]- [19]. The load-aware dynamic mode selection scheme based on Q-learning approaches the ultimate goal by taking clever strategies and excels in environment dynamics. The proposed algorithms can be utilized in practical RAU-mode-selection scenario with limited computation power. We initially proposed a centralized Q-learning algorithm which viewed all RAUs as an agent. However, this algorithm has the problem of explosive growth of the size of Q-table for storage. We further proposed a distributed multiagent Q-learning method. The distributed algorithm viewed each RAU as an independent agent and thus had a smaller size of storage unit with lower complexity.
The main contributions of this paper are highlighted as follows: • The dynamic mode selection scheme is hard to be modeled in the actual scene. In order to determine the RAUs' working mode, two binary assignment vectors x u , x d ∈ {0, 1} M ×1 are used to model the mode selection problem.
To improve the resource utility and access efficiency of the system, a load-aware dynamic mode selection scheme is further proposed.
• A utility function is defined to reflect both the proposed performance improvement in resource allocation and the associated overhead costs of any coalition formation. The defined utility function leads the UEs preferentially select RAUs with fewer RB resources under the premise of satisfying its own QoS, thereby improving the utilization of RBs.
• Q-learning is a classic method of reinforcement learning which does not require a deep neural network for function approximation which can be utilized in practical RAU-mode-selection scenario with limited computation power. A load-aware dynamic mode selection scheme based on centralized Q-learning is proposed to solve the resource utilization problem.
• We further proposed a distributed multi-agent Q-learning method to avoid the problem of explosive growth of the size of Q-table for storage in centralized Q-learning. The distributed algorithm viewed each RAU as an independent agent and thus had a smaller size of storage unit with lower complexity. The effectiveness of the proposed scheme was verified through simulations.  RAUs to uplink and downlink mode. Fixed-mode encounters bottlenecks in the use of UL and DL resources. So dynamic RAU mode selection plays a significant role in improving the system performance. To improve the resource utility and access efficiency of the system, a load-aware dynamic mode selection scheme is further proposed. In order to determine the RAUs' working mode, two binary assignment vectors x u , x d ∈ {0, 1} M ×1 are used to model the mode selection problem. x u,i (x d,j ) is equal to 1 if RAU i(j) is used for UL reception (DL transmission), or equal to 0 otherwise. This paper assumes all the antennas of the same RAU working in the same mode and an RAU is either uplink or downlink only, which means x u + x d = 1. We define X u = diag(x u ) and X d = diag(x d ), and derive the effective received signal, transmitted signal and the load-aware mode selection model.

B. DOWNLINK SIGNAL MODEL
For downlink transmission, the baseband signals are compressed by the central processing unit (CPU) and conveyed to each downlink RAUs through the downlink fronthaul links. The downlink RAUs decompress the downlink signals received from the CPU and then forward them to the downlink users.
For downlink transmission, in each scheduled time slot, M d downlink RAUs jointly send signals to K d downlink users. Specifically, the signal received by the DL user j can be expressed as i ] = 1 by UL user i, and g i,j denotes the interfering channel vector between the UL transmitter user equipment (UE) i and the DL receiver UE j, η d,j ∼ CN (0, σ 2 d ) is additive white Gaussian noise at DL user j. Then, the signalto-interference-plus-noise ratios (SINR) of downlink user j can be expressed as where R D,j represents the downlink rate which can be denoted as R D,j = log 2 (1 + γ d,j ).

C. UPLINK SIGNAL MODEL
For uplink transmission, each uplink RAU compresses the received signals that are transmitted by uplink users and sends them to the CPU. The CPU receives the compressed signals that are transmitted by all uplink RAUs and then performs joint decoding of all uplink users based on the received compressed signals.
For uplink transmission, all uplink RAUs jointly receive signals from UL users. The received signal can be expressed asr whereh u,i = X u h u,i ,G I = X u G I X d denote the effective UL channel vector and the channel vector between uplink RAUs and downlink RAUs,η u = X u η u denotes the effective noise denotes the channel vector between the transmitting UL user i and all the uplink RAUs, G I ∈ C M u N ×M d N is the real interference channel matrix between downlink RAUs and uplink RAUs. In practice, we assume the channel state information (CSI) between uplink-RAUs and downlink-RAUs is imperfect due to the channel estimation errors. Specifically, we model the inter-RAU channel as: where G IRI denotes the estimated channel and G IRI denotes the channel estimation error. Then, the SINR for each uplink user i can be expressed as where µ i denotes the interference between DL users and UL users, and v u,i are the variance of the total interference plus noise in the DL and the corresponding receiver vector respectively. R U,i represents the uplink and rate which can be denoted as R U,i =log 2 (1+γ u,i ).

D. MODEL FOR THE PROPOSED LOAD-AWARE DYNAMIC MODE SELECTION SCHEME
The load of the system is an important indicator as it greatly affects the performance of multi-connectivity. In multiconnectivity, the user equipments must avoid accessing to RAUs with higher load to the extent possible to achieve system load balancing. In order to improve the resource utility of the system, a load-aware dynamic mode selection scheme is proposed. We assume that each UE has a requirement of resource blocks (RBs) to meet the need of QoS. RAU m has k RBs to allocate, which denotes as λ m = {λ m,1 , λ m,2 , . . . , λ m,k }. The total transmit power is P m . Evenly allocated transmission power is assumed, then the power of each RB is p m = P m k . The number of RBs allocated by RAU m to UE i is n m,i , then the power allocated by RAU m to UE i is p m,i = n m,i p m . Considering the traffic load, the achievable rate of UE i associated with RAU m can be expressed as where b represents the bandwidth of each RB [20]- [21]. It can be seen that the user rate is related to the number of RBs that can be used. Each user has different service requests and different QoS requirements. RAU guarantees users' QoS by allocating a certain number of RBs to different users. In order to meet the QoS needs of different users, let t req i denote the bandwidth request of user i n m,i blog 2 It can be obtained that the number of RBs provided by RAU m for user i is [14] and [15] considering load-aware methods both proposed the load-aware utility function suitable for their research scenarios. Utility function should reflect both the proposed performance improvement in resource allocation and the associated overhead costs of any coalition formation. We propose the utility function for the UE i as U i The utility of the UE at each time slot depends on the load of the associated RAUs and the number of RBs allocated to it. If the RAU cannot guarantee the QoS of UE i, the RAU does not provide RBs for UE i. In this case, the value of the utility function for UE i U i = 0. In Fig. 2, we present the trend of the utility function provided that the number of RBs of an RAU is k = 20, k = 25, k = 30, k = 40. We assume that the number of RBs allocated to the users associated with this designated RAU is 10. According to the nature of the logarithmic function, when the load of the entire network is too large, UEs will preferentially select RAUs with fewer RB resources under the premise of satisfying its own QoS, thereby improving the utilization of RBs. UEs' utility value will decrease as the overall network load increases. Also, with the increase of the the number of RBs of an RAU acquired, the value of the utility function goes up. Combined with uplink and downlink model concerned with NAFD scheme, the utility function for downlink user j and uplink user i can be expressed as follows respectively , andβ U,i = β U,i X u .

E. PROBLEM FORMULATION
We aim at maximizing the users' utility function considering the traffic load as follows.
n m,i blog 2 where P D,j and P U,i are the power consumption budget for downlink RAU j and uplink user i. The binary assignment variables should either be 0 or 1, that is for a certain RAU, it should work at either downlink or uplink working mode. (15) is the QoS needs of different users.

III. PROPOSED LOAD-AWARE DYNAMIC MODE SELECTION ALGORITHM BASED ON REINFORCEMENT LEARNING
Two load-aware dynamic mode selection algorithms based on reinforcement learning for NAFD Cell-Free Large-scale Distributed MIMO Systems is proposed in this part. One is based on centralized Q-learning, the other is based on distributed multi-agent Q-learning. Reinforcement learning (RL) usually contains five elements, including environment, agent, state, action and reward. The agent has the ability to learn by interacting with the environment constantly and will act on the basis of the observed values combined with its own experience, which is also called a strategy. The state of the environment will be affected by specific action taken by the agent. Two pieces of information from the changing environment will be obtained by the agent: observations and the reward. So the agent can perform new actions based on new observations. The usefulness of performing an action is represented by a numerical value known as the Q-value. The expectation of the long-term return generated by certain strategic actions under the premise of knowing the current state s t and action a t is called the state-action value func- where γ denotes the discount factor. Then we map our scenario to key elements mentioned above.

A. PROPOSED CENTRALIZED Q-LEARNING ALGORITHM FOR LOAD-AWARE DYNAMIC MODE SELECTION
Q-learning is a classic method of reinforcement learning. The purpose of Q-Learning is to establish a Q- Table with ''state'' as the row and ''action'' as the column, and to continuously update the Q value in the Q-Table through the rewards brought by each action, so as to obtain the Q value under a specific action and a specific state. The strategy for taking each action in Q-Learning is ε-greedy strategy, that is, to maintain a delicate balance of exploration and utilization. While the evaluation strategy used when learning to update the Q-Table is the greedy strategy, that is, the best action is always recorded in the Q- Table. Q-Learning is off-policy because its action strategy and evaluation strategy are not the same. In this paper, the proposed centralized Q-learning algorithm treated all RAUs as a whole to assign the working modes of each of them in the system.

1) AGENT
In the proposed centralized Q-learning algorithm, we take the total RAUs as an agent of the proposed reinforcement learning framework. Agent will make intelligent decisions by observations of the environment, including the adaptive selection of working mode of the RAUs. Particularly, the agent can obtain experience and adjust its action strategy.

2) STATE
Define the infinite set for the state space as S, a 1 × M one-dimensional array is used to demonstrate the state of the environment s t = [x 1 , x 2 , x 3 . . . x M ]. The value of x M can be either 0 or 1, where 0 means RAU M is working as uplink reception and 1 means RAU M is working as downlink transmission. At each time slot, an RAU is working as UL reception or DL transmission.

3) ACTION
Since each RAU only has two working mode, we can simply take action to change the original UL RAU to DL RAU or to change the original DL RAU to UL RAU by performing the XOR operation between bits. The agent selects one of the following actions in the current state s t : ''RAU 1 changes its original working mode''. . . ''RAU M changes its original working mode''. Hence, M actions is used to model the mode selection scheme. In this way, each RAU's working mode can be changed between uplink and downlink mode according to the strategies taken by the agent.
To obtain the optimal reward, an appropriate adjustment of exploitation and exploration is required because the agent does not possess sufficient information regarding the environment in general. So with the probability of ε(e), the agent selects a random action. While with the probability of 1−ε(e), the agent chooses the action with the highest Q-value. If the greedy rate is too high, it is easy to enter the local optimal solution. When we first train the Q function, we must have a large epsilon. As the agent becomes more confident about the estimated Q value, we will gradually decrease epsilon. Hence, the decayed ε-greedy policy is used as follows where e denotes the the current episode index, ε first is the initial value of ε, ϕ denotes an exploration parameter that controls the attenuation rate of ε, and |action| is the size of the action set.

4) REWARD
Our goal is to maximize the the users' utility function considering the traffic load, so the reward can be assigned as the sum value of the users' utility function as expressed in (11). The learning process is driven by the reward function in the RL framework, and the system performance can be improved when the design of the reward function for each step is related to the desired objective. The Q-learning algorithm selects the action that can achieve the maximum reward based on the state-action value Q (s t , a t ). Q-learning uses the current return and the estimated value of the next moment obtained by taking the action that maximizes the value to estimate the value of the current moment. The Q value is updated by the following formula where α denotes the learning rate. The specific procedures are summarized in Algorithm 1. An episode is one complete play of the agent interacting with the environment in the general RL setting.

B. PROPOSED DISTRIBUTED MULTI-AGENT Q-LEARNING ALGORITHM FOR LOAD-AWARE DYNAMIC MODE SELECTION
The concept of the distributed algorithm for multi-agent optimization first appeared in [22] where the authors studied the scenario that agents cooperatively minimize a common additive cost with no constraint. Multi-agent reinforcement learning has been used in mobile D2D networks [23] and UAV networks [24]. A distributed Q-learning algorithm for dynamic resource allocation problem with unknown cost functions and unknown resource transition functions was studied in [25]. As a matter of fact, the algorithms based on Q-learning excels a lot in dynamic environments where the users can move randomly in the coverage area and the channel conditions between the RAUs and UEs vary.
In centralized Q-learning, all possible cases explorable by the agent should be considered. Hence, the size of the Q-table of the centralized Q-learning increases exponentially based on the sizes of the state and action sets which can be calculated as Q table-size = |state| size × |action| size . In our scenario, the state of the environment is denoted as where the value of x M can be either 0 or 1. Therefore, the size of our state set comes to 2 M . The size of the Q-table for centralized Q-learning algorithm is 2 M × M . By contrast, in distributed multi-agent Q-learning, each agent generates its Q-table considering only its own state set and action set. Therefore, the size of the Q-table of the distributed multi-agent Q-learning can be obtained as Q dis table-size = N agent × |state| own-size × |action| own-size . N agent is the number of agents.
In the proposed distributed multi-agent Q-learning algorithm for load-aware dynamic mode selection, each RAU is regarded as an agent. M RAUs corresponds to M agents. The state set of each RAU is denoted as s t = { s 1 , s 2 } , where s 1 denotes the RAU working as uplink reception, while s 2 denotes the RAU working as downlink transmission. The action taken by each RAU is either ''RAU M changes its original working mode from uplink to downlink or from downlink to uplink'' or ''RAU M stays in its original working mode''. The decayed ε-greedy policy in (16) is used in action selection. Subsequently, the size of the Q-table of the proposed distributed multi-agent Q-learning algorithm is M × 2 × 2. Because distributed multi-agent Q-learning does not require a deep neural network for function approximation, the proposed algorithm can be utilized in practical scenarios for massive terminal access with limited computation power. Furthermore, distributed multi-agent Q-learning method also alleviates the problem of Q-table's size explosion when a large number of terminals access large-scale RAUs. It has high scalability in massive access scenario.
When the state of the RAU indicates its working mode as uplink reception, then the reward of the proposed distributed multi-agent Q-learning framework is defined as When the state of the RAU indicates its working mode as downlink transmision, then the reward is defined as The Q value in the Q-Table is updated according to (17). The specific procedures are summarized in Algorithm 2.

C. COMPLEXITY ANALYSIS
Reinforcement learning provides a robust way to treat environment dynamics and perform sequential decision making by constantly interacting with the uncertainty of the environment, reducing the computational complexity. The time complexity of the proposed centralized Q-learning and the distributed multi-agent Q-learning is O(E max KM ) and O(E max K ) respectively. E max is the number of episodes that make the algorithm converge. K and M here refer to the number of users and RAUs respectively. They are far better than the exhaustion approach which owns the complexity of O(2 M ). The training phase of reinforcement learning is offline so the time it takes for training is out of consideration when implementing it in practice. We can simply get the optimal assignment of RAUs through Q-table that has been trained already. As for the storage unit, centralized Q-learning algorithm requires a 2 M ×M Q-table while distributed multi-agent Q-learning only requires M ×2×2 Q-tables. As demonstrated above, the size of Q-table is greatly reduced in distributed multi-agent Q-learning without the implementation of a complicated trained deep neural network which works better under the scenarios for massive terminal access in practice.

IV. NUMERICAL RESULTS AND DISCUSSION
In this section, NAFD cell-free large-scale distributed MIMO system in a circular area is considered. This paper assumes the M RAUs are distributed in a circular area. The system contains K randomly distributed users, including K u uplink users and K d downlink users. Each RAU is equipped with N half-duplex antennas. The detailed simulation parameters are listed in Table 1.

Algorithm 1 Proposed Load-Aware Dynamic Mode Selection Algorithm Based on Centralized Q-Learning
Input: • initialization: Generate the state set s t (•) as a 1 × M one-dimensional zero array, create a Q-table scaling 2 M × M and initialize Q (•), ε, h u,i , h d,j , w d,m , g i,j , G I , γ , α, and randomly place K u uplink users and K d downlink users. Repeat: for Everyepisode do ε(e) = ε first (1 − ε first ) e ϕ×|action| . Determine an action: The state changes to the next state by taking the action. Calculate the reward according to (11), max Update the Q-table according to (17). end for Return the optimal solutions of state and the optimal reward which correspond to the best assignment of UL/DL RAUs and the biggest value of the utility function respectively.
In this study, we considered four conventional schemes as benchmarks to compare the performances of the proposed algorithms in terms of performance metrics.
• Average RAU scheme: This scheme is based on randomly equal splitting of the RAUs as half uplink RAUs and half downlink RAUs.
• TDD scheme: The TDD scheme is the time division duplex mode.
• Random scheme: This scheme randomly chooses an assignment of RAUs in each scenario.
• Exhaustion scheme: The exhaustive search provides an optimal solution of the assignment of RAUs with very high computational complexity. The convergence of the proposed reinforcement learning algorithm to the optimal solution can be proved through the comparison between the exhaustive search and the proposed reinforcement learning algorithm.
• Centralized Q-learning scheme and Distributed multi-agent scheme: The two schemes refer to the proposed reinforcement learning algorithms mentioned above.
Most of the previous works with respect to FD systems usually fix the assignment of antennas or RAUs and then perform system performance analysis on this basis. The mode selection of RAUs can significantly improve the utilization of UL/DL resources and the load-aware dynamic mode selection scheme improves the access efficiency of massive terminal communications and enhance the transmission reliability of dynamic access links taking the traffic loads in consideration.
The state changes to the next state by taking the action. end for Calculate the reward according to (18). end if if State detected as downlink transmission then for i = 1 : K d do Determine an action for each RAU according to the decayed ε-greedy policy.
The state changes to the next state by taking the action. end for Calculate the reward according to (19).

end if
Calculate the sum reward: Update the Q-table according to (17) end for Return the optimal solutions of state and the optimal reward which correspond to the best assignment of UL/DL RAUs and the biggest value of the utility function respectively. Fig. 3 and Fig. 4 illustrates the accumulated average reward based on the progress of the episode. The results of optimal rewards and were obtained through an exhaustive search in the entire search space. Fig. 3 was performed under the ZF precoding while Fig. 4 was performed under the MRT precoding. The optimal values is plotted as a constant to present the rate of convergence and optimality of the convergent solutions. As shown in the figure, the proposed Q-learning algorithm can converge quickly and has similar good convergence performance for different precoding which proves the robustness of the algorithm. Fig. 5 is the CDF of the utility function of different schemes under ZF precoding. Here, we generated 1000 scenarios of randomly distributed users with their required resources requirements. The results indicate that  the load-aware dynamic mode selection scheme using the proposed centralized Q-learning and distributed multi-agent Q-learning have similar performance. The gain provided by load-aware dynamic mode selection scheme instead of the simple and equally split of RAUs and random assignment scheme is approximately 15% at the probability of 0.5, and it is only 7% poorer than the optimal solution. It also provides nearly 56% gain compared with the TDD scheme. Fig. 6 is the CDF of the throughput of different schemes. The throughput of the system is calculated under the premise that the values of the above utility function have reached their maximum values of each scheme respectively. The red line which indicates the max throughput is to maximize SE through exhaustive research under the condition of satisfying users' QoS. As shown in Fig. 6, the proposed load-aware mode selection scheme based on Q-learning achieves nearly the same throughput performance as the exhaustive research method, which gains 16% compared with the random scheme, 22% compared with the average RAU scheme and 45% compared with the TDD scheme at the probability of 0.9. This proves the mode selection of RAUs can significantly improve the throughput performance of the system. According to the Nash Equilibrium in Game Theory, a Nash Equilibrium in a game is a list of strategies, one for each player, such that no player can get a better payoff by switching to some other strategy that is available to him while all other players adhere to the strategies specified for them  in the list. In our scenario, each RAU's strategy is the best response to other RAU's strategies. Hence, the throughput by maximizing the values of utility function is no larger than the maximizing-throughput in global perspective. Instead, each RAU has made a ''no regrets'' decision. Maximizing the utility function guides users to prefer RAUs with fewer RB resources on the premise of meeting their own QoS, thus improving the utilization of RBs. With the traffic load and the QoS of users considered, the load-aware dynamic mode selection scheme helps to make better use of system resources, thus improving the efficiency of user access. Fig. 7 indicates the resource utility between load-aware schemes and non load-aware scheme. Our proposed load-aware dynamic mode selection scheme based on Q-learning gains 13% of the resource utility compared with non load-aware mode selection scheme at the probability of 0.5. It has been proved that the load-aware schemes can significantly improve the resource utility of the system. Fig. 8 shows the value of the utility function versus the number of RAUs. We assume the number of UL and DL users is 10 respectively. The fixed mode refers to the equally split the RAUs as uplink and downlink working mode. It has been indicated that the proposed load-aware dynamic mode selection scheme based on centralized Q-learning and distributed Q-learning achieves nearly the same performance regardless of the advantages of distributed Q-learning in computing and storage. The exhaustive method can get the theoretically optimal solution with high complexity. As the number of RAUs increases, the values of the utility function go higher under   these three schemes. The proposed load-aware dynamic mode selection scheme achieves better performance than the fixed mode scheme and gets closer to the exhaustive method with the growing number of RAUs. Fig. 9 shows the value of the utility function versus the number of UL/DL UEs. The numbers of uplink and downlink users are equal and correspond to the numbers on the horizontal axis. The values of the utility function don't fluctuate much as the numbers of UL/DL UEs vary which indicates our proposed load-aware dynamic mode selection scheme is robust even when the number of UEs increases.
With the access of a large number of terminals, it is essential to find an flexible mode selection scheme that can maximize the use of system resources. The load-aware dynamic mode selection scheme based on centralized Q-learning determines a clever strategy to approach the ultimate goal by itself and excels in environment dynamics while the distributed multi-agent Q-learning method reduces the computational complexity using a distributed approach. The distributed method has a higher scalability which is more suitable in the actual scene. The proposed Q-learning algorithm can be a reasonable approach for obtaining an optimal solution rapidly and with low complexity.

V. CONCLUSION
In this paper, a load-aware dynamic mode selection scheme of RAUs was studied under NAFD cell-free large-scale distributed MIMO systems. A utility function was proposed to measure the utilization of the system traffic loads. The centralized Q-learning and distributed multi-agent Q-learning algorithms with different complexity were investigated. Through intensive simulations, we demonstrated that the proposed algorithms outperformed conventional schemes, i.e., equally splitting of the RAUs, random scheme, and TDD scheme with far lower complexity compared with exhaustive research method. The load-aware dynamic mode selection can better exploit the system resources and enhance the performance. The distributed multi-agent Q-learning algorithm is proved to more suitable in the actual scene with smaller storage unit and lower complexity. For the sake of real time, CSI overhead and limited fronthaul capacity may be caused by NAFD scheme in the scenario with a large number of RAUs and users. Therefore, in future research, some clustering algorithms will be considered on Aps and UEs to further investigate the scalable mechanism of our systems.