A Machine Learning Approach for Beamforming in Ultra Dense Network Considering Selfish and Altruistic Strategy

Coordinated beamforming is very efficient at managing interference in ultra dense network. However, the optimal strategy remains as a challenge task to obtain due to the coupled nature among densely and autonomously deployed cells. In this paper, the deep reinforcement learning is investigated for predicting coordinated beamforming strategy. Formulated as a sum-rate maximization problem, the optimal solution turns out as a balanced combination of selfish and altruistic beamforming. As the balancing coefficients depend on the beamforming vectors of all the cells, iterations are inevitable to get the final solution. To address this problem and improve efficiency, deep reinforcement learning (DL) is proposed to predict the balancing coefficients. Specifically, the agent, on behalf of a base station-user pair, will rely on Deep Q-network to learn the highly complex mapping between the balancing coefficients and signal-interference environment of each user. Subsequently, the beamforming vectors are obtained efficiently through the learned balancing coefficients. Due to the distinguished feature in exploration of the beamforming parameterization, the complexity problem brought by predicting the beamforming matrix directly is avoided. The performance of the proposed scheme is investigated by experiments with arguments regarding multiple input and multiple output configuration, shadow fading and state design. Simulation results indicate the facts that: 1) the theoretically infinite strategy space can be discretized with limited levels and granularity;2)it is feasible to approximate the complex mapping by Q-learning for wireless channel consisting both the large and small scale fading, 3) the balancing coefficients only concerns large scale fading, so the coordinated beamforming can be decomposed to two sub-problems with different time scales: parameterization at large time scales and instant beamforming based on balancing coefficients.


I. INTRODUCTION
The ultra dense network (UDN) is the key enable technology for the future mobile communication system such as 5G and beyond to meet the exponential demand growth of data traffic [1] and mobile multimedia services [2], [3]. Within the architecture of UDN, a large amount of small cells are deployed autonomously around the served users, eventually, system performance in terms of capacity, coverage, and The associate editor coordinating the review of this manuscript and approving it for publication was Dapeng Wu . service efficiency can be improved due to the ever decreased distance between the transmitter and receiver. However, when cells density continues to increase, UDN will suffer from severe issues such as bad inter cells interference and complex mobility management problem. While the later problem can be circumvented via separation of control and data planes, interference management faces more challenges. To solve the problem, traditional means such as game theory, graph theory, and optimization method have been employed. Under the framework of optimization approach, coordination for UDN can be modeled as utilization maximum problem in multi-cells, then scheme based on joint multi-dimensional resource allocation can be obtained by, for example, convex optimization in some case. Although coordination based on optimization such as beamforming or power control among cell cluster can be promising, the optimal solution is ordinarily difficult to get, because the denser and larger scale the UDN is, the more and stricter constraints are imposed on the network performances and resource usages of the optimum problem. As a compromise, some kind of sub-optimal strategy or iterative optimization approach is employed [1], [4].
Recently, deep learning (DL) has shown great potentials for improving the performance in communication system. At present, much attempts have been made to apply DL in areas of physical layer [5], resource allocation such as power control [6], [7], and beamforming [8]- [10]. For instance, [6] applied the full connected deep neural network (DNN) to approximate the weighted minimum mean square error (WMMSE) power allocation algorithm. Experiment results indicate that DNN can approximate WMMSE algorithm with high approximation accuracy and less computation complex. Reference [7] proposed to solve the transmit power control problem by using convolution neural network (CNN). The objective is to maximize spectrum efficiency (SE) and energy efficiency (EE). Simulation results show that the CNN-based power control method can achieve almost the same or even higher SE and EE than conventional power control scheme with much less computation time. Reference [8] dealt with transmitter beamforming based on an outage-based scheme, the proposed work attempts to cope with channel uncertainty at base station (BS), however, only simple scenarios assumptions of point-to-point and single group multicasting were adopted. In [9], a coordinated beamforming scheme based on neural network model for mm Wave BS was proposed. The DL is supplied with the OFDM omni-received sequences from the coordinated BSs in uplink to predict the RF beamforming codeword in downlink. To meet the real time applications requirements, [10] explores the beamforing structure of uplink-downlink duality, and relies on a CNN network to predict the virtual uplink power allocation, from which the final beamforming matrix can be obtained. In contract to approaches in [8]- [10] prevent the neural network from predicting directly the beamforming matrix, as a result, complexity is reduced.
Despite the progress made in DL based coordinate beamforming, it is noticed that such efforts mainly focus on approaches regarding only single cell scenario, but relatively speaking, approach for multi-cells is scarce in literature. In our opinion, scheme for multi-cells is more beneficial and practical for UDN. Motivated by this finding, we propose to use deep reinforcement learning to perform the multi-cells coordinated beamforming in this paper.
Traditionally, multi-cells beamforming can be in the form of coordination and cooperation. In the beamforming coordination such as Cooperative Beamforming(CB) in Coordinated Multi-Point downlink transmission(COMP) [14] in LTE-Advanced, the cells will share control information each other firstly, then apply classical beamforming algorithm such as maximal ratio combining (MRC) [14], zero forcing (ZF) [11], and block diagonalization (BD) [12], [13] to reduce inter cell interference. In the beamforming cooperation such as Joint Processing(JP) in COMP [14], the cells will share both control and data information each other, then beamforming algorithm such as virtual SINR (VSINR) in [15] can be employed. Moreover, aiming at maximizing different system objective, such as spectrum efficiency (SE) [16], energy efficiency (EE) [17], or content distribution efficiency [18], the multi-cells beamforming can be formulated as optimum problems while guaranteeing quality of experience (QoE). Based on the formulated model, optimal or sub-optimal solutions can be obtained by convex optimization or heuristic method. For 5G heterogeneous network with massive MIMO(multiple input and multiple output), new beamforming structure appeared such as hybrid beamforming (HBF) proposed in [19]. HBF involves beamforming at analog and digital stages concurrently in order to reduce the complexity and cost due to the large number of antennas.
Distinguished from the traditional or single cell approaches, we focus coordinated beamforming for multi cells UDN based on DL approach. Specifically, the beamforming is formulated as a system sum-rate maximum problem. As the problem is non-convex, it is solved by optimum condition with standard theory of Lagrange duality. The obtained optimum beamforming vectors for multi cells couple with each other, moreover, each vector depends on coefficients balancing the selfish and altruistic strategy. To refrain from iterations in calculating the coefficients, we propose to use DL to learn the balancing coefficients which parameterize the beamforming vectors. More precisely, the Deep Q-network is applied to approximate the mapping between the strategy equilibrium and observations of each user, including desired channel state information and interference levels towards other users. Accordingly, each serving cell, on behalf of its scheduled user, acts as an agent, and by deep learning, the agent will get the best strategy in an offline style. Based on the learned strategy and observed information, the agent will automatically search and react with a balanced beamforming strategy between selfish and altruistic. Moreover, as balancing coefficients only concern large scale channel fading, the corresponding beamforming can be decomposed to two sub-problems with different time scales: 1)large time scales parameterization at different levels for MIMO configurations; 2) instant beamforming based on balancing coefficients. Eventually, the proposed scheme avoids direct estimate of complex beamforming matrix.
The contributions of this work can be summarized as follows.
1) We propose to use deep reinforcement learning to determine the balancing coefficients of selfish and altruistic strategy in coordinated beamforming. 2) The mapping function between the observations of each user and the balancing coefficients which determine the beamforming can be learned in large time VOLUME 8, 2020 scales by a Deep Q-network. Based on the learned mapping function, discretized coefficients are properly selected for instant beamforming according to the environments of base station and user pair.
3) The performance of the proposed scheme is simulated and evaluated by experiments with arguments regarding MIMO configuration, shadow fading and state design options. By simulation results, we find that the balance coefficients depend on the large scale fading of the channels, the simulation results also confirm the feasibility and effectiveness of the proposed method. The paper is organized as follows. Section II introduces the system model. In Section III, the optimization problem and its resulting solution are presented. Section IV is devoted to the Deep Learning design, and Section V presents simulation results. Finally, Section VI concludes the paper.
Notation: The following notations are used: ||X|| 2 : 2-norm of X, X H : Hermitian transpose of X, |X |: cardinal number of set X , ⊥

II. SYSTEM MODEL FOR BEAMFORMING IN UDN SYSTEM
We consider downlink UDN consisting of N cells, each with one base station(BS) and one served user(UE). For convenience of presentation, the same index is assumed for both the BS and its served user. Further, each BS has N t antennas and each UE has N r antennas, while for multiple input and single output (MISO) configuration, each UE has one antenna.
We assume frequency reuse one in the system, so there will be severe inter cell interference across the cells. To combat inter cell interference, beamforming coordination across N c cells in a cluster is sought to suppress the interference. We denote by H ki ∈ C N r ×N t the channel from BS i to UE k, both large and small scale fading components are included. The large fading includes path loss and shadowing. The small fading is modeled as independent identically distributed complex Gaussian random process with zero mean and unit variance. The signal to noise plus interference ratio of UE k with noise power of σ 2 k is where w k ∈ C N t ×1 is the BS transmitting beamforming vector and v k ∈ C N r ×1 is the UE receiving beamforming vector, P is the transmit power of BS. Assume the goal is to maximize the sum-rate of the system with constraint w k 2 2 = 1, then the beamformimg vectors can be obtained by solving the optimization problem: where R k is the achievable rata of user k V and W are matrices of v k and w k respectively.
Problem (2) is well known to be NP hard, to circumvent the obstacle, we resort to approach harnessing standard theory of Lagrange duality, and get the final solution through characterizing the optimum condition of Lagrange function.

III. COORDINATE BEAMFORMING ALGORITHM BASED ON BALANCED STRATEGY
In this section, the beamforming algorithm based on the optimum condition with standard theory of Lagrange duality is first outlined. Then based on the structure of the obtained solution, different levels of parameterization are presented for MIMO, MISO, and special case of 2×1 MISO configurations respectively.

A. THE COORDINATE BEAMFORMING ALGORITHM
First, let define the Lagrangian function as follows: where µ k is the Lagrangian multiply of the constraint for w k in (2), µ is the vector consisting of µ k s. Then, if we take the partial derivative of the Lagrangian function with respect to w k first, then we can express the By rearranging the terms in (5), (5) can be further expressed as an eigenvector problem as follows. where where Once coefficients λ jk are given, (6) is an eigenvector problem, the optimal beamforming vector w k is obtained by where V Max G (X) is the eigenvector of matrix X with the largest eigenvalue.

B. PARAMETERIZATION OF THE BEAMFORMING FOR MIMO BY BALANCING COEFFICIENTS
It is noted from (8) that the beamforming vector w k is determined by coefficients λ jk , j = k. Given the determined coefficients, w k can be obtained directly by eigenvalue computation. However, (7) indicates that λ jk depends on both w k of user k and w j , j = k of other users. In other word, the optimal bamforming vectors w k , ∀k are coupled each other in nature. To address this issue, w k is updated iteratively by (8) with other w j , j = k fixed till the convergent results are reached. The iteration based beamforming algorithm is summarized in Algorithm 1 in Table 1.
From (8), it is also noted that coefficient λ jk plays a balancing role in the beamforming coordination strategy. Specifically, for case λ jk = 0, the beamforming vector corresponds to the traditional maximum-ratio transmission (MRT) scheme, which selfishly increases the desired signal of user k with none consideration of interference to other users j = k. On the other side, for case λ jk = ∞, the beamforming effort is devoted to decrease the interference to users in other cells. The later corresponds to the classical Zero Forcing Beamforming(ZF) scheme. So λ jk in (8) will guarantee the system sum-rate by balancing the selfish and altruistic strategy.

C. PARAMETERIZATION OF THE BEAMFORMING FOR MISO
For MISO configuration which can be regarded as a simplified case of optimization problem in (2), the beam combiner v k = 1, ∀k in this case. As a result, the optimal beamforming vector w k in (6) can be rewritten as follows.
where ζ jk = (µ k ω k ) −1 λ jk h jk w k j = k, and ζ kk = (µ k ω k ) −1 h kk w k are complex numbers. h jk ∈ C 1×N t is the channel vector from base station k to user j.
Equation (9) indicates that the optimal beamforming vector w k in MISO configuration can be parameterized by complex numbers ζ jk , that is, the weighted combination of h jk ∈ C 1×N t by ζ jk . Actually, (9) has been pointed in [21] for parameterization of the Pareto boundary. The Pareto boundary [21] is the outer boundary of the achievable rata region R defined as Based on the definition of the Pareto region, the outer boundary of R corresponds to optimum points where one user's rate cannot be increased without decreasing the rate of other users.

D. PARAMETERIZATION OF THE BEAMFORMING FOR 2×1 MISO
For special case of 2×1 MISO configuration, the coefficients and Considering where γ is a complex number, the optimal beamforming vectors w k , k = 1, 2 can be expressed as where w MRT k and w ZF k are the MRT and ZF beamforming vectors respectively. λ k , k = 1, 2 are real value parameters in the range of 0 ≤ λ k ≤ 1.
Take k = 1 as an example, and Equation (13) has been proved differently in [21] and [22] respectively. It indicates that the optimal beamforming vector is a balanced combination of MRT and ZF vectors in the beamforming vectors space for special case of a two users MISO system. Now, if we look back and make an observation of (8) and (9), we will find that the beamforming vector obtained from (13) is at the highest level in term of parameterization with balancing coefficients, the parameterization level ascends from MIMO to two users MISO configuration. VOLUME 8, 2020 Based on the observation made above, we conclude that the balancing coefficients are determinant of coordination strategy for beamforming algorithm. However, to get the balancing coefficients such as (7), iteration among different cells is inevitable, which will increase complexity in implement and introduce additional delay in signaling.
To remedy the issue brought by iteration, the beamforimg problem (8) is decomposed to two sub-problems with different time scales: 1) Determine balancing coefficients at large time scales; 2) Instant beamforming based on balancing coefficients. In next section, deep reinforcement learning is introduced as an alternative approach to approximate the complex mapping function in (8) and (13).

IV. DEEP REINFORCEMENT LEARNING BASED BEAMFORMING
Reinforcement learning (RL) is widely used in machine learning area. Under the infrastructure of RL, the agent will interact with environment, and during this process, it will acquire the best action strategy through learning from the exploration and exploitation of the environment. In the sense of RL learning, both activities are based on the experiences of the agent, however, exploration are based on experience which the agent has never been come across previously in the space of state-action pairs, on the other hand, exploitation is based on the experiences so far.
In the literature, the RL can be classified to three categories: (1) critic only; (2) actor only; (3) critic and actor. Q-learning, which falls into the critic only category, is a widely used algorithm in RL for the agent to learn the best action strategy. The action strategy, also called policy, is a sequence of actions in each upcoming time instants of an episode: where ϕ t (s t ) is the mapping function from the state to action. In Q-learning, a Q-function is defined as Q(s t , a t ), which reflects the value of action a t in state s t . So an optimal Q-function Q * (s t , a t ) means that the agent will get the maximum expected rewards when it takes the action a t in state s t following the optimal policy π * . The objective of the Q-learning is to find the best policy π * which achieves the maximum expected rewards. In this paper, the mapping function ϕ t (s t ) in Q-learning is utilized to map the complex relationship in (8) and (13) between balancing coefficients and signal-interference environment. For multi agents learning process, distributed or coordinated Q-learning can be assumed. In distributed Q-learning, each agent learns independently without sharing of policy information each other, and the target agent regards other agents as part of the environment. On the contrary, in coordinated Q-learning, portion of the policy information will be shared among agents taking part in the coordination, and the convergence time can be reduced by providing the learned policy to a new agent for initialization. However, extra signal overhead is required for coordinated Q-learning scheme. Moreover, states and actions among the agents should be updated with synchronization, otherwise, oscillation and instability will occur in the system. Based on this consideration and the additional observation in (9) that only partial observations of the channels h jk ∈ C 1×N t are required to calculate the optimal beamforming vector w k , in this paper, distributed Q-learning is assumed. Specifically, each BS in the cluster is considered as an agent, which will interact with the environment. The environment, on the other hand, corresponds to the UDN which excludes the target agent. For deep reinforcement learning framework formulation, the key elements are detailed in the following sub-sections.

A. STATE SPACE
Under the deep reinforcement learning framework, the state will characterize the environment the agent faces. The agent interacts with environment through action and reward based on the observed state. To parameterize the optimal beamforming vector, the state is defined differently for MIMO and MISO configuration, as each configuration possesses different parameterization level.
MIMO Configuration: The state space S k of the agent k is defined as: where z is the signal strength vector {z k = log(||H kk || 2 ), k = 1, · · · , N c }, L k = {log ||H jk || 2 , j = k} is the signal leakage vector of agent k towards other UEs. Since the channel amplitude varies across several orders in magnitude, the logarithmic form in the state definition is beneficial for quick convergence of the DQN network during training process.
MISO configuration: The state space of each agent S k is defined as: In (18), only signal strength of desired user k is considered compared with (17). L k = {log ||h jk || 2 , j = k} is defined similarly as in MIMO configuration. The reason for this design comes from the observation of (9) that the beamforming vector of user k is the weighted combination of h jk ∈ C 1×N t , j = 1 · · · N c . Based on this observation, the logarithmic form of channels from the BS k to all the users are taken as elements of the state space. On the contrary, for MIMO configuration, as balance coefficient in (7) depends on the receiver combiner vectors, so the signal strength of other users in the environment are also considered in (17) to reflect the effect of the receiver combiner. The rationale of the different design for different MIMO configuration relies on the fact that higher level of parameterization exists in beamforming structure in MISO, so less information is required in the state space to determine the best solution. 6308 VOLUME 8, 2020

B. ACTION SET
At each instant t, the agent in state s t will take an action a t from the action set A. In this paper, the action set A is a discretized balancing coefficients for the parameterization in the structure of the beamforming vector. As different MIMO configuration possesses different level of parameterization, different action space is defined accordingly.
MIMO configuration: Theoretically, the action space for (7) seems infinite, ranging from λ jk = 0 to λ jk = ∞. So it is a challenge to discretize the large space with limited levels and granularity while trade off the performance loss against the complexity. To cope with this issue, we discretize the balancing coefficient based on its distribution probability by examining the statistics from experiment result in Section V. Based on the statistics, the action space is defined as: where Correspondingly, λ jk = 10 log(λ min jk )+a k j δ , and λ max jk and λ min jk are the maximum and minimum value of the truncated distribution in λ jk .
2 × 1 MISO Configuration: For special case of 2 × 1 MISO configuration, the action space is a discretized balancing coefficients of λ k , k = 1, 2 in (13). As the value of λ k , k = 1, 2 is in the range of 0 ≤ λ k ≤ 1, k = 1, 2, even discretization based on a fixed granularity = 1/35 is assumed with 36 levels in total: a k = {1, · · · , 36, for λ k = 0, , · · · , 35 } The selected action a t by an agent in state s t is based on a decision policy π, which will be learned by the agent from the reward fuction defined in the next sub section.

C. REWARD FUNCTION
The reward function will reflect the objective of the UDN system, the agent will receive a reward from environment as the degree of satisfaction to the action. Based on the objective in (2), the reward is defined as the system sum-rate, which is expressed as: With deep learning, the agent will get the maximum return if it follows the optimal decision policy based on the observation of the environment. In deep learning, the return is calculated as the expected cumulative discount rewards defined as G t = E[ ∞ n=0 β n r t+n ], where β is the discount rate, and E[•] is the expectation.
To get the optimal decision policy, we adopt deep Q-learning scheme. The core concept of Q-learning is the Q-function Q t (s t , a t ), which is defined as the ultimate expected action value the agent will receive if it follows a policy π depending on the state thereafter. The optimal decision policy follows the optimal Q-function actually. In practice, the optimal Q-function is obtained with Bellman equation by iterative updates as follows: a t+1 )|s t , a t ]. (24) The remarkable features of Q-learning is that it is model free, which means that it can get the optimal Q-function by (24) without knowledge of transition probability from one state to others. It has been shown that the iterative updates are guaranteed to converge to the optimal Q-function when t → ∞.
In this paper, the Q-function is utilized to approximate relations between wireless environment states and balancing coefficients in (8), and (13). Once the optimal Q-function is obtained, the agent will select the best λ jk based on the wireless states for the instant coordinate beamforming across the cells.

D. DEEP Q-NETWORK
In case the number of discrete states and action spaces are small, Q-table can be used for the Q-function iterative updates. As the name suggests, Q-table is a table with dimensions of |A| rows and |S| columns. The content of the Q-table corresponds to the value of a specific state-action pairs. Once the Q-table is obtained from training stage, the optimal policy follows the output of the Q-table. But for the DL approach investigated in this paper, the dimension of action spaces is N c × (N c − 1) × |A| for (19), so the action space scales with N 2 c . Moreover, the state value in (17) is continues, apparently, Q-learning based on Q-table is not affordable, so we resort to Deep Q-network (DQN) in [23].
In DQN, deep neural network with weights {θ} is employed to approximate the Q-function, and the weights {θ} are obtained by training algorithm with data samples in the offline training stage. At the online testing stage, the trained DQN will output the value of Q-function given the state as the input. To update parameter weights {θ }, the DQN use a mean-squared error as loss function, and the loss function is defined as: where D is the data sample set for training, y is the target value defined as {θ − } are the parameter weights in previous iteration, Q old is generated by target network for producing target value. With loss function, the parameter weights {θ} are updated as follows: where ∇ θ t Loss t is the loss function gradient with respect to parameter weights {θ },αis the step size.

E. DEEP Q-NETWORK BASED BEAMFORMING
The Deep Q-network based beamforming is achieved by training the DQN agent to get the balancing coefficients with the experiences during the interactions with the environment. The principle is showed in Fig.1.
To train the agent, the emulator is firstly constructed which is a UDN with BSs and served UEs distributed under the coverage of their serving base station. Based on the distance between the BS and UE pair, the channel can be generated with the path loss model assumed. The training process will consist of fixed number of episodes. During each episode, the training proceeds with the following procedures: 1) one of the BS and UE pairs is selected randomly as agent, and other pairs in the UDN act as the environment; 2) the channel of each pair is generated independently; 3) a random action is selected for the agent from the action sets; 4) the beamforming vectors for users in the environment are initialized with classical algorithm such as ZF; 4) the beamforming vectors for agent is calculated based on the selected action; 5) the beamforming vectors for other users are calculated with the balancing coefficient in (7) for MIMO configuration; 6) the observation and reward are produced and fed to the DQN by the environment; 7) based on the observation and reward, the DQN determines the action according to the δ-greedy algorithm; 8) the DQN records the Quadruple e t = {S k t , a k t , r t , S k t+1 } in the memory pool D, and updates the DQN parameters with random samples from D.
The proposed training algorithms are outlined in algoritm 2, algorithm 3, and algorithm 4 for MIMO, MISO and 2 × 1 MISO configuration respectively in the following tables.

V. SIMULATION RESULTS
In this section, the simulation layout is first described, afterwards, the simulation results will be presented.

A. SIMULATION LAYOUT
A UDN network with cell radius of 50 meters is simulated. The cell number per cluster is 3. The layout of the base stations and users are shown in Fig.2. The transmit power of BS is 30dBm. For MIMO configuration, each BS has 4 antennas, and each UE has 2 antennas. For MISO case, one antenna for each user is assumed. The UE is dropped  uniformly in each cell, and the path loss model is given by PL = 140.7 + 36.7 × log 10 (R) in dB, where R is distance in km. The DQN is a fully connected neural network with five layers. Three hidden layers are configured, corresponding to 100, 60, and 30 neurons respectively. The Relu activation function is adopted. To update parameters, adaptive moment estimation method (Adam) [24] is used, and the beginning learning rate is 0.01. The DQN was first trained in offline stage with observation received from the simulated UDN, then, in the deployment stage, the DQN generates the coordinate coefficient λ jk for beamforming, and the system throughput is collected for comparison.
The proposed method is compared with two reference beamforming algorithms. One is the SLNR in [20], and the other is the MRT algorithm. For MIMO configuration, maximum SINR beam combiner is assumed to maximize the achieved data rates. The beam combiner is defined as follows. where is the covariance matrix of received interference and noise at receiver side of user k.

B. DISTRIBUTION OF THE BALANCING COEFFICIENTS
To perform simulation based on Deep Q-learning, the first thing is to determine the discretizing granularity of the coefficients λ jk . In this paper, we determine the granularity by analyzing the distribution of the coefficients with experiment. Fig.3 shows the probability distribution of the coefficients λ jk when the aforementioned layout of UND is assumed. From Fig.3, it is evident that over 95% of the coefficients samples are within the range [4,9] in dB. Based on this observation and a compromise between performance and complexity, the even discretization scheme with six levels in (20) is adopted in the simulation. The λ max jk and λ min jk are set according to log 10 (λ max jk ) = 9 and log 10 (λ min jk ) = 4 respectively.
For training and evaluation of the proposed method, 20000 channel samples are generated, and 16000 samples are employed for training of the DQN. At the training phase, one user is scheduled for transmission per cell, and multiple BS-UE pairs are formed for coordinated beamforming. The agent is selected with random from the BS-UE pairs, and other BS-UE pairs constitute the environment. During the training process, the beamforing vector of the agent is determined by the action provided by the DQN, while the beamforming vectors of BS-UE pairs in the environment are calculated with classical algorithm or algorithm1 in Tab.1. Based on the coordinated beamforming vectors, the corresponding states and the rewards, which are the sum-rate of the system, are calculated. To train the DQN network, both the states and the rewards are fed to the DQN as inputs, and the outputs are the actions of the agent. At the evaluation phase, 4000 samples are used. The predicted balance coefficients by the trained DQN were applied to substitute the values originally provided by (7).
The simulation results are illustrated below from Fig.4-10 for MIMO, MISO, and special case of 2 × 1 MISO configurations. In the simulation, the performances of the DL based algorithm are closely investigated considering the arguments listed below.
• MIMO configuration • Reference algorithm • Shadow fading effect • Dominant factor determining agent state.

MIMO Configuration Results:
In MIMO case, for algoritm 1 to calculate the beamforimg vectors as well as balancing coefficients, iterations between transmitter and receiver sides are required as beamforming vectors at transmitter and receiver sides are coupled as indicated in (7) and (29). However, for case of the proposed DL scheme, the balancing coefficients are learned by DQN in an offline style, so in the online stage, they are provided by the DQN according to the instant states of the agent and keep fixed during the iterations between the transmitter and receiver. As a result, calculation of balancing coefficients is avoided during the iteration for the proposed scheme compared with algorithm1. Fig.4 illustrates the throughput results by the proposed and classical algorithms in terms of the Cumulative Distribution Function (CDF) for 4 × 2 MIMO. From Fig.4, we can see that the performance of MRT is the worst, as it assumes the selfish strategy, so it only enhances the desired signal without consideration of interference towards other UEs. Relatively, the SLNR is better than MRT, but worse than proposed deep learning based beamforming (DL indicated in figure), as the coefficient in SLNR is fixed but not the optimal value in (7), so the result is sub-optimal.

D. SIMULATION RESULTS COMPARED WITH ALGORITHM1
Next, the performance of the proposed DL based scheme is compared with algorithm1 in which the both beamforming vectors w k and coefficients λ jk are found iteratively. The number of iteration in algorithm 1 is fixed to 20. The CDF comparison results are shown in Fig.5.
As presented in Fig.5, the performances of the two schemes in terms of system sum-rate match well on the whole, however, there is still small difference between the CDF curves of the two schemes. The small difference mainly exists above the CDF value of 0.5, the difference indicates that the proposed scheme performs better for edge users than for center users compared with algorithm1. Typically, the cell center users and cell edge users correspond to 90th and 5th percentile users. The reasons for this difference can be elaborated from two aspects: 1) Performance loss due to discretization range of action space. From Fig.3, we notice that there are still samples outside the range of discretization. Due to the range limitation, more selfish strategy is not allowed for the center users to select. To overcome this problem, some more advanced discretization schemes such as nonlinear or non-uniform schemes may be required. 2) Absent of leakage components L j , j = k from state space design in (7). Such absent can reduce the signaling overhead but to some extent at a cost of small performance loss.
MISO Configuration Result: In case of MISO for algoritm 1, iterations are still required to obtain the balancing coefficients, even though the beamforming vectors at the transmitter and receiver sides are decoupled. However, for the proposed DL based algorithm, iteration for obtaining the balancing coefficients is avoided. The comparison results in Fig.6 are employed to demonstrate such difference in iteration. The results are averaged over 500 experiments. From Fig.6, it is obvious that no iteration is required for the proposed DL scheme, while iterations are required for algorithm1 to reach convergence of the balancing coefficients. The results verify the advantage of the proposed scheme in terms of efficiency for both the processing complexity and time delay.

E. SIMULATION RESULTS ON SHADOWING EFFECT
In perspective of (7), the balancing coefficient depends on both the large and small scale fading in channel. The large scale fading consists of path loss and shadowing effect. To investigate the influence of channel shadowing on the performance of the proposed scheme and the extent of such influence, the performances of the proposed DL based algorithm with and without shadowing effect are compared, moreover, the proposed method is also compared with algorithm1 for MIMO (MISO) and algorithm in [21] for 2 × 1 MISO configuration respectively. The shadowing fading is assumed as lognormal distributed with Standard Deviation (STD) of 5dB. Fig.7 shows the simulation results in terms of CDF of system sum-rate for 4 × 2 MISO configuration with and without shadow fading, the last case is denoted with 'No' in the figure.
At a first glance of the Fig.7, we get the impression that performance differences due to shadow fading exist for both the proposed algorithm and algorithm1. More precisely, shadow fading will decrease the throughput of cell edge users while increase that of cell center users. But if we look deep at the details of the figure, we see the influence of shadow fading on DL agent is a little more than for algorithm1. The reason is that the shadow fading will introduce more uncertainty in the channel due to the random nature, as a result, there are more uncertainty in the distribution of the balancing coefficients, which is responsible for the performance difference in the cell center users. This phenomenon also exists for 4 × 1 MISO results in Fig.8.  It should be pointed out that the results in Fig.8 are achieved at a much less cost compared with Fig.7, as they are obtained with reduced state information. This fact corroborates the assertion that beamforming for MISO possesses higher level of parameterization, so less information is required to train the agent, and signaling overhead is saved in practical implement.
where ζ k is expressed as a k , b k , and c k are defined as: andk is the complement set of k for set k = {1, 2}. VOLUME 8, 2020   9 gives the simulation results in terms of system sumrate CDF for 2 × 1 MISO. In the figure, DL based algorithm is compared with algorithm which calculates beamforming vectors by λ k , k = 1, 2 in (31). The later is indicated by 'Theory' in Fig.9. From the figure, we conclude that the performance of the DL based algorithm in 2 × 1 MISO is the least influenced by shadow fading among the MIMO configurations considered. The DL based algorithm manifests the best due to the highest level of parameterization in 2×1 MISO configuration.

F. SIMULATING THE DEPENDENCE OF BALANCING COEFFICIENTS ON LARGE SCALE CHANNEL FADING
Considering that multi-cells beamforming in (8) depend on the wireless channel consisting of large and small scale fading with different time scales, we attempt to separate the original beamforming problem into two sub-problems: 1) predicting the balancing coefficients at large time scales; 2) instant beamforming based on obtained balancing coefficients. Beamforming with balancing coefficients predicted at large time scales can save signaling overheard for channel estimation in practice. To this end, we consider state space S k design option based on large scale fading in this section.
Specifically, the new state space S k is predicted only by channels with large scale fading component. The predicted balancing coefficients are used for instant beamforming with MIMO channels consisting of both kinds of fading. To verify this attempt, simulations based on the new state space design are conducted. The simulation results are shown in Fig.10. It is surprising that the performance difference is rather small when state space designs with partial and full channel components are used to predict the balancing coefficients. This result is encouraging for implement: much signaling overheard can be saved for instant channel estimation, and more reliable estimate of the balancing coefficient based on large time scales can be guaranteed.

VI. CONCLUSION
In this paper, we present the parameterized beamforming structure of the coordinated beamforming considering balanced strategy in UDN. Based on analysis, the Deep Reinforcement Learning is proposed to predict the balancing coefficients which parameterize the final beamforming vectors. The proposed method was evaluated in simulated UDN system with different MIMO configurations. Experiment results demonstrate that strategy space for MIMO configuration can be discretized with limited levels and granularity although the range is infinite theoretically. Simulation results confirm the efficiency of the proposed scheme in terms of iteration. The results also reveal the fact that the learned DQN can predict the beamforming strategy only based on the large scale channel fading. The important aspect of this finding is the reduction of signaling overhead in implement.