The multiuser MIMO (MU-MIMO) technology plays an important role in modern wireless communication systems due to its capability of providing significant performance gains over the single-user MIMO (SU-MIMO). With MU-MIMO, a multi-antenna base station (BS) can simultaneously serve multiple user equipments (UEs) within a cell using the same spectrum resource, and thus greatly improving the overall spectrum efficiency [2]. As a promising technology for the Fifth Generation (5G) wireless communication standard [3]–[5], massive MIMO with large scale antennas further enhances the system performance in terms of spectrum efficiency and energy efficiency. In real operations, we must concurrently balance system capacity and service coverage by appropriately tuning optimization parameters at BSs, which is referred as the Coverage and Capacity Optimization (CCO) problem. Usually, CCO related system parameters include reference signal power, antenna tilt, scheduling parameters, and etc. However, it is very difficult and expensive to configure a lagre number of antenna tilt values for adaptive CCO. We have to study and define more tractable optimization parameters for addressing the CCO in massive MIMO systems.
Specifically, user scheduling mechanisms are responsible for choosing how to allocate spectrum resources for the BSs with fine time and frequency resolutions, taking into account channel condition and QoS requirements. Thus we can seek for tuning the scheduling parameters for the CCO problem instead of antenna tilt. Combes et al. [6] proposed an \alpha
-fair scheduler technique to improve coverage performance at the expense of very small capacity losses. In [6], an optimization parameter \alpha
was proposed to adaptively adjust the scheduling rules based on whether the number of users is small or not. Comsa et al. [7] proposed a dynamic neural Q-learning-based scheduling technique that achieves a flexible system capacity and user fairness tradeoff, including the Q-learning algorithm is used to adopt different policies of scheduling rules at each Transmission Time Interval (TTI). Nevertheless, both [6] and [7] focuse on SU-MIMO that schedules only one user for transmission. In the context of MU-MIMO and massive MIMO, Sun et al. [8] proposes a joint user scheduling and power allocation algorithm for Joint Spatial Division and Multiplexing (JSDM) [9] scheme in massive MIMO downlink systems, which schedules users and allocates power iteratively with the MAX user scheduling and the Lagrange power optimization method. Based on a two-stage precoding framework for the massive MIMO systems, Xu et al. [10] proposed an improved K
-means user grouping scheme which allocates the users to different pre-beamforming groups using the second-order channel statistics, and then a user grouping scheme that considers both load balancing and precoding design. After user groups are determined, they presented a dynamic user scheduling scheme where second-stage precoding is designed based on instantaneous channel conditions. However, the researches of [8] and [10] are only to maximize the system sum rate and thus without considering to concurrently optimize the system capacity and coverage.
In this paper, we propose a novel parameter GAUSS to efficiently support the user scheduling for the massive MIMO system, and thus serve as an effective parameter for the CCO problem. Together with SINR_{min}
, GAUSS can effectively control the variance of signal strengths of scheduled user group. Moreover, the DECCO is proposed to dynamically derive GAUSS and SINR_{min}
during the process of coverage and capacity optimization. We also propose an inter-cell interference coordination scheme to enhance the CCO performance. The key contributions of this work are summarized as follows.
A novel scheduling parameter GAUSS, together with a unified threshold of quality of service SINR_{min}
, is proposed to address the challenging problem of CCO in massive MIMO systems.
A novel CCO algorithm DECCO is proposed to dynamically derive the optimal combination of GAUSS and SINR_{min}
with a pre-trained policy gradient neural network in the user scheduling scheme, together with a novel ICIC scheme.
Analytical and simulation results show that the proposed DECCO algorithm can effectively achieve a much better performance balance between system capacity and service coverage. In particular, compared with traditional Fixed Optimization (FO) and Proportional Fair Optimization (PFO) algorithms, DECCO significantly increases both cell average and cell-edge spectrum efficiency in a typical massive MIMO system.
The rest of this paper is organized as follows. In section II, we briefly introduce the massive MIMO system model and the coverage and capacity optimization problem formulation. In section III, we present the novel optimization parameter GAUSS for the user scheduling scheme. In section IV, we present the DECCO algorithm, including the deep reinforcement learning-based user scheduling algorithm and the inter-cell interference coordination scheme. In section V, we give the simulation results to verify the effectiveness of the proposed parameters, and compare the coverage and capacity performance of our proposed DECCO algorithm with a algorithm using the best fixed configuration of the proposed parameters in the user scheduling scheme. We finally conclude this work in section VI.
SECTION II.
System Model and Problem Formulation
In this section, we first briefly introduce the system model for multiuser massive MIMO and the user SINR estimation model. Then we provide the formulation of the coverage and capacity optimization problem.
A. Massive MIMO System Model
We consider the downlink transmission of a massive MIMO network as depicted in Figure 1. The shadow area of the cells is regarded as the cell center and the area between the dashed line and the solid line is defined as the cell edge. For the CCO problem of this system, the Key Performance Indicators (KPIs) of system capacity and service coverage are thus defined as the Cell Average Spectrum Efficiency (CASE) and the Cell-Edge Spectrum Efficiency (CESE), which are calculated as the average spectrum efficiency of all the users in the cell and the cell-edge users, respectively.
Each BS is equipped with {M_{t}}
antennas, and can simultaneously serve {K}
user terminals with {N_{r}}
antennas. We assume that the {M_{t}} \times (K\times {N_{r}})
dimensional channel matrix {\mathbf {H}}
is fixed for a certain block length, which is the coherence time of the channel, and changes from block to block. In order to reduce the Channel State Information (CSI) feedback overhead, two-stage precoding scheme is usually adopted for Frequency Division Duplexing (FDD) massive MIMO systems [9], [11], [12]. The inner precoder adopts zero-forcing scheme based on local CSI, and the outer precoder groups the UEs based on the similarity of the eigenspace of auto-correlation matrix of the UEs’ downlink channel. The received signals at the user side using the two-stage precoding scheme is then given by \begin{equation} \mathbf {y} = \mathbf {H}^{H}\mathbf {BPd} + \mathbf {z}, \end{equation}
View Source
\begin{equation} \mathbf {y} = \mathbf {H}^{H}\mathbf {BPd} + \mathbf {z}, \end{equation}
where {\mathbf {y}}
is the received signal, {\mathbf {d}}
is transmitted data symbol vector, {\mathbf {z}}
is the additive Gaussian noise, {\mathbf {B}}
is outer precoder, {\mathbf {P}}
is the inner precoder, and {\mathbf {H}}
is the channel matrix. Assuming each UE’s signal is allocated with equal power, the normalized received signal {{\tilde {\mathbf y}}}
can be further obtained as \begin{equation} {\tilde {\mathbf y}} = \sqrt {\frac {P_{t}/N}{{{\textit {Tr}(}{\mathbf {BP}}{{\mathbf {P}}^{H}}{{\mathbf {B}}^{H}}{)}}}} {{\mathbf {H}}^{H}}{\mathbf {BPx}} + {\mathbf {n}}, \end{equation}
View Source
\begin{equation} {\tilde {\mathbf y}} = \sqrt {\frac {P_{t}/N}{{{\textit {Tr}(}{\mathbf {BP}}{{\mathbf {P}}^{H}}{{\mathbf {B}}^{H}}{)}}}} {{\mathbf {H}}^{H}}{\mathbf {BPx}} + {\mathbf {n}}, \end{equation}
where {P_{t}}
is the total transmit power at BS, {N}
is the noise power, {\mathbf {x}}
and {\mathbf {n}}
are the normalized signal and Gaussian noise respectively, and Tr(\cdot)
denotes the matrix trace.
B. Problem Formulation
Considering the interference, the user SINR could be estimated as\begin{equation}~\textit {SINR} = \frac {P_{t}/(N + I)}{{{\textit {Tr}(}{\mathbf {BP}}{{\mathbf {P}}^{H}}{{\mathbf {B}}^{H}}{)}}}. \end{equation}
View Source
\begin{equation}~\textit {SINR} = \frac {P_{t}/(N + I)}{{{\textit {Tr}(}{\mathbf {BP}}{{\mathbf {P}}^{H}}{{\mathbf {B}}^{H}}{)}}}. \end{equation}
In downlink of massive MIMO network with MU-MIMO transmission mode, the number of transmit antennas at the BS is much larger than the total number of receive antennas of scheduled UEs in a cell, thus the two-stage precoding scheme using distributed zero-forcing can apparently reduce inter-cell interference and intra-cell interference.
In each scheduling duration, users’ SINR can be calculated according to the channel matrix composed by the scheduled users, and then the SINR of each user can be estimated accurately assuming we can get ideal CSI. When scheduling, the minimum user SINR threshold, denoted as SINR_{min}
, can be used to control the minimum SINR value of the user and adjust coverage and capacity performance. The bandwidth of spectrum is denoted as B
, which is shared by the users when scheduling. Multiple users can be scheduled simultaneously to improve the spatial gain. For the sake of simplicity, we assume the interference is perfectly cancelled through the two-stage precoding scheme and the inter-cell interference coordination scheme. Thus the users’ SINR within each cell in the network can be estimated as follows.\begin{equation}~\textit {SINR} = \frac {P_{t}/N}{{{\textit {Tr}(}{\mathbf {BP}}{{\mathbf {P}}^{H}}{{\mathbf {B}}^{H}}{)}}}. \end{equation}
View Source
\begin{equation}~\textit {SINR} = \frac {P_{t}/N}{{{\textit {Tr}(}{\mathbf {BP}}{{\mathbf {P}}^{H}}{{\mathbf {B}}^{H}}{)}}}. \end{equation}
The instantaneous spectrum efficiency of a cell at certain time step t
can be calculated as \begin{equation} E(t)=\frac {\sum _{k=1}^{K^{*}}B\log _{2}(1+\rho _{k})}{B}=\sum _{k=1}^{K^{*}}\log _{2}(1+\rho _{k}), \end{equation}
View Source
\begin{equation} E(t)=\frac {\sum _{k=1}^{K^{*}}B\log _{2}(1+\rho _{k})}{B}=\sum _{k=1}^{K^{*}}\log _{2}(1+\rho _{k}), \end{equation}
where K
is the number of scheduled users, {\rho _{k}}
is the {k}
-th user’s SINR, defined by Equation 4. Thus CASE can be computed as Cumulative Distribution Function (CDF) of 50% tile of the instantaneous spectrum efficiency, and CESE can be calculated as CDF of 5% tile of the instantaneous spectrum efficiency. A unified KPI of an optimization area, e.g., sector i
, is defined as \begin{equation} {KPI_{i}} = wE_{5\%} + (1-w)E_{50\%}, \end{equation}
View Source
\begin{equation} {KPI_{i}} = wE_{5\%} + (1-w)E_{50\%}, \end{equation}
where E
denotes E(t)
, and w
is used as a weight factor to balance coverage and capacity performance. As mentioned in previous section, we utilize user scheduling scheme and interference coordination scheme for the optimization of coverage and capacity. The scheduling results of a sector has little impact on neighboring regions since the interference is perfectly cancelled through the two-stage precoding scheme and the inter-cell interference coordination scheme. In this sense, the scheduling result will result in little effect on the neighboring sectors’ KPIs. Thus the coverage and capacity optimization problem can be formulated as a KPI maximization problem for a sector i
:\begin{align}&\max \limits _{G} ~{KPI_{i}} = wE_{5\%} + (1-w)E_{50\%},\notag \\&\mathrm {s.t.} ~0<|G| \leq K^{*},\notag \\&\qquad \!\! 0\leq w \leq 1, \end{align}
View Source
\begin{align}&\max \limits _{G} ~{KPI_{i}} = wE_{5\%} + (1-w)E_{50\%},\notag \\&\mathrm {s.t.} ~0<|G| \leq K^{*},\notag \\&\qquad \!\! 0\leq w \leq 1, \end{align}
where G
is defined as the group of scheduled users, and the number of users in the group in constrained by K^{*}
. Note that this optimization problem is constrained by how to schedule the users. We will offer the discussion in the following section.
SECTION III.
Group Alignment of User Signal Strength
In this section, we introduce our proposed optimization parameter GAUSS, which is used to identify the qualified users to be scheduled.
According to the property of singular value decomposition (SVD), we can get \begin{align} \textit {Tr}(\mathbf {BPP^{H}B^{H}})=&~\textit {Tr}(\mathbf {u}\lambda \mathbf {v}^{H}\mathbf {v}\mathbf {u}^{H})\notag \\=&~\textit {Tr}(\lambda ^{2}\mathbf {u}^{H}\mathbf {u})\notag \\=&~\textit {Tr}(\lambda ^{2})\notag \\=&~\sum _{k=1}^{K^{*}}\lambda _{k}^{2}, \end{align}
View Source
\begin{align} \textit {Tr}(\mathbf {BPP^{H}B^{H}})=&~\textit {Tr}(\mathbf {u}\lambda \mathbf {v}^{H}\mathbf {v}\mathbf {u}^{H})\notag \\=&~\textit {Tr}(\lambda ^{2}\mathbf {u}^{H}\mathbf {u})\notag \\=&~\textit {Tr}(\lambda ^{2})\notag \\=&~\sum _{k=1}^{K^{*}}\lambda _{k}^{2}, \end{align}
where {\mathbf {BP}}={\mathbf {u\lambda }}{{\mathbf {v}}^{H}}
with singular values {\boldsymbol{\lambda }} = diag\left ({{\lambda _{1}, \cdots {\lambda _{k}}, \cdots {\lambda _{K}^{*}}} }\right)
in descending order. With the scheduling constraints, K^{*}
should be no larger than the rank of {\mathbf {BP}}{{\mathbf {P}}^{H}}{{\mathbf {B}}^{H}}
. Substituting (8) into (5), with \mu = \frac {P_{t}}{N}
, there is \begin{equation} E(t) = \sum \limits _{k{ = 1}}^{K^{*}} {\log _{2}} \left({1 + \frac {\mu }{{\sum \limits _{i = 1}^{K^{*}} {\lambda _{i}^{2}} }}}\right), \end{equation}
View Source
\begin{equation} E(t) = \sum \limits _{k{ = 1}}^{K^{*}} {\log _{2}} \left({1 + \frac {\mu }{{\sum \limits _{i = 1}^{K^{*}} {\lambda _{i}^{2}} }}}\right), \end{equation}
where {\lambda _{i}}^{2}
can be interpreted as the channel gain factor for each user in the MIMO scenario, this value is larger for the cell-edge user, and smaller for the cell-center user. Equation 9 can be rewritten as \begin{equation} E(t) = \sum \limits _{k{ = 1}}^{K^{*}} {\log _{2}} \left({1 + \frac {\mu }{{\lambda _{1}^{2}\sum \limits _{i = 1}^{K^{*}} {\gamma ^{2}_{i}} }}}\right), \end{equation}
View Source
\begin{equation} E(t) = \sum \limits _{k{ = 1}}^{K^{*}} {\log _{2}} \left({1 + \frac {\mu }{{\lambda _{1}^{2}\sum \limits _{i = 1}^{K^{*}} {\gamma ^{2}_{i}} }}}\right), \end{equation}
where {\gamma _{i}} = \frac {\lambda _{i}}{\lambda _{1}}
and {\gamma _{1}} = 1,{\gamma _{i}} < 1(i \ne 1)
. Thus we can infer that spectrum efficiecy is mainly determined by a small number of users with larger channel gain factor. If the cell-center users and the cell-edge users are scheduled simultaneously, the throughput of the center users will be lowered by the cell-edge users. In order to ensure the system capacity will not decline while improving the spatial multiplexing rate, it needs to guarantee the gap between \max ({\lambda _{i}})
and \min ({\lambda _{i}})
within a certain range. In this sense, we introduce the user scheduling optimization parameter GAUSS to align the users’ SINR in the scheduled user group, where GAUSS is defined as follows \begin{equation}~R = \frac {{\max ({\lambda _{i}})}}{{\min ({\lambda _{i}})}}. \end{equation}
View Source
\begin{equation}~R = \frac {{\max ({\lambda _{i}})}}{{\min ({\lambda _{i}})}}. \end{equation}
If we increase the value of GAUSS, i.e. R
, the number of scheduled user K
increases while the k
-th user’s instantaneous SINR {\rho _{k}}
decreases since more UE scheduled. We denote {\xi _{i}}
as the average channel gain factor of user i
. Then Equation 10 can be rewritten as follows \begin{equation} {E(t) = \sum \limits _{i = 1}^{K^{*}} {\log _{2}}\left({1 + \frac {\mu }{\xi _{i}}}\right)} \end{equation}
View Source
\begin{equation} {E(t) = \sum \limits _{i = 1}^{K^{*}} {\log _{2}}\left({1 + \frac {\mu }{\xi _{i}}}\right)} \end{equation}
The user’s average channel gain factor is sorted in ascending order, combined with group alignment of users’ signal strength R
, Figure 2 can be obtained. After selecting a user i
, with the average channel gain factor of the user {\xi _{i}}
as the center and the group alignment of users’ signal strength R
as the radius, the qualified users that can participate the scheduling process can be determined. The user channel condition at the left of {\xi _{i}}
is superior to user i
, the user channel condition at the right of {\xi _{i}}
is inferior to user i
. According to the previous analysis, in a scheduling process we should obtain the user set based on their channel condition, and the target user set is controlled by how we choose R
, i.e. the value of GAUSS. Further considering SINR_{\min }
, we can obtain the following inequality \begin{equation} {\xi _{i}} \le \frac {\mu }{{SINR_{\min }}} = \beta. \end{equation}
View Source
\begin{equation} {\xi _{i}} \le \frac {\mu }{{SINR_{\min }}} = \beta. \end{equation}
If {\xi _{i}}
is smaller than \beta
, the scheduling user i
can satisfy the constraint in (11) and can be scheduled with other users; otherwise, user i
cannot satisfy the constraint in (11) and cannot be scheduled with other users to avoid lower SINR. In this sense, the users on the left side of \beta
can be reused when scheduling, while the users on the right side of \beta
cannot be reused when scheduling. Another observation is that SINR_{\min }
determines the location of \beta
, as SINR_{\min }
decreasing, the point of \beta
moves to the right. Hence the system capacity decreases due to more poor channel condition users are scheduled. However, since the SINR of the users located to the right of \beta
is higher, the coverage performance can be improved. When we increase SINR_{\min }
, the system capacity and network coverage change vice versa.
SECTION IV.
Deep-Learning Enabled Coverage and Capacity Optimization
In this section, we present the DECCO algorithm to perform the capacity-coverage optimization by the deep reinforcement learning-based user scheduling scheme and a subsequent inter-cell interference coordination scheme. The overall framework of DECCO is depicted in Figure 3, and the detailed implementation is presented in the following.
A. Preliminaries
We will only introduce the basic concepts of deep reinforcement learning techniques that we build on in this paper. Detailed survey and rigorous derivations can be found in [13].
1) Reinforcement Learning
Reinforcement learning is a model-free method to solve a Markov decision process (MDP). Typically, an MDP consists of a state space \mathcal {S}
, an action space \mathcal {A}
, a stationary transition distribution describing the environment dynamics p(s_{t+1}|s_{t},a_{t})
which satisfies the Markov property, and a reward function r
. Figure 4 shows a general setting of reinforcement learning where an agent interacts with an environment [14]. The agent observes some state s_{t}
and chooses an action a_{t}
at each time step t
. Once the action is done, the state of the environment transitions to s_{t+1}
and the agent receives reward r_{t}
. Specially, the state transitions and rewards of the environment are stochastic and are assumed to have the Markov property, thus they are only depending on the value of the previous timestep.
We should notice here that the agent can only control its actions and have no apriori knowledge of the environment. However, the agent can learn to act properly by randomly choosing actions and observing the transitions of the environment. The goal of learning is to maximize the expected cumulative discounted reward:\mathbb {E}\left[{\sum _{t=0}^{\infty }\gamma ^{t}r_{t}}\right]
, where \gamma
is a factor discounting future rewards within (0,1]. If we set a very small \gamma
, learning will not depend on future rewards much and immediate rewards are dominating. On the other hand, if it is too large, learning will count on future rewards heavily.
2) Policy
At each timestep t
, the agent’s decision making procedure is characterized by a Policy, \pi (s,a)=Pr\{a_{t}=a|s_{t}=s\}
, \forall s \in \mathcal {S}, a~\in \mathcal {A}
, which is the probability distribution that action a
is taken in state s
. In practical problems, the state space and action space are normally large, which is intractable to store the policy in tabular form. In this sense, function approximators are utilized to parametrize the policy as \pi _{\theta }(s,a)
with a parameter vector \theta
, where \theta \in \mathfrak {R}^{l}
, for l << |\mathcal {S}|
. Another advantage of function approximators is that the agent could take similar actions for “close-by” states.
There are many forms of function approximators that can be used to represent the policy. It is popular to use linear combinations of features of the state/action space as the function approximator. Recently Deep Neural Networks (DNNs) have been successfully used to account for large-scale reinforcement learning tasks. Generally speaking, algorithms that use DNNs are belong to Deep Learning techniques. The insight of using DNNs as the function approximators is that they do not need human-crafted features. In this paper, we use a DNN to represent the policy and thus our algorithm is Deep-learning enabled.
3) Policy Gradient Methods
Policy gradient methods are heavily used in the recent state-of-the-art reinforcement learning algorithms. In these methods, training of the policy is performed by following the gradient of cumulative discounted reward with respect to the parameter vector, which is given by [15]:\begin{equation} \nabla _{\theta } \mathbb {E}\left[{\sum _{t=0}^{\infty }\gamma ^{t}r_{t}}\right] = \mathbb {E}_{\pi _{\theta }}[\nabla _{\theta }\log \pi _{\theta }(s,a)Q^{\pi _{\theta }}(s,a)], \end{equation}
View Source
\begin{equation} \nabla _{\theta } \mathbb {E}\left[{\sum _{t=0}^{\infty }\gamma ^{t}r_{t}}\right] = \mathbb {E}_{\pi _{\theta }}[\nabla _{\theta }\log \pi _{\theta }(s,a)Q^{\pi _{\theta }}(s,a)], \end{equation}
where Q^{\pi _{\theta }}(s,a)
is the value of expected cumulative discounted reward by determinstically choosing action a
in state s
. The insight of policy gradient methods is to estimate the gradient by observing the trajectories of executions that are obtained by executing the policy. The agent emprically calculates the cumulative discounted reward v_{t}
with the simple Monte Carlo Method [16] by sampling multiple trajectories, and uses it as an unbiased estimate of Q^{\pi _{\theta }}(s_{t},a_{t})
. Thus the policy parameters are updated via gradient descent as follows \begin{equation} \theta \leftarrow \theta + \alpha \sum _{t}\nabla _{\theta }\log \pi _{\theta }(s_{t},a_{t})v_{t}, \end{equation}
View Source
\begin{equation} \theta \leftarrow \theta + \alpha \sum _{t}\nabla _{\theta }\log \pi _{\theta }(s_{t},a_{t})v_{t}, \end{equation}
where \alpha
is the step size. This equation leads to the episodic REINFORCE algorithm [15] which can be intuitively understood as follows. The agent should update the policy parameters following the direction of \nabla _{\theta }\log \pi _{\theta }(s_{t},a_{t})
thus to increase \pi _{\theta }(s_{t},a_{t})
(the probability of action a_{t}
at state s_{t}
). This results in the effect that reinforces actions that empirically lead to better performance.
B. User Scheduling Scheme
The user scheduling scheme takes into account both spectrum efficiency and user fairness. We choose the first user with classical proportional fair (PF) scheduling factor. While scheduling the other users, we introduce the group alignment of users’ signal strength R
to ensure spectrum efficiency, and meanwhile exploit SINR_{\min }
to ensure user estimated SINR no less than it thus ensuring system capacity. We assume that {L}
is the user set to be scheduled, {g}
is the scheduled user set, {r_{k}}
is the instantaneous data rate of user {k}
, {D_{k}}
is the average data rate of user {k}
, {M}
is the number of scheduled users, {K^{*}}
is the maximum number of scheduled users, {\lambda _{\max }}
and {\lambda _{\min }}
are the maximum and the minimum singular value respectively, R
is the maximum group alignment of users’ signal strength.
In this paper, we seek to dynamically choose SINR_{\min }
and R
for each TTI. The resulting user scheduling scheme consists of two phases: 1) In each TTI, SINR_{\min }
and R
are identified via the deep reinforcement learning algorithm. 2) Then SINR_{\min }
and R
are utilized by the subsequent user scheduling scheme.
1) The Deep Reinforcement Learning Formulation
State space. Each sector of the cell is defined as a agent that aims to maximize both CASE and CESE of the cell. We define the continuous state as: s_{t}=\{CASE_{t},CESE_{t}\}
, where CASE and CESE of the cell at timestep t
can be calculated by the previous definition.
Action space. The action space is constructed by the combinations of SINR_{\min }
and R
parameter sets. Suppose there are m
discrete levels of SINR_{\min }
and n
discrete levels of R
respectively, thus the action space consists of m\cdot n
combinations. We use DNN as function approximator to compute the policy that the agent should follow in a given state, where the policy is the combination of SINR_{\min }
and R
with the largest probability that can result in the largest reward.\begin{equation} a_{t} = \mathop {\arg \max }_{a} \pi _{\theta }(s_{t},a) \end{equation}
View Source
\begin{equation} a_{t} = \mathop {\arg \max }_{a} \pi _{\theta }(s_{t},a) \end{equation}
DNN is used to approximate the policy in the sense that the state space is continuous in our scenario, where we can not use a tabular to store the learned policies.
Rewards. The reward function is calculated with Equation 17 by taking into account the sub-rewards for CASE and CESE. \begin{equation} r_{t} = \eta \cdot r_{CASE_{t}} + (1-\eta)\cdot r_{CESE_{t}}, \end{equation}
View Source
\begin{equation} r_{t} = \eta \cdot r_{CASE_{t}} + (1-\eta)\cdot r_{CESE_{t}}, \end{equation}
where \eta
(0\leq \eta \leq 1
) is the weight that enables the setting of a desired tradeoff between CASE and CESE. The sub-rewards are defined as follows \begin{equation} r_{CASE_{t}}= \begin{cases} 1, &CASE_{t+1} \ge CASE_{t}, \\ -1, &otherwise. \\ \end{cases} \end{equation}
View Source
\begin{equation} r_{CASE_{t}}= \begin{cases} 1, &CASE_{t+1} \ge CASE_{t}, \\ -1, &otherwise. \\ \end{cases} \end{equation}
The same expression is applied for r_{CESE_{t}}
.
2) Training
Based on the above formulation, we represent the policy as a neural network (called policy network) which takes as input the state s_{t}
, and outputs a probability distribution over all possible actions. The policy network is trained with a variant of the REINFORCE algorithm in an episodic setting. In each training iteration, we run N
episodes for a fixed duration of T
TTIs to explore the probabilistic space of possible actions using the current policy, and use the resulting data to improve the policy. We should notice here that an episode is the whole simulation procedure of a TTI consisting of computing SINR_{\min }
and R
via policy network, utilizing the parameters in user scheduling, and inter-cell interference coordination. The trajectories consisted of the state, action, and reward are recorded for all timesteps of each episode, and use these values to calculate the discounted cumulative reward, v_{t}
, at each timestep t
of each episode. The estimation of policy gradient with Equation 15 introduces high variance, thus we need to subtract a baseline from v_{t}
. A simple approach that we calculate the baseline is to use the average of the discounted cumulative rewards. The implementation of the variant REINFORCE algorithm is described as Algorithm 1.
Algorithm 1 Policy gradient algorithm
1:Initialize policy with parameter \theta_{1}
2:for iteration k=1,2,\ldots
do
3:\Delta \theta \leftarrow 0
4:run episode i=1,\ldots,N
:
5:obtain \{s_{1}^{i},a_{1}^{i},r_{1}^{i},\ldots,s_{T}^{i},a_{T}^{i},r_{T}^{i}\}
under policy \pi_{\theta_{k}}
6:for t = 1,{\ldots}
,T do
7:for i = 1,{\ldots}
,N do
8:compute rewards: v_{t}^{i}=\sum_{s=t}^{T}\gamma^{s-t}r_{s}^{i}
9:compute baseline: b_{t}^{i}=\frac1{N}\sum_{i=1}^{N}v_{t}^{i}
10:\Delta{\theta} \leftarrow \Delta {\theta} + \alpha {\nabla}_\theta~ log \pi_{\theta} (s_{t}^{i},a_{t}^{i})(v_{t}^{i}-b_{t}^{i})
13:\theta_{k} \leftarrow \theta_{k} + \Delta \theta
C. Inter-Cell Interference Coordination Scheme
Interference coordination mechanism is completed by the measurement module and the precoding module [17]–[19]. The basic scheme for mitigating inter-cell interference is distributed zero forcing, which needs channel matrix information of neighbor cells. In order to control the spatial freedom of transmission antennas used for inter-cell interference coordination, we define a new parameter percentage of cell-edge users needing inter-cell interference suppression denoted by \delta
for setting the percentage of users that need interference suppression so as to adjust the spatial freedom resources between inter-cell and intra-cell interference cancellation. Denote e_{s}
as the number of cell-edge users in cell s
. \delta
with the value of 100% means that all the cell-edge users need to perform interference suppression. The steps of the proposed inter-cell interference coordination scheme are as follows.
Step 1:
Each service cell measures the downlink average SINR of each UE, and counts all users whose downlink average SINR is lower than SINR_{\min}
, and defines these users as cell-edge users. These cell-edge users are sorted by SINR in ascending order.
Step 2:
Each service cell sends a command to the edge users to measure the strong interference cells and estimate the channel matrix of the strong interference cells.
Step 3:
Each serving cell forms the interference matrix table as shown in Table 1.
Step 4:
Each cell in the network interacts with the respective interference matrix table on X2 interface, and obtains interfering edge user information of neighbor cells and the channel matrix to these users.
Step 5:
When precoding, according to ascending order of average SINR, each service cell selects the interfered users whose channel vectors constitute null space, and interference suppression is performed by selecting channels of the first \delta \times {e_{s}}
users of each neighbor cell.
Step 6:
The service cell generates the null space matrix of the interfered user channel matrix, and multiplies the outer precoder by the null space matrix to achieve interference suppression for neighbor cells.
Step 7:
The service cell constructs inner precoding to form the final precoding matrix.
In conclusion, the overall DECCO algorithm is formed by user scheduling scheme with a pre-trained policy network and the subsequent inter-cell interference coordination scheme, and it is concluded in Algorithm 2.
Algorithm 2 DECCO Algorithm
Phase 1 – User Scheduling
1:Initialization: L = \left \{{ 1, 2, \ldots, K }\right \}, g = \emptyset
, k^{*} = {}\mathop {{\mathrm {argmax}}}\limits _{L} \left({\frac {r_{k}}{D_{k}}}\right),{}g = g \cup \{ {k^{*}}\},{}M = 1,{}L{} = {}L\backslash \{ {k^{*}}\}
2:Compute SIN{R_{\min }}
and R_{\max }
with policy gradient network
3:schedule_allow_flag = True
4:while M{} \le K^{*}
and schedule_allow_flag = True do
6:while L_{tmp} \ne \emptyset
do
7:Select k^{*}
from L_{tmp}
, g^{\prime } = g\,\,\cup \,\,\{k^{*}\}
, calculate \mathbf {B}
, \mathbf {P}
according to \mathbf {H}
formed by gg^{\prime }
, calculate SINR, do singular value decomposition for \mathbf {BP}
and obtain \lambda _{\max }
and \lambda _{\min }
8:if \frac {\lambda _{\max }} {\lambda _{\min }} \le R_{\max }
and SINR \ge
SINR_{\min }
then
9:g = g \cup \{ k^{*}\}, M = M + 1, L = L\backslash \{ k^{*}\}
, schedule_allow_flag = True
12:L_{tmp} = {L_{tmp}}\backslash \{ {k^{*}}\}
, schedule_allow_flag = False
Phase 2 – Inter-Cell Interference Coordination
16:Initialization: calculate SINR and obtain edge-user set e_{s}
, obtain the null space matrix \mathbf {H}
of the edge-users’
18:Obtain the outer precoding matrix \mathbf {B}
and achieve interference suppression for neighboring cells by {\mathbf {H}}^{H}\mathbf {B}
19:Obtain the inner precoding matrix \mathbf {P}
and form the final precoding matrix {\mathbf {H}}^{H}\mathbf {B}\mathbf {P}
SECTION V.
Simulation Results
In this section, we demonstrate performance and behavior of the proposed concepts and models for optimization of coverage and capacity by simulative evaluation [20] of representative case studies.
A. Simulation Setup
We take the International Telecommunications Union’s (ITU) three-dimensional urban macro cell (3D-UMa) model as our channel model. The number of antennas at BS is 64, and the number of antennas at UE is 2. We use the JSDM scheme as the downlink transmission method. SINR_{\min }
is quantified for 15 levels from 1dB to 15dB, and R
is quantified for 20 levels from 25 to 500. Other parameters needed for the simulation are listed in Table 2.
Once the simulation parameters are set, the architecture of the policy network for computing the proposed parameters SINR_{\min }
and R
in the DECCO algorithm can be indentified. The input layer of the resulting neural network has 2 neurons, which accept CASE and CESE of the cell respectively. The output layer consists of 300 neurons, which denote the full combinations of SINR_{\min }
and R
parameter sets and output the probability distribution of policies. We use two hidden layers to learn and approximate the optimal policy, where each layer consists of 100 neurons. Thus the policy network has 4 layers, and a total of 6,000,000 parameters.
B. Policy Network Training
As dicussed earlier, the policy network is used to derive the combination of SINR_{min}
and R
for the downlink transmission. Concretely, it maps an environment state into an policy (or action) by learning from tremendous trajectories that the agent experienced. In this sense, training the policy network is performed interactively between the agent and the environment. We leave the learning rate \alpha
as a hyperparameter of the policy network in the DECCO algorithm to tune during training. We run 1000 iterations for training, and in each training iteration, we run N=20
Monte Carlo simulations (also named episodes) in parallel. We update the policy network parameters using the stochastic gradient ascent with the configurable learning rate.
We tune the learning rate and investigate the average reward performance with different learning rate used by updating the policy network parameters. Figure 5 shows the average reward trend with three different learning rate 0.01, 0.05, 0.10. To make the figure more readable, we use the common logarithm of iteration. It is obvious that the average reward converges to a better stage with learning rate \alpha =0.01
.
C. Influence of the Weight Factor \eta
Recall that the weight factor \eta
in the reward function controls the balance between maximizing CASE and CESE of the cell. We investigate the influence of weight factor \eta
adopted in the DECCO algorithm on capacity-coverage performance as the number of BS N
increases from 1 to 7. Specifically, we set \delta
, the percentage of cell-edge users needing inter-cell interference suppression, with 100% for this scenario. Unless otherwise specified, the results below are from the CCO algorithms with \delta =100\%
. Figure 6 and Figure 7 together shows the average CASE and CESE performances with different weight factors as the number of BSs grows. The weight factor with \eta =0.0
and \eta =1.0
are two special cases, where the two optimization objectives that maximize CASE and CESE are reduced to a single objective problem. When \eta =0.0
, the DECCO algorithm only optimizes CESE since the term of CASE becomes zero. This case may not happen in practical network operation. When \eta =1.0
, CASE is the only objective to be optimized by the DECCO algorithm, and this case may happen in practical network operation since only providing good service for the majorities is a possible option for the network operators. When \eta = 0.8
, CASE is the dominant factor that contributes to the average reward thus CESE grows slower than CASE. While the vice versa trend of CASE versus CESE occurs when \eta = 0.3
. An equal performance gains of CASE and CESE are achieved when we choose the weight factor \eta
to be 0.5, which is safe for whatever the number of BSs is. However, the inter-cell interference increases with the number of BSs. Thus CESE decreases more quickly than CASE as the number of BSs increases. In this sense, the weight factor of CESE that controls the contribution of CESE to the reward, 1-\eta
, should be increased to effectively optimize CESE by learning as the number of BSs grows.
D. Performance Comparison of Different CCO Algorithms
We would like to point out that the evaluation of DECCO and other CCO algorithms is made within the user scheduling scheme, and all CCO algorithms use the same ICIC scheme as proposed. We compare the capacity-coverage performance between DECCO with the dynamic configuration of SINR_{\min }
and R
computed by the policy network in the user scheduling algorithm at runtime and a CCO algorithm with best fixed configuration of SINR_{\min }
and R
obtained through trial-and-error as the number of BS increases, where we denote the later algorithm as Fixed coverage and capacity Optimization (FO). Moreover, we set the CCO algorithm with proportional fair (PF) scheduling scheme and the proposed ICIC scheme as the baseline for the comparison, where we denote this algorithm as PFO. The weight factor \eta
for the DECCO algorithm, and SINR_{\min }
and R
for the FO algorithm with the specified number of BSs are listed in Table 3.
Figure 8 and Figure 9 depict that the proposed DECCO algorithm outperforms the FO algorithm and the PFO algorithm on both CASE and CESE. In this sense, SINR_{\min }
and R
are two effective capacity-coverage optimization parameters. The FO algorithm outperforms the PFO algorithm on CASE, while it does not perform better than the PFO algorithm on CESE since it schedules the users with fixed thresholds which fails to track the changing inter-cell interference. As the number of BSs increases, the performance gains of the DECCO algorithm against the FO algorithm decrease. In this In practical systems, we can partition learning groups to alleviate learning performance degradation with large scale learning. For example, we can set every learning group with 4 base stations. Figure 10 depicts the empirical cumulative density function (Empirical CDF) of spectrum efficiency of different CCO algorithms at runtime, where the number of BSs N=7
. We also evaluate the influence of the different weight factors on our proposed DECCO algorithm working at runtime when \eta =0.3
and \eta =0.8
. The results are concluded in Table 4, where SD is the abbreviation of Standard Deviation. The DECCO algorithm with \eta =0.3
outperforms the FO algorithm and the PFO algorithm by 5.6% and 18.1% on CASE, respectively. When it comes to CESE, the DECCO algorithm with \eta =0.3
outperforms the FO algorithm and the PFO algorithm by 62.9% and 7.5%, respectively. Thus it is explicit that the performance gain of CESE is larger than that of CASE since CESE gains more weight than CASE. Moreover, the PFO algorithm outperforms the FO algorithm on CESE just as we discussed earlier. When \eta =0.8
, the DECCO algorithm outperforms the FO algorithm and the PFO algorithm by 22.2% and 36.5% on CASE, respectively. In terms of CESE, the DECCO algorithm outperforms the FO algorithm and the PFO algorithm by 57.1% and 3.8%, respectively. In contrast to \eta =0.3
, the performance gain of CASE is larger than that of CESE due to the fact CASE plays a more important role in reward calculation with \eta =0.8
. This is in consistency with the investigations of influence of the weight factor we have conducted. An important observation is that the DECCO algorithm has more potential on maximizing CESE than CASE in comparison with the FO algorithm. Beyond that, the DECCO algorithm improves CASE larger than the PFO algorithm. In addition, the DECCO algorithm has a smaller SD value, which means the coverage and capacity optimization performance is more fair and stable. As a result, our proposed DECCO algorithm is a generally optimal method for capacity-coverage optimization.
In this paper, we have proposed a novel Deep reinforcement learning Enabled Coverage and Capacity Optimization (DECCO) algorithm, in which a deep reinforcement learning-based user scheduling scheme and a novel inter-cell interference coordination (ICIC) scheme are contained, to address the coverage and capacity in massive MIMO networks. In addition, we proposed a novel optimization parameter GAUSS, i.e. Group Alignment of Users’ Signal Strength, is proposed together with Together with a unified threshold of QoS, i.e. SINR_{min}
to be dynamically configured with a pre-trained deep policy gradient-based neural network in each transmission time interval in the user scheduling scheme. Furthermore, a novel ICIC scheme has been proposed to further enhance the performance of the deep reinforcement learning-based user scheduling scheme. We conducted extensive simulations to compare the capacity-coverage performance between our proposed DECCO algorithm with a CCO algorithm utilizing the best fixed configuration of the proposed optimization parameters in the user scheduling scheme, and used the CCO algorithm with proportional fair scheduling as the baseline. Simulation results show that 1) our proposed optimization parameters are effective in optimizing the coverage and capacity, 2) our proposed DECCO algorithm outperforms the other two CCO algorithms with the same ICIC scheme on both coverage and capacity. This means our methods successfully track the dynamics of the considered systems. Moreover, we can set learning clusters to account for learning gains decreasing as opposed to the scale of networks. Thus future work will be done to enable coverage and capacity optimization in large scale learning.
ACKNOWLEDGMENT
A portion of this paper was presented at the IEEE Vehicular Technology Conference, Sydney, NSW, Australia, June 2017 [1].