Reinforcement Learning for Energy-Efficient 5G Massive MIMO: Intelligent Antenna Switching

To provide users with high throughputs, the fifth generation (5G) and beyond networks are expected to utilize the Massive Multiple-Input Multiple-Output technology (MMIMO), i.e., large antenna arrays. However, additional antennas require the installation of dedicated hardware. As a result the power consumption of a 5G MMIMO network grows. This implies, e.g., higher operator costs. From this angle, the improvement of Energy Efficiency (EE) is identified as one of the key challenges for the 5G and beyond networks. EE can be improved through intelligent antenna switching, i.e., disabling some of the antennas installed at a 5G MMIMO Base Station (BS) when there are few User Equipments (UEs) within the cell area. To improve EE in this scenario we propose to utilize a sub-class of Machine Learning techniques named Reinforcement Learning (RL). Because 5G and beyond networks are expected to come with accurate UE localization, the proposed RL algorithm is based on UE location information stored in an intelligent database named a Radio Environment Map (REM). Two approaches are proposed: first EE is maximized independently for every set of UEs’ positions. After that the process of learning is accelerated by exploiting similarities between data in REM, i.e., REM-Empowered Action Selection Algorithm (REASA) is proposed. The proposed RL algorithms are evaluated with the use of a realistic simulator of the 5G MMIMO network utilizing an accurate 3D-Ray-Tracing radio channel model. The utilization of RL provides about 18.5% EE gains over algorithms based on standard optimization methods. Moreover, when REASA is used the process of learning can be accomplished approximately two times faster.


I. INTRODUCTION
Mobile network throughput can be significantly increased by the utilization of the Massive Multiple-Input Multiple-Output (MMIMO) technology [1]. The idea of MMIMO is to equip a Base Station (BS) with an antenna array of a high number of elements, i.e., much greater than 1 [2]. The large number of antenna elements allows the system to, e.g., increase the amount of the wanted signal's power being received by the User Equipment (UE). It can be achieved through proper weighting of the signal transmitted from each of the BS antennas, i.e., beamforming. However, the underlying cost of improved network throughput is reduced Energy The associate editor coordinating the review of this manuscript and approving it for publication was Miguel López-Benítez . Efficiency (EE). This phenomenon is mainly related to the increased power consumption caused by additional hardware related to each of the BS antennas [3]. As high power consumption affects network operators' costs and contributes to the world's carbon footprint, EE optimization is identified as one of the key challenges for fifth generation (5G) and beyond networks [4], [5]. There are several possibilities to improve EE of the MMIMO network. The highest gains are expected to be obtained by switching off underutilized BSs. Several algorithms have already been reported in the literature [6]. A representative example of those techniques is an algorithm that switches off a BS if its traffic can be offloaded to the neighboring cells [7]. However, some research has shown that frequent BS on/off switching can reduce the lifetime of BSs, and cause high replacement costs [8]. On the other hand, VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ it has been shown that energy consumption scales nearly linearly with the number of active BS antennas, e.g., during the following studies [9], [10]. This number can be adjusted to the network state, e.g., the number of connected UEs. It has been shown that not all antennas contribute equally to the overall array gain. Therefore, a proper antenna selection algorithm should be used to obtain a subset of active antennas, to, e.g., improve EE [11]. While this problem is difficult to be fully specified and solved, because of its complexity, several heuristic metrics have been proposed to reduce its computational complexity. The simplest one is to select antennas characterized by the highest mean channel gains to all served UEs [12]- [14]. A more advanced solution additionally takes into the account spatial channel response correlations within the antenna array installed at the BS [15], [16]. Another approach is to use a channel gain-based antenna selection algorithm as an initial solution and then run a bio-inspired optimization algorithm [17]. On the other hand it has been shown that greedy algorithms can be successfully used for the purpose of antenna selection [18]. Finally antenna selection can be realized with the use of a machine learning techniques [19], [20]. However, machine learning-based antenna selection requires large training sets. Moreover learning of the models is very time-consuming when compared against e.g. channel gain-based algorithms. The major drawback of the above mentioned papers is that they do not provide a good answer to the question: How many antennas do we need? Most of them only propose the order in which the antennas should be deactivated, i.e., an antenna selection algorithm. Though there are some papers where closed-form expressions are obtained for the number of active antennas that maximize MMIMO network EE [21]- [23]. However, the authors assume that every antenna contributes equally to the array gain, and simple Shannon formula-based throughput estimation is used. For real radio conditions, the network throughput is typically lower than a Shannon formula-based one. As a result of this inaccuracy, the obtained number of active antennas is misestimated, and the resultant EE is not optimal. One should also notice that, it is not a trivial task to use standard optimization methods to improve MMIMO network EE under real conditions. Mainly because the MMIMO system is too complex to obtain its accurate analytical model including: precoder, scheduler, intra-cell interference and realistic radio channel model.
Taking into account the limitations of the standard optimization methods, our proposal is to use machine learning for intelligent antenna switching in a complex MMIMO system. The solution is divided into two parts. First, antennas are sorted, e.g., based on their mean channel gains. Next, the optimal number of active antennas which maximize EE under a given pattern of UE positions is obtained using Reinforcement Learning (RL). The process of RL requires memory to store the results of learning, and necessary parameters. For this purpose, MMIMO BS is expected to be equipped with an intelligent database of location-dependent data known as a Radio Environment Map (REM) [24]. Although state of the art REMs were designed to store value-location tuples, we propose to map a set of all connected UE positions onto the number of active antennas. A similar representation has been used in the context of base station switching in [25]. We propose to learn how many antennas are needed through interaction with the MMIMO network, i.e., using RL. The high-level idea of RL application is that for a given set of UE positions, various numbers of active antennas would be tested, and their resultant EE would be observed, in order to learn how to act in the future. In RL there is a problem of balancing gaining new knowledge (exploration), and exploitation of the current knowledge. At first, we utilize for this purpose a state-of-the-art algorithm named Upper-Confidence-Bound [26]. Then, we show that the utilization of similarities between UE positions saved in REM, can accelerate the learning phase. It shows that a combination of RL with REM produces a synergy effect. For the evaluation purpose, an advanced system-level simulator of the MMIMO network is used, together with an accurate 3D-Ray-Tracing radio channel model. While the paper focuses on downlink transmission only, being the most important factor to obtain high EE for a given network operator, the proposed method can easily be extended for uplink transmission.
The main contributions of this paper are as follows: • Comparison between the optimal number of active antennas indicated by an algorithm based on standard optimization methods [21], [22], [27], and the optimal number of active antennas based on the observation of EE in a realistic simulator based on a 3D-Ray-Tracing radio channel model.
• Proposal of an algorithm combining REM, and RL schemes to map the network state (specified by UE position pattern) to the number of active antennas maximizing EE.
• Exploiting the similarities between sets of UE positions saved in REM in order to accelerate the learning phase.
The paper is organized as follows: Sec. II provides the reader with an overview of the system model. In Sec. III presents an intelligent antenna switching algorithm, including a description of the antenna selection algorithm and two approaches to active antenna number computation: state-ofthe-art, and REM based. The proposed RL scheme aimed at providing REM with information about the optimal number of active antennas is described in Sec. IV. In Sec. V, the results of computer simulations are presented. The paper is concluded in Sec. VI. To improve readability all acronyms are listed in Tab. 1.

II. SYSTEM MODEL
This paper aims at maximization of EE through Intelligent Antenna Switching, while considering downlink in a single MMIMO cell. In the cell there is one BS equipped with M antennas arranged in a rectangular array, and serving K single-antenna UEs. The BS total transmission power does not depend on the number of active antennas, and has a constant value of P T for fair comparison. A downlink in the considered MMIMO cell is based on Orthogonal Frequency-Division Multiple Access (OFDMA), where the available spectrum is divided into N rb Resource Blocks (RBs). UEs are capable to report their positions with the use of standard cellular network mechanisms. They are claimed to utilize high accuracy satellite navigation available in 5G and beyond networks, e.g., Real Time Kinematics (RTK), providing cm-level accuracy [28], [29]. Less accurate location information is expected to decrease the performance of the proposed algorithms but can be an interesting subject of future studies similar to [30]. The process of intelligent antenna switching consists of the two stages, and requires two additional functional blocks to be installed at the BS: Active Antenna Number Computation Block (AANCB) and Antenna Selection Block (ASB), as depicted in Fig. 1. AANCB is responsible for providing information about the number of active antennasm (out of all available M ) to the MMIMO BS, e.g., on the basis of current positions of UEs. Later on, ASB selectsm active antennas indicated by AANCB, according to a given selection metric. AANCB and ASB cooperate to improve EE of the considered MMIMO cell.

A. POWER CONSUMPTION MODEL
Although there are several definitions of EE in the literature, all of them consist of the term related to the throughput and the term related to the power consumption. Thus before the problem of the EE optimization can be formulated, first, a power consumption model proper for the MMIMO BS must be defined. There are already several power consumption models proposed, for a single-antenna BS, or a MIMO BS [31], [32]. These are not adequate for MMIMO, where BSs are equipped with possibly above hundred-element antenna arrays. A popular model of MMIMO BS power consumption can be found in [2]. Similarly as in our previous work, we will consider three components having major impact on power consumption [25]: • Effective Transmitted Power (ETP) P ETP is the total power P T transmitted by the MMIMO BS, taking into the account the power amplifier efficiency η: (1) • Transceiver Chains Power (TCP) P TCP (m), is the power consumed by the local oscillator P LO , and m active transceiver chains (equivalent to the number of active antennas), each of power consumption PT CP : • Fixed Power P fix , is the constant amount of power consumed by the BSs for, e.g., backbone communication or processing of signaling information. The overall power consumption of the considered MMIMO BS is given by:

B. PROBLEM FORMULATION
Having a power consumption model proper for MMIMO BS, a definition of EE can be introduced, and optimization problem can be formulated. Several definitions of EE can be found in the literature [33]. The most commonly used is the one, where EE is given by the ratio of average network throughput, and average network power consumption. We proposed a slight modification to this definition [25]: where s denotes the set of UE positions, i.e., state, m is the number of active antennas, P tot (m) is the power consumption related to m active antennas, computed according to (3), and c 50 (s, m) stands for the median user bitrate related to the set of UEs at positions s, and m active antennas. In the original definition of EE, users having high bitrates contribute much to the average network EE. Instead, our proposition improves the fairness by protecting users characterized by poor radio conditions. The optimization problem addressed in this paper is to maximize EE (4) independently for all visited states, i.e., sets of UE positions s. In every state s, the optimization is achieved by adapting the number of active antennasm, so as to maximize EE (4): where S denotes a set of all considered states.

III. INTELLIGENT ANTENNA SWITCHING ALGORITHM
To obtain the subset of active antennas we propose to extend the MMIMO BS by two additional functional blocks creating together the Intelligent Antenna Switching Algorithm, i.e., AANCB and ASB. Our aim is to split the process between AANCB deciding on the number of active antennas, and ASB, responsible for obtaining the exact subset of active antennas. The motivation behind this split is to reduce the computational complexity. Without ASB, there would be 2 M possible configurations of active antennas to be evaluated. With the introduction of ASB there are only M possible numbers of active antennas to be selected.

A. ANTENNA SELECTION BLOCK
The aim of ASB is to select a subset ofm active antennas from all M antennas installed at the MMIMO BS. This procedure is known as the antenna selection algorithm. In this paper we assumed to arbitrarily choose the antenna selection algorithm that will be used by ASB. To focus on learning REM with a proper number of active antennas, rather than the improvement of the antenna selection algorithm, a stateof-the-art solution based on average channel gains has been chosen [11]. Them antennas having the highest channel gain averaged over K users, and N rb RBs, have been chosen to be active. This algorithm is both intuitive and characterized by simple implementation.

B. ACTIVE ANTENNA NUMBER COMPUTATION BLOCK
The aim of AANCB is to compute the number of active antennas that will maximize the EE of a MMIMO cell under a given set of UE positions s. The computed number of active antennas is passed to ASB to obtain the exact subset of active antennas. Two versions of AANCB will be described: a stateof-the-art solution based on analytical computations, and a REM-based idea.

1) STATE-OF-THE-ART SOLUTION
The MMIMO network EE has been optimized under the Zero-Forcing (ZF) precoding scheme [21], [22], [27], and the Maximum Ratio Combining (MRC) precoding scheme [34], [35]. The focus of this paper is on a more complex ZF precoder (see Sec. V). The number of active antennas therein is adapted to the number of users connected to MMIMO BS. While the UE position has a significant impact on wireless channel characteristics, it is not taken into account in the state-of-the-art papers. Although the authors obtained closed form expressions for the number of antennas, their simulation scenarios, EE definition and power models slightly differ from the ones presented in this paper. Thus, these expressions can't be directly implemented. Instead, we will adapt them to the considered definition of EE (4). The bitrate of the UE located at position s can be calculated as a function of the number of active antennas m, using an approximation proper for the ZF precoder [27]: where β(s) is a mean channel gain between a user located at position s and the BS, averaged over all antennas and RBs, B is the system bandwidth, and σ 2 is the power of the white Gaussian noise at the user's receiver. Using the above formula, a median user bitrateĉ 50 (s, m) can be estimated in the function of active antennas m: Then, the number of active antennasm related to the set of UE positions s can be computed by solving: Because obtaining a closed form expression of this equation is out of this paper's scope, numerical methods will be used.

2) REM-BASED SOLUTION
The state-of-the-art solution assumes that only the number of UEs connected to MMIMO BS and their path losses should be taken into account while computing the number of active antennas. However, K users can create various spatial patterns, resulting in different radio conditions, e.g., their channels can be more or less correlated or follow various non-Rayleigh distributions. Such properties of realistic MMIMO channels has been confirmed by measurements, e.g., [36]. Thus, we propose to optimize the number of active antennas separately within each set of UE positions. For this purpose an intelligent database called REM is employed at the MMIMO BS. The aim of REM is to store and process information related to the given set of UE positions, e.g., power consumption, bitrates, number of active antennas. As a long-term result of machine learning, REM provides MMIMO BS with information about the optimal number of active antennas, maximizing EE for a given set of UE positions.
The data stored in REM is organized in entries, as depicted in  each REM entry contains, for each considered number of active antennas m, the so-called action value Q(s, m) used by RL, and computed on the basis of EE(s, m) observed previously. These variables will be described in detail in the next section. Finally, information about the number of times a particular number of active antennas has been tested N (s, m), is stored in each REM entry as well.

IV. REINFORCEMENT LEARNING SCHEME
The solution proposed in this paper is to provide the network with an optimal number of active antennas, maximizing EE, separately for each set of UE positions. This information is to be stored in REM. However, REM must first get this knowledge. An effective approach to this task is to use RL, where the process of learning is based on interaction, i.e., a so called agent interacts with an environment in discrete time steps, by taking so-called actions, and observing their outcome-reward [26]. This procedure is depicted in Fig. 3 in the context of filling REM with information about the most energy efficient number of active antennas. The main elements of an RL scheme can be described as follows: • Environment is a single cell in the MMIMO network equipped with M antennas and serving K UEs, as described in Sec. II.
• A State s is the set of UE positions. First, the currently reported set of UE positions is compared against REM entries. If the reported set of UE positions cannot be found in REM, a new entry is created.
• An Action m is the considered number of active antennas, m ∈ {1, . . . , M } being the output of AANCB. The exact configuration of active antennas is obtained by ASB, as described in Sec. III-A.
• A Reward r(s, m) in this case is EE computed according to (4). The reward is obtained after an observation period called step, when a given action m is under evaluation.
The step duration has to be long enough for an average instantaneous EE metric variation caused by short-term changes in scheduling and channel coefficients.
• An Agent is a REM unit deployed at MMIMO BS as AANCB. It is responsible for taking actions m according to the environment state s, and updating action preferences on the basis of observed reward r(s, m).

A. ACTION VALUES UPDATE RULE
In REM there is information about potential EE related to each possible action m (the number of active antennas) in state s, based on previous experience. In RL it is known as an action value Q(s, m). Each time an agent receives a reward, the related action value is updated. In general, the action values depend both on the reward obtained after the current step and the action values of the next state. However, in the considered model the movement of UEs does not depend on the number of active antennas, i.e., the action does not affect the future state. This case is known in the literature as the problem of Contextual Bandit or Associative Search [26]. As a result it is enough to use the reward observed after performing the current action. Thus, the update rule is a simple exponential average of rewards observed after each visit in state s taking action m: where α ∈ [0, 1] is a step size parameter.

B. UPPER CONFIDENCE BOUND SELECTION ALGORITHM
One of the challenges related to RL is the so-called exploration-exploitation problem, i.e., the problem of how much time should an agent spend on learning the environment by taking fewer explored actions in contrast to exploiting previously obtained knowledge by taking the best known action. There are several algorithms designed to balance exploration and exploitation, e.g., -greedy, soft-max distribution, Upper Confidence Bound (UCB) [37]. Due to the so-called channel hardening property of MMIMO systems, the radio signal after the procedure of precoding is almost immune to small-scale fading [38]. Thus, from the perspective of UE, achievable bitrates should be relatively stable, i.e, affected mainly by slow-varying large-scale fading. Due to that fact, we decided to utilize the UCB algorithm, because it is not utilizing randomness for action selection, i.e, the algorithm is not blindly exploring actions that are expected to result in low reward. Action selection, following UCB algorithm, is given as follows [26]: in (10). This term grows for uncertain Q(s, m) estimates, i.e., obtained with a low number of measurements. It is smaller for the actions that are selected often [26]. To control the balance between exploration and exploitation the constant c is introduced. For a high value of c algorithms tend to be more focused on exploration, while for low values of c, on exploitation. In the extreme case of c = 0, UCB becomes a greedy algorithm focused only on exploitation.
Additionally, in every state s, the UCB algorithm is forced to take each action at least once, even when c is set to zero. This behavior is called optimistic initialization [26]. It is obtained by initially setting Q(s, m) to very large values. After the first reward is observed the update rule is Q(s, m) ← r(s, m) instead of (9), that is used for next rewards.

C. REM-EMPOWERED ACTION SELECTION ALGORITHM
In the case of the UCB algorithm, the EE optimization is strictly following the Associative Search rule, i.e., no information about other states is used. However, it can be expected that similar sets of UE positions would result in a similar number of antennas that should be active. Such a phenomenon was observed during our previous studies, exploiting the problem of BSs switching [25]. Therein, we have taken advantage of these similarities between states (REM entries) to speed up the process of learning. Thus, in this paper we propose an approach named REM-Empowered Action Selection Algorithm (REASA) for the purpose of active antenna selection. We expect that due to utilization of the similarities between REM entries REASA can achieve the same end level of EE as in the case of UCB, but requiring less time. REASA is designed as an extension of the UCB algorithm. In the case of UCB action selection is based on one action value related to the current state, while REASA uses action values Q(s, m) from all states averaged proportionally to the distance to the current state. Suppose that {s l } L l=1 denotes states saved in REM, and s i is the currently reported set of UE positions. REASA is given by: where: γ is an arbitrary constant that scales the impact of more distant REM entries on the current action selection, and d h (s i , s l ) is the Hausdorff Distance between s i and s l . It is computed according to [39]: where: and δ(·, ·) denotes a Euclidean Distance between two points on the Cartesian plane: x k and x j . Each of these points represents the coordinates of a single UE. One should note that the Hausdorff Distance is used, as it is a state-of-the-art measure of similarity between two sets of points of possibly different sizes, used e.g., for the purpose of image processing [39]. However, the performance of other distance metrics should be compared in the future, similarly to [30]. Comparing to the UCB, tuning parameters of REASA may be more challenging because there is an additional parameter: γ . The procedure of RL-based Intelligent Antenna Switching using either UCB or REASA action selection is summarized as Algorithm 1. The algorithms aimed at increasing the EE of a MMIMO network by switching off selected antennas, are evaluated in this section. For this purpose an advanced system-level simulator of a 5G-like, OFDMA-based MMIMO cell described in Sec. II is used. The cell covers mainly a park area surrounded by approximately 45 m tall buildings, according to the Madrid Grid Model [40]. The simulation parameters are presented in Tab. 2. The multi-stage system-level simulator used in this manuscript includes, e.g., channel estimation using Sounding Reference Signals, user scheduling with the proportional fair rule and ZF precoding, as presented in detail in [25]. We are considering a medium range BS, of transmit power equal to 38 dBm [41]. The BS is equipped with a rectangular antenna array of 128 elements placed in 8 rows and 16 columns. The available bandwidth equals 300 MHz around the center frequency of 3.55 GHz. Every 0.5 ms the scheduling algorithm allocates all 272 RBs to users connected to the MMIMO BS. Resource allocation is done independently for each RB following the proportional fair rule, i.e., the ratio between potential and past bitrate [42]. Beamforming is realized with the use of the so-called Zero-Forcing (ZF) precoding, aiming at the suppression of intracell interference [10]. There are 7 UEs in the cell moving with the speed of 1.5 m/s. Their initial positions are drawn from a uniform distribution. The location of the BS (larger dot) together with a single realization of UEs' initial positions (smaller dots) is depicted in Fig. 4. This is a low traffic scenario, that can be observed e.g., during night time. This scenario is expected to allow for a significant number of transmission chains to be deactivated, significantly increasing EE. The coefficients of radio channels are generated with the use of a realistic 3D-Ray-Tracing model [43]. The channel model individually tracks propagation paths between the UE antenna and each of the BS's antennas. Up to two reflections of each ray are considered (to limit computational complexity), but with many scattering rays possible. To produce accurate coefficients of the radio channel, 3D-Ray-Tracer takes into account the urban scenario, including reflections from the buildings and other obstacles, e.g., randomly distributed and moving pedestrians. To mimic the radio channel estimation error, a zero-mean Gaussian-distributed random variable is added to the real channel coefficients [44].

A. TERMINOLOGY
The process of evaluation of the EE-improving algorithms consists of performing RL for several sets of initial UE positions. For each set of initial UE's positions, these users and their movement is observed over some time period. To clarify the design of the simulation experiment the following terminology is introduced: Step is a time period of 60 ms when one cycle of RL is performed. First, initial 10 ms are discarded from statistics as a start-up phase, allowing, e.g., the scheduler to stabilize. After these 10 ms, the state is identified • An Episode is a sequence of several steps with the same UE path. During every episode, the same set of states is visited. From the network perspective the UE pattern and their path being the same as at some time in the past means that a given episode repeats. To assess algorithms in terms of learning speed, the average EE obtained during consecutive episodes would be observed. After many repetitions of a single episode, the UCB algorithm should focus on exploitation. In this case the mean EE over a single episode shows the average EE over UE paths (over all visited states). • No EE denotes a scenario, where there is no EE optimization, i.e., all M antennas are active.
• REM-UCB denotes the algorithm of learning REM with the use of UCB described in Sec. IV-B • REASA is the algorithm of learning REM, utilizing similarities between REM entries from Sec. IV-C The dependencies between Experiment, Episode and Step, are summarized in Fig. 5.

B. HOW MANY ANTENNAS ARE NEEDED?
The aim of the first simulations is to show that the optimal number of active antennas obtained for the full system model VOLUME 9, 2021  with the REM-UCB algorithm is different than obtained for simplified modeling with the Ref algorithm. For this purpose 50 experiments are conducted assuming perfect channel knowledge to make environment invariable. For such conditions, the number of episodes required to obtain knowledge about every action is equal to the number of antennas M = 128. Therefore, the number of episodes is set to the value of 128, making REM-UCB only to explore all possible actions, i.e., an exhaustive search is performed. Each episode consists of 1 step, i.e., 50 different random patterns of UE positions are tested. This allows us to estimate the average EE for 7 UEs randomly distributed in a cell. Fig. 6 presents the reward related to each action, averaged over all experiments, and normalized by the highest value. The red line refers to the Ref algorithm, while the black one to REM-UCB. The shaded area around REM-UCB marks the 95% confidence interval for a mean estimate. It can be seen that the Ref algorithm on average chooses a lower number of active antennas than REM-UCB, i.e., 13 against 31. It is because the Ref algorithm utilizes a simplified system model. It assumes that all antennas contribute equally to BS throughput, and that all users can be simultaneously served by the BS. In practice, due to channel correlations, some UEs cannot be served using the same time-frequency resources. Due to the antenna array geometry and correlations between antennas, they do not contribute equally to the throughput. That is why the average optimal number of active antennas obtained through direct observations of EE via REM-UCB is larger. However, the results depicted in Fig. 6 are averaged over experiments related to different sets of UE positions. For each set of UE positions, a different number of active antennas can be optimal. In Fig. 7 a probability density function is shown for the optimal number of active antennas obtained for a full system model, over considered 50 experiments. It can be seen that, although the average optimal number of active antennas is around 30 (exactly 31 while considering Fig. 6), different patterns of UE positions can result in an optimal number of active antennas as low as 10, or as high as 100. This result justifies the deployment of REM, where the number of active antennas can be optimized independently for different patterns of UE positions.

C. EVALUATION OF REM-UCB
The previous observations have shown that the number of active antennas indicated by the Ref algorithm is typically underestimated, and that the optimal number of active antennas should be adjusted to the particular pattern of UE positions. For this purpose, a REM-UCB learning algorithm is utilized. To evaluate the phase of learning, 10 experiments were conducted. Each experiment exploits 160 episodes. Because REM-UCB utilizes the so-called optimistic initialization, the Q-values have initially very big values to enforce checking each action at least once, as explained at the end of Sec. IV-B. Thus the 128 initial episodes are necessary to evaluate every possible action in each visited state once, while the remaining 32 episodes provide an evaluation of the algorithm performance under different values of c parameter, after initial ''exploration'' is finalized. Every episode consists of 15 steps. The resultant average reward-EE obtained after each episode and averaged over experiments is depicted in Fig. 8. We can see that during the initial 128 episodes, when each possible action is tested, the reward first increases and after about 40 episodes starts to decrease. The first reason is related to the implementation, i.e., actions are tested starting from one to all active antennas over initial 128 episodes. Maximum can be observed because both having too many and not enough active antennas causes poor EE. After every  action is taken at least once the algorithm is no longer forced to take unprofitable actions. As a result, after about 120-th episode reward rapidly grows, and remains relatively stable and provides about 18.5% EE improvement over the Ref algorithm. During the last 32 episodes, the least fluctuations are observed, when c = 0. In this case, the algorithm performs greedily, i.e., it exploits the current knowledge by always taking the best known action. This is caused by the fact that in the considered scenario all UEs have a line-ofsight connection to the BSs, and thus related radio channels are stable, even though the channel estimation error is introduced.
To highlight the fact that every set of UE positions can be potentially related to a different number of active antennas, we have plotted the optimal number of antennas and EE related to each step in Experiment 1. The results are depicted in Fig. 9. It can be seen that the optimal number of active antennas varies from about 20 to about 50 between steps. Similarly the related EE depends on the UE positions.
EE is defined as the ratio of the median-user bitrate and mean power consumption as shown in (4). Fig. 10

D. EVALUATION OF REASA
Although the REM-UCB algorithm provides REM with the optimal number of active antennas for a given set of UE positions, we hope that the learning phase can be speed up, i.e., EE gains provided by REM-UCB after about 120 episodes can be reached faster. The speed-up can be achieved by the utilization of similarities between the patterns of UE positions, i.e., a similar pattern of UE positions would probably be related to a similar optimal number of active antennas. This idea is implemented in the REASA algorithm described in Sec. IV-C. We have evaluated the REASA algorithm under the same scenario as REM-UCB: 10 experiments, each consisting of 160 episodes exploiting 15 steps. In addition, the REM is from the beginning filled with 50 entries. These entries had been obtained during simulations reported in Sec. V-B. Most importantly, the experiments generated here are independent from those reported in Sec. V-B, i.e., UE positions and paths are not repeated. The resultant average reward-EE obtained after each episode and averaged over experiments is depicted in Fig. 11. As it can be observed, due to the utilization of REASA together with REM possessing initial knowledge about past experiments, EE can be improved almost instantly for a completely new UE positions pattern. The drawback of the utilization of c = 0 in the case of REASA is that the algorithm tends to permanently exploit sub-optimal solutions. However, these solutions are much better than Ref. The quality of the sub-optimal solution depends on the γ constant. The higher the γ , the smaller the impact of similar REM entries on current action selection. The best sub-optimal solution is observed for γ = 1.5.  The same level of EE as for REM-UCB can be achieved by REASA by making the algorithm less greedy (increasing c), and proper balancing of the impact of similarities on the current action, i.e., tuning the γ parameter. Although γ = 1.5 gave the best sub-optimal result, the optimal solution couldn't be achieved for this value of γ by increasing the constant c. Thus, we decided to slightly reduce the impact of similar REM entries on the current action by setting γ = 3.0. The performance of REASA under γ = 3.0, and varying constant c is depicted in Fig. 12. It can be seen that constant c must be properly chosen. Too low value of c favors a sub-optimal solution because of excessive exploitation, i.e., c = 0 (green line), c = 0.01 (purple line). On the other hand, too high value of the c constant results in too frequent exploration of the sub-optimal actions. As a result, the performance is slightly worse than the performance of REM-UCB. The best results were observed for c equal to 0.05. Most importantly, this value of c guarantees convergence to a solution as good as the one observed for REM-UCB algorithm. However, due to the utilization of similarities between REM entries, the procedure of learning REM can be reduced from about 120 episodes to only about 60. This shows that the proposed REASA can reduce the required learning phase at least twofold. These results are promising from the perspective of potential practical implementations of a long time of network operation. In this case, similar states should occur very often. Then, REASA is expected to provide an even higher speed-up of the learning phase in relation to the REM-UCB.

VI. CONCLUSION
The results have shown that, state of the art optimization of EE, in the case of MMIMO BS, based on analytical formulas, does not provide an optimal solution under realistic conditions. It is because of some simplified assumptions, e.g., that every antenna contributes equally to the resultant array gain. Instead we propose to extend the network with a REM unit, and utilize an RL scheme, to provide mapping between patterns of UE positions and the number of active antennas. The results have shown that this approach can increase EE by 18.5% compared to the reference algorithm. Moreover, by exploiting the similarities between REM entries, the learning phase can be significantly reduced, i.e., when the REASA algorithm is used instead of the REM-UCB method.
In the future more focus can be put on the selection and improvement of the antenna selection algorithm itself. More advanced metrics can be used to asses network and user bitrates, e.g., by taking into the account the correlations between radio channels. Finally studies similar to [30] can be performed to compare the performance of REASA under an alternative distance metric, e.g., the Sum of Minimums. He was involved in a number of national and international research projects. His research interests include the problems concerning the physical layer of the dynamic spectrum access systems, multicarrier signal design for green communications, and problems related to practical implementation of massive MIMO systems. VOLUME 9, 2021