Deep Convolutional Neural Network Assisted Reinforcement Learning Based Mobile Network Power Saving

This paper addresses the power saving problem in mobile networks. Base station (BS) power and network trafﬁc volume (NTV) models are ﬁrst established. The BS power is modeled based on in-house equipment measurement by sampling different BS load conﬁgurations. The NTV model is built based on trafﬁc data in the literature. Then, a threshold-based adaptive power saving method is discussed, serving as the benchmark. Next, a BS power control framework is created using Q-learning. The action-state function of the Q-learning is approximated via a deep convolutional neural network (DCNN). The DCNN-Q agent is designed to control the loads of cells in order to adapt to NTV variations and reduce power consumption. The DCNN-Q power saving framework is trained and simulated in a heterogeneous network including macrocells and microcells. It can be concluded that with the proposed DCNN-Q method, the power saving outperforms the threshold-based method.


I. INTRODUCTION A. BACKGROUND
In the era of data, information is flowing in an unprecedented way anytime everywhere. It is reported in [1], that the number of mobile broadband subscriptions will be approaching eight billion by 2025. The amount of mobile data traffic is anticipated to grow at an exponential pace, reaching 160 extrabyte (EB, 10 18 bytes) per month within the same time period. New emerging applications such as augmented reality (AR), virtual reality (VR), vehicle to everything (V2X), and internet of things (IoTs) are projected to have increasing contribution to the massive growth of data traffic.
The fifth generation (5G) mobile network (MN) [2]- [4] has introduced groundbreaking technologies in order to satisfy this growing demand of data traffic. Millimeter-wave (mmWave), for instance, is a well-recognized solution as high bandwidths in mmWave are able to provide more available radio resources. In addition, the use of massive multiple-input multiple-output (MIMO), which equips base stations (BSs) and user equipments (UEs) with an increasing number of The associate editor coordinating the review of this manuscript and approving it for publication was Ivan Wang-Hei Ho .
antennas, can reduce intercell interference and boost network throughput. Most importantly, reducing cell size and increasing cell density have been the main source of enhancing network throughput [5], [6]. There is no exception in 5G networks, as they are expected to significantly scale up cell densities.
However, denser cells come at the cost of larger MN power consumption, which increases green house gas emissions and accelerates global warming. Operators such as Vodafone, have targeted to reduce green house gas emission by 50% by 2025 [7]. Reducing power consumption can not only reduce green house gas emission, but also reduce operating cost of MNs. To tackle the problem of MN power saving, practical models for BS power consumption and data traffic as well as smart resource management techniques are required.
Authors in [8] measured BS power in real equipment and proposed a number of linear power models in terms of load for the remote unit (RU) only. In [9], power models were built for components in a BS, such as power amplifier and filter. It concluded that power consumption in downlink was dominant. Measurement of voice traffic was presented in [10]. More generally, the white paper [11] revealed traffic patterns of various applications in reality. Both measurement VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ reports showed that network traffic volume (NTV) was normally higher during weekdays and lower during weekends.
There are a number of classic cell on/off algorithms, including optimizing user association, optimizing BS coverage, traffic prediction, and heterogeneous deployment [12]. In [10], a concept known as network-impact was proposed, which can be calculated by the maximum of sum of the original BS load and the additional load increments brought by neighboring BSs. The algorithm in [10] required heuristic parameters. In [13], the user association to BSs and dynamic BS operations were jointly optimized for the purpose of improving energy efficiency. The switching on and off of BSs relied on a greedy algorithm and heuristic parameters. Authors in [14] and [15] proposed algorithms to adjust cell coverage to reduce power consumption. Methods of traffic pattern and BS energy consumption pattern prediction were discussed in [16]. In [17], stochastic geometry was used to model distributions of macrocells and low-power cells. The minimum separation distance between a macrocell and a lowpower cell was optimized to reduce interference and power consumption. Besides the discoveries in academia, industry has designed schemes to reduce power consumption as well. The 3GPP 5G new radio (NR) [18] has replaced the alwayson cell-specific reference signal (CRS) in 4G long-term evolution (LTE) [19] with a novel reference signal framework, including demodulation reference signals (DMRSs) and channel state information reference signal (CSI-RSs). These are user-specific and flexibly configurable. As a result, power consumption is reduced when there is no traffic or measurement to certain UEs.
Besides classic methods, machine learning (ML) based methods have attracted researchers to explore new approaches to solve the MN power saving problem [20], [21]. Having assumed accessible location information, [22] proposed a reinforcement learning (RL) based method to predict movement of UEs and dynamically adjust the powers of the handover target cell and the original cell. Authors [23] used RL to optimize durations of different sleep modes to reduce power consumption. These RL based methods did not considered realistic power and traffic models. Also, loads of BS were not directly controlled. In this paper, a centralized deep RL based method is proposed, to intelligently control BS loads according to realistic power and traffic models. In a multi-cell mobile network, it is straightforward to expect a distributed architecture where each cell is equipped with one RL agent. As a result, multiple agents perform RL individually. However, a distributed architecture suffers from the moving target problem [24], where the behavior of each agent can impact on behaviors of other agents. On the contrary, the centralized architecture used in this paper assumes one agent only controlling all cells in the mobile network. This can accelerate convergence.

B. CONTRIBUTIONS
The contributions of this paper are listed as follows.
1) A power model and a NTV model for base stations are proposed. The power model is established based on measurement data in real-world base stations. More importantly, detailed power consumption in a data unit (DU) and a RU is shown. The NTV model is obtained from measurement data in the literature. These two models are able to provide realistic descriptions on the dynamics of network power consumption in terms of time. 2) A threshold-based power saving method is proposed.
This method uses a cell load adaptation equation to update cell loads to adjust power consumption. 3) Most importantly, a deep learning approach, i.e., deep convolutional neural network based Q-learning (DCNN-Q), for power saving is proposed. The proposed method uses a centralized architecture and Q-learning to control cell loads, with the action-state function approximated by a DCNN. The DCNN not only takes a one-dimensional (1D) load vector as input, but also a two-channel two-dimensional (2D) image containing information of instantaneous NTV requirement and network throughput.
The rest of this paper is organized as follows. Section II proposes a power model based on measurement data and a NTV model based on literature data. Problem description and system model are presented in Section III. The benchmark method, i.e., the threshold-based method is investigated in Section IV. Our proposed DCNN-Q method is discussed in detail in Section V. Simulation/numerical results and analysis are presented in Section VI. Conclusions are drawn in Section VII.

II. POWER MODEL AND NETWORK TRAFFIC MODEL A. POWER MEASUREMENT IN REAL-WORLD EQUIPMENT
The power measurement was conducted in our in-house lab on real LTE DU and RU equipment. Both power of DU and RU in terms of different settings of load were measured, by installing a power meter to both the DU and RU power cables. Load of the system is the ratio of the number of active physical resource blocks (PRBs) over the number of total available PRBs. This was configured using the orthogonal channel noise simulator (OCNS) functionality via command line interface (CLI) during the measurement. The flowchart of measurement is depicted in Fig. 1. A typical set of load settings, i.e., 0%, 50%, 100%, were configured. The total measurement period lasted for 10 hours and the readings of power meter were recorded every 15 minutes. As a result, there were 41 measured samples in total.

B. POWER MODEL
After obtaining the measured power data, a power model can be established. In the paper, linear models for DU power P DU and RU power P RU based on measured data are proposed, i.e., P DU (l) = 1.68l + 266.98 (1) P RU (l) = 153.50l + 93.95 (2) where l represents the load of the unit. Both proposed models, the total power model P Total (l), and measured data are shown in Fig. 2. It can be observed that the power of DU is not sensitive to the change of load. On the other hand, the change of load can result in changing the power of RU from 94 W to 247 W. Moreover, when load falls down to zero, switch-off DU and RU can be assumed. In this case, the total power is assumed to reduce to zero in the paper, although in practice there can be a small amount of energy consumption. Hence, the total power can be expressed as The power saving comes from two sources. First, each cell adapts its load according to current network traffic. Second, certain low load cells need to handover their traffic to other cells such that these low load cells can be completely switched off.

C. NETWORK TRAFFIC MODEL
Network traffic model in this paper was developed based on measured NTV data published in [11]. The measured NTV data were extracted by visual inspection. It can be observed in [11] that the shape of NTV in each single day is similar.
However, the absolute NTV values in weekdays and weekends are different. Therefore, to establish the model, a twostep approach is used in this paper. First, a normalized NTV model for a single day is established, to characterize how NTV is varying in different hours of a day. Second, another model is established to characterize how NTV is varying in different days of a week. The NTV model V 1 (t) for a single day can be expressed as a 20th-order polynomial as where t ∈ [0, 24) is the hour of a day and a n is the coefficient of the nth-order term. Least-squared estimation was performed and the coefficients a n can be found in Table 1. Fig. 3 shows the comparison of normalized measured NTV in a day in [11]. It can be seen that the valley of NTV appears at approximately 4 am, which accounts for 15% of the peak of a day. The peak of NTV occurs at 9 pm. The proposed 20th-order polynomial model provides sufficient approximation to the real-world one-day measured NTV. Comparison between the single-day NTV model and measured NTV in [11].
Next, to capture the variation of days within a week, the NTV model V 2 (τ ) for a week can be expressed as a 5th-order polynomial as where τ = 1, 2, . . . , 7 represents Monday to Sunday and b m is the coefficient of the mth-order term. The coefficients b m are in Table 2. Then, with (4) and (5), the NTV of a week can be synthesized via where t is the hour in a week (t ∈ [0, 168)), mod(·) is the modulus operator, · is the flooring operator, and η is a VOLUME 8, 2020  scaling factor to scale the normalized NTV to a realistic NTV. The normalized NTV model is depicted in Fig. 4. It can be observed that during the weekdays, the NTVs are similar. However, during weekend, the NTVs drop from Saturday to Sunday. Furthermore, we define a parameter γ called safety margin, which quantifies the largest rate of change of NTV between two adjacent time instances t ν and t ν+1 , i.e., where ν is the sub-interval index when a certain length of observation interval is divided into equal-size sub-intervals. A network with γ taken into consideration will be able to satisfy the period when the traffic increases at the steepest rate. From (6), it can be computed numerically that γ equals 0.4.
Assume that there are N user users per cell with index i = 1, 2, · · · , N user , the user traffic volume (UTV) for the ith user in terms of time is modeled as where Z i is user-specific independent and identically distributed (i.i.d.) log-normal random variable, i.e., lnZ i ∼ N (0, σ 2 ) ∀i, to describe user-specific traffic variations. Parameter σ is the standard deviation of UTV among different users.

III. PROBLEM DESCRIPTION AND SYSTEM MODEL A. PROBLEM DESCRIPTION
From Section II, the mobile network power saving problem is to adjust loads of cells according to the current NTV requirement. Furthermore, to save the largest amount of power, a handover mechanism needs to be considered such that certain cells can migrate its attached users to other cells and reduce its load to zero and switch off. However, since the number of cells can be massive in the area of interest (AOI), the solution space of this combinatorial optimization problem will be too large for exhaustive search, even for a single time instance. Moreover, the NTV is evolving in terms of time.
The solution of the problem should be sufficiently flexible to handle the variation of NTV.

B. NETWORK DEPLOYMENT
In this paper, we consider an approximately 1km×1km AOI which is covered by four frequency bands. Among these four frequency bands, three of them are for urban micro (UMi) and one of them is for urban macro (UMa). Settings of these four frequency bands are listed in Table 3. The UMi cells have carrier frequencies 2.1 GHz, 2.7 GHz, and 3.6 GHz, and they are two-ring hexagonal [25] and with 200m intersite distance (ISD). For a two-ring hexagonal layout [25], each band has 19 three-sector sites, resulting in 57 cells in total. The UMa cell has carrier frequency 1.8 GHz and it is one-ring hexagonal with 500m ISD. For a one-ring hexagonal layout [25], each band has 7 three-sector sites, resulting in 21 cells in total. Users are uniformly and randomly dropped into the AOI for each band and the average total NTV for each band of each cell, i.e., the mean of the sum of all user traffic within a cell, equals (6) for a specific time t. It should be noticed that each band will fully cover the AOI. When a user is dropped inside the AOI, it will choose the cell in a certain band which provides the largest received power. Also, in this paper, for the sake of reducing power, handover between different bands is allowed. Namely, when cells in two different bands with similar coverage area, one cell in Band 1 can migrate all of its traffic to the other cell in Band 2, provided that Band 2 will not overload. Then, the cell in Band 1 will have zero load and it can completely switch off.

C. SINR AND NETWORK THROUGHPUT CALCULATION
Since different bands will not interfere each other, band index is dropped in the following expressions. Consider a certain band, let P k be the transmit power of the kth cell and let β ij P j χ ij (9) where χ ij is a coefficient representing the interference ratio between the ith base station and the jth base station. As the jth base station is only using N (k) j PRBs, the interference power emitted by it is a fraction of its total power P j . At the same time, the ith base station has only N (k) i active PRBs, the interference power it receives is a fraction of the interference power emitted by the jth base station. As a result, χ ij is a function of N PRB , N The network throughput T (k) provided by the kth cell can be computed as where µ represents a factor accounting the overhead and number of layers during the transmission process. The area throughput in the AOI T area is then the sum of the network throughput of all cells, i.e., From (9) and (11), it can be observed that the network throughput may not always be monotonically increasing with loads, because as loads increase, mutual interference among cells increases as well. Also, when a set of new loads are configured for all the cells, the SINR and throughput map of the AOI need to be updated.

IV. BENCHMARK METHOD: THRESHOLD-BASED POWER SAVING
Controlling problems like MN power saving are usually approached by threshold-based methods. Namely, a feedback loop is established and the feedback is mapped to a metric, such that actions will be taken accordingly based on whether the metric is higher or lower than a threshold. For power saving, these actions include scaling up or down the loads of cells, and handing over traffic to other bands and switching off cells whose loads are zero, and switching on cells. Let V (k) X be the NTV, the network throughput, and the cell load of the kth cell in Band X , respectively. The adaptation of cell load l (k) X at time t n+1 is expressed as where γ from (7) is a safety margin such that the cell load is enough for the steepest NTV increase. It can be seen from (13) that the cell load at time t n+1 is the sum of two terms. The first term is a scaled version of load at the previous time t n . The gap between two time instances is customizable and it is assumed half an hour in this paper. The second term is an additional load if Band Y is switched off and Band Y migrates its traffic to Band X .
When the load of the kth cell l (k) X (t n+1 ) in Band X is less than a threshold ξ 1 , i.e., l (k) X (t n+1 ) < ξ 1 the cell will handover its traffic to another band then the cell can be switched off and l (k) X (t n+1 ) will be set to zero. For simplicity, we assume that the handover is done by handing over from Band X to Band X + 1. This is a feasible simplification if the traffic distributions in Band X and Band X + 1 are statistically the same. On the other hand, letl active X denote the average load of active cells in Band X . Ifl active X > ξ 2 , meaning that current active cells have heavy load, then inactive cells in Band X −1 should be switched on to help handle traffic. The settings VOLUME 8, 2020 of γ , ξ 1 , and ξ 2 in this paper are heuristically determined and listed in Table 4.
The procedure of the threshold-based power saving method is shown in Fig. 5. The procedure starts with scaling the cell load only based on traffic variation. Then, each band starts to adjust the on and off situations. This is achieved by calculating the average load of active cells. If this load is larger than ξ 2 , it requires more cells to offload upcoming traffic. Therefore, inactive cells in Band X −1 are switched on. If this load is less than ξ 2 , the load of each cell in Band X will be compared to ξ 1 . If a cell load is less than ξ 1 , it means this cell has low load and can be switched off as soon as its traffic is handed over to Band X + 1. Otherwise, the updated load is calculated according to (13).

V. DCNN-Q FOR MN POWER SAVING A. RL REVIEW
RL is a trial-and-error machine learning technique, which samples the environment and takes actions to the environment. The environment is everything that cannot be arbitrarily modified by the RL agent and will provide a feedback containing the reward corresponding to the action to the RL agent. When the RL agent obtains a sample from the environment, this sample is known as a state. The RL agent attempts to make a sequence of decision on actions in order to achieve a certain goal. The difficulty of RL is that when an action is taken in each step, it will impact on actions in later stages.
A Markov decision process (MDP) provides widely used model for RL [26]. A MDP can be modeled by a 4-tuple E = E S, A, P, R . State space S consists of all possible states of the environment. A state s ∈ S is the perception of the environment of the RL agent. Action space A contains potential actions to be taken by the RL agent. Assume that the state is s, when action a ∈ A is taken, the environment will transit to a new state s . This transition is modeled by a hidden transfer function P : S × A × S → R, which represents the transition probability. Moreover, in each state transition, a reward is produced and it is characterized A policy π associates a state s to an action, which can be categorized as deterministic or randomized. A deterministic policy maps a state to an action π : S → A. On the contrary, a randomized policy maps a state to a probability distribution π : S × A → R, representing the probability of taking action a ∈ A in state s. In the learning process, the state-action value function (Q function) Q π (s, a) stores the estimated values of accumulated discounted rewards using policy π .
When the model of the environment is accessible (modelbased learning), i.e., the hidden transfer function P is known, the expected values of the Q function can be computed iteratively with dynamic programming. According to [26], the optimal policy satisfies the optimal Bellman equation and can be found by selecting the action maximizing the state-action value iteratively. The state-action values increase monotonically each time the policy is updated with the best action. Therefore, when the policy converges, it converges to the optimal policy. However, in practice, it is usually difficult to obtain the model of the environment. Namely, the hidden transfer function P is unknown. In this case, model-free learning can be applied. Model-free learning assumes no knowledge of the environment and relies on approximating the Q function by sampling the environment, states, and rewards. A widely used model-free learning method is the Monte Carlo (MC) method [26], where the value function and policies are updated only when an episode of samples are finished. Another model-free learning method is the temporaldifference (TD) learning [26] where the value function and policies are updated in a step-by-step manner. A TD learning method known as Q-learning is used in this paper. The main characteristic of Q-learning is that the approximation of the Q function is independent of the policy being followed, which largely reduces complexity [26].

B. DCNN-Q ARCHITECTURE DESIGN
The overview diagram of DCNN-Q for mobile network power saving is depicted in Fig. 6. This RL problem can be divided into the design of state space, action space, policy, reward function, and Q function, which will be detailed later paragraphs.

1) STATE SPACE
In a power saving problem, a state should be able to capture what the current requirement of NTV is and how well the system is responding to such requirement. Therefore, a state s ∈ S is characterized by a traffic map, a throughput map, and a vector of current loads of all cells in the AOI, i.e., s = {traffic map, throughput map, loads of all cells}. It should be noticed that both maps are 2D while the vector of loads of all cells is 1D.

2) ACTION SPACE
The cardinality of the action space should be properly design. If the cardinality of the action space is too small, then the granularity of actions becomes coarse. Conversely, if the cardinality of the action space is too large, convergence of training will be too slow. To reach this balance, three constraints are taken in this paper. First, instead of continuous load, only discretized loads are considered. For example, a load can only be chosen in the set of {0%, 25%, 50%, 75%, 100%}. However, even only discretized loads are considered, with 192 cells as shown in Table 3, there are still 5 192 combinations. Therefore, the second constraint is that once a load is chosen, all the cells in the same band will be set to the same load. Then, for four bands, the number of combinations reduces to 5 4 = 625. Third, certain combinations will be excluded in the action space as the pseudo codes shown in Fig. 7. In Fig. 7, the subscript k in w k represents the band identification (ID). From the action space generation algorithm, it can be observed that the UMa band is always on to guarantee there will be coverage in the AOI while UMi cells can be switched off. This constrain is able to avoid coverage holes in the AOI when certain UMi cells are switched off. After the algorithm in Fig. 7, the cardinality of the action space is reduced to 140.

3) POLICY
A state is mapped to an action via policy π (s). A commonly chosen policy is the -greedy algorithm ( ∈ [0, 1]) [26]. It consists of two phases. In the exploitation phase, which has probability 1 − , the RL agent selects the action with the highest Q value. In the exploration phase, which has probability , the RL agent will choose an action in a random manner in the action space with equal probability. The -greedy algorithm is presented as where U is a uniform random variable in [0, 1].

4) REWARD FUNCTION
There are two principles to design the reward function. First, it should be penalized if the current network throughput is not able to satisfy the NTV requirement. With such design, the RL agent will learn from experience to avoid corresponding actions. Second, as another goal of the problem is to save power, the reward should be monotonically increasing if the network consumes less power, provided that the required NTV is satisfied. As a result, the reward function is modeled as where β is a positive coefficient describing how fast the reward is decaying with the increase in power and P (k),Total X is the total power of the kth cell in Band X . The choice of the exponential function is because it is continuous in its domain and able to handle interpolated power values. In this paper, VOLUME 8, 2020 β is set to be 0.004, which corresponds to a reward of 0.2 when the load is 25%.

5) Q FUNCTION
The accumulated reward of a state-action pair is recorded by the Q function Q(s, a) and it is incrementally updated as the training progresses, i.e., Q(s, a) = Q(s, a) + α(r + δQ(s , a ) − Q(s, a)) (16) where s is the new state, a is the action to the new state, α is the learning rate, and δ is the discounting factor. To approximate the Q function, both table-based and NN-based methods were used in the literature. In this paper, the NN-based method is adopted. A NN-based Q function has two benefits. First, it does not need massive storage compared to a table-based method when the action space and state space are large. Second, it can handle complex inputs such as a mixture of 2D and 1D data and unseen states. A NN structure is proposed for approximation of the Q function. The NN accepts a state as input and outputs the Q value of each potential action. The NN is constructed by two parts which is shown in Fig. 8. The first part is a DCNN and the second part is a 3-layer fully-connected network. The DCNN maps a two-channel 2D image to a vector. One channel of the 2D image is the throughput map of the AOI and the other channel is the traffic map of the AOI. The two-channel image is then passed to five convolution blocks in serial and each convolution block consists of a convolution layer, a rectifier (ReLU) [27], and a pooling layer. The convolution layer is responsible for exacting highlevel features of the 2D input. The ReLU is a typical nonlinear activation in NNs. The pooling layer is responsible for reducing complexity and extracting dominate features. Then, after five convolution blocks, the output is passed into a drop-out layer to further reduce complexity and avoid overfitting. Since a state consists not only the 2D image but also a 1D vector storing the current loads of all the cells, the output of the drop-out layer is normalized and concatenated with the 1D load vector. The concatenated 1D vector is input to the second part of the NN, i.e., the 3-layer fully-connected network, the output of which is the the Q function. The objective function of the DCNN is the root mean squared error (RMSE) between the predicted Q value vector and the updated Q value vector.
To achieve the best performance of the DCNN, both the throughput map and the traffic map will be normalized before being input to the DCNN. Let 2D matrices M 1 and M 2 denote the throughput map and traffic map, respectively. The normalization of M 1 includes cutting-off and scaling, i.e., The normalization of M 2 is achieved bỹ The DCNN-Q learning is described in pseudo codes in Fig. 9.
To begin with, the policy is initialized with equal probability. In each iteration of the training, the RL agent chooses an action according to -greedy policy. As soon as the action is determined, it will be mapped to cell loads in all the bands. Then, all the cells will adjust their loads according to the action. After these processes, the situation of mutual inference is changed and hence SINR in the AOI needs to be re-calculated. Then, the throughput map needs to be updated and the new state is formed. Next, the reward is computed according to (15) and the next action is obtained from the policy function. The RL agent updates the Q function and the policy. These steps are then repeated until the maximum number of iteration is reached.

VI. RESULTS AND ANALYSIS
The proposed DCNN-Q power saving is trained and tested according to the parameters listed in Table 5. As comparisons,  the always-full-load method, the threshold-based method, and the DCNN-Q method are discussed. The always-full-load method means that all cells at all bands are operating with 100% loads. Normalized mean reward with respect to the number of weeks trained is shown in Fig. 10. Normalization is down in terms of the mean reward after 10 weeks of training. It can be seen that there are fluctuations between 20 to 40 weeks, as the size of training data is still small. After 40 weeks of training, performance starts to improve. After 200 weeks of training, the result is 13% better than 10 weeks of training. As the length of training needs to reach balance between performance and training cost, we use the 200-week trained RL agent to test the proposed DCNN-Q performance.
NTV requirement and throughput provided by these three methods in terms of time within a week are illustrated in Fig. 11. The step size is half an hour. The always-full-load  method provides a constant network throughput, which is 30% higher than the highest NTV peak during a week. This is the foundation for an intelligent power saving method. The threshold-based method is able to adjust its network throughput according to NTV. It can be seen that the threshold-based method is aggressive when the required NTV is low and is conservative when the required NTV is high. On the contrary, DCNN-Q does not behave like the threshold-based method. DCNN-Q is more conservative when NTV is low by reserving a larger safety margin, and more aggressive when NTV is high. It can also be observed that the change of configuration in DCNN-Q is sharper. This is because the action space of the DCNN-Q is discrete.
Network power consumption in terms of time within a week is illustrated in Fig. 12. The power consumption of the always-full-load method is constant. The thresholdbased method has the lowest power consumption when NTV is low and had higher power consumption when NTV is high. Its range of power consumption is relatively wide.  Conversely, the provided network throughput by the DCNN-Q has limited range of power consumption values. Fig. 13 depicts the normalized aggregate power consumption of the three methods. Normalization is done relative to the threshold-based method. Always-full-load consumes the most power as expected, which is 41% higher than the threshold-based method. The proposed DCNN-Q method is able to save 19% power compared to the threshold-based method and 42% compared to the always-full-load method. This demonstrates that the proposed method is able to achieve significant power saving.

VII. CONCLUSION
To investigate power saving for mobile networks, it is important to establish practical power and network traffic models. Based on our in-house measurement, linear models are sufficiently accurate to describe base station power consumption in terms of load. The power of RU is more sensitive to load change, whereas the power of DU is steady. Power reduction is achieved via the adaptation of loads of the network and dynamic switching on and off according to required NTV. A polynomial model for synthesizing NTV is proposed, describing traffic fluctuations over one week. The thresholdbased method, which relies on heuristically set thresholds, serves as the benchmark and is able to reduce power consumption by 30% compared to always-full load. As a significant enhancement, the centralized DCNN-Q method is proposed. The DCNN-Q uses a DCNN, which accepts a joint input of 2D images and a 1D vector, to approximate the Q function in the Q-learning framework. The proposed DCNN-Q method is capable of saving 19% power compared to the threshold-based method. This demonstrates that DCNN-Q is a promising solution to confine mobile network power when both the data and the size of a network are soaring. For future work, instead of the centralized method proposed in this paper, a distributed learning framework would be another direction of research. Also, optimization on energy efficiency, i.e., bits/joule, is essential to consider for green network research.