BESS Aided Renewable Energy Supply Using Deep Reinforcement Learning for 5G and Beyond

—The year of 2020 has witnessed the unprecedented development of 5G networks, along with the widespread deployment of 5G base stations (BSs). Nevertheless, the enormous energy consumption of BSs and the incurred huge energy cost have become signiﬁcant concerns for the mobile operators. As the continuous decline of the renewable energy cost, equip-ping the power-hungry BSs with renewable energy generators could be a sustainable solution. In this work, we propose an energy storage aided renewable energy supply solution for the BS, which could supply clean energy to the BS and store surplus energy for backup usage. Speciﬁcally, to ﬂexibly regulate the battery’s discharging/charging, we propose a deep reinforcement learning based regulating policy, which can adapt to the dynamical renewable energy generations as well as the varying power demands. Our experiments using the real-world data on renewable energy generations and power demands demonstrate that, our power supply solution can achieve an cost saving ratio of 77.9%, compared to the case with traditional power grid supply.


I. INTRODUCTION
T HE 5G network is considered as a promising technology to significantly improve the way how we live [1].Compared to the 4G/LTE, it can ensure users with higher bandwidth and lower latency and thus enable various cutting-edge mobile services, such as the Internet of Vehicles [2], Virtual to the adoption of high frequency bands by 5G base station (BS), its signal coverage range is much shorter than that of the 4G/LTE.Consequently, the mobile operators need to deploy a large number of 5G BSs to tackle the problem of poor signal coverage.This would result in an ultra-dense BS deployment, especially in "hotspot" areas, as illustrated in Fig. 1.
Building and operating such large-scale BSs require enormous investments and resources.According to a field survey in the cities of Guangzhou and Shenzhen, China, the fullload power consumption of a typical 5G BS is about 2 ∼ 3 times of that of a 4G one [5].Considering the ultra-dense deployment of 5G BSs, it could lead to a tenfold increase in energy consumption.In addition, with the increasing emphasis on environmental protection, many governments have shut down some coal-fired power plants, resulting in severe power shortages in some areas.In this regard, how to effectively reduce energy consumption and improve the energy efficiency are critical problems.
Renewable energies like the solar and wind energies are eco-friendly with zero carbon emissions and become popular in more scenarios in recent years [6].Owing to the continuing price decline in photovoltaic (PV) module and wind turbine, the installation cost of renewable energy has dramatically decreased over the past decade, e.g., it reports a 61% reduction of the solar equipment from 2010 to 2017 [7].Such cost reductions lead to a rapid payback period for the renewable energy investment, from a couple of years to several months [8].The above observations indicate the great potential of renewable energy on the market of fossil fuel replacement and carbon emission reduction.
It thus has inspired the mobile operators to utilize renewable energy as the auxiliary power supply to tackle the huge power demand at 5G BSs.In some developing countries, solar power has already been applied to supply the BSs, some of which occupies over 8% of the total electricity usage [9].By installing the PV and wind turbine near the BSs, it shows that the maximum power from the solar and wind generators can reach up to 8.5kW and 6.0kW, respectively, which could remarkably cut down the communication energy supply from the traditional power gird.
To maximize the utilization of renewable energy, energy storage can be strategically utilized such that the energy can be continuously provided, as the renewable (like solar or wind) energy is intermittent and unstable.Meanwhile, most Fig. 1.A vision of the future radio access network (RAN) in 5G and beyond, which consists of macro and small cells, and also includes the mobile and space BSs.For the purpose of green communication, all the BSs could be supplied by both the renewable energy and power grid.
BSs are equipped with backup batteries to safeguard the BS's normal functioning against power outages, making it the natural energy storage.Besides, with the continuous price decline in battery storage these years [10], [11], combining the battery storage with renewable energy generators could offer even greater cost-reduction potential.Specifically, i) when the generated renewable power is less than the power demand (e.g., during the peak hours), the battery can be discharged to flatten the peak power demands, and ii) when the generated renewable power is more than the power demand (e.g., during the off-peak hours), the battery can be charged to store the surplus renewable energy.
In this paper, we propose a battery energy storage system (BESS) aided renewable energy supply solution for the 5G network and beyond.Aiming at energy cost reduction for mobile operators, our solution is to maximize the utilization of the renewable energy and thus minimize the utilization of power grid (i.e., fossil energy).Specifically, the energy charge can be continuously reduced by the generated renewable power, and the demand charge can be reshaped and flatten through strategic battery discharging/charging operations.
When designing the optimal control strategy in battery discharging/charging operations, we are faced with several challenges.Firstly, the renewable energy generation and power demand are highly varying in both spatial and temporal dimensions and thus hard to predict.Secondly, owing to the physical constraints of the battery discharging/charging operations (e.g., discharge/charge efficiency), it is complicated to design the optimal battery controlling policy.Thirdly, as the battery's capacity and lifetime are limited and shortened along with the discharge/charge cycles, it is necessary while non-trivial to trade-off between the cost of battery's degradation/replacement and the gain of renewable energy storage.
By tackling the above challenges, we make the following contributions in this work: • We present the BESS aided renewable energy supply paradigm for 5G BS operations, in which the battery discharging/charging operation is modelled as an optimization problem.The model is comprehensive by taking into account the practical considerations of dynamic power demand and renewable energy generation, as well as battery specifications and physical constraints.• To cope with the intermittent renewable energy generation and dynamic BS power demand, we propose a deep reinforcement learning (DRL) based battery discharging/charging operating policy.It can update the network parameters by interacting with the environment in real-time, so as to improve its decision-making efficiency.• We conduct extensive evaluations using real-world BS deployment scenario and BS traffic load traces.The results show that the proposed DRL-based battery discharging/charging policy can effectively utilize the renewable energy and cut down the energy cost, with the cost saving ratio up to 77.9%.The rest of the paper is organized as follows.In Section II, we introduce the background of this paper.In Section III, we give the system models and formulations of the problem, and then propose the BESS aided renewable energy supply solution in Section IV.We develop a DRL-based battery discharging/charging controlling policy in Section V. We evaluate the proposed method by experiments with a real-world dataset in Section VI.We present the related work in Section VII and conclude the paper in Section VIII.

A. Base Station Power Demand
The power demand pattern of a BS is mainly determined by its location and associated with the behavior of users there.Usually the demand also shows a periodic pattern (e.g., with a one-day or one-week period).As shown in Fig. 2, in this paper, we mainly consider three types of BSs at the areas of resident, office, and comprehensive, which account for nearly ninety percentage of the total demands [12].To be detailed, the characteristics of these power demand patterns are as follows.
• Power Demand of BSs at Resident Area: The power demands of this type of BSs increase rapidly in the

B. Energy Cost of 5G BS
The energy cost of the mobile operator typically makes up of two components: i) energy charge, i.e., the total consumed electricity amount (in kWh) throughout the entire billing cycle, which is the interval of time from the end of one billing statement date to the next billing statement date for electricity (e.g., one month), and ii) demand charge, i.e., the peak power demand (in kW) during the billing cycle period.Specifically, the demand charge is regarded as a penalty due to the caused extra load burden to the power grid.
For example, for a commercial data center consuming 10 MW on peak and 6 MW on average, the monthly energy charge and demand charge amount to around $24,000 and $165,500, respectively [13].The demand charge could be up to 8x the energy charge, therefore, effectively cutting down the demand charge could remarkably reduce the energy cost.However, there seems no practical way to flatten the peak power demands of 5G BSs, e.g., shifting the real-time demands from mobile users to the off-peak hours could lead to the long delay for some of the classes of jobs [14].

III. SYSTEM MODEL
In this section, we present the system models and basic assumptions and problem formulation.For clarity, the major notations used in this paper are explained in Table I.

A. Scenario Overview
As illustrated in Fig. 3, the proposed BESS aided renewable energy supply solution deployed at each 5G BS mainly includes: i) a renewable energy generator, e.g., the PV panel and wind turbine, which is deployed near the 5G BS system and generates renewable energy for the system, ii) a battery storage, which stores the surplus renewable energy and acts as the power source for the BS as needed, and iii) a controller, which can obtain the environment state (i.e., the measurement data) so as to control the battery discharging/charging operations through the control signals.In addition to the standard meter, as shown in Fig. 3, an additional generation meter is installed for the BS power supply system to measure the renewable energy generation.Furthermore, with commands from the controller, the distribution panel takes responsibility of power switch between the renewable energy and grid energy and ensures continuous and stable electricity supply for the BS.
As the essential component of the BESS aided renewable energy supply solution, the controller determines how efficient this paradigm is.Specifically, at each scheduling point, the controller needs to decide the amount of power supply from either the battery or the power grid.The scheduling operations should be made upon the power demands and battery states in real-time, so that the utilization of renewable energy can be enhanced and the total energy cost can be minimized.
Note that the feasibility of such an implementation as illustrated by Fig. 3 has been preliminarily verified in practice.According to [15], small integrated renewable energy generators have been provided by some commercial companies for the BS system, which are easily deployed in both open rural and crowded urban environments.

B. BS Power Supply and Demand
The power of each 5G BS is mainly supplied by three parts: power grid, generated renewable energy, and storage energy.In particular, i) when the generated renewable energy is more than the power demand (e.g., during the off-peak hours), each 5G BS is only supplied by the renewable energy (i.e., offgrid) and the surplus renewable energy is stored in the battery storage, ii) when the generated renewable energy is less than the power demand (e.g., during the peak hours), each 5G BS is supplied by all three parts in a cooperative way.
In this paper, we consider a discrete time model, where the entire billing cycle (e.g., one month) is equally spilt into T consecutive slots with length of Δt and denoted by T = {1, 2, . . ., T }.For an arbitrary 5G BS, the power demand during the entire billing cycle can be represented by a power demand vector: where d(t) is the power demand in time slot t, which can be obtained by power meter readings at each BS.

C. Renewable Energy Generation
By harvesting energy from renewable energy resources, the BSs could be powered in an environmentally-friendly and cost-efficient way.In this paper, in order to make the model extensible, we denote the renewable energy generation vector as: In this work, we choose two typical renewable energy as the auxiliary way of power supply, i.e., solar energy (i.e., g s (t)) and wind energy (i.e., g w (t)).Accordingly, for an arbitrary time slot t, the renewable energy generation vector can be represented by: We assume that if the total generated renewable energy is beyond the power demand (i.e., g(t) > d (t)), the power is supplied in proportion to the renewable energy generated.The generation of both varies during a certain period (e.g., one day) and is affected by a some similar factors such as weather, temperature, wind speed, and so on.
1) Solar Energy Generation: Power generated by the solar PV system mainly depends on three factors: global horizontal irradiance (GHI(t)), outdoor temperature (Temp(t)), and time of day (ToD(t)).By arranging solar PV cells in series/parallel, solar PV could harvest energy and convert it into DC to charge the battery storage and supply the power demand.The generated power by the solar PV at time slot t can be measured by the following function: where F S (•) is a known, non-linear function defined in PVLIB [16].Accordingly, the solar energy generation during the entire billing cycle can be represented by a vector: 2) Wind Energy Generation: Power generated by the wind turbine generator fluctuates randomly with time and mainly depends on the wind velocity (WV(t)), weather system (WS(t)), and hub height (HH(t)).The wind turbine generate energy typically into two stages: first, it converts the wind power into mechanical energy and then transforms into electricity.The amount of the power generated by the wind turbine at time slot t can be calculated by the following function: where F W (•) is a known, non-linear function defined in [17].Accordingly, the wind energy generation during the entire billing cycle can be represented by a vector:

D. Battery Specification
At an arbitrary time slot t, the state of the battery is modeled as follows: where the notations of SoE, SoC, and DoD represent the state of effective capacity, state of charge, and depth of discharge of the battery, respectively.Specifically, i) SoE indicates the current effective capacity of the battery, as a percentage of its initial capacity (denoted as π), ii) SoC indicates the current energy stored in the battery, as a percentage of the current effective capacity, and iii) DoD indicates how much energy the battery has released, as a percentage of the current effective capacity.
For simplicity to tackling the optimization problem, we discretize the SoC of a battery into M equal-spaced states (e.g., M = 10, i.e., {10%, 20%, . . ., 100%}).Accordingly, the DoD are also discretized (e.g., release 10% from 90%, i.e., 90% → 80%).Besides, for an arbitrary time slot t, in order to prevent the battery from over-discharging/charging, we use SoC max and SoC min to indicate the upper and lower bounds of SoCs, respectively, which is shown as follows.

IV. BESS AIDED RENEWABLE ENERGY SUPPLY
The battery storage is deployed at 5G BS, and can charge by the surplus renewable energy (generated by solar PV and wind turbine system) and discharge to reshape the power demand, so as to maximize the utilization of renewable energy (or minimize the utilization of fossil fuel) and reduce the electricity bill.
We define the battery discharging/charging operations by a battery operation vector: where b(t) is a real number variable and indicates the amount of discharging/charging operations.To be detail, i) positive value indicates discharging the power from the battery storage to the 5G BS during time slot t, ii) negative value indicates charging from the renewable energy to the battery storage, and iii) zero value indicates no discharging/charging operation performs.
Meanwhile, the discharging/charging operations is constrained by the maximum charging rate and maximum discharging rate, denoted as R + and R − , respectively.It means the largest power that the battery can be recharged and supply with in a time slot, which is shown as follows.
Besides, the battery storage need to meet the following conditions in discharging/charging operations: which represents that the battery storage can only be charged when there exists surplus renewable energy after supplying to the 5G BS, and means that the battery storage cannot be simultaneously charged and discharged at any time slot.Due to the power loss (e.g., AC-DC conversion and battery leakage [18]) occurred during discharging from battery storage to the power grid (or charging from renewable energy to the battery storage), we denote the actual discharging/charging operations from/to the battery by: where α ∈ (0, 1) and β ∈ (0, 1) represent the charging and discharging efficiencies, respectively.Given the power demand of the 5G BS (i.e., d(t)), the renewable energy generation (i.e., g(t)), and the battery discharging/charging operations (i.e., b(t)), we can derive the power consumption vector supplied by the power grid for an arbitrary time slot t by: where p(t) is denoted as:

A. Energy Cost
The billing policy of the energy cost for the mobile operators throughout the entire billing cycle typically make up of two components, energy charge and demand charge, which is widely applied in previous [13], [14], [19].And we will introduce them in detail as follows.
• Energy Charge: The total consumed electricity amount (in kWh) throughout the entire billing cycle (in the unit $kWh and denoted by λ e ).• Demand Charge: The peak power consumption supplied by power gird (in kW) during the entire billing cycle (in the unit $kW and denoted by λ d ).Therefore, the incurred cost of energy charge of the whole system in each time slot t can be represented by: Accordingly, the incurred cost of demand charge of the whole system in each time slot t can be represented by: where p max records the peak power consumption during the past t − 1 time slots.For any arbitrary time slot t, if p(t) − p max > 0, p max will be updated to p(t) accordingly.

B. Investment Cost
Every usage of this equipment (solar PV, wind turbine, and battery storage) incurs a certain reduction of its lifetime, which is essential for the investor.Therefore, it is significant to understand, detail and quantify the various factors influencing the performance loss curves.For the accuracy of our model, we quantify the investment cost in every time slot as follows.
1) Renewable Energy Generator Cost: As modules of a renewable energy generated system age, they gradually lose some performance.In this paper, we assume the decline of the system is linear and positively related to its using time.We denote the lifetime of the renewable energy generator as L, which indicates the total time it can be used.For an arbitrary time slot t, the remaining lifetime of the renewable energy generator is denoted as l(t), which is constrained by 0 ≤ l(t) ≤ L. The renewable energy generator has to be discarded and replaced by a new one if l(t) ≤ 0. Given the remaining lifetime of the renewable energy generator at time t − 1, the remaining lifetime at time t is updated by: where u(t) is defined by: We formulate the using cost of the renewable energy generator in each time slot t as: where λ is the investment cost of a new renewable energy generator.
We extend the model of renewable energy generator to specific system, i.e., the solar PV system and wind turbine system.To be detail, i) for the solar PV system, we denote the lifetime, the investment cost, and investment as l s (t), C us (t), and λ s , respectively, ii) for the wind turbine system, we denote the lifetime, the using cost, and investment as l w (t), C uw (t), and λ w .Accordingly, we can derive the using cost of the solar PV system and wind turbine system by replacing the symbol in the Eq.(20).
2) Battery Storage Degradation Cost: Every cycle of discharge/charge operation does some "harm" to the battery (typically lead-acid) and reduces its capacity and lifetime.Especially, a deep discharging severely affect its internal structure, even can permanently damage the battery (e.g., an overdischarging).The battery has to be discarded and replaced by a new one, when the effective capacity drops down to the "ineffective" level, denoted by SoE ine in this paper.
As illustrated in Fig. 4, each level of DoD has a corresponding number of discharge/charge cycles, thus, we can formulate the battery storage degradation cost by the relationship between both.Given a state of battery at time slot t, i.e., SoE (t), SoC (t), DoD(t) , the SoE decrease of the battery during this time slot can be measured by: Fig. 4. Relationship between DoD levels and battery lifetime (in number of discharge/charge cycles) for LI battery, respectively [20].
where h(•) maps from an input DoD level to the total number of discharge/charge cycles (exemplified in Fig. 4), and ΔDoD(t) gives the increase of DoD and can be calculated by: With the above expression of SoE decrease in each time slot t, we can then formulate the degradation cost of the battery storage at each time slot t as: where λ b is a coefficient converting the battery degradation to a monetary cost, with the unit of "$/SoE decrease."To sum up, the total investment cost in each time slot t can be calculated as:

C. Optimization Formulation and Difficulty Analysis
The battery discharging/charging operations is controlled by the controller.Given the state (i.e., χ(t)) of the battery storage in time slot t − 1, the state in time slot t can be updated by: For the entire billing cycle T , we need to find the optimal battery discharging/charging controlling policy to solve the optimization problem, so as to minimize the total electricity bill during the entire billing cycle, which is defined as follows.
1) Uncertainty of Renewable Energy: Renewable energy generation is affected by multiple factors such as outdoor temperature and wind velocity.It is hard to accurately forecast renewable energy generation (i.e., g(t)) and make the optimal discharging/charging operations (i.e., b(t)) of the battery storage without accurate information in advance, as the unpredictable and intermittent nature of these factors.Therefore, we need to propose a method to tackle the problem of the uncertainty of renewable energy generation.
2) Dynamic of Power Demand: In our modeled problem, we assume the power demand (i.e., p(t)) is known in advance and thus can essentially optimize in an offline way.However, such assumptions are unrealistic in practice.In fact, traditional offline optimization methods (e.g., dynamic programming [21], [22]) are hard to find the global optimal solution, as the power demand can be obtained only when the workload arrives at the 5G BS.Thus, an online method to deal with the dynamic power demands (i.e., d(t)), and make optimal discharging/charging operations (i.e., b(t)), is in great need.
3) High Computation Complexity: The optimization problem in Eq. ( 26) has embedded NP-hard subproblems.Firstly, in every time slot t, the controller needs to search the action space (mainly determined by M), so as to find the optimal discharging/charging operation (i.e., b(t)).For simplicity to solving the optimization problem, in this paper, we discretize the SoC of battery in to M equal-spaced states, however, in real scenario, the state of the battery is continuous, which leads to an enormous searching space.Secondly, during the entire billing cycle (i.e., T ), it is challenging for the controller to continuously make the optimal discharging/charging operation.
To tackle the above three challenges, we propose an online discharging/charging operation controlling method based on deep reinforcement learning (DRL) in the following section.

V. A DRL-BASED BATTERY OPERATION APPROACH
Recent breakthrough of deep reinforcement learning (DRL) [23] provides a promising technique for enabling effective experience-driven control, which exploit the past experience (e.g., historical battery discharging/charging operations) for better decision-making by adapting to current state of environment.We consider DRL is particularly suitable for online discharging/charging operations because: i), it is capable of handling a high-dimensional state space (such as in AlphaGo [24]), which is more advantageous over traditional Reinforcement Learning (RL) [25], and ii) it is able to deal with highly dynamic time-variant environments such as timevarying power demand and renewable energy generation.Next, we will introduce the basic components and concepts of DRL and the proposed DRL-based battery discharging/charging controlling policy in detail.

A. Components & Concepts
A typical DRL framework consists of five key components: agent, state, action, policy, and reward.The concept and design of each component in our DRL-based battery discharging/charging controlling policy is explained as follows.
• Agent: The role of the agent is to make decisions in every episode by interacting with the environment.Specifically, at the beginning of each time slot, it determines the discharging/charging operations (i.e., b(t)) according the current state (e.g., d(t), g(t) and χ(t)) of the environment.The objective is to find an optimal battery discharging/charging controlling policy to minimize the total electricity bill during the entire billing cycle.
• State: At each episode, the agent first observes the state of the current environment to take action.In order to take the optimal action at each episode, the current state should cover as much information as possible.In this paper, we define the state vector of the current environment as s(t) = [d (t), g(t), χ(t), p max ], which concludes the current information of the power demand, the renewable energy generation, the battery storage and the peak power consumption.• Action: After observing the state of the environment, the agent will take an action accordingly.In our problem, the action is to control the battery discharging/charging operations in each time slot, i.e., b(t), specifically, i) whether the battery should be discharged or charged, and ii) how much energy should be discharged or charged.We denote the action taken at time t by a(t), which is equivalent to b(t).• Policy: The battery discharging/charging controlling policy ψ(s(t)) : S → A defines the mapping relationship from the state space to the action space, where S and A represent the state space and the action space, respectively.Specifically, the controlling policy can be represented by set of a(t) = ψ(s(t)), which maps the state of the environment to the action at time slot t. • Reward: After interacting with the environment, the agent will receive a reward r(t) (calculated by the reward function R(s(t), a(t))), which indicates the effect of the action in this episode, so as to update the controlling policy.The objective of the agent is to find a policy ψ to maximize the total reward through continuous interacting with the environment.The design of the reward function significantly affect the performance of the DRL-based algorithm, and we will introduce its detail in the next subsection.To sum up, at each episode, the agent observes the state s(t), takes an action a(t) generated by the policy ψ, and receives a reward r(t) calculated by the reward function R(s(t), a(t)).The objective of the proposed DRL-based battery discharging/charging controlling policy is to take the optimal action in every episode so as to maximize the total reward.

B. Reward Function Design
At the end of each time slot, the agent evaluates the performance of the action using a reward function, which transforms the performance statistics to a numerical utility value.For an arbitrary time t, the agent observes the state s(t), takes the action a(t) and adopts the following reward function to access the performance of the controlling action:

R(s(t), a(t))
in which: • V e (t) = −C e (t), measures the reward of the incremental energy charge caused by the action in time slot t.
, measures the reward of the incremental demand charge caused by the action in time slot t.
, measures the reward of the investment cost caused by the action in time slot t.At the end of each time slot, the agent evaluates the performance of the action by the reward r(t) calculated by the reward function R(s(t), a(t)).In the DRL-based framework, the objective is to maximize the expected cumulative discounted reward: where γ ∈ (0, 1] is the discount accumulative factor indicating the degree of emphasis of future rewards, and the higher γ indicates a higher degree of emphasis on future rewards.

C. Learning Process Design
The learning process of the algorithm adopts a deep neural network (DNN) called Deep Q-Network (DQN) to derive the correlation between each state-action pair (s(t), a(t)) and its value function Q(s(t), a(t)), which is the expected discounted cumulative reward.If the environment is in state s(t) and follows action a(t), the value function of the state-action (s(t), a(t)) can be represented as: After obtaining the value of each state-action (s(t), a(t)), the agent selects the action a(t) with the -greedy policy ψ, that is, randomly selects the action with the probability of , and chooses the action with the maximum of Q(s(t), a(t)) with the probability of 1-, i.e., argmax a(t) Q(s(t), a(t)).
As illustrated in Fig. 5, two effective techniques were introduced in [23] to improve stability: replay buffer and target network.Specifically, • Replay Buffer: Unlike traditional reinforcement learning, DQN applies a replay buffer to store state transition samples in the form of s(t), a(t), r (t), s(t + 1) collected during learning.Every κ time steps, the DRL-based agent updates the DNN with a mini-batch experiences from the replay buffer by means of stochastic gradient descent (SGD): where σ is the learning rate.The higher learning rate will lead to the faster parameters updating speed.However, at the same time, the algorithm would be more affected by abnormal data, which is easy to diverge and difficult to converge.Compared with Q-learning (only using immediately collected samples), randomly sampling from the replay buffer allows the DRL-based agent to break the correlation between sequentially generated samples, and learn from a more independently and identically distributed past experiences.Thus, the replay buffer can smooth out learning and avoid oscillations or divergence.• Target Network: There are two neural networks with the same structure but different parameters in DQN, the main net and the target net.Q(s, a; θ) and Q(s, a; θ) represent the current Q-value and target Q-value generated by the main net and the target net, respectively.The DRL-based agent uses the target net to estimate the target Q-value Q for training the DQN.Every τ time steps, the target net copies the parameters from the main net, whose parameters are updated in real-time.After introducing the target net, the target Q-value will remain unchanged for a period time, which reduces the correlation between the current Q-value and the target Q-value and improves the stability of the algorithm.Accordingly, the DQN can be trained by the loss: where θ is the network parameters of the main net, and Q is the target Q-value and calculated by: where θ is the network parameters of the target net and it updates every τ time slots by coping from the main net.
To sum up, the learning process is depicted by the pseudocode in Algorithm 1.The controller first initializes the replay buffer and the parameters (i.e., θ and θ) of the main net and target net, respectively (Line 1-3 in Algorithm 1).After obtaining the value of each state-action (s(t), a(t)), the agent selects the action a(t) with the -greedy policy ψ, and then performs the action a(t) and interacts with the environment (Line 6-7 in Algorithm 1).Next, the agent will receive the reward r(t) and observe the next state s(t + 1) of the environment, meanwhile store the state s(t), a(t), r (t), s(t + 1) into the RB (Line 8-9 in Algorithm 1).Every κ time steps, the agent updates the DNN by Eq. ( 30) with a mini-batch experience from the replay buffer by means of stochastic gradient descent (SGD).The target net will copy the parameters of the main net by every τ time steps (Line 10-13 in Algorithm 1).During the learning process, we set the learning rate σ is 0.001, the in -greedy method is 0.9, the discount accumulative factor γ is 0.9, and the step parameters τ and κ are both 2000.For the whole battery discharging/charging scheduling process, the algorithm has an overall computational complexity of O(C conv • T ), where C conv = n i=1 C i in C i out represents the sum of the product of the input channel (neurons) and the output channel (neurons) of i-th linear layer, leading to the faster convergence speed compared to other DRL algorithm.

VI. PERFORMANCE EVALUATION
We evaluate the performance of the proposed DRL-based battery discharging/charging controlling policy through extensive numerical analysis.

A. Experiment Setup 1) BS and Power Consumption Data:
In order to show the performance of the proposed method, we mainly consider the 5G BS deployed at the three areas, i.e., resident area, office area, and comprehensive area, whose power consumption within one-week period are illustrated in Fig. 2, and we assume the power consumption of the same type BSs in different cities (e.g., Beijing, Shanghai and Guangzhou) is the same.For simplicity, we denote the BS deployed at the areas of resident, office, and comprehensive as type I, type II, and type III, respectively.We will apply the BESS aided renewable energy supply solution to different types of BSs in different cities under different weather conditions and evaluate its performance through massive simulation experiment.
2) Renewable Energy Generation Data: In Section III-C, we introduce the factors that impact the generation of renewable energy.For simplicity, we divide the weather conditions into three types.Accordingly, the output power pattern of the solar PV and wind turbine could be divided into three types.Specifically, for the solar PV, the weather conditions are divided into the clear day, partial cloudy day, and cloudy day; for the wind turbine, the weather conditions are divided into the high wind velocity, middle wind velocity, and low wind velocity.The output power patterns of the solar PV and wind turbine under different weather conditions are illustrated in Fig. 6 and the time slot Δt is 15 minutes in our experiment.
3) Equipment Parameter Settings: In this study, we use a quantity of 15 Panasonic Sc330 solar modules each with a power rating of 330W and JFNH-5kW wind turbine of Qingdao Jinfan Energy Science and Technology Co., Ltd.For the battery storage, we consider the mainstream lithiumion (LI) battery on the current market.We then refer to [14], [26], [27] for parameter settings of electricity billing policy and battery configurations and the main parameter settings are summarized in Table II.
4) Scenario Settings: As the generation of the renewable energy is significantly affected by the weather conditions, we choose three representative cities in China for this paper, i.e., Beijing, Shanghai, and Guangzhou, which has different weather pattern during the billing cycle window (i.e., from 1st June 2020 to 30th June 2020).We compare and analyze the overall energy cost (including energy charge, demand charge and investment cost), detailed controlling results and return of investment (ROI) for three types of BSs (i.e., type I, type II, and type III BSs) in these cities, and the specific day of the Fig. 6.(a) The solar PV output power patterns under different weather conditions (i.e., GHI(t), Temp(t), and ToD(t)) in one day period.(b) The wind turbine output power patterns under different weather conditions (i.e., WV(t), WS(t), and HH(t)) in one day period.Fig. 7.The weather data is obtain from [28], and the billing cycle window is from 1st June 2020 to 30th June 2020.weather conditions in these cities during the billing cycle window are shown in Fig. 7. Specifically, i) for Beijing, it has more clear days during the billing cycle window, ii) for Shanghai, it is in the plum rain season during the billing cycle window, thus it has more high-wind days but less clear days, and iii) for Guangzhou, the cloudy days and the low-wind days are relatively more than other two cities.

B. Performance Under Different Weather Conditions
As is shown in Fig. 6, the output power patterns of the solar PV and wind turbine are both divided into three types under different weather conditions.Accordingly, the weather pattern can be divided into nine types: The power supply patterns under different weather conditions in one day period of 5G BS at the area of resident, office, and comprehensive are illustrated in Fig. 8, Fig. 10, and Fig. 11 (in the Appendix), respectively.As we can see, the BESS aided renewable energy supply solution could significantly reduce the power from the grid (i.e., energy charge and demand charge).Specifically, with the increase of radiation and wind velocity, renewable energy generation increased accordingly.It could cover most of the power demand and reduce the power supplied from the power grid.Especially, under high-wind days, the power demand could be totally supplied by the renewable energy and battery storage and need 0 power from the grid.
After calculating the power supply paradigm under different weather patterns, we can derive the electricity bill of these three types of BSs during the billing cycle in different cities (i.e., different weather patterns, which is illustrated in Fig. 7), and the results from all the set scenarios are summarized in Table III.
Specifically, for a single 5G BS without the proposed power supply paradigm, the energy charge and the demand charge are $45.6 and $22.8, respectively.However, after utilizing the BESS aided renewable energy supply solution on the 5G BS, the electricity bill is significantly reduced.Especially in Shanghai, which has relatively more clear and high-wind days, the energy charge and the demand charge can be reduced to $3.8 and $9.1, respectively.Although there exists equipment degradation during the discharge/charge cycles, the investment cost still keeps at a well accepted level.The highest cost saving for the BS which utilized the proposed power supply paradigm in Beijing, Shanghai, and Guangzhou in one billing cycle is $50.4,$50.7 and $49.5, respectively.Accordingly, the saving ratio can be up to 74.4%, 74.8% and 73.2%, respectively.

C. Performance Under Different Types of BSs
As the different types of BSs has diverse power demand, resulting in different energy charge and demand charge, thus the performance of deployment of the BESS aided renewable energy supply solution could be different.Specifically, as is shown in Table III, the type I BS has the highest cost saving compared to other two types of BSs, i.e., $50.4 in Beijing, $50.7, and $49.5.This is because that type I BS has the biggest power demand and peak value (near 1450 watts), making it has great potential in energy-saving and peak power shaving.Besides, as type II BS's power demands are relatively small, the generated and stored renewable energy can effectively reduce the power grid supply.Therefore it has the highest saving ratio, i.e., 76.4% in Beijing, 77.9% in Shanghai, and 75.6% in Guangzhou.

D. ROIs of Different Scenarios
The return of investment (ROI) is a financial metric defined by the benefit (cost saving in our case) divided by the total investment.It indicates the probability of gaining a return from an investment and has been widely used to evaluate the efficiency of an investment [29].Typically, a bigger ROI value indicates a higher investment efficiency.With the costs of renewable energy generator and battery storage (given in Table II), the total investments can be calculated.Accordingly, the ROIs can thus be derived with the results in Table III.

TABLE IV PARAMETER SETTINGS
The ROIs of different types of BSs deployed in different cities are shown in Table IV.Specifically, type I BS has the highest ROI, which could reach to 5.43% in Beijing, 5.46% in Shanghai, and 5.33% in Guangzhou, respectively, indicating a relatively high investment efficiency for the operators.This is because that type I BS has the biggest cost saving.
As the equipment's cost is estimated to decrease dramatically in the future [30], and the ROI could rise significantly in 5G and beyond.Additionally, as we can see, the city with more clear and high-wind days will obtain a bigger ROI value, thus the proposed solution is more suitable for those cities with more sunny and windy days.
It is worth noting that, we assume the deployed renewable energy generator and the battery storage only supply power to one single 5G BS, and thus the surplus renewable energy (when the battery is full) will be discarded.This actually leads to a relatively low utilization, as given in this work.In practice, the generated renewable energy could supply to multiple BSs [5], so that the ROI and utilization of the renewable energy could be further improved.

E. Total Electricity Bill Under Different Algorithms
In order to reflect the performance of the proposed method, we mainly compare the total electricity bill with two baseline algorithms.
• AC: Which uses actor-critic (AC) method [31] to make the discharging/charging scheduling operations, one of the DRL methods.• Max: Which satisfies the BS's current power demand to the greatest extent.As shown in Fig. 9, because Max only meets the current power demand to the greatest extent and lacks predictability, it may consume too much power in the beginning and fail to discharge continuously when the power demand is high later.
AC needs to train two networks (i.e., actor network and critic network), resulting in poor stability.Compared to AC, DQN can complete the operation in a shorter time, which can be converged under 300 iterations, proving the efficiency of the proposed method.

VII. RELATED WORK
The most involved related literatures can be divided into the following two categories.

A. Base Station Energy-Saving Method
With the increase of the BS power consumption, the energyefficient cellular networks have recently received significant attention.One commonly used scheme is to switching-on/off the BS according to with the BS traffic load [32]- [34].Intuitively, we can switch on the BS when the traffic load at the BS is high, and switch it off or turn it to sleep mode when the traffic load at the BS is low.In addition, by combining with AI, the accuracy of the traffic load prediction can be improved so that the corresponding energy-saving policies can be elaborately formulated.However, due to the shutdown of some BSs, the traffic latency may increase, degrading the QoS of wireless services.
For energy management, the peak power shaving is a preferable approach to overcome the uneconomic and inefficiency of peak power supply, making the load curve flatten by reducing the peak amount of load and shifting it to times of lower load [35], [36].Specifically, peak power shaving is achieved through charging energy storage system when demand is low (off-peak period) and discharging energy when demand is high (on-peak period).For task offloading, the total power consumption can also be reduced by dispatching tasks to BSs with lower loads [37], [38].As shown in Table V, we have summarized the relevant literature.

B. Battery Storage Optimal Control
The optimal control of energy storage has been extensively studied in the past.Most related works formulate an optimization problem that aims to maximize the revenue generated by the battery storage co-located with renewable energy generator.Babacan et al. [39] proposed a convex program to minimize the electricity bill of operators.Ratnam et al. [40] aimed to maximize the daily operational savings that accrue to customers while penalizing large voltage swings stemming from reverse power flow and peak load.Kazhamiaka et al. [41] studied the profitability of residential PV-storage systems in three jurisdictions and set up an integer linear program to determine the battery controlling policy.These works assume the generations of renewable energy and the power demand are known in advance and can be optimized in an offline way.However, these assumptions are unpractical in the real world.
Several papers study the optimal control of batteries under uncertainty and randomness.Guan et al. [42] utilized a reinforcement learning method to minimize the homeowner's cost by taking an action that yields the best expected reward.EnergyBoost [18] could provide a predictable ability of the renewable energy generation and power demand.However, these works are only applied in the home scenario, which generates a few demands compared to 5G BS.Therefore, we propose the DRL-based method to tackle the problem of large and constrained state-and action-space and the uncertainty of renewable energy generation and power demand.

VIII. CONCLUSION
To cope with the ever-increasing electricity bill for mobile operators in 5G era, we proposed a BESS aided energy supply solution for the 5G BS system, which models the battery discharging/charging as an optimization problem.With our proposed solution, besides the power grid, a BS can be powered by the renewable energy and the battery storage, to cut down the total energy cost.To solve the problem under the dynamic power demands and renewable energy generation, we developed a DRL-based approach to the BESS operation that accommodates all factors in the modeling phase and makes decisions in real-time.To evaluate the performance of our solution, we chose three cities with different weather patterns for experiments.The experimental results show that our power supply solution can achieve an cost saving ratio of 74.8% during the entire billing cycle and improve the renewable energy utilization.
In the future, with further development of the communication technology (e.g., B5G/6G), there will be more mobile BSs and air BSs equipped with more batteries, which could much rely on the renewable energy.Designing an effective battery discharging/charging policy to ensure the high QoS of mobile networks is also an interesting and challenging problem for future work.

Fig. 5 .
Fig.5.The framework of the learning process in DQN.For simplicity, we denote s(t + 1) as s .After interacting with the environment, the agent (i.e., controller) will determine the specific discharging/charging operation.
clear & high-wind day, clear & middle-wind day, clear & low-wind day, partial cloudy & high-wind day, partial cloudy & middle-wind day, partial cloudy & low-wind day, cloudy & high-wind day, cloudy & middle-wind day, and cloudy & low-wind day.

Fig. 8 .
Fig. 8.The power supply pattern of a single 5G BS at area of resident is supplied by different power supply methods under different weather conditions in one day period.

Fig. 10 .
Fig. 10.The power supply pattern of a single 5G BS at area of office is supplied by different power supply methods under different weather conditions in one day period.

Fig. 11 .
Fig. 11.The power supply pattern of a single 5G BS at area of comprehensive is supplied by different power supply methods under different weather conditions in one day period.

Algorithm 1 :
Battery Controlling Algorithm With DRL Power demand of BS d(t) and renewable energy generation g(t), 1 ≤ t ≤ T Output: Discharging/charging actions a(t), 1 ≤ t ≤ T 1 Initialize replay buffer (RB) to capacity N; 2 Initialize main net Q with random weights θ; 3 Initialize target net Q with weights θ = θ; Input: 13Set Q = Q by every τ steps;

TABLE III RESULTS
SUMMARY (ONE BILLING CYCLE)