Exploiting battery storages with reinforcement learning: a review for energy professionals

The transition to renewable production and smart grids is driving a massive investment to battery storages, and reinforcement learning (RL) has recently emerged as a potentially disruptive technology for their control and optimization of battery storage systems. A surge of papers has appeared in the last two years applying reinforcement learning to the optimization of battery storages in buildings, energy communities, energy harvesting Internet of Things networks, renewable generation, microgrids, electric vehicles and plug-in hybrid electric vehicles. This article reviews these applications through 4 different perspectives. Firstly, the type of optimization problem is analyzed; the literature can be divided to approaches that optimize either financial targets or energy efficiency. Secondly, the approaches for handling user comfort are analyzed for applications that may impact a human user. Thirdly, this paper discusses the approach to model and reduce battery degradation. Fourthly, the articles are categorized by application context and applications likely to attract a high amount of research are identified. The paper concludes with a list of unresolved challenges.


I. INTRODUCTION
The transition to renewable production and smart grids is driving a massive investment to battery storages, evidenced by numerous very recent reviews on the subject (e.g.[1]- [5]).Reinforcement learning (RL) has recently emerged as a potentially disruptive technology for the control and optimization of battery storage systems.For example, Lee & Choi [6] formulate the optimization of a domestic energy storage system as a mixed-integer linear programming (MILP) problem as well as a RL problem, reporting significant performance improvements with the RL approach.In the energy domain, Perera and Kamalaruban [7] note a dip in recent publications on model predictive control, mirrored by a rapid increase in RL publications.Yang et al. [8] note that RL is well suited for complex problems with nonlinearity and uncertainty, which is often the case in next generation electric systems.
The associate editor coordinating the review of this manuscript and approving it for publication was Vitor Monteiro .
A few reviews on RL applications to intelligent energy systems include battery storages among the energy resources that have been studied.Perera and Kamalaruban [7] review RL applications across six major sectors: building energy management system (BEMS), dispatch, vehicle energy systems, energy devices, grid and energy markets.Battery storages appear as subcategories of some of these.Yang et al. [8] have an even broader scope including RL applications to smart grid, microgrids, integrated energy systems and energy internet.Glavic [9] reviews RL applications for controlling power grids, so batteries are discussed only for the purpose of grid support.Vázquez-Canteli & Nagy [10] review RL applications to demand response and discuss battery storages to the extent that they are featured in such applications.Wang and Hong [11] review RL applications for building controls and note works involving batteries, without analyzing further the control and optimization problems involving batteries.Frikha et al. [12] identify significant interest in RL in Internet of Things (IoT) applications and identify battery energy consumption and lifetime management as one of the key challenges.In all of the above-mentioned reviews, significant background knowledge of RL theory is required from the reader.Different RL techniques are discussed and their application to specific problems is analyzed critically.Further, none of these reviews has a section dedicated to battery storages.Rather, batteries are discussed if they appear in the context of energy systems.
A surge of papers on RL applications for battery storages has appeared in the last two years, so at the time of writing, a critical mass of literature exists, meriting a dedicated review.This review is targeted at researchers and practitioners applying battery storages in different areas of green electrification, who wish to understand the disruptive potential of RL to their field.RL technology consists of algorithms that are interfaced to real or simulated systems in such a way, that the algorithms learn to achieve specified optimization targets as the interact with the system.Since the great majority of our target audience are not RL experts, the objective of this paper is to review this research in a way that is understandable to this audience.The reviewed works apply batteries to a range of innovative applications in buildings, energy communities, energy harvesting IoT networks, renewable generation, microgrids, electric vehicles (EV), plug-in hybrid electric vehicles (PHEV) as well as hybrid electric vehicles (HEV).Thus, our presentation is aimed beyond battery experts to the broader energy community working on such applications.Key RL concepts are introduced within a general framework of a RL agent managing a battery storage system, without assuming prior knowledge of RL or machine learning from the reader.The reviewed works are analyzed with reference to this framework.
This paper is structured as follows.Section 2 introduces a general conceptual overview for RL agents managing a battery storage system, without assuming prior knowledge of RL or machine learning from the reader.Several examples of systems including batteries are presented in the context of this framework.Section 3 presents the methodology of the literature review and an overview of the papers that were included into the review.The objectives, scope and approach of each paper is studied through four different aspects in sections 4-7.Each paper is discussed in each of these sections to the extent that the relevant aspects were explicitly discussed in the paper.Section 4 assesses the literature with respect to the main optimization objective of the RL application, with three categories emerging: optimization of energy efficiency, minimization of operational costs and minimization of investment costs.Section 5 discusses how user comfort has been handled or ignored in applications that may impact a human user.Section 6 discusses the various levels of abstraction used to model the battery and how battery degradation has been included into the optimization.Section 7 summarizes the review by discussing the literature according to specific applications areas, so that readers interested in a specific area such as electric vehicle charging will understand the focus of the research and open challenges in their field.Section 8 discusses our proposal for handling the main problem that was encountered in the literature review: there is a rapidly growing body of research, but it is difficult to identify the works with breakthrough performance, due to the great diversity in problem formulations and experimental setups.Benchmark RL environments have successfully addressed this problem in other fields, and the closest such works to the topic of this paper are identified.Section 9 concludes the paper with recommendations for overcoming the main unresolved challenges.
Figure 1 provides a graphical overview of the categorization in sections 4-7.Each of the four boxes corresponds to one of the four main categories analyzed in sections 4.7.Each of these categories has been indicated as being mandatory or optional.The optional category is not applicable to all of the papers selected for the review.The boxes within the main box are subcategories analyzed in their own subsection.

II. GENERAL CONCEPTUAL OVERVIEW FOR REINFORCEMENT LEARNING AGENTS MANAGING A BATTERY STORAGE
Major categories of machine learning methods include supervised, unsupervised and reinforcement learning.Supervised learning applications can be further categorized as regression and classification problems.Regression involves predicting a value based on several input datasets; for example, the price of an electricity market could be predicted based on weather and power system data.Classification involves choosing one out of several possible categories; for example, the categories could include a normal operating mode and several failure modes.In all cases, supervised learning methods require a training set, in which the correct output has been labelled for each input sample.If such labelled training data is not available, unsupervised learning can be applied to some problems.For example, if a time series dataset is available for a system running in a normal operating mode, an unsupervised learning algorithm can be trained to recognize a deviation from that normal operating mode, but it will not be able to classify the specific failure mode.In the energy domain, supervised and unsupervised learning methods are used mainly for time series forecasting and condition monitoring, rather than decision making.RL differs from these methods in the sense that it learns to make better decisions by interacting with an environment and adjusting its actions according to feedback.
Figure 2 shows a general framework of a RL managing a battery storage.The figure introduces the key concepts that are used throughout this paper.
The environment consists of the battery storage and the system in which the storage is used.A few examples are presented in the following, to give the reader an idea of the great diversity of environments that researchers have developed to support their diverse RL problem formulations.If RL is used to minimize gasoline consumption of a PHEV, the gasoline tank and the engine should be modelled in the environment at a suitable level of abstraction [13].The availability of V2G (vehicle-to-grid) needs to be considered when modelling the possibility to sell energy from the vehicle batteries to the grid [14], but details such as grid inverters may be abstracted away at the discretion of the authors [15].For a wireless EV charging system, the EV characteristics and the traffic environment need to be considered [16].For optimizing revenues of a wind farm with battery storage, the environment simulates the settlement scheme of the electricity market [17].It is noted that the terms PHEV and HEV are used inconsistently in the literature.In this paper, all vehicles with an internal combustion engine and a battery are categorized as PHEVs.Further, EVs and HEVs are categorized so that the latter has another energy source such as a hydrogen fuel cell to complement the battery.
The RL agent takes actions, which impact the environment.The actions are specific to the application.Examples are bidding on various electricity markets [18], selecting between battery packs [19] or controlling the power of the engine in a PHEV [20].
The environment provides the RL agent with state information, which the agent considers when taking an action.The State of Charge (SoC) is a very commonly used state variable.Depending on the level of detail chosen by the authors, additional variables such as the temperature of batteries can be included [19].Additional state variables depend on the specific application.For example, the energy management of PHEVs, EVs and HEVs is usually formulated in terms of a power demand state variable, which specifies the momentary power demand that must be jointly supplied by the on-board energy sources (e.g.[21]).As another example, relevant state information for the electricity management of a building's HVAC (Heating, Ventilation and Air Conditioning) includes indoor temperatures and occupancy [22].
The state may include additional exogenous variables that the RL agent cannot affect, but which are useful information for the RL agent as it determines the best action to take in the present state.For example, relevant market prices or weather data can be included if they are known at the time of taking the action, or if a forecast is available [23].Otherwise, the price and weather can be treated as an unknown not included in the state information, but which can be taken into account in the reward [15].Some studies use historical weather observation data instead of forecasts [24], so the system cannot be deployed as such to an online environment in which only uncertain weather forecasts are available.
The environment must implement a mapping to a next state given a current state and an action.The mapping can be constructed analytically with equations (e.g [15]).Another approach is to use an energy simulator and implement a wrapper around it to realize the state, action and reward interfaces [25]- [27].In some cases, an energy simulator is not sufficient.For example, in self-driving vehicles that need to consider other vehicles and traffic lights, Wegener et al. [28] include a traffic simulator to the environment.
Finally, the RL agent requires feedback in the form of rewards to train its machine learning model, which determines the action based on the state information.The reward is generated by the environment and should penalize the agent for disadvantageous actions and reward it for advantageous actions.Depending on the objective of the paper, the reward is usually based on electricity costs [15], grid stability [29] or energy efficiency related criteria [23].If the RL agent is allowed to impact the users comfort, for example by rescheduling appliances, adjusting indoor temperature or changing the charging behavior of EVs, a discomfort related penalty can be included to the reward [30].A penalty for battery degradation can be included into the reward.
As training progresses, the RL agent learns to take actions that result in a high reward, but this may result in suboptimal solutions.To avoid this, the RL practitioner is able to force the training process to occasionally choose random actions, in a technique called exploration.Some authors may use their expertise of the specific battery energy management application to achieve more intelligent and computationally effective exploration.For example, in managing a HEV battery, Lian et al. [31] force the exploration to occur close to the Brake Specific Fuel Consumption curve, which is known to be the optimal region of operation for this kind of application.Zhou et al. [32] achieve a similar result with a heuristic algorithm developed to constrain the exploration.
The internals of the RL agent involve details understandable to machine learning practitioners.As the primary target audience of this article is energy practitioners, this review does not focus on these aspects.However, a brief overview is provided as follows.The RL agent essentially implements a mapping from the state space to the action space.An early approach was Q-learning, in which the mapping was captured in a table, so the learning process involved updating the values in this table.More recently, due to increases in computational power and the resulting progress in deep neural networks, such networks are now commonly used to implement the mapping instead of the Q-table.The basic approach involves a single neural network that is used to make this mapping.However,, this approach does not always result in stable training performance.To overcome this, the concept of value was introduced: the value quantifies how good a particular state is, so this is a different concept than the reward.To exploit the value concept, the actor-critic network was introduced.The actor implements the mapping from the state space to the action space, and the critic computes the value, which is used in the training process of the actor.However, a weakness of actor-critic methods is that a small adjustment to the weights of the actor network may cause a jump to a region in which performance is poor, so the training may not converge, and thus fail to optimize the reward function.Several variants of the actor critic have been proposed to cope with issue.In particular, PPO (Proximal Policy Observation) limits the changes to the actor network parameters at each training step, improving the stability of the training process.Further innovations involving several neural networks have been developed, with Deep Deterministic Policy Gradient (DDPG) [73], [74] and Twin Delayed DDPG (TD3) being among the most commonly used.In general, these can be considered implementation details that are encapsulated in the ''Reinforcement learning agent'' box of Figure 2, so the choice of implementation method does not directly impact the formulation of state, action and reward.This is an encouraging observation, in the sense that battery domain experts could be more involved in the formulation in the future.However, there is a notable consideration in the choice of algorithm that will impact the formulation of state and action spaces.Some of the algorithms only support discrete state and action spaces, so if a state or action variable is of a continuous nature, the practitioner must define a limited number of discrete values for it.
Batteries are systems with complex chemical phenomena governing their charging, discharging and aging behavior.These phenomena are specific to the battery chemistry, and the development of such chemistries is an active area of research.However, as will be discussed in more detail in section VI, RL practitioners either explicitly or tacitly ignore these phenomena, or model them in a simplified way.For example, many authors assume that charging or discharging power can be expressed as the product of the battery capacity, SoC difference over a time period and a charging/discharging efficiency constant, so the battery chemistry is not considered or even mentioned (e.g.[116]).Such equations are implemented in the reinforcement learning environment, and they govern how the environment transitions from one state to the next upon receiving an action from the RL agent.
Very few authors use more sophisticated models that are configured for a specific battery chemistry.For example, a lithium-ion battery model distinguishes between terminal and open circuit voltage and internal and trainset resistance [91].In another example, the charge and discharge behavior of a lead acid battery is specified in the context of the system that the battery is used in, comprising a diesel generator and an inverter connected load [96].The established way to capture aging in the reviewed articles was to add an aging penalty term to the reward function.The most common approaches are to penalize situations in which a minimum or maximum SoC threshold has been crossed (e.g.[130]), or to penalize deviations from a reference SoC (e.g.[70]).Unfortunately, the findings of such studies cannot be expressed in terms of equivalent full cycle, which is an established metric of battery lifetime.

III. LITERATURE REVIEW METHODOLOGY
Various terms for battery storages are used in the literature, such as energy storage system, battery storage, battery energy storage system, battery and storage.To capture these and other variants, the following search string was used: ''reinforcement learning'' AND (storage OR battery) The search string was applied to all fields.The search results are shown in Table 1.The hits were studied manually to select the relevant papers to be included in the review.The ''storage'' term resulted in many irrelevant articles on data storage or industrial warehouse type of storage; however, including this term in the search was important to find several relevant articles not using the word ''battery'' but rather ''energy storage''.The Elsevier (Science direct) and IEEE (IEEEXplore) search engines returned a large number of hits.These were sorted by relevance and studied in batches of 25.The search was stopped upon encountering a batch with no relevant papers.The search was limited to papers published since 2016.The search in IEEEXplore was limited to journal articles, including early access.
It is notable that this approach of using a simple search string resulted in a larger number of articles, many of which were not considered relevant.Thus, the approach relies heavily on manual work and judgement on the part of the authors.An extensive list of criteria for article selection was developed for this purpose and it is discussed in the next paragraph.Since this is a new, incipient field, the number of hits was manageable, and the final number of papers selected for inclusion in the review was considered suitable for a review paper.The search was restricted to the publishers Science Direct, IEEE and MDPI.When the  search was repeated in the Web of Science database and limited to journal articles, these publishers emerged as the top 3 publishers.
The following principles were used to guide the manual selection process: • Several articles addressed non-battery energy storage systems such as fuel cells e.g.[33], ultracapacitors [34], natural gas storage tanks [35] and thermal storages such as hot water tanks [36], boilers [37], chilled water tanks [38], [39] and ice storage [40] or by exploiting the building structures themselves as a passive thermal energy storage [41].Such works were not selected, unless these storages were used in addition to a battery.
• Papers planning to incorporate batteries in future work (e.g.[42]) were not selected.
• Approaches that were generally applicable to distributed energy resources, including batteries, were not selected if they did not explicitly consider battery energy management (e.g.[43], [44]).
• Papers only indirectly related to battery management were not selected.A few examples of such indirectly related works are as follows.Biemann et al. [45] optimize the temperature in a data center to ensure desirable operating temperature for batteries; Wang et al. [46] use a lightweight RL approach on IoT sensor nodes with limited battery capacity; Bing et al. [47] design an energy efficient gait for a battery powered mobile robot, but do not consider battery energy management.
• If the same authors published several highly similar papers, only one was selected.
• Although our search string covered all applications of RL to batteries, no recycling related application was encountered within the search results Figure 3 plots the papers that were manually selected for the review according to the year of publication.An exponential growth in publications is observed, indicating that RL has good potential to become a disruptive technology in battery management.It is notable that the review was performed in the summer of 2021, so the numbers for 2021 in Figure 3 are expected to be significantly higher by the end of the 2021.

IV. PROFITABILITY AND ENERGY EFFICIENCY
In this section, each of the papers in Figure 3 is categorized either under ''optimization of energy efficiency'', ''optimization of operational costs'' or ''optimization of investment cost''.

A. OPTIMIZATION OF ENERGY-EFFICIENCY
A significant portion of the reviewed works aimed to optimize energy efficiency without considering financial objectives.Energy efficiency is understood in different ways depending on the application, and the applications attracting significant amounts of research as summarized as follows.In case of PHEVs, HEVs and EVs, many authors use RL to split the demand for driving power between different battery packs or between the battery and other power sources such as a fuel cell, ultracapacitor or internal combustion engine.The power split optimization is an instantaneous problem, but RL has also been used for energy optimization of EVs, including flying ones, over the course of a single trip.In built environments, batteries are optimized in conjunction with other energy resources such as PV, domestic loads and EV chargers, either in the scope of a single building or a microgrid.IoT sensor networks consist of potentially large numbers of battery-powered IoT nodes with no grid connection and limited or non-existent battery recharging possibilities.In such cases, electricity cost is not considered relevant, and most authors focus on minimizing energy consumption to maximize the battery lifetime.Table 2 summarizes all of the papers according to the 4 aspects around which this review is structured: profitability & energy efficiency, management of user discomfort, battery losses & degradation and context of use of the battery.Each of these aspects is discussed further in the text of the sections 4, 5, 6 and 7 respectively.The table summarizes these aspects for each paper, and the works are ordered according to the publication year.

B. OPTIMIZATION OF OPERATIONAL COSTS
The optimization of operational costs of systems with batteries is a popular objective for RL; however, since RL is capable of multi-objective optimization, the financial objectives are often complemented by other application specific objectives or battery degradation related objectives.Several authors are not explicit about what kind of electricity market is being considered, but an analysis of the papers that did specify this reveals the following list of markets against which optimizations have been performed: day-ahead, intraday, real-time pricing, time-of-use pricing and frequency reserve markets.Using the battery storage to increase selfconsumption of renewable generation and to reduce purchase of grid power is a common theme for RL applications to buildings, energy communities and microgrids.Such works have been included in Section IV.A if there was no financial element to the optimization Otherwise, they have been included in this section.The charging of EVs and fleets of   EVs is another topic of interest; however, a wide variety of formulations for the optimization problem were encountered in the literature.Finally, a few authors propose a new market for prosumers, energy communities or microgrids, and perform their optimization against that market.The analyzed papers are summarized in Table 3, sorted by the year of publication.

C. OPTIMIZATION OF INVESTMENT COST
Whereas most works focus on real-time operation or shortterm planning of the operation of an existing system, a minority of works seek to minimize investment cost.The dimensioning or placement of the battery storages is a common optimization problem shared by these works.However, most authors do this optimization in the context of a larger problem that also considers investments to other kinds of energy storages.Another kind of investment cost optimization involves optimizing the lifespan of the battery by limiting the battery degradation.Such considerations have been incorporated as additional optimization criteria to several of the short-term optimization works in Section IV.A and Section IV.B. Works that have battery lifetime maximization  as the main optimization criterion are covered in this section.The reviewed works are listed in Table 4, sorted by the year of publication.

V. MANAGEMENT OF USER DISCOMFORT
The reviewed papers can be categorized under three approaches for considering a human user.The first approach is to ignore the user, since the nature of the system and goals of the optimization problem are such that users are not impacted.The majority of the reviewed papers use this approach.The second approach is to impose constraints on the RL agent to ensure that users are not impacted.The third approach is to permit the RL agent to take actions that cause some inconvenience or discomfort to the users, and to penalize the agent for such impacts.In this analysis, all such impacts are referred to with the term user comfort.
For optimizing the battery usage of EVs and PHEVs as they are being driven, RL applications are concerned with energy efficiency.Most works are concerned with battery management decisions that do not impact passengers (e.g.[13], [19], [71]).However, a minority of works do consider the driving experience.The acceleration of the vehicle may be limited for reasons of energy efficiency [77] or safety [28], [102].For traffic flow management with self-driving EVs, Qu et al. [76] aim to reduce stop-and-go traffic waves.In an autonomous flying taxi system, Yun et al. [147] penalize the RL system for any in-flight collisions.EV charging systems have the potential to disrupt the lives of EV drivers.Tuchnitz et al. [29] define user comfort as avoiding the trouble to give input to a system that coordinates EV charging -this benefit is questionable since the EVs are not compensated.In an autonomous EV, Cao et al. [120] find a tradeoff between the total electricity cost and the waiting time to charge the vehicle.Zhang et al. [123] aim to minimize the time that the EV owner spends getting access to the charging point and waiting for the EV to charge.Yan et al. [156] propose a driver anxiety concept that captures the likelihood of the EV not having sufficient SoC for making an unexpected trip.For an EV taxi fleet, Tang et al. [124] define user comfort as customer waiting times.Reference [119], [153], [154], Li and Wan [132] define a user requirement for the SoC and minimize the charging cost within this constraint.
Residential sector applications require careful consideration of potential user comfort issues.When the battery is used to shift electricity purchases and sales, user comfort is not impacted [15], [24], [121].This remains true if local PV generation is added to the mix [23], [146], [158].Lee and Choi [6] and Lee and Choi [30] consider a smart home with a capability to reschedule appliances and EV battery charging, and penalize for dissatisfaction resulting from rescheduling.Nakabi and Toivanen [142] consider household loads that respond to a dynamic price signal, and assume that user comfort is incorporated to the load controllers as a price elasticity parameter.Lee et al. [22] optimize a HVAC system and minimize of the deviations of the indoor environment outside an ideal range.Lee and Choi [130] and Yu et al. [136] considers appliance agents for home energy management systems which are optimized against two criteria: reducing electricity bills while satisfying the consumer comfort level for heating and the consumer preferences for appliances.

VI. BATTERY LOSSES AND DEGRADATION
Some works assume that no loss occurs during the charging, discharging, and idling of the battery.The battery aging and degradation is also not considered by many works.In some case the authors state this explicitly (e.g.[15], [144]).In some kinds of applications, ignoring these effects can be justified, as they are not directly related to the optimization problem.For example, Sultan et al. [98] optimize the selection of active sensors in an IoT sensor network to achieve the required data communication task so that the total energy consumption at the IoT nodes is minimized.As another example, Sangoleye et al. [99] find the optimal base station for each IoT node to connect to.In both of these examples, authors assume that these optimizations achieve the ultimate goal of the research, which is to prolong the battery lifetime at the sensor nodes through reducing the energy consumption.Diverse approaches are used by the authors that do consider energy losses and degradation.With respect to the RL problem formulation, these approaches can be categorized as capturing losses in the environment, imposing constraints on the RL agent to avoid degradation, or including minimization of degradation as an optimization criterion in the reward function.These approaches are not mutually exclusive, and ideally authors capture charging and discharging inefficiencies in the environment and additionally consider degradation in the reward function (e.g.[18]).

A. CAPTURING BATTERY LOSSES IN THE REINFORCEMENT LEARNING ENVIRONMENT
The RL environment usually uses a set of equations that define how the SoC is impacted by the control action taken by the RL agent.The SoC is often a state variable and may in some cases be used in the reward function, for example in formulas that capture battery degradation.A common battery modelling approach in the environment of the RL agent is to capture energy losses resulting from battery operations with factors for charging and discharging efficiency, and to impose limits to charging and discharging power (e.g.[22], [24], [29], [116], [133], [142]).Whereas most authors capture losses as a simple coefficient for charging and discharging efficiency, a few authors use more detailed models.Chen et al. [71] use a non-linear battery model and Zhang et al. [96] model the charging and discharging dynamics in detail for a specific type of battery, the lead-acid battery.Kolodziejczyk et al. [141] model the maximum charging and discharging power as non-linear functions of SoC.Totaro et al. [95] model how charging and discharging efficiencies as well as the battery storage capacity degrade over time.For problem formulations that permit selling battery energy to the grid, the inverter efficiency as a function of discharging power is a significant factor taken into account only in the minority of the works [23].Aljohani et al. [91] include temperature in their battery model to ensure an accurate tracking of SoC over the duration of a trip.Liu et al. [94] consider energy management in the VOLUME 10, 2022 specific application of Formula-E races, and carefully model the impact of ambient temperature and vehicle speed on battery temperature.
In general, the battery is captured in the environment of the RL agent by a set of equations defined by the authors.An alternative approach would be to use a battery simulation [83].A few authors use RL to improve such simulation models.Unagar et al. [176] use machine learning to infer the battery model's parameters.RL is used to avoid the need for labelled training data, as would be the case for supervised learning methods.Kim et al. [177] use RL to obtain a more accurate method for estimating the SoC of lithium-ion batteries than what has been possible with modelbased methods.

B. IMPOSING CONSTRAINTS ON THE AGENT TO PREVENT BATTERY DEGRADATION
One approach for limiting battery degradation is to impose constraints, so that an external logic overrides the actions taken by the RL agent in case these constraints are violated.A simple approach is to define minimum and maximum SoC and charging and discharging power thresholds as hard constraints [49], [58], [140], [167].Nyong-Bassey et al. [72] take this constraint as the starting point for power pinch analysis, which anticipates SoC threshold violations and takes actions ahead of time to ensure that the violations will not occur.For self-driving EVs, Tang et al. [124] implement a constraint that the vehicle must reach a charging station before its SoC drops below a minimum threshold.

C. INCORPORATING BATTERY DEGRADATION INTO THE REWARD FUNCTION
Reducing battery degradation is included to the multiobjective optimization problem by adding a penalty term to the reward function.A simple approach is to penalize situations in which the battery SoC exceeds a minimum or maximum threshold (e.g.[6], [18], [30], [130], [145]).Other authors penalize SoC deviations from a reference value [31], [54], [56], [59], [70], [89], [114].Zhou et al. [32] do this only when the SoC is out of an ideal operating range of 60-85%, and Zhou et al. [57] do this only when the SoC is under the reference value.Qi et al. [60] add a penalty term to the reward function when the SoC is out of the 20-80% range.Cao et al. [150] is similar for the range 20-90% and Silva et al. [119] penalize when the SoC is less than 20%.Yang et al. [149] include a battery deprecation cost that is proportional to the charge/discharge power at each timestep.Chen et al. [13] include the minimization of the maximum battery discharge power as one of the optimization criteria.Cao et al. [128] determined that battery degradation is a linear function of charge/discharge cycles in the short term, and incorporate this penalty to the reward function.Shang et al. [127] consider the number of operating cycles and the SoC in individual cycles.Muriithi and Chowdhury [146] capture the degradation of a lithium-ion battery in terms of depth of discharge.
Roesch et al. [125] use a sophisticated battery degradation model to capture the impacts of irregular charging and discharging cycles on battery degradation.Yang et al. [17] penalize the number of switches between charging, idle and discharging modes.Cao and Xiong [50] do not explicitly consider degradation, but formulate the RL problem to avoid energy losses by avoiding high discharge currents, an approach which will have side benefits related to mitigating degradation.
A minority of works considers the impact of temperature on aging and degradation.Sui and Song [170] consider a battery pack and propose an intelligent controller to select between batteries to avoid overheating caused by excessively frequent charging and discharging of any single battery.Li et al. [19] go further and consider diverse 'high energy' and 'high power' battery packs [100].The abovementioned approaches include temperature as an aspect of the optimization problem by incorporating the temperature effects into the reward function.Xie et al. [162] use a thermal model and SoH degradation for the aging of a lithium-ion battery.
The majority of works uses SoC in their reward formulations, but SoH (State of Health) is used in some papers.Xiong et al. [55] define SoH as the ratio between the present and rated battery capacity.Wu et al. [151] define SoH in terms of capacity fade.Mendil et al. [163] define the battery state jointly described by SoC and SoH.

VII. CATEGORIZATION BY APPLICATION
This section categorizes the reviewed works by application.Each paper is categorized under only one application, unless it strongly fits under several categories.Generic works that do not mention any application are not discussed in this section.The pie chart in Figure 7 categorizes the reviewed articles by application and gives an indication of which kinds of RL battery management applications are expected to receive a high number of publications in a future.However, in addition to the information in Figure 7, the following insights from the analysis of individual papers should be considered: • The problem of managing the power split of one or more battery packs and other sources of power has become a well-established line of research, in which the RL problem formulation was quite similar across all the works in the 'EV & HEV driving' and 'PHEV driving' categories.
• Additional 'EV charging' applications are included in the 'Buildings' category.
• No works were found addressing stand-alone PV plants, so PV does not appear as a separate category.However, PVs are a central element in many of the works in the categories 'Buildings', 'Energy communities, 'Grid-connected microgrids', 'Isolated microgrids' and 'Multi-carrier systems'.
• A few works address wind farms.Wind power was also covered by works in some of the other categories, but to a much lesser extent than PV.This is unsurprising, since rooftop PV is becoming increasingly common, whereas windmills are usually not welcomed in the vicinity of buildings.
• A huge body of research was encountered related to IoT, but only a small minority addressed battery management.As the IoT community begins to address the practical issues related to deploying and maintaining IoT systems, it is possible that there is a significant growth of research in this category.

A. VEHICLE 1) LAND a: POWER SPLIT i)PHEV
In contrast to EVs, the battery management of PHEVs has the additional consideration of switching between battery power and fuel.Most authors minimize fuel consumption [13], [32], [71], [107].Other authors additionally penalize actions that wear down the battery [20], [31], [54], [56], [57], [60], [74] and the engine [21], [89].RL formulations for optimizing the driving performance of PHEVs have power demand and SoC as the state variables.Some authors add velocity [71] and road slope [59].The action involves controlling either the engine power (e.g.[20], [31], [56], [115]) or the battery power (e.g.[13], [71]).The problem is usually framed as a question of satisfying the power demand for moving the vehicle forward (heading demand) with the engine and the battery, but in case of a tracked vehicle, the power demand consists of the heading demand as well as the steering demand [48], [56].In contrast to the majority of the research, Wu et al. [117] and Tan et al. [115] perform their optimization based on cost, taking into account the price of electricity and diesel.The above-mentioned authors consider emissions only indirectly through minimizing gasoline consumption.However, Hofstetter et al. [58] add tailpipe NOx emissions as a constraint to the optimization problem.For a Fuel Cell -PHEV hybrid powertrain, Li et al. [175] propose a framework for achieving optimal battery sizing parameters with minimal operation cost and component degradation.

ii) EV AND HEV
Most works on EVs and HEVs that do not have a gasoline engine involve selections between different types of battery packs [19], [100] or selections between the battery and other on-board power sources such as fuel cells [104] and ultracapacitors [70].Cao & Xiong [50] aiming to reduce energy losses by avoiding high discharge currents and Xiong et al. [55] optimize the state of health of HEV batteries.Whereas most works assume a human driver, He et al. [77] and Wegener et al. [28] consider self-driving vehicles, with which it is possible to include energy efficient acceleration into the optimization.

b: CHARGING i) CAR
EV charging optimization targets include the following: reducing peak load [29], reducing charging costs for the EV [119], [132], [139], [153], [154], reducing both charging cost and waiting time [123], reducing charging cost based on knowledge of user behavior [156], minimizing the cost for the charging station with a PV and battery storage [155], minimizing the cost of several such stations [131], and aggregating several stations within a local market operated by an aggregator [135].

iii) RAIL
For rail applications, stationary batteries are a viable alternative to wireless charging.Regenerative breaking by the train can be used to charge the batteries, which are then VOLUME 10, 2022 used to power the train when it drives.Zhu et al. [87] and Yang et al. [90] optimize such a system by minimizing power consumption from the grid and minimizing the losses from regenerative breaking.Wireless charging is generally not investigated for rail transport, since established solutions for connecting to the grid are available.However, Ko [164] propose a wireless charging infrastructure for trams.

iv) BUS
Gao et al. [129] optimize the charging/discharging schedules of electric buses in battery swapping stations with V2G capability.The objective is to minimize the station's electricity bill.Wu et al. [103] optimize the energy management of hybrid electric buses by penalizing for overtemperature and degradation of the battery.Lee et al. [6] design a wireless charging system and minimize the battery size and charging times.

c: SELF-DRIVING VEHICLES
Self-driving vehicles could be coordinated to achieve smoother traffic flow than what is possible with human drivers.One goal formulation is to reduce stop-and-go traffic waves or other abrupt velocity changes, since this reduces acceleration/deceleration cycles and thus battery degradation [28], [76].Guo et al. [102] minimize fuel consumption and travel time while having safety overrides to avoid hazardous actions.

d: TRIP PLANNING
RL has been applied for the trip planning of human driven EVs, HEVs or PHEVs [91], [94] and mobile robots [79].This can include either route planning [79], [91] or optimizations made for a predetermined route [94].

2) AERIAL
Batteries in unmanned aerial vehicles (UAV) are used for flying and data transmission.Flying applications include the minimization of flight path length [73], maximizing flight time [83] and using only locally generated wind power for charging the fleet [75].The following data transmission applications were encountered.Wang et al. [105] propose a framework for the UAVs to independently select their transmit power in the presence of a jammer.Li et al. [106] propose an RL-based flight resource allocation framework to minimize the overall data packet loss to avoid additional energy consumption from retransmission.

3) MARINE
Battery management for short distance electric ships involves optimization of decision making for battery usage and charging.The on-board energy storages include a battery and a fuel cell.When the ship is in port, on-shore power can be used to charge the battery, while when it is at sea only the fuel cell can be used to charge the battery.The authors minimize the total cost, which consists of hydrogen fuel cost, fuel cell degradation, battery degradation and on-shore electricity cost [84], [134].

B. GRID 1) MICROGRID a: GRID CONNECTED i) ELECTRIC
With respect to the RL applications reviewed in this article, grid-connected microgrids are very similar to the energy communities discussed in Section VII.C.2.The main difference is that a microgrid operates in a geographically constrained area, all energy resources must be physical connected to the microgrid, and power flow limits must be observed at the point of common coupling with the utility grid [127].Nakabi and Toivanen [142] run a market for household loads within the microgrid, in which loads participate in microgrid-level demand response.Kolodziejczyk et al. [141] consider an aggregated load without specifying the type of load.Liu et al. [111] introduce a distributed framework to coordinate loads, distributed generation units and storage.Shuai et al. [159] perform a multi-objective optimization to minimize the operating cost of a microgrid with PV, wind and diesel, considering fuel prices, power exchange costs of the utility grid and curtailment costs of PV and wind.Nunna et al. [138] trade aggregated battery capacity on intra-microgrid markets as well as inter-microgrid markets.Lu et al. [112] minimize grid peak power consumption.Wang et al. [116] envision new auction-based markets in which microgrids can participate.Guo et al. [140] propose a new market to balance cost minimization objectives of the microgrids and the utility.Hua et al. [114] maximize self-sufficiency, minimize cost of non-renewable generation and minimize battery degradation.Qiu et al. [49] exploit the operational difference between batteries with different chemistries to achieve better efficiency.Duan et al. [165] optimize battery lifetime.

ii) MULTI-CARRIER
Multi-carrier systems involve the use of electricity along other forms of energy and, in some cases, freshwater production.Variants of a multi-energy microgrid involve electricity, heat and freshwater production [173], electricity and heat [122] and electricity, gas and heat [148], [149].
Nyong-Bassey et al. [72] designed an isolated microgrid with a battery, fuel cell and diesel generator, so that an electrolyzer can use excess PV to replenish the fuel cell, aiming to minimize the need for the diesel generator.

b: ISOLATED
In the case of isolated microgrids, purchases from an external electricity market are either not possible [95], [168] or a last resort to complement local fossil-fuel based emergency generation [96].Phan and Lai [169] and Zhang et al. [96] note that the trend towards a decentralized electric power system should in some seashore regions be complemented with a move to decentralized freshwater production, so a desalination plant is added to the microgrid.Nie et al. [68] curtail loads to keep the microgrid operational for as long as possible.

2) GRID SUPPORT
Applications for grid support can be categorized to market driven applications and to other applications in which the financial incentive has not been specified.Market driven applications include energy arbitrage [128] and frequency reserves participation [86], [152], [161].Other applications include PV generation peak shaving [82], loss minimization in distribution networks [150], mitigating voltage deviations in low voltage distribution networks with high PV penetration [157] and frequency instability reduction not related to frequency reserve market participation [108].

C. BUILDING 1) SINGLE BUILDING
Buildings are a common context for RL agents managing battery storages in coordination with other energy resources.
The main difference is the types of other energy resources available and their flexibility in terms of possibilities for rescheduling or curtailment.Only PV is considered in [24], [23], [69], [109], [133], [136], [144], [146],.Liu et al. [137] and Kim and Lim [15] consider EV chargers along with PV. Lee and Choi [6] and Lee and Choi [30] include reschedulable appliances and an EV charger; Alfaverh et al. [121] only consider appliances.The optimization objectives for PV related works can be categorized either as maximizing the PV generation through Maximum Power Point Tracking [51], maximizing PV self-consumption [23], [69] or minimizing electricity bills.The latter requires assuming a specific type of electricity contract, such as real-time pricing [144], [146], day-ahead markets [133] or Time-of-Use pricing [15].

2) ENERGY COMMUNITY
Communities of buildings offer further optimization opportunities with shared batteries.An aggregator can trade the capacity of the batteries and other flexible energy resources on utility markets [22], [130], [145].Alternatively, a local market can be established to avoiding buying and selling from the grid [14], [118], [143], [158].take battery management as one criterion in a multi-objective optimization that aims to reduce the latencies and dropped packets for the IoT computation tasks.Banerjee et al. [80] selectively activate IoT nodes in an outdoor network to minimize the increased energy requirement for data transmission when the node is exposed to high outdoor temperature and direct sunlight.Teng et al. [171] reduce both the battery investment cost and data transmission delay with an intelligent power transmission policy.Conti et al. [52] allow IoT nodes to offload computation to a fog-computing node.

2) ENERGY HARVESTING
If minimizing energy consumption is not sufficient or practical for prolonging the battery lifetime, energy harvesting approaches are used to recharge the battery.Sangoleye et al. [99] identify the best base station to connect to for energy harvesting, whereas Chen et al [101] migrate computation tasks to nodes that are best positioned for harvesting.Elmagid et al. [67] schedule packet transmissions in a way that is optimal for energy harvesting.Chu et al. [61] use battery forecasts to optimize access of IoT nodes to energy harvesting.Chu et al. [63] optimize the access and power control policies.Li et al. [62] perform simultaneous energy harvesting and data transfer by finding a transmission scheduling strategy to minimize data loss.For maximizing the throughput of large multiple-access channel energy harvesting networks, Sharma et al. [64] propose an optimal power control policy.Temesgene et al. [88] perform an optimization at virtual small cells that jointly minimizes harvested energy and the volume of dropped traffic.Cao et al. [120] minimize the distance travelled, and thus the energy consumption, of a battery powered mobile wireless sensor charger.Faraci et al. [75] go further, using a fleet of drones as the mobile wireless charger, and using only locally generated wind power for charging the drones.In a V2I (vehicle-to-infrastructure) roadside unit, a battery is periodically recharged and RL can be used to optimize the quality of service of the communication link without draining the battery before the next recharging period [53].For energy harvesting in an underwater relay network, Wang et al. [66] propose an optimal online power allocation policy to ensure the quality of data transmission.

E. WIND FARM AND TIDAL
Power production from wind and tidal needs to be traded ahead of time, based on forecasts.RL applications to batteries in this context include management of uncertain generation forecasts [85], [167], management of uncertain generation and market forecasts (Yang et al., [17]) smoothing fluctuations in generation [65], [97] and optimizing the revenue of a wind farm with other generation resources on site [140].

F. FACTORY
Batteries are emerging as an element of factory energy systems, either for rescheduling production tasks to lower electricity price periods [125] or to ensure the continuity of production during outages [113].

VIII. DISCUSSION
Comparisons between original research works and attempts to synthesize them are hindered by the fact that each author has a unique formulation of the RL problem, resulting in unique environments, state and action spaces and reward formulations.This field could greatly benefit from the availability of benchmark environments for the different applications of batteries identified in section 7. The OpenAI Gym is an open-source project for creating such environments that implement a standard interface for the RL agent to connect to [179].A range of benchmark environments implementing the OpenAI interface are available for video games [180].Similar benchmarks are not available for the energy domain, although a few works in the energy domain implement the OpenAI interface for the following applications: maximum power point tracking of PV installations [181], building energy management [25]- [27], microgrid energy management [142], demand response for building cooling [182].Building on such works, the emergence of a range of open-source benchmark environments for diverse battery applications could greatly speed up the research on RL applications for battery management and improve the possibilities to comparatively assess similar works and identify the superior RL designs.The closest work to this direction that was found is by Henry & Ernst [183], who published precisely such an environment for electricity distribution systems, but it does not involve batteries.Specific areas of research that are expected to see significant numbers of publications in the future have been discussed in conjunction with Figure 7.The following unsolved challenges have been identified for further research: • A number of solutions exist for the problem of managing a battery in conjunction with diverse local energy resources in a building or microgrid.Approaches are split into two bodies of research: optimizing energy efficiency goals and minimizing electricity bills.As any deployments will require financial investments, proponents of the former approach should consider adjusting their research targets to obtain benefits that can serve as the basis for a return-oninvestment calculation.Further research challenge: a cost-benefit perspective should be included in RL problem formulations motivated by energy efficiency.
• The battery is modelled as part of the environment used in the RL agent's training process.Various levels of abstraction have been used in the modelling, and only a minority of works try to capture the characteristics of a specific type of battery, such as a lithium ion or lead-acid battery.The chosen level of abstraction can cause a significant difference between the performance of the RL agent that has been reported in a scientific publication, and the performance of the same agent when it is deployed to manage a physical battery.Further research challenge: the trained agents should be deployed to physical batteries and the performance should be compared to the performance achieved against the battery model.
• Long term battery degradation is captured in a minority of works, which use diverse ways to define the degradation and to incorporate it to a multi-objective optimization problem.This issue, in combination with the varying levels of abstraction in modelling the battery, prevents direct comparisons between the performance reported in different works.Thus, researchers will have difficulties in identifying the most promising lines of research.Developers and implementers cannot be expected to assess how these issues will impact the performance of a RL agent, should it be deployed.
Further research challenge: a benchmark battery model is needed to assess the efficacy of RL solutions aiming to mitigate battery degradation.
• Innovative battery management solutions can cause inconvenience or discomfort to human users of the system that contains the battery.The identification and resolution of these issues remains largely an unsolved issue.Some authors ignore these issues, some define constraints on user comfort, and some include comfort as one aspect of a multi-objective optimization problem.As these issues receive more attention from researchers, it is possible that original and unique formulations of user comfort will further complicate the comparisons between the performance of different research works.Further research challenge: the end user of the system that contains the battery needs to be identified and standard approaches for quantifying user comfort are needed; for example, if the battery is used in conjunction with smart building loads, established standard metrics for indoor air quality and thermal comfort should be identified and adapted to the RL problem formulation.

IX. CONCLUSION
The objective of this manuscript has been to provide an application-oriented review of RL applications to battery systems.In particular, this review aims to introduce energy domain experts to RL and to describe the diverse applications that have been recently published involving batteries.A fourfold approach has been undertaken for this purpose.Firstly, the motivations of the RL research have been analyzed either from an energy-efficiency or financial perspective.Secondly, any efforts to identify and mitigate impacts on end users were analyze.Thirdly, approaches for modelling charging and discharging losses as well as battery degradation were analyzed.Fourthly, the reviewed literature was categorized according to the application.
One key finding is that the batteries are modelled at a high level of abstraction.The great majority of works do not specify the battery chemistry.The RL solutions are trained and validated against these simplified battery models, and there is a lack of further validation against high fidelity models or physical batteries.Further multidisciplinary research involving battery experts is needed.This article intends to provide such experts with necessary background knowledge and an understanding of the state-ofthe-art.
Our literature search was general and thus covered all lifecycle phases of the battery.The great majority of articles addressed real-time control or short-term optimizations.Thus, the focus of the research is on the operation phase of the battery lifecycle.A few works addressed the planning phase, in order to optimize the battery investment cost.None of the reviewed works addressed second-life battery applications, decommissioning or recycling.

FIGURE 2 .
FIGURE 2. General framework of a RL agent managing a battery storage.

FIGURE 3 .
FIGURE 3. Papers that were selected manually for the literature review.

FIGURE 4 .
FIGURE 4. Reviewed articles primarily aiming at optimizing energy-efficiency.

FIGURE 5 .
FIGURE 5. Reviewed articles primarily aiming at optimizing operational costs.

FIGURE 6 .
FIGURE 6. Reviewed articles primarily aiming at optimizing investment cost.

TABLE 2 .
Papers primarily aiming at optimizing energy-efficiency.

TABLE 3 .
Papers primarily aiming at optimizing operational costs.

TABLE 4 .
Papers primarily aiming at optimizing investment cost.