A Data-Driven Solution for Energy Management Strategy of Hybrid Electric Vehicles Based on Uncertainty-Aware Model-Based Offline Reinforcement Learning

Energy management strategy (EMS) is the key technology to improving the fuel efficiency of hybrid electric vehicles (HEV). In recent years, the development of artificial intelligence has enabled tremendous advances by utilizing reinforcement learning (RL) for training and deploying deep neural network-based EMS. However, in contrast to the fields of deep learning, such as computer vision and natural language processing, which mainly rely on large-scale offline datasets, most RL-based policies must be trained online by trial-and-error with the initial performance being almost arbitrary. Such a paradigm is considered inefficient and unsafe for industrial automation and can only be used to tackle the EMS problems in the simulation world. Considering that large historical interactive datasets are readily available in the EMS domain, if an RL algorithm can be used to extract a policy purely offline from the prior collected dataset and improve upon data logging policy, the current issues, including sample inefficiency, unsafe exploration, and simulation-to-real gap that prevent the widespread use of RL methods, could be mitigated to a great extent. To this end, this article presents a feasible algorithmic framework for model-based offline RL. Unlike vanilla RL approaches without any consideration against distributional shift, a data-driven dynamic model is built before the policy training using RL. After that, two techniques, namely, conservative MDP and state regularization, are augmented, which are proved to be effective against model overexploitation. By incorporating the guidance of uncertainty awareness, a near optimal policy can be obtained by using only the dataset from a suboptimal controller.


I. INTRODUCTION
C ONSIDERING the depletion of fossil resources and climate change, reducing emissions and improving energy efficiency have become one of the focuses of public attention [1], [2]. Limited by the technical problems faced by power battery and fuel cells, the development of pure electric vehicles and fuel cell vehicles is in a dilemma [3]. Hybrid electric vehicles (HEVs), which not only meet the requirements of low fuel consumption and low emissions, but also ensure sufficient cruising range, are a smooth transition from traditional fuel vehicles to pure electric driven vehicles [4]. Since HEV is equipped with two or more power sources, in order to improve the dynamic performance of the vehicle and achieve the target of energy conservation and CO 2 emission reduction, it is crucial to develop an intelligent energy management strategy (EMS).
The EMSs of HEV can be categorized into rule-based, optimization-based, and learning-based methods. Rule-based EMSs includes deterministic rules and fuzzy logic rules which are usually set by priori knowledge, and they generally have strong reliability and robustness [5]. However, the performance of rule-based EMSs is suboptimal, and they often depend on preset parameters which also mean they lack self-adaptability in practical application [6]. As an alternative, the EMS based on optimization has become a research hotspot, especially in academia. Optimization-based EMSs can be roughly split into two directions: global optimization and real-time optimization. Global optimization methods, such as dynamic programming (DP) can only be used under the conditions of known driving conditions and require too much computing resources [7]. Therefore, DP is often considered as a reference to compare with other strategies. Real-time optimization-based EMSs such as equivalent fuel consumption minimization strategy (ECMS) [8] and model predictive control strategy [9], [10] do not rely on artificially set rules and can approximate the optimal solution under certain circumstances. Nevertheless, due to the current complex traffic conditions, the uncertainty of driver operation, and the computational complexity limitation of real controllers, the application of such EMSs still yields suboptimal behavior though they are being adopted by an increasing number of mass production HEVs recently. With the vigorous development of artificial intelligence (AI), learning-based control strategies are being considered as more promising directions in the field of EMS. In particular, reinforcement learning (RL) based EMS has been developed continuously, especially after the great success of AlphaGo, which is considered a major milestone in AI research [11]. RL is an area of machine learning concerned with how intelligent agent is trained to make a sequence of decisions by trial-and-error [12]. Different from other machine learning paradigms including supervised learning and unsupervised learning, the goal of an RL is to learn the optimal behavior in an environment by observing how it responds given an action. In recent years, a growing number of tabular and approximate RL-based solutions have been applied to EMS tasks. They can be broadly classified into value based [13], [14], [15], policy based [16], and mixed value and policy methods [17], [18]. All the approaches earlier are typically viewed as online RL, which requires the controller to interact with the environment while learning the policy. Such a paradigm is often inefficient and potentially unsafe and thus may only be applied if a high-precision simulation model is available. As shown in Fig. 1, depending on whether a simulation model is provided, the current online RL-based algorithms have the following issues when they are applied to industrial EMS problems.

A. Online RL Algorithm
It is found that most online RL algorithms successfully applied to virtual games in the field of AI domain cannot be directly applied to EMS tasks in the real physical world, because online physical data collection has a nontrivial cost. Specifically, as shown in Fig. 1(a), learning a deep RL policy with the policies and values approximated by function approximators (usually deep neural networks) in an online manner often necessitates millions to billions of costly environment interactions [19]. In addition, due to such a barrier of sample inefficiency, it remains notably more challenging to directly apply online RL algorithms to safety critical EMS applications [20]. To address the problems earlier, most of the RL-based EMS research focus on applying online RL algorithms virtually, i.e., building a high-fidelity physics-based simulation model before an online RL algorithm is used for training.

B. Simulation-Based RL Solution
The success of online RL depends on two aspects: 1) learning framework, e.g., deep Q-learning (DQN) [21], deep deterministic policy gradient (DDPG) [22], proximal policy optimization [23], asynchronous advantage actor-critic (AC) [24], which use different techniques to effectively and efficiently approximate values and/or policies during policy learning, and 2) a highfidelity simulator, e.g., OpenAI Gym [25], which serves as environments. Unlike the relatively simple environment (such as board games or Atari) in which AI experts propose and compare different RL algorithms, the high-precision simulation model for EMS problems is difficult to build, though most of the existing RL-based EMS algorithms assume access to. Due to the simulation-to-real gap, policies trained in simulation often do not transfer to the real world [26]. To bridge the reality gap, it is often desirable to fine-tune the policy trained by simulation model via further online interactions. However, it is observed, concurrently with this research, that there exists a distributional shift between virtual online and real online state-action distributions. This will lead to severe bootstrap error during the fine-tuning process which could destroy the initial policy obtained via the simulation-based RL solution [27].
Considering the issues of the current RL algorithms including sample inefficiency, potentially unsafe deployment, and simulation-to-real gap, a data-driven model-based offline RL algorithm specifically designed for EMS problems will be proposed. Compared with the other related literatures, the contributions of this research are as follows.
1) A feasible data-driven model-based offline RL algorithm has been proposed that can alleviate the problems of sample inefficiency and unsafe online exploration faced to online RL algorithms, without the need of an accurate physics-based simulation model. It is also found that the proposed approach can solve the distributional shift problem inherent in off-policy RL algorithms to a large extent. 2) Unlike vanilla data-driven model-based offline RL approaches without any consideration against model inaccuracy, a novel uncertainty-aware model-based offline RL (UMORL) algorithm is proposed. By incorporating the guidance of uncertainty-awareness, i.e., conservative MDP and state regularization, a near-optimal policy can be obtained by using only the dataset from a suboptimal logging policy controller. The rest of this article is organized as follows. Section II introduces the problem formulation of the given EMS. The unconstrained and uncertainty-aware data-driven model-based offline RL framework for the EMS problem will be detailed  in Section III. Section IV discusses the hardware-in-the-loop (HIL) results of the proposed framework in contrast to other algorithms. Finally, Section V concludes this article.

II. PROBLEM FORMULATION
As shown in Fig. 2 , a parallel HEV is selected as the research object. It includes a 1.5 L four-cylinder turbocharged gasoline engine, an electric motor placing between the engine and the gearbox (P2 reference application), a lithium-ion battery pack, a six-speed automatic transmission, and two clutches. The parameters of the vehicle can be seen in Table I. As the main objective of this study is to propose a new conceptual RL-based EMS training framework that can utilize fixed offline data without additional online data collection, for the repeatability of the test, a general powertrain Blockset provided by Simulink example within MATLAB 2019b is used to generate offline data and to verify the feasibility of the proposed algorithms. The objective of a charge-sustaining EMS is to find the optimal efficiency of distributing the power demand between the internal combustion engine (ICE) and the battery over a trip of length t, which can be defined as follows: is the time duration in which the fuel consumption is calculated. The minimization of J should also be subject to constraints related to physical operation limitations and the requirement to sustain the battery SOC over a given trip. This will make the EMS task a constrained, finite-time optimal control problem (OCP).
To solve this OCP, several online RL-based algorithms can be applied. However, depending on whether a physics-based simulation model is used, the challenges of sample inefficient, unsafe exploration and simulation-to-real gap need to be tackled before the widespread use of these RL methods in the industry. Roughly speaking as shown in Fig. 3(a) and (b), online RL methods can be divided into on-policy and off-policy RL. On-policy RL learns and updates policies from samples generated by its own policies.
Off-policy RL, on the other hand, estimates and improves a policy which may be different from the one that is selected for action. Although a high-fidelity physics-based simulation model is difficult to build, there often exist many suboptimal EMS controllers capable of generating a large amount of interactive data containing informative behavior. If a highly rewarding policy can be learned offline from this prior collected dataset in a data-driven manner, the application scenario for RL-based algorithms will be greatly expanded. To address this, a data-driven model-based offline RL algorithm that can effectively leverage the previously collected data will be proposed. Fig. 3(c) shows the conceptional pipeline of the proposed offline RL. In contrast to its on-policy and off-policy RL counterparts, a fitting dynamic model instead of physics-based simulation model is constructed using the offline dataset before policy learning. The detailed modeling process will be given in Section III, but before that a brief description of why the off-policy RL algorithms cannot be applied to offline settings (which appears to be a more direct data-driven solution) is explained.
In principle, any off-policy RL algorithms (such as DQN [21], DDPG [22], and TD3 [28]) can utilize the offline dataset assuming the data are generated with the environment. But due to the distributional shift between the behavior and the target policy, the policies learned by such a paradigm are usually poor. More specifically, as defined in (2) and (3), the loss function of a Q function (deep neural network-based) is to minimize the distance between the predicted Q value and the target Q value through gradient decent. For most online RL algorithms (such as DQN), the target policy is generated by maximizing the Q value, but the behavior policy for the offline data could be any policy. This will result in a large divergence between the estimate Q and the actual Q value, yielding suboptimal or even poor final policy behavior. In Section IV, the results of a fully off-policy DQN, DDPG, and TD3 will also be provided for reference.
where s, a, r, and s denote the state, action, reward, and next state after taking action a. π β (s, a) is the action taken in state s under policy π β . π target is the target policy. Q(s, a) is the estimate of state-action value. y(s, a) is the target state-action value. E is the expectation operator.

III. UNCERTAINTY-AWARE MODEL-BASED OFFLINE RL
In this section, first an RL-based EMS framework that incorporates heuristic domain knowledge is detailed. This framework can facilitate RL learning process while guaranteeing safety during both the learning and execution process. Then, the pipeline that details the mechanism of the proposed data-driven model-based offline RL is provided. Finally, after considering the inaccuracy of the learned model explicitly, an UMORL algorithm is proposed that attempts to conservatively extract the HEV EMS with the maximum possible utility out the available datasets.

A. Safe RL Mechanism Incorporating Domain Knowledge
In this article, a control framework, namely adaptive ECMS (or A-ECMS), that is widely adopted in the EMS of massproduction HEVs is utilized [29]. As shown in Fig. 4, the key of the ECMS method is the introduction of the equivalent factor (EF), which allocates a certain cost to the use of electric energy and converts it into equivalent fuel consumption (EFC), so that the use of electric energy is equivalent to saving a certain amount of fuel. As shown in (5), by using ECMS, the global problem of minimizing fuel consumption in a charge-sustaining HEV is reduced to the local problem of minimizing EFC. The value of EF has a large influence on vehicle EMS performance and should be reasonably estimated. After the EF is determined, several candidate values of the control variables of the battery power for the EMS (which is discretized from the minimum to the maximum limit) are calculated. The value that gives the lowest EFC is selected to distribute the energy flow. It should be noted that EF is the control variable for A-ECMS, and the battery power is the control variable for classic ECMS that acts on the EMS, which is strongly dependent on the selected EF value. (5) whereṁ f,eqv (t), P mot (t),ṁ f (t), p(SOC), and Q lhv are EFC, demand motor power, real fuel consumption, multiplicative SOC penalty function, and fuel lower heating value, respectively.
The reason why this control framework is adopted is that: 1) the offline data in this research is coming from existing PID-based A-ECMS (for simplicity, the phrase "logging policy" will be used) which has been tuned well but not globally optimal.
2) A safe RL mechanism which guarantees safe execution can be easily implanted. To be clearer, in order to guarantee the battery SOC within an allowable range, a shield function should be incorporated by the guidance of a risk metric. Compared with the other EMS control frameworks which may be difficult to define this risk metric, in ECMS a multiplicative penalty function can be used to add safety shield before conducting an action. Equation (6) shows a general form of this penalty function. As indicated in Fig. 4, this penalty function transforms the constrained optimization problem to unconstrained by multiplying a measure of constrained violation.
where SOC max and SOC min are the maximum and minimum SOC, respectively, and a is a hyperparameter that defines penalty function curvature. The existing PID-based A-ECMS corrects the EF in real time through the proportional unit (P), the integral unit (I), and the differential unit (D) according to the SOC deviation between the current SOC value and the target SOC value.
In order to implement RL algorithms, the given EMS problem can be formulated as an MDP which consists of a set of state variable and transition probability distribution p s t a t ,s t+1 , where p s t a t ,s t+1 represents the probability of making a transition from state s t to state s t+1 using action a t .

B. Data-Driven Model-Based Offline RL Framework
The principal difference between online RL approaches and the proposed data-driven model-based offline RL is the process of data-driven dynamic model training using the offline dataset before policy training. Since the dynamic model is solely developed using the offline data, physics-based high-precision simulation model is not required while enabling a safe trial-anderror plan. This could potentially unlock the widespread use of RL algorithms for industrial automation in which physics-based model is difficult or in some cases impossible to build.
In this research, as the state, i.e., battery SOC from one-time step to the next is often very similar which may either drive the neural network to learn an identity map or require more training data to learn the small variations in order to accurately predict the next state, the technique of training the state difference, i.e., s t+1 − s t (rather than the next state s t+1 ) based on the current state s t and action a t is adopted. To be specific, as shown in (7), first the parameters of a multilayer perceptron are optimized using the maximum likelihood estimation with mini-batch stochastic optimization using Adam [30]. After that the dynamic model is parameterized as M φ (s t , a t ), indicated in (8). By doing this, a local continuity can also be enabled which is beneficial for the following policy learning phase.
where Δ = s t+1 − s t , (s t , s t+1 ) ∈ D(s t , a t , s t+1 ). s t −μ s σ s and a t −μ a σ a are the normalized state and action, respectively, and μ s , σ s , μ a , σ a are the mean and standard deviation of states and actions in dataset D.
After the model is built, a wide variety of planning algorithms could be used for policy training. In this work, one of the widely studied AC algorithms, namely DDPG, capable of operating over continuous action and state spaces is chosen as the policy learning framework.

C. Uncertainty-Aware Model-Based RL Algorithm
Direct use of the data-driven model-based offline RL can be challenging. This is because the learned data-driven model is unlikely globally accurate as the offline dataset may not span the entire state-action space. Thus, planning using such a learned model without any safeguards against model inaccuracy may result in model over-exploitation, yielding suboptimal behavior.
To overcome this, as shown in Fig. 5, a conservative datadriven model-based offline RL framework that explicitly takes model inaccuracy into consideration is proposed. First, a conservative MDP dynamics is constructed, which pessimistically describes the reward of the learned model. To be specific, the MDP dynamics are modeled with ensembles of neural network a t ), . . .}. Each model is initialized with different weights and optimized with different mini-batch sequences. The epistemic uncertainty is measured and partitioned by comparing ensemble discrepancies.
If the discrepancy is within a threshold (which is a tunable hyperparameter), it is flagged "certain," and all certain area will receive the same reward as it was set at the beginning. However, if a large discrepancy is detected, it is flagged "uncertain," and a large negative reward will be used to replace the original set reward. By doing this, the policies that visit "uncertain" region of the state-action pair will be heavily punished, which provides the first layer of the safeguard against model over exploitation.
Note that the proposed conservative MDP is based on the model's ability to judge whether a given state-action pair is "certain" or not and is not an inherent property of the real MDP. As a result, there may be some regions which are marked as "certain" but are actually risky. To address this, the conservative MDP dynamics are further constrained by partitioning a given state into "certain" or "uncertain" region. As defined in (11), if the SOC is within the range of an allowable range, it is flagged "certain" and the same reward as it was set at the beginning will be received. Under other circumstances, a large negative reward is applied that forces the policy to detour the uncertain regions. By that, a double guarantee will be provided against model over exploitation.
where SOC min , SOC max are the allowable minimum and maximum SOC in a given driving cycle.

D. Experiment Setup
The HIL test is used to evaluate the real-time performance and effectiveness of the proposed strategy on a real controller. Fig. 6 shows the HIL experimental test platform, including the real-time control system based on Mototron, a real controller that meets full production, environmental and packaging requirements for HEV applications, and the vehicle simulation system running on national instruments products, both of which can carry out the operations of rapid generation of C language and online calibration. Because the algorithms have run on production intent hardware and can flow into production development

IV. RESULTS AND DISCUSSIONS
This section will verify the behavior of the proposed algorithm. To make the result more logical, the following three aspects will be analyzed. First, the performance of the unconstrained data-driven model-based offline RL will be demonstrated. For comparison, the optimal DP strategy and the strategy trained by fully off-policy DQN, DDPG, and TD3 using the same offline dataset are also provided. Next, the results of the proposed UMORL algorithm are introduced in detail. Finally, the robustness of UMORL under unknown conditions is evaluated by discussing the results of the vehicle under different driving cycles. Fig. 7 shows the SOC trajectories of the proposed unconstrained data-driven model-based offline RL algorithm compared with the fully off-policy DQN, fully policy DDPG, fully policy TD3, DP, and the logging policy under FTP 75 driving cycle. Note that the fixed offline dataset is generated by running the logging policy algorithm under only FTP75 driving cycle. The corresponding EFC and the terminal SOC for these algorithms are provided in Table II. It is obvious that the fully off-policy DQN, DDPG, and TD3 algorithms cannot sustain the battery SOC within the specified range. This is mainly due to the distributional shift inherent in off-policy RL applied to   8 and Table III show the results of the proposed UMORL results that add safeguards against model over exploitation in  two stages. Compared with the unconstrained approach with the SOC trajectory vibrating relatively drastically, the SOC trajectory of the constrained policy augmented with the conservative MDP behaves much smoothly, and this also contributes to a small fuel consumption reduction. If the second safeguard, i.e., state regularization is added, the behaviors of EFC and the terminal SOC are close to that of DP, indicating that the proposed methods are effective in offline settings almost without the issue of model over exploitation. The detailed engine and motor working points of the proposed UMORL algorithm in contrast with the logging policy and the optimal DP algorithm are shown in Fig. 9. It is not difficult to see that the engine and motor working areas are located in similar higher fuel efficiency regions for UMORL and DP, whist that for the logging policy diverges to some inefficient area.

C. Influence of Offline Data Quality on Final Behavior
To investigate whether the quality of the offline data have an impact on the final behavior of the proposed UMORL, a new dataset of the same size that is generated from another logging policy is collected. In contrast to the previous logging policy which refers to a near-optimal performance (termed logging policy A in this article for simplicity), the new logging policy (termed logging policy B) is the logging policy with a suboptimal behavior. As can be observed from Fig. 10 and Table IV, the policy behavior of the proposed UMORL is close to the corresponding logging policy in terms of SOC trajectories, EFC, and terminal SOC. Considering that 1) the quality of the offline data plays a crucial role in the performance achievable with the proposed model-based offline RL algorithm, and 2) a number of existing logging policies are available which can approximate the optimal under different driving conditions, is it possible   for the proposed UMORL algorithm to learn a near-optimal policy with mixed offline datasets generated by different logging policies in arbitrary driving cycle? To verify this, the proposed UMORL algorithms are used to train the combined datasets of the two logging policies. As can  be seen from Fig. 11 and Table V, compared with the learned policy behavior using separate dataset, combining the datasets result in a learned policy behaving in between. This is mainly due to the reduced constraint effect (or increased distributional shift) caused by the broader state-action region generated by  the divergent logging policies. It is thus suggested to select the offline dataset that is generated only by the selectable optimal policy in order to learn a near-optimal policy using the proposed UMORL algorithm.

D. Results When the Driving Cycle Is Changed
Finally, the final policy trained by the proposed algorithm under FTP75 driving cycle is applied to WLTP, UDDS, and LA92 driving cycles in order to validate whether the proposed algorithm can be extended to data points outside the sample without severe over fitting. As can be seen from Figs. 12 -14 and Tables VI -VIII, compared with the fine-tuned logging policy (corresponding to logging policy A in Section IV.C), the proposed UMORL algorithm achieves lower fuel consumption without sacrificing too much SOC sustenance performance in new driving cycles without further training.

V. CONCLUSION
In this article, a data-driven model-based solution for EMS of HEVs based on UMORL has been proposed. The major findings are summarized as follows.
1) A feasible data-driven model-based offline RL algorithm has been proposed and compared with the widely used fully off-policy counterparts, the naïve model-based offline RL algorithm can already solve the distributional shift problem inherent in off-policy RL applied to offline settings to a large extent. 2) Since the data-driven learned model is unlikely globally accurate as the offline dataset may not span the entire state-action space, planning using such a learned model without any safeguards against model inaccuracy can result in model over exploitation, yielding suboptimal behavior. To address this, two safeguards namely conservative MDP and state regularization are augmented in the proposed data-driven model-based offline RL algorithm during the dynamic model training, which is proved to be effective against model over exploitation.
3) The quality of the offline data plays a crucial role in the performance achievable with the proposed model-based offline RL algorithm, and it is suggested to select the offline dataset that is generated only by the selectable optimal policy in order to learn a near-optimal policy. For future work, the proposed algorithm is planned to test for a real HEV in a road test. In addition, depending on quality of the datasets, it is often desirable to fine-tune the learned policy via further online interactions. However, due to the distributional shift between the offline and the online dataset, severe bootstrap error during the fine-tuning process may destroy the initial policy obtained via the proposed offline RL. How to solve this problem using a modified model-based offline RL algorithm will be the follow-up work.