Online Data-Driven Energy Management of a Hybrid Electric Vehicle Using Model-Based Q-Learning

The energy management strategy of a hybrid electric vehicle directly determines the fuel economy of the vehicle. As a supervisory control strategy to divide the required power into its multiple power sources, engines and batteries, many studies have been conducting using rule-based and optimization-based approaches for energy management strategy so far. Recently, studies using various machine learning techniques have been conducted. In this paper, a novel control framework implementing Model-based Q-learning is developed for the optimal control problem of hybrid electric vehicles. As an online energy management strategy, a new approach could learn the characteristics of a current given driving environment and adaptively change the control policy through learning. Especially, for the proposed algorithm, the internal powertrain environment and external driving environment are separated so they can be learned via the reinforcement learning framework, which results in a simpler and more intuitive control strategy that can be explained using the vehicle state approximation model. The proposed algorithm is tested and verified through simulations, and the simulation results present near optimal solution. The simulation results are compared with conventional rule-based strategies and optimal control solutions acquired from Dynamic Programming.


I. INTRODUCTION
Energy management strategies for hybrid electric vehicles (HEVs) are one of the most important factors determining the fuel economy performance of a vehicle. Coordinating multiple power sources, generally fossil fuel energy and electric energy in HEVs, the energy management strategy is a supervisory control method to operate each power source by determining when and how much energy to use according to the driving environment [1].
Simple and applicable rule-based approaches are mainly used for the controllers of real vehicles, which usually focus The associate editor coordinating the review of this manuscript and approving it for publication was Canbing Li . on obtaining the best efficiency for each powertrain component as well as calibration of the control parameters are based on heuristics or engineer's intuition. Examples of rule-based control can be found in [2], [3]. More mathematical approaches have also been conducted based on optimal control theories. One of the most widely known algorithms is Dynamic Programming (DP) [4]. The dynamic programming approach is a powerful tool that shows the best available fuel economy of the vehicle. Therefore, the results of the DP simulation for HEV can be used to obtain an intuition for the control policy of powertrain [5], [6]. However, DP is not available for real-time control since it needs the entire driving speed profile before vehicle departure.
At the same time, optimization-based control strategies for real-time application have been developed in various ways. One of the most representative methods widely studied is a control strategy based on instantaneous optimization techniques such as Equivalent Consumption Minimization Strategy (ECMS) [7]- [9] and Pontryagin Minimum Principle (PMP) [10], [11]. ECMS and PMP have the advantage that they can be used as a real-time control strategy to achieve fuel efficiency optimization through equivalent calculations of the engine and fuel. However, similar to DP, these strategies need to reflect future driving information for control to achieve high fuel economy, which is given as an equivalent factor or co-state that represents the balance between fuel and electrical energy usage. As a result, to improve the fuel efficiency of hybrid vehicles as in DP, it is necessary to calculate an optimized solution that reflects the driving conditions of the vehicle [12], [13]. Accordingly, recent studies have been conducted in to predict and utilize future driving conditions. However, it is not easy to accurately predict these future driving speed profiles, and changing and learning the optimal control method according to the changing driving conditions of the vehicle requires a sophisticated algorithm and computational burden [14], [15]. Because of these problems, recent approaches have attempted to solve hybrid control problems using machine learning.
Reinforcement Learning (RL), a field of machine learning that has been actively researched in recent years, has a framework that can be applied to control problem suitably [16]. RL is one type of machine learning that has been developed based on the foundations of dynamic programming. Therefore, problems previously solved using DP such as the HEV optimal control problem are suitable for the control problem framework by applying RL. In fact, these RL techniques have been applied to HEV control, considering previous studies on stochastic dynamic programming (SDP) [17]- [19].
Much work has been done regarding RL for energy management strategies of HEV control, especially Q-Learning. In [20], RL was applied to the power management strategies of HEVs, in which a Temporal Difference (TD)-learning algorithm was used to derive the optimal control policy. In [21], RL was applied to the power management strategy for a Plug-in hybrid electric vehicle (PHEV), in which the remaining distance to travel was chosen as a state variable and the immediate reward was defined as the sum of the fuel consumption cost and battery energy usage cost. In [22], RL was used to optimize the power distribution between the battery and the ultra-capacitor for a PHEV. In this paper, the transition probability matrices were updated based on the driving cycle and Kullback-Leibler divergence rate. [23] presented the RL-based energy management strategy for a hybrid electric tracked vehicle, in which Q-learning and the Dyna algorithm were applied to generate the optimal control policy. [24] suggested a predictive energy management strategy based on RL and velocity prediction was applied to the parallel HEV. More recently, [25] utilized a Deep Q Network (DQN), which combined Q-learning and a deep neural network for HEV control.
In this paper, as in previous studies, we conducted a study on the HEV optimal control problem using RL. In particular, to apply RL to HEV control, we constructed a novel RL framework more suitable for the HEV optimal control problem based on previous studies [26], [27]. Especially, by separating the vehicle's internal powertrain environment and the vehicle's driving environment on the learning framework, we constructed a model-based Q-learning algorithm for energy management strategy of HEV, which is a more intuitive and explanatory learning framework for vehicle powertrain control. Accordingly, this approach was developed not just to find a generalized offline control policy according to many different driving patterns, but also to develop an online data-driven energy management strategy in which the vehicle controller is optimized with respect to the current given driving environment, thus allowing it to adaptively change the control policy according to change in the environmental data. The contribution of this paper is that by developing a novel optimal control framework using model-based Q learning applied to the HEV optimal control problem, the characteristics of the HEV optimal control problem and the intuition of the RL control technique for the HEV controller are better understood.
The remaining chapters are organized as follows. Chapter II gives a description of the HEV simulation model used in this paper. Subsequently, in Chapter III, the optimization problem to be solved in this paper is defined, and a novel algorithm using RL is proposed. Chapter IV discusses the feasibility and various features of the proposed algorithm based on simulation result, and finally, Chapter V gives the conclusions.

II. VEHICLE MODELING
In this study, the fuel efficiency performance and validity of the proposed algorithm are tested based on a vehicle simulation, thus it is very important to have a reliable vehicle powertrain model to perform simulations. In this study, we use a vehicle powertrain models consisting of each component model based on quasi-static modeling. For the powertrain structure, a parallel HEV is used, as given in Fig.1.
First, for engine modeling, a quasi-static engine fuel consumption model is utilized. It is assumed that the engine VOLUME 8, 2020 transient behavior such as the combustion dynamic are much faster than vehicle system level dynamics for energy flow analysis. The fuel consumption rate of the engineṁ is represented using map, as given in Fig.2 and (1), using the engine torque T eng and engine speed ω eng : For the motor, the efficiency of the motor η mot is calculated using the pre-determined map, and battery power output P bat is also presented using the motor torque T mot and motor speed ω mot , as shown in (2).
The efficiency of the motor η mot is a function of the motor torque T mot and motor speed ω mot as given in Fig. 3. If the machine is used as a motor, then k = −1, and if machine is used as a generator, k = 1. It is also assumed that the effects caused by the transient dynamics of the electric motor are sufficiently small, thus can be neglected. The battery power in (2) changes the State of Charge (SOC) in the battery, as modeled by the SOC dynamics described in (3), by considering an equivalent circuit model for the battery as shown in Fig. 4.Ṡ Here, the open circuit voltage of the battery is V oc , the electric power consumed outside the battery is P bat , the internal resistance is R and the battery capacitance is Q bat . For the battery model, a simple internal resistance model is used. The open circuit voltage and internal resistance of the battery are determined by a pre-determined map, as shown in Fig. 5. For the powertrain, drivetrain dynamics from the transmission input shaft to the wheel can be expressed as shown in (4), (5), and (6) when a clutch is engaged.   Here, T wh is the wheel torque, T eng is the engine torque, T mot is the motor torque, T gb_loss is the torque loss in the transmission, γ gb is the gear ratio, T fd_loss is the final drive torque loss, ζ gb is the final drive gear ratio, ω t is the transmission input speed, ω wh is the wheel speed, and T t is the transmission input torque. The loss for the gear box is given as a three-dimensional map, as given in (7), as a function of T t , ω t , and the gear step number i gb .
For the final drive gear, T fd_loss is given as function of the final drive input speed ω fd and the final drive input torque T fd . The vehicle model can be described simply as (9) and (10) Here, v is vehicle speed, R tire is the tire radius, F brake is the brake force, and F loss is the road load loss, which includes the road grade. M veh is the vehicle mass and M eq is the equivalent mass for the rotating inertia of the powertrain component. Finally, f 0 , f 1 , and f 2 are the driving resistance coefficients. Some of the vehicle model parameters are shown in Table 1.
Based on these vehicle models, the algorithm presented in the paper was tested and verified. The next chapter describes the algorithm.

III. ONLINE DATA-DRIVEN ENERGY MANAGEMENT STRATEGY FOR HYBRID ELECTRIC VEHICLE
In this paper, RL is used for energy management of HEV. In RL, learning is accomplished through feedback, giving appropriate compensation for the outcomes of the learning. The difference between supervised learning and RL is that unlike supervised learning, in which it explicitly corrects undesired behaviors, RL focuses on the online performance, which is one of the advantages that it is more suitable for applications in real-time control strategies for HEVs. Among the RL algorithms, Q-learning is utilized in this study. Q-learning is a method that allows the learning of optimal control online, where the Q function is learned using the temporal difference method based on interactions between the controller and environment. Based on Q-learning, as mentioned in the introduction, a novel energy management strategy has been developed specifically for the optimal power distribution problem of HEV control. Prior to this, the optimal control problem is explained first, followed by the new energy management strategy.

A. OPTIMAL CONTROL PROBLEM
First, the optimal control problem is defined to minimize the expected total cost over an infinite horizon as shown in (11).
constrained by Here, x k is the state variable, g is the instantaneous cost incurred, γ is the discount factor that represents the future cost as the expected value of the cost at current time step, J π (x 0 ) is the expected cost when the system starts at state x 0 and follows the policy π, and u is the engine power P e , which is also discretized as where N u is the number of discretized control inputs. The state variable x k is composed of a four-dimensional state space as given below (13).
Here, SOC is the battery state of the charge, and E on is the engine on/off state. The engine on/off state is considered to avoid fuel consumption due to frequent engine changes to the on/off states. The instantaneous cost incurred g is defined as the equation below.
Here, W fuel is the instantaneous fuel consumption and β is the coefficient for the engine on/off penalty. ζ (SOC) is a term that penalizes the SOC deviation for charge sustenance as given below.
Here, µ and C Penalty are positive constant values for the SOC deviation. The underlying meaning of the optimal control problem is that the overall expectation of the cost for the infinite horizon is minimized instead of for a finite horizon, therefore the control policy result is time invariant, which can be easily implemented as a real-time vehicle controller. Note that the definition of this optimization problem is different from what the existing DP normally defines for the finite horizon or when using the Monte-Carlo method, which can learn from an episode of experience, and the final SOC constraint in DP is considered for the instantaneous cost. VOLUME 8, 2020

B. MODEL-BASED Q LERNING
In this paper, to apply the Q-learning algorithm to the HEV control problem, a new energy management strategy based on the RL framework is developed. First, in Q-learning, the optimal cost J * (x k ) and optimal control policy π * (x k ) can be found as in the below equation using the Q-function: Further, the Q-function value can be updated as the below equation.
When the system is in some state x k , (i.e., in this HEV control problem, when the vehicle is in some state according to SOC k , P dem,k , v k , and E on,k ), the control u k is selected which has a minimum Q value. According to the action u k , the state x k changes to x k+1 with immediate reward g k , then based on the Q value at the new state x k+1 and g k , the Q-function value Q (x k , u k ) is updated with the Bellman equation. Equation (18) presents the baseline of the Q-learning algorithm. Based on this algorithm, a new online data-driven energy management strategy using model-based Q-learning is proposed in this study. In the case of conventional Q-learning, the most important one is that of convergence. When applying Q-learning to various problems, including the HEV control problem, there is often difficulty considering the convergence properties or the state dimension is too large, thus taking a long time to converge. Additionally, there is the issue of the curse of dimensionality, as in DP. In this paper, we propose an algorithm that fits the framework of the HEV optimal control problem. Fig. 6 and Fig. 7 present the concept of the algorithm and pseudo code, respectively. The idea of the algorithm presented in this paper is as follows. In HEV control problems, the states in (13) can be divided into stochastic and deterministic parts. That is, considering the driving environment of the vehicle (i.e., P dem and v), the vehicle moves probabilistically with uncertainty, while the state of the vehicle (i.e., SOC and E on ) moves deterministically via the control input according to the given control policy with the given driving environment. Further, the fuel consumption W fuel can be modelled deterministically for the given driving condition and control input. In many existing papers, when using Q-learning or DQN, the vehicle and environment states are grouped together and trained entirely free of the model. The advantage of this model-free approach is a feature of Q-learning. However, if we model the powertrain of the vehicle and the state in vehicle dynamics, we should consider model-based techniques. In this study, the algorithm is composed by approximating the vehicle model, as shown in Fig. 6, thus there is an inner-loop process in which a learning process can conducted separately. In the proposed algorithm, first the control u k is chosen based on Q (x k , u k ). However, unlike the conventional Q-learning algorithm, in which an ε-greedy policy is used often, here the action u k is selected only based on Q (x k , u k ) (i.e., minimum Q value) without any exploration strategy. Instead, the Q-function value is updated based on interactions between the agent and vehicle state approximation model using the driving cycle information. While the optimal action u k is chosen and implemented in the environment, the agent updates the Q-function value by investigating all admissible actions u k based on the vehicle model (considering the burdensome computation, the action number of u k in the inner-loop can be reduced). The reward g k+1 and vehicle state x k+1 (which are SOC k , and E on,k ) according to the action u k is obtained using the vehicle state approximation model, and the Q-function value is updated by combining these data with the driving cycle state x k+1 (which are P dem,k , and v k ).
The underlying meaning of this structure is that in terms of the exploration-exploitation dilemma, by separating the deterministic vehicle model state from the stochastic vehicle driving environment state, exploration of the control according to the vehicle driving environment is increased while exploitation of the control policy is secured. Thus, the proposed algorithm works differently from existing Q-learning or DQN, where an ε-greedy policy is used often. That is, random selection of the control input for the HEV control problem decreases the fuel economy performance for exploration. Additionally, considering that these random control inputs in the exploratory strategy can cause undesirable behavior or even fatal errors in the vehicle system, the proposed algorithm has the advantage of stability and robustness, which is very important for vehicle control characteristics. Further, similar to DQN, experience replay could be conducted by updating Q using the vehicle model for different actions, which helps convergence.
On the other hand, the vehicle state approximation model is updated using the information obtained from interactions between the agent and environment as in the equation below.
The vehicle model is defined as above and updated using the results of the interaction between the actual agent and the environment. Note that it is still possible to have a model-free property, which is an advantage of Q-learning. The initial approximation of the vehicle model only helps faster learning and convergence of the algorithm. In other words, even if the model is not accurate, it can be modified by learning from the driving data, which allows optimal control to be explored. The vehicle approximation model (battery SOC and fuel consumption) is given as a four-dimensional look-up table that is a function of the state and control as written in equations below.
W fuel = f fuel (P dem , v, E on , u) Fig. 8 and Fig. 9 present examples of the battery SOC model and fuel consumption model, respectively. The advantage of the proposed algorithm is that it separates the vehicle model from the environment differently from the existing Q-learning-based energy management strategies. In the case of the vehicle state approximation model, the future vehicle state and reward (battery SOC, engine on/off state, fuel consumption) can be derived when certain control inputs are given with along with the current vehicle state information. However, in the case of driving cycle information, it is not easy to accurately predict the change in vehicle speed and the required power demand. Therefore, for the vehicle powertrain, the state approximation model is considered in the control algorithm through modeling and is updated based on the initial value and learning. However, in the case of the vehicle driving cycle, the model is configured to learn the driving data based on interactions between the agent and environment as in the existing Q-learning based energy management strategy. Thus, based on the vehicle state model approximation, the uncertainty of the state transition model can be significantly reduced, and based on these vehicle state approximation, the decision making process could be explained more explicitly; this is unlike in conventional RL, which lacks visibility for the learning process.
Therefore, the state related vehicle model and control can be learned using full backups, and the driving environment can be learned using sample and shallow backups. Compared to the SDP algorithm, in which the driving cycle information is expressed as a transition probability matrix (TPM), the proposed algorithm updates instantaneous driving cycle information using the Q-function value and is stored based on the Bellman equation as if the TPM is updated at every moment. On the other hand, in the proposed algorithm, using the vehicle model, it is possible to derive the optimum control value by examining the vehicle state change and the compensation value according to all possible control inputs, as in SDP.

IV. SIMULATION ANALYSIS
The effectiveness of the proposed algorithm described above was verified through vehicle simulations. We investigated how well the learning process is actually conducted using the proposed algorithm, and how accurate the fuel economy VOLUME 8, 2020  performance results based on learning are compared to the fuel economy of the DDP, which represents the optimal fuel economy. Additionally, simulation results with the conventional rule-based strategy are presented for comparison. First, discretization of the parameters is performed as in Table 2.

A. SIMULATION USING STANDARD DRIVIG CYCLE
Standard driving cycles for the Urban Dynamometer Driving Schedule (UDDS) and Highway Fuel Economy Test (HWFET) are used for the learning process and the vehicle simulation. Fig. 10 presents the learning curve for UDDS, in which the cumulative reward decreased rapidly as iterations were repeated. As the iterative learning continues, the cumulative reward value becomes smaller and convergence can be confirmed. Fig. 11 presents the battery SOC results for each simulation for UDDS. First, there is nothing previously learned, thus the battery SOC is decreased; this is because the controller will select the control to minimize the immediate fuel consumption and SOC deviation penalty without considering the discounted cost of the next state. Thus, the battery SOC value becomes smaller to reach the minimum boundary SOC value (0.45). However, as the learning process is repeated, the battery SOC is sustained near the target battery SOC value (0.60). In the same way, the simulation for HWFET is conducted. The strategy is simulated using the HWFET driving cycle repeatedly for learning, and the fuel efficiency performance is measured. Table 3 presents the equivalent fuel economy performance of the strategy for UDDS and HWFET, which are trained for each cycle separately. The simulation results show that   in the case of UDDS, the RL-based strategy exhibits a fuel economy performance of 24.9 km/l, which is 95.4% of the optimal fuel efficiency for DDP. For the HWFET RL-based strategy, the fuel economy is 25.7km/l which is 98.1% of the DDP results. In both cases, we confirmed that the results of the RL-based strategy are better than the results of the rule-based strategy. The RL-based strategy presents a very similar behavior for the engine operating point with the DDP, as shown in Fig. 12. Fig. 12 shows that in both the DDP and RL-based strategies, the engine is operated near the optimal operating line, which has a relatively high Brake Specific Fuel Consumption (BSFC) efficiency. However, the fuel economy results for the RL-based strategy cannot reach that of DDP even though it is trained repeatedly using the driving cycle. This is because the optimization problem is defined with an infinite time horizon rather than a finite driving cycle, thus the derived optimal control is not suitable for the deterministic case.

B. ONLINE DATA-DRIVEN LEARNING
On the other hand, the learning ability of the proposed strategy was also tested through simulation. In these simulations, the UDDS and HWFET driving cycles are used for learning, and learning is performed again for different driving cycles (HWFET and UDDS) to determine whether new learning occurs with the existing learned data. Fig. 13 presents the equivalent fuel economy results as learning is performed for the UDDS driving cycle using pre-learned data with the HWFET driving cycle. It is seen that equivalent fuel economy is increased as iterations are repeated. Similarly, Fig. 14 shows the process for re-learning the HWFET driving cycle using the data learned during the UDDS driving cycle. Additionally, it can be confirmed that the fuel consumption value increases as learning is repeated, eventually converging to a constant value. Table 4 shows the fuel efficiency performance for the UDDS and HWFET cycles of the learned strategy for different cycles. In the case of the UDDS driving cycle, learning with only UDDS exhibits the best fuel economy, while learning with HWFET only shows the best fuel economy for the HWFET driving cycle. However, the fuel efficiency with different cycles exhibits reduced performance. When two cycles are learned, a similar fuel economy performance is seen compared to best fuel efficiency. The simulation results show that even when the proposed algorithm is implemented in a driving environment different from the initially trained driving environment (for example, from UDDS to HWFET, or HWFET to UDDS), a competitive fuel economy can be obtained based on the generalized control policy learned from existing learned data.

C. APPROXIMATION MODEL LEARNING
Finally, the model-free approach is tested via simulation. Generally, vehicles are exposed to various driving environments and performance deteriorates naturally. For example, aging of the engine or the performance degradation of the powertrain over time can happen in real vehicles, thus adaptation of the controller according to corresponding changes in the vehicle component performance is a necessary factor for minimizing the fuel efficiency reduction. One advantage of the proposed strategy is that the algorithms can learn by themselves and find the optimal control according to such environmental changes. In this case study, we deliberately changed the fuel consumption map of the engine model under the assumption that the engine consumes more fuel with a high torque area according to the performance reduction, and we verified that the proposed algorithm can work adaptively to learn and derive the optimal control rules according to this change. The fuel consumption map is intentionally modified as shown in Fig. 15, where the faint part of the high engine torque indicates the fuel consumption is rising. With this modified engine model, the RL-based energy management strategy is implemented with the HWFET driving cycle.
As a result, the strategy dynamically changes the existing set of parameters according to the change in elements to find the optimum control rule. Fig. 16 presents the vehicle fuel consumption approximation model in the RL-based energy strategy before and after changes in the fuel consumption map. Fig. 17 presents the BSFC map of the engine and the simulation results for the engine operating point with the VOLUME 8, 2020  original fuel consumption map data. On the contrary, Fig. 18 presents the simulation results for the engine operating points with the changed fuel consumption map data. It is seen that the BSFC map is changed as the fuel consumption map is modified intentionally; however, the engine operating point remains with the same area after the 1 st iteration, as seen in Fig. 18 (a). After a few iterations, the engine operating point moves to the most efficient area of the BSFC, as seen in Fig. 18 (b), and (c). Additionally, according to the learning process, the fuel economy performance is also increased from 23.6 km/l for the 1 st iteration to 24.6 km/l for the 20 th iteration.
These results show that the control algorithm can adaptively find the optimal control policy when the performance or characteristics of the vehicle powertrain are changed, and those changes can be found using the vehicle state approximation model. This is possible by constructing the control framework for the powertrain model and the driving environment separately. Thus, the characteristics of the HEV optimal control and learning process of the RL control technique can be explained and understood more simply and intuitively through the vehicle powertrain engineering view, while the conventional RL-based strategy cannot explicitly describe the

V. CONCLUSION
In this study, an RL-based control strategy was developed for the optimal control problem of HEVs. In the proposed RL-based control strategy, the transition probability of the vehicle's driving speed profile is learned online based on the driving data, and the control strategy is optimized based on model-based Q-learning. To obtain an improved fuel economy in HEVs, it is necessary not only to increase the efficiency of the vehicle powertrain, but also characterize the speed profile of the vehicle for use in the control strategy. The proposed control strategies in this paper have a powerful mathematical framework using reinforcement learning to model the driving cycle information from the stochastic view, and then solving the HEV supervisory control problem based on optimization using model-based approaches with an explainable and tunable vehicle state approximation model. As future work, experimental validation of the proposed control strategy is needed. Since the control strategy is verified based on simulations, it is necessary to verify the strategy based on experiments. Further, the tradeoff relationship of computational burdensome and fuel economy performance of the strategy should be investigated based on experimental evidence. Finally, combined with other practical issues such as emission or drivability, we expect that it is possible to advance the proposed strategy so that it is more practical and realistic.
CHANGBEOM KANG received the B.S. degree in mechanical aerospace from Seoul National University, South Korea, in 2012, and the M.S. degree in mechanical and aerospace engineering from Seoul National University, South Korea, in 2014, where he is currently pursuing the Ph.D. degree in mechanical and aerospace engineering. His research interests are energy management of hybrid electric vehicle and plug-in hybrid electric vehicles using Pontryagin's maximum principle and machine learning.