Multi-Agent Reinforcement Learning Based Actuator Control for EV HVAC Systems

While electric vehicles (EVs) continue to draw more attention as an alternative to traditional fossil fuel vehicles, the relatively short driving range of EVs is often pointed out as their biggest drawback. In terms of energy consumption, one of the most energy-intensive systems in EVs is the heating, ventilation, and air conditioning (HVAC) system. Most HVAC systems use On/Off or PID control for the actuators, but these control methods have low efficiency and are difficult to apply in multiple-input multiple-output systems. In this paper, we propose a novel multi-agent deep reinforcement learning (MADRL) method to efficiently control the low-level actuators of the EV HAVC systems. Through this method, multiple objectivs such as setpoint temperature, subcooling and efficiency can be considered simultaneously by giving independent rewards for each actuator agent. The proposed method is evaluated via a actual vehicle simulator, and experimental results show that the MADRL-based method consumes only 53% of the energy consumption of PID control on average in a transient phase.

The associate editor coordinating the review of this manuscript and approving it for publication was Christopher H. T. Lee .

Variables β
Importance ratio in single agent reward function.

I. INTRODUCTION
One of the biggest drawbacks of current electric vehicles (EVs) is their relatively short driving range compared to traditional fossil fuel vehicles. The simplest way to improve the short driving range is to increase battery capacity, and a significant amount of related research is underway [1]. Another approach is to improve the efficiency of the energy-consuming systems in EVs [2], [3], [4], [5]. One of the most energy-intensive systems in EVs is the heating, ventilation, and air conditioning (HVAC) system, which is a complex nonlinear thermo-fluid dynamics system composed of a variety of components such as a compressor, heat exchangers, and electric expansion valves (EXVs) to control vehicle climate conditions. The importance of the efficient control of the HVAC system in EVs is greater than that of conventional vehicles as the proportion of energy consumption by the HVAC system is larger in EVs [3]. Most HVAC systems use On/Off control and PID control [6], [7], [8]. On/Off control is simple but has problems such as low efficiency and a shortening of the lifespan of the components. PID control, also widely used for its simple implementation, does not require the dynamics of the target system and therefore can be applied to both linear and nonlinear systems. However, PID control involves a number of issues. First, in most cases, it is difficult to use PID control in multiple-input multiple-output (MIMO) systems as each control output is coupled with a single feature, which makes it hard to reflect complex objectives with multiple features in PID control. Second, PID control requires a setpoint that is typically based on human expertise and therefore might not be optimal. In particular, determination of the target degree of subcool (DOS), which greatly influences the efficiency of the system [9], [10], depends on the prior experience of humans. Further details about subcooling are explained in Section II. Third, since PID control is a type of feedback control that relies on errors from the current observation of the system, it is difficult to consider the entire trajectory. As a result, if the coefficients are not tuned well, oscillation can occur in the system.
Recently, approaches based on reinforcement learning (RL) to HVAC control have been studied. RL methods are designed to maximize the reward of the global trajectory [11] and are widely applied to various control problems. In HVAC systems, the policies of RL control a set of actuators, satisfying the multi-objective of the MIMO system via reward engineering [12], [13]. In Refs. [12], [13], the classical RL method SARSA has been applied to vehicle HVAC systems, showing promising performance compared to conventional control methods. However, the classical tabular RL method is inadequate for a large continuous system because of the curse of dimensionality [11]. In building HVAC systems, RLbased control has shown promising efficiency and generalization performance compared to baseline algorithms [14], [15], [16], [17], [18], [19], [20]. However, Refs. [14] and [18] can be used only for the discrete control problem, and while Refs. [15], [17], [18] propose a high-level control that outputs the desired setpoint temperature, the efficiency of the low-level subsystem actuator control to reach the setpoint temperature is not considered. Therefore, there is room for improving control at the low-level. Otherwise, the control methods proposed in [19] and [20] apply to the HVAC systems of buildings and are not applicable to vehicle HVAC systems.
In this paper, to address the above mentioned challenges, we propose a novel control method based on multi-agent deep reinforcement learning (MADRL) for the EV HVAC system, with the following features. First, our method is based on a multi-agent architecture, which enables the use of actuator-specific reward functions to minimize the energy consumption of the MIMO system. Second, using the novel reward functions, our method finds the setpoints needed for feedback control without human expertise. Third, our method achieves more energy-efficient control than conventional methods in terms of the entire trajectory while achieving comparable target convergence performance. To the best of our knowledge, this research is the first attempt to use deep reinforcement learning for the continuous control of low-level subsystem actuators in the EV HVAC system.
The remaining sections are as follows. In Section II, we introduce our problem's objective functions, the concept of subcooling, and deep reinforcement learning. In Section III, we explain our MADRL-based method including state representation, action representation, and reward functions. In Section IV, we show that our proposed method outperforms the grid search and conventional PID control in efficiency while meeting a target temperature. As a training environment, we employed the commercial simulation software GT-SUITE R [21], which is widely used for industrial vehicle HVAC modeling. Conclusion and further works are provided in the last section.

II. PROBLEM STATEMENTS AND PRELIMINARIES
A. PROBLEM STATEMENTS The HVAC system provides a cool or warm airflow into the cabin through a variety of heat exchanges for passengers' thermal comfort during driving. The heat exchangers VOLUME 11, 2023 (e.g., evaporator, heater core, low temperature radiator, etc.) transport thermal energy that is converted into the change in cabin air temperature.
The HVAC system has multiple modes depending on the usage for high performance and efficiency. Each mode adopts a different circuit design by closing or opening multi-way valves. Because of such distinct circuit design, each mode involves a different set of actuators, and therefore, every mode has its own specific objective and constraints.

1) OBJECTIVE
In this paper, we consider the cabin air conditioning (A/C) mode, which is used when the cabin demands lowtemperature air. The actuators involved in this mode are the compressor and one EXV. The objective of the mode is to meet the given setpoint of the air temperature from the evaporator while minimizing the work. Regarding the work, only that done by the compressor is considered, since other work such as cooling fan work or blower work are relatively small.
To be more specific, the objective can be divided into two parts: before and after reaching the target temperature, as shown in Fig. 1. Before reaching the target temperature (we call this stage the transient phase), the system operates to find the optimal trajectory of the control inputs, which minimizes the weighted sum of the total work done and time taken until the convergence of the temperature. The system should reach the target temperature within a given time: where δ determines the weight of the time penalty and λ is the range of convergence of the temperature. After reaching the target temperature (we call this stage the steady phase), the system operates to maintain the temperature within the desired range while conducting minimum work. The system becomes steady by fixing the inputs of the actuators to certain values. While there can be multiple input combinations that maintain the same target temperature, each of them may differ in efficiency. Therefore, in the steady phase, the objective is to find the optimal combination of control inputs rather than finding the trajectory. This optimal combination should satisfy Eq. (2):

2) USE OF SUBCOOLING
A liquid is subcooled when it exists at a temperature below its normal condensing point. Fig. 2 shows that subcooling (SC) exists to the upper left of the saturation line in the pressure-enthalpy diagram. In HVAC systems, condenser SC has a significant effect on the coefficient of performance (COP) [22]. In other words, maintaining the DOS at the target level is equivalent to achieving a certain level of efficiency. In conventional control methods, an EXV is controlled to reach the desired DOS using PID control or model predictive control to improve the COP [9], [10], [23], [24]. Our method is first validated to control the EXV to meet the target DOS provided by experts as an auxiliary target of efficiency. In the steady phase, the system is expected to reach not only the target temperature but also the optimal DOS. The objective function of the steady phase can be modified as follows: However, as stated in [23], finding the optimal DOS is difficult and often requires numerous assumptions. Moreover, if the system is newly introduced or updated, finding the optimal DOS requires trial and error. In such cases when the target DOS is unknown, our method can be trained to directly minimize the work. Details are further discussed in Section IV.

3) CONSTRAINTS
Due to physical limits and safety issues, the compressor has a limit on the available changes in actuation per time, which is called the slew rate constraint. The slew rate constraint can be expressed as Eq. 4, where u 1 t and u 2 t are the control values of the compressor and EXV at time step t, respectively: Also, each actuator's operating range has physical limits. This constraint is shown in Eq. 5.

1) REINFORCEMENT LEARNING
We formulate the HVAC control problem as a Markov decision process [11] with a tuple of states (S, A, p, r, γ ), where S is the continuous state space, A is the continuous action space, p : S × A × S → R ≥0 is the unknown state transition dynamics that denotes the probability density of the next state s ∈ S given s ∈ S and a ∈ A, and r : S × A → R is a reward function. γ ∈ [0, 1) is a discount factor of future rewards, and π(a|s) is a stochastic policy of action a given state s ∈ S. The agent chooses action a t based on policy π(a t |s t ) for every time step given state s t and reaches the next state s t+1 following the stochastic transition dynamics p(s t+1 |s t , a t ).
The goal of RL is to maximize the expected cumulative reward from the current policy π for every time step: Using the policy function π(a|s), the state value function V or action value function Q (or Q-function) is obtained to approximate the expected cumulative reward. The state value function V π (s) of a policy π is the expected cumulative reward starting from the state s upon executing π: Next, the Q-function Q π (s, a) of a policy π is the expected return starting from state s, taking action a, and then following π: Recently, deep neural networks have been widely used to train the policy or value function of RL algorithms [25], [26], [27], [28]. In this work, we deployed soft actor-critic (SAC) [29], which is a state-of-the-art model-free RL algorithm for continuous control domains. SAC has a high sample efficiency, as the algorithm is trained based on maximum entropy [30] with an actor-critic architecture [31]. Moreover, SAC tends to converge more stably and requires less hyperparameter tuning than other RL algorithms.
In SAC, the goal of the agent is to maximize not only the expected sum of the rewards from the current policy π but also the expected entropy of the policy: where H π(· | s t ) = E a∼π [− log π(a|s) ]. The critic, which is the soft Q-function Q θ of a policy π φ , is trained by minimizing the critic objective (Eq. 10), while the actor π φ is updated by minimizing the actor objective (Eq. 11), whereθ is the set of parameters of the target network. For a better exploration, the temperature coefficient α is automatically adjusted for the maximum entropy policy by minimizing the α objective, where H 0 is the desired minimum expected entropy threshold.

3) MULTI-AGENT RL
Multi-agent reinforcement learning (MARL) is widely studied in control systems with multiple components [32], [33], [34], [35], [36], [37]. MARL enables a more delicate design of the action and reward functions compared to single-agent VOLUME 11, 2023 RL algorithms. To define our MARL problem, we introduce a Markov game (MG) [32], [34], [38], [39]. The MG is a framework that generalizes the Markov decision process for multiple agents interacting simultaneously in a shared environment. MG is defined with the tuple (N , where N denotes the number of interacting agents (N > 1), S is the continuous state space, A : A 1 × · · · × A N is the joint action space which is the collection of the continuous action spaces of individual agents i ∈ {1, . . . , N }, p : S × A × S → R + is the unknown state transition probability, and r : r 1 × · · · × r N is the reward function. In MARL, joint action a : a 1 × · · · × a N and joint policy π(a|s) = i π i (a i |s i ) are defined with the collection of the actions and policies of individual agents. A MARL agent chooses joint action a t based on joint policy π(a t |s t ) for every time step given state s t and reaches the next state s t+1 following stochastic transition dynamics p(s t+1 |s t , a t ). The goal of MARL is same as the goal of RL (Eq. 6).

A. MULTI-AGENT IN THE HVAC SYSTEM
As mentioned in Section II-A, our model has two different objectives, Eq. 1 for the transient phase and Eq. 2 for the steady phase. Each objective is optimized by its own agent, namely a transient agent for the transient phase and a steady agent for the steady phase. Each agent is composed of a compressor agent and an EXV agent, which are responsible for the corresponding actuators of the system. With an identical architecture for the compressor and EXV agents (Fig. 4), the transient and steady agents are trained for their respective phase using their own novel reward functions. Each reward function reflects the objectives defined in Eq. 1 and Eq. 3.
The compressor agent contains a policy network π 1 θ whose primary objective is to find the optimal control value of the compressor to meet the target temperature. Similarly, the EXV agent contains a policy network π 2 θ whose primary objective is to find the optimal control value of the EXV to meet the target DOS. In general, the compressor has a larger influence on the system than the EXV. Before the compressor converges, most of the features including compressor work, temperature, and SC predominantly depend on the compressor. Only after the target temperature is reached and the compressor converges can the EXV agent observe the change in SC that is induced solely by itself. Then it becomes possible to control the EXV agent to meet the target DOS with appropriate feedback. In other words, EXV control requires long-term exploration, in which the agent learns to meet the target DOS after the convergence of the target temperature.
In the case of single-agent RL, the agent outputs control values of both compressor and EXV from a single neural network. Also, the reward is a single scalar value unified from r 1 t and r 2 t (Eq. 13).
In this case, the agent tends to learn to control the more dominant component, which is the compressor, while struggling with the less dominant component, the EXV. One possible solution to address this issue is to adjust each reward's relative importance with a ratio constant, β. Unfortunately, finding the right value of β requires extensive hyperparameter searching with exponentially growing costs for increasing numbers of actuators. By splitting each actuator into independent agents, the relative magnitude of the reward is no longer an issue. Likewise, this structure can be easily expanded to HVAC systems with more actuators by simply adding more agents.

B. STATE REPRESENTATION
The compressor and EXV agents share most of their important features from system observations. Common features are as follows.
Here, T t−1 , T t , t−1 , t are observed values of the temperature and SC of the current and previous time steps, respectively. As the HVAC system reacts gradually rather than instantly, information of the previous time steps is required for the agents to decide the next action. This nature of the HVAC system can be easily inferred from a mathematical modeling of each system component, where most of the dynamics are differential equations with time [40], [41].
In the above equation, the current error of temperature and SC are denoted by T t − T target and t − target . The agents can decide the magnitude and direction of the action based on the current error. Moreover, the control values of the previous step are also included as (u 1 t−1 , u 2 t−1 ). To make the most of the information available, each agent's current state contains the other agent's action that was taken one step before. The control values of different actuators are normalized with their upper and lower limits for a relative scale, as follows.
Besides the common features, the EXV agent receives additional information, C t , which is a binary flag indicating whether the target temperature is reached. C t , as given by Eq. 16, is 1 when the error of the temperature is below λ for two consecutive action steps and otherwise 0. The reward changes depending on C t , and thus C t provides a clear understanding of the status. More details about the reward function are explained in Section IV-D.
To sum up, the state of each agent is as follows:

C. ACTION REPRESENTATION
The action values (a 1 t , a 2 t ) of the policy networks (Eq. 18), which are in the [−1, 1] range, are converted to control values via mapping functions. For clarity, we call the output of the agent as the action value and the input of the actuator as the control value.
As each actuator needs a different control value, different mapping functions are applied. For the compressor agent, the action value a 1 t is mapped to the control value u 1 t considering two types of constraints. The first constraint (Eq. 5) is the operating range of the component, which is the maximum and minimum rpm value of the compressor. The other constraint (Eq. 4) is the slew rate, which is the allowed change of rpm per second for the safety of the component. The compressor mapping function is as follows: The compressor action value a 1 t is first rescaled to the domain of an actuator control value using the slew rate and the operating range. The rescaled value is then added to the previous control value u 1 t−1 to get the next control value u 1 t . In this way, we can assure that the control value always stays in the allowed range.
One notable thing to address is that the mapping function is asymmetric. When a 1 t is rescaled with an affine transformation to the change in the control value, the allowed range of increment and decrement can differ [ Fig. 5(a)]. If a 1 t is rescaled uniformly in this range, the portion of either increment or decrement can be relatively larger than the other, which can be problematic for the training process. Specifically, the agent collects training data with a random policy in the early stages of the training. The distribution of the training data can be biased towards either increment or decrement if the range is unbalanced, and this can greatly influence the training time and quality. To address this issue, the action value is rescaled differently depending on its sign. Positive action values always increase the control value within the allowed range, while negative action values work in the opposite way [ Fig. 5 While a 1 t is rescaled to u 1 t , a 2 t is directly rescaled to u 2 t using the operating range of the EXV component. The slew rate is not considered in this case. The EXV mapping function is as follows.

D. REWARD FUNCTION
The reward function is designed to enable multi-objective control in the current MIMO system. As described in Fig. 4, two different control agents are trained, one for each corresponding phase. The objectives of each phase are expressed as reward components.

1) REWARD FUNCTION FOR THE TRANSIENT AGENT
The reward function for the transient agent in Eq. 21 is based on Eq. 1 from Section II. The reward function has two components, r Temp t and r Work t . The compressor agent and EXV agent receive rewards composed of a temperature reward and work reward. In the transient phase, the convergence of SC is not considered; hence, the only termination condition for the transient phase is the convergence of temperature. The reward function of the compressor and EXV agents is r 1 t and r 2 t , respectively. Note that the ρ variables determine the shape and scale of the reward function.  The r Temp t component is related to the convergence of temperature to the target value. As T t approaches T target , r Temp t converges to 0. In contrast, r Temp t converges to −(ρ 1 + ρ 3 ) as T t gets further away from T target . Otherwise, the r Work t component is related to the work done by the compressor and linearly decreases as the work W t increases. W t is linearly mapped to [−(ρ 1 + ρ 3 ), 0] using its minimum and maximum values, which are 500 W and 4000 W. The reason for mapping temperature and work rewards to the same range is to prevent any reward from being overly dominant. In a multi-objective problem, an imbalance between reward components may cause some objectives to be ignored in training (Fig. 6).
Each agent's reward function is designed to satisfy the objectives of the transient phase (Eq. 1). In the compressor agent case, the reward r 1 t consists of a sum of r . The EXV only takes the work reward because SC is not considered in the transient phase and the EXV control value has only a subtle influence on the change in temperature. It is worth noting that the objective of the transient phase is to meet the target temperature, not the target SC.
One important thing to address in the transient phase is the sign of the rewards. Contrary to the steady phase, every reward in the transient phase has a negative value. With a negative reward, the total episode reward decreases as the length of the episode increases. Therefore, negative rewards force the agent to balance between terminating the episode early and minimizing the work, which paves the way to achieve our objective function in Eq. 1. This property of negative rewards is used to reflect the objective of faster convergence.

2) REWARD FUNCTION FOR THE STEADY AGENT
The reward function for the steady agent in Eq. 22 is based on Eq. 3 from Section II. This reward function is designed to be two-fold due to the trajectory of the temperature. As the steady phase comes after the transient phase (Fig. 1), the steady agent should be trained not only in the steady phase but also in the transient phase, and thus should have two reward functions. When the temperature is far from its target (C t = 0, transient phase), the reward function is designed to reach the target temperature as fast as possible to enter the steady phase. Then after reaching the target temperature (C t = 1, steady phase), the reward function changes to meet the objectives of the steady phase (Eq. 3). Compared to the transient agent, an additional reward component r t is used for the convergence of the target SC in this case. Also, the sign of both reward components for the steady agent is positive, for the following reasons. As the objective of the steady phase is to maintain the target temperature and SC in a stable manner, the steady agent should be enhanced to stay in the desired state as long as possible. With a positive reward, we can compensate the agent as much as necessary. Note that composing the two reward functions of this agent with a single sign is important; if negative and positive rewards are mingled, the agent will struggle to understand the true effect of the action. Therefore, the positive reward for the steady agent's transient phase is applied, differing from the negative reward for the transient agent.
The shapes of the temperature and work reward for the steady agent are the same as the model for the transient agent. They are only translated to positive values. Here, r Temp t converges to ρ 1 +ρ 3 as T t approaches T target and converges to 0 otherwise. And r Work t is linearly mapped to [0, ρ 1 +ρ 3 ] using 7580 VOLUME 11, 2023 the minimum and maximum values of W t . Similar to r Temp t , r t converges to ρ 1 +ρ 3 as t reaches target and converges to 0 otherwise (Fig. 7). Before reaching the target temperature (C t = 0), r 1 t is the same as r Temp t . After reaching the target temperature (C t = 1), the compressor agent receives both r Temp t and r Work t . The reason for adding r Work t when C t = 1 is to train the steady agent to keep the temperature in the desired range (| T | < λ). As r Work t is positive, r 1 t is always much higher when the temperature is in the desired range, from which we can be assured that the convergence of the temperature has a higher priority than efficiency.
In this case, r 2 t is the same as r Work t before reaching the target temperature. After reaching target temperature, the EXV agent receives r t , with which it is trained to reach the target SC. When the target SC is unknown, the EXV agent can be trained with r Work t after reaching the target temperature (C t = 1). Here, the EXV agent is trained to find the action that minimizes the work while maintaining the temperature. As a byproduct, the resultant SC can be used as a target SC value. An agent trained without target SC is validated in Section IV-2. Fig. 9 illustrates our MADRL training framework for the HVAC system control. First, both compressor and EXV agents receive the tuple (s t , a t , s t+1 ) by interacting with the HVAC simulator. Using the tuple, the reward functions of the compressor and EXV agents calculate r 1 t and r t 2 , respectively, and store the transition (s t , a t , r t , s t+1 ) in their experience replay memory. Then, a mini-batch of k transitions are randomly sampled from each experience replay memory and given to each network. The networks output the current action value, current Q-value, and target Q-value for each agent. These outputs are passed to multiple loss functions that calculate actor loss, critic loss, and α loss. The parameters of each network are updated using the gradient of the corresponding loss. After the training, in the inference, the actor network is separated and takes the state as an input and outputs control values for the compressor and EXV.

E. MODEL ARCHITECTURE AND TRAINING PROCESS
Both the compressor and EXV agents are based on SAC [29]. The network architecture of our model is as  the actor and critic networks use Xavier initialization. Then the parameters of the target critic network are synchronized with the critic network. Also, the coefficient of entropy α and experience replay memory M are initialized.
The training process consists of multiple nested loops, where the outer loop denotes the training episode for each iteration, the second loop denotes the training step of each episode, and the innermost loop denotes the training of each component. In line 8, the initial states s 1 0 , s 2 0 are obtained from the HVAC simulator. In lines 10-12, using the current state s t , the action, reward, and next state are obtained, which forms the transition sample (s t , a t , r t , s t+1 ) for every time step. In line 14, the transition sample is stored in the replay memory, and in lines 15-19, a mini-batch is randomly sampled from the replay memory and the networks are updated. In line 20, the target critic network is soft-updated with hyperparameter τ .

A. EXPERIMENTAL SETUP
In this section, the MADRL-based control algorithm is validated for the EV HVAC system. The HVAC system is formulated with the commercial software GT-SUITE R [21], which has been widely employed by vehicle manufacturers and component suppliers. As one of the key aims of this research is to facilitate our implementation in an industrial setup, it is natural to choose a widely used industrial simulator. In our experiment, the vehicle thermal system circuit is implemented as explained in Section II. For experiments, the thermal system circuit requires additional conditions such as vehicle speed and ambient temperature; these conditions are chosen here based on a common scenario using the A/C circuit. In particular, the target evaporator outlet temperature, which is the main control objective, is set within a range   Table 2.

B. EXPERIMENTAL RESULTS
As explained in Section III, the transient and steady agents of the MADRL-based control model are trained separately. In this section, we evaluate our model in terms of the temperature convergence and the work efficiency. Also, we show that our method works well without a target DOS and can even find an effective target DOS. Additionally, we compare the MADRL-based control model with conventional PID control.

1) CONVERGENCE AND EFFICIENCY EVALUATION
The performance of temperature convergence is evaluated differently for each phase. In the transient phase, the ability to reach the target temperature within a limited time is tested (Fig. 10). The time limit (t limit ) is 40 steps, equivalent to 120 s. For every target temperature, the transient agent succeeds to reach the target within the given time. In the steady phase, the ability to reach both target temperature and SC is evaluated. Fig. 11 shows the case when the target temperature is 6 • C; the rest of the experimental results are presented in Fig. 12. The target DOS corresponding to the target temperatures is obtained by domain experts. In Fig. 11, we can see that the trajectory of the MADRL control model with a SC target converges to the target temperature and the target SC. As a result, the control values reach almost the same control values as PID control by the end. For every experiment in Fig. 12, the agents succeed to reach the target temperature and target SC with less than 0.5 • C error.
In terms of efficiency, we evaluate the work done by each control model in each phase. In Table 3, the work done by the MADRL control model is compared with PID control for the transient phase. Comparing the result of the transient agent  cooling load is greater than in the high temperature zone. In particular, when the target temperature is 4 • C, the transient agent consumes only 11% of the energy compared to that by PID control.
Such performance improvement is largely due to faster convergence. As shown in Fig. 10, Fig. 12, and Table 4, PID control converges to the target temperature more slowly than the transient agent. One of the reasons for this slow convergence is an overshoot that is easily observable in the PID control. The overshoot of control is observable not only in the temperature but also in the SC. The overshoot of SC also contributes to the inefficiency of the system, where the proper SC promises a better COP. Conversely, the trajectories of the transient agent show little or no overshoot.
When we compare the work done between the steady agent and the transient agent in Table 3, the transient agent shows better efficiency, indicating that the reward functions are working as intended. As the transient agent is penalized if the work done by the compressor is high, the transient agent utilizes the compressor more mildly to use less work. Lastly, in the steady phase, the steady agent trained with the SC target is compared with PID control in Table 5. Specifically, the work values are estimated based on the average of the last 10 steps of the trajectory in the steady phase. The steady agent shows a similar performance compared to PID control. This result is consistent with the trajectory of both methods, as they converge to the same temperature and DOS.

2) TRAINING THE MADRL MODEL WITHOUT A SUBCOOLING TARGET
As mentioned in Section III, conventional PID control requires a target DOS for feedback control, whereas the MADRL-based control model can be trained without a target DOS. Without a SC target, the agent is expected to reach the target temperature while minimizing the work. Since the target DOS is only used in the steady phase, this experiment is only for the steady agent.
We first validate if the model successfully reaches the target temperature. In Fig. 12, for every case, the agent reaches the target temperature successfully. In Table 5, we compare the work done by the MADRL control model trained without a SC target and that by PID control. Our model shows comparable efficiency for all cases, averaging 99% efficiency compared to PID control. Especially when the target temperature is 4 • C, the MADRL control model outperforms by 8%. Also, it is worth noting that the model converges to SC similar to PID control in most cases. Even when the converged DOS differ, the control model shows reasonable efficiency, varying by less than 5% from the PID control results even in the worst case. The DOS targets used in this work are optimal values found by domain experts, and finding them requires wide experience and numerous heuristics. If the DOS targets are not optimal, the performance gap between PID control and our method will be even greater.
For further validation, we conducted a brief grid search ( Table 6). The grid search is executed by maintaining a fixed set of actions until observations converge. The resolutions of the actions are 1000 rpm and 0.2 mm for the compressor and EXV, respectively. Each grid search result is grouped by the final temperature. In the results, each group contains a final temperature value differing less than 1 • C from the target temperatures of our experiment. In Table 5, the best grid search results are selected that satisfy the convergence condition (| T | ≤ 0.5). Compared to the grid search results, both the MADRL control model and PID control achieve greater efficiency for all target temperatures. In fact, the grid search results show that finding the appropriate DOS for each temperature is difficult. For target temperatures of 4 and 6 • C, less than 3 combinations of actions are found to reach the desired temperature. Applying a finer resolution would result in better performance but also increase the cost of computation. Moreover, if the number of actuators increases, the cost increases exponentially.

V. CONCLUSION
In this paper, we proposed a MADRL-based control method for an EV HVAC system. The key conclusions of this work are as follows: First, through the multi-agent architecture, various objectives such as temperature, subcooling and efficiency can be simultaneously set. Second, the proposed method reduce the energy consumption in the transient phase to 53% of the PID control. Third, experiment show that it is possible to control with a similar level of efficiency without optimal setpoint subcooling only through the reward function. Furthermore, the converged SC values can be used as new setpoints.
Our study was validated in the A/C mode under a simulated environment. In future research, we will validate our model in the heat pump mode with more actuators. Also, we expect to generalize our model in a real EV with minimum tuning. He is currently a Co-Founder and a Chief Data Scientist with MakinaRocks. Leading the ML Solution Team, he is focusing on developing machine learning technology to solve challenging problems in industry and MLOps technology for continuous deployment and operation. Prior to MakinaRocks, he was a Data Scientist at SK Telecom specializing in advanced analytics and machine learning for industrial data.

ACKNOWLEDGMENT
JEONGHOON LEE received the B.S. and M.S. degrees in mechanical engineering from Hanyang University, Seoul, and the Ph.D. degree from the School of Mechanical and Aerospace Engineering, KAIST, where he studied thermal comfort evaluation inside a passenger vehicle compartment using 3-D image reconstruction. He is a Technical Fellow with Hanon Systems Company and played a leading role in applying CO2 sensors, humidity sensors, and external variable compressor control to mass production in vehicles. He is currently focusing on industrial artificial intelligence, as well as simulation and test using domain knowledge-based AI dynamics modeling and embedding it into control units. VOLUME 11, 2023