A Reinforcement Learning Approach for Transient Control of Liquid Rocket Engines

Nowadays, liquid rocket engines use closed-loop control at most near steady operating conditions. The control of the transient phases is traditionally performed in open-loop due to highly nonlinear system dynamics. This situation is unsatisfactory, in particular for reusable engines. The open-loop control system cannot provide optimal engine performance due to external disturbances or the degeneration of engine components over time. In this paper, we study a deep reinforcement learning approach for optimal control of a generic gas-generator engine's continuous start-up phase. It is shown that the learned policy can reach different steady-state operating points and convincingly adapt to changing system parameters. A quantitative comparison with carefully tuned open-loop sequences and PID controllers is included. The deep reinforcement learning controller achieves the highest performance and requires only minimal computational effort to calculate the control action, which is a big advantage over approaches that require online optimization, such as model predictive control. control.


I. INTRODUCTION
T HE demands on the control system of liquid rocket engines have significantly increased in recent years [1], in particular for reusable engines. Advanced mission scenarios, e.g. in-orbit maneuvers or propulsive landings, require deep throttling and re-start capabilities. The aging of reusable engines also requires a robust control system as the performance of engine components might degrade over time, e.g. due to soot depositions [2]- [4], increased leakage mass flows caused by seal aging [5], or turbine blade erosions [6]. The costefficient operation of a reusable launch vehicle is only possible if the engines possess a long service life without expensive maintenance.
Nowadays, most liquid rocket engines use predefined valve sequences to drive the system from the start signal to a desired steady-state and to shut down the engine safely. These control sequences are usually determined during costly ground tests. Closed-loop control is at most used near steady operating conditions to maintain a desired combustion chamber pressure and mixture ratio [7]. The resulting lower deviations of the controlled variables decrease the amount of extra propellant to be carried, which in turn increases the payload capacity of the launch vehicle. Although the importance of closed-loop control has been evident for many years, the majority of rocket engines still employ valves which are operated with pneumatic actuators, too inefficient for a sophisticated closed-loop control system. The development of an all-electric control system started in the late 90s in Europe [8]. The future European Prometheus engine will have such a system [9]. Other countries are also well advanced in the research and development of electrically operated flow control valves [10]. Due to the electrification of the actuators and the grown demands, the interest in closed-loop solutions increased recently and will further rise in the future.
Furthermore, optimal control of the engine operation, including the transient phases, is the only way to realize high performing systems, which also comply with the aforementioned demands on the control system of future liquid rocket engines [11]. One way to solve optimal control problems is to use reinforcement learning (RL). Although the application of such modern methods of artificial intelligence seems unorthodox in this setting, it offers certain advantages. First, given a suitable simulation environment, RL algorithms can automatically generate optimal transient sequences. Second, the trained RL controller features a minimal computational effort to calculate the control action, so it can easily be used for closed-loop control of the demanding transient phases. Third, RL is perfectly suited for complex control tasks, including multiple objectives and multiple regimes [12]. Optimal control using RL [13] has been studied in many different areas, from robotics [14], [15] and medical science [16] to flight control [17], [18] and process control [19]. Furthermore, the benefits of an intelligent engine control system, where artificial intelligence techniques are used for control reconfiguration and condition monitoring, have already been investigated in the space shuttle area [20], [21].
The objective of our work is analogous to the investigation of Pérez-Roca et al. [22], where a model predictive control (MPC) approach to control the start-up transient of a liquid rocket engine was studied. After the derivation of a suitable state-space model [23], a linear MPC controller was synthesized. The controller completes the start-up and can track the end-state references with sufficient accuracy. MPC and RL have specific advantages and disadvantages. The work presented here aims to evaluate the capabilities and limitations of RL for liquid rocket engine control. Our main contributions are the following: • formulation of optimal start-up control as a RL problem • training and evaluation of the RL controller for multiple operating conditions and degrading turbine efficiencies • quantitative comparison with carefully tuned open-loop sequences and PID controllers The remainder of this paper is structured as follows: Section II describes the basics of RL and presents pseudocode of the used RL algorithm. The simulation environment and its coupling with the RL algorithms are outlined in section III. Section IV discusses the test case. Section V reports the results, including the comparison with the performance of PID controllers. Finally, section VI provides concluding remarks.

II. REINFORCEMENT LEARNING
In this section, we review basic RL concepts [24]. RL algorithms can be used to solve optimal control problems stated as Markov decision processes (MDPs) [25]. MDPs provide a mathematical framework for modeling decision making in situations where the system changes possibly in a stochastic manner. Standard MDPs work in discrete time: at each time step, the controller (usually called the agent in RL) receives information on the state of the system and takes an action in response. The decision rule is called a policy in RL. The action changes the state of the system, and the latest transition is evaluated via a reward function. The optimal control objective is to maximize the (expected) cumulative reward from each initial state. Formally, an MDP consists of the state-space X of the system, the action (input) space U , the transition function (dynamics) f of the system, and the reward function ρ (negative costs). Due to the origins of the field in artificial intelligence, the usual notation would be S for the state-space, A for the action space, P for the dynamics, and R for the reward function. In this paper, notation inspired by control theory is used. As a result of the action u k applied in state x k at discrete time step k, the state changes to x k+1 and a scalar reward r k+1 = ρ(x k , u k , x k+1 ) is received. The goal is to find a policy π, so that u k = π(x k ), that maximizes the cumulative reward, typically the expected discounted sum over the infinite horizon: where γ ∈ (0, 1] is the discount factor. The mapping from a state x 0 to the value of the cumulative reward for a policy π is called the (state) value function V π (x 0 ): (2) The control objective is to find an optimal policy π * that leads to the maximal value function, for all x 0 : Although state-values functions suffice to define optimality, it is useful to define action-value functions, called Q-functions. The action-value function gives the expected reward if one starts in state x, takes an arbitrary action u (which may not have come from the policy), and then forever after acts according to policy π: where the prime notation indicates quantities at the next discrete time step. The optimal Q-function Q * is defined using V * . Once an optimal Q-function Q * is available, an optimal policy π * can be computed by while the formula to compute π * from V * is more complicated. As a consequence of the definitions, the Q-functions Q π and Q * fulfill the Bellman equations: and which are of central importance in RL. The crucial advantage of RL algorithms is that they do not require a model of the system dynamics. Instead, an optimal policy can be found by learning from samples of transitions and rewards. The problem formulation with MDPs and the associated solution techniques also handle nonlinear, stochastic dynamics, and nonquadratic reward functions. Perhaps the most popular RL algorithm is Q-learning. In Q-learning, one starts from an arbitrary initial Q-function Q 0 and updates it using observed state transitions and rewards. The update rule is of the following form: (8) where α k ∈ (0, 1] is the learning rate. The term inside the square bracket is nothing else than the difference between the updated estimate of the optimal Q-value of (x k , u k ) and the current estimate Q k (x k , u k ). Under mild assumptions on the learning rate and that a suitable exploratory policy is used to obtain samples, i.e. data tuples of the form (x k , u k , x k+a , r k+1 ), Q-learning asymptotically converges to Q * , which satisfies the Bellman optimality equation. The reader is referred to [26] for a description of similar RL algorithms. Q-learning and its many variants require that Qfunctions and policies are exactly represented, e.g. as a table indexed by the discrete states and actions. Especially for the control of physical systems, the states and actions are continuous; moreover, exact representations are in general impossible. Normal Q-learning does not work in this setting. Fortunately, methods like Q-learning can be combined with function approximation. We denote approximate versions of the Q-function and the policy byQ(x, u; θ) andπ(x; w), where θ and w are the parameters of parametric approximators. There are many different function approximators to choose from.
The combination of RL with deep neural networks (DNNs) as function approximators leads to the field of deep RL. In the last years, deep RL algorithms have achieved impressive results, such as reaching super-human performance in the game of Go. Besides the sensational results in board games or video games, those algorithms are successfully used in areas like robotics. In deep Q-learning, one uses a neural network to approximate the Q-function. Neural networks can represent any smooth function arbitrarily well given enough parameters, and therefore they can learn complex Q-functions. Loss functions and gradient descent optimization are used to fit the parameters of the models. Gradient estimates are usually averaged over individual gradients computed for a batch of experiences.
Nevertheless, the simple training procedure is unstable, because sequential observations are correlated, and techniques like experience replay have to be used. Correlated experiences are saved into a replay buffer. When batches of experiences are needed for training, these batches are generated by sampling from the replay buffer in a randomized order. A further reason for the simple training procedure's instability is that the target values depend on the parameters one wants to optimize. The solution is to use a so-called target network, Q(x, u; θ − ), with target parameters θ − , which slowly track the online parameters. While deep Q-learning solves problems with continuous state-spaces, it can only handle discrete and low-dimensional action spaces. The reason for that is the following: (deep) Q-learning requires fast maximization of Q-functions over actions. When there are a finite number of discrete actions, this poses no problem. However, when the action space is continuous, this is highly non-trivial (and would be a very computational expensive subroutine).
The deep deterministic policy gradient (DDPG) [27] algorithm is specially adapted for environments with continuous action spaces. It uses neural networks to approximate both the Q-function and a deterministic policy, i.e. the policy network deterministically maps a state to a specific action. For exploration, one adds noise sampled from a stochastic process N to the actions of the deterministic policy and updates it by a gradient-based learning rule. As in deep Q-learning, the DDPG algorithm uses a replay buffer and target networks to improve stability during neural network training. Further details of the DDPG algorithm and its performance on different simulated physics tasks are given by Lillicrap et al. [27].
Although the DDPG algorithm is quite powerful, it has a direct successor, the Twin Delayed DDPG (TD3) [28] algorithm, which further improves the stability by employing three critical tricks. The first trick addresses a particular failure mode of the DDPG algorithm: if the Q-function approximator develops an incorrect sharp peak for some actions, the policy will quickly exploit that peak and then have brittle or incorrect behavior. This failure mode can be averted by smoothing out the Q-function over similar actions. For this, one computes the action that is used to form the Q-learning target in the following way: (9) where ∼ N (0, σ) is noise sampled from a Gaussian process Algorithm 1 Twin Delayed DDPG (TD3) 1: Input: initial policy parameters w, Q-function parameters θ 1 , θ 2 , empty replay buffer D 2: Set target parameters equal to main parameters Observe state x and select action Execute u in the environment 6: Observe next state x , reward r, and done signal d to indicate whether x is terminal 7: Store (x, u, r, x , d) in replay buffer D 8: If x is terminal, reset environment state 9: if it is time to update then 10: for j in range(however many updates) do 11: Randomly sample a batch of transitions Compute target actions Update Q-functions by one step of gradient descent for i = 1, 2 15: if j mod policy delay = 0 then 16: Update policy by one step of gradient ascent Update target networks end for 20: end if 21: until convergence (target policy noise). The action is based on the target policy, but with clipped noise added (target noise clip c). After adding the noise, the target action is also clipped to lie in the valid action range (x Low , x High ). The second trick is to learn two Q-functionsQ(x, u; θ i ), for i = 1, 2, instead of one and use the smaller of the two Q-values to form the target. This improvement reduces overestimation in the Q-function.
The third trick is to update the policy less frequently than the Q-functions (policy delay) to damp the volatility that arises in the DDPG algorithm. Algorithm 1 shows the full pseudocode of the TD3 algorithm. The done signal d is equal to one when x is the terminal state and otherwise equal to zero. The done signal guarantees that the agent gets no additional rewards after the current state at the end of an episode.
In addition to enhancements that improve the stability of the training process, research is also carried out to speed up the learning process of RL agents [29]. Besides DDPG, TD3, or SAC [30], which are so-called off-policy algorithms, there are also state-of-the-art on-policy algorithms like TRPO [31] or PPO [32]. Nevertheless, on-policy methods are much more sample inefficient and have longer training time to achieve equivalent performances. From a control perspective, reinforcement learning converts the system identification problem and the optimal control problem to machine learning problems. Similar to explicit model predictive control it also addresses the problem of removing one of the main drawbacks of model predictive control, namely the need to solve a complex optimization problem online to compute the control action.
The main advantages of RL for control: • no derivation of a suitable state-space model, model order reduction or linearization needed • direct use of a nonlinear simulation model • ideal for highly dynamic situations (no complex online optimization needed) • complex reward functions enable complicated goals The main disadvantage of RL for control: • stability of the controller is in general not guaranteed Concerning the last point (stability), we would like to make a remark. The controller's output can always be tested using the simulation environment, and there has been promising recent work on certifying stability of RL policies [33].

III. SIMULATION ENVIRONMENT AND RL IMPLEMENTATION
A suitable simulation environment for our intended use is given by EcosimPro [34]. EcosimPro is a modeling and simulation tool for 0D or 1D multidisciplinary continuous and discrete systems. The system description is based on dierentialalgebraic equations and discrete events. Within a graphical user interface, one can combine dierent components, which are arranged in several libraries. Of particular interest are the European Space Propulsion System Simulation (ESPSS) libraries, which are commissioned by the European Space Agency (ESA). These EcosimPro libraries are suited for the simulation of liquid rocket engines and have continuously been upgraded in recent years.
We use the TD3 implementation of Stable-Baselines [35]. Stable-Baselines is a set of improved implementations of RL algorithms based on OpenAI Baselines. It features a common interface for many modern RL algorithms and additional wrappers for preprocessing, monitoring, and multiprocessing. We encapsulate our simulation environment into a custom OpenAI Gym environment using an interface between Ecosim-Pro and Python. Hence, we can directly use Stable-Baselines for training and testing. A big advantage of the RL approach is that it works regardless of whether one uses a lumped parameter model, continuous state-space models, surrogate models employing artificial neural networks [36], [37], or a combination of the above.

IV. TEST CASE
The engine architecture considered to study the suitability of an RL approach for the control of the transient start-up is shown in Fig. 1. It is similar to the architecture of the European Vulcain 1 engine [38], which powered the cryogenic core stage of Ariane 5 launch vehicle before it got replaced by where a small amount of the propellants is burned in a small combustion chamber, the gas-generator (GG). The gasgenerator is operated at a fuel-rich mixture ratio of 0.9. The produced hot-gas is used to drive the turbines before it is exhausted. The turbines power the pumps which force the propellants into the combustion chambers. LH2 is used to cool the nozzle and main combustion chamber before it gets burned. A convergent-divergent nozzle, which usually includes an uncooled nozzle extension (NE), accelerates the combustion gases to generate thrust. The actuators are given by five flow control valves (VCO, VCH, VGO, VGH, VGC). VCO and VCH are the main combustion chamber valves that regulate the propellant flow to the combustion chamber. VGO and VGH, the gas-generator valves, are used to control the gas-generator pressure and mixture ratio. The turbine valve, VGC, is located downstream of the gas-generator and is used to determine the hot-gas flow ratio between the LOX and LH2 turbines. Thus, this valve mainly influences the global mixture ratio (PI, pump-inlet). Further actuators are the ignition systems (IGN) for the main combustion chamber and the gas-generator, as well as a turbine starter. The turbine starter produces hot-gas for a short period to spin up the turbines during the start-up.
To start the engine and reach steady-state conditions, a succession of discrete events, including valve openings and chamber ignitions are necessary. The start-up sequence of an engine, i.e. the chronological order of oxidizer and fuel valve openings, as well as the precise ignition timings, determines the engine's thermodynamic conditions and mechanical stresses during start-up. A non-ideal start-up sequence can damage the engine, e.g. by excessive temperatures. These high temperatures can substantially damage the turbine blades or at least reduce their live expectancy [39]. An optimal startup sequence leads to a smooth ignition of the combustion chamber and gas-generator with low thermal and mechanical stresses. An open-loop start-up sequence (OLS) for a steadystate chamber pressure of 100 bar is shown in Fig. 2. The sequence does not correspond exactly to the Vulcain 1 start-up sequence, but it is realistic for such an engine cycle. The flow control valves are opened monotonically until the end positions are reached. First the VCH valve starts to open at t = 0.1 s, followed by VCO at t = 0.6 s. A fuel-lead transient is usually used for a smooth ignition of the combustion chamber, which takes place at t = 1.0 s. At this point, the main combustion chamber is burning at low pressure, only fed by the tank pressurization. At t = 1.1 s, the turbine starter activates to spin up the turbopumps, which start to build up the pressure in the main combustion chamber and at the gas-generator valves VGO, VGH. At t = 1.4 s and t = 1.5 s, the gas-generator valves VGH and VGO open and the gas-generator is ignited. The VGC valve is set to a fixed position during the entire startup sequence. At t = 2.6 s, the turbine starter is burned out and the engine reaches steady-state conditions after approximately 4 s. The valve positions in Fig. 2 are tuned to reach a main combustion chamber pressure p cc of 100 bar, a global mixture ratio MR PI of 5.2 and a gas-generator mixture ratio MR GG of 0.9.
Although RL can solve discrete or hybrid control problems, there are controllability and observability issues during the first phase of discrete events due to very low mass flows [21]. Thus we focus on the fully continuous phase starting at t = 1.5s. We study different reference values for the combustion chamber pressure, namely 80 and 100 bar. The reference mixture ratios remain the same, 5.2 for the global mixture ratio and 0.9 for the mixture ratio of the gas-generator. For a combustion chamber pressure of 80 bar, the valve timings are the same, but the final valve positions were adjusted accordingly (see Fig. 9). Furthermore, we study the effect of degrading turbine efficiencies on the start-up transient. This scenario has practical relevance for future reusable engines. The use of cryogenic propellants leads to significant thermostructural challenges in the operation of turbopumps. Since thermal stresses depend on the temperature gradient, they can cause significant loads on the metal parts that have to react to these stresses. The resulting fatigue deformation [39] affects the performance of the turbines. Furthermore, the aging of seals can cause increased leakage mass flows, which in turn decreases the turbine efficiency [5]. Additional reasons are turbine blade erosions [6] and soot depositions on the turbine nozzles by fuel-rich gases when using hydrocarbons as fuel. These soot depositions can decrease the effective nozzle area up to 20 % [2], thus reducing the turbopump performance. Furthermore, soot depositions are a main shortcoming for reusable engines due to the unpredictable impacts for engine re-start [40].
To study the effect of degrading turbine efficiencies for our generic test case, we simulate and evaluate the performance of the open-loop start-up sequence, a family of PID controllers, and our RL-agent for 16 different combinations of LOX and LH2 turbine efficiencies. For each turbine, 4 different efficiencies are considered ranging from 100 % to 85 % of the nominal value.
The reward, which is used to evaluate a start-up sequence and to train the RL agent consists of 3 different terms: The first term for x i ∈ [p CC , MR GG , MR PI ] penalizes deviations from the desired set-point for all controlled variables. Each reward component in this term is clipped to a maximum value of 0.2 to improve training and to balance the accumulated reward during start-up and steady-state. The second term of the reward additionally penalizes high mixture ratios in the gas-generator. High mixture ratios are dangerous because they result in increased temperatures and thus possible damaging conditions to the turbines. The last reward component where s is the change in valve position between two time steps, penalizes excessive valve motion. By adding this component, we encourage the agent to move the valves as little as possible to avoid valve wear, valve oscillations, and valve jittering. All together, this reward allows the agent to trade off between reaching the desired reference point as fast as possible, avoiding steady-state errors, minimizing overshoots, and reducing valve motion as much as possible. Fig. 2 shows all 3 components of the cumulative reward for the nominal OLS for 100 bar. Since the valves are only moved once in the OLS, the contribution of r valve to the total reward is low. As the overshoot in the gas-generator mixture ratio is also small (small r GG ), the total reward is mainly composed of the set point error r sp . To train and use a RL agent, one needs to define the observation and action space of the agent. The observation space, i.e. the variables the agent receives from the environment at each time step, should at least contain sufficient information to unambiguously define the state of the system. In our set-up, the observation space X = [p cc,ref , cc , PI , GG , P os VGO , P os VGH , P os VGC , ω LOX , ω LH2 ] (15) contains 9 variables, where i = x i −x i,ref is the absolute error for each controlled variable, P os VGO , P os VGH , and P os VGC are the positions of all control valves, and ω LOX and ω LH2 are the rotational speeds of the turbopumps. The observation space is normalized with the reference steady-state values. All variables in our observation state are measurable in real engines. Thus our approach is not limited to simulation environments, where one could possibly use variables that are impossible to measure directly in real engines (e.g. the turbine efficiencies). The agent's action space U consists of all 3 gas-generator valve positions U = [P os VGO , P os VGH , P os VGC ].
At each time step, the RL agent receives observations from the environment and sends control signals to the flow control  valves of the engine. The frequency of interaction between the controller (RL-agent and PID) and the environment is set to 25 Hz.

V. RESULTS
In this section, we assess the performance of our RL controller. For this we use the approximation of the integrated absolute error over one entire episode for each controlled variable: where t j are the discrete time steps. Furthermore, we evaluate the average steady-state values of the controlled variables from t = 3.5 s to t = 5.0 s and the value of the cumulative reward. Before we turn to the performance of closed-loop control, let us record the downsides of open-loop sequences (OLS). The first column in Fig. 5 shows the resulting engine start-up for the nominal OLS and degrading turbine efficiencies. For the latter, the steady-state values deviate strongly from the reference values. The minimum steady-state value of the main combustion chamber pressure is 92 bar. The steady-state of the global mixture ratio varies between 4.9 and 6.0. To prevent fuel or oxidizer from running out during a mission in the event of a persisting mixing ratio deviation, the loaded propellants must be increased, which reduces the payload capacity of the launch vehicle. A further negative effect is that the temperature in the combustion chamber can rise significantly due to a shift in the mixing ratio, which could reduce the engine's service life. Additionally, the steady-state value of the mixture ratio of the gas-generator changes too. The temperature in the gasgenerator is sensitive to the mixture ratio, and an increased temperature can also damage the turbines. These damaging conditions are especially problematic for reusable engines, which must possess a long service life. The same implications apply to the 80 bar case as Fig. 8 shows.
Those unfavorable effects can be counteracted with a closed-loop control system. First, we tune a family of PID controllers to achieve the start-up. The process of controlling the chamber pressure of the main combustion chamber, the mixture ratio of the gas-generator, and the global mixture ratio by manipulating VGO, VGH, and VGC is coupled. E.g. changing VGO does affect not only the mixture ratio of the gas-generator but also the other two controlled variables. Nevertheless, for rocket engine control near steadystate conditions, the standard approach is to use separate PID controllers and tune the control loops at different speeds to avoid oscillations [7]. Hence, we also use three separate controllers.
The first controller manipulates VGO to control the mixture ratio of the gas-generator, the second controller manipulates VGH to control the chamber pressure of the main combustion chamber, and the third controller manipulates VGC to control the global mixture ratio. Starting far away from the reference point can be problematic for a simple PID controller because the integrator begins to accumulate a significant error during the rise. Consequently, a large overshoot may occur. Modern PID controllers use different methods to address this problem of integrator-windup. We use a simple feedback loop, where the difference between the actual and the commanded actuator position is fed back to the integrator, to avoid the effects of saturation. If there is no saturation, our anti-windup scheme has no effect. The ratio between the time constant for the antiwindup and the integration time is 0.1 for all PID controllers.
For PID parameter tuning, we directly use the simulation model coupled with a genetic algorithm [41] of the Distributed Evolutionary Algorithms in Python (DEAP) framework [42]. To guarantee a fair comparison, we use the reward function to calculate the fitness value of a certain parameter combination. Table IV presents the optimal PID parameters, which maximize the reward function. The genetic algorithm uses a population of 5000 valid individuals and evolves the population for 20 generations. Fig. 3 show that the best PID controllers open the valves in a nonmonotonic way, which leads to a faster start-up. Furthermore, the PID controllers fulfill their main task: the feedback loops lead to an adjustment of the valve  positions at lower turbine efficiencies and significantly reduce the deviations from the reference values of the controlled variables. Due to the structure of PID controllers, with their proportional, integral, and derivative terms, the shape of the control input is restricted and does not provide optimal control. Fig. 5 shows that the optimized PID controllers lead to certain overshoots of the main combustion chamber pressure and the global mixture ratio. It is possible to eliminate the overshoots by changing the PID parameters, but this would significantly increase the settling time. For our parameters, there is still an error in combustion chamber pressure after 4 s even for nominal efficiencies. The settling time is not the only reason for a large error in the combustion chamber pressure in the case of lower turbine efficiencies. For the lowest turbine efficiencies, a combustion chamber pressure of 100 bar is physically no longer possible while maintaining the other constraints (especially the desired gas-generator mixture ratio). A specific disadvantage of PID controllers is that degenerating efficiencies or other system parameters cannot be considered directly as further input variables. Fig. 6 shows that for the 80 bar start-up VGC oscillates a little. It is challenging to tune a single family of PID controllers for different reference combustion chamber pressures. For even lower combustion chamber pressures (deep throttling), it becomes more and more difficult to achieve a convincing performance for all operating conditions. The prevention of oscillations leads to an increased settling time for all reference values. All in all, the performance of the PID controllers is not perfect but satisfactory for the case of 100 and 80 bar and fixed mixture ratios. Now we examine the performance of our RL approach. The comparison of Fig. 3 and Fig. 4 shows that at first glance the RL agent's behavior shows strong similarities to the PID controllers. The flow control valves are opened in a nonmonotonic way. Nevertheless, the agent can guarantee an even faster start-up, as presented in Fig. 5. The RL controller can better control the combustion chamber pressure and the global mixing ratio. The control of the gas-generator mixture ratio is comparatively good. Furthermore, the RL agent can directly take the firing of the turbine starter into account. The action changes at t = 2.6 s, which is the time when the firing of turbine starter stops. Similar to the PID controllers, the RL agent can handle degrading turbine efficiencies to a certain extent. It can detect deviating efficiencies because the relationship between valve positions and controlled variables changes, and adjusts the start-up. A prerequisite for this is that the valve positions are also included in the observation space, and that experiences with different efficiencies were generated during the training. Table I compares the rewards, steady-state values, and IAEs of the studied approaches for nominal turbine efficiencies and both main combustion chamber pressures of 100 bar and 80 bar. The open-loop sequences are satisfying for the nominal start-ups. Nevertheless, both IAEs and rewards show that improvement is possible. One can start up faster if the valves are opened nonmonotonously. Why is this not done for realistic start-up sequences? As already mentioned, it is common practice to determine the control sequences employing tests on test benches, which is expensive and time-consuming. With non-reusable engines, the demands on the control system are not so dramatic, and one can accept good but not optimal sequences as long as a large amount of development costs is saved. Another reason is that, as a rule, disturbances influence the start-up anyway and cancel out the advantages of optimized sequences. The advantages can only be realized by closing the control loop. The tuned PID controllers are better than the open-loop sequences concerning the value of the reward. The RL agent is even better. The RL agent and the PID controllers also achieve decent steady-state values. Table II compares the rewards, steady-state values, and IAEs of the studied approaches for degrading turbine efficiencies. We present the mean and standard deviation of the measures instead of giving all values for the 16 different combinations of turbine efficiencies. For the steady-state values, the minimum and maximum values are also listed in Table II. As already seen in Figure 5, the OLS results in large deviations for degrading turbine efficiencies. For 100 bar, the steady-state main combustion chamber pressure ranges between 92 bar to 100 bar. Furthermore, degrading turbine efficiencies strongly influence the overall mixture ratio MR PI . Large deviations in MR PI (here from 4.9 to 6.0) poses two major problems. First, the fuel and oxidizer tank volumes are designed for the nominal mixture ratio. Deviations in MR PI result in a non-ideal utilization of the propellants, thus lowering the launcher's performance. Second, the mixture ratio in the main combustion chamber is directly affected by the overall mixture ratio, potentially resulting in more damaging conditions for the main combustion chamber. The cumulative reward for the OLS increases to a mean value of −14.9 with a large standard deviation of 5.2.
The controller performances of both closed-loop controllers highlight the benefits of closed-loop control for degrading turbine efficiencies. The mean and standard deviations of the cumulative rewards are much smaller for the PID controllers and the RL agent. The additional reduction for the RL agent is mainly due to an even faster start-up. The mean steadystate value of p CC is given by 96.0 bar for the agent, which is a little bit closer to 100.0 bar than the value of PID controllers and much closer than the value of the open-loop sequence. Furthermore, the maximum deviation is the smallest. The advantages of closed-loop control and especially the RL approach are also reflected in the mixture ratios, which are much closer to their nominal values compared to the OLS. The IAEs also show that the RL agent performs better than the PID controllers.

VI. CONCLUSION AND OUTLOOK
In this work, we presented a RL approach for the optimal control of the fully continuous phase of the start-up of a gas-generator cycle liquid rocket engine. Using a suitable engine simulator, we employed the TD3 algorithm to learn an optimal policy. The policy achieves the best performance compared with carefully tuned open-loop sequences and PID controllers for different reference states and varying turbine efficiencies. Furthermore, the prediction of the control action takes only 0.7 ms, which allows a high interaction frequency, and in comparison to MPC enables the real-time use of RL algorithms for closed-loop control. The modest computational requirements should be met by the current generation of engine control units. A potential drawback of the RL approach is the lack of stability guarantees. Nevertheless, the control system can be tested using a high fidelity simulation model, and there is ongoing work on certifying stability of RL policies [33].
The present work can be improved in many directions. It is necessary to carefully examine the performance of the controller when various disturbances occur. Disturbance rejection, integration of filtering, and observer design will be the focus of future work. Furthermore, even the most sophisticated models usually have prediction errors due to not included effects or model miss-specifications. Therefore, it is essential to ensure that controllers trained in a simulation environment are robust enough to be used in real applications. There are RL approaches that explicitly consider modeling errors. Domain randomization [43] can produce agents that generalize well to a wide range of environments. Another issue with RL is implementing hard state constraints. Using the example of liquid rocket engine control, one would like to impose hard constraints to limit the maximum rotational speed of the turbopumps and maximum temperatures to prevent damage to the engine. It is possible to approximate hard state constraints by carefully tuning the reward function, e.g. one can give the agent a sizeable negative reward upon constraint violation and possibly terminate the training episode. Besides, there has been recent work on implementing hard constraints in RL using constrained policy optimization [44].
We would like to conclude this publication with an outlook on the potential advantages of this approach for rocket engine control. Controllers trained with RL can depend on many input variables, can be used for very different operating conditions, and can include multiple objectives. The thrust control of rocket engines is crucial for improving the performance of the launch vehicle, but it is particularly critical when using rocket engines for the soft landing of returning rocket stages. Deep throttling domains of an engine, i.e. 25-100% range of nominal thrust, are not supposed to pose a problem for RL controllers. Regarding multiple objectives, one can modify the reward function to optimize both the system's performance and damage mitigation [45]. The coupling of sophisticated health monitoring systems, possibly based on machine learning techniques, with suitable policies trained by RL, can increase the reliability of launch systems further. Given a suitable simulation environment, end-to-end RL may even enable the training of integrated flight and engine control systems. Overall, it is hoped that the current work will serve as a basis for future studies regarding the application of RL in the field of rocket engine control.

ACKNOWLEDGMENT
The authors would like to thank Wolfgang Kitsche, Robson Dos Santos Hahn, and Michael Börner for valuable discussions concerning the start-up of a gas-generator cycle liquid rocket engine.

APPENDIX I IMPLEMENTATION AND TRAINING DETAILS
The agent is trained for 100 000 time steps, which is equal to approximately 1.5 hours of simulation time. The agents' hyperparameters are tuned with Optuna [46] and are presented in Table III. For exploration we use action noise sampled from an Ornstein-Uhlenbeck process [47]. Table IV shows the parameters of all three PID controllers and the corresponding controlled variable and control valve.   Fig. 9 shows the nominal OLS for a main combustion chamber pressure of 80 bar. The manipulated valve positions by the PID and RL agent for 80 bar are shown in Fig. 6 and Fig. 7. Finally, Fig. 8 compares controller performances for different degraded turbine efficiencies.