Deep reinforcement learning under signal temporal logic constraints using Lagrangian relaxation

Deep reinforcement learning (DRL) has attracted much attention as an approach to solve optimal control problems without mathematical models of systems. On the other hand, in general, constraints may be imposed on optimal control problems. In this study, we consider the optimal control problems with constraints to complete temporal control tasks. We describe the constraints using signal temporal logic (STL), which is useful for time sensitive control tasks since it can specify continuous signals within bounded time intervals. To deal with the STL constraints, we introduce an extended constrained Markov decision process (CMDP), which is called a $\tau$-CMDP. We formulate the STL-constrained optimal control problem as the $\tau$-CMDP and propose a two-phase constrained DRL algorithm using the Lagrangian relaxation method. Through simulations, we also demonstrate the learning performance of the proposed algorithm.


I. INTRODUCTION
Reinforcement learning (RL) is a machine learning method for sequential decision making problems [1]. In RL, a learner, which is called an agent, interacts with an environment and learns a desired policy automatically. Recently, RL with deep neural networks (DNNs) [2], which is called Deep RL (DRL), has attracted much attention for solving complicated decision making problems such as playing video games [3]. DRL has been studied in various fields and many practical applications of DRL have been proposed [4]- [6]. On the other hand, when we apply RL or DRL to a problem in the real world, we must specify a state space of an environment for the problem beforehand. The states of the environment need to include sufficient information in order to determine a desired action at each time. Additionally, we must design a reward function for the task. If we do not design it to evaluate behaviors precisely, the learned policy may not be appropriate for the task.
Recently, controller design methods for temporal control tasks such as periodic, sequential, or reactive tasks have been studied in the control system community [7]. In these studies, linear temporal logic (LTL) has often been used. LTL is one of temporal logics that have developed as formal methods in the computer science community [8]. LTL can express a temporal control task in a logical form.
LTL has also been applied to RL for temporal control tasks [9]. By using RL, we can obtain a policy to complete a temporal control task described by an LTL formula without a mathematical model of a system. The given LTL formula is transformed into an ω-automaton that is a finite-state machine and accepts all traces satisfying the LTL formula. The transformed automaton can express states that include sufficient information to complete the temporal control task. We regard a system's state and an automaton's state as an environment's state for RL.The reward function for the temporal control task is designed based on the acceptance condition of the transformed automaton. Additionally, DRL algorithms for satisfying LTL formulae have been proposed in order to solve problems with continuous state-action spaces [10], [11].
In real world problems, it is often necessary to describe temporal control tasks with time bounds. Unfortunately, LTL VOLUME 4, 2016 1 arXiv:2201.08504v4 [stat.ML] 19 Nov 2022 cannot express the time bounds. Then, metric interval temporal logic (MITL) and signal temporal logic (STL) are useful [12]. MITL is an extension of LTL and has timeconstrained temporal operators. Furthermore, STL is an extension of MITL. Although LTL and MITL have predicates over Boolean signals, STL has inequality formed predicates over real-valued signals, which is useful to specify dynamical system's trajectories within bounded time intervals. Additionally, STL has a quantitative semantics called robustness that evaluates how well a system's trajectory satisfies the given STL formula [13]. In the control system community, controller design methods to complete tasks described by STL formulae have been proposed [14], [15], where the control problems are formulated as constrained optimization problems using models of systems. Model-free RL-based controller design methods have also been proposed [16]- [19]. In [16], Aksaray et al. proposed a Q-learning algorithm for satisfying a given STL formula. The satisfaction of the given STL formula is based on a finite trajectory of the system. Thus, as an environment's state for a temporal control task, we use the extended state consisting of the current system's state and the previous system's states instead of using an automaton such as [9]. Additionally, we design a reward function using the robustness for the given formula. In [17], Venkataraman et al. proposed a tractable learning method using a flag state instead of the previous system's state sequence to reduce the dimensionality of the environment's state space. However, these methods cannot be directly applied to problems with a continuous state-action space because they are based on a classical tabular Q-learning algorithm. For problems with continuous spaces, in [18], Balakrishnan et al. introduced a partial signal and applied a DRL algorithm to design a controller that partially satisfies a given STL specification and, in [19], we proposed a DRLbased design of a network controller to complete an STL control task with network delays.
On the other hand, for some control problems, we aim to design a policy that optimizes a given control performance index under a constraint described by an STL formula. For example, in practical applications, we should operate a system in order to satisfy a given STL formula with minimum fuel costs. In this study, we tackle to obtain the optimal policy for a given control performance index among the policies satisfying a given STL formula without a mathematical model of a system.

A. CONTRIBUTION:
The main contribution is to propose a DRL algorithm to obtain an optimal policy for a given control performance index such as fuel costs under a constraint described by an STL formula. Our proposed algorithm has the following three advantages. 1) We directly solve control problems with continuous state-action spaces. We apply DRL algorithms for problems with continuous spaces such as deep deter-ministic policy gradient (DDPG) [20] and soft actor critic (SAC) [21]. 2) We obtain a policy that not only satisfies a given STL formula but also is optimal with respect to a given control performance index. We consider the optimal control problem constrained by a given STL formula and formulate the problem as a constrained Markov decision process (CMDP) [22]. In the CMDP problem, we introduce two reward functions: one is the reward function for the given control performance index and the other is the reward function for the given STL constraint. To solve the CMDP problem, we apply a constrained DRL (CDRL) algorithm with the Lagrangian relaxation [23]. In this algorithm, we relax the CMDP problem into an unconstrained problem using a Lagrange multiplier to utilize standard DRL algorithms for problems with continuous spaces. 3) We introduce a two-phase learning algorithm in order to make it easy to learn a policy satisfying the given STL formula. In a CMDP problem, it is important to satisfy the given constraint. The agent needs many experiences satisfying the given STL formula in order to learn how to satisfy the formula. However, it is difficult to collect the experiences considering both the control performance index and the STL constraint in the early learning stage since the agent may prioritize to optimize its policy with respect to the control performance index. Thus, in the first phase, the agent learns its policy without the control performance index in order to obtain experiences satisfying the STL constraint easily, which is called pre-training. After obtaining many experiences satisfying the STL formula, in the second phase, the agent learns its optimal policy for the control performance index under the STL constraint, which is called fine-tuning.
Through simulations, we demonstrate the learning performance of the proposed algorithm.

B. RELATED WORKS:
1) Classical RL for satisfying STL formulae Aksaray et al. proposed a method to design policies satisfying STL formulae based on the Q-learning algorithm [16]. However, in the method, the dimensionality of an environment's state tends to be large. Thus, Venkataraman et al.
proposed a tractable learning method to reduce the dimensionality [17]. Furthermore, Kalagarla et al. proposed an STL-constrained RL algorithm using a CMDP formulation and an online learning method [24]. However, since these are tabular-based approaches, we cannot directly apply them to problems with continuous spaces.
2) DRL for satisfying STL formulae DRL algorithms for satisfying STL formulae have been proposed [18], [19]. However, these studies focused on satisfying a given STL formula as the main objective. On the other hand, in this study, we regard the given STL formula as a constraint of a control problem and tackle the STL-constrained optimal control problem using a CDRL algorithm with the Lagrangian relaxation.

3) Learning with demonstrations for satisfying STL formulae
Learning methods with demonstrations have been proposed [25], [26]. They designed a reward function using demonstrations, which was an imitation learning method. On the other hand, in this study, we do not use demonstrations to design a reward function for satisfying STL formulae. Alternatively, we design a reward function for satisfying STL formulae using robustness and the log-sum-exp approximation [16].

C. STRUCTURE:
The remainder of this paper is organized as follows. In Section II, we review STL and the Q-learning algorithm to learn a policy satisfying STL formulae briefly. In Section III, we formulate an optimal control problem under a constraint described by an STL formula as a CMDP problem. In Section IV, we propose a CDRL algorithm with the Lagrangian relaxation to solve the CMDP problem. We relax the CMDP problem to an unconstrained problem using a Lagrange multiplier to utilize the DRL algorithm for unconstrained problems with continuous spaces. In Section V, by numerical simulations, we demonstrate the usefulness of the proposed method. In Section VI, we conclude the paper and discuss future works.

D. NOTATION:
N ≥0 is the set of the nonnegative integers. R is the set of the real numbers. R ≥0 is the set of nonnegative real numbers. R n is the n-dimensional Euclidean space. For a set A ⊂ R, max A and min A are the maximum and minimum value in A if they exist, respectively.

A. SIGNAL TEMPORAL LOGIC
We consider the following discrete-time stochastic dynamical system.
where x k ∈ X , a k ∈ A, and w k ∈ W are the system's state, the agent's control action, and the system noise at k ∈ {0, 1, ...}. X = R nx , A ⊆ R na , and W = R nx are the system's state space, the control action space, and the system noise space, respectively. The system noise w k is an independent and identically distributed random variable with a probability density p w : W → R ≥0 . ∆ w is a regular matrix that is a weighting factor of the system noise. f : X × A → X is a function that describes the system dynamics. Then, we have the transition probability density p f (x |x, a) := |∆ −1 w |p w (∆ −1 w (x − f (x, a))). The initial state x 0 ∈ X is sampled from a probability density p 0 : X → R ≥0 . For a finite system trajectory whose length is K + 1, x k1:k2 denotes the partial trajectory for the time STL is a specification formalism that allows us to express real-time properties of real-valued trajectories of systems [12]. We consider the following syntax of STL.
where K e , k s , and k e ∈ N ≥0 are nonnegative constants for the time bounds, Φ, φ, ϕ, and ψ are the STL formulae, ψ is a predicate in the form of h(x) ≤ d, h : X → R is a function of the system's state, and d ∈ R is a constant. The Boolean operators ¬, ∧, and ∨ are negation, conjunction, and disjunction, respectively. The temporal operators G T and F T refer to Globally (always) and Finally (eventually), respectively, where T denotes the time bound of the temporal operator.
. The Boolean semantics of STL is recursively defined as follows: The quantitative semantics of STL, which is called robustness, is recursively defined as follows: which quantifies how well the trajectory satisfies the given STL formulae [13]. The horizon length of an STL formula is recursively defined as follows:  hrz(φ) is the required length of the state sequence to verify the satisfaction of the STL formula φ.

B. Q-LEARNING FOR SATISFYING STL FORMULAE
In this section, we review the Q-learning algorithm to learn a policy satisfying a given STL formula [16]. Although we often regard the current state of the dynamical system (1) as the environment's state for RL, the current system's state is not enough to determine an action for satisfying a given STL formula. Thus, Aksaray et al. defined the following extended state using previous system's states.
where τ = hrz(φ) + 1 for the given STL formula Φ = G [0,Ke] φ (or Φ = F [0,Ke] φ) and Z is an extended state space.We show a simple example in Fig. 1. We operate a onedimensional dynamical system to satisfy the STL formula At any time in the time interval [0, 10], the system should enter both the blue region and the green region before 3 time steps are elapsed, where there is no constraint for the order of the visits. Let the current system's state be x k = 1.5. Note that the desired action for the STL formula is different depending on the past state sequence. For example, in the case where x k−3:k = −0.5, 0.5, 1.0, 1.5, we should operate the system to the blue region right away. On the other hand, in the case where x k−3:k = −1.5, −2.5, −0.5, 1.5, we do not need to move it. Thus, we regard not only the current system's state but also previous system's states as an environment's state for RL. Additionally, Aksaray et al. designed the reward function R ST L : Z → R using robustness and the log-sumexp approximation. The robustness of a trajectory x 0:K with respect to the given STL formula Φ is as follows: We consider the following problem.
where β > 0 is an approximation parameter. We can approximate min{· · · } or max{· · · } with arbitrary accuracy by selecting a large β. Then, (5) can be approximated as follows: Since the log function is a strictly monotonic function and β > 0 is a constant, we have Thus, we use the following reward function R ST L : Z → R to satisfy the given STL formula Φ.
To design a controller satisfying an STL formula using the Q-learning algorithm, Aksaray et al. proposed a τ -MDP as follows: • p z is a transition probability density for the extended state. When the system's state is updated by x ∼ p f (·|x, a), the extended state is updated by z ∼ p z (·|z, a) as follows:  2 shows an example of the transition. We consider the sequence that consists of τ system's states x k−τ +1 , x k−τ +2 , ..., x k as the extended state at time k. In the transition, the head system's state x k−τ +1 is removed from the sequence and other system's states x k−τ +2 , ..., x k are shifted to the left. After that, the next system's state x k+1 updated by p f (·|x k , a k ) is inputted to the tail of the sequence. The next extended state z k+1 depends on the current extended state z k and the agent's action a k . • R ST L : Z → R is the STL-reward function defined by (8).

FIGURE 2.
Illustration of an extended state transition. We consider the case τ = 3. The next extended state z k+1 depends on the current extended state z k and the agent's action a k .

III. PROBLEM FORMULATION
We consider the following optimal policy design problem constrained by a given STL formula Φ, where the system model (1) is unknown.
where γ ∈ [0, 1) is a discount factor and R : X × A → R is a reward function for a given control performance index. E pπ [·] is the expectation value with respect to the distributions p 0 , p f , and π. We introduce the following τ -CMDP that is an extension of a τ -MDP [16] to deal with the problem (9).
0 is a probability density for the initial extended state • p z is a transition probability density for the extended state. When the system's state is updated by x ∼ VOLUME 4, 2016 p f (·|x, a), the extended state is updated by z ∼ p z (·|z, a) as follows: • R ST L : Z → R is the STL-reward function defined by (8) for satisfying the given STL formula Φ. • R z : Z × A → R is a reward function as follows: where R : X × A → R is a reward function for a given control performance index. We design an optimal policy with respect to R z under satisfying the STL formula using a model-free CDRL algorithm [23]. Then, we define the following functions.
is the expectation value with respect to the distributions p 0 , p f , and π. We reformulate the problem (9) as follows: where l ST L ∈ R is a lower threshold. In this study, l ST L is a hyper-parameter for adjusting the satisfiability of the given STL formula. The larger l ST L is, the more conservatively the agent learns a policy to satisfy the STL formula. We call the constrained problem with (10) and (11) a τ -CMDP problem.
In the next section, we propose a CDRL algorithm with the Lagrangian relaxation to solve the τ -CMDP problem.

IV. DEEP REINFORCEMENT LEARNING UNDER A SIGNAL TEMPORAL LOGIC CONSTRAINT
We propose a CDRL algorithm with the Lagrangian relaxation to obtain an optimal policy for the τ -CMDP problem.
Our proposed algorithm is based on the DDPG algorithm [20] or the SAC algorithm [21], which are DRL algorithms derived from the Q-learning algorithm for problems with continuous state-action spaces. In both algorithms, we parameterize an agent's policy π using a DNN, which is called an actor DNN. The agent updates the parameter vector of the actor DNN based on J(π). However, in this problem, the agent cannot directly use J(π) since the mathematical model of the system p f is unknown. Thus, we approximate J(π) using another DNN, which is called a critic DNN. Additionally, we use the following two techniques proposed in [3].
In the experience replay, the agent does not update the parameter vectors of DNNs immediately when obtaining an experience. Alternatively, the agent stores the obtained experience to the replay buffer D. The agent selects some experiences from the replay buffer D randomly and updates the parameter vector of DNNs using the selected experiences. On the other hand, we cannot directly apply the DDPG algorithm and the SAC algorithm to the τ -CMDP problem since these are algorithms for unconstrained problems. Thus, we consider the following Lagrangian relaxation [27].
where L(π, κ) is a Lagrangian function given by and κ ≥ 0 is a Lagrange multiplier. We can relax the constrained problem into the unconstrained problem. Actually, we input a pre-processed stateẑ stated in Section IV.D to the DNN instead of an extended state z.

A. DDPG-LAGRANGIAN
We parameterize a deterministic policy using a DNN as shown in Fig. 3, which is an actor DNN. Its parameter vector is denoted by θ µ . In the DDPG-Lagrangian algorithm, the parameter vector θ µ is updated by maximizing (13). However, J(µ θµ ) and J ST L (µ θµ ) are unknown. Thus, as shown in Fig.  4, J(µ θµ ) and J ST L (µ θµ ) are approximated by two separate critic DNNs, which are called a reward critic DNN and an STL-reward critic DNN, respectively. The parameter vectors of the reward critic DNN and the STL-reward critic DNN are denoted by θ r and θ s , respectively. θ r and θ s are updated by decreasing the following critic loss functions.
where Q θr (·, ·) and Q θs (·, ·) are the outputs of the reward critic DNN and the STL-reward critic DNN, respectively. The target values t r and t s are given by r (·, ·) and Q θ − s (·, ·) are the outputs of the target reward critic DNN and the target STL-reward critic DNN, respectively, and µ θ − µ (·) is the output of target actor DNN. θ − r , θ − s , and θ − µ are parameter vectors of the target reward critic DNN, the target STL-reward critic DNN, and the target actor DNN, respectively. Their parameter vectors are slowly updated by the following soft update.
where ξ > 0 is a sufficiently small positive constant. The agent stores experiences to the replay buffer D and selects some experiences from D randomly for updates of θ r and θ s . E (z,a,z )∼D [·] is the expected value under the random sampling of the experiences from D. In the standard DDPG algorithm [20], the parameter vector of the actor DNN is updated by decreasing where E z∼D [·] is the expected value with respect to z sampled from D randomly. However, in the DDPG-Lagrangian algorithm, we consider (13) as an objective instead of J(µ θµ ). Thus, the parameter vector of the actor DNN θ µ is updated by decreasing the following actor loss function.
The Lagrange multiplier κ is updated by decreasing the following loss function.
where E z0∼p z 0 [·] is the expected value with respect to p z 0 . Remark: κ is a nonnegative parameter adjusting the relative importance of the STL-reward critic DNN against the reward critic DNN in updating the actor DNN. Intuitively, if the agent's policy does not satisfy (11), then we increase the parameter κ, which increases the relative importance of the STL-critic DNN. On the other hand, if the agent's policy satisfies (11), then we decrease the parameter κ, which decreases the relative importance of the STL-critic DNN.

B. SAC-LAGRANGIAN
SAC is a maximum entropy DRL algorithm that obtains a policy to maximize both the expected sum of rewards and the expected entropy of the policy. It is known that a maximum entropy algorithm improves explorations by acquiring diverse behaviors and has the robustness for the estimation error [21]. In the SAC algorithm, we design a stochastic policy π. We use the following objective with an entropy term instead of J(π).
J ent (π) = E pπ K k=0 γ k (R z (z k , a k ) + αH(π(·|z k ))) , where H(π(·|z k )) = E a∼π [− log π(a|z k )] is an entropy of the stochastic policy π and α ≥ 0 is an entropy temperature. The entropy temperature determines the relative importance of the entropy term against the sum of rewards. We use the Lagrangian relaxation for the SAC algorithm such as [28], [29]. Then, a Lagrangian function with the entropy term is given by We model the stochastic policy π θπ using a Gaussian with the mean and the standard deviation outputted by a DNN with a reparameterization trick [30] as shown in Fig. 5, which is an actor DNN. The parameter vector is denoted by θ π . Additionally, we need to estimate J ent (π θπ ) and J ST L (π θπ ) to update the parameter vector θ π like the DDPG-Lagrangian algorithm. Thus, J ent (π θπ ) and J ST L (π θπ ) are also approximated by two separate critic DNNs as shown in Fig. 4. Note that, in the SAC-Lagrangian algorithm, the reward critic DNN estimates not only J(π θπ ) but also the VOLUME 4, 2016 FIGURE 5. Illustration of an actor DNN with a reparameterization trick. The DNN outputs the mean µ θπ (ẑ) and the standard deviation σ θπ (ẑ) parameters for an inputẑ. We use the reparameterization trick to sample an action, where is sampled from a standard normal distribution N (0, 1). entropy term. The parameter vectors are also updated using the experience replay and the target network technique. θ r and θ s are updated by decreasing the following critic loss functions.
where r = R z (z, a), s = R ST L (z), and Q θr (·, ·) and Q θs (·, ·) are the outputs of the reward critic DNN and the STL-reward critic DNN, respectively. The target values are computed by r (·, ·) and Q θ − s (·, ·) are outputs of the target reward critic DNN and the target STL-reward critic DNN, respectively, and E a ∼π θπ [·] is the expected value with respect to π θπ . Their parameter vectors θ − r , θ − s are slowly updated like (16). In the standard SAC algorithm, the parameter vector of the actor DNN θ π is updated by decreasing where E z∼D,a∼π θπ [·] is the expected value with respect to the experiences z sampled from D and the stochastic policy π θπ . However, in the SAC-Lagrangian algorithm, we consider (20) as the objective instead of (19). Thus, the parameter vector of the actor DNN θ π is updated by decreasing the following actor loss function.
where E z0∼p z 0 ,a∼π θπ [·] is the expected value with respect to p z 0 and π θπ . The entropy temperature α is updated by decreasing the following loss function.
where H 0 is a lower bound which is a hyper-parameter. In [21], the parameter H 0 is selected based on the dimensionality of the action space. Additionally, in the SAC algorithm, to mitigate the positive bias in updates of θ π , the double Qlearning technique [31] is adopted, where we prepare two critic DNNs and two target critic DNNs. Thus, in the SAC-Lagrangian, we also adopt the technique.

C. PRE-TRAINING AND FINE-TUNING
In this study, it is important to satisfy the given STL constraint. In order to learn a policy satisfying a given STL formula, the agent needs many experiences satisfying the formula. However, it is difficult to collect the experiences considering both the control performance index and the STL constraint in the early learning stage since the agent may prioritize to optimize its policy with respect to the control performance index. Thus, we propose a two-phase learning algorithm. In the first phase, which is called pre-train, the agent focuses on learning a policy satisfying a given STL formula Φ to store experiences receiving high STL rewards to a replay buffer D, that is, the agent learns its policy considering only STL-rewards.

Pre-training for DDPG-Lagrangian
The parameter vector of the actor DNN θ µ is updated by decreasing instead of (17). On the other hand, θ s is updated by (15).

Pre-training for SAC-Lagrangian
The parameter vector of the actor DNN θ π is updated by decreasing J a (θ π ) = E z∼D,a∼π θπ [α log(π θπ (a|z)) − Q θs (z, a)] (27) instead of (23). On the other hand, θ s is updated by (22), where V − θs is computed by In the second phase, which is called fine-tune, the agent learns the optimal policy constrained by the given STL formula. In the DDPG-Lagrangian algorithm, the actor DNN θ µ is updated by (17). In the SAC-Lagrangian algorithm, the actor DNN θ π is updated by (23). Remark: The two-phase learning may become unstable temporally because it discontinuously changes the objective functions. In such a case, we may start the second phase with changing the objective functions from those used in the first phase smoothly and slowly.

D. PRE-PROCESS
If τ is a large value, it is difficult for the agent to learn its policy due to the large dimensionality of the extended state space. Then, pre-process is useful in order to reduce the dimensionality, which is related to [17]. In the previous study, a flag state for each sub-formula is defined as a discrete state. The flag discrete state space is combined with the system's discrete state space. On the other hand, in this study, it is assumed that the system state space is continuous. If we use the discrete flag states, the pre-processed state space is a hybrid state space that has discrete values and continuous values. Thus, we consider the flag state as a continuous value and input it to DNNs as shown in Fig. 6.   FIGURE 6. Example of constructing a pre-processed state. We consider the 1-dimensional system and the two STL sub-formulae: φ1 = F [2,7] (x ≥ 0.0) and φ2 = F [0,7] (x ≥ 0.2). For each sub-formula, we compute the flag value using the extended state z k , which is regarded as a continuous value in [−0.5, 0.5]. After that, we construct the pre-processed state using z k [τ − 1](= x k ),f 1 k , andf 2 k and input it to DNNs.
We introduce a flag value f i for each STL sub-formula φ i , where it is assumed that k i e = τ − 1, ∀i ∈ {1, 2, ..., M }. Definition 3 (Pre-process): For an extended state z, a flag value f i of an STL sub-formula φ i is defined as follows: Note that max ∅ = −∞ and the flag value represents the normalized time lying in (0, 1] ∩ {−∞}. Intuitively, for φ i = G [k i s ,τ −1] ϕ i , the flag value indicates the time duration in which φ i is always satisfied, whereas, for φ i = F [k i s ,τ −1] ϕ i , the flag value indicates the instant when ϕ i is satisfied. The flag values f i , i ∈ {1, 2, ..., M } calculated by (28) or (29) are transformed intof i as follows: The where For the sub-formula φi = F [2,4] However,f i k+1 depends onf i k and z k [5]. If the pre-processed state is given by [z k [7]f k ] , the agent with DNNs observes the environment partially. Then, the agent also needs z k [5] and z k [6] as parts of the pre-processed state. most effective in terms of reducing the dimensionality of the extended state space.

E. ALGORITHM
Our proposed algorithm to design an optimal policy under the given STL constraint is presented in Algorithm 1. In line 1, we select a DRL algorithm such as the DDPG algorithm and the SAC algorithm. From line 2 to 4, we initialize the parameter vectors of the DNNs, the entropy temperature (if the algorithm is the SAC-Lagrangian algorithm), and the Lagrange multiplier. In line 5, we initialize a replay buffer D.
In line 6, we set the number of the repetition of pre-training K pre . In line 7, we initialize a counter for updates. In line 9, the agent receives an initial state x 0 ∼ p 0 . From line 10 to 11, the agent sets the initial extended state z 0 = [x 0 ... x 0 ] and computes the pre-processed stateẑ 0 . One learning step is done between line 13 and 25. In line 13, the agent determines an action a k based on the pre-processed stateẑ k for an exploration. In line 14, the state of the system changes by the determined action a k and the agent receives the next state x k+1 , the reward r k , and the STL-reward s k . From line 15 to 16, the agent sets the next extended state z k+1 using x k+1 and z k and computes the next pre-processed stateẑ k+1 . In line 17, the agent stores the experience (ẑ k , a k ,ẑ k+1 , r k , s k ) in the replay buffer D. In line 18, the agent samples I experiences {(ẑ (i) , a (i) ,ẑ (i) , r (i) , s (i) )} I i=1 from the replay buffer D randomly. If the learning counter is c < K pre , the agent pre-trains the parameter vectors in Algorithm 3. Then, the parameter vectors of the reward critic DNN θ r and the STL-reward critic DNN θ s are updated by (14) and (15) (or (21) and (22)), respectively. The parameter vector of the actor DNN θ µ (or θ π ) is updated by (26) (or (27)). In the SACbased algorithm, the entropy temperature α is updated by (25). On the other hand, if the learning counter is c ≥ K pre , the agent fine-tunes the parameter vectors in Algorithm 4. Then, the parameter vector of the actor DNN θ µ (or θ π ) is updated by (17) (or (23)) and the other parameter vectors are updated same as the case c < K pre . The Lagrange multiplier is updated by (18) (or (24)). In line 24, the agent updates the parameter vectors of the target DNNs by (16). In line 25, the learning counter is updated. The agent repeats the process between lines 13 and 25 in a learning episode. FIGURE 9. Control of a two-wheeled mobile robot under an STL constraint. The working area is 0.5 ≤ x (0) ≤ 4.5, 0.5 ≤ x (1) ≤ 4.5 colored gray. The initial state of the system is sampled randomly in 0.5 ≤ x (0) ≤ 2.5, 0.5 ≤ x (1) ≤ 2.5, −π/2 ≤ x (2) ≤ π/2 colored red. The region 1 labeled by ϕ1 is 3.5 ≤ x (0) ≤ 4.5, 3.5 ≤ x (1) ≤ 4.5 and the region 2 labeled by ϕ2 is 3.5 ≤ x (0) ≤ 4.5, 1.5 ≤ x (1) ≤ 2.5. These regions are colored blue.

V. EXAMPLE
We consider STL-constrained optimal control problems Algorithm 1 Two-phase DRL-Lagrangian to design an optimal policy under an STL constraint. 1 Receive an initial state x 0 ∼ p 0 . 10: Set the initial extended state z 0 using x 0 .

12:
for Discrete-time step k = 0, ..., K do 13: Determine an action a k based on the stateẑ k . 14: Execute a k and receive the next state x k+1 and the reward r k and the STL-reward s k . 15: Set the next extended state z k+1 using x k+1 and z k . 16: Compute the next pre-processed stateẑ k+1 by Algorithm 2.

A. EVALUATION
We apply the SAC-Lagrangian algorithm to design a policy constrained by an STL formula. In all simulations, the DNNs had two hidden layers, all of which have 256 units, and all layers are fully connected. The activation functions for the hidden layers and the outputs of the actor DNN are the rectified linear unit functions and hyperbolic tangent functions, respectively. We normalize x (0) and x (1) as x (0) − 2.5 and x (1) − 2.5, respectively. The size of the replay buffer D is 1.0 × 10 5 , and the size of the mini-batch is I = 64. We use Adam [32] as the optimizers for all main DNNs, the entropy temperature, and the Lagrange multiplier. The learning rate of the optimizer for the Lagrange multiplier is 1.0 × 10 −5 and the learning rates of the other optimizers are 3.0 × 10 −4 . The soft update rate of the target network is ξ = 0.01. The discount factor is γ = 0.99. The target for updating the entropy temperature H 0 is −2.0. The STLreward parameter is β = 100. The agent learns its control policy for 6.0 × 10 5 steps. The initial parameters of both the entropy temperature and the Lagrange multiplier are 1.0.
For performance evaluation, we introduce the following three indices: • a reward learning curve shows the mean of the sum of rewards K k=0 γ k R z (z k , a k ) for 100 trajectories, • an STL-reward learning curve shows the mean of the sum of STL-rewards K k=0 γ k R ST L (z k ) for 100 trajectories, and • a success rate shows the number of trajectories satisfying the given STL constraint for 100 trajectories. We prepare 100 initial states sampled from p 0 and generate 100 trajectories using the learned policy for each evaluation. We show the results for K pre = 0 (Case 1) and K pre = 300000 (Case 2). We do not use pre-training in Case 1. All simulations were done on a computer with AMD Ryzen 9 3950X 16-core processor, NVIDIA (R) GeForce RTX 2070 super, and 32GB of memory and were conducted using the Python software.

1) Formula 1
We consider the case where the constraint is given by (34). In this simulation, we set K = 1000 and l ST L = −40. The dimension of the extended state z is τ = 100. The reward learning curves and the STL-rewards learning curves are shown in Figs. 10 and 11, respectively. In Case 1, it takes a lot of steps to learn a policy such that the sum of STL-rewards is near the threshold l ST L = −40. The reward learning curve decreases gradually while the STL-reward curve increases. This is an effect of lacking in experience satisfying the STL formula Φ. If the agent cannot satisfy the STL constraint during its explorations, the Lagrange multiplier κ becomes large as shown in Fig. 12. Then, the STL term −κQ θs of the actor loss J(π θ ) becomes larger than the other terms. As a result, the agent updates the parameter vector θ π considering only the STL rewards. On the other hand, in Case 2, the agent can obtain enough experiences satisfying the STL formula in 300000 pre-training steps. The agent learns the policy such that the sum of the STL-rewards is near the threshold relatively quickly and fine-tunes the policy under the STL constraint after pre-training. According to the results in the both cases, our proposed method is useful to learn the optimal policy under the STL constraint. Additionally, as the sum of STL-rewards obtained by the learned policy is increasing, the success rate for the given STL formula is also increasing as shown in Fig. 13.

2) Formula 2
We consider the case where the constraint is given by (35). In this simulation, we set K = 500 and l ST L = 35. The dimension of the extended state z is τ = 50. We use the reward function R ST L (z) = exp(β1(ρ(z, φ)))/ exp(β) in stead of (8) to prevent the sum of STL-rewards diverging to infinity. The reward learning curves and the STL-rewards learning curves are shown in Figs. 14 and 15, respectively. In  Case 1, although the reward learning curve maintains more than −20, the STL-reward learning curve maintains much less than the threshold l ST L = 35. On the other hand, in Case 2, the agent learns a policy such that the sum of STLrewards is near the threshold l ST L = 35 and fine-tunes the policy under the STL constraint after pre-training. Our proposed method is useful for not only the formula Φ 1 but also the formula Φ 2 . Additionally, as the sum of STL-rewards obtained by the learned policy is increasing, the success rate for the given STL formula is also increasing as shown in Fig.  16.

B. ABLATION STUDIES FOR PRE-PROCESSING
In this section, we show the ablation studies for preprocessing introduced in Section IV.D. We conduct the experiment for Φ 1 using the SAC-Lagrangian algorithm. In the case without pre-processing, the dimensionality of the input to DNNs is 300 and, in the case with pre-processing, the dimensionality of the input to DNNs is 5. The STL-reward learning curves for each case are shown in Fig. 17. The agent without pre-processing cannot improve the performance of its policy for STL-rewards. The result concludes that preprocessing is useful for a problem constrained by an STL formula with a large τ .

C. COMPARISON WITH ANOTHER DRL ALGORITHM
In this section, we compare the SAC based algorithm with other algorithms: DDPG [20] and TD3 [31]. TD3 is an extended DDPG algorithm with the clipped double Q-learning technique to mitigate the positive bias for the critic estimation. For the DDPG-Lagrangian algorithm and the TD3-Lagrangian algorithm, we need to set a stochastic process generating exploration noises. We use the following Ornstein-Uhlenbeck process.
where ε is a noise generated by a standard normal distribution N (0, 1). We set the parameters (p 1 , p 2 , p 3 ) = (0.15, 0, 0.3). For the TD3-Lagrangian algorithm, the target FIGURE 17. STL-reward learning curves for the case without pre-processing (red) and the case with pre-processing (blue). We consider the formula Φ1. The solid curves and the shades represent the average results and standard deviations over 10 trials with different random seeds, respectively. The dashed line shows the threshold l ST L = −40. The gray line shows 300000 steps.
policy smoothing and the delayed policy updates are same as the original paper [31]. The target policy smoothing is implemented by adding noises sampled from the normal distribution N (0, 0.2) to the actions chosen by the target actor DNN, clipped to (−0.5, 0.5), the agent updates the actor DNN and the target DNNs every 2 learning steps. Other experimental settings such as hyper parameters, optimizers, and DNN architectures, are same as the SAC-Lagrangian algorithm.
We conduct experiments for Φ 1 . We show the reward learning curves and the STL-reward learning curves in Figs. 18 and 19, respectively. Although all algorithms can improve the policy with respect to rewards after fine-tuning, the DDPG algorithm cannot improve the policy with respect to the STL-rewards. The STL-reward curve of the DDPG-Lagrangian algorithm is much less than the threshold. On the other hand, the TD3-Lagrangian algorithm and the SAC-Lagrangian algorithm can learn the policy such that the STL-rewards are more than threshold. These results show the importance of the double Q-learning technique to mitigate positive biases for critic estimations in the fine-tuning phase. Actually, the technique is used in both the TD3-Lagrangian algorithm and the SAC-Lagrangian algorithm. Then, we show the result in the case where we do not use the double Q-learning technique in the SAC-Lagrangian in Fig. 20. Although the agent can learn a policy such that the STL-rewards are near the threshold in the pre-train phase, the performance of the agent's policy with respect to the STLrewards is degraded in the fine-tune phase.

VI. CONCLUSION
We considered a model-free optimal control problem constrained by a given STL formula. We modeled the problem as a τ -CMDP that is an extension of a τ -MDP. To solve the τ -CMDP problem with continuous state-action spaces, we  . STL-reward learning curves for the formula Φ1. The red, blue, and green curves show the results of the DDPG-Lagrangian algorithm, the TD3-Lagrangian algorithm, and the SAC-Lagrangian algorithm, respectively. The solid curves and the shades represent the average results and standard deviations over 10 trials with different random seeds, respectively.
proposed a CDRL algorithm with the Lagrangian relaxation. In the algorithm, we relaxed the constrained problem into an unconstrained problem to utilize a standard DRL algorithm for unconstrained problems. Additionally, we proposed a practical two-phase learning algorithm to make it easy to obtain experiences satisfying the given STL formula. Through numerical simulations, we demonstrated the performance of the proposed algorithm. First, we showed that the agent with our proposed two-phase algorithm can learn its policy for the τ -CMDP problem. Next, we conducted ablation studies for pre-processing to reduce the dimensionality of the extended state and showed the usefulness. Finally, we compared three CDRL algorithms and showed the usefulness of the double Q-learning technique in the fine-tune phase.
On the other hand, the syntax in this study is restrictive compared with the general STL syntax. Relaxing the syntax restriction is a future work. Furthermore, we may not directly apply our proposed methods to high dimensional decision making problems because it is difficult to obtain experiences satisfying a given STL formula for the problems. Solving the issue is also an interesting direction for a future work.