Reinforcement Learning for Two-Stage Permutation Flow Shop Scheduling—A Real-World Application in Household Appliance Production

Solving production scheduling problems is a difficult and indispensable task for manufacturers with a push-oriented planning approach. In this study, we tackle a novel production scheduling problem from a household appliance production at the company Miele & Cie. KG, namely a two-stage permutation flow shop scheduling problem (PFSSP) with a finite buffer and sequence-dependent setup efforts. The objective is to minimize idle times and setup efforts in lexicographic order. In extensive and realistic data, the identification of exact solutions is not possible due to the combinatorial complexity. Therefore, we developed a reinforcement learning (RL) approach based on the Proximal Policy Optimization (PPO) algorithm that integrates domain knowledge through reward shaping, action masking, and curriculum learning to solve this PFSSP. Benchmarking of our approach with a state-of-the-art genetic algorithm (GA) showed significant superiority. Our work thus provides a successful example of the applicability of RL in real-world production planning, demonstrating not only its practical utility but also showing the technical and methodological integration of the agent with a discrete event simulation (DES). We also conducted experiments to investigate the impact of individual algorithmic elements and a hyperparameter of the reward function on the overall solution.


I. INTRODUCTION
Production scheduling problems are widely studied in academic literature and solving these problems has a significant impact on the success of manufacturing companies.A particular case are permutation flow shop scheduling problems, which were first introduced by [1].PFSSPs typically involve a sequence of jobs that must be processed by multiple machines in a specific order.Methods for solving PFSSPs include heuristics, numerical methods, and metaheuristics.
The associate editor coordinating the review of this manuscript and approving it for publication was Zhiwu Li .In recent years, RL has attracted more and more attention as an alternative approach for successfully solving scheduling problems [2], [3].However, many approaches do not move beyond the academic context due to their abstraction from real-world requirements, as [4] and [5] point out.
Our paper addresses a specific PFSSP, a so called twostage PFSSP, that occurs in a household appliance production of Miele.A two-stage PFSSP encompasses two distinct production stages with multiple machines in at least one of the stages.Thus, it is not only necessary to plan the sequence of jobs within the first stage, but also to consider how this sequence affects the material flow in the second stage.Due to this interdependence, two-stage PFSSPs impose more solving difficulties than traditional PFSSPs.The concrete problem considers one machine in the first stage and multiple machines in the second stage.Moreover, it has a finite buffer connecting the stages, as well as sequencedependent setup efforts in the first stage, and machine shifts.The interdependence also arises from a tradeoff between objectives: The goal is to minimize idle times in the second stage and the setup effort required for changing product types in the first stage.To the best of our knowledge, this problem has not yet been addressed in the literature.
Since the problem's complexity precludes an exact solution and RL generally outperforms heuristics for multi-objective problems [4], we utilize RL for our solution.To this end, we formulated the PFSSP as a Markov decision process (MDP).Furthermore, we developed an RL approach comprising algorithmic elements such as action masking, reward shaping, and curriculum learning to incorporate domainspecific knowledge.For training the agent, we utilized the professional DES software FlexSim to create a simulation model of the environment.Furthermore, we conduct experiments with realistic data sets from production to benchmark our solution against a state-of-the-art GA.The results show that our RL approach performs better, especially on complex problem instances.In addition, we investigate how each algorithmic element contribute to the success of the solution and how weighting factors in the reward function can be used to lexicographically optimize the two objectives.
The paper is motivated by the research gap identified in the literature.Recent work on solving PFSSPs using RL is shown in Table 1.Our work differs from the literature by having a PFSSP variant that has never been considered before.Moreover, the literature mainly considers the optimization of only one objective.Mostly this is the makespan, e.g., in [12], [14], [16], [17], [18], [19], [20], and [21].However, based on the real-world requirements of the household appliance production, we consider the optimization of two objectives, also, in a lexicographic way.Furthermore, most of the papers considered combine RL with heuristics, metaheuristics or other methods to generate feasible and near optimal schedules.We focus, however, on an approach where RL is used exclusively.One advantage of this is a shorter runtime, since a trained RL agent generates a schedule in one iteration, whereas combined methods usually require several iterations.This is especially relevant for runtime-intensive simulation models.In order to evaluate the agent in realworld usage, we draw the problem instances from real-world production.This differs from the literature reviewed, which mostly employs either the Taillard data set [6] or synthetic data.As a methodology for benchmarking our approach, we use a metaheuristic, as in the majority of the literature reviewed.

II. PROBLEM FORMULATION
The real-world problem can be defined as a non-linear Mixed Integer Program (MIP) that precisely formalizes all relevant constraints and objectives.It is a two-stage PFSSP with multiple second stage assembly stations, finite buffers, sequence-dependent setup efforts and station-related shift windows.Each production job comprises two sequential operations, corresponding to the two consecutive stages, for manufacturing one product.As displayed in Figure 5, a single pre-assembly station (PAS) in the first stage fills a central buffer with semi-finished products (SFP).The SFP are limited to a small set of generic types, which are afterward converted to specific products.At the following order penetration point, each SFP must be assigned to a predefined non-identical final assembly station (FAS) in the second stage.Under consideration of a static buffer capacity limit, it is possible that the buffer initially contains a number of SFP of different types.Thus, some jobs can skip the first stage by directly assigning a SFP from the buffer to them.As a further complication, the PAS must be suitably set up for each SFP type, which leads to sequencedependent setup efforts.Moreover, the stations can have different shift windows, which can have an effect on the material flows.On the FAS, some jobs have predecessor relationships to other jobs on the same FAS.This means that they cannot be released until the predecessor jobs have been completed.Johnson [1] demonstrated that a specialization of the problem for makespan minimization without sequencedependent setup efforts, finite buffers, work shifts and release conditions can be solved exactly and efficiently in polynomial time.However, the approach in this form cannot be applied to our extended problem, nor were we able to identify suitable efficient algorithms in the existing literature.Considering the given conflicting objectives and further constraints, we assume that the problem is complex and that traditional algorithms are unlikely to solve it within an acceptable runtime.
The problem consists of the following parameters: The following variables are required to solve the MIP: The following auxiliary definitions allow a better modeling of the constraints: ζ is a binary value, which determines if a job i precedes job j on station k (→ if the completion time of i is lower or equal the start time of j).
1, if both jobs are assigned to station k and the completion time of i is lower or equal the start time of j 0 Based on this, ζ * is a binary value, which determines direct preceding jobs on the PAS.
1, if i directly precedes j on station k = 1 0 λ determines the actual processing time required for job i on the PAS.When the SFP of a job is initially available in the buffer, it is not assigned to the PAS and the processing time is zero.It should be noted that a setup operation can take place separately from the actual processing and therefore does not affect the processing time on the PAS.
Finally, C determines the completion time of a job i on a station k.
Based on the real use case, the primary objective is to minimize the sum of idle times from all FAS (1).In the course of lexicographical optimization, the second objective is to minimize sequence-dependent setup efforts on the PAS (2).
The first constraint (3) ensures that a station can process only one job at a time: The assigned FAS can only start when the job is available in the buffer: Only those jobs can directly start on the FAS, if their types are initially available in the buffer: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Jobs assigned to stations must be realized in exactly one shift: The start time of a job must be after the shift's begin timestamp: The completion time of a job must be before the shift's end timestamp: Jobs can then be released when the predecessor jobs have been completed.
The capacity of the central buffer cannot be exceeded regarding its upper bound of SFP:

III. METHOD A. REINFORCEMENT LEARNING
Reinforcement Learning is a type of Machine Learning, where an agent is trained to behave optimally in an environment [22].At each time step t, the agent selects an action a t based on a representation of the relevant information about the environment called state s t .When taking an action, the agent receives a reward r t+1 , which represents the desirability of the action taken in terms of the optimization goal.The behavior of the agent is determined by its policy π(a t |s t ), which maps states to actions.Policies are often modelled as neural networks which are parameterized with θ.RL problems are typically formulated as a Markov decision process (MDP), which is a 5-tuple that encompasses: 1) A set of states S, 2) a set of actions A, 3) a transition probability function T (s t+1 |s t , a t ), 4) a reward function R(s t , a t ) determining the reward r t+1 , 5) and a discount factor γ ∈ [0, 1) for discounting future rewards.
Over the last decade, several RL algorithms have been developed.One of the most widespread is the Proximal Policy Optimization (PPO) algorithm developed by OpenAI [23], which is a so-called policy-based method.We select PPO as basis for our RL solution, as it has been shown in preliminary experiments to be more robust and less hyperparameter sensitive than other RL algorithms (see also [24]).
The goal of PPO is to learn parameters θ that maximize the expected cumulative discounted reward: PPO achieves this by maximizing a clipped surrogate objective: where Ât is the estimated advantage at time step t (calculated by a generalized advantage estimator [25]), r t (θ) is the likelihood ratio r t (θ) = π θ (a t |s t )/π θ old (a t |s t ), and ϵ is a small positive scalar that limits the update step size.This objective ensures moderate updates of θ, so that π θ is not deviating too much from π θ old .

B. MDP DEFINITION
Crucial for the successful application of RL to solve this optimization problem is an adequate formulation of the MDP.For ease of understanding, we explain and specify the MDP using the concrete problem instance at Miele.The household appliance production consists of a PAS where SFP types can be produced.After being produced, these are stored in a buffer with a static capacity of b max = 2900.Four final assembly stations pick SFPs out of the buffer to produce the final products.The planning period is a week with five working days.The MDP is described in the following.

1) STATE SPACE
The state space must include all information necessary for the agent to fulfill the demands of the final assembly stations.I.e.there must always be enough SFPs available in the buffer so that the demanded household appliances can be produced and the final assembly stations are not idle (1).We therefore introduce several features that condense the information about when and how much SFPs are demanded at the current time step.These features cover all relevant parameters of the MIP.
• u τ,24 (t): The amount of SFPs of type τ that is required from all FAS in the next 24 hours.
• u τ,end (t): The amount of SFPs of type τ that is required from all FAS until the end of the planning period.
• b τ (t): The amount of SFPs of type τ in the buffer.
The amount of SFPs of type τ still to be produced, so that the FAS demand is covered in the next 24 hours.
The amount of SFPs of type τ still to be produced, so that the FAS demand is covered until the end of the planning period.
• t τ (t): This represents the duration for which the demand of SFP type τ in the buffer is still covered.All variables are normalized and clipped between 0 and 1.We incorporate v τ,24 , v τ,end , and t τ of all SFP types into the state space.If the amount of an SFP of type τ in the buffer exceeds the amount required until the end of the planning period (i.e.b τ > u τ,end ), v τ,end is set to 0 and t τ is set to 1. Similarly, v τ,24 is also set to 0 if there are enough SFPs of the required type in the buffer for the next 24 hours.Additionally, we add b u (t) to the state space, which represents the sum of all SFPs in the buffer divided by b max .
The second objective is to minimize the setup efforts, as seen in ( 2).In order for the agent to take the setup efforts into account, the last produced SFP type τ ′ (t) is included into the state space as one-hot-encoded vector.
Taking all components into account, the state is represented by the following vector: Since there are eight different SFP types, we define the action space to be discrete, where an action a is an integer with a ∈ • τ = {0, 1, . . ., 7}.When the RL agent selects an action, a predefined amount o 1 of the selected SFP type will be produced on PAS 1.In this way, the production schedule is not given from the start, but is built up successively as the agent interacts with the environment.In retrospect, the production plan is then created by stringing together the actions.o 1 is held constant over the whole planning period and for each SFP type.Since for each SFP type the same processing time is given, a constant o 1 ensures equidistant time steps for the agent-environment interactions.Otherwise, a more sophisticated approach such as a so-called Semi-MDP would be needed to cope with varying production amounts and therefore non-equidistant time steps [26], [27].We define o 1 to be 50 based on domain expertise.Smaller values allow the agent to produce more fine-grained and thus potentially better production plans, but they prolong training because more steps would be required to achieve the same number of episodes.
Furthermore, we use action masking to integrate domain knowledge into the agent in order to shorten the training time and increase the probability of successful training.Action masking is a technique known in reinforcement learning approaches for computer games [28] or traffic signal control [29], among others.This involves masking undesirable or non-permissible actions at each decision point of the agent, so that the agent cannot select them.We use action masking to ensure that only SFP types are produced for which there is a need until the end of the planning horizon (v τ,end > 0).This restriction ensures that the agent does not select any unrequired SFPs and is therefore not wasting production capacity for required SFPs.
We define the mask as a vector m a (t) = [m a,0 (t), m a,1 (t), . . ., m a,7 (t)], where each element is defined as The values for the mask are computed immediately before each decision point of the agent.For undesirable actions, the probability that the agent will choose that action is set to 0.

3) REWARD SHAPING
In RL, objectives have to be converted into a reward function.
With regard to the objective functions 1 and 2, the obvious approach is to punish high idle times in the FAS stage as well as high setup efforts in the PAS stage.The agent would then learn to minimize idle times and setup efforts in order to evade punishment.For this purpose, we introduce d(t), which is the sum of the idle times of all FAS between the last and the current decision time step t.The reward function is then where q τ,τ ′ (t) is the sequence-dependent setup effort in the last time step and α idle and α se represent the weights of the respective objectives.An alternative approach for the reward function is achieved by reward shaping.Reward shaping is used to guide the learning process of an agent towards the desired behavior [30], [31].It involves modifying the reward function through integrating of domain expertise to make it easier for the agent to learn the optimal policy.This shortens the training time and increases the probability of successful training.
Therefore, instead of directly minimizing idle times, we shape the reward to focus on avoiding critical demands so that idle times do not occur in the first place.The advantage with using criticality is that it provides finer granularity in evaluating the agent's performance, which helps guide the learning process in the desired direction.
To this end, we define a reward function for each SFP type r v/t,τ that punishes critical requirements more than non-critical.Critical in this context means that the ratio v τ,24 (t)/t τ (t) -the required amount of SFP type τ relative to the remaining time of how long the demand is covered -is high.The higher this value, the more likely idle times in a FAS will occur because the required quantity of type τ cannot be produced in time.Conversely, a smaller ratio indicates a less critical demand.The fewer SFPs are required (low value of v τ,24 ) or the more time is available for production (high value of t τ ), the more this ratio moves towards its low point of 0.
We define a section-wise linear function consisting of 3 sections (see Figure 1) to map the criticality of an SFP type to a continuous range of values.In the first range [0 − 0.036) representing the non-critical situation, the reward decreases from 0 (no demand for this SFP type) to −2.The second range [0.036 − 0.072) represents the critical situation.The reward drops from −2 to −7 with increasing ratio v τ,24 (t)/t τ (t).All ratios ≥ 0.072 reflect a highly critical situation, reflected by a reward of −7.The sum of these rewards is denoted as r v/t : Additionally, we define a penalty value of −1.5 for all demands that are required in less than t mgn = 30min.
The sum of these rewards for all SFP types is denoted as r mgn : This motivates the agent to keep a sufficiently large stock in the buffer for all SFP types, so that a safety margin in terms of time is maintained.
To account for setup effort, we use q τ,τ ′ (t) as we did in (13).The total reward of this alternative approach using reward shaping is thus given by: Both approaches, r 1 and r 2 , will be compared in the experiments.

4) EPISODE LENGTH
An episode ends when the buffer covers the demand for SFPs for the remaining planning period, but at the latest at s max .However, if the capacity of the buffer is reached due to misplanning (see (10)) and at the same time the FAS cannot produce any more due to the lack of the required SFPs in the buffer, the episode is ended prematurely.

C. CURRICULUM LEARNING
The two objectives (minimization of idle times and setup efforts) conflict with each other.For instance, to minimize setup effort, there should ideally be no change of SFP types on station 1.This, in turn, would mean that the requirements of the SFPs would not be met, resulting in idle times.Conflicting optimization goals pose a challenge for the learning process.To guide the agent in the learning process, we use curriculum learning [32], [33], i.e. the agent is exposed to a sequence of tasks with increasing complexity.Thereby we divide the learning problem into 3 tasks from easy to hard.The agent starts with the easiest task until it masters it, and then learns the next harder task.The resulting algorithm is presented in Algorithm 1.
Since minimizing idle time is the main objective, this is trained in the first task.The punishment for setup effort from ( 15) is removed for this purpose: r 2,easy (t) = r v/t + r mgn (16) To further simplify this task, the original action mask in equation 12 is constrained to produce only the 3 most critical SFPs.We define the mask as m easy (t) = [m easy,0 (t), m easy,1 (t), . . ., m easy,7 (t)], where each element is defined as: 1, if v 24,τ /t τ among the greatest 3 values 0 (17 At the end of each episode, the sum of all idle times t d(t) is added to a Queue D with a capacity of 100.The agent learns the first task until the mean of D, i.e. the average sum of idle times of the last 100 episodes, is less than 100s.Then, this task is considered as solved.After that, it switches to the next task.In task 2, the original action mask ( 12) is used.Thus, the agent must now learn from a larger set of SFPs to choose those that minimize idle times.In the last task, the penalty of setup effort is added to the reward function, resulting again in the original reward function (15).For reducing the training time, we use a parallel algorithm that leverages the computing resources of multiple CPUs on a machine.
It is theoretically possible that the initial buffer allocation is so poor that mean(D) < 100s will never be reached.In this case, this condition would have to be relaxed.However, from the practical experience of domain experts, it is known that this case rarely occurs.

D. GENETIC ALGORITHM (GA)
In order to benchmark the performance of the RL agent in a representative way, a suitable metaheuristic algorithm was employed.Here, we utilized a Non-dominated Sorting GA (NSGA-II), which is a widely used multi-objective GA in the current scheduling literature as well as in many realworld applications [34], [35].The GA was implemented in the Python framework pymoo, which provides a set of modern, suitably preconfigured metaheuristics for multiobjective optimization [36].Appropriate and modified GA operators were used for fair comparability and are described in the following subsections.

1) SCHEDULE ENCODING AND EVALUATION
A solution candidate of a schedule (=individual) is represented as a sequential vector of SFP types τ ∈ • τ (see Table 2) to be processed on the PAS from left to right.The individual's SFP sequence includes all SFP types required by the FAS to complete all production jobs.In order to evaluate (encode) an for each trajectory do 8: At the end of the episode, append sum of idle times Use m easy and r 2,easy 16: Use m a and r 2,easy 18: Use m a and r 2 20: end if 21: end for individual and to determine its fitness, the simulation model introduced in Section IV-A1 is utilized.Upon completing the full simulation of the SFP sequence, the individual's fitness value, which corresponds to the objective function value (see Eq. 1 and 2), can be calculated by obtained simulation metrics.

2) INITIALIZATION
Due to the limited capacity of the central buffer, it was not possible to generate feasible individuals for the base population by randomly permuting the SFP sequence.As a result, the number of viable sequences for processing the SFP on the PAS is restricted and must be determined with consideration of this bottleneck.For this purpose, let the matrix M be the starting point for generating a feasible individual.Each row corresponds to a specific FAS, where the columns (from left to right) determine the predefined sequence based on r i,j (see Table 2).As also depicted in Figure 2, the individual is generated as follows: 1) For a diversity of possible sequences, M undergoes a line-wise shuffling using a randomly generated permutation matrix P: M ← M × P 2) M is column-wise flattened to create a valid sequence for the PAS, where the buffer capacity cannot be exceeded.However, the setup effort is still high when using this vector.
3) From left to right: Remove all types from the sequence, which are initially available in the buffer and therefore not required to be processed on the PAS.4) The sequence is clustered according to a randomly selected group size n, grouping identical SFP types and reducing setup efforts.The clustering starts with the first element of the sequence, followed by moving the next n − 1 elements of the same type in front of the current element.Afterwards, the next element in the sequence is selected and the process is repeated.

3) CROSSOVER AND MUTATION
We used the well-known Job Order Crossover and Swap Mutation as operators to create neighbor individuals.Figure 3 illustrates the essential principle of the operators.Job Order Crossover involves combining two parent individuals: First, a random selection of SFP types from the first parent is transferred positionally to the offspring individual (see blue markings).Then, the free positions of the offspring are filled up with the remaining SFP types from the second parent in the order of the second parent (see red markings).[37] The Swap Mutation only changes the structure of a single individual by interchanging two random genes in their position (see green markings) [38].

IV. EXPERIMENTS A. EXPERIMENTAL SETUP 1) PROBLEM INSTANCE, DATA, AND SIMULATION MODEL
Our approach is evaluated on realistic data from Miele production for 5 different weeks (w1, w2, w3, w4, w5).Each data set contains the following information for one week: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
• The initial amount of SFPs in the buffer (b τ,0 ) for each type.
• A setup efforts matrix to determine q τ,τ ′ .
• All work shifts S k (including maintenance and break times) for the scheduling horizon for each station k.
• A set of jobs J with precedence constrains (determined by r i,j ).An example for all jobs assigned to a final assembly station for week 2 is shown in Figure 4.
• All processing times for all jobs on each station (p i,k ).As a basis for training the RL agent, a high-level simulation model of the household appliance production at Miele was developed with the DES FlexSim.Only the relevant parts and dependencies for the planning problem (see Sect.II) were modeled.Prior to each training episode or inference, the model is initialized with a week-specific data set.At each step, FlexSim collects all the data necessary to compute the state vector and reward and passes it to the agent.Based on these inputs, the agent decides which SFP type to produce next and sends that decision back to FlexSim.This is how the agent determines the production sequence in the PAS.A screenshot of the simulation model and its integration into the RL workflow is depicted in Figure 5.

2) SOFTWARE ARCHITECTURE AND REINFORCEMENT LEARNING IMPLEMENTATION
We use a PPO algorithm from the Ray RLlib module [39], that we modify according to Algorithm 1 for training the RL agent.Since Ray RLlib requires Python, we encapsulated the FlexSim simulation model as a Python class.This class inherits from the Gymnasium module [40] and represents the training environment for the RL agent.FlexSim offers an RL connector that is used to communicate with the environment class via a socket connection.
For training, the environment class is handed over to Ray RLlib.The training process is parallelized with several environment instances on multiple CPUs, speeding up data collection.For inference, the trained agent π * is packed in an HTTP Server.At every decision event, the simulation model queries the HTTP Server for the next action.The software architecture is shown in Figure 6.
To generate a schedule S * for a week, the sampled actions are recorded sequentially in S * .To do this, an environment env is first initialized with the data belonging to a week, such as b τ,0 , S k , etc.The environment provides the current state s t and action mask m a for each decision point, which are fed into π * in order to sample the next action a t .a t is appended to S * and fed back into the environment until the planning period is over or the buffer provides enough SFPs for all jobs (done=True).Algorithm 2 illustrates the generation of a schedule.s t , m a , done = env.step(at ) 7: end while

B. BENCHMARK WITH GENETIC ALGORITHM
In this experiment, the RL approach is benchmarked against GA in III-D.The hyperparameters for the RL algorithm were determined in preliminary experiments (Table 4).RL is trained with 700k steps, where one step is defined as producing o 1 SFPs.To ensure fair comparison, GA is stopped after the same number of steps.Both algorithms search for the optimal sequence of SFPs to be produced for five realistic data sets in order to lexicographically minimize idle times first and subsequently setup efforts.In this process, each trial is repeated 10 times to evaluate the stability and variance of the performance of the algorithms.The results are shown in Table 5.Here, an attempt is evaluated as failed if the algorithm was not able to minimize the idle time to 0. The average of the setup efforts SE refers only to the successful attempts of a week.For a more detailed evaluation of the setup efforts see Figure 7.
The results show that RL is significantly more robust in finding feasible solutions than GA.In fact, with RL all trials were successful, whereas with GA 38% of all trials failed.However, there is a significant fluctuation here between weeks.For example, the number of failed trials for week 1 is 8, whereas for week 2 it is 0. A possible explanation for this could be different complexity of the problem instances, where it might be more difficult to minimize the conflicting objectives.Further research should be undertaken to investigate the impact of the instance's structure on the problem's complexity.Furthermore, the average setup efforts in weeks 1-4 are in some cases significantly smaller with RL than with the successful trials with GA.For example, with RL at week 3 the setup effort is on average 36.6%smaller compared to GA, at week 1 even by 76.1%.Likewise, the variance of the setup efforts for weeks 1-4 is smaller with RL, as can be seen from the boxplots.At week 5, no successful solution could be learned with GA 4 times.The remaining 6 trials, on the other hand, show an average of 5.6% smaller setup effort compared to the RL results.
Overall, it can be concluded that our proposed RL approach leads much more robustly to successful solutions with on average less setup effort compared to state-of-the-art metaheuristics.

C. ABLATION STUDIES
In the following, we examine the impact of reward shaping (RS), action masking (AM), and curriculum learning (CRCL) on the performance of the RL solution.For this purpose, we conducted several experiments with modified variants of our algorithm, in which we successively disabled some of these elements to quantify their impact on the success rate of the training.We tested the following variants: • AM+CRCL+RS: Original variant, serving as baseline.
• CRCL+RS: Here, no AM was used, i.e., the agent could choose any action at any decision point.Accordingly, the curriculum learning protocol was adapted so that the first task was skipped.
• AM+RS: In this case, no Curriculum Learning was used.However, the action space of the agent has been restricted to the extent that only those SFPs may be produced for which there is a need until the end of the planning horizon (see Eq. 12).
• RS: No use of action masking or curriculum learning.
• AM+CRCL: The natural approach using idle time and setup effort directly as punishment is chosen as the reward function (see Eq. 13).Each variant is thereby trained 10 times each for week 1 and 2. According to the success rate of GA, week 1 and 2 represent a difficult and a rather easy planning problem, respectively.The results are shown in Figure 8.
Omitting AM or CRCL leads to a 40% reduction in the successful training rate at week 1.This highlights the ability of AM and CRCL to enable the agent to find better policies in challenging planning problems.For week 2, omitting CRCL leads to a reduction of only 20% due to the lower problem complexity, while omitting AM has no effect on the success rate at all.The success rate for the RS variant is also noteworthy.Here, omitting both AM and CRCL at week  1 leads to a 70% reduction in the success rate compared to the original variant.This emphasizes that also the interaction of AM and CRCL helps the agent to find good policies and not only the single algorithmic elements themselves.The highest impact on the success rate, however, is from reward shaping, as can be seen in the results of AM+CRCL.In this case, for week 1, no successful agent could be trained in any of the 10 trials.For week 2, the success rate also dropped significantly to only 20%.
It can be concluded from the ablation studies that integrating domain knowledge through RS, AM and CRCL is necessary to generate successful and performant schedules.This is even more true the more complex the data set is. Notably, these algorithmic building blocks are necessary to overcome the solving difficulties of the competing objectives in the two interdependent stages.At the same time, however, this also shows that these building blocks are sufficient to provide the agent with enough assistance during training.Thus, a purely RL-based approach to solving this optimization problem is possible, whereas many rather runtime-intensive approaches in the literature combine RL with other methods, as shown in section I.

D. SETUP EFFORTS WEIGHTING FACTOR
To mitigate the challenge of conflicting objectives, curriculum learning was applied, so that the agent first learns to minimize the idle times and then the setup efforts.However, balancing these objectives through the weight α se in the reward function is also crucial, since it determines the trade-off between minimizing idle times and setup efforts.Therefore, in this experiment we investigate the impact of this weight by varying it from 0 to 16 in steps of 0.5 and training 5 runs for each of these values for weeks 1 and 2.
The average sum of the idle times and setup efforts per α se are shown in Figure 9. First, it is noticeable that between 0the agent focuses only on minimizing idle times -and an only slightly higher value of 0.5, there is a significant reduction in setup efforts.For instance, in week 1, setup efforts are reduced from an average of 1829 to 343.
As α se increases, setup efforts continue to decrease, as expected, since the agent gives more weight to minimizing them.However, an excessively high α se leads the agent to prefer accepting idle times rather than planning in a way that ensures the final assembly lines always have an adequate supply of SFPs.This effect is observed at week 1 starting at an α se of 9.5, where idle times are on average 1041s.As α se increases, this effect occurs more severely, so that at α se = 15 there are more than 8h of idle times.For week 2, this effect occurs somewhat later (from α se = 12), but with the same magnitude in terms of the resulting idle times, e.g.ca.8.9h at α se = 13.
This effect is thus dependent on the complexity of the data set.Therefore, it would be ideal to fine-tune α se for each week to achieve the lowest possible setup effort.However, it is worth noting that there is a fairly wide range of values for α se in which setup effort remains low and no idle time occurs, while further increasing α se results in only a small reduction in setup effort.This mitigates the need for fine-tuning.For example, doubling α se from 3.5 to 7 only results in a reduction of setup effort by only ca.19 respectively 8.1% for week 1.

V. CONCLUSION AND FUTURE RESEARCH
In this paper, we addressed a scheduling problem occurring in one of Miele's household appliance productions: a two-stage PFSSP with a finite buffer, sequence-dependent setup efforts, and work shifts.The objective is to minimize the idle times and setup efforts in lexicographic order.For this purpose, the problem was formulated as a Markov decision process and then solved with RL.We developed an RL approach that integrates domain knowledge through reward shaping, action masking and curriculum learning.
Experiments on realistic data show the superiority of our approach over a state-of-the-art GA.They also demonstrated that incorporating domain knowledge is critical to successful planning on complex data sets.In addition, we investigated adjusting the setup effort weighting parameter to ensure minimization of idle time and setup effort in a lexicographic order.Moreover, we developed a software architecture to connect the DES to the Ray RLlib framework.Our work thus provides a successful example of the applicability of RL in real-world production planning.
In the future, we plan to further develop the system for realworld use, addressing three aspects in particular: robustness, explainability, and universality.
In real-world production, sporadic disturbances (e.g. because of machine failures) are unavoidable.Therefore, our approach has to be extended so that the agent can generate sequences that avoid idle times and keep setup efforts as minimal as possible despite occurring disturbances.The approach should make it possible to integrate historical data on disturbances such as machine failures in order to make the plans only as robust as necessary so that the actual optimization goals are not unnecessarily underprioritized.
Furthermore, the agent's decisions are to be made transparent so that production planners working with this system can understand and evaluate the decisions.And finally, an agent is to be developed that can perform well over a variety of different problem instances with varying complexities in a zero-shot or few-shot learning manner.Such an universal agent would save training time, since currently for each week a new agent has to be trained.

FIGURE 1 .
FIGURE 1. Reward Function for punishing the criticality of an SFP type.

Algorithm 1 1 : 2 : 3 : for each iteration do 4 :each actor do {parallelized across CPUs} 5 :
Parallelized PPO With Action Masking and Curriculum Learning Initialize parameters θ.Initialize Task = 1 and Queue D with length 100.for Collect set of trajectories by running policy π θ in the

FIGURE 2 .
FIGURE 2. An exemplary creation of an individual (SFP sequence) for the base population, taking into account the bottleneck of the central buffer and the clustering of identical SFP types to minimize PAS setup efforts.

FIGURE 3 .
FIGURE 3. Simplified representation of Job Order Crossover and Swap Mutation.

FIGURE 4 .
FIGURE 4. Example of all jobs of an FAS for week 2. Each color represents a different type.

FIGURE 5 .
FIGURE 5. High-Level Simulation Model in FlexSim with conceptual integration of RL.

Algorithm 2
Schedule Generation With Trained PPO Agent 1: Input Trained PPO agent π * , Environment env, S * = ∅ 2: s t , m a = env.reset()3: while not done do 4: sample a t from π * (s t ) using action mask m a 5: append a t to S * 6:

FIGURE 6 .
FIGURE 6. Software Architecture for training and inference of Ray RLlib RL agents with FlexSim.

FIGURE 7 .
FIGURE 7. Setup efforts as boxplots for all successful trials.The results for GA at week 1 are not shown here because they are significantly larger than the other values.

FIGURE 9 .
FIGURE 9. Setup efforts and idle times for different α se .

TABLE 1 .
Comparison of recent RL approaches for solving the PFSSP.

TABLE 5 .
Performance of RL and GA.