Deep Reinforcement Learning for Minimizing Tardiness in Parallel Machine Scheduling With Sequence Dependent Family Setups

Parallel machine scheduling with sequence-dependent family setups has attracted much attention from academia and industry due to its practical applications. In a real-world manufacturing system, however, solving the scheduling problem becomes challenging since it is required to address urgent and frequent changes in demand and due-dates of products. To minimize the total tardiness of the scheduling problem, we propose a deep reinforcement learning (RL) based scheduling framework in which trained neural networks (NNs) are able to solve unseen scheduling problems without re-training even when such changes occur. Specifically, we propose state and action representations whose dimensions are independent of production requirements and due-dates of jobs while accommodating family setups. At the same time, an NN architecture with parameter sharing was utilized to improve the training efficiency. Extensive experiments demonstrate that the proposed method outperforms the recent metaheuristics, rule-based, and other RL-based methods in terms of total tardiness. Moreover, the computation time for obtaining a schedule by our framework is shorter than those of the metaheuristics and other RL-based methods.


I. INTRODUCTION
As the competition among enterprises intensifies, production scheduling becomes one of the essential decision-making problems in modern manufacturing systems. Specifically, manufacturers should fulfill production orders under the sequence-dependent family setup time (SDFST) requirement that occurs when two products belonging to different families are consecutively processed on a machine [1]. Furthermore, since customer demands frequently and unpredictably change, it is required to deal with the variabilities associated with the production requirements and due-dates of the products [2]. Accordingly, there is a challenge in developing a scheduling method that is able to obtain high-quality schedules while accommodating the variabilities.
The associate editor coordinating the review of this manuscript and approving it for publication was Hao Shen . We focus on the unrelated parallel machine scheduling problem (UPMSP) with SDFST, which has attracted a great deal of attention in various domains such as semiconductor [3]- [5], chemical [6], and food industries [7]. A UPMSP aims to allocate each job to one of the machines where the processing time of a job on different machines is not related. This scheduling problem is known to be NP-hard for minimizing the total tardiness [8].
Metaheuristics have been successfully adopted for solving UPMSPs with SDFST under due-date constraints [9]- [12]. Unfortunately, it is not guaranteed for them to find a high-quality schedule for large-scale scheduling problems within a specific time limit. As an alternative, manufacturers have actively employed rule-based methods due to their short computation time, and ease of implementation [13]. However, schedules obtained by the rule-based methods may not be satisfactory since their decisions are made in a myopic manner [14].
To overcome the drawbacks of rule-based methods, reinforcement learning (RL) approaches have been actively investigated from decades ago [15], [16]. The purpose of RL is to learn an adaptive policy that maximizes the expected sum of cumulative rewards. In recent years, due to the remarkable success in deep reinforcement learning (DRL) that utilizes deep neural networks (DNNs) [17], several studies have shown promising results on the scheduling problems in manufacturing systems [18]- [21]. Yet, there are still two challenges to solve UPMSPs based on a DRL-based method while addressing SDFST as well as the variabilities in terms of production requirements and due-dates. First, the size of the state space might become large when accommodating due-dates and sequence-dependent setups in a state representation of a neural network, which leads to difficulties in function approximation and DNN generalization [21]. Second, since learning complexity grows quickly as the numbers of jobs and machines increase, it is intractable to re-train a DNN whenever such variabilities occur in large-scale manufacturing systems.
To this end, we propose a DRL-based method for minimizing tardiness for UPMSP with SDFST to address the above challenges. It is worth noting that learning DNN parameters robust to changes in production requirements and due-dates is the primary concern of our work. The contributions of this paper are summarized as follows: • We design a novel state representation whose dimensionality is independent of production requirements and due-dates of jobs while accommodating SDFST. Given a state, an action is executed by periodically determining a setup status of a machine and a family of a job.
• To reduce the size of the network and increase the training efficiency, we proposed a DNN architecture in which network parameters are shared among several hidden layers [22]. In the experiments, the effectiveness of the proposed state representation and parameter sharing was demonstrated.
• To validate the performance of the proposed method, we tested our method on large-scale datasets. Experimental results showed that the proposed method outperforms recent metaheuristics, rule-based, and other RL-based methods in terms of the total tardiness for all datasets. Moreover, the computation time taken by the proposed method was shorter than those of other RL-based methods and metaheuristics considered. Finally, the robustness of the proposed method was investigated by solving scheduling problems with stochastic processing and setup time. The rest of this paper is organized as follows. Section II introduces the previous approaches for solving the UPMSP with SDFST and the implications of studies on RL-based scheduling methods. Section III defines the scheduling problem considered in this paper. Details of the proposed method and its training algorithm are presented in Section IV. Performance comparisons with considered alternatives are carried out in Section V. Finally, Section VI concludes this paper.

II. LITERATURE REVIEW A. UPMSPs WITH SDFST UNDER DUE-DATES CONSTRAINTS
UPMSPs with due-date related objectives are classical scheduling problems that have been comprehensively researched for decades [23]. In particular, several studies have addressed SDFST as main constraints [24]- [27]. To reduce the number of setups in a scheduling problem, batch scheduling heuristics were popularly adopted by forming batches of jobs processed on a machine without a setup [28]. To minimize the weighted tardiness, two-level batch heuristics were proposed under the assumption of common due-dates [29], and identical jobs [30]. Additionally, the batch apparent tardiness cost with setup (BATCS) heuristic was suggested by incorporating the processing time, slack time, and family setup time [31]. For the total tardiness objective, an improved simulated annealing heuristic was developed through incorporating a batch-based repair method [9].
On the other hand, metaheuristics have been adopted for solving UPMSPs with SDFST to minimize the total tardiness. They employed the batch formation technique to restrict schedules with a small number of setup changes in the solution space. The scheduling problem with primary customer constraints was successfully solved by the iterated greedy (IG) algorithm [10], which enhances an initial solution through the iterative neighborhood selection and exhaustive local search. More recently, Pinheiro et. al. [11] developed an improved IG by designing six local search operations such as the batch swap and job insertion. They showed competitive results in solving scheduling problems with six machines and 50 jobs. In [12], the artificial bee colony algorithm was proposed through applying crossover operations to exchange job sequences.

B. REINFORCEMENT LEARNING ON UPMSPs
Markov processes have been broadly adopted to solve optimal control problems in the presence of abrupt changes in system dynamics [32]. To the recent, several Markov random processes were employed with advantages of modeling fuzzy systems, such as Markov chaotic systems [33] and semi-Markov jump nonlinear systems [34]. For solving scheduling problems, a manufacturing system is formulated as Markov decision process (MDP) [35], and RL is utilized to learn a policy of MDP. RL aims to train an agent by interacting with an environment that consists of everything outside the agent. In particular, Q-learning (QL) [36], one of the representative model-free RL approaches, has been widely adopted to solve scheduling problems. Given a state observed from the environment that corresponds to a manufacturing system, the agent makes scheduling decisions by predicting the estimated value of an action called the Q-value.
For solving UPMSPs by utilizing RL-based methods, Zhang et. al. adopted QL to minimize the weighted tardiness [37], [38]. They employed a linear basis function to approximate Q-values for given state features indicating the status VOLUME 9, 2021 of all jobs and machines. After six heuristics are designed as actions, QL-based adaptive rule selections outperformed individual rules. Although [37] dealt with sequence-dependent setups, the agents in [37] were only validated on the same scheduling problems as those for training. The method in [38] was tested on various scheduling problems without considering setups.
Yuan et. al. [39], [40] addressed ready time constraints and machine breakdown for minimizing the total tardiness and number of tardy jobs, respectively. They adopted a tabular method that stores Q-values by exploring state-action pairs. To represent the state in a limited size of the table, the values of continuous attributes should be discretized into several groups. For instance, in [40], the mean lateness of jobs was classified as being greater than or less than zero. As alternative methods for representing the state, unsupervised learning methods, such as self-organizing map [41], and k-means nearest neighbor [42], [43], have also been adopted for solving scheduling problems.

C. DEEP REINFORCEMENT LEARNING FOR SOLVING SCHEDULING PROBLEMS
Compared to the traditional RL methods, DRL aims to approximate Q-values from high-dimensional state features through DNNs [17]. This helps an agent to learn a nonlinear policy from large and continuous state spaces. Besides, [17] proposed a target network to enhance learning stability, and experience replay for reducing correlations among state-action pairs. With the extensive application of DRL to various decision-making problems such as energy supply control [44], vehicle routing [45], and electricity market pricing [46], it has gradually attracted prominence in the field of scheduling. Until recently, a lot of studies widened the scope of DRL to scheduling problems in computer resource management [47], [48] and distributed system [49], [50]. Those studies encoded the state as a 2D matrix of resources and upcoming timesteps. Since they employed fully connected networks, the matrix was flattened into a one-dimensional vector.
In addition, DRL-based methods have been employed for solving scheduling problems in manufacturing systems. In a hybrid flow shop scheduling problem, an agent allocates jobs from a given state that indicates whether the machine status is idle or busy or finished [51]. Among various shop scheduling problems, several researchers have investigated DRL-based methods for minimizing the makespan in job shop scheduling problems. Lin et. al. [52] represented machine status and statistics of processing time in the state to infer dispatching rules for each machine. In [20], the setup time was considered by utilizing the setup history and setup status of all machines. Recently, convolutional neural networks were employed to address the states represented in 2D matrices indicating relationships between jobs, operations, and machines [19], [53].
On the other hand, a few DRL approaches were developed for solving scheduling problems with due-date related objectives. In [54], due-dates of all waiting jobs were represented in the state for maximizing throughput. To minimize the total tardiness in a single machine scheduling problem, [55] utilized the slack time, which refers to the difference between the processing time and remaining time until due-dates of a job. Washneck et. al. [18] accommodated setup constraints in a semiconductor production scheduling problem for minimizing due-date deviations. Since their state included the setup status of all machines and due-dates of all jobs, DNNs in [18] should be trained again when solving new scheduling problems whose number of jobs is different from those of the training. Luo et. al. [21] focused on new job insertions in the flexible job shop scheduling problem under ready time constraints with the total tardiness objective. For a state that consists of seven features, an action is determined by selecting one of the heuristics. Yet, they did not consider setup constraints.

III. PROBLEM DESCRIPTION
In this section, we describe UPMSP with SDFST considered in this paper. There are N J jobs where the j th job is denoted as J j . Each job belongs to one of N F families from the set F = {1, 2, . . . , N F }. Each job can be processed by one of any N M machines where the i th machine is denoted as M i . The processing time is denoted as p i,j when the job J j is performed on M i . A job is finished after being processed once on one of the machines. Let P(f ) be the total number of the jobs that belong to family f , which indicates the production requirements of family f . As a result, the following equation holds: At the beginning of the scheduling, due-dates of J j , denoted as d j , are given. Let G i denote the current setup status of M i . G i is an element of F and equivalent to g if the job of family g can be processed on the machine without a setup change. If a job of family f is assigned on M i whose G i is g, the setup time, denoted as σ f ,g , is incurred before the job is processed on the machine.
The goal of the scheduling is to allocate each job to one of the machines in order to minimize the total tardiness, denoted as TT , which is defined as where c j is the completion time of J j . When there is no setup time, the scheduling problem considered becomes equivalent to the problem in [8], which is proven to be NP-hard. Finally, the assumptions made in this paper are listed below.
• At the beginning of the scheduling, all jobs are ready to be processed, and all machines have been set up.
• There is no machine breakdown.
• After the setup change of a machine is finished for processing a job, the machine immediately starts to process the job. • Machine can only process one job at a time.
• The preemption is not allowed. • The moving time for each job is zero.

IV. PROPOSED METHOD
We propose a DRL-based scheduling method in this section which is divided into four subsections. First, we describe the MDP to solve UPMSP with SDFST constraint. Second, the proposed parameter sharing architecture is introduced. Finally, a training algorithm and the flowchart are described.

A. MDP FORMULATION
For employing DRL, the scheduling problem considered in this paper is formulated as MDP. We denote the state, action, and reward at timestep k as s k , a k , and r k , respectively. After the agent executes an action a k , the state transition takes place from s k to s k+1 . Then, reward r k and next state s k+1 are observed. We denote the time interval from s k until s k+1 as the period k. In this paper, we model that the time spent in each period, denoted as T , is constant. We define in detail the action, state, reward, and state transition in the following.

1) ACTION
When an action is defined for each pair of a job and a machine, the size of the action space grows quickly as the numbers of jobs and machines increase. To reduce the size of action space, we define an action as a tuple of a job family and a machine setup status. The action set A is defined as follows.
Then, we denote the feasible action set at s k as A k ⊂ A. An action a k = (f k , g k ) indicates that a job of family f k will be assigned on a machine whose setup status is g k during the period k, where a k ∈ A k . We note that a k is feasible only if there is a waiting job of family f k and an idle machine whose setup status is g k in the period k.

2) STATE
When adopting RL-based methods to solve scheduling problems, the state is usually represented by utilizing observations on the current status of machines and jobs. If the dimension of a state varies as production requirements and due-dates of jobs change, the DNN is required to be re-trained whenever such changes occur. To solve the scheduling problems without re-training DNNs even when such changes occur, we propose a family-based state representation whose dimensionality is independent of the production requirements and due-dates of jobs. The proposed state consists of five 2D matrices and one vector. Table. 1 describes them in terms of notations, names, and dimension. To help the understanding of s k , we present an example of s k in a scheduling problem with three families, as depicted in Fig. 1. Next, we define the details of s k . It is noted that the index k is omitted in each of the five matrices and vector for the sake of conciseness.
• S w : To accommodate the due-dates of waiting jobs, the value of the f th row in this matrix refers to the number of waiting jobs for family f where their due-dates belong to one of periods. For separately counting each waiting job J j with respect to their remaining time until due-dates, denoted as where W k (f ), · , and H w refer to the set of waiting jobs that belong to family f at s k , ceiling function, and a positive integer smaller than max j d j /T , respectively. As indicated by Eq. (4) and the vertical line on the top of Fig. 1(a), S w (f , 1) and S w (f , 2H w ) refer to the number of waiting jobs for family f where their δ j are smaller than (1 − H w )T and larger than (H w − 1)T , respectively. This representation aims to restrict the number of columns in S w into 2H w while capturing due-dates of all waiting jobs in the matrix. The practical rationale behind Eq. (4) is that jobs whose due-dates are quite far from the current period have a smaller impact on executing an appropriate action than the other jobs. In Fig. 1(a), S w (1, H w ) is equal to 6, which indicates that there are six waiting jobs whose family is 1 and δ j T = 0. • S p : This matrix contains the number of in-progress jobs for each family of which their remaining processing time is included in one of periods. In-progress jobs refer to the ones that are currently being processed on a machine.
Since the tardiness of in-progress and finished jobs are already determined, due-dates of in-progress jobs are not relevant to minimizing the tardiness. Meanwhile, the remaining processing time of in-progress jobs affects the completion time of the waiting jobs that will be assigned after in-progress jobs are completed. Therefore, each job J j that belongs to P k (f ) is classified according to the remaining processing time, denoted as ρ j , where P k (f ) is the set of in-progress jobs of the family f at s k . Then, we define S p (f , n) as below.
where H p is a positive integer less than max j ρ j /T . We note that the dimension of the column in S p is constrained to H p in a similar way to Eq. (4). For example in Fig. 1(b), S p (2, 1) is equal to 1, which indicates that there exists one job satisfying the following two conditions: the job belongs to P k (2), and its remaining processing time is shorter than T .
• S s : To capture the family setup time for predicting the tardiness that will be incurred after s k , S s (f , g) represents the required setup time to process a job of family f on a machine whose setup status is g. Specifically,S s (f , g) is defined as follow.
where σ max refers to the maximum setup time. If (f , g) / ∈ A k , S s (f , g) is set to σ max to consider the worst case.
• S u , S a , and S f : These are devised to incorporate the history of the job, machine, and agent status until s k , which has been known to be effective for solving scheduling problems based on DRL [20]. First, S u (f , 1) and S u (f , 2) respectively refer to the amounts of processing and setup time until s k that have been spent for processing jobs of the family f . S u (f , 3) is the number of finished jobs whose family is f . Next, S a represents the last action a k−1 . By using one-hot encoding [56], S a (·, 1) and S a (·, 2) denote N F -dimensional vectors that indicate the family of a job and the setup status of a machine, respectively. Finally, S f consists of the three historic features that cannot be grouped into a specific family. S f (1), S f (2), and S f (3) are respectively equal to r k−1 , k, and a binary value that is set to 1 if s k is terminal,

Algorithm 1 State Transition by an Action
if M * = M i then 7: f ← f k 8: else 9: f ← arg min

12:
Assign J j on M i 13:

3) REWARD
The reward proposed in this paper is motivated by [37] in the sense that r k is calculated by considering job delays which occurred only during the period k. However, since we assumed that the time spent in each period is always equal to T , the reward in [37] is redefined in this paper to accommodate such assumption by using the following clip function.
Then, r k is defined as follows.
Eq. (8) states that the reward is equivalent to the negative sum of the job tardiness clipped by Eq. (7). When computing r k , we assumed that c j is set to be infinite if J j is waiting until s k+1 . As a result, the total sum of rewards is equal to TT , which was proven in [37].

4) STATE TRANSITIONS
After executing a k , the state transition from s k to s k+1 occurs. Algorithm 1 describes the procedure for the state transition. Let E i be the time when the current job performed on M i will be finished.  In line 1, we obtain a machine set M that consists of the machines whose setup status is g k . Among M, we select the machine M * where the sum of processing time for the jobs in W k (f k ) is the shortest (line 2). Lines 3-15 continue until satisfying the following two conditions: there is at least one waiting job, and the minimum of E i among all machines do not exceed the end time of the current period k. In line 4, the machine M i whose E i is the smallest among all machines is selected. Then, the setup status of the machine, called g, is obtained (line 5). In lines 6-10, we determine the family of a next job that will be performed on M i , denoted as f . If M * is equal to M i , f is set to f k by following a k (line 7). Otherwise, f is selected to minimize the setup time incurred on M i (Line 9). Among waiting jobs whose family is f , the job J j with the earliest due-date is selected (line 11). After J j is assigned on M i (line 12), both E i and W k are updated (lines 13 and 14). Finally, s k+1 , A k+1 , and r k are obtained (lines 16 and 17). Fig. 3 depicts a schedule that is built during the periods k and k + 1. At s k , M 1 and M 2 are respectively performing J 1 and J 2 , and the jobs J 3 to J 7 are waiting for being assigned. Since a k is equal to (2, 1), a job of the family 2 should be assigned on a machine whose setup status is 1. By following lines 2 and 11 in Algorithm 1, J 5 is assigned on M 2 , and σ 2,1 is incurred before processing the job. Meanwhile, J 3 and J 4 , which respectively belong to family 1, are consecutively allocated to M 1 whose G 1 is 1. At the end of the period k, s k+1 and r k are obtained as the consequences of the state transition. After executing a k+1 = (1, 3), the job J 7 of family 3 is allocated on M 1 whose setup status is 1. In summary, J 3 , J 4 , and J 5 are scheduled according to a k and J 6 and J 7 according to a k+1 .

B. DNN ARCHITECTURE
A deep Q-network (DQN) was employed to estimate a Qvalue given a state [17]. DQN takes a state s as an input and outputs Q-values for all possible actions, called Q θ (s, a), where θ is the network parameters and a ∈ A. Fig. 2. depicts the proposed fully-connected network architecture with parameter sharing that has been adopted for solving single lot-sizing [57], and job scheduling problems in computing platforms [22].
The input of DQN is equal to a matrix that is constructed by concatenating the five matrices S w , S p , S s , S u , and S a . Each row vector of the input is connected to a block that consists of several hidden layers. To reduce the network parameter size and increase the training efficiency, the parameters for each block are set to be the same. The last hidden layer is composed by concatenating the values in the last layers of N F blocks and S f . Finally, the number of nodes in the output layer is equal to the number of all possible actions. The ReLU function [58] was adopted as an activation function except for the output layer to represent negative Q-values.

C. TRAINING DQN
Algorithm 2 describes the training procedure of the proposed DQN. Given a scheduling problem for training DQN, a scheduling process is repeated until there is no waiting job (lines 3-16). We refer to the completion of one scheduling VOLUME 9, 2021 while W k = ∅ do 7: With probability ε select a k randomly from A k 8: otherwise a k ← arg max a∈A k Q(s k , a) 9: Get s k+1 , A k+1 , r k , W k+1 from Algorithm 1 10: Store transition (s k , a k , r k , s k+1 ) in B 11: Sample N TR transitions (s u , a u , r u , s u+1 ) ∈ B 12: Calculate loss L from (9)-(11) 13: Perform a gradient descent step on L w.r.t. θ 14: k ← k + 1 15: end while 16: Synchronizeθ to θ at every N U episodes 17: end for 18: return Q-network process as an episode, and e indicates the index of the episode currently being performed. The scheduling processes continue until e reaches the number of training episodes, denoted as N E .
At the start of an episode, k is set to 0, and N J jobs and N M machines are initialized (lines 3 and 4). In line 5, the agent initially observes the s k with W k . For each timestep k, the agent selects a k from the ε-greedy policy [36] presented in lines 7 and 8, where ε ∈ [0, 1] is a probability to select an random action. After the state transition takes place from s k to s k+1 (line 9), the transition, represented as a quadruple of state, action, reward, and next state, is stored in replay buffer (line 10). The replay buffer and its size are denoted as B and N B , respectively. If the number of stored transitions in B exceeds N B , the oldest ones are removed. Line 11 indicates that N TR transitions are sampled to train DQN, where N TR is the number of sampled transitions. Given a transition (s u , a u , r u , s u+1 ), the temporal difference error, called η u , are calculated by the prediction Q θ (s, a) and target Q-value [17] as follows.
where γ andθ respectively indicate the discount factor [36] and the parameters of a target DQN which has the same network architecture as DQN for training. In Eq. (9), r u + γ max a Qθ (s u+1 , a ) indicates a target Q-value which is set to r u when s u+1 is terminal. We denote the temporal difference loss from sampled transitions as L(θ), which is defined as follows.
where h is the loss function, defined as below.
In Eq. (11), we adopted Huber loss [59] instead of mean-squared error for further enhancing the stability of DQN training [20]. By calculating L(θ), the network parameter θ is updated (lines 12 and 13). Line 16 shows that the target network parameterθ is periodically replaced to the θ [17]. Finally, the trained DQN is acquired at the end of N E episodes (line 18). Fig. 4 depicts the overall flowchart of the proposed algorithms.
After DQN training, a test procedure is implemented to solve the scheduling problems whose production requirements and due-dates change from those of the training problems. During the test procedure, random actions (line 8 in Algorithm 2) are not executed any more. The rest of the procedure is identical to Algorithm 2 except the lines 10-13 and 16 that are required to train a DQN.

V. EXPERIMENTS A. DATASETS
We prepared 8 datasets that simulated the semiconductor wafer preparation facilities in South Korea. Table. 2 presents N M , N J , N F , and due-date tightness, denoted as τ , for the scheduling problems in each dataset. Except for τ , datasets 1 and 3 are equivalent to datasets 2 and 4, respectively. Moreover, N M and N J of datasets 5-8 are 2.5 times larger than those of datasets 1-4, respectively.
Each dataset has 330 different scheduling problems which are divided into 300 problems for training the proposed DQN, and the other problems for validating the trained DQN. Fig. 5 depicts the distributions of production requirements for each family. In particular, across all datasets, production requirements were perturbed by at least 37%.
For each scheduling problem, d j of N J jobs were set to be uniformly distributed between L(0.5 − τ ) and L(1.5 − τ ), where L indicates an expected makespan adopted in several studies [9], [11], [30]. To simulate the real-world scenario, each machine performs a job at the beginning of the scheduling, where its remaining processing time was randomly generated.

B. EXPERIMENTAL SETTINGS
Experiments were conducted on a Xeon E5 2.2-GHz PC with 126-GB memory. When employing DRL-based methods, choosing appropriate values of hyperparameters plays a crucial role in the performance of DNN. Since it is challenging to determine optimal values of hyperparameters due to their huge search space, we implemented the random search [60], and found the values that achieved the best performance.
For performing a gradient descent step, we adopted RMSProp optimizer [61] where the learning rate is set to 2.5 × 10 −3 . In the ε-greedy policy, the initial value of  ε is set to 0.2. The value decays linearly to zero until e reaches 0.9 × N E . Besides, N B , N U , N TR , and γ is set to 10 5 , 50, 64, and 1, respectively. Finally, we set N E = 10 5 , T = 2 3p in datasets 1-4, and N E = 1.5 × 10 4 , T = 1 2p in datasets 5-8, respectively, wherep is the average processing time over all pairs of jobs and machines. Each shared block consists of three hidden layers where the numbers of nodes in the first, second, and third layers are 64, 32, and 16, respectively. For each dataset, after the trained DQNs were stored when e is equal to 0.91 × N E , 0.92 × N E , . . . , 0.99 × N E , and N E , respectively, they were used for performance comparisons.
Figs. 6(a) and (b) illustrate the mean tardiness results (in hours) on dataset 1 by varying H w and H p , respectively. In the experiments, the mean tardiness is calculated by dividing TT to N J . We note that the values of H w and H p were chosen in the range described in Eqs. (4) and (5), respectively. In Fig. 6(a), TT significantly decreased until H w reaches 6, which reveals the validity of S w . On the other hand, the performance changes were negligible when H w exceeds 6. This implies that the advantage of classifying waiting jobs in terms of their due-dates is diminished due to the increase in the dimension of S w . As shown in Fig. 6(b), the performance improvement achieved by varying H p was less significant than varying H w . This can be attributed to the fact that S w contains more observations than S p since the number of waiting jobs is larger than those of in-progress jobs in most periods. As a result, the rest of the experiments were carried out with the best values of H w and H p , which were 6 and 5, respectively.

C. PERFORMANCE COMPARISON
In order to show the effectiveness of the proposed method, we compared performances of IG in [11] and two RL-based methods, which are two-phase DQN (TPDQN) [18], and QL method with linear basis functions (LBF-Q) [37], respectively. The parameters of IG were the same as those in [11]. Since the production scheduling in the semiconductor manufacturing systems is usually carried out on an hourly basis [14], IG was terminated after an hour. For LBF-Q and TPDQN, ten models were stored as described in Section V-B for performance comparisons, respectively. The rest of the hyperparameters were the same as those in [18], [37], respectively.
Furthermore, we made comparisons between our method and four rule-based methods: BATCS [31], shortest setup time with earliest due date (SSTEDD), least slack remaining (LSR) [13], COVERT [62]. LSR and COVERT are widely adopted to minimize due-date related objectives, while BATCS is effective when solving scheduling problems with SDFST. In particular, SSTEDD selects the jobs which require the shortest setup time and decides a job with the earliest due-date among those jobs. Table. 3 presents the mean tardiness results (in hours) of ours and the other methods. Among the rule-based methods, SSTEDD outperformed the other methods in all datasets. Meanwhile, LSR and COVERT yielded 3.2 times longer TT than the other rule-based methods in the best case and 11.3 times in the worst case. It can be said that addressing sequence-dependent setups is crucial for minimizing TT of the scheduling problems considered. It was observed that TT achieved by LBF-Q and TPDQN was longer than SSTEDD for all datasets. This may be due to the fact that the family setups were not accommodated in their state and action representations. Although the performances of IG were better than those of the rule-based and other RL-based methods for all datasets, TT achieved by IG was 13% to 55% longer than those of the proposed method. Based on these results, the proposed method appears to be effective for solving scheduling problems even when the production requirements and due-dates are changed from those of the training. Table. 4 presents the average computation time taken by our method, TPDQN, LBF-Q, IG, and SSTEDD. We only presented the results of SSTEDD whose average computation time is the shortest among the four rule-based methods. The computation time results of LBF-Q and TPDQN were longer than those of the proposed method for all datasets. This might be related to the fact that Q-values are computed by the proposed method at each period, different from LBF-Q and TPDQN that compute Q-values whenever allocating a job to a machine. Moreover, the ratio of average computation time for the datasets 5-8 to that for datasets 1-4 was 5.56 for LBF-Q, 6.23 for TPDQN, and 3.98 for the proposed method, respectively. This observation can be attributed to the fact that the number of parameters for the proposed DQN is independent of N M and N J , different from LBF-Q and TPDQN.
Compared to the best rule, the computation time of the proposed method was increased by 8.8 to 12 times for the datasets 1-4, and 6.7 to 7.9 times for the datasets 5-8, respectively. Nevertheless, the results demonstrate that the proposed method built a schedule less than 20s for all datasets. Different from metaheuristics and rule-based methods, the proposed DRL-based method can quickly obtain a new schedule by using the trained DQN, which suggests the viability of the proposed method in terms of the computation time for real-world manufacturing systems with parallel machines.
To examine the robustness of the proposed method when both processing and setup time is stochastic, we carried out  additional performance comparisons. Specifically, both p i,j and σ f ,g were not known in advance and set to be uniformly distributed between [0.8p i,j , 1.2p i,j ] and [0.8σ f ,g , 1.2σ f ,g ], respectively. Each test scheduling problems was solved 30 times with different random seeds. Table. 5 shows the average and standard deviation of the mean tardiness (in hours) for SSTEDD, IG, LBF-Q, and our method. For datasets 1 and 5, both the average and standard deviation of TT yielded by the proposed method were the lowest among all methods. Based on these results, the proposed method seems to be robust even when processing and setup time are stochastic.
We further analyzed the effectiveness of the proposed state representation and parameter sharing architecture. To solely investigate the performance improvements achieved by the state representation and parameter sharing, we compared the proposed method with two modified baseline methods. First, we adopted production attributes-based state representation (PABS) proposed in [37]. Next, the proposed family-based state representation was utilized as one-dimensional vector (FBS-1D) by flattening and concatenating S w , S p , S s , S u , S a , and S f . For PABS and FBS-1D, we utilized the fully-connected network with three hidden layers where the number of nodes were the same as mentioned in Section. IV-B. The rest of the details are equivalent to the proposed method. Note that PABS and FBS-1D were the same except for the state representation. Fig. 7 highlights the mean tardiness results (in hours) of PABS, FBS-1D, and ours in datasets 1-8. For all datasets, FBS-1D consistently outperformed PABS with respect to TT . Specifically, TT achieved by FBS-1D was 70% lower than those of PABS in datasets 1-4 and 34% lower for datasets 5-8, respectively, which demonstrate the superiority of the proposed state representation. Furthermore, compared to FBS-1D, the performances of the proposed method were 30% better for datasets 1-4 and 47% in datasets 5-8, respectively. Based on the above observations, parameter sharing appears to be more efficient for solving scheduling problems with large numbers of jobs and machines.

VI. CONCLUSION
In this paper, we proposed a DRL-based method for solving UPMSPs with SDFST constraint to minimize the total tardiness. To cope with the variabilities in production requirements and due-dates while addressing SDFST, we proposed a novel state representation whose dimension is invariant to such variabilities. Furthermore, we suggested the parameter sharing architecture to learn DQN parameters effectively and reduce the parameter size. As a result, the trained DQN was able to quickly solve unseen scheduling problems whose production requirements and due-dates are different from those considered in training.
To examine the performance of the trained DQN, the proposed method was compared to IG, four rule-based methods, LBF-Q, and TPDQN. The experimental results demonstrated that our method outperformed the existing methods in terms of the total tardiness for all datasets. Moreover, the computation time of the proposed method was shorter than those of IG and two RL-based methods. Through further experiments, it was verified that both the proposed state representation and parameter sharing architecture have contributed to the performance improvements.
Yet, the proposed DQN poses a limitation when the number of families changes since a re-training procedure is required. To address such limitation, we plan to develop the state and action whose dimensions are independent of the number of families. Furthermore, some assumptions made in this paper related to the ready time of jobs and machine breakdown will be relaxed in the future work. Finally, the network architecture can be improved by utilizing other deep learning models, such as recurrent DNNs, and convolutional neural networks.