Predictive Maintenance Decision Making Based on Reinforcement Learning in Multistage Production Systems

Predictive maintenance has become increasingly prevalent in modern production systems that are challenged by high-mix low-volume production and short production life cycle. It is very helpful to prevent costly equipment failures, and reduce significant production loss caused by unscheduled machine breakdown. Although important, decision models for joint predictive maintenance and production in manufacturing systems have not been fully explored. Therefore, we propose a reinforcement learning based decision model, that brings together production system modeling and approximate dynamic programming. We start from the development of a state-based model by analyzing the dynamics of a multistage production system with predictive maintenance. It provides an approach to quantitatively evaluate the various disruptions as well as the maintenance decision’s impact on production. Then a reinforcement learning method is proposed to explore optimal maintenance policies, that optimize the production and maintenance cost. To further improve the performance of the production system, machine stoppage bottlenecks are defined. An event-based indicator is proved to identify bottlenecks with production data. We test the proposed models in simulation case studies. The proposed predictive maintenance decision model is compared with three policies, which are state-based policy (SBP), time-based policy (TBP) and greedy policy (GP). The numerical studies show that the proposed decision model outperforms the policies, and it has the lowest system cost that is 9.68%, 39.07%, and 39.56% lower than SBP, TBP, and GP, respectively. In addition, the research shows that bottleneck identification and mitigation could help manufacturing systems to achieve more than 9.00% throughput improvement.


I. INTRODUCTION
Production systems must maintain high productivity and low production cost to succeed in the highly competitive business environment. However, unexpected machine random failures can significantly impact the operation of the production systems. The systems are forced to stay in transient states, which causes the systems suffering from more unscheduled machine breakdown with extended durations, and hence leads to low productivity and quality rate [1].
Maintenance plays a central role to reduce machine failures, improve productivity and keep the functional level of products [2]. Predictive maintenance has been recognized as The associate editor coordinating the review of this manuscript and approving it for publication was Agustin Leobardo Herrera-May. one of the most promising maintenance strategies for production systems because of high efficiency and low cost [3]. Predictive maintenance is an approach that makes maintenance decisions based on the real machine health conditions [4]. It can reduce unscheduled equipment breakdown and prevent maintenance events that are not necessary [5]. Predictive maintenance is recognized as a promising technique that revolutionizes production industry [6]. An optimal decision model is essential to successfully deploy predictive maintenance in production systems. In single stage production systems, all machines are independent of one another. Conducting predictive maintenance based on each machine's health degradation can optimize system performance. However, it is not trivial to make decisions in multistage production systems considering the complex interdependencies among machines. In the systems, a machine's health degradation cannot only result in the breakdown of the machine, but also starve or block the adjacent machines. It may cause production loss. Predictive maintenance decision models should be adapted to take into consideration of the interdependencies in order to reach a system-wide optimal policy [7].
This research proposes a decision model for integrated production and maintenance decision making in multistage production systems. First, we propose a Markov chain model through analyzing the transient dynamics of multistage production systems with predictive maintenance control. In the existing analytical models, systems mainly use corrective maintenance to bring downtime machines back to operation while ignoring predictive maintenance. We extend the models to production systems where machines have multiple deterioration states, and predictive maintenance is employed to restore machines to better health conditions. Second, we integrate approximate dynamic programming and the Markov chain model. The existing research usually considers the modeling and control of multistage production systems separately. We bring together the research efforts, and investigate their applications in predictive maintenance decision making.
Machine stoppage bottlenecks refer to the machines whose random failure most strongly impedes throughput. The performance of the production system can be effectively improved if we give higher maintenance priorities to bottleneck machines. The challenge is to define and identify bottleneck machines. Therefore, we propose in this manuscript the definition and identification method for machine stoppage bottlenecks.
The remaining of the paper is presented as follows: literature review is discussed in Section II. Section III introduces system's descriptions and assumptions. Section IV proposes the Markov chain model. Sections V and VI describe the dynamic maintenance decision model. Section VII defines machine stoppage bottlenecks and establishes an event-based identification method. Section VIII carries out the numerical case study to validate the proposed models. Conclusions are summarized in Section IX.

II. LITERATURE REVIEW
In an effort to improve production and maintenance efficiency, extensive studies have been carried out in the past decades on joint maintenance and production control [8], [9]. Zhang et al. [10] established a reinforcement learning based algorithm to exploit optimal maintenance policies. The model can be applied to systems whose degradation modes can be either known or not. Yang et al. [11] investigated the integrated optimization of preventive maintenance and production scheduling in a multi-stage sing-machine production system. A Markov decision process framework is formulated and R-learning algorithm is applied to optimize the system rewards. In [12], a repairable multistate system is considered to optimize the achievement over a finite time period with maintenance resources constrain. A Markov decision process is established, and it is solved with a reinforcement learning based algorithm. Peng [13] proposed a Markov decision process model for systems with continuously degradation states. The transition and value function of system states are approximated with the Gaussian process regression. Although the aforementioned models are very useful to improve system performance in their applications, they are mostly developed for systems with one or two stages. The methods are difficult to be extended to multistage manufacturing systems.
Although some research has been carried out in multistage manufacturing systems to establish predictive maintenance decision models, most of them relies on heuristic rules without a systematic understanding of system dynamics [14]. As a result, the models may not adequately exploit optimal maintenance decisions. It is desired to integrate production system models into maintenance decision making algorithms to make comprehensive decisions in multistage production systems. For example, Iravani and Duenyas [15] discussed the application of reinforcement learning in production maintenance. It has the potential to improve 5 − 20% maintenance efficiency comparing with conventional maintenance policies Xia et al. [16] proposed a predictive maintenance control policy for serial production systems, which utilizes a globalobjective model (GOM) to make machine-level decisions and a maintenance time window method (MTW) to make systemlevel decision. The multi-unit production system is simplified by assuming that the entire production system needs to be stopped to perform predictive maintenance. Chang et al. [17] introduced a supervisory control algorithm to schedule predictive maintenance in maintenance opportunity window (MOW). The algorithm assumes that all the machine failures can be predicted. It strives to satisfy the maintenance needs while only slightly impede production efficiency.
Bottlenecks in production systems have been analyzed in numerous references. Cui et al. [18] studied bottlenecks in a serial production system, and established an identification method with accumulated starvation and blockage time. However, the identification method can only be analytically proved in production systems without random machine failure. Zhang et al. [19] analyzed three types of bottlenecks in Markovian production systems where machines can have multiple states. An approximation method is proposed to find bottlenecks. Li et al. [20] utilized a Markov chain model to identify market demand bottlenecks. A continuous improvement algorithm is established based on the bottleneck identification method. This manuscript will extend the bottleneck identification methods to production systems with maintenance.
In summary, the current literature fails to provide a systematic approach to the joint production and maintenance decision making in multistage production systems. Although extensive research efforts have been reported to optimize maintenance policies, they are mainly for single stage production systems. Many current decision models for multistage production systems rely on tribal knowledge or ad-hoc rules. The complex nature of a production system makes it difficult to utilize these methods to optimize system performance. The ever-growing complex business environment puts huge pressure on production systems. Dynamic decision models that integrate production system modeling and optimal control method is imperative to reduce maintenance cost and improve productivity. And this is the main concern of the paper.

III. SYSTEM ASSUMPTIONS AND NOMENCLATURE A. SYSTEM ASSUMPTIONS AND BACKGROUND
In this paper, we consider a serial production system as shown in Figure 1. We use M l , 1 ≤ l ≤ M , to represent the lth production machine and b l to represent the lth buffer. The end-of-line machine M M +1 is the market demand virtual machine. The last buffer b M is the finished-goods buffer. The system makes the definitions and assumptions.
1) The capacity of each buffer is finite.
2) The cycle time of each machine is identical and equals to a time unit [21], [22]. states, which are denoted as α l ∈ {0, 1, . . . , N l }. State α l = 0 represents as good as new state, while state α l = N l represents failure state. Each machine has one failure state. 5) Without maintenance, machine M l , 1 ≤ l ≤ M , will eventually fail and almost never revives on its own. We assume that machine M l , 1 ≤ l ≤ M , can transfer from state α l = i to state α l = j with a probability of θ ij , i = N l . 6) When machine M l , 1 ≤ l ≤ M , fails, corrective maintenance (CM) shall be performed to bring the machine to one of the non-failure states. We use β N l j , j = N l , to denote the transition probability of machine M l from state N l to state j with CM. 7) Before machine M l , 1 ≤ l ≤ M , fails, predictive maintenance (PM) can be performed to restore the machine to a healthier state. We assume that the duration of PM is shorter than that of CM on average. PM usually has lower cost rate than CM. With PM, the transition probability of machine M l from state i to state j is denoted as

Remark 2:
If predictive maintenance is performed on machine M l , l = 1, . . . , M , when it is in state α l = i, the maintenance continues until machine M l successfully transfers to another state. The expected length of the predictive maintenance is estimated as 1 system's state at time t c maintenance decision of the system c(t) maintenance decision of the system at time t c l (t) , 1 ≤ l ≤ M maintenance decision of machine M l at time t r(s(t), c(t)) cost of the system when it is in state s(t) with control decision c(t) θ real valued weights of the neural network function approximator π maintenance policy V π (s) value function in state s with maintenance policy π V s, θ optimal value function in state s L θ mean-squared error in the Bellman equation

IV. ANALYTICAL ANALYSIS OF MULTISTAGE MANUFACTURING SYSTEMS
At each time t, the production system is modelled by a Markov chain with state defined as . The production system dynamics are described with: where TH l (t) and TH l+1 (t) are the throughputs of machines M l and M l+1 at time t. When M l , 1 < l ≤ M + 1, is not the first machine in the system, its throughput TH l (t) at time t can be determined by comparing its rated speed v l (α l (t)) in state α l (t), the buffer level b l−1 (t − 1) of the immediate upstream buffer b l−1 at time t − 1, and the available buffer space For the first machine M 1 , the throughput TH 1 (t) is determined by comparing the rated speed v 1 (α 1 (t)) of the machine in state α 1 (t), and the available buffer space Therefore, the throughput of each machine is summarized in the following Lemma 1.
Lemma 1: In the production system defined in Section III.A, the throughput of each machine is expressed as: We suppose that at time t, the system is in state ] at the following time step t +1 is estimated as: 1)] at the following time step t + 1 is estimated as: At time t + 1, give n the machines' states M (t + 1), the buffer level can be calculated with Equations 1 and 2. Therefore, the transition probability from states The process is repeated until all the transition probabilities are obtained. The process is summarized in Algorithm 1.

Algorithm 1
Step 0. Input system state s(t) and maintenance decision c(t) Step 1. For all machines' states M (t + 1) Compute the buffer levels for the following time step t + 1 with Equations 1 and 5.
Compute the transition probability from s(t) to s(t + 1) with Equations 6 and 7. End For Remark 3: According to Equation 7, it is noted that the transition of system state is determined based on the transition of machines' states. When system is in state s(t), it can transfer to at most M +1 l=1 N l other states, where M +1 l=1 N l is the total number of possible machines' states combinations. Therefore, each time when we compute the transition probability, we only need to take into consideration of the M +1 l=1 N l transitions. The probabilities for the system to transfer to all other states equal to zero.

V. PREDICTIVE MAINTENANCE DECISION MODEL FORMULATION
The maintenance decision making problem is a typical Markov Decision Process (MDP), which can be described with a 4-tuple [S, C, P, R]: VOLUME 10, 2022 • S denotes the state space, • C denotes the action space, • P: S × S × C → [0, 1] denotes the transition probability matrix with element P(s, s , c), ∀s, s ∈ S and ∀ c ∈ C, • R: S × C denotes the cost function, where r(s, c) ∈ R, ∀s ∈ S and ∀ c ∈ C, represents the immediate cost of the system in state s and having maintenance decision c. System cost r(s(t), c(t)) at each time t constitutes inventory cost, backlog cost, and maintenance cost, which is calculated as follows: (8) where g b is the inventory cost per part, and g − is the backlog cost per part. The backlog incurred in the system is measured as the production loss in the last virtual machine . A maintenance policy is a function from system states to maintenance decisions. Policy π (s, c) denotes executing maintenance decision c ∈ C in state s ∈ S. The value function of the policy is V π (s), which estimates the discounted cost when the system is in state s and it follows policy π thereafter: where s(0) refers to the initial system state and γ , 0 < γ < 1, is the discount factor. The objective of the decision model is to find an optimal maintenance policy π * that minimizes the expected discounted cost over an infinite horizon, i.e. ∀π = π * , V π * (s) ≤ V π (s). The optimal policy satisfies Bellman's optimality equation, which can be expressed as: Solving Equation 10 encounters the curses of dimensionality especially when system state space S and the action space C become large. It is difficult, if not impossible, to apply traditional dynamic programming methods to solve the problem. In the following section, we will introduce an approximate dynamic programming algorithm for exploring the optimal maintenance policy.

VI. PREDICTIVE MAINTENANCE DECISION MODEL FORMULATION
The recent breakthrough of reinforcement learning (RL) in AlphaGo proves the method as a good approach to handle dynamic programming with large state and action spaces [23], [24]. Actually, some pioneer works have been presented that apply RL to solve optimal control problems in production systems. McDonnell et al. [25] employed a reinforcement learning approach for specifying the payoffs in reconfiguration games in a heterarchical production system Csáji et al. [26] presented an adaptive iterative distributed scheduling algorithm in a market-based production control system. The algorithm uses a triple-level learning mechanism, and is based on deep reinforcement learning.
Therefore, we propose to use reinforcement learning (RL) to perform maintenance decision making. Instead of directly exploring maintenance policies, the aim of the method is to learn the optimal value function V * (s) , s ∈ S. The optimal policies are obtained based on the value function estimations. A neural network function approximator is adopted to represent the value function. The neural network function approximator is organized in a standard multilayer perception architecture, which is parameterized by real valued weights θ = [θ 1 , . . . , θ n ] . In the network, the value function V s, θ is computed with a feed-forward flow of activation from the input neurons to the output neurons, passing through one or more layers of hidden neurons [27].
The neural network function approximator is trained with a reinforcement learning algorithm. However, conventional reinforcement learning algorithms with neural network function approximator can oscillate or diverge when they learn directly from consecutive samples [28]. To overcome the limitations, we adopt a mechanism named experience replay, which can randomize the training samples and break the strong correlations among them. The idea behind experience replay is to randomly sample a mini-batch of previous experiences of the system, and smooth out learning over many historical experiences. The training process is summarized in Algorithm 2.

End For End For
The main steps of the algorithm are introduced as follows. 1. Replay Memory Generation and Update. In each time step t, the production system selects and executes maintenance actions according to ε-greedy policy. It means that with probability ε, the system selects and executes a random action c(t). And with probability 1 − ε, it selects and executes action c (t) by solving Then the experience e t = (s (t) , c (t) , r (t) , s(t +1)) is stored in a data set D t = (e 1 , . . . , e t ), which is named as replay memory data set. The size of the replay memory data set is fixed. It stores the latest N experiences.
2. Approximation Update. The network is trained through iteratively adjusting the parameters θ to reduce the mean-squared error in the Bellman equation, which is defined as where θ − are the parameters θ from the previous iteration.
Stochastic gradient descent is applied to reduce the value of Error Equation 12. During each iteration, a set of experiences are randomly drawn from the replay memory pool, and utilized to compute the target Then θ is updated with a gradient descent step on where α = 1/n in the nth training iteration.
3. Stopping Criteria. The algorithm determines the final neural network function approximator and the corresponding maintenance policy by repeating the aforementioned steps for certain iterations.
Note to Practitioner. In order to apply the proposed maintenance decision model, the production systems should be able to continuously track the health status of machines. In addition, the production systems should have access to production data including the states of machines, buffer levels, maintenance decisions, etc. Then, the current system state s(t) is adopted as the input to the trained neural network function approximator. Equation 11 is adopted to determine the optimal maintenance control. The decision-making process is shown in the following Procedure 1.

Procedure 1
Step 1. Collect the current system state s (t).
Step 2. Compute V s, θ by running the forward propagation of the trained network Step 3. Determine the optimal maintenance decision according to Equation 11.

VII. MACHINE STOPPAGE BOTTLENECKS
If the end-of-line machine M M +1 is starved, the production system fails to satisfy the market demand. The products that are not satisfied are defined as unsatisfied market demand (USMD). It can be calculated as the production loss of M M +1 caused by starvation.
Machine stoppage is the most direct cause of USMD. Identifying and mitigating machine stoppage bottlenecks have been considered as one of the most cost-effective approaches to reduce USMD. Machine stoppage bottlenecks refer to the machines that their stoppage most significantly impacts USMD. It is defined in the following Definition 1.
where 1 ≤ l ≤ M and l = m. In the definition, DTD m refers to the average downtime duration (DTD) of machine M m , 1 ≤ m ≤ M , because of random machine failures or predictive maintenance. It measures the average time that the system takes to recover M m from down states to working states.
According to the definition, machine M m is MSB if a small amount decrease of its DTD could lead to the greatest decrease of USMD. However, identifying MSB with its definition is challenge because there is no close-form expression of USMD in multistage production systems, let alone to estimate its derivatives. It is necessary to develop a method that can identify MSB with production or simulation data.
We adopt   The partial differential equation as demonstrated in Definition 1 can be estimated as It is noted thatv M +1 is constant. Therefore, n i can be utilized as the indicator to determine the value of ∂USMD ∂DTD i . n i measures the number of times that machine M i causes the starvation of the end-of-line machine M M +1 . It can be determined directly from the collected production information or simulation results. Therefore, Proposition 1 presents the indicator for MSB identification, that the machine with the highest n i is MSB.
Remark 4: Since the transition of end-of-line machine M M +1 is assumed to be time dependent, the probability distribution of the machine can be obtained by solving a sequence of balance equations: (21) and

The average speedv
, and it is constant. Note to Practitioner. The application of the bottleneck identification method is shown in Procedure 2. Procedure 2 Step 1. Collect production information.
Step 2. Compute n i of each machine.
Step 3. The machine with the highest n i is identified as MSB.

VIII. NUMERICAL STUDIES
This section performs numerical studies to analyze the proposed maintenance decision model and the bottleneck identification method. The simulation environment is Tensorflow    1.14.0 with Python 3.6, and the simulation is performed on a laptop with Intel i7-5500U CPU, and 16.0 GB RAM.

A. ANALYSIS OF MAINTENANCE DECISION MODEL
We now consider a serial production system that consists of 6 machines and 5 buffers. The system parameters as demonstrated in Tables 1-3. In the system, machine M l , 1 ≤ l ≤ 5, has 4 states, which are denoted as α l ∈ {0, 1, 2, 3}. For ease of discussion, we assume that the machine's health condition degrades from state α l = 0 to state α l = 3. The speed of machine M l , 1 ≤ l ≤ 5, is v l (α l ) = 1, α l = 0, 1, 2 0, α l = 3 .
We assume the market demand virtual machine M 6 has 2 states, i.e. α 6 ∈ {0, 1}. The speed of the virtual machine is v 6 (α 6 ) = 1, α 6 = 0 0, α 6 = 1 . The maintenance cost for each machine is summarized in Table 4. The inventory cost is $60 per part and time unit, and backlog cost is $100 per part. Each time step has 5 minutes and it equals to the cycle time of the machines. First, the implementation of the proposed maintenance decision making algorithm is demonstrated. The core of the algorithm is the construction and training of the neural network, which uses system state s(t) as input, and computes the corresponding value function V (s(t)). The neural network is a feed-forward network, which has an input layer, a hidden layer and an output layer. In this case, the input layer consists of 17 input neurons. The output layer is a fully-connected linear layer with a single output. The hidden layer is fully-connected and the number of its neurons is determined based on [29] as 9.
The maximum number of iterations is J = 1000. The probability ε in ε-greedy linearly decreases from 1 to 0.1 during the training process. The training process is repeated for 10 times. In each training, the training data is generated with simulation. The transition {s (t) , c (t) , r (t) , s(t + 1)} is recorded in the replay memory. The sizes of replay memory and mini-batch are demonstrated in Table 5. The average CPU time, the average system cost r(t), as well as their standard deviation are demonstrated in Table 6. It is observed that the training time increases from 253s to 713s as the sizes of replay memory and mini-batch increase. The results also indicate that the average system cost decreases first and then increases when the sizes of replay memory and minibatch increase. The proposed maintenance decision making algorithm achieves the least system cost when the replay memory has a size of 10000 and mini-batch has a size of 15. And the CPU time is 312s, which is the second least CPU time. Therefore, in this case, the sizes of replay memory and mini-batch are 10000 and 15, respectively.
Second, we compare the performance of the maintenance policies suggested by the proposed algorithm, which is denoted as reinforcement learning policy (RLP), to that of the following three widely accepted policies by manufacturers.
• State-based policy (SBP): Predictive maintenance is performed on machine M l , 1 ≤ l ≤ 5, if it reaches state α l = 2. The policy is based on the observation that reactive maintenance is usually much more costly than predictive VOLUME 10, 2022 FIGURE 2. WIP inventory, production loss and maintenance decisions with RLP. The figures plot the trajectories of WIP inventory, the production loss of the last virtual machine, and the maintenance decision of each machine with RLP during operations. The horizontal axis shows production cycles. The Vertical axis shows WIP, production loss, and maintenance decision, respectively.   maintenance. The policy strives to reduce the reactive maintenance events and hence reduce the maintenance cost.
• Time-based policy (TBP): Preventative maintenance is periodically performed on machine M l , 1 ≤ l ≤ 5. This policy is usually determined based on experience or equipment maintenance manual instead of actual machine health conditions. In this case, we use simulation to search for the maintenance frequency for each machine that leads to the lowest discounted system cost.
• Greedy policy (GP): It is also denoted as reactive maintenance only policy. All the machines are made to produce parts until they reach the breakdown states. No predictive maintenance is applied.
Each policy is simulated for 10,000h in the production system. Table 7 summarizes the results. It can be observed that the proposed RLP outperforms all the other policies in terms of mean maintenance cost, mean inventory cost, and mean backlog cost. To be specific, RLP causes 9.27% less maintenance cost, 0.08% less inventory cost, and 1.43% less backlog cost than SBP. It causes 9.27% less maintenance cost, 0.43% less inventory cost, and 1.46% less backlog cost than TBP. It causes 9.95% less maintenance cost, 2.99% less inventory cost, and 3.11% less backlog cost than GP.
The distinctive characteristics allow the proposed RLP to achieve the best performance among all the 4 policies. Figure 2 plots the trajectories of WIP inventory, the production loss of the last virtual machine, and the maintenance decision of each machine with RLP. The results are discussed as follows: 1. Maintenance actions are scheduled together. When machine M l , 2 ≤ l ≤ 5, is stopped for maintenance, there exist opportunities to perform maintenance on machines M 1 , . . . , M l−1 at the same time. On the one hand, turning off the machines will not result in additional production loss [30]. On the other hand, it helps to prevent the system from accumulating excess inventory, which leads to high inventory cost.
2. Predictive maintenance is scheduled more frequently on upstream machines. Machine M l , 1 ≤ l ≤ 5, can cause the production loss of the last virtual machine M 6 if all the buffers between machines M l and M 6 being empty, i.e. 5 k=l b k (t) = 0. At any moment, it takes less time for machine M l to cause the production loss than its upstream machines M 1 , . . . , M l−1 . Therefore, stopping a downstream machine has more risk in causing the production loss of the last virtual machine M 6 , which leads to high backlog cost.
3. Predictive maintenance is scheduled when the inventory level is sufficiently high. This indicates that RLP can balance the production and maintenance needs of the production system. When the system has a low inventory level, performing predictive maintenance can lead to high production loss and cause high backlog cost. When the inventory level is high, stopping a machine for predictive maintenance cannot only reduce the risk of high reactive maintenance cost, but also help to decrease inventory cost. RLP helps the production system to capture the opportunities, and optimize the system cost.

B. BOTTLENECK IDENTIFICATION AND IMPROVEMENT
The effect of improving MSB is investigated. The serial production system in the previous section is still adopted. The production line is simulated for 10,000h to generate production data. Since the end-of-line machine M 6 models the market demand, the n l of machines M 1 to M 5 is calculated and demonstrated in Table 8. It is found that machine M 5 has the greatest n l , and the machine is MSB. In addition, Definition 1 is also utilized to identify MSB. Both methods identify the same MSB.
The DTD of machine M 5 is reduced by 20%, and the process is repeated for four times. Table 9 records the USMD and the MSB before and after the improvement. The result indicates that the USMDs in the first two improvements are significantly higher than the USMDs in the last two improvements. MSB also transfers from machines M 5 to M 4 after the third improvement. It is because when machine M 5 is MSB, reducing its DTD can most effectively reduce USMD. And when MSB transfers to another machine, continuously reducing the DTD of machine M 5 cannot effectively reduce USMD.
To further validate the bottleneck identification method, 10, 000 production systems are randomly generated. The parameters of the production systems are randomly selected from Table 10. The identification indicator proposed in Proposition 1 is adopted to find bottlenecks. The production data are generated by simulations, that have a simulation time of 10,000 h. Definition 1 is also utilized to find bottlenecks. The results show that the methods find identical bottlenecks in all the cases.
A bottleneck improvement method (BIM) can be established by identifying and improving bottlenecks. The improvement method is summarized as follows.
1) Collect production information and compute the value of n i . 2) Reduce the DTD of MSB by 5%. 3) Repeat step 2 until MSB transfers to another machine.
The method is compared with two improvement methods. The first method is denoted as the worst machine improvement method (WMIM), where the DTD of the machine with the highest breakdown frequency is reduced by the same percentage as that in BIM. The second method is denoted as random machine improvement method (RMIM), where the DTD of a randomly selected machine is reduced by the same percentage as that in BIM. The improvement using the three methods are demonstrated in Table 11. It is shown that the proposed iterative process can lead to the most throughput improvement, which is about 3.05 times higher than the improvement by WMIM, and is 3.21 times higher than the improvement by RMIM.

IX. CONCLUSION AND FUTURE WORK
This paper establishes an integrated decision model for joint production and maintenance decision making. A Markov chain model is developed to investigate the transient behavior of the production system. A reinforcement learning approach is proposed to optimize the production and maintenance cost. This research integrates production system modeling and approximate dynamic programming. It establishes a systematic approach of dynamic decision making in multistage production systems. The presented control method is compared with other three commonly applied maintenance policies. The result indicates that the proposed control method can balance the production and maintenance needs of the production system. It leads to the lowest production and maintenance cost, which is 5.25% lower than SBP, 7.96% lower than TBP, and 8.87% lower than GP. The research also investigates machine stoppage bottleneck. The numerical studies show that the identification and mitigation of machine stoppage bottleneck can effectively improve system throughput (approximate 9.00% improvement). It leads to an improvement that is about 3 times higher than both WMIM and RMIM.
The research presents some managerial insights into the predictive maintenance decision-making and continuous improvement in the production systems. Firstly, when machine M l is stopped for maintenance, there are maintenance opportunity windows for all the machines in the upstream of M l . On the plant floor, operation managers can schedule the maintenance work of the machines together without causing excessive production loss. Secondly, predictive maintenance decision-making is closely related to the occupancies of the buffers. Plant floor managers should consider to performance predictive maintenance when the buffer levels between the machine and end-of-line machine are sufficiently high. Thirdly, when plant floor managers improve the performance of production systems through bottleneck identification and improvement, they should continuously identify bottlenecks, and only improve the performance of the current bottlenecks.
In the research, the standard feedforward neural network and stochastic gradient descent method can be improved to achieve higher computation efficiency and better maintenance policies. In addition, the current model assumes discrete machine deterioration states. However, it is not uncommon that a machine's health condition is described with continuous functions. For example, remaining useful life has been widely accepted as an indicator to denote machines' health condition. The proposed decision model should be extended to optimize production and maintenance when machines have continuous health states. We also plan to apply the decision model to broader areas. For instance, it has been proved in [31] that machines can be temporarily turned off for energy saving. The decision model will be extended to optimize the energy saving control such that energy efficiency improvement can be achieved without causing additional production loss. These will be the future work of the research.