Improving prescriptive maintenance by incorporating post-prognostic information through chance constraints

Maintenance is one of the critical areas in operations in which a careful balance between preventive costs and the effect of failures is required. Thanks to the increasing data availability, decision-makers can now use models to better estimate, evaluate, and achieve this balance. This work presents a maintenance scheduling model which considers prognostic information provided by a predictive system. In particular, we developed a prescriptive maintenance system based on run-to-failure signal segmentation and a Long Short Term Memory (LSTM) neural network. The LSTM network returns the prediction of the remaining useful life when a fault is present in a component. We incorporate such predictions and their inherent errors in a decision support system based on a stochastic optimization model, incorporating them via chance constraints. These constraints control the number of failed components and consider the physical distance between them to reduce sparsity and minimize the total maintenance cost. We show that this approach can compute solutions for relatively large instances in reasonable computational time through experimental results. Furthermore, the decision-maker can identify the correct operating point depending on the balance between costs and failure probability.


I. INTRODUCTION
Operational areas within organizations are under everincreasing pressure to improve their performance. Social, political, and competitors are just some of the drivers pushing companies to be more efficient and effective with their resources and assets. This pressure, in turn, has added a tremendous burden to maintenance, an area that must keep a delicate balance between the effects of failures and the cost of preventive measures. Furthermore, the increase in complexity of current production systems makes this balance even more challenging, making condition-based maintenance policies hard to define and implement. To deal with these difficulties, maintenance areas have turned to operational data to get an answer, taking advantage of many sensors and telemetry systems that are now available. Here, predictive analytics tools have helped convert data into information, transforming the constant flow from sensors and actuators to detect and even predict changes in the state of the system [1], [2]. The development of frameworks like the Prognostics and Health Management (PHM) one [3], [4], have further increased the need for fault prediction [5]- [8] as well as estimating the remaining useful life (RUL) of a component after a fault appears [9]- [12]. However, this is only a partial solution. As systems grow, so will the number of detections and diagnoses, and what maintenance areas need is to have reliable plans that help them balance the cost of preventive measures with the ones caused by undetected or untreated failures [13], [14]. In this setting, prescriptive analytics tools might hold the key to improving the efficiency and efficacy of these complex systems, taking advantage of the plethora of operational data sources that are now available, if these systems can handle the uncertainties inherent with prognostic Researchers have recently been dealing with uncertainty and component connections in maintenance planning from the decision-making perspective [14]- [16]. Covering both aspects will be essential for large and complex production facilities like wind farms [17], solar generators, and even scientific instruments such as the ALMA radio telescope [5]. In this work, we will focus on the last step of PHM for decision-making in maintenance, which covers the two aspects mentioned before.

A. OUR CONTRIBUTION
Our work has the following novel contributions: 1) We propose a stochastic model with chance constraints to handle unexpected failures and address components with different levels of uncertainty in decision-making for maintenance to minimize the total cost. In addition, the model considers the distance between components in each maintenance period and the total residual RUL. 2) We study and describe the effect of varying of the chance constraints in the resulting schedule.

II. PROBLEM DESCRIPTION
Let N be the set of components distributed over K machines, which might be in different sites, as shown in Figure 1. Additionally, each machine has a list of components on which a predictive system, like the one described in [5], has detected a degradation fault. Furthermore, each component has a predicted RUL distribution provided by this predictive system. The machines are not necessarily identical, and we assume that their components are independent between machines and within each machine. If one of the machine's components fails, we consider that the machine fails. This type of setting rises in several applications like manufacturing [18], offshore wind farms [17], and scientific instruments like the ALMA radio telescope [19], among others. Our goal is to arrange this set of components to minimize the maintenance cost considering the distance between machines and balancing the machines' availability. We consider a one-year planning horizon with maintenance decisions per month in our work.

A. PREDICTIVE SYSTEM
LSTM networks are a type of artificial recurrent neural network (RNN) architecture proposed by Hochreiter and Schmidhuber [20] to deal with the vanishing gradient problem. One LSTM unit comprises three gates: an input gate, an output gate, and a forget gate. It also has a memory cell that remembers values over arbitrary time intervals, while the three gates regulate the flow of information into and out of the cell. This type of RNN has been found extremely successful in many applications [21]. A typical LSTM [22] is illustrated in Figure 2.
We have developed an RUL prediction system based on LSTM neural networks .This network was pre-trained using run-to-failure data with degradation faults as the ones described in [5]. The data for each component was analyzed and clustered, with each cluster having a catastrophic failure threshold. The system is in charge of identifying which cluster best represents the detected fault, after which it uses the corresponding analytical model to predict the RUL's distribution. As a result, we have available the mean r i and the standard deviationσ i of the RUL estimation for each component i. A general diagram with the developed prediction system is shown in Figure 3.

III. PROPOSED MAINTENANCE SCHEDULING MODEL
The scheduling model formulation is based on the ideas developed in [23]. However, unlike that work, instead of accomplishing the given demands, our approach aims to use the components as much as possible before the end of their respective RUL.

A. DYNAMIC MAINTENANCE COST
A dynamic maintenance cost function models the trade-off between the cost of preventive maintenance C p (early repair before failure) and the corrective maintenance cost C c that deals with unexpected failures [24]. Typically, corrective maintenance costs are higher than preventive maintenance ones. as: where R i,ti,0 is the residual RUL of component i, which started at time t i,0 .

B. SCHEDULING MODEL
The prescriptive maintenance problem is modeled as the following optimization problem: such that, where C i,ti,0 (t) is the dynamic maintenance cost defined in Section III-A. The parameters and decision variables are summarized in Table 1.
The objective function, given by equation (2), minimizes the total maintenance costs of a set of |G| components. Each component has its dynamic maintenance cost, nominal functional cost, additional time for the repair cost, the cost of the distance between components, and cost related to residual RUL.
Constraints (3) guarantees that each component enters maintenance only once in the planning horizon. In contrast, the chance constraint (4) restricts the number of components that run out of RUL before their scheduled maintenance with a threshold ρ and a probability of 1 − . In that constraint, the Bernoulli random variable ζ i,t is 1 if R i,ti,0 < t and 0 otherwise; and ρ sets a upper bound on the number of components with catastrophic failure. The probability of not achieving the bound set by ρ is given by .
Constraints (5) ensures that at mostM +γ t work-hours are needed for maintenance in each period t. If additional work-VOLUME 4, 2016 hours are needed, then additional costs are added to the total maintenance cost. Constraints (6) determine the maximum distance between components planned for maintenance in period t; this allows for reducing the dispersion of the components in each maintenance period. Finally, constraints (7)-(8) represent the total number of days before and after the end of the RUL in which components require maintenance in period t.

C. SAFE APPROXIMATION OF CHANCE CONSTRAINT
The usage of chance constraint (4) in a decision-making model makes it computationally challenging. In order to make this constraint tractable, an upper bound can be computed on the left-hand side of this inequality using Markov and generalized Bernstein inequality as proposed in Proposition 1 in [25]. Hence, given z ∈ {0, 1} |T |×|G| satisfying i∈G t∈T it will also satisfy constraint (4). Figure 4 shows the behavior of the values of ρ * , for ρ = 11 and 150 components. In this Figure, we can notice that for very close to 0, the bound is smaller, implying that it is strengthened more than the number of components with corrective maintenance, as long as it does not exceed the amount ρ. As the value of increases, this condition becomes less strict.

D. STOCHASTIC MIP SCHEDULING MODEL
To deal with non-linear terms, we linearize the chance constraint and dynamic cost using safe approximation and approach the stochastic optimization model with a set of scenarios sampled from the prediction distribution of the RUL of each component.

1) Chance constraint linearization
Using the safe approximation defined in Section III-C, we can reformulate chance constraint (4) taking the same strategy proposed in [23] by defining an auxiliary decision variable as follows P i,t := E(ζ i,t ) = P(R i,ti,0 ≤ t), ∀i ∈ G, ∀t ∈ T. (11) ConsideringP i,t as an upper bound of P i,t and 0 ≤ P i,t ≤ P i,t ≤ 1, we can rewrite in the form of a safe approximation of the chance constraint as follows: t∈T i∈G Analogously, we apply the linearization to the non-linear term C i,ti,0 (t)z i,t of the objective function by defining where, θ i,t = C i,ti,0 (t), and 0 ≤ θ i,t ≤θ i,t ≤ C c . Therefore, the linearization of w i,t is given as follows:

2) Scenarios
Since each component has its RUL distribution provided by the predictive system, we create a set of scenarios, S, such that each scenario is generated from each component's RUL distribution, i.e., with s k = (r s k ,1 , r s k ,2 , . . . , r s k ,|G| ), k = {1, . . . , |S|}, (21) wherer i ,σ i represent the mean and standard deviation of the RUL estimate of component i, respectively.

3) Optimization model
Considering the information on the distribution of the RUL of each component and the linearization of the non-linear terms of both the chance constraint and the dynamic cost function described in Section III-D1, we can formulate our prescriptive maintenance problem into a stochastic mixedinteger model as follows, such that, t∈T , ∀t ∈ T, ∀s ∈ S, (28) where,θ and r i has the same distribution as defined in (22). t i,0 represents the days elapsed since the last emission of the predictive information until the moment the scheduling process is carried out and ρ * is the safe approximation constant defined in (10). For simplicity, we considerP i,t = 1, ∀i ∈ G, ∀t ∈ T . The model aims to minimize the average cost generated through all the scenarios, which is described by the equations (23) and (24). The constraints (25)-(29) guarantee that all components enter maintenance only once during the planning horizon, and ensure that at most they needM + γ t workhours for maintenance in each period t. These constraints also reduce the geographical dispersion between the components attended in each period, considering the distance between them. The model aims to use each component as much as possible and reduce the days each enters maintenance after the end of RUL in each period t in each scenario.
The constraints (30)-(32) represent the linearization of the chance constraint (4), whereas the linearization of the dynamic cost is given by the equations (33)-(34).

IV. EXPERIMENTAL SETTINGS
The proposed prescriptive maintenance system was implemented in Python 3.8.10 using Gurobi 9.1.1 as a mixedinteger optimization solver. The experiments were done on a computer with an Intel®Core™Processor i5-3230M of 2.6 GHz x 4 cores, with 8 GB RAM, and Linux Mint 20.1 Ulyssa (64 bits) as OS.
The model settings were as follows: the planning horizon for maintenance was set to one year, i.e. H = |T | = 12, with each month as a period with operational length of 30 days, O p = 30. The preventive and corrective costs were C p = 100000 and C c = 400000, respectively. Other related costs were: C + = 10000, C d cl = 10000, C r+ cl = 11000, C r− cl = 22000, and V i,t = 5000. The maximum work-hours was set toM = 160, and 100 scenarios were generated. These cost values were set with the objective of evaluating both the dynamic cost and the performance of the proposed model.
A public repository with all the benchmark instances tested with our methodology can be found at [26].

V. COMPUTATIONAL RESULTS
A simulated problem of |G| = 250 components distributed over K = 9 machines, as shown in Fig. 5b was used as one of the instances to test the model's performance. Furthermore, we set = 0.1 and ρ = 11; this implies that about 5% of the components enter corrective maintenance due to a catastrophic failure with a probability of at least 1 − . We solved the model using multiple scenarios sampled from RUL distribution, described in Section III-D2, and we assumed that the predictive system provided us with the information on the same day that the scheduling model was executed; therefore, we set t i,0 = 0. The result of the minimum maintenance cost of each scenario is shown in Figure 5a, where the red dash line represents the average maintenance cost over all the scenarios. In the resulting recommendation, all components enter maintenance before the end of their RUL, with planned maintenance of fewer than 12 days before they fail with respect to the estimated RULr i ∀i ∈ G, as illustrated in Figure  5c (which presents no After RUL orange bars). Looking at the cases for all scenarios, 3.41% of cases have some of the components go into maintenance after the end of the sampled RUL, r s,i ∀s ∈ S, ∀i ∈ G. This study verifies that less than 5% of the components goes into corrective maintenance, which we have previously set, and is reflected in the orange bars in Figure 5d. Constraint (4) introduces a tuning parameter , that helps the decision-maker balance the different costs. Figure 5e shows that for smaller values of , a higher maintenance cost is needed since the model tries to increase the machines' availability by making earlier maintenance procedures. For the case study, if ≥ 0.1, the maintenance cost decreases almost linearly, showing slight changes in some periods. Figures 6a to 6b show the effect on the schedule of increasing from 10 −8 to 0.1. The analysis shows significant changes, showing several grouping modifications in each period. On the other hand, when we increased the from 0.1 to 0.2, there were only small changes in the movement of some components: one component from period 2 to period 1, two components from period 9 to period 10, and two components from period 10 to the next period. Varying implied some schedule changes and the effect on the maintenance cost and computational effort. In the figures, for each period, the red box indicates the component with a residual RUL of less than ten days, the orange box when it is between 11 and 20 days, and the green one when it is greater than 20 days.
We also tested the performance in instances with 500 and 1000 components distributed over 20 machines, measuring the time required to solve them. Our instances with 1000 components were solved in around 12 minutes. The results are summarized in Table 2.

VI. CONCLUSION
The increasing complexity of systems has made it harder for the operational areas to develop well-balanced policies in maintenance. The availability of data has helped significantly get better information, but decision support tools are crucial to help improve efficiency and the effective use of resources and assets. Furthermore, these tools need to embrace the uncertainty inherent with predictive analytics tools such as RUL predictions to be helpful.
Our work shows an initial approach to doing this. Our model presents excellent performance, even when there are different levels of uncertainty in the predicted RUL. This approach complements predictive systems, taking advantage of their information. Furthermore, the scheduling model can handle a more extensive set of components, reduce processing time, and give robust recommendations to the decisionmakers.   He also founded and was the initial director of the UAI Systems Center, a center dedicated to technology transfer and solving complex reallife problems using operations research tools. His research is focused on the design and development of decision support tools and algorithms. Before joining UAI, he was a researcher at Siemens Corporate Research in Princeton, NJ, developing decision support algorithms for smart grids and energy management. Prior to this, he worked at Booz Allen Hamilton, leading operations research projects in Chile, Argentina, Brazil, Peru, and Canada.
Rodrigo holds an electrical engineering degree and a master of science in engineering, focused in control systems, from Universidad Católica de Chile, and an M.Phil. and a Ph.D. from Columbia University in industrial engineering and operations research.