An Improved Grasshopper Optimization Algorithm Based Echo State Network for Predicting Faults in Airplane Engines

In today’s age of industrialization, sensor devices installed on equipment generate a vast amount of data. One of the engineers’ main jobs is utilizing these data to provide better solutions to industrial problems. This availability of extensive data partly led to the creation of predictive maintenance (PdM). In PdM, existing and previous conditions of devices are used to predict their future behavior for optimal maintenance. Most of these PdM approaches are typical time-series predictions. Machine learning tools like Recurrent Neural Networks (RNNs) are excellent tools for time-series predictions. However, most RNNs suffer from training issues due to the unstable gradient problem. Thus, networks such as the Echo State Network (ESN), were designed to solve them. The ESN solves the gradient problem by training only the output weights using simple linear regression. Despite this ease, the selection of ESN parameters and topology is a considerable design challenge. This problem is often formulated as a typical optimization problem. Metaheuristic algorithms are known to be excellent tools for solving optimization problems. Hence, in this work, we design an improved Grasshopper Optimization Algorithm (GOA) based ESN. The proposed technique uses a new solution representation with a simplified attraction and repulsion mechanisms to enhance performance. Our target application is to predict the Remaining Useful Life (RUL) of turbofan engines. The method outperforms the Cuckoo Search (CS), Differential Evolution (DE), Particle Swarm Optimization (PSO), Binary PSO (BPSO), the original GOA, the classical ESN, deep ESN, and LSTM. We have provided all implemented codes and data at the GitHub repository. https://github.com/bala-221/Airplane-fault-prediction


I. INTRODUCTION
Complexities in engineering design and manufacturing, combined with changing operating and environmental conditions, have led to substantial reliability issues in the industry.
The associate editor coordinating the review of this manuscript and approving it for publication was Yu Liu . To overcome these challenges, Prognostics and Health Management (PHM) was invented. PHM's primary aspect is ''prediction,'' which consists of the use of present and past component states of any dynamic system to estimate its future behavior. Engineers can use the prediction results for two purposes. Firstly, it serves as a reliable alarm system to stop machine efficiency degradation, machine malfunctioning, VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ or even disastrous system failure [1]. Secondly, it can be employed to schedule maintenance appropriately, resulting in a significant reduction in down-times. Moreover, it also has potential maintenance cost savings since there is a shift from preventive to predictive maintenance [2]. Generally, there are two approaches to prognostics: modelbased methods and data-driven methods [3]. In model-based methods, we require an understanding of the physics of the system. This information is then used to create mathematical formulae that explain the system's dynamics. We then use the equations to estimate the system's future health. However, for complex dynamic systems, it is often challenging to obtain a correct analytical model of the system, especially if the system is operating in noisy and unpredictable environments.
In contrast, data-driven methods employ the use of machine learning and pattern recognition to create a link between operation signals and the health state of systems. This is then used to detect changes in the system as well as predict future behavior. The classical data-driven techniques for non-linear system prediction are non-deterministic such as projection pursuit [4], multivariate adaptive regression splines [5], and, Autoregressive Integrated Moving Average (ARIMA) models [6]. These methods depend on state patterns of similar previous systems to forecast future states. They are more suitable when the general first principle of a system is not available, or when it is significantly complicated, that finding the accurate analytical model is extremely costly. Data-driven methods have shifted over the years towards the use of adaptive system models like fuzzy neural systems and neural networks [7]. These systems have provided better performance compared to classical prediction methods.
One of the powerful types of neural networks is Recurrent Neural Network (RNN). These networks have cyclic connections within them that loop back previous signals within the network. This architecture makes the RNNs particularly suitable for sequential data prediction. However, most RNNs have complicated training schedules and suffer from unstable gradient problem [8]. Gated networks such as the Long Short Term Memory (LSTMs) and Gated Recurrent Units (GRUs) have been efficiently used to solve the vanishing and exploding gradient problem of RNNs. Despite their extensive and successful application, they are more complex than the ESN in terms of architecture and implementation. The ESN solves the unstable gradient problem by applying the simple least square regression training of the output weight connections. Despite its simplicity, the ESN has been shown to perform as well or even better than LSTM and GRU on multivariate timeseries prediction tasks [9]. This work is also compared with the deep LSTM network [10]. In the next section, we discuss the classical ESN.

A. ECHO STATE NETWORK
ESN is a relatively modern RNN, which substitutes a dynamic reservoir for the hidden layer of RNNs. Some of its unique characteristics include: (i) The randomly generated connection of weights within the reservoir is used as an information processing unit. (ii) The randomly generated reservoir serves as a feature space in which the inputs are mapped to. (iii) Only its output weights require training. Other weights are often generated randomly. (iv) Readout weight connections are often learned through a simple linear regression technique. A schematic of the classical ESN is shown in Figure 1. It has 3 levels: input, reservoir, and a readout layer. Moreover, it usually has an input bias unit that is excluded here for clarity. General guidelines for designing a good ESN are provided in [11], [25]. They suggested the design of a random, large, and sparsely connected reservoir layer. Thus, the units within the ESN's reservoir could be hundreds to thousands depending on task complexities. The connection weights are mostly obtained via a uniform distribution with symmetry around zero. In this work, we use a discrete ESN with J inputs. These J input neurons receive inputs of I(t) = [I 1 (t), I 2 (t), . . . I J (t)] T at each time interval (t). Additionally, it has K reservoir units depicted as: T that provide the prediction results. From Figure 1, the reservoir update equation is: Additionally, the output equation is: where f and g are the input and output activation functions (AFs) respectively. While W in (J ×K ) is the input weight matrix connections from the inputs to the reservoir, and W r (K × K ) is the reservoir connection weights. Moreover, W back (L × K ) is the optional feedback weights from the output back to the reservoir. Additionally, other voluntary connections are: weights between inputs and output units W inout (J × L), output cycle weights W outout (L × L) and bias to output weights W biasout (1 × L). As already stated, only the output weights require training, and all remaining weight connections are usually generated randomly and often stay constant throughout the ESN training. If W outGen denotes all weight connections to the output, then (2) can be transformed to: with [;;;] representing matrix concatenation. The main goal of the training is to find the W outGen . If we assume that the g in (3) is invertible, then the equation may be formulated as: We calculate the output weight (W outGen ) by employing the least square linear regression technique, which minimizes the variance between the target value and the network's actual output. One important feature of the echo state network is the echo state property (ESP) [27]. ESP posits that the impact of previous states and initial inputs on future states should decrease over time and not continue or grow. The ESP is often ensured in practical applications if reservoir weight is scaled so that its spectral radius (SR), which is the highest absolute eigenvalue of W r , is set to a magnitude of less than 1. However, several works have proven that since the inputs to the ESN also affect its ESP, it is possible to attain ESP even if SR is greater than one for certain inputs.

Summary of ESN training:
Main Inputs: An input/output sequence: Step 1: The number of reservoir units K , spectral radius of the reservoir (SR), and the washout time (T o ) are initialized. The washout time is the number of training samples in which the reservoir activations are not collected but are discarded. It is required so that the impact of initialized random weights is reduced. After the initial parameters are set, the weights are randomly generated while ensuring ESP.
Step 2: Collection of reservoir activations: which are used to find x(1).
• Computing the reservoir states x(t) of the T time steps after washout using (1).
• The reservoir sates x(t), input I(t) and output O(t − 1) for t = T o , . . . , T are collected in a matrix called S with size (J + K + L) × (T − T o + 1).
Step 3: The output weights are then computed by using the following definition from (4).
with Y desired denoting the desired output of the training data. The Moore-Penrose pseudoinverse method is used to find inverse of (5). This marks the end of the training, and the ESN is now ready for prediction by applying inputs and forecasting the outputs using (1) and (2). Moreover, as a means to reduce over-fitting, regularization may be applied to the output as: where reg represents the regularization term and I an identity matrix. Since the ESN inception, different types have been proposed [26]. However, ESNs with leaky integrator neurons in their reservoir called Leaky Integrator ESNs (LI-ESN) have been proven to give impressive performance [28]. The leaky neurons perform a leaky integration of its activation from previous time steps. Thus, (1) is transformed to: with a as the leakage/decay rate with value often in the range [0, 1]. This value is set so that a neuron neither retains nor leaks more activations than it had. The leakage rate has been shown to control the ''velocity'' of the reservoir dynamics. Throughout this article, we use LI-ESN. Despite its easy training, ESN parameter and topology selection is a significant challenge. Initial studies have provided guides on ESN designs [11]. However, these suggestions depend on the target application and are often fuzzy. Thus, more recently, many works have employed metaheuristics techniques to select the ESN's design parameters [12], [29].
In this work, we develop a grasshopper optimization algorithm (GOA) based method to optimize the ESN [13]. This algorithm has proven to have excellent performance compared with Particle Swarm Optimization (PSO), Differential Evolution (DE), and Genetic Algorithm (GA) on selected mathematical benchmark functions. We test our technique on fault prediction of turbofan engines. The main contributions of the work include.
• Development of a new solution representation for ESN optimization.
• Design of an improved Grasshopper Optimization Algorithm (GOA) with simplified attraction and repulsion schedules.
• Testing the optimized ESN on the remaining useful life (RUL) prediction of turbofan engines. The remaining portions of the paper are as follows: Section II discuses a review of works related to this article. In Section III, the Grasshopper Optimization Algorithm is introduced. Section IV, outlines the methodology of our research. The RUL prediction of the turbofan engine is explained in Section V. While Section VI discusses the results of our study. In Section VII, we discuss some of the challenges of ESN. Finally, Section VIII gives concluding remarks. VOLUME 8, 2020

II. RELATED WORK
This section briefly discusses some selected works that use metaheuristics (MAs) to optimize the ESN. For a general review, we refer the reader to Bala et al. [12]. One of the pioneering works in the application of MA in ESN optimization is Ferreira and Ludermir [14]. They employed the famous GA to tune the ESN. In their further work [15], they proposed a new technique called RCDESIGN, that employs GA to select the appropriate ESN parameters and topology.
Additionally, Ma et al. [16] developed a deep ESN for time-series forecasting. The GA was used to tune the values of leaky rate, spectral radii, and reservoir sizes of the stack of ESNs within the deep ESN. Moreover, a GA optimized double reservoir ESN (DRESN) was developed by Zhong et al. [17]. The technique was better than the PSO optimized ESN for turbofan engine RUL estimation.
The PSO has also been successfully used to fine-tune the ESN. For instance, Wang and Yan [18] proposed a binary PSO (BPSO) for ESN optimization. They define the output weights connections as a feature binary selection problem. Subsequently, the BPSO was used to select appropriate output connections. In [19], a PSO based tuning of ESN was developed. The method selects a fraction of the reservoir weights and then tunes their values using PSO. They highlighted that the technique is less mathematical since the spectral radius does not have to be computed. Chouikhi et al. [20] extended the work in [19]. Here, together with the fraction of reservoir weights, a fraction of input weights and feedback weights are also chosen for the PSO optimization. The method outperformed the one in [19].
Additionally, Amaya and Alvares [21] proposed an ESN tuning based on an artificial bee colony algorithm. The method was tested on the RUL prediction of turbofan engines and was better than the classical ESN. Moreover, cuckoo search optimization was used to optimize the ESN [22], [23]. The method was found to outclass classical ESN and other methods. In recent research, Wang et al. [24] used the DE algorithm for ESN tuning. The method was found to outperform the GA optimized ESN on the prediction of electricity consumption. In the next section, we present the classical GOA algorithm.

III. THE GRASSHOPPER OPTIMIZATION ALGORITHM
The Grasshopper Optimization Algorithm (GOA) is a swarmbased metaheuristic developed in 2016 [30]. It copies the natural swarming culture of grasshoppers and locusts [31]. Even though grasshoppers can be found as isolated individuals, they are mostly organized in one of the fascinating swarms available in nature. One exciting thing about these swarms is that they can exist in both the larva and adult phases of the grasshopper development. The larva swarm is characterized by slow and little steps of movements. In contrast, adult swarms consist of long steps of movement with abrupt jumps. The authors mimicked this behavior to fit the exploitation and exploration techniques required to form most metaheuristics. Thus, the exploration is like the adult swarm movement, and the exploitation copies the larva swarms [30]. They modeled the location of a grasshopper within the swarm as [32]: (8) where X i represents the location of the i th grasshopper. S i is the social relationship of grasshopper i with other grasshoppers within the swarm. A i is the wind advection affecting the i th grasshopper. Finally, G i is the gravitational pull on grasshopper i. The main engine of the GOA is the social interaction between the grasshoppers (S i ) given as [30]: where d ij is the distance between grasshopper i and j given as: The Z is a function representing the power of social forces on grasshopper i.
vector from the i th to the j th grasshopper. The Z function is given as [32]: with f denoting the attraction strength and l, the attractive length scale. The Z function elicits forces of attraction and repulsion between the grasshoppers. A plot of the function is shown in Figure 2. One can observe from the figure that when the distance is in the range [0, 2.079], the grasshoppers/agents repel each other to avoid impact. However, there is no attraction nor repulsion when the difference is exactly at 2.079, and agents are said to be in the comfort zone. As the distance extends beyond 2.079, the function keeps increasing till the distance reaches a value of around 4. This range [2.079, 4] is called the attraction phases. Here, the grasshoppers cooperate to reach the food source. Different values of f and l would give variant zones of repulsion, comfort, and attraction. However, for the GOA, the values of f = 0.5 and l = 1.5 are used. Despite the excellent modeling of the different zones by (10), it gives a value of almost zero when the distance goes beyond 10, (Figure 2). Thus, to solve this issue, the distances between the agents are projected to the range [1,4].
A schematic of the different agent movement zone is further given in Figure 3. From the figure, when the distance of a grasshopper and the target (a grasshopper at food source) is less than the comfort zone, the repulsion force on the grasshopper is greater than that of attraction. While, when the distance is at the comfort zone, there is an equal force of repulsion and attraction. As the grasshopper travels outside the comfort zone, the forces of attraction become more significant than the forces of repulsion.
Equation (8) is slightly modified. The gravity force is dropped. Moreover, the direction of wind is assumed to be towards the target. Thus, the new position of a grasshopper i  is modeled as [30]: where lB d and uB d as the lower bound and upper bound values on the d th dimension accordingly. While, T d is the value of the d th dimension in the target (best solution met). Moreover, c is a reducing coefficient set to minimize the repulsion, comfort, and attraction zones with iterations. Its value is computed as [30]: with maxIter denoting the maximum iteration number and l the present iteration number. The constants are: The basic GOA is shown in Algorithm 1. It commences by initializing the algorithm constants like: the maximum iteration (maxIter), number of grasshoppers in the swarm (m), c max and c min . Next, the initial random population/swarm is generated. Subsequently, the fitness/cost of each agent within the population is found by using an evaluation function (line 3). The best agent within the swarm is set as target (T ) in line 4. The main GOA iteration starts at line 5. Within the loop, the value of c is updated by using (12). Then, for each search agent, its distance from other agents is found and normalized into the range [1,4]. In line 9, the position of the present agent is updated by (11). If the new position extends beyond boundary conditions, they are brought back (line 10). The fitness of each new position is found. Then the best new position of the search agent is compared with that of the target. If it is better than the target, it replaces it. The algorithm keeps iterating till the maximum iteration count is reached and the target is returned.

IV. IMPROVED GOA ALGORITHM
The methods developed in this research are discussed here. Inspired by the grasshopper optimization algorithm (GOA) we develop a new hybrid version. We take motivation from many other metaheuristics to come up with this improved one.
The procedure is shown in Algorithm 2. It proceeds as follows: firstly, algorithm parameters such as maximum generation (maxGen), size of the population, and the number of children (numChildren) are set. Then, the population is initialized, the cost of each agent is found, and the best agent is saved. In line 7, we begin the main loop of the algorithm which will be repeated maxGen times. In this loop, the first VOLUME 8, 2020 Algorithm 1 The GOA Algorithm [13] 1: Fix algorithm parameters 2: Fix the grasshopper population X i {i = 1, 2, . . . , m} 3: Find the cost of each grasshopper 4: T ← best grasshopper 5: for l = 1 to maxIter do 6: Update c using (12) 7: for each X i in the swarm do 8: Normalize the distances between the swarm agents in [1,4] 9: Update the position of the current agent using (11) 10: If new position goes out of search boundary, return it back 11: Find the fitness of new positions 12: end for 13: Update T if better solution is found 14: end for 15: Return the T as best agent. task is to divide the population into two groups. The best 25% form the best group and the remaining 75% make up the bottom group. Subsequently, in lines 10-28, we create children from the initial population. In this for-loop, we perform a selection process like that of the genetic algorithm (GA). For the first parent called hopperOne, we run a rand function which generates a random number evenly distributed from 0 to 1. If the number generated is less than 0.7, we select the first parent (hopperOne) from a random position in the top group (line 12). Otherwise, we choose hopperOne from a random point in the bottom group in line 15. This would make the first parent more likely to come from the best group. However, for the second parent (hopperTwo), we reverse the case above, thus, making the second parent more probable to come from the bottom group lines 13 and 16.
Next, we find the absolute difference between the two parents (diff ). For each of the genes in diff that is less than 3, we perform a repulsion. The repulsion is like the case where grasshoppers have come too close in flight, and need to repel each other to avoid collision. However, if the gene difference is greater than or equal to 7, we perform an attraction showing the case where the two grasshoppers are too far apart and may miss the source of food. Thus, they need to attract each other. The procedures depicting attraction and repulsion are given in Algorithm 3 and 4 respectively.
In the attraction Algorithm 3, for each of the genes were K i is greater than 7, we calculate a ratio term as in line 2. We then add this ratio to the inferior parent's gene as in lines 4 or 6 and we return the child.
In contrast, Algorithm 4 shows the repulsion procedure and it proceeds as follows. For each gene where K i is less than or equal to 3, we generate a val term. Next, we run a randi (2) function which gives a value of 1 or 2, so we randomly add this val term to the gene of parent one or two and then return the child.
We then return to Algorithm 2 line 25, where other genes of the child are obtained by randomly choosing from either Divide the population into top and bottom group 10: for X i = 1 to numChildren do 11: if rand < 0.7 then 12: Select random top agent as hopperOne 13: Select random bottom agent as hopperTwo 14: else 15: Select random bottom agent as hopperOne 16: Select random top agent as hopperTwo 17: end if 18: diff = abs(hopperOne − hopperTwo) 19: for each K i gene in diff do 20: if K i < 3 then

21:
Simulate repulsion 22: else if K i > 7 then 23: Simulate attraction 24: end if 25: Get other genes by randomly choosing from hop-perOne or hopperTwo 26: end for 27: Find the cost of new child and save it 28: end for 29: Merge parents and children and sort them by cost 30: 50% of next population come from best merge 31: Other 50% are chosen randomly from the remaining merged population 32: Update T , if better solution is found 33: end for 34: Return the T as best agent.
the gene of parent one or two. The new child is validated for out of bounds and then saved. This similar approach is done for all children. They are then merged with their parents and sorted. We then proceed to create the next generation. 50% of the subsequent generation of the population is selected from the fittest of the merged population as in line 30. While the other 50% are selected randomly from the remaining merged population. The best agent is updated if a better solution is found. The outer loop keeps repeating till the maxGen iterations are reached and we report the final best agent.

A. SOLUTION REPRESENTATION
Six parameters of the ESN are optimized: reservoir size, reservoir connectivity, leaky rate, spectral radius, regularization term, and, input scaling. However, the reader should note that for the input scaling (scaling of the input weight connections), all the 14 inputs (in the Commercial  A typical solution/agent can be {1, 4, 6, 7, 8, 9, 10, 4, 8, 9, 2, 4, 5, 6, 7, 8, 1, 2, 1, 4}. We then use these values to index into the arrays of the parameters. Thus, this solution would be transformed into reservoir size = 100, reservoir connectivity = 0.3, spectral radius = 1.2, leaky rate = 0.7, etc. The usefulness of this unique solution representation may not be apparent here, until we discuss how we perform the repulsion and attraction in Section IV-D.

B. INITIAL POPULATION
To obtain the initial population, we select a random integer from 1 to 10 for each agent's gene. This is then repeated to get each initial member of the population.

C. OBJECTIVE FUNCTION
We adopt the mean squared error (MSE) of the validation data as cost function. Thus, for each agent, we set aside a part of the train data for validation. Then, we select parameters and train the ESN. The MSE of the trained ESN on the validation data is taken as the cost of the agent. The MSE cost function is shown in (13). For the input scaling, each column of the input weight matrix is normalized by dividing it with its absolute highest singular value and then scaled to the parameter. While, for the spectral radius, the reservoir matrix is normalized by dividing it by the absolute highest eigenvalue, before it is scaled to the spectral radius.

E. STOPPING CRITERION
The maximum iteration count which is set as 100 is the stopping condition adopted here.

V. TURBOFAN ENGINE DEGRADATION PREDICTION
This section explains the turbofan engine degradation prediction. Figure 4 shows a pictorial view of the prediction procedure. The main element of the method is the turbofan engine (shown in Figure 5). It is responsible for providing the driving thrust to most modern airplanes. Its main parts include a fan, low-pressure compressor (LPC), high-pressure compressor (HPC), low-pressure turbine (LPT), and highpressure turbine (HPT). The engine degradation dataset is obtained from the prognostic data center of the National Aeronautics and Space Administration (NASA) of the United States [33]. It is simulated with the NASA-developed Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) program.
To produce the C-MAPSS data, identical turbofan engines were driven in different operating conditions, and then faults of varying magnitudes were inserted into each engine unit until the system failed. These faults were injected by varying VOLUME 8, 2020 FIGURE 4. Firstly, we extract run-to-failure sensor signals from the turbofan engine. The data is then preprocessed (Filtering & Normalization), and fed to the echo state network (ESN) for training. In the linear regression training process, our proposed algorithm is used to select the right parameters for the network. After training, the ESN is now ready for deployment. In this mode, it can provide the predicted remaining useful life of an engine when it receives a certain sensor signal. the inputs of the C-MAPSS software. These inputs are presented in Table 1. As the inputs (faults) and operating conditions are varied, sensor output signals shown in Table 2 are measured at a regular frequency. The engine keeps running until it fails to meet the healthy state criterion determined by the health index parameters given at the bottom of Table 2. These health index parameters are represented by margins. First, they are normalized to the range [0, 1], where 1 indicates a healthy machine, and 0 denotes an engine with stall margin reduced to a defined limit (which represents an unhealthy state). The limits are set as 15% for LPC, HPC, and fan stall margins, and, to 2% for the EGT margin. The health index of an engine is taken as the minimum of HPC, LPC, fan, and EGT margins. The varying operational conditions and the injected faults are the sole cause of engine failure. More details on this can be found in [33].
The data has some interesting properties and is an outstanding research benchmark for modern prognostic approaches. For instance, it includes a multi-dimensional output from a dynamic system (the turbofan engine), and the simulation incorporated measurement and process noise. This dual-state noise inclusion creates a delicate noise feature often associated with real data.
The C-MAPSS data is partitioned into four parts, as depicted in Table 4. The table also shows the operational modes and the type of injected faults for each of the data. The fault types are HPC and fan degradation. Moreover, each data part is additionally subdivided into a training and a testing set. However, signals of the test set are intentionally truncated several periods before failure. Hence, the prediction job is to find the test set's RUL. To verify the found RUL, the real RUL is provided within the datasets for comparison. Since this study is a comparison of different optimization techniques of the ESN, we make use of only the training samples of the C-MAPSS data and then further divide the same (train set) into train/validation/test data.

A. RUL APPROXIMATION
The subsequent job is to estimate each machine's RUL from the dataset. An excellent approximation is to use the time steps before failure as the engine's RUL. So that if the engine fails at time cycle 234, for example, then at the start of the signal, the RUL will be set as 234, and the RUL at failure will be set to 0. This is a good estimate [35]. However, as discussed in [36], for most cases, the deterioration in the engine does not start to indicate prominence until some time cycles have passed. They highlighted from experiments they conducted that it will be unwise to find the RUL of the engine before it begins to show any sign of wear. This will be like finding a new engine's RUL. So most authors commence the RUL forecast only after a certain time cycle has passed [37]. To achieve this, the RUL is held constant for the initial step until a certain time has elapsed. The value of this time is set to 125 in [35], [36], [38]. Figure 6 shows an example of the RUL plot. It compares the old RUL estimate with the new RUL having a constant value at the beginning of each machine run.

B. SIGNAL PREPROCESSING
Like most machine learning processes, particularly those with noisy data, the input signal may require preparation before being added to the network. This preprocessing may involve noise reduction, collection of features, and normalization. It was observed that some signal values of the C-MAPSS stayed unchanged during run-to-failure simulation from the 21 sensor signals. Since they do not reflect system failure, we remove such signals. We select Fourteen signals from the current 21 sensor signals. These are: (2,3,4,7,8,9,11,12,13,14,15,17,20,21) as done in [36], [37], [39]. We also performed normalization to fit the data in the span [−1, 1] of  our tanh squash function. The equation for normalization is: VOLUME 8, 2020  with d i,j denotes the i th data of signal j and d norm i,j is its normalized form. d min j and d max j represent the maximum and minimum values of the original sensor data of sensor j accordingly. After normalization, a Gaussian filter is then used to smooth the signals.

VI. RESULTS AND DISCUSSIONS
This section outlines the study's research findings. We employ the LI-ESN in all simulations, and we assumed that all ESNs had no feedback weights. Also, a washout time of 300 cycles was used. Moreover, both the input and reservoir weight values were randomly initialized between [−1, 1] of our tanh squashing function. Additionally, the codes were run on MATLAB R2018b on an OptiPlex 7060 Dell PC with Intel R Core TM i5-8500 CPU @ 3.00GHz (6 CPUs) running on Windows 10 Pro 64-bit (10.0, Build 18362).

A. MACKEY-GLASS SERIES
Here, the deepESN [40], [41] is applied to predict the next time-series data on the Mackey-Glass (MG) synthetic timeseries. This experiment's main aim is to compare two versions of the deepESNs with the shallow (classical) ESN and determine which is the best. The deepESN is a type of ESN with many layers. Both ESNs used to predict the next time step of the Mackey-Glass (MG) time-series. The series was developed in 1977 by Michael Mackey and Lean Glass to explain physiological control systems. The sequence is known for its nonlinear random characteristics. The series with the values of constants assigned to their most common numbers are given in Equation (15). This work uses the value of τ as 17. Figure 7 shows a typical graph of the trajectory of this series.
The training length was set to 14000-time steps and the test length as 6000. Moreover, we use reservoir connectivity of 0.01 and a spectral radius of 1.25 for each network. Additionally, we set the reservoir size to 500 for the shallow ESN. While for the deep ESNs, the reservoir size was set to the total reservoir size divided by the number of layers. For example, for a five-layered network, each reservoir would have a size of 100, i.e., 500/5. We implement two types of deepESNs. The first (deepESNv1), has the same reservoir and interlayer weights for all the layers. In contrast, the second (deep-ESNv2) has distinct reservoir weights and interlayer weights for each layer. Each ESN was trained and tested twenty times, and the average, standard deviation, best and worst means squared error (MSE) is reported. Additionally, the average run times of each network are also reported. Initially, we run the deepESNs with different layers (3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20), and we found that the layer number of 3 was the best because the results deteriorated as the number of layers increased beyond 3. Table 5 shows the results of the experiments. From the table, the shallowESN outperformed the deepESNs. This shows that for this prediction task, there is no advantage of layering. Additionally, the deepESNv2 with different reservoir and inter-layer weight was better than the deepESNv1 with a    fixed weight values. Regarding time complexity, on average, the deepESNs took more than twice the time needed to train and test the shallowESN.
From results in this section and our preliminary implementation of the deepESN for turbofan engine RUL prediction (See Table 6), we discover that the deepESN offers no advantage at least for our type of test cases. Thus, all subsequent experiments were done on the shallow ESN and performed on the C-MAPSS data.

B. RUL PREDICTION OF TURBOFAN ENGINES
This section provides the experimental findings for applying the optimized ESN for the RUL prediction of turbofan engines. The proposed method is compared with other existing works on the same task. Also, we use all the train data in C-MAPSS in this work. For train_FD002 and train_FD004 the train/validate/test sequence are 24, 500/10, 500/15, 000 samples. While for train_FD001 and train_FD003 the sequence is 9800/4200/ 6000 samples. This division makes sense since from   Table 7 presents the best settings for the implemented algorithms. The values of these parameters were set by carefully considering suggestions from existing literature. For each of the population-based algorithms, we set the population size to 20 and the maximum iteration to 100. • Concerning average timings, the proposed algorithm is faster than three of the methods (CS, GOA, BPSO, and PSO). However, it is slower than DE, classical ESN, deepESN, and LSTM.

2) AVERAGE RESULTS
• On average, we can also observe that the metaheuristicbased optimized ESNs are better than the classical ESN. This shows that optimizing the ESN with such algorithms can significantly enhance prediction accuracy. Nonetheless, the GOA and PSO come short of the classical ESN in many cases. The PSO was inferior because the authors in [20] did not attempt to optimize the reservoir's spectral radius, and thus, the ESN had no echo state property (ESP). In contrast, the GOA algorithm produced many outliers that pulled the average MSE high.
• To further put things into perspective, we have plotted the predicted RUL versus the actual RUL of some selected machines in Figure 8. Without loss of generality, this is for the FD001 data set, and the predicted path is that of our proposed method. The graph shows how the predicted values closely follow the actual path.

3) STATISTICAL SIGNIFICANCE
To further test for the statistical significance of our work, we conducted the Wilcoxon rank-sum non-parametric test.
We need this test since statistics such as mean and standard deviation of Table 8 may have resulted due to chance. Ten trials of our proposed method are compared with that of other techniques. The Wilcoxon rank-sum null hypothesis is that the two trial groups are from continuous distributions having equal medians. This may loosely translate that the two groups are not significantly different. In contrast, the alternative hypothesis is that they are not, and our results are statistically significant. Table 9 shows such comparison. Each entry in the table is a comparison of the ten trials of proposed improved GOA with other methods at 5% significance level. From the low P-values (mostly < 0.05), we can categorically say that the null hypothesis is rejected in 30 of the 32 tests.

4) CONVERGENCE PLOTS
Graphs in Figure 9 show the cost (validation MSE) per iteration for each case study. Each data point is the average of the ten conducted trials. Here we compare the convergence of the proposed algorithm with other methods. From the graphs, we can see that our proposed method has the fastest convergence in 3 out of the 4 cases. These graphs also reflect the results obtained in Table 8. Additionally, in each iteration of the proposed algorithm, we collect the average cost of the population. The graphs of Figure 10 show such an average for each case study. Again, the data is the mean of the ten conducted trials. It can be vividly seen the values rise and fall as the execution of the algorithm proceeds. This is possible because we occasionally accept inferior solutions into our population (see Algorithm 2 line 31). This is a hill-climbing behavior that helps most metaheuristics escape local optima.

5) SENSITIVITY ANALYSIS
In this case, we test the reaction of the proposed algorithm to adjustments of its parameters. The test was conducted on the FD004 test case without loss of generality. The parameters we varied were: the number of children numChildren (Alg. 2 line 10), percentage of agents in the bottom group (Pa) (see Alg. 2 line 9), and a parameter we call takingFrom (see Alg. 2 line 30). Table 10 shows the average of 10 runs for each setting.
Regarding the numChildren, it can be observed that as the number of children increases, the performance of the algorithm improves. However, this comes at the expense of increased time.
Considering the percentage of worst agents, we see that as the percentage improves, the performance follows. This shows that there is value in keeping more agents in the bottom agents' category. In the takenFrom rows, we can see that taking 25% was the best compared to taking from 75% or 50%.

VII. CHALLENGES OF ESN
It is worth noting that for real big data predictions, the ESN may suffer from instability and high computation cost.

A. INSTABILITY
Since the data is very large, there exists a huge reservoir state (reservoir with many units). The instability often occurs due to multicollinearity between the reservoir state values [44]. Multicollinearity is a situation where there is an almost linear relationship between the independent variables (reservoir states). This means that some of the reservoir units are redundant. This problem often causes the matrix S to be ill-posed, making the output weights too sensitive to changes in S and Y . This leads to inaccurate estimates of the output weights and thus limits the generalization ability of the network.
To remedy this problem, we use regularization techniques such as the Tikhonov regularization (ridge regression) [44], [45] instead of the ordinary least squares method. The ridge regression adds a small bias that breaks the correlations between the state values (please see equation (6)) and thus decreases the variance of the predicted output weights. Another technique used to solve the collinearity problem is the Principal Component Analysis (PCA) [16].

B. HIGH COMPUTATIONAL COST
Again, this is caused by the considerable size of the reservoir. Thus, causing the linear regression to take a longer time. This is because it often involves either the multiplication of larger matrices and finding its inverse, its Singular Value Decomposing (SVD), or QR decomposition. The following methods may be employed to reduce the computational costs.
One of the proposed solutions is the use of Principal Component Regression (PCR) [46]. PCR runs the regression only on some selected principal components of the design matrix S. Another interesting method of ridge regression is the divide and conquer kernel ridge regression of [47], [48]. Here, the design matrix is partitioned into batches, and the kernel ridge regression is run on each subset. The final global predictor is obtained from the mean of local regressions. Thirdly, if the above solutions prove abortive, the ridge regression may be formulated as an optimization problem, and the stochastic gradient descent (SGD) may be applied [28].

VIII. CONCLUSION
We provided an improved Grasshopper Optimization Algorithm (GOA) based optimized echo state network (ESN) for predicting the remaining useful life (RUL) of turbofan engines. To improve the original GOA, we developed a new solution representation that gave birth to a simplified attraction and repulsion of grasshopper agents. Additionally, we compare the proposed technique to methods such as the Cuckoo Search (CS), Particle Swarm Optimization (PSO), Binary PSO (BPSO), Differential Evolution (DE), original GOA, the classical ESN, deep ESN, and LSTM with interesting results. Since this improved algorithm is entirely new, it would be interesting to investigate how it will perform on other optimization problems as future work.

CODE AVAILABILITY
All programming codes and data generated in the experiments are available publicly at the GitHub repository. https://github.com/bala-221/Airplane-fault-prediction