Long Short Term Memory Water Quality Predictive Model Discrepancy Mitigation Through Genetic Algorithm Optimisation and Ensemble Modeling

A predictive long short-term memory (LSTM) model developed on a particular water quality dataset will only apply to the dataset and may fail to make an accurate prediction on another dataset. This paper focuses on improving LSTM model tolerance by mitigating discrepancies in model prediction capability that arises when a model is applied to different datasets. Two predictive LSTM models are developed from the water quality datasets, Baffle and Burnett, and are optimised using the metaheuristic genetic algorithm (GA) to create hybrid GA-optimised LSTM models that are subsequently combined with a linear weight-based technique to develop a tolerant predictive ensemble model. The models successfully predict river water quality in terms of dissolved oxygen concentration. After GA-optimisation, the RMSE values of the Baffle and Burnett models decrease by 42.42% and 10.71%, respectively. Furthermore, two ensemble models are developed from the GA-hybrid models, namely the average ensemble and the optimal weighted ensemble. The GA-Baffle RMSE values decrease by 5.05% for the average ensemble and 6.06% for the weighted ensemble, and the GA-Burnett RMSE values decrease by 7.84% and 8.82%, respectively. When tested on unseen and unrelated datasets, the models make accurate predictions, indicating the applicability of the models in domains outside the water sector. The consistent and similar performance of the models on any dataset illustrates the successful mitigation of discrepancies in the predictive capacity of individual LSTM models by the proposed ensemble scheme. The observed model performance highlights the datasets on which the models could potentially make accurate predictions.


I. INTRODUCTION
Rivers are valuable inland water resources utilised for human consumption, agricultural needs, industrial and recreational purposes. Increased urbanisation, poor water infrastructure, and climate change have increased pressure on rivers, necessitating efficient water management. To effectively manage The associate editor coordinating the review of this manuscript and approving it for publication was Artur Antonyan .
rivers, the quality of the water must be continuously monitored [1].
Water quality is commonly evaluated through expensive and time-consuming laboratory analyses. This process includes, but is not limited to, water sample collection from the relevant river, the correct storage and transportation of samples to the laboratory, chemical laboratory tests and analysis, after which the quality of the water can be evaluated. There is more than enough room for error and inefficiency in this layered process [2]. The ability to predict water quality beforehand can greatly increase the efficiency of water management [3].
This study proposes the optimized Long Short-Term Memory (LSTM) model, which is an advanced recurrent neural network (RNN), for water quality prediction. The LSTM is the most appropriate network for sequential data in which temporal dependency is an implicit feature and when the retention of information of the earlier stages of the sequential data is necessary for forecasting future trends [4]. Such is the case with time-sequential water data used for water quality prediction.
Discrepancies in LSTM predictive model capability can arise when the model, developed using a particular water quality dataset, is applied to different water quality datasets for prediction purposes. The model will probably not make an accurate prediction on other water quality datasets. The LSTM models tend to be case study-specific.
This research aims to improve the tolerance of LSTM prediction models by mitigating these discrepancies through optimising the LSTM network using the metaheuristic genetic algorithm (GA). Two different GA-optimised LSTM models, used as base models, will be fused using a linear weight-based approach to create a final tolerant LSTM ensemble model. This research produced three main contributions. The first was adaptation and optimisation through the successful adaptation of the LSTM network for water quality prediction for two temporal-based water quality datasets taken from different rivers and time periods. Both models were subsequently optimized by GA, to improve efficiency and robustness, resulting in two-hybrid GA-optimised LSTM prediction models. The second contribution was the combination of the two-hybrid GA-LSTM prediction models, using a weightbased technique to develop a single more tolerant ensemble model. Generalization, the third contribution, explored the possible use of the final GA-optimised LSTM-based ensemble prediction model in areas other than water quality. The purpose was to assert the tolerance and thus the relevance of the final ensemble model in the wider field of LSTM and ensemble prediction models.
The paper progresses as follows: Sec. I contains the introduction, Sec. II details the LSTM and Sec. III the GA, Sec. IV describes the weight-based combination technique, while Sec. V discusses the water bodies (rivers) and relevant water quality parameters used. After which, Sec. VI describes the development of the robust and tolerant water quality prediction LSTM based ensemble scheme, while Sec. VII focuses on the results and analysis, and Sec. VIII concludes the paper with further recommendations.

II. LONG-SHORT TERM MEMORY (LSTM) NETWORK
Most artificial neural networks (ANNs) are feed-forward neural networks [5], incapable of capturing sequences and accounting for the temporal nature of data and thus cannot model memory [10]. The RNN is a Deep Neural Network (DNN) that has a looping mechanism, allowing FIGURE 1. Internal structure of LSTM recurrent network cell adapted from [16].
for the information retention of the previously processed elements in the sequence, thus enabling the modeling of memory [8], [11].
RNNs are incapable of learning long time dependencies and retaining information of the beginning of the sequence, skewing results [10], [12]. RNNs are trained using backpropagation, where the gradients of the previous layer are used to calculate the gradients of the current layer [13]. Small weight adjustments will result in smaller weight adjustments with each subsequent layer [7], [8] until the internal weights are barely adjusted, and earlier network layers cannot learn anything [9]. This shrinking of gradients [13] is referred to as the vanishing gradient [9]. The LSTM is widely used to mitigate the complications created by the vanishing gradient [8] and was first suggested by Hochreiter and Schmidhuber in 1997 [14]. The opposite, the phenomena of gradients increasing exponentially in size [6], [8], referred to as the exploding gradient can also occur [13]. Gradient clipping, which entails shrinking the gradient when norms exceed a particular threshold, is used to mitigate this [6], [8].
The LSTM can scale to longer sequences with its unique architecture [10], which is illustrated in Figure 1 [16]. Each LSTM cell has a cell state, the network memory [6], [15] which is regulated by the forget, input, and output LSTM gates that control which information is added or removed from the cell state, ensuring that only relevant information is used to make predictions [15].
The forget gate (f t ) controls information removal and retention. The input of the LSTM cell at the current time step (x t ) and the output from the previous hidden state (h t−1 ) are both individually multiplied by the weight of the forget gate (W f ) and summed together, the result of which is added to a bias vector (b f ) then passed through the logistic sigmoid activation (σ ) [15] as shown in (1). Notably, each gate has a different set of weights. The equations below are expressed in accordance to Figure 1 [16] and were adapted from [4]: When the previous hidden state and current input are multiplied by the weight of the input gate (W i ), added to the bias vector (b i ) and passed through the logistic sigmoid activation, VOLUME 10, 2022 it determines which values are updated. And hence the input gate (i t ) can update the cell state [15] in (2): An intermediate cell state (c t ) is calculated, where the output of the previous hidden state and the current input is multiplied by the weight of the intermediate cell state (W c ), the product of which is added to a bias vector (b c ) and passed through a tanh function to regulate the network [6], [15] as expressed in (3):c The Hadamard element-wise product (•) of the output of the forget gate and the memory of the previous state (c t−1 ) is then used to calculate the current cell state (c t ) when it is added (through point-wise addition) to the product of the output of the intermediate cell state and the output of the input gate. Thus the sigmoid output determines what should be retained from the tanh output [15] as expressed in (4): Using the current cell state, the output gate (o t ) determines what the next hidden state should be when the previous hidden state and the current input are multiplied by the weight of the output gate (W o ), then added to a bias vector (b o ) and finally passed through a logistic sigmoid activation [15] as shown in (5): The hidden state (h t ) is then calculated by passing the current cell state through the tanh activation and multiplying it by the output gate in (6), determining the information carried by the hidden state to the next time step [15].
The output is dependent on the input at the current time step and the previous hidden state and is modulated between one and zero by the sigmoid functions. The gate will block a signal if the output is zero. The model learns the weights and biases of each gate through the minimisation of the difference between the LSTM outputs and the training samples [17]. LSTMs have several parameters, such as the number of layers, the number of units in the hidden layer, time window size, batch size, etc., referred to as hyperparameters [18], which influence network behaviour [4] and thus should be optimised before the training process [18].
Hyperparameter optimisation can be manual or automatic [4]. The manual adjustment of each hyperparameter and the interpretation of the effect of the adjustment on the network are dependent on the knowledge and experience of the researcher. Automatic optimisation ranges from the exhaustive grid search, which explores all the possible hyperparameter combinations, and the random search algorithms, which converge slowly until the global optima are found, to the more complex model-based algorithms [20]. Hyperparameters influence model underfitting or overfitting, affecting the final accuracy of the network [4]. The temporal nature of the chosen data dictates the need for the optimal number of LSTM units in hidden layers and time window size, which should contain an appropriate dataset background enabling the LSTM to learn enough from past information [19]. A large time window will lead to model overfitting, while a small window will neglect important information [10].

III. GENETIC ALGORITHM
Innately stochastic metaheuristic algorithms such as GA mimic observed natural behaviour to solve complex optimization problems within limited time and computational capacity. GA is known to lessen search complexity and to find optimal (or close to optimal) values for tunable LSTM hyperparameters [4]. Thus this paper suggests a hybrid solution integrating GA with the LSTM network [19] to find the optimal hyperparameters. GA-optimisation of the LSTM network is initialised by a generated population of random individuals [4], where each individual represents a potential solution-hyperparameter values that are expressed as binary strings [19]. Individuals are arbitrarily selected from the search space and are evaluated by the fitness function defined before the optimisation process. Individuals with higher fitness functions are considered to be near-optimal solutions and are preserved for the next generation, while the rest are disregarded [10], replicating the natural evolution where stronger individuals are more likely to reproduce [4]. Figure 2 [4] illustrates the basic GA process with six distinct stages: initialization, fitness calculation, termination condition check, selection, crossover, and mutation [10].
Thus with every generation, the worst performing individuals are removed, and new individuals are added to the population to add genetic diversity to the evolutionary process through crossover, mutation, and selection genetic operators [4]. The crossover operator creates new individuals by replacing some of the portions of the two-parent individuals [4] by only utilising information in the search space, without generating new information [10]. The mutation operator generates new information by mutating portions of individual strings [10] to create unique individuals [4]. An inappropriate definition of the solution representation could lead to incorrect choice of mutation and crossover operators [4]. The selection operator effectively exploits the information accumulated through the GA search by selecting the stronger individuals as their offspring will have higher fitness (fitness function) and thus survive the next generation [4]. The generated individual goes through the process of selection, crossover, and mutation while calculating its fitness to the model and verifying the termination criteria. Once the termination criteria are satisfied, the process stops [10].
Iterative modifications and evaluation of the population allow GA to find optimal or near-optimal hyperparameters in a search space with many peaks and valleys while traditional gradient-based methods get stuck at the local optima. Also, diversity injected into the population enables the execution of a global search by allowing the exploration of new areas in the search space. Thus GA is the appropriate choice for combinatorial optimization where a thorough search of all the possible solutions is required and would demand enormous computational power [4].

IV. ENSEMBLE MODELS THROUGH WEIGHT-BASED FUSION
The inherent human approach to problem-solving that involves making informed decisions based on several inputs [21] forms the basis of ensemble learning [22]. Ensemble models are developed by combining a finite number of different neural networks to improve the overall prediction. Similarly, a human would combine the knowledge of several differently sourced opinions to make a final decision. Individual networks are independently trained, and their predictions are combined using a mathematical rule [23] to form a final single ensemble model prediction. Ensemble forecasting can alleviate challenges in time series forecasting where data is volatile, dynamic, and non-stationary such as in the geology, energy, water quality, and finance sectors, enhancing the forecasting accuracy instead of a single model forecast [24].
Weight-based ensemble approaches use weighting schemes for the ranking of features or models [25], such as weighted linear combination techniques. The weights allocated to each model can be equal, such as in the simple average ensemble model. They can also be determined through a mathematical rule, as with the weighted ensemble model. The most appropriate way to combine individual LSTM model forecasts is unknown and undecided as the research into LSTM applicability in ensemble forecasting is scarce [24]. This study chose to evaluate the combined forecast of the LSTM models through a linear function of the individual model forecasts [26]. These linear combinations can be calculated as follows from [26]: Let Y be the actual time series that will be forecasted using n different models, where LetŶ (i) be the forecast obtained from the i'th model (i = 1, 2, . . . , n) where [26]: A linear combination of these n forecasted series of the original time series can be shown as follows [26]: which is produced by: where f is a linear function of the individual forecasts and which results in [26]: where w i is the weight assigned to the i'th forecast. The added weights amount to unity [26]. All the models are assigned equal weights in the simple average approach [26]: For the weighted ensemble model, greater weight is assigned to the more skilled model, indicating greater trust in the model [23]. In contrast, all the models are allocated the same weight regardless of skill in the average ensemble, rendering it incapable of dealing with extreme values, such as outliers and skewed distributions [24]. These weights are small positive values. The sum of all the weight coefficients in the ensemble must be equal to one [23]. The more skilled model is emphasised in the weighted ensemble. Thus it is expected to perform better than the average ensemble [23]. Effective ensemble forecasting requires a considerable amount of diversity between individual LSTM models [24]. This study accounts for diversity in LSTM models through different time window sizes and as a result, a different number of LSTM units, which is advantageous for the modeling of highly non-linear statistical dependencies. Consequently, an ensemble model with LSTM based models of different time window sizes will be capable of handling the nonstationary and dynamic nature of real-world time series [24]. Tuning the hyperparameters of each LSTM based model in the ensemble will increase their quality, thus enhancing the overall quality of the ensemble [27] provided that a sufficient amount of diversity exists between each model. This study highlighted the weight-based approach for combining the two-hybrid GA-LSTM models.

V. WATER BODIES AND WATER QUALITY PARAMETERS
For the development of the models, data was taken from two water bodies (rivers), the Burnett river and the Baffle river.
The Burnett River is in southeast Queensland, Australia, named after J.C. Burnett, the first explorer of the river in 1847 [28]. The river rises on the western slope of the Burnett Range, east of the Eastern Highlands [28] at VOLUME 10, 2022 Mount Gaete [29] and flows a 435 km course into the Coral Sea of the South Pacific Ocean at Burnett Heads [29]. The Auburn, Boyne, and Barambah rivers are tributaries to the river [28]. Small crops and sugar cane are grown in the areas around the river, which are part of the South East Queensland and Brigalow Belt bio-regions [30]. The Baffle Creek, also known as the Baffle river, was named by politician and pastoralist William Henry Walsh in the 1850s. He was unable to track down raiders through the dense bush along the banks of the creek, leaving him baffled, hence the name [31]. The river is in southeast Queensland, Australia, and flows a 124 km course from Arthurs Seat down into the Coral Sea of the South Pacific Ocean [32]. The tributaries into the river from the right are the Scrubby, Granite, Grevillea, and Three Mile creeks and the Euleilah and Island creeks from the left[33]. The Burnett and Baffle datasets included water quality parameters such as water temperature ( • C), pH, electrical conductivity (mS/cm), dissolved oxygen (mg/L), and turbidity (NTU).
The dissolved oxygen (DO) concentration is the most relevant water quality parameter as it reflects the equilibrium between the oxygen-producing and consuming processes in a river [34]; it is the amount of free non-bonded, non-compound oxygen present in water [35]. DO is an obvious criterion of river health, as DO is directly or indirectly influenced by other water quality parameters, such as temperature, salinity, oxygen depletion, pH, turbidity, etc. [34]. River-dwelling organisms, such as fish, invertebrates, plants, and bacteria, use DO in water just as land organisms use atmospheric oxygen. Extreme DO levels can adversely impact water quality and harm aquatic life [35].DO levels range between 6 and 14 mg/L [36], with healthy rivers at 6.5 to 8 mg/L [37]. The monitoring and prediction of DO levels ahead of time with predictive models will aid in optimising water quality control measures [34] and is thus of paramount importance [35].
Water temperature is a physical property and a measure of the average thermal energy of the water [38]. The river temperature is dependent on four factors: the type and depth of the river, the environment surrounding the river, and the season of temperature recording [38]. There are no typical water temperature ranges that apply to all rivers. However, rivers have annual temperature patterns. Temperatures that deviate from the pattern should be viewed in context. Rivers and streams exhibit faster and greater temperature fluctuations than lakes and oceans. Observed seasonal temperatures across American rivers on average can be as low as between 1 to 4.5 • C and as high as between 30 to 35 • C [38]. Water temperature influences the physical and chemical properties of water. An increase in temperature will decrease the solubility of gases (such as oxygen) in water. Thus warmer waters hold less dissolved oxygen [35]. Other water quality parameters such as electrical conductivity, salinity, compound toxicity, water density, and pH are also affected [38].
The pH is a measure of the activity of the hydrogen ion (H+) in water. pH ranges from 0 to 14 and is reported as the reciprocal of the logarithm of the hydrogen ion activity. A river with a pH of 7 has 10 −7 moles per liter of hydrogen ions [39]. River pH ranges from 6.5 to 8.5 and is slightly lower for groundwater pH at 6 to 8.5 [39].
Conductivity measures the ability of a river to pass an electrical current, which increases with the presence of inorganic dissolved solids with either a negative or positive charge in the water. Organic compounds have low conductivity in water [40]. The conductivity of rivers ranges from 50 to 1500 µmhos/cm (0.05 to 1.5 mS/cm). Some inland freshwaters have observed ranges from 150 to 500 µmhos/cm (0.15 to 0.5 mS/cm) and heavily polluted industrial waters at 10,000 µmhos/cm (10 mS/cm) [40].
Turbidity, an optical characteristic of water, is a measure of the relative clarity of river water [41]. Turbidity measures the light that is scattered by suspended particles when light is shone through the water sample [42]. Solid particles suspended in the water include sediment (clay and silt), a variety of microscopic organisms, fine inorganic and organic matter, algae, and plankton [41].
High turbidity lessens the light penetration of the water, altering the ecological productivity, habitat quality, and aesthetic value of the river, causing harm to fish and aquatic life. Pollutants, such as bacteria and metals, also attach themselves to suspended particles enabling further pollution [42]. At high turbidity levels, suspended solid particles absorb heat, causing an increase in water temperature, thereby causing a decrease in DO concentration. Suspended particles also reduce the sunlight penetrating the river, inhibiting the photosynthesis of river plants, hindering DO production [43]. Turbidity levels can range from 1 to 1000 NTUs. Lower values indicate low turbidity and healthier rivers [43]. On average, river turbidity levels usually range from 10 to 25 NTUs [41].
The observed typical values for each water quality parameter has been summarised in Table 1.

VI. A ROBUST AND TOLERANT WATER QUALITY PREDICTION LSTM BASED ENSEMBLE SCHEME
The methodology used to develop a robust and tolerant water quality prediction GA-optimised LSTM based ensemble scheme is described below through a sequence of steps and is illustrated in Figure 3.

A. WATER QUALITY DATASETS AND DATA PREPARATION
Two water quality datasets were used to develop two LSTM models. The historical water quality data was publicly available from the Ambient Estuarine Water Quality Monitoring Programme on the Queensland Government open data The data was pre-processed and cleaned in accordance with the following steps, prior to model development (yellow block in Figure 3): 1) Only highly problematic inconsistent observations, such as a negative pH were removed, etc., were strategically removed using Table 1 as a reference as certain incongruous values could represent a reality that should be recorded. 2) Statistical analysis through evaluation of the minimum, maximum, mean, standard deviation and the first, second, and third quartile of each parameter was used to remove outliers along with the interquartile range (IQR), found using the upper boundary 3) Duplicate observations were removed to prevent repetition from distorting results. 4) Observations at irregular time intervals were removed, while missing values at regular intervals were calculated through interpolation. The correlation between the parameters was evaluated using Spearman's Correlation to identify which parameters could be inputs for the predictive models. Correlation measures the association between two parameters and is expressed as a value between −1 and 1. When there is a positive correlation (closer to one), one parameter increases as the other parameter increases. When there is a negative correlation (closer to negative one), one parameter increases as the other parameter decreases. When the correlation is neutral (close to zero), there is no relationship between the parameters. A strong correlation indicates a strong relationship between two parameters, implying they can be used to build a model [6], [46]. The Spearman's coefficients of each parameter in relation to DO (target feature) are shown below in Table 2.
DO shares the strongest negative correlation with temperature for both datasets as shown in Table 2, thus as water temperature increases, DO levels will decrease. The correlation between DO and electrical conductivity and turbidity is insignificant for both datasets. In contrast, pH behaves differently in each dataset and has a moderate positive correlation with DO in the Burnett dataset but a weak negative correlation in the Baffle dataset. Thus highlighting the difference between causation and correlation, which are not equitable and are often confused when working with timeseries data [47]. A causal relationship exists between two parameters when three requirements are met: an association VOLUME 10, 2022 between the parameters, appropriate time order, and the elimination of other parameters [48], under experimental conditions. A causal stimulus can also be tested. If manipulation of a parameter causes a sufficient change in the other parameter, a causal relationship can be established [49].
Any assumed causation between pH and DO was disregarded due to the different correlation values in datasets and no documented causal relationship between the parameters at the time of this study. Thus DO and the water temperature were chosen as input parameters for the models. The other parameters were removed, and new Burnett and Baffle datasets only containing the input parameters were created.

B. DEVELOPMENT OF A MULTIVARIATE MULTI-STEP STACKED LSTM MODEL
Two water quality predictive LSTM models (brown circle in Figure 3) were developed and evaluated using the free open source publicly available Keras Python library and were defined by efficient numerical libraries using Tensorflow backend, in Google Colab (Google Colaboratory), a hosted Jupyter notebook service which executes python code through the browser [50].
The Burnett model was developed as a multivariate multistep stacked LSTM model. Models are defined as a sequence of layers in Keras. DO and water temperature, in the input layer, are used to predict the target parameter, DO. The model has two hidden layers. The first hidden layer has more LSTM units than the second hidden layer. The dense output layer connects the whole model and outputs a multi-step prediction of DO values 24-time steps ahead. Root Mean Square Propagation (RMSprop) is the chosen optimiser, and Rectified Linear Unit (ReLU) is the activation function. The Baffle model was developed with a similar architecture, using the same two input features, two hidden layers (the first layer has more LSTM units than the second layer) and a dense output layer predicting DO values 24-time steps ahead, with RMSprop and ReLU as the optimiser and activation function respectively. The Baffle model has more LSTM units than the Burnett model due to the structural nature of the different datasets.
The Baffle and Burnett datasets both have a total of 52 560 observations. The Burnett dataset has the observations spread across three years, while the Baffle dataset has the observations spread across a single year. Hence it is easier for the LSTM network to pick up variations in the Burnett data and learn trends over three years. Therefore an LSTM model with an overall smaller architecture was used for the Burnett model. The Baffle data is denser, with many close clusters of similar points with little variation between them, increasing the difficulty of learning a trend to make a prediction. Thus a larger LSTM was used to accommodate the densely spaced Baffle dataset.
ReLU is linear for positive inputs, retaining linearity properties when training neural networks with backpropagation and behaves like a nonlinear function when the input is negative and outputs a value of zero [6], [15]. Thus still affording hidden layers the opportunity to be activated, while sigmoid and tanh functions can only approximate values close to zero but not zero [15]. Other advantages of the ReLU include representational sparsity, linear behaviour, and computational simplicity and implementation at a lower computational cost as it excludes the computation of an exponential function, unlike the sigmoid and tanh functions [6], [51].
The Root Mean Square Propagation (RMSprop) optimiser is an adaptive learning rate method developed to address the problem of drastically diminishing learning rates observed when using the Adagrad optimiser [52].
The pre-processed data was split into the training, validation, and testing datasets in a 50%, 20%, 30% ratio, respectively. The training dataset was the largest to ensure there were enough samples for the LSTM network to learn from. Water quality parameters with different ranges and scales were normalised before the data was fed to the models. Normalisation reduced model training difficulty and ensured the model was independent of input unit choice. The mean (x) and standard deviation (σ x ) of the original data (x) were used to calculate the normalised data (x ) in both the training and validation datasets, in accordance to (14) [27] x = x The test dataset was not normalised, allowing for the generality of the model to be assessed, thus enabling the development of a better predictive algorithm. If the test data is normalised with the entire dataset, the testing process will validate the model and not assess model generality [27]. Model predictions were scaled back to the original scale for each parameter by reversing the normalisation calculation in (14) before the evaluation of performance metrics.

C. TRIAL-AND-ERROR OPTIMISATION OF THE LSTM MODEL
The LSTM models were initially optimised in terms of time window size and the number of LSTM units in the two hidden layers using trial-and-error (orange triangles in Figure 3).
Three arbitrary values were chosen for the time window size, the number of LSTM units in the first hidden layer, and the number of LSTM units in the second hidden layer, and as previously stated, there are more units in the first hidden layer than the second layer.
The LSTM model was trained and a prediction was made by the model. The RMSE value was calculated to evaluate the accuracy of the model prediction. The smaller the RMSE value, the more accurately the model can predict DO from the historical data. In the second run, the three values were arbitrarily lowered. The process was repeated, and the RMSE was recorded. In round three, the values were arbitrarily increased. The process was repeated, and the RMSE was recorded. Thereafter, arbitrary combinations of the number of LSTM units in the two hidden layers were chosen, while the time window size remained constant. The process was repeated, and the RMSE was recorded. The time window size was then changed while the LSTM units remained constant, and the same procedure was followed.
The behaviour of the model was observed with each change and influenced the choice of the values chosen for the subsequent run. If lower RMSE values were achieved at small time window sizes, then consistently lower values were chosen for the window size, until the RMSE values started increasing again at the smaller window size values. This strategy was also used to find the number of LSTM units and the combination of the time window size and LSTM units that would produce the lowest RMSE value.
The hyperparameter values for the Burnett model found through trial and error were 32 LSTM units in the first hidden layer, 16 LSTM units in the second layer, and a time window size of 100-time steps. The Baffle model values were 64 LSTM units in the first hidden layer and 32 LSTM units in the second hidden layer, with a time window size of 150-time steps. As expected, the Baffle LSTM model network architecture is larger than the Burnett model, with a bigger window size and almost double the amount of LSTM units.

D. HYBRID GENETIC ALGORITHM OPTIMISED LSTM BURNETT AND BAFFLE MODELS
The Burnett and Baffle LSTM models were optimised with GA utilising the ''Distributed Evolutionary Algorithms in Python'' package, referred to as DEAP [53], using Keras and Tensorflow. The models were optimised separately. The hyperparameter values found through the trial-and-error process were used as the initial values for the GA-optimisation process shown in Figure 4 [10]. The GA-optimisation process was used to find the optimal time window size and the optimal number of units in the first and second hidden layers for each model (light green hexagon in Figure 3). Genetic parameters were specified using the DEAP package, such as a population size of 70, mutation rate of 0.15, crossover rate of 0.7, and the number of generations at 10. These values were chosen for similar studies [10]. Ordered crossover, shuffle mutation, and roulette wheel selection, were also chosen and specified using DEAP as they produced the best possible results from the available options [53].
During the optimisation process, the search space was explored by the genetic operators, and the population became composed of possible solutions (optimal hyperparameter values), in the form of chromosomes encoded by binary bits, which represent the number of the LSTM units and the time window size [10]. The binary solution was of length 14. The first six digits were for the time window size, the subsequent four digits were for the number of LSTM units in the first hidden layer, and the last four digits were for the number of LSTM units in the second hidden layer. The selection and recombination operators search for the best solution within the solution-composed population. Each solution is evaluated with the predefined fitness function, and the solution with the best performance is chosen for reproduction [10].
Defining the fitness function before the optimisation process is crucial. The RMSE value was used to evaluate the fitness of each solution [10] and had to be minimised for optimisation to occur. DEAP was used to define the Fitness Maximum as −1.0 for minimisation [53]. The chromosome that provided the combination of hyperparameter values that resulted in the lowest RMSE value was considered the optimal or near-optimal solution. The termination criteria were satisfied by the optimal solution. The optimal solution was implemented by the LSTM model, to create the GA-optimised LSTM version of the model, which could now be used to make a prediction. If the termination criteria were not satisfied, the cycle of selection, crossover, and mutation would continue until the optimal solution was found [10]. The pseudo-code of the GA-Burnett and GA-Baffle models is shown in Algorithm 1.

Algorithm 1 GA-Optimised LSTM Model
1: Split the data into training (30%), validation (50%) and test (20%) data. 2: The LSTM is evaluated on the validation data. Probability of 0.7 for crossover of chromosomes; 8: Evaluation of the freshly generated chromosome through use of the fitness function; 9: return RMSE of 1000 Stopping condition as minimisation of RMSE is the aim 10: end if 11: Choose the best individual chromosome which represents the optimal time window, the optimal number of units in the first hidden layer and the optimal number of units in the second layer. 12: Apply the optimal window size and optimal number of units in the two hidden layers in the LSTM and make a prediction on the unseen test data.
The optimal time window size for the hybrid GA-optimised Burnett and Baffle LSTM models were 57 and 63-time steps, respectively. The optimal LSTM units in the first hidden layer for the GA-Burnett and GA-Baffle models were 10 and 12 LSTM units, respectively, and 8 and 10 LSTM units in the second layer for the GA-Burnett and GA-Baffle models, respectively. Again, the Baffle model has an overall larger architecture than the Burnett model. These values are summarised in Table 3.

E. ENSEMBLE LSTM-BASED MODEL THROUGH A WEIGHT-BASED TECHNIQUE
The linear weight-based technique (dark green rectangles in Figure 3) combined the predictions made by the models by allocating a weight to the prediction made by each model, which was indicative of the contribution the model made to the new ensemble model. The ensemble model was created using dataset C, a Baffle river (different to the Baffle dataset 2019 used for the Baffle model) water quality dataset beginning from January 2015 and ending in December 2015, 48 observations per day and a total of 17 520 observations for the year. The process of the development of the weighted and average ensemble models from the two GA-optimised LSTM base models, GA-Burnett model, and GA-Baffle model is detailed below and was inspired by [54]. An illustration of the process is shown in Figure 5.
The Baffle 2015 dataset was divided into the training, validation, and testing datasets at 30%, 50%, and 20%, respectively. The holdout validation dataset, which was unseen by both models during the training process, was larger than the training dataset as it was used to estimate the optimal weight contribution of each model in the ensemble model. The validation dataset had to be large and representative to prevent the model from over-fitting.
GA-Burnett and GA-Baffle models defined according to their optimal hyperparameters were trained on the training dataset and made predictions using the validation dataset as the test dataset. The predictions were evaluated through comparison to the actual DO values by calculating the RMSE value. A lower RMSE value indicated greater model performance.
The optimal weight coefficients for each model prediction were found through an exhaustive grid search, comprised of weight coefficients starting from 0.0, increasing increments of 0.1, and ending in 1.0. The sum of the two possible weight coefficients must be equal to one. The weight coefficient values were multiplied by the GA-Burnett and GA-Baffle model predictions until the optimal weight combination was found, through a function that minimises the RMSE value. The validation dataset, which was unseen by the individual models, was used to simultaneously perform the grid search to find the optimal weight combination, while each model made a prediction on the validation dataset, with RMSE minimisation as the final goal. The performance of the models was compared to each other and the ensemble models, using the RMSE value. The optimal weight combination represented the extent of the contribution of each model to the final weighted ensemble model. The weighted ensemble was evaluated on the test dataset.
In an average ensemble model, each model contributes equally to the ensemble prediction. Thus the continuousvalued output is the average of the individual member predictions. As there are only two members in the ensemble, the weight coefficient of each model is 0.5. The more skilled model has a larger weight coefficient than the less skilled model in a weighted ensemble model. The average ensemble will be compared to the weighted ensemble. A well-configured weighted ensemble model is expected to outperform an average ensemble model. The pseudo-code for the ensemble model is shown in Algorithm 2.

F. PERFORMANCE METRICS
The ensemble models were assessed by computing common performance metrics for continuous output models, such as the Mean Squared Error (MSE), Mean Absolute Error (MAE), Mean Absolute Error Percentage (MAPE) and the a, b = GA-Burnett, GA-Baffle 9: Define a function to evaluate the ensemble prediction by calculating the RMSE score with the ensemble prediction and the true values of the validation data; 10: while w a + w b = 1 and w a , w b > 0 do 11: minimise RMSE score for the weighted ensemble 12: end while 13: The best weight for each model will form the weight combination that gives the lowest RMSE on the validation data 14: The best weight combination for the weighted ensemble can be used to make a prediction on the unseen test data 15: For the average ensemble, w a and w b are equal i.e. 0.5 16: The average ensemble is also used to make a prediction on the unseen test data Root Mean Squared Error (RMSE). The MSE (15) [10] is a measure of average squared difference between the predicted values and the actual values [55]. The smaller the MSE, the more accurate the model prediction: where n is the number of samples, y i is the desired output and y i is the predicted output value of the observation made by the model i th . The MAE (16), MAPE (17) and RMSE (18) were evaluated in accordance to the following equations [10]: The MAE defined in (16) is the average of the absolute difference between the predicted values and the actual values [56]. The smaller the MAE value, the closer the predicted values are to the actual values.
The MAPE defined in (17) is the average of the absolute percentage of the difference between the estimated values and the actual values, providing a measure of the error in a comprehensible percentage. The smaller the MAPE, the better the forecast [57].
The RMSE value defined in (18) is the quadratic average of the differences between the predicted and actual values and measures the accuracy of a model on a dataset and not between datasets as it is scale dependent [58]. Outliers disproportionately affect the RMSE value [59]. A good prediction has a low RMSE. RMSE value of zero indicates a perfect fit between model and data [58]. The R 2 score defined by (19) [56] is the coefficient of determination [60] and indicates how well the model fits the data, thus measuring how well unseen data is likely to be predicted by the model. Here y is VOLUME 10, 2022 the average of the actual values [56]: .
The Median Absolute Error, defined by MedAE in (20) [56] is a measure of the error and finds the median of all the absolute differences between the predicted and actual values [56]: The maximum error defined by MaxError in (21) [56], calculates the maximum residual error by capturing the worst error between the predicted value and the desired outcome [56]: The explained variance defined by the EV score in (22) [56], indicates how well the model accounts for the variation of a dataset. The closer the score is to one, the better the performance of the model. Var is variance, the square of the standard deviation [56]:

G. ASSESSMENT OF MODELS ON UNSEEN DATA
The four models: the GA-Burnett, GA-Baffle, average ensemble, and weighted ensemble, were tested on three multivariate and one univariate publicly available online datasets. Wind Power Forecasting Dataset [61] includes seven key columns labeled ''wp1'' to ''wp7'' for the wind power measurements of seven different wind farms. As the data were normalised by the data provider, the original values, units, and scale are unknown. The data is publicly available and visible on the Kaggle: Machine Learning and Data Science Community as part of their Global Energy Forecasting Competition 2012 on Wind Forecasting. The Air Temperature Dataset is referred to as the Jena Climate Dataset on Kaggle [62] and includes various meteorological parameters along with air temperature, such as air pressure, air density, etc., which are used to predict the air temperature. The Pollution Dataset, referred to as the Beijing PM2.5 Data dataset [63] on Kaggle, includes parameters such as temperature, pollution level, pressure, dew point, wind speed, and direction, etc., which is used to predict the future pollution level in terms of PM2.5 (atmospheric particulate matter (PM) that has a diameter less than 2.5 micrometers) [64]. The daily minimum temperature in Melbourne is a univariate dataset [65] containing the single temperature parameter used to predict the future minimum temperature. It was viewed on Kaggle while the original dataset was hosted by the Data Market Qlik Sense Data Sources.
Comparison is one of the best performance measures. Thus the predictive capability of four classical time series forecasting methods, the Autoregression (AR), Moving Average (MA), Autoregressive Moving Average (ARMA), and Autoregressive Integrated Moving Average (ARIMA) was compared to that of the four models on the univariate dataset.   A summary of all the datasets used in this study is presented in Table 4. As the wind power dataset was normalized externally and the original values were unknown, the mean and standard deviation could not be calculated.

VII. RESULTS AND ANALYSIS A. TRIAL-AND-ERROR OPTIMISED BURNETT AND BAFFLE LSTM MODELS PREDICTIVE CAPABILITY
The Burnett model in Figure 6 has a greater ability to predict DO concentration 24-time steps ahead than the Baffle model in Figure 7, from historical water temperature and DO data (blue line) as seen from the well-aligned red points (predicted DO values) and green points (actual DO values). The green and red points in Figure 7 are hardly aligned and only intersect at one point. Table 5 supports this notion with a detailed comparison of model predictive ability in terms of the performance metrics. The hyperparameters of the Burnett and Baffle LSTM models optimised through the trial-and-error process are in The Burnett model has significant predictive capability compared to the Baffle model, achieving a R 2 score value of 0.80, while the Baffle model scores half that value at 0.36. The Burnett model is also more capable of accounting for variation in data with an explained variance (EV ) score of 0.81, while the Baffle model scored 0.47. The maximum error from the Burnett model is 0.14 and is less than the 0.21 of the Baffle model. The median absolute errors (MedAE) are 0.13 and 0.40 for the Burnett and Baffle models, respectively. The maximum error (MaxError) of the Burnett model is close to the median absolute error, while the median absolute error of the Baffle model is greater than the maximum error, further reinforcing the superiority of the Burnett model.
The Burnett model performs better, despite having a smaller time window size of 100-time steps than the Baffle model with 150-time steps per window. The Burnett model is fed data over a longer period of three years than the Baffle model, as is evident by the three peaks in the graph shown as a repeated trend in Figure 6. The Baffle model is fed denser spaced data over a single year which is evident from the single peak in the graph in Figure 7. The greater period of data allowed for the greater diversity of data and the repetition of the same trend three times. That enabled the Burnett model time window to effectively capture the appropriate dataset background, resulting in a better trained Burnett model with a greater predictive capacity than the Baffle model.

B. HYBRID GA-OPTIMISED LSTM BURNETT AND BAFFLE MODELS PERFORMANCE CAPACITY 1) GA-BURNETT AND BURNETT LSTM MODELS
The GA-Burnett model in Figure 8 shows the predicted DO values 24-time steps ahead with a time window size of 57-time steps of historical DO data (blue line), almost  Table 3. The graph in Figure 8 is similar to the graph in Figure 6, where the red and green points (predicted and actual DO values) align at numerous instances. This similarity of model performance is mirrored by Table 5. The improvement in the performance of the Burnett model after GA-optimisation, resulting in the GA-Burnett model was notable as shown by the observable difference between the performance metric values achieved by the Burnett and GA-Burnett models on the same dataset in Table 5.
The GA-Burnett model has RMSE, MAE, MSE and MAPE values of 0.25, 0.17, 0.06 and 2.5 respectively. These values are slightly lower than the Burnett model. After GA-optimisation, the RMSE value of the Burnett model decreases by 10.71%. The MAE value is reduced by 5.55%, whilst the MSE is decreased by 25%. The MAPE value is reduced by 0.2%.
The GA-Burnett model has a bigger maximum error than the Burnett model at 0.22, but the median absolute error of the GA-Burnett model at 0.12 is similar to the Burnett model, implying that the maximum error of the GA-Burnett is due to an outlier prediction. Both the R 2 score and explained variance score of 0.84 are only slightly greater than the Burnett model. The R 2 score of the Burnett model increased by 5% and explained variance score improved by 3.7% after GA-optimisation.
There is a notable yet relatively small change in predictive ability, despite the obvious GA-optimisation of the Burnett model. The Burnett model may have already been optimised to a great extent before the GA-optimisation, as shown by the good performance metric values in Table 5 and the close alignment of the green and red points in Figure 6, and any further optimisation would noticeable improve but not greatly alter the predictive ability of the model.

2) GA-BAFFLE AND BAFFLE LSTM MODELS
The performance of the Baffle model has significantly improved after GA-optimisation, as illustrated by the performance of the GA-Baffle model in Figure 9. There is VOLUME 10, 2022 a closer alignment of the red and green points in the graph, compared to the green and red points that only intersected once in Figure 7 for the Baffle model. This improved performance between the Baffle and GA-Baffle LSTM models is illustrated by the substantial difference in performance metrics shown in  The prominent performance improvement that existed between the Baffle and GA-Baffle models, was not present between the Burnett and GA-Burnett models. After optimization, the time window size for the Baffle model changed from 150-time steps to 63-time steps, less than half the size it was. The units in the first and second hidden layers were reduced from 64 to 12 and from 32 to 10 units, respectively. The architecture of the GA-Baffle model is much smaller than that of the Baffle model. The difference in the architecture of the Baffle model after GA-optimisation is greater than the difference observed by the Burnett model and is responsible for the significant performance improvement.

3) GA-BURNETT AND GA-BAFFLE LSTM MODELS
The Burnett model has performed better than the Baffle model every time, primarily due to the structure of the dataset used to train the Burnett model. The already well-performing Burnett model was only slightly optimised by GA, while the Baffle model was greatly optimised by GA but not enough for the GA-Baffle model to outperform the GA-Burnett model as shown by the values in Table 5. Table 5 summarises the performance metrics of all four models alongside one another for comparison. The Baffle model has the largest and the worst RMSE, MSE, MAE, and MAPE values, while the GA-Burnett model has the lowest and the best values for these performance metrics. The GA-Burnett, Burnett, and GA-Baffle models all have similar R 2 scores, around 0.8. Similarly, all their explained variance scores are also approximately 0.8. The Baffle model has a much lower R 2 score of 0.36 and explained variance score of 0.47, almost half the value of the other models. The Baffle model only fits the data and accounts for the variation in the data half as well as the other models.

4) GA-OPTIMISED LSTM MODELS AND LSTM MODELS
The GA-optimised models have larger maximum errors than their original counterparts. The median absolute errors of the GA-optimised models are smaller than their maximum errors. Despite producing more accurate and consistent predictions, the GA-optimised models seem to be prone to outlier predictions. Overall, the GA-Burnett model shows the best predictive capability, while the Baffle model has the worst performance from all four models. In general, both GA-optimised models show improved predictive performance compared to their respective original counterparts. This performance improvement is representative of the improved robustness of the original LSTM models.

C. ENSEMBLE MODEL PREDICTIVE CAPABILITY
The GA-Burnett and GA-Baffle models contributed equally to the average ensemble model. An optimal weight combination of 60% of the GA-Burnett model and 40% of the GA-Baffle model was found for the weighted ensemble model. There is only a 10% difference in the weight combination of the ensembles. The optimal weight combination did not deviate much from the equal weight combination after the exhaustive grid search. This was largely due to the similarities in the architecture of the two GA-optimised LSTM based models. The only architectural difference between the base models is the number of units in the hidden layers. The GA-Burnett and GA-Baffle models have 1344 and 1904 trainable parameters, respectively, amounting to a difference of 560 parameters and 6 minutes of computation time.

1) AVERAGE AND WEIGHTED ENSEMBLE MODELS
The negligible difference in the performance of the ensemble models is shown in  The maximum error of the average ensemble at 0.31 and the weighted ensemble at 0.29 are similar. Both models have the same median absolute error of 0.09. The large difference between the maximum error and median absolute error in both ensembles reaffirms the trend of random outlier predictions observed in the GA-optimised LSTM based models. The behavior of the base models has been transferred to ensemble models.
In Table 6, the overall difference in the performance of the ensemble models is negligible and can be attributed to the LSTM based model weight contribution in each ensemble, where the GA-Burnett model only has 10% greater power in the weighted ensemble than in the average ensemble. Thus the difference in performance and predictive capability of the ensembles can only be slight. Table 6 shows the performance of the GA-optimised and ensemble models on the same Baffle 2015 dataset. The RMSE value of the GA-Baffle model decreased by 5.05% with the average ensemble and 6.06% with the weighted ensemble. A larger decrease was observed by the RMSE value of the GA-Burnett model, which was reduced by 7.84% with the average ensemble and 8.82% with the weighted ensemble.

2) ENSEMBLE MODELS AND GA-OPTIMISED LSTM MODELS
The The R 2 score of the GA-Baffle model increased by 1.97% with the average ensemble and by 1.8% with the weighted ensemble. The R 2 score of the GA-Burnett model improved by 3.05% with the average ensemble and by 2.93% with the weighted ensemble. The explained variance score of the GA-Baffle model increased by 1.03% with the average ensemble and by 0.92% with the weighted ensemble. The explained variance score of the GA-Burnett model improved by 2.56% with the average ensemble and by 2.45% with the weighted ensemble.
The difference in the model performance of all four models can be considered notable but not significant. This demonstrates consistency in model performance and prediction capability and is evidence of model robustness and improved model tolerance. Overall, the weighted ensemble performs the best, with the average ensemble performing almost as well. The performance difference between the ensembles is negligible.
The  Table 7 shows the consistency in the performance of all four models. The GA-Burnett, GA-Baffle, average, and weighted ensemble models have similar RMSE, MSE, and MAE values with negligible differences. All the R 2 and explained variance (EV ) scores are low and very far from the numeric value of one, indicating that despite consistently low RMSE, MAE, and MSE values, none of the models were able to predict the generation of wind power accurately. The models achieved consistently low maximum errors. The values used to train the models were normalised (by the data provider) positive decimals less than one. Thus in context, these seemingly low values are quite large. The weighted ensemble performs the best by a small margin. In general, both the ensemble models outperform the GA-optimised LSTM models. Consequently, all four models show consistent but poor predictive capability on this dataset.
The structure and data preparation of the wind power dataset could be responsible for this behaviour. The dataset has more than one parameter, but the parameters refer to the same feature, wind power from different wind farms.
In practice, each of these values is treated as a different parameter as they come from different systems (wind farms), while in essence, they are the same feature-wind power values. This data structure differs from the data structure of the water quality datasets used to develop the models. These datasets had several different parameters, each representative of a unique feature of the same system, such as water temperature, dissolved oxygen, to name a few. The models struggle to make predictions on datasets that do not have multiple features from the same system.
The models were developed on data that was normalised using the mean and standard deviation of the dataset. The manner in which the wind power data was normalised by the data provider is not known and might differ from the normalisation carried out on the other datasets, by this study. This possible difference in data preparation leads to the weaker performance of the models on the dataset. In conclusion, the models exhibit great consistency but low tolerance on the wind power generation multivariate dataset.

2) AIR TEMPERATURE
In Table 8, the similar RMSE, MSE, and MAE values of each model point towards the consistency in the predictive capability of each model. The R 2 and explained variance (EV ) scores are similar, consistently large, and very close to the numeric value of one, highlighting the good predictive capability and capacity to account for data variation of each model on the air temperature dataset. This good performance contrasts with the poor performance of the models on the wind power generation dataset. As with the wind power generation forecast, the weighted ensemble model performs the best, but the performance is comparable to that of the other models with negligible differences.
The dataset contained various related meteorological parameters in addition to air temperature values, such as air density and wind speed, to name a few, all recorded as part of the same system. Thus the dataset contained several parameters representing different features within the same system, which is similar in structure to the water quality datasets used to develop the models and could be a possible reason for the excellent model performance on this particular dataset.
All the air temperature dataset values were normalised before training using the mean and standard deviation of the data. This method of data preparation was applied to datasets used to develop the models. The similarity in the air temperature data structure and data preparation to the water quality datasets used to develop the models is the cause of the consistent and good model predictive capacity.
The air temperature dataset is the largest dataset with has 420 551 samples. The wind power dataset is one of the smaller datasets with only 18 757 samples as shown in Table 4. Thus the models might be better suited to larger datasets with multiple features from the same system. Interestingly, the GA-Burnett model performed the worst on this dataset. The GA-Burnett has the smallest architecture of all  the models and thus a lower capacity for a large dataset. In summary, the models exhibit great consistency and high tolerance on the air temperature multivariate dataset. Table 9 shows the model performance consistency, which was present with the previous two datasets. The RMSE values of the models were consistently similar to one another. The same trend was witnessed with the MSE and MAE values. The R 2 and explained variance (EV ) scores were exceedingly and consistently similar and close to 0.5. Thus the models are moderately capable of making predictions and have an average ability to account for the variation in the pollution dataset. As with the other datasets, the ensemble models perform better than the individual GA-optimised LSTM models, with the weighted ensemble model performing the best, but by an insignificant margin.

3) POLLUTION LEVEL
The pollution dataset has 43 800 samples (Table 4 and is smaller than the air temperature dataset but larger than the wind power dataset. The models seem to perform moderately or poorly on smaller datasets. In summary, the models exhibit great consistency and moderate tolerance on the pollution dataset.  Table 10 shows the model performance on the daily minimum temperature dataset in terms of RMSE values. Although the model performance difference is not large, it is more prominent than the notable but insignificant model performance difference observed on the multivariate datasets. The GA-Burnett and GA-Baffle models that performed similarly on the multivariate datasets exhibit a distinct difference in performance on the univariate dataset. The GA-Burnett model with the smallest network architecture performed better than the other models on the univariate dataset with a RMSE of 2.92. The GA-Baffle model performed poorly compared to the other models with a RMSE of 4.28, as its architecture might have been too large for the univariate dataset. The average ensemble performed worse than the weighted ensemble, which is composed of 60% GA-Burnett and 40% GA-Baffle, with RMSE values of 3.33 and 2.99, respectively. Thus the ensemble model that comprised more of the model with the smaller architecture performed better. The univariate dataset only considers one feature (temperature) for model training and predictions. The models were developed on multivariate datasets with the two features, DO and water temperature, used for model training. Thus the architecture of the models catered for multi-feature datasets. The univariate dataset is the smallest dataset with 3650 samples. This is possibly why the model with the smallest architecture performs the best on this dataset. In general, the models do not perform well on the dataset, further emphasising that the models do not perform well on smaller datasets.

2) MODELS AND CLASSICAL TIME SERIES FORECASTING METHODS
It can be seen from Table 10 that the classical methods perform better than the models on the univariate dataset. The ARIMA method performs the best with the lowest RMSE of 2.316, with the ARMA method achieving almost the same value at 2.320. The AR and MA methods achieved RMSE values of 2.389 and 2.98, respectively. MA is the worst performing method. The performance of the classical methods is comparable, rendering the performance difference insignificant.
Many of the classical methods outperform the models on the univariate dataset, shown in Table 10. The ARIMA method achieved the lowest RMSE value. As mentioned, the GA-Baffle model has the highest RMSE value. MA, the poorest performing classical method, has the same performance capabilities as the best performing model, GA-Burnett. Hence classical time series forecasting methods might still be more appropriate for univariate datasets than LSTM and ensemble models. Of all the models, the GA-Burnett model most closely replicated the behaviour of the classical methods, on the univariate datasets, due to the small architecture of the model. Thus the models show a lower tolerance on univariate datasets than on multivariate datasets. The models with smaller architectures have a higher tolerance on univariate datasets than the models with larger architectures. Table 11 shows the number of trainable parameters for each model, based on the number of units in the two hidden layers and the computation time taken to train each model. Trainable parameters are the number of trainable elements in a network-the parameters that are changed during gradient computation by the optimiser after the application of backpropagation [66]. All the models were developed, trained, and tested on an Aspire A315-53 laptop, with a processor (CPU) of Intel(R) Core(TM) i5-7200U, installed RAM of 4 GB DDR4 (Double Data Rate 4), of which 3,88 GB is usable and storage of 1 TB HDD (hard disk drive) with an effective storage of 930 GB.

F. COMPUTATION TIME AND TRAINABLE PARAMETERS
As seen in Table 11  The GA-optimisation of the LSTM models had the greatest impact on model robustness and computation time from all the processes in the ensemble development by decreasing the trainable parameters and hence computation time of the original Baffle and Burnett models. This concurs with literature, especially by Krstanovic and Paulheim [27], which suggested that configuring the individual LSTM base models of an ensemble through hyperparameter tuning will increase the base model quality, enhancing the resultant ensemble model quality.
The weight-based combination of the two models did not impact the number of parameters and caused a slight change in computation time. However, it did have an impact on model performance and model tolerance. The weighted ensemble model had the best performance on many of the datasets. This is evident from Table 12, which shows the best and worstperforming models and their performance difference on each dataset in terms of RMSE values.
In four out of the five datasets, the weighted ensemble model performed better than the other models to differing extents. At times, the difference in model performance was significant and sometimes notable but insignificant. The worst performing models on each dataset were either the GA-Burnett or GA-Baffle model. Thus the weight-based combination of the GA-optimised models improved the performance of the individual GA-optimised models, even if only marginally in certain instances and without substantially increasing computation time, indicating increased model robustness and tolerance. Table 13 shows Spearman's coefficients for the most significant water quality parameters in relation to DO for the three water quality datasets. The datasets show a significant negative correlation between DO and temperature, implying that as water temperature increases, DO concentration decreases, concurring with the research literature. The negative correlation is similar for the Burnett and Baffle 2019 datasets at −0.437 and −0.419, respectively, and is much stronger for the Baffle 2015 dataset at −0.752.

G. DESCRIPTIVE STATICS AND CORRELATIONS OF THE WATER QUALITY DATASETS
The relationship between pH and DO is unclear. Both the Burnett and Baffle 2015 datasets show a positive correlation. The Baffle 2015 correlation at 0.731 is much stronger than the Burnett at 0.361. In contrast, pH and DO share a weak negative correlation in Baffle 2019. Correlation does not necessarily translate to causation, and as previously stated, there was no documented correlation between DO and pH at the time of this study. The negligible correlation in Baffle 2019 does not support the existence of a causal relationship, and thus this study excluded pH from the predictive DO models.  The different pH-DO correlations should not be overlooked. The Baffle 2015 dataset has 17 520 observations spread over a single year. Similarly, the Burnett dataset with a total of 17 520 observations per year. The Burnett dataset has this density of data over three years, culminating in 52 560 observations. Perhaps if the pH-DO relationship is observed for over three years, the correlation becomes weaker, and thus the Baffle 2015 dataset has a higher positive pH-DO correlation. The Baffle 2019 is the densest dataset with 52 560 observations over a single year. Densely spaced datasets indicate closer clusters of similar observations with little variation between them. Thus more detailed daily observations could imply no correlation between pH and DO. Hence it is plausible to assume that if the Baffle 2015 had more observations per day, the strong pH-DO correlation might not exist.
In Table 14 which shows the statics of the water quality datasets, the pH minimum and maximum of the Baffle 2019 dataset are 4.53 and 8.21, respectively. The minimum pH is lower than the typically observed 6.5 to 8.5 pH values, indicating highly acidic waters. It is improbable that an external event caused the low pH as the other water quality parameter values (DO and temperature) are unaffected and in range. There is a greater possibility of the incorrect recording of pH for the Baffle 2019 dataset. Despite the discrepancy in pH values, all the datasets have an average pH of around 7, which falls well within the range of typical pH values and indicates neutral river waters.
From Table 14, the mean temperature of the datasets are similar and fall well within the range of typically observed water temperatures shown in Table 1 with 24.5 • C, 25.1 • C and 24.8 • C for the Burnett, Baffle 2019 and Baffle 2015, respectively. The mean temperatures also fall on the upper end of the typical temperature range, indicating that both rivers are in warm areas. Both rivers are in the fairly warm southeast Queensland, Australia. Thus this is a realistic depiction. The maximum temperature for all the datasets is around 32 to 33 • C, and the minimum temperatures for the Burnett, Baffle 2019, and 2015 are 11.8 • C, 16.8 • C, and 15.7 • C, respectively, implying that Baffle river is overall warmer than the Burnett river. It is also possible that the Burnett dataset that spans over three years, unlike the Baffle datasets, would have a larger temperature range due to the longer period.
The Burnett DO values fall within the range of the typical observed values from 6 to 14 mg/L. The minimum DO values of the Baffle 2019 and 2015 datasets are 4.0 mg/L and 3.9 mg/L, respectively. They are out of range, possibly due to the higher temperatures observed in the Baffle river, which would lower the DO concentration. The minimum DO values are observed at the minimum temperature values, which are higher for the Baffle datasets. The average DO values for the datasets are around 6.5 to 7 mg/L, falling within the typical range for healthy river waters from 6.5 to 8 mg/L.

VIII. CONCLUSION AND FUTURE WORK
The proposed LSTM based ensemble scheme improved the tolerance (mitigated the discrepancies of the individual LSTM models) of the hybrid GA-optimised LSTM water quality prediction models, for different water quality datasets taken from different sites and different times. Three main contributions and many observations were made by this study to achieve this result.

A. DEVELOPMENT OF THE LSTM FOR WATER QUALITY PREDICTION
Water quality prediction increases the efficiency of water quality monitoring, enabling effective water management, which is necessary for the preservation of rivers. Two LSTM models, the Burnett and Baffle LSTM models were developed from two different time-sequential water quality datasets with differently spaced data structures from two different time periods. Both models can successfully predict water quality ahead of time in terms of dissolved oxygen concentration, using historical dissolved oxygen concentration values and corresponding water temperature values. This is evident from the low RMSE, MSE, MAE, and MAPE values, along with the high R 2 and explained variance scores achieved by the models for water quality prediction.

B. PREDICTIVE LSTM MODELS AND DATA STRUCTURE
Overall, the Burnett LSTM model performed better than the Baffle LSTM model due to differences in dataset structure. The Burnett dataset had 52 560 observations spread over three years, and the Baffle dataset 2019 was more densely spaced, with 52 560 observations over a single year. The Burnett dataset allowed the Burnett model to observe a repeated trend over three years with greater variation in data, while the Baffle dataset only exhibited the trend to the Baffle model once with a single year of densely spread data with little variation.

C. GA-OPTIMISATION OF LSTM MODELS
The Burnett and Baffle LSTM models were successfully optimised using GA to increase their efficiency and robustness, resulting in the hybrid GA-optimised LSTM models, GA-Burnett and GA-Baffle LSTM models. Both the GA-optimised models outperformed their original counterparts, showing an enhancement in model performance through hyperparameter tuning. GA-optimisation had the biggest impact on decreasing the overall computation time and the number of trainable parameters. The computation time of the Baffle model was reduced by 50 minutes and the trainable parameters by 28 456 parameters after GA-optimisation, thus significantly reducing the overall computation time of the final ensemble model. Thus base model optimisation was crucial for ensemble model development.
The improvement in performance by the Baffle model after optimisation was much greater than the Burnett model. Unlike the Baffle model, the Burnett model was welloptimised through the trial-and-error method and could not be further improved. Thus when models, are developed from datasets that are spread over long periods and exhibit repeated trends, they are easier to optimise using trial-and-error methods. While models based on more densely spaced datasets with little variation in observations over shorter periods, not exhibiting repeated trends, require a more powerful optimisation technique, such as GA. After the Burnett and Baffle models were GA-optimised, the performance of the models became comparable. Model performance was not similar before optimisation.

D. WEIGHT BASED ENSEMBLE SCHEME
A linear weight-based technique combined the GA-Burnett and the GA-Baffle LSTM models to create two ensemble models, the average and weighted ensemble models. Due to increased robustness and improved predictive capability, the ensemble models performed better than the individual GA-optimised LSTM base models, even if it was only by a slight margin in certain instances and without significantly increasing the computation time. The performance difference between the two ensembles was notable but not significant, with the weighted ensemble only slightly outperforming the average ensemble, mainly due to the similar weight coefficients of the base models.
The similarity of the weight coefficients was due to the similarity in the architecture of the two base models, as the only difference between the base models was the number of LSTM units in the two hidden layers and the time window size. When two similar base models with similar performance capacities are combined to create an ensemble model, both the base models will have an equal or an almost equal contribution. The weighted ensemble was the more powerful of the two ensembles as the GA-Burnett, which performed better than the GA-Baffle had a greater contribution in the weighted ensemble. The similarity of the base models and the resultant similarity of the ensemble models were the cause of the similar behaviour exhibited by all four models: weighted ensemble, average ensemble, GA-Burnett, and GA-Baffle.

E. GENERALISATION AND MODEL TOLERANCE
The behaviour and predictive capability of all the models were similar and consistent. The weighted ensemble model only marginally outperformed the other models on the multivariate datasets. On particular datasets, the models had consistent good predictive capabilities and consistent moderate to poor predictive capacities on other datasets. Thus this study has mitigated the discrepancies and improved the tolerance of the individual LSTM models developed from different datasets, taken from different rivers and periods through the employment of the GA-optimised LSTM based ensemble scheme, to a large extent. Thus further asserting the relevance and tolerance of the developed models, especially the weighted ensemble model, in the wider field of LSTM and ensemble prediction models.
As the models were developed on multivariate datasets, greater consistency in model performance was observed on the multivariate datasets than on the univariate dataset. Classical forecasting methods outperformed the LSTM models on the univariate dataset, illustrating that the classical methods were more appropriate than LSTM models, even LSTM based ensemble models for making predictions on univariate datasets. The smaller LSTM models were better suited to univariate datasets than bigger LSTM models. In general, the models performed better on bigger datasets than smaller datasets. The models performed better on datasets with multiple related features in a single system and when the data was prepared (normalised) in the same way it was for the development of the models. Hence the models tend to perform well on datasets that are similar in structure, preparation, and size to the datasets used to develop them.

F. FUTURE WORK
This work was limited to two LSTM based models. Future studies can identify the optimum number of LSTM based models required to make the most tolerant ensemble model for water quality prediction and possibly in other areas such as energy, finance, geology, and many more. The Burnett and Baffle river datasets were taken from different sites and times, but both rivers are situated in southeast Queensland, Australia, and flow into the Coral Sea of the South Pacific Ocean. More diverse water quality datasets, possibly from different regions, could be used to increase the tolerance of the final ensemble model. The GA-optimised LSTM based models had a different number of LSTM units and time window sizes but were similar in other regards. The use of LSTM based models with greater architectural differences and diversity in terms of the number of hidden layers, input parameters, various activation functions, and optimisers could be explored to enhance the tolerance of the ensemble. Different metaheuristic algorithms, such as particle swarm optimisation, could be used to optimize the LSTM network to gauge the effect of other optimisation algorithms on the tolerance of the ensemble model. The focus of this study was the prediction of DO levels, but water temperature also influences other water quality parameters. Thus the development of a water temperature prediction model along with a DO prediction model should be encouraged. This paper mentioned the correlation between DO and pH. Whether this correlation translates into causation or any other possible link between the two parameters and what their potential pairing could hold for the development of water quality prediction models should be explored. ADNAN M. ABU-MAHFOUZ (Senior Member, IEEE) received the M.Eng. and Ph.D. degrees in computer engineering from the University of Pretoria. He is currently the Centre Manager of the Emerging Digital Technologies for 4IR (EDT4IR) Research Centre, Council for Scientific and Industrial Research (CSIR), an Extraordinary Professor with the University of Pretoria, a Professor Extraordinaire with the Tshwane University of Technology, and a Visiting Professor with the University of Johannesburg. His research interests include wireless sensor and actuator networks, low power wide area networks, software defined wireless sensor networks, cognitive radio, network security, network management, and sensor/actuator node development. He is a member of many IEEE Technical Communities. He is an Associate Editor of IEEE ACCESS, IEEE INTERNET OF THINGS, and IEEE TRANSACTION ON INDUSTRIAL INFORMATICS. He has participated in the formulation of many large and multidisciplinary research and development successful proposals (as a Principal Investigator or a main author/contributor). He is the founder of the smart networks collaboration initiative that aims to develop efficient and secure networks for the future smart systems, such as smart cities, smart grid, and smart water grid.