Proposition of New Ensemble Data-Intelligence Models for Surface Water Quality Prediction

An accurate prediction of water quality (WQ) related parameters is considered as pivotal decisive tool in sustainable water resources management. In this study, five different ensemble machine learning (ML) models including Quantile regression forest (QRF), Random Forest (RF), radial support vector machine (SVM), Stochastic Gradient Boosting (GBM) and Gradient Boosting Machines (GBM_H2O) were developed to predict the monthly biochemical oxygen demand (BOD) values of the Euphrates River, Iraq. For this aim, monthly average data of water temperature (T), Turbidity, pH, Electrical Conductivity (EC), Alkalinity (Alk), Calcium (Ca), chemical oxygen demand (COD), Sulfate (SO4), total dissolved solids (TDS), total suspended solids (TSS), and BOD measured for ten years period were used in this study. The performances of these standalone models were compared with integrative models developed by coupling the applied ML models with two different feature extraction algorithms i.e., Genetic Algorithm (GA) and Principal Components Analysis (PCA). The reliability of the applied models was evaluated based on the statistical performance criteria of determination coefficient (R2), root mean square error (RMSE), mean absolute error (MAE), Nash-Sutcliffe model efficiency coefficient (NSE), Willmott index (d), and percent bias (PBIAS). Results showed that among the developed models, QRF model attained the superior performance. The performance of the evaluated models presented in this study proved that the developed integrative PCA-QRF model presented much better performance compared with the standalone ones and with those integrated with GA. The statistical criteria of R2, RMSE, MAE, NSE, d, and PBIAS of PCA-QRF were 0.94, 0.12, 0.05, 0.93, 0.98, and 0.3, respectively.


I. INTRODUCTION A. THE IMPORTANCE OF SURFACE WATER QUALITY MONITORING AND DETECTION
Human life is significantly reliant on the availability of water because humans depend on water for many activities such as for drinking, cooking, farming, personal hygiene, industrial The associate editor coordinating the review of this manuscript and approving it for publication was Anandakumar Haldorai . and manufacturing purposes [1], [2]. Water is also important in other activities like biotransformation, electric power generation, etc. [3]. Owing to the reliance of human life on water availability, both surface and groundwater bodies are exposed to various levels of contamination from different contaminants [4], [5]. This has made the prediction of WQ a difficult task in recent times and many scholars have dedicated much effort to WQ assessment due to its importance to human life [6], [7]. VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ A high level of stress has been experienced over the last two decades in the area of water resources in the Iraqi region due to several reasons, such as the damming of Tigris and Euphrates Rivers, variations in global climate, and the decrease in the local annual rainfall precipitation rates [8]- [10]. Water salinity is a critical issue in Iraq that affects WQ for domestic, agricultural, and industrial purposes [11], [12]. Poor drainage and irrigation practices have brought about low water table and soil salinization in the region; agricultural developments and other human activities have affected the quality of water in the Euphrates Basin. However, these impacts are not obvious at the point of water source for irrigation. Therefore, WQ management is necessary for the effective management of all water-related resources [13].

B. MACHINE LEARNING MODELS LITERATURE REVIEW
The need for effective, dependable, accurate, and flexible prediction models has increased recently due to the acknowledgment of the issue of surface water pollution, coupled with the increasing interest in WQ assessment [14]. It is expected that these models can precisely describe the mechanisms of WQ deterioration [15]. Researchers have developed the idea of surface and underground WQ modeling using soft computing tools, such as ML models owing to their reliability and accuracy [16], [17]. However, the ML models demonstrated an inability of the generalization to handle the complicated and highly nonlinear relationship among the modeling parameters [18]. Based on the reported literature (2014-2021), Scopus database indicated that there is a substantial attention on the BOD simulation using the feasibility of ML models. Figure 1 reported the major keywords occurrence clusters and the time span, used over the literature. Over 144 keywords were presented indicating the significant of this topic on modeling river water quality. The idea of the exploration of new ML models that are capable to solve environmental engineering problems is always going on and the research domain of modeling WQ using new sophisticated models are of interest of researchers and scientists [19]- [21]. Although the literature revealed different version of ML models applied for surface WQ modeling such as artificial neural network, kernel models, fuzzy logic, genetic programming, adaptive neuro-inference system models and several others [7]; however, there are several new versions of ML models are yet to be explored for modeling surface WQ phenomena. The efficiency of integrative intelligence models in WQ modeling has also been noted [9], [36]- [39]. Further, although ML models are the commonly used predictive models in surface WQ prediction, they are still facing several limitations, such as the need to tune their internal parameters, the need for time-consuming algorithms, poor generalization capability, and the need for human intervention during the modeling process. Hence, there is a need for models that are flexible enough to address the complicated nature of most environmental engineering problems [22].

C. THE SIGNIFICANT OF THE SELECTED CASE STUDY
The accurate determination of BOD is necessary for water pollution control because it is an important index of good quality water [23]. This parameter is delicate and tedious to analyze, especially BOD analysis. BOD presents the approximation of the bio-degradable organic matter in the water and defines an essential indicator for water pollution. In addition, BOD is presented as the foremost parameter for the aquatic system health presentation and its proper quantification can contribute to development of strategic water resources protection and safety. Furthermore, for instance, the DO parameter, the analysis can be adopted in-situ instruments; however, BOD is recorded for at least five days. Accurate prediction of WQ parameters in a study area can save cost, energy, and time; this is why much effort is given to the modeling approaches when predicting these valuable parameters [24]. The modeling approaches are more important in developing countries where the budget for environmental quality assessment and monitoring is low compared to the developed countries. The research is conducted on the base to predict monthly scale BOD for Euphrates River located in Iraq region. Five different ensemble ML models were developed for this purpose. The selection of those models was owing to their massive implementation received and confirming their potential in hydrological, climatological and environmental researches [25]- [28]. The obtained modeling results were compared with several well-established literature on river WQ prediction of diverse region all around the world.

D. RESEARCH MOTIVATION AND OBJECTIVES
Several review research articles presented lately on the progress of ML development for river WQ [7], [29], [30]. The literature review emphasis on the exploration of new versions of ML models for modeling river WQ due the drawbacks of the associated limitations with the existed ML models. For instance, classical models such as artificial neural network (ANN), fuzzy logic (FL) and support vector machine (SVM) are associated with the drawbacks on tuning their internal parameters [31]- [34]. Another issue reported in the previous studies on the importance selecting the significant and related predictors for the targeted predicted parameters [16], [35]. As the prediction matrix is highly influenced by the input feature selection, integrating a prior approach for the better understanding the predictors effects is an essential step in ML models development. The previous studies have shown an admirable trend for this point of view. For instance, the integration of improved Grey relational analysis (IGRA) algorithm with Long-Short term Memory (LSTM) predictive model, to simulate the DO concentration at the Tai Lake and Victoria Bay [36]. In another study, water quality index (WQI) was predicted using the coupled Gaussian Naïve Bayes and several ML model at Rawal Lake [37]. Recently, some authors tested the capacity of the quantum teaching and learning based optimization as feature selection for WQI determination using weighted extreme learning machine model for groundwater samples collected at the Dharmapuri district in Tamil Nadu [38]. Several other scholars adopted similar methodologies for surface WQ simulation [39]- [41]. All those studies confirmed the significant of coupled ML models for modeling surface water quality for better understanding to the substantial correlation between the simulated WA parameters.
Hence, the current research was prompted on the base to explore more reliable and robust soft computing predictive models. In addition, the investigation of the highly influential parameters on the prediction of BOD in river located with semi-arid region. The objectives of the current research are (i) to explore the capacity of five ML models including Quantile regression forest (QRF), Random Forest (RF), radial support vector machine (SVM), Stochastic Gradient Boosting (GBM) and Gradient Boosting Machines (GBM_H2O) for river BOD prediction, (ii) to identify the prediction matrix using the feasibility of the statistical correlation. The proposed ML models were further enhanced on their prediction capability by integrating two approaches of feature selections (GA and PCA). The ultimate goal of the current research was to develop a reliable and robust intelligence model for river water quality prediction.

II. CASE STUDY AND DATA DESCRIPTION
This study focused on the prediction of WQ parameters in the Euphrates River, Ramadi City, Anbar state, Iraq. The coordinates of the measured point are as follows: 33 • 26 15 N latitude and 43 • 16 52 E longitude ( Figure 2). The laboratory measurement was conducted from a large drinking water plant treatment intake at Ramadi City. The climate of the region is semi-arid with extreme summer temperature ''exceed 45 • C'' and cold weather during winter [42]. Sampling was done monthly for 10 years (2004-2013). The quality of water in the Euphrates River basin is mainly affected by human activities, especially human domestic and agricultural activities. The salt level of the river has increased tremendously along the stream course. Furthermore, industrial discharge of untreated sewage water into the river also contributed to the sources of contaminants. Therefore, this study is relevant as it provides an intelligent system for WQ monitoring of the studied river. Until now, studies are yet to be reported in this perspective, hence, this is a novel contribution in consideration of the proposed methodology. The statistical properties of the WQ parameters presented in Table 1. WQ prediction models can aid in determining the trend of decline in WQ at any point. BOD and DO have been the commonly used parameters of WQ for decades, hence, this study focused on the prediction of both parameters as their accurate prediction is essential towards easing the protective initiatives.

A. QUANTILE REGRESSION FOREST (QRF) MODEL
QRF model is one of the popular ML models in which was firstly developed in this study before applying quantile regressions at the last model prediction stage to achieve the quantile RF model. The regression algorithm was applied in this model since the model output (i. e., BOD) is a continuous parameter. The concept of a RF model is based on the aggregation of several decision trees to establish the model output [43] as shown in Figure 3a. A decision tree (DT) refers to a decision support tool that relies on tree-like structures that consist of links and nodes to achieve potential model outputs. The starting point of each DT is a parent node that serves as a decision point; the parent node keeps creating branches until a decision is reached.
During the modeling process, the training dataset is first randomly bootstrapped into sub-training sets (1, . . . , n) as in Figure 3a; each of the resulting bootstrapped sub-training sets are used to establish the DTs for different predictor parameters combinations starting with the topmost predictor. The response of each DT is an estimate of the response parameter; the response for 0.01 & 0.99 quintiles can be estimated from the number of DTs used to constructs an RF, while the mean response (i.e., a quantile of 0.50) can be calculated as the final output of the model. The data within the range of 0.01 and 0.99 quintiles represents the prediction interval percentile. RF analysis demands a critical selection of the number of DTs, this ranges from some hundreds to thousands of trees. The optimum number of DTs in this study was determined using the out-of-bag (OOB) error technique [44]. A trained model is achieved when the prediction errors have been minimized and once this is achieved, the model is said to be the optimized/best-trained model. In this study, MSE was used to minimize the total error for each node of each DT. The error was estimated at each data splitting point -with the minimum MSE representing the best estimate. The study by [45] has earlier provided a detailed description of the formation of a DT and how RF works.

B. RANDOM FOREST (RF) MODEL
RF is a supervised learning approach that combined the bagging ensemble ML algorithm achieved from the classification and regression tree and the random subspace technique, introduced by Breiman [43]. Despite its simplicity, it is an effective tool that relies on the ''divide and conquer'' principle to solve multi-regression & prediction problems [46]- [48]. It has low sensitivity to multi-collinearity and achieves stable performances on unbalanced datasets. RF adopts the bootstrapping method in resampling the original dataset to generate subsets of similar sizes to the original set. Then, the tree construction is achieved by using the generated subsets, followed by the pooling of the results (prediction or regression) of the individual trees to arrive at the final outcome [48], [49]. RF has found successful application in environmental engineering [9] and other fields of study [50]. Detailed information on the mathematical formulation of RF models can be found in the studies presented by [43], [51], [52]. The randomforest and caret packages were used to train the predictive model. The RF model was initiated based on root mean square error-folds to control the model parameter. Grid search algorithm with randomly selected parameter.

C. SUPPORT VECTOR MACHINE (SVM) MODEL
This is an ML subcategory that was first proposed and developed by [53] for addressing both classification & regression problems. It is a robust approach that is based on the statistical learning theory [54]. The principle of the SVM model is hinged on first assessing the level of dependence of the target parameters (ŷ) on the predictive parameters (x) before obtaining a regression function using the relation [55]: where ϕ represents the functions for the replacement of complex nonlinear expressions with linear simpler ones. ω is the regression function weight while b is the regression function bias; both functions are generated via minimization of the deviation of f (x) from the observed value (ŷ). SVM adopts the ε -insensitive loss function for the evaluation of this VOLUME 9, 2021 deviation (ϑ) [56], [57]: The following risk-structure function is also minimized to obtain the corresponding weight and bias: The hyperparameters are represented as C and ε. The Lagrange multiplier technique is used to minimize S to achieve the regression equation in Eq. 4, where the kernel function is represented as K [53]: Numerous kernel functions were tested in this study, and based on certain performance metrics and time efficiency, the linear function was selected for this study followed the reported literature [58], [59]. The regression function of the support vector machine model presented in Figure 3b.

D. STOCHASTIC GRADIENT BOOSTING (GBM) MODEL
Friedman [60] first developed the GBM algorithm as a combination of the gradient descent with the boosting algorithm. Hence, GBM was developed as an ensemble learning algorithm that merged boosting and DTs; the new model was built following the gradient descent path of the loss function of the earlier model. GBM algorithm was developed for the training of the classification function F * (X ) which will minimize the loss function between the real function and the classification function. The loss function distribution is important in the implementation of the GBM model [60] even though the model can be applied to all loss functions. Friedman [61] suggested the surrogate loss function (multi-class log-loss) for the K -class problem. The mathematical expression of the loss function is as follows: where X = {x 1 , x 2 , . . . , x n } represent the input parameter, y is the output parameter, k represents the number of classes, and the probability is represented as p k (X ). This gives rise to the following equation: where y k i − p k (x i ) = the existing residuals; hence, K -trees are induced, leading to the production of K trees each with L-terminal nodes at iteration m, R klm . For each tree, a separate line search can be used to resolve the terminal node as shown.
The updating of each of the functions leads to the formation of the GBM. The GBM algorithm has earlier been detailed in the study by [62]. The GBM algorithm depends on three key parameters which are (i) the number of trees (boosting interactions, M ), (ii) the depth of the interaction (the max tree depth, J ), and (iii) the shrinkage (the learning rate, v). Better performance and generalization of the GBM model depends on a proper tunning of these hyper-parameters.

E. GRADIENT BOOSTING MACHINE (GBM_H2O)
Another popular supervised ML model is the GBM_H2O model which was developed by [60], [61]; it is an efficient tool in solving both classification and regression problems [63], [64]. Boosting learns multiple classifiers via manipulation of the sample weights during the training phase and later linearly merges these classifiers to improve the classification performance. Friedman presented an extension of Boosting to regression tasks in 2011 via the introduction of the GBM to come up with an additive model that can ensure minimization of the loss function. The GBM model is first initialized to a constant value that minimizes the loss function, followed by the estimation of the negative gradient of the loss function in each iterative training process as the current models' residual value. Then, a new RT is trained to fit the current residual, followed by the addition of the current RT to the previous model and the updating of the residual. The algorithmic process is continued until the maximum iteration number set by the user is attained. The GBM model has been improved on the aspect of its poor performance (when using data) by ensuring that the RT is continuously used to fit the residuals.

F. PRICIPAL COMPONENT ANALYSIS
The principal component analysis (PCA) is a well-recognized feature selection approach that works based on un-supervised pattern recognition. It abstracts the frequent pattern that scores the highest in the simulated matrix [65]. The mathematical procedure of the PCA approach is working on the base to allocate the minimum error between the observed and predicted values due to the variance of the principal component [66]. The variance component (a k,i ) is calculated using:  attained, the d' is calculated through: where λ k presents the scatter matrix of eigen values against the e k . The variance direction is followed the direction of the eigen values [67].

G. GENETIC ALGORITHM
GA is a realistic method that is based Darwin's principle [68].
In the current study, GA was adopted to sort the best fit predictors using its potential on the base of evolutionary process [69]. The successive iterations were calculated then after the filtered values and the optimal solution configured. The mathematic aspect of the best value for the optimal feature is computed as follows [70]: χ th presents the individual feature ''predictors water quality parameters'', k indicates the constant variable for the selective pressure between 1 and 2. The last term R χ th defines the ranking of the individual features.

IV. MODELING RESULTS AND ANALYSIS
The modeling procedure adopted in this research was exhibited in a form of flowchart presented in Figure 4.

A. PREDICTORS SELECTION
In this study, the development of five different ensemble data-intelligence models (i.e., QRF, RF, SVM, GBM and GBM_H2O) were established for surface water BOD prediction. In addition, the integration of the PCA and GA feature selection approaches was investigated as the second modeling scenario. The models' performances were compared based on multiple statistical criteria including determination coefficient (R 2 ), root mean square error (RMSE), mean absolute error (MAE), Nash-Sutcliffe model efficiency coefficient (NSE), Willmott index (d), and percent bias (PBIAS) [71], [72], and graphical presentation. Owing to the fact that the wise selection of which predictor ''water quality parameters'' to be included in the prediction formula, it has more advantageous effects on overall performance than the choice of the modeling algorithm itself and thus the feature selection approaches were employed to identify the minimal subset of features for optimal learning. The performance of the feature selection techniques was compared to the benchmark models comprising a full set of VOLUME 9, 2021  data covering potential casual parameters including temperature (T), Turbidity, pH, EC, Alkalinity )Alk(, Ca, COD, SO 4 , TDS, and TSS data. Initially, the collected data was analyzed in terms of their correlations as shown in Figures 5 and 6. Figure 5 shows the Pearson correlation coefficient at significant level 0.05 between the BOD (predictand) and the predictors. It is noticeable from Figure 5   It seems that the biological reactor of this particular case study is mainly influencing due to the water temperature as is could be due to the climate characteristics of this region. Thus, it can be concluded that T, Turbidity, pH, EC, Alk, COD, TDS, and TSS are significantly associated at 0.05 level with the BOD.
The statistical results of Goodman and Kruskal tau measurement were presented in Figure 6 in which presenting the correlation between the input parameters and the target parameter. The distinguished values of each input parameters were presented in diagonal elements. The forward and backward tau measures were reported in the form of off-diagonal elements. The associations from T, Turbidity, pH, EC, Alk, Ca, COD, SO 4 , TDS, and TSS to BOD were 0.24, 0.75, 0.08, 0.79, 0.39, 0.20, 0.49, 0.57, 0.79, and 0.52, respectively. Apparently, all predictors were associated with the predictand values and thence suggesting of potential predictability from the selected parameters to BOD. From the above analysis, it is judgeable that each test presented different concept on the association between the water quality parameters. Therefore, the entire data set was used to build the benchmark models.
Using the GA approach, the search of the feature space was conducted repeatedly within resampling iteration. Hence, the training data were split according to the 5 fold-cross validation resampling method specified in the control function. Hence, the entire GA approach was performed in 5 separate times. For the first fold, one fifth of the data were employed in the search while the remaining fifth was employed to estimate the external performance since the data points were not used in the search. The internal and external average accuracy estimates computed from the 5-out samples prediction were exhibited in Figure 7. Using the R software package ''gafs() function using 100 generation and 50 individuals'', was implemented to perform the assessment for the chromosomes of each generation. This was conducted by random forest model and 5-fold cross validation. Therefore, in the final search using the entire training set; only 4 features (among  the ten) were selected at iteration 49 which included T, pH, COD and TSS with RMSE, R 2 , and MAE values of 0.3463, 0.7138, 0.2742, respectively based on the external performance. Accordingly, the optimal four predictors were used to build the integrative GA-ML models.
In the same manner, the collected data was analyzed in terms of their associations using the PCA approach. Figure 8 depicts biplot for the first two most variance components. As it can noticed that the first component is the most dominant by the T parameter. While the second component was dominated by pH and Ca. Moreover, the scree plot that explains the most of variability in the data was plotted as shown in Figure 9. Where the x-axis and the y-axis represents the component and the importance of that component, respectively. As it can be seen from the figure that after the second component there is a significant drop-off to the incremental impact of each additional component. The eigen value per component was calculated as given in Table 2. The only parameter which has an eigen value close to 1 were included. The idea behind this is that if the eigen value is much less than 1, then the component accounts for less variance than a single parameter contributed. Upon that, the only first four components were used to build the models.

B. MODELS PERFORMANCES
The performances of the five-ensemble ML models developed in the study were evaluated based on learning accuracy using WQ data collected form the Euphrates River. Before applying the data into the models, it was randomly partitioned into 75% for training the models and the remaining 25% for validation [14], [16], [73]. The statistical performance of the developed five ML models was reported in Table 3. In general, the results indicated a highly competitive among the five models; however, an acceptable level of predictability performance was observed. The prediction power of the adopted models was ranked based on their average performance across the six statistical measures. The graphical presentation for the attained results were selected using Taylor diagram and boxplots over the validation phase. The statistical metrics records of R 2 , RMSE, MAE, NSE, d, and PBIAS form QRF-GBM_H2O were 0.9, 0.16, 0.07, 0.87, 0.97, and 0-0.84, 0.19, 0.1, 0.81, 0.96, and 0, respectively (Table 3). In other words, the lowest RMSE, MAE, PBIAS and the highest R 2 , NSE, and d were from these two models. However, the remaining models had performances less accurate than the QRF and GBM_H2O. Figure 10a presented the boxplot results of the established five ML models in comparison with the benchmark observed dataset over the validation modeling VOLUME 9, 2021  phase. The middle line indicates the magnitude of the BOD and the whiskers are presented by the minimum and maximum magnitudes of the samples. The 25 th and 75 th percentiles are the referred to the lower and the upper edges. It is clearly can be observed that the QRF model could achieve the identical prediction accuracy as it is the nearest shape to the observed dataset. Whereas, the GBM model reported the worst prediction accuracy in comparison with the other models. Taylor diagram presented the results in the form of 2-dimensions graph where the observed dataset was indicated as a circle along the abscissa and other models were exhibited their performance based on the distance from the observed data based on the RMSE, standard deviation and the correlation statistic (Figure 10b). In harmony with the boxplot, the QRF model was presented the nearest coordination to the observed dataset and the GBM model was the furthest. The correlation coefficient of the QRF model was within the range of 0.95 and the centered pattern RMS difference between the two pattern was 0.16.
The closest distribution around the line 1:1 was observed from the QRF model with values of R 2 , RMSE, MAE, NSE, d, and PBIAS equal to 0.85, 0.19, 0.09, 0.82, 0.96, and 0, respectively. While the remaining models performed with less accuracy than QRF. The boxplot of the results obtained from the evaluated the integrative GA-ML modelling methods during the validation stage were given as shown in Figure 11a. With the same manner of the benchmark models, the distribution of the QRF model was the most similar to the observed followed by GBM_H2O>GBM> SVM>RF. Taylor diagram (Figure 11b) confirmed that the optimal performance was from the QRF model while the RF and SVM are the furthest, and the other evaluated methods in between. The correlation coefficient between the QRF and the observed data is less than 0.95, and the centered pattern RMS difference between the two patterns is ∼0. 19. The performances of the integrative GA-ML models were not as good as to those from the benchmark models. Indicating that the selected features by the GA was not representative to the entire data of BOD.
Overall, the applied ML models using the PCA approach performed better than GA approach by producing the lowest prediction error. However, QRF outperformed the GBM_H2O in terms of the statistical performance metrics. R 2 , RMSE, MAE, NSE, d, and PBIAS of QRF were 0.94, 0.12, 0.05, 0.93, 0.98, and 0.3, respectively. While those of GBM_H2O were 0.89, 0.16, 0.09, 0.87, 0.97, and 0.3, respectively. The performance from QRF and GBM_H2O was followed by SVM> GBM> RF. The boxplot of the results obtained from the evaluated the integrative PCA-ML modelling methods during the validation stage were analyzed as shown in Figure 12a. It can be confirmed that the distribution from QRF was the most similar to that from the observed. The interquartile of the QRF model was almost the closest one to the observed values. Then followed by GBM_H2O>GBM> SVM>RF. This fact was further confirmed by Taylor diagram (Figure 12b) which prove that the optimal performance was from the QRF model while the RF and SVM were the worst, and the other evaluated methods in between. The correlation coefficient between the QRF and the observed data is greater than 0.95, and the centered pattern RMS difference between the two patterns is ∼0.12. The subset selection using the PCA approach outperformed that of the benchmark and GA-ML models.

V. DISCUSSION
The redundant and irrelevant predictors significantly deteriorate the performances of regression models and causes overfitting problem in the prediction models. Therefore, extracting a smaller subset of predictors with most relevant predictors might be useful since it saves time in data collection and computation [74], [75].
In this study, two-feature selection were integrative with five different ensemble learning artificial intelligence models (i.e., QRF, RF, GBM_H2O, GBM, SVM) in order to improve the surface BOD water quality prediction accuracy at the Euphrates River. These two-feature selections can be broadly categorized into filter methods (PCA) and wrapper methods (genetic algorithm) [76]. It was concluded that the performance from PCA outperforms the predictability performances of GA approach and the benchmark models. The GA works by searching the space of possible feature subsets and then evaluating a subset of features using a ML algorithm. This method is known as greedy algorithms owing to the fact that they aim to find the best possible combination of features, which result in the best performant algorithm model [77]. This in turn would be computationally expensive, and impractical in the case of exhaustive search. While in PCA, each predictor is evaluated with a statistical performance metric and then ranked according to its performance indicator. Then after, the top-performing features is selected through the truncation selection before applying a ML mod-els. Hence, the method is considered as a pre-processing step as it doesn't consider the complex interactions between predictors and are independent of learning algorithms [78]. As mentioned earlier, it is well identified that the PCA method is computationally efficient [79]. However, one shortcoming was pointed when applying this method is being stuck in local optimum when the complex interactions among predictors are ignored [79], [80]. Many researchers argued that wrapper methods (the GA) take into consideration the interaction among predictors but they are not as computationally efficient as filter methods (the PCA) because of the larger space to search [81]- [83]. It is well pointed out that the main drawback of applying GA is the necessity to be applied with a higher population size and larger number of generation, which are mostly time consuming [76]. It is prevailed that the optimal features selection returns by GA and the better the network perform in prediction can be attained when there are a large population size and number of generations. Small data set for feature selection may cause the problem of overfitting which is why the performance of GA in this study was not superior in comparison to the baseline models.
The combination of PCA with quantile regression forest model outperforms all the applies models in terms of the statistical performances criteria. In QRF model, the conditional quantiles can be inferred which was introduced by Meinshausen [84] as a generalization form of random forests [43]. The robustness of QRF method attributed to its non-parametric accurate way of estimating conditional quantiles for high-dimensional predictor parameters. The method is proved to be consistent when applied with multiple different scenarios, suggesting that the algorithm is competitive in terms of predictive power.
It is worth to mention that span of the dataset used for the current study provided a satisfactory information for the ML models development and the learning process. It is true that several data span were adopted over the literature; however, in this study, the monthly scale of ten years observations were adequately construct the ML models.
The current research modeling is associated with some limitations such as tuning the internal parameters of the SVM model with other advanced non-linear function [85]. In addition, using metaheuristic optimization algorithms can be another option to enhance the performance of the ML models learning process [86].

VI. CONCLUSION
This study was proposed five relatively new explored ML models for BOD of surface WQ prediction. These models were considered in this work as a robust approach towards the prediction of WQ parameters rather than relying on laboratory analysis. Further enhancement, two feature selection approaches (GA and PCA) were integrated with the developed ML models to enhance their predictability performance. Various categories of water parameters, including physical, chemical, and biological parameters were used for the development of the proposed models as the input attributes. The data for the model construction was 10 years period laboratory information covering 2004-2013. The outcome of the research showed that PCA-QRF model provided a reliable performance of the BOD prediction compared to the other established models. Furthermore, the proposed model exhibited less approximation of the input parameters that are extremely for the catchments with less environmental or ecological information. Generally, the proposed ML models performed an accurate prediction of the WQ parameters of the Euphrates River. Future studies are aimed at the prediction of other WQ parameters, as well as the inclusion of more input attributes, such as climatological or hydrological factors.
Conflict of Interest: The authors declare no conflict of interest to any party. He taught several subjects related to the water resources and hydrological engineering, such as water management, fluid mechanics, hydrology, dams engineering, and statistics. In addition, he was involved in some consultancy projects, such as designing wastewater treatment plant, and environmental impact assessment of several projects. He has supervised on many under and post graduate students. His research interests include using the artificial intelligence methods in water resources management. He is interested in machine learning, statistical, and stochastic modeling, and has been manipulated with variety of statistical software, since 2011. He published several articles in peer reviewed international journals in this field. In 2017, he was honorably selected for the Fulbright Visiting Scholar Program. AITAZAZ AHSAN FAROOQUE is currently working as an Associate Professor with the Faculty of Sustainable Design Engineering, University of Prince Edward Island. His research focuses on fundamental understanding and development of state of the art precision agriculture (PA) technologies for Eastern Canada's agriculture industry. Development of innovative and novel PA systems utilizes knowledge of engineering design, development and management, instrumentation, design and evaluation of sensors and controllers, and development of hardware and software for automation of machines to sense targets in real-time for spot application of agrochemicals on an as-needed basis to improve farm profitability while maintaining environmental sustainability. He is actively working on machine vision, application of multispectral and thermal imagery using drone technology, delineation of management zones for site-specific fertilization, electromagnetic induction methods, remote sensing, and digital photography technique for mapping, bio-systems modeling, artificial neural networks, deep learning, and analog and digital sensor integration into agricultural equipment for real-time soil, plant, and yield mapping. He has been evaluating the variable rate technologies for potential environmental risks.

ACKNOWLEDGEMENT
KHALED MOHAMED KHEDHER has been an Assistant Professor with the Civil Engineering Department, College of Engineering, King Khalid University, since 2018. Before, he worked as an Assistant Professor in civil engineering for many years at the Universities (France, Tunisia, and Saudi Arabia). He has participated in many international and national conferences and symposiums linked to civil engineering and new technologies, such as GIS, remote sensing, and geosciences. He has also worked on many articles and papers in ISI journals with high impact factor. Until now, more than 60 articles online published in the international publishers, such as Elsevier, Taylor & Francis, MDPI, Wiley, and Springer.
ZAHER MUNDHER YASEEN is currently working as the Research Director with Al-Ayen University, Iraq. He is also a Lecturer and a Researcher in the field of civil engineering. The scope of his research is quite abroad, covering water resources engineering, environmental engineering, knowledge-based system development, and the implementation of data analytic and artificial intelligence models. He has published over 210 research articles within international journals and total number of citations over 5000 (Google Scholar H-index = 41). He has collaborated with over 40 international countries and more than 400 researchers. He has served as a reviewer for more than 110 international journals. VOLUME 9, 2021