Water Quality Management Using Hybrid Machine Learning and Data Mining Algorithms: An Indexing Approach

One of the key functions of global water resource management authorities is river water quality (WQ) assessment. A water quality index (WQI) is developed for water assessments considering numerous quality-related variables. WQI assessments typically take a long time and are prone to errors during sub-indices generation. This can be tackled through the latest machine learning (ML) techniques renowned for superior accuracy. In this study, water samples were taken from the wells in the study area (North Pakistan) to develop WQI prediction models. Four standalone algorithms, i.e., random trees (RT), random forest (RF), M5P, and reduced error pruning tree (REPT), were used in this study. In addition, 12 hybrid data-mining algorithms (a combination of standalone, bagging (BA), cross-validation parameter selection (CVPS), and randomizable filtered classification (RFC)) were also used. Using the 10-fold cross-validation technique, the data were separated into two groups (70:30) for algorithm creation. Ten random input permutations were created using Pearson correlation coefficients to identify the best possible combination of datasets for improving the algorithm prediction. The variables with very low correlations performed poorly, whereas hybrid algorithms increased the prediction capability of numerous standalone algorithms. Hybrid RT-Artificial Neural Network (RT-ANN) with RMSE = 2.319, MAE = 2.248, NSE = 0.945, and PBIAS = −0.64 outperformed all other algorithms. Most algorithms overestimated WQI values except for BA-RF, RF, BA-REPT, REPT, RFC-M5P, RFC-REPT, and ANN- Adaptive Network-Based Fuzzy Inference System (ANFIS).

Inadequate sewage networks, uncontrolled and improperly planned urbanization, and dumping of industrial trash, pesticides, and fertilizers contribute to water pollution [3]. Such pollution is more evident in local rivers or water channels closer to urban developments.
With both non-point and point sources, river pollution is becoming a more significant problem and presents a tough challenge to global water management authorities. Such pollution seriously deteriorates water quality (WQ). WQ degradation substantially impacts aquatic life and the availability of clean water for drinking and agricultural purposes [4]. The pollution challenge is harder to tackle in developing countries which frequently go through times of economic fluctuations. Further each development action can have severe environmental consequences. For example, with an increase in the population and demand for more resources, the requirement for more agricultural production pressures soils' organic fertility, increasing the demand for artificial fertilizers to enhance yield [5]. Accordingly, surplus fertilizers are frequently dumped into rivers and waterways that pollute ground and underground water sources [1]. This increases the need for WQ assessment and surveillance.
WQ surveillance and evaluation are critical for environmental, climate, and human health protection. This can be achieved through timely, efficient, and long-term water management plans. The WQ is assessed through the water quality index (WQI). WQI helps guide policymakers' actions and decisions. However, calculating WQI is not a simple process due to the involvement of multiple sub-indices and equations. WQI is a non-dimensional index derived from defined WQ variables. It uses variables such as pH (potential of hydrogen), DO (dissolved oxygen), TSS (total suspended solids), BOD (biological oxygen demand), AN (ammoniacalnitrogen), COD (chemical oxygen demand), and others [6]. The associated matrices enable a definite evaluation of WQ. Measurements of variables such as Ca2+, Mg2+, NO3, and others are commonly used to estimate groundwater quality indicators (GQIs) [7], [8], [9].
Several aspects of water, including physical, chemical, biological, and radiological, are included in the assessment of WQ [10]. In addition, WQI is a frequently used technique for assessing the effectiveness or failure of WQ management measures [11]. Some examples of WQIs include the Canadian WQI (CQI), United States National Sanitation Foundation WQI (NSFWQI), Interim National Water Quality Standards for Malaysia (INWQS), British Columbia WQI (BCWQI), Oregon WQI (OWQI), Florida Stream WQI (FWQI), and others. WQI is calculated through multiple methods and algorithms around the globe. However, WQI calculation is not a straightforward process, and the associated computations have many drawbacks [12]: 1) The computation algorithms are complex.
2) It is a lengthy process 3) The computations are verbose and harder to understand 4) The process is subject to inconsistencies and errors as there is no uniform WQI approach and the WQI computations frequently utilize different and varying algorithms. Some experts have used a non-physical strategy to address these difficulties. Accordingly, they suggest using artificial intelligence (AI) to forecast WQI [13], [14], [15]. AI-based modeling eliminates the need for sub-index computations and quickly generates WQI values. Such AI algorithms are gaining popularity because of their nonlinear structures, capacity to forecast complicated events, ability to handle large datasets, and lack of sensitivity to missing data [16]. For WQI modeling, artificial neural networks (ANN) and adaptive network-based fuzzy inference system (ANFIS) based classic AI algorithms have been extensively developed. On the other hand, environmental scientists have researched more robust and trustworthy AI algorithms [17], [18], [19]. However, the methodology and quality of data gathering and analysis are critical to the predictive capability of AI systems.
Data mining is a form of AI algorithm developed to tackle nonlinear equations and reduce AI's drawbacks. It has been used to quantify suspended sediment yield [20], approximate benchmark water loss [21], [22], [23], and replicate direct sunlight [24]. New algorithms such as M5P, random tree (RT), random forest (RF), bagging (BA), reduced error pruning tree (REPT), instance-based k-nearest neighbors (IBK), random committee (RC) are currently explored in hydrological processes, climate science, and hydraulic systems [12], [20], [22], [23], [25], [26]. Another prominent solution for different environmental and hydrological issues includes the usage of tree-based algorithms such as decision trees [27], [28]. Furthermore, the known powerful machine learning (ML) tool for both linear and nonlinear regression problems is the support vector machine (SVM), which is used in a range of scientific problems with remarkable forecast accuracy [27], [29], [30], [31]. DT and SVM algorithms have been used to predict parameters of WQ, such as TDS (total dissolved solids), TSS, BOD, and COD.
Granata et al. [32] developed a regression tree (RT) algorithm and a support vector regression (SVR) algorithm for predicting wastewater quality indicators and discovered that the SVR model provided the best results. Kayaalp et al. [33] developed a hybrid SVR model using monthly WQ parameter data with the firefly algorithm (FFA) to forecast WQI. The algorithm showed a significant increase in prediction performance compared to the standalone SVR model. Kamyab-Talesh et al. [34] looked into the optimization of the SVM algorithm to investigate the factors having the highest impact on the WQI. The authors observed that nitrate is the most crucial parameter for WQI prediction. Wang et al. [35] analyzed three ML algorithms, SVR, SVR-GA (genetic algorithm), and SVR-PSO (particle swarm optimization), to predict WQI and compared their performance. Since decision tree-based algorithms (i.e., M5P, RF, RT, REPT, and others) lack hidden units and modeling clarity, they can produce superior modeling results than ANFIS and ANN [36]. Furthermore, integrated modeling gives more reliable results than using standalone algorithms. VOLUME 10, 2022 Researchers from Iran [12] have introduced a new WQI to focus on the characteristics and conditions of the rivers and lakes because previous algorithms are time-consuming and not accurate enough to be trusted. In addition, they added more parameters to their algorithm to improve prediction accuracy. However, its feasibility is not tested yet due to the diverse weather conditions that vary between the arid and moist seasons. As a result, applying this index to specific locations may be risky and yield variable results. But since our study area lies in a similar climate region and has the same metrological and climatic properties, we can rely on this algorithm for our study area.
Northern Pakistan has gained economic significance over the last decade because of China Pakistan Economic Corridor (CPEC) project [1], [37]. The urban areas, for instance, Gilgit city, are experiencing an economic boom because of the latest development, which has brought improved linkage and connectivity through the upgradation of the Karakoram Highway (KKH). However, this enhanced connectivity also invites an urban sprawl in the region and is expected to face many environmental issues, including WQ [10], [38]. Very recently, Maqsoom et al. [38] mapped this region's groundwater susceptibility and found moderate to high groundwater susceptibility, particularly the region around Gilgit city. Moreover, Awais et al. [1] also conducted a study and assessed nitrate contamination in this region and found that the region has moderate to high groundwater nitrate contamination risk. Overall, the two studies discovered that the water in the extreme Northern side appears to be of good quality, with minimal contamination and protected through natural vegetation. However, as the system approaches a developed region, Gilgit city, WQ rapidly degrades because of improper and unregulated discharges, a typical trend of build-up regions [39].
The objective of this research, however, is to forecast the WQI for the region along KKH. For achieving this, the current study utilizes four standalone algorithms, M5P, RT, RF, REPT, and 12 unique hybrid data mining algorithms (randomizable filtered classifier, CV parameter selection, and BA) combined with the four standalone algorithms. It was expected that WQI could be accurately forecasted using a standalone decision tree algorithm, as its ability to predict diverse hydrological events has been proved in the literature mentioned earlier. However, by combining it with classifier algorithms, it was aimed that the precision rate could be enhanced further and the fundamental flaws of the given algorithms could be reduced. Therefore, a combination was proposed and utilized in this study.
The current study differentiates itself from published works as two new hybrid algorithms were tested in this study for WQI analysis. Moreover, the outcomes of the hybrid algorithms were compared with the previously established algorithms and techniques to establish a more robust algorithm in terms of better accuracy. This study will benefit this fast-growing region as it is expected to be highly induced by human activities and causing many environmental issues, i.e., water pollution, and will help policymakers in the CPEC region with better water management.
The rest of the paper is organized as follows. First, the study area is explained in Section 2. Then Section 3 describes the research methodology, followed by the presentation of the algorithms used in this research in Section 4. Section 5 compares the algorithms and their performance. Section 6 presents results and pertinent discussions. Finally, Section 7 concludes the study and explains the key takeaways, limitations, and future direction for further expanding the current research. The study area is located in northern Pakistan and lies in the Districts Gilgit and Hunza-Nagar, located near the Pakistan-China border. The study area is 20 kilometers buffer along the 236 kilometers (146.6 miles) stretch of the traditional Silk route/ KKH from Gilgit to Khunjerab Pass, encompassing a hilly terrain. This route has tremendous importance as it connects Pakistan and China and is considered the backbone of the CPEC project [1], [38]. The study area is a part of the Himalayas, Hindukush, and Karakoram Mountain ranges, having an elevation range from 1294 meters to 7330 meters. River Hunza and River Gilgit flows from this region to provide domestic water. The area is located at a high altitude and receives lots of snow in winter, which melts in summer, thus providing freshwater [1], [37]. In the past, this region had an excellent WQ, but the local WQ is deteriorating due to the recent construction and other development due to CPEC. This calls for a WQ study for the region to better manage the groundwater and surface water in line with the global sustainability goals. Figure 1 shows the study area and locations of water wells from where the water samples were taken and analyzed for the research. Figure 1 further shows the water channels, district boundary, and elevation in the study area.  Figure 2 shows the methodology flow chart of this research and the associated steps. Figure 2 shows that the data collection was initially performed, and followingly different WQ parameters were calculated from the water samples. The data was then distributed into testing and validation datasets. From the testing datasets, the best input combination was identified. Finally, multiple algorithms were applied to the best varieties, and an algorithm assessment was conducted for the best possible algorithm selection to predict WQI. The detailed steps of the method are subsequently presented and discussed.

A. DATA COLLECTION AND PREPARATION
Water samples were collected from different random water wells in the study area so that they covered the area entirely. Overall, the data is collected from 39 locations. To minimize the seasonality impact, the samples were taken over two years, 2020 and 2021. Followingly, multiple WQ parameters were calculated. These parameters include PH, DO, TDS, conductivity, salinity, chloride, total alkalinity, total hardness, sulfate, nitrate, and WQI. Pakistan's WQ Index (PKWQI) was calculated using these datasets. The COMSATS University Islamabad, Wah Campus's laboratory was used for the WQ parameter calculations. Using the 10-fold cross-validation method, the dataset was partitioned into two subsets for algorithm training and testing (70:30). This ratio is one of the most popular modeling strategies for spatial [22], [23], [26], [40] and temporal [20], [22], [23], [24], [25], [40] predictions. The PKWQI was created using the NSFWQI equation. In the index, the cleanness of water depends on the value of PKWQI. The river is cleaner if the PKWQI value is more significant (a WQI of 80 or higher denotes a clean river) and vice versa [15]. The PKWQI formula is used to determine the PKWQI, as shown in (1).
where, W i is the variable I s weight (between 0 and 1), and SI i is the sub-index resulting from the quality-index curve (0-100). The calculation techniques align with the   NSFWQI [34], [40]. Table 1 shows the ranges of quality parameters for the PKWQI. It classifies the WQI into seven classes based on the ranges of WQI. For example, <15 WQI is classified as very low-quality water, while >85 WQI values are classified as very good-quality water, as stated in table 1.
The generic ruleset is that the higher the WQI value, the better the WQ. Table 2 shows the used WQ parameters and the results of multicollinearity analysis. These parameters were selected based on the literature [6], [12], [14], [32], [34], [41] and are among the standard characteristics used for WQ assessment. The variation inflation factor (VIF) value for all factors is less than 5 and satisfies the maximum threshold [42]. Thus, it can be stated that there is no multicollinearity present among the selected parameters.
According to the descriptive data (see Table 3), the WQI varies from 11.45 to 87.45 (the maximum value is 100). Thus, the WQ ranges from excellent to unsuitable for drinking in the study area [43]. The average pH for the training dataset is 7.9 and 7.5 for the testing dataset. Overall, the region has a very weak basic pH. The mean total hardness for the training dataset is 104.6 and 103.50 for the testing dataset, which means this area has moderately hard water. Also, if we notice the TDS values, the area processes hard water as the TDS for both training and testing datasets are 90.9 and 90.4, respectively. VOLUME 10, 2022  The data were normalized (X i ) to a 0 to 1 range to increase prediction ability using the following relation [44]: [48] ( where X i is the normalized value of a variable (i.e., BOD, COD, etc.), xi is the value at a given location, and X min and X max are the variable's minimum and maximum values.

B. CONSTRUCTING THE INPUT COMBINATION
Before modeling, the ideal input combination and the best value for each algorithm's operator must be identified. Ten factors were examined as potential inputs, and correlation coefficients (CCs) between input and WQI were used to determine the outcome, as presented in Table 4. CCs range from −1 to +1. Where −1 means strong negative relations and +1 means strong positive relations, and 0 means no relation among the two variables. Ca+Mg, SO4, and NO3 strongly relate to the WQI, while pH has no relation to the WQI. TDS and salinity have moderate relation. A total of 10 input combinations for this purpose as presented in Table 5. NO3 was the initial variable included in the algorithm, having an excellent CC value, as shown in Table 5. The best estimate of the WQI is obtained using this variable alone; hence it is the known, accurate and effective variable. Until the final variable with the lowest CC was included (i.e., pH and other combinations), each variable with the next highest CC (i.e., SO4, then Ca+Mg, then CO3+HCO3, etc.) was added to the preceding variety. Each algorithm's most successful (i.e., most predictive) combination is determined by applying fixed input variable values (or default values) to all ten input combinations. The testing phase was evaluated using the root mean square error (RMSE) criterion.

C. DETERMINING THE OPERATOR'S OPTIMUM VALUES
After establishing the optimal input parameters, trial and error were used to obtain the optimal values for each algorithm's operator. Since operators have no universal optimum value (values vary per research), various values should be examined using the hit and trial approach to determine the most efficient value. To achieve this, each algorithm was run using default settings. Based on these findings, higher and lower numbers were randomly entered until the optimal value was found. The batch size for all the algorithms was set to 100, and the model was operated at 100 iterations. DT algorithms were used as classifiers, and random projection was used to filter all the algorithms. The minimum variance proportion was set as 0.001, and the number of decimal places for output values was 3. 15 hidden layers were used for the ANN algorithm to get a single output.

IV. DESCRIPTIONS OF THE ALGORITHMS
This study uses sixteen ML algorithms to predict WQI. The algorithms used are divided into two groups. Jupyter notebook was used to implement the algorithms and process the obtained data. The most essential used packages are Tensor-Flow, scikit-learn, ANFIS, and weka-pyscript, and the most important used libraries are NumPy, Matplotlib, and pandas [45], [46], [47], [48]. In this study, the WQI was predicted and evaluated using the unique algorithms in group 1. Following that, ensemble algorithms based on the algorithms in Groups 1 and 2 were created to assess the accuracy of the WQI prediction. Finally, sixteen algorithms are analyzed and evaluated in 2 categories to choose the best algorithm to predict WQI. The two groups of algorithms are explained below: This group contains six algorithms. These include M5P, RT, RF, REPT, ANN, and BPNN (back propagation neural network), as discussed below: M5P, a machine-learning algorithm, is the first member of the DT group included in this study. M5P is a robust decision-tree algorithm used in various applications [49], [50], [51], [52], [53]. It works like a regression tree, with constants acting as the leaves [54]. The M5P algorithm was derived from the M5 algorithm given by Quinlan [55]. The classification and regression tree [26] algorithm modify M5P [44]. As the M5P method is centered on classification and regression analysis, it uses a divergence metric to generate a decision tree. It calculated continuous parameters using the decision tree with linear regression functions as nodes that produced numerical attributes.

2) RT
The RT algorithm is a well-known DT technique first developed in 2000 [56], [57]. In contrast to typical DTs, it builds DTs from a random selection of columns. In addition, RT offers flexible and quick training [58]. From the training dataset containing features and labels, the RT developed the DT by formulating its own set of rules and then used those rules to make the predictions. RF was suggested for the first time by Breiman [59]. Supervised ML, ensemble ML, and RT are some algorithms that fall within this category [27], [60]. The sample subsets from the original data are used in the RF algorithm. It creates a DT for each subgroup and summarizes the sub-decision tree forecasts. The DT was built with around two-thirds of the dataset, and the algorithm is evaluated with the remaining data. This type of evaluation is known as ''out-of-bag'' (OOB) evaluation. More details are given in [61], [62], [63], [64], and [65] about the utility of RF algorithms in natural science areas.

4) REPT
REPT can learn quickly where the DTs are created based on data enrichment or variance reduction [66]. Reduced-error pruning with back over-fitting is the primary approach used in this strategy. Pruning procedures are used to reduce the size of a DT. The REPT algorithm examines each node of the DT and lowers the number of branches until the tree's correctness is compromised [67]. The REPT considered each node for pruning and removed the subtrees at nodes. As per the REPT, the performance is compromised, making them leave by assigning weights. The REPT, by iteratively operating, continued the removal of nodes till the pruning became harmful.

5) ANN
ANNs are computer systems modeled after the biological neural networks that make up animal brains. An ANN is made up of artificial neurons, a collection of linked units or nodes that resemble the neurons in a biological brain [68]. The neurons were grouped into layers, and the best possible match was made for each input layer to form a single group. Signals went from the first layer (the input layer) to the last layer (the output layer) by going through the middle layer (the hidden layer). Neurons were assigned a threshold at which the signal was only transmitted once the aggregate signal exceeded it. The process was repeated many times till the convergence was achieved.

6) BPNN
The BPNN was created to solve the challenge of multi-layer perceptron training. The addition of a differentiable transfer function at each node of the network and using error backpropagation to adjust the internal network weights after each training period were the BPNN's key innovations. Backpropagation helped fine-tune the weights of every neural based on the error rate obtained in the previous epoch during the iterations. Proper tuning of the weights ensured lower error rates, thus, making BPNN consistent by increasing its generalization. Because of its capacity to construct complicated decision boundaries in the feature space, the BPNN was chosen as a classifier by Hornik et al. [69].

2) RFC
RFC is a data-classification approach that uses randomly filtered data [74]. The filter uses the training dataset with a specific structure [75]. RFC was used to train the M5P, RF, RT, and REPT base learners to predict WQI, similar to how the bagging and CVPS algorithms were trained, resulting in four hybrid algorithms: RFC-M5P, RFC-RF, RFC-RT, and RFC-REPT. The validation dataset was run through a filter to ensure the algorithm was of good quality without affecting its structure. A random number of seeds were used to create each base classifier using the same data. The result was the average of the classifiers' predictions. Followingly, the class was utilized to construct a random classifier committee. The committee members were then categorized, and the randomizable interface was implemented.

3) ANFIS
Adaptive Neuro-Fuzzy Inference System (ANFIS) uses two sets of the algorithm as a single unit, i.e., Fuzzy Logic [18] and ANNs. Because of this combination, this algorithm handles complex large data structures very quickly and efficiently and speeds up the execution time. First, the ANFIS mapped input characteristics into input membership functions (MFs) and then input MF to a set of if-then rules. Followingly, the rules were converted to a set of output characteristics and then the output characteristics to output MFs. Lastly, the output MFs were transformed into a single-valued output or a decision associated with the output.

V. COMPARISON AND ASSESSMENT OF ALGORITHMS
Six statistical metrics were used to analyze the algorithms quantitatively. These metrics have been used in the past by several researchers to assess the performance of data mining algorithms. The used metrics include the root mean square error (RMSE) [1], coefficient of determination (R 2 ) [76], mean absolute error (MAE) [77], Nash-Sutcliffe efficiency [78], percentage of bias (PBIAS) [79], and percent of relative error index (PREI) [80]. RMSE is the difference between the actual and predicted value. The greater the RMSE, the higher the error in the model. MAE is the mean of errors among all actual and predicted values. The lower the mean error, the more reliable will be the prediction model.
Similarly, R 2 depicts the fitness of the model against the actual values. A higher R 2 means a high correlation between actual and predicted values, and the model generates good results. NSE calculates the relative magnitude of the residual variance compared to the measured data variance. Its values range from negative infinity to 1. Where 1 means perfect answer and prediction of values, and values close to 1 show higher accuracy. PBIAS defines whether the predicted data is overestimated or underestimated than the actual dataset. Its optimal value is 0, which means perfect estimation, and values low or higher than 0 mean overstated or overestimated, respectively. Finally, PREI calculates the error percentage. The higher the ratio, the higher the error would be. Overall, all of these parameters give information on how accurate the model is and which model has what type of limitations, i.e., the model provides an overestimated prediction, the model is not fit, etc. These parameters are calculated using the following relations from Breiman et al. [79] and Breiman [80].
where, WQI predicted and WQI measured are the predicted and measured WQI mean values, respectively. Visual comparisons were also performed to evaluate the algorithms. Scatter plots and box plots were two approaches used for visual comparisons. Scatter plots are frequently used to assess algorithm performance and study the distribution of datasets employed [81], [82]. For example, scatter plots are used to study the data organization and density. Box plots are a standard tool for assessing the density and distribution of datasets and findings. The datasets or results are separated into four data quartiles. It is possible to look at extreme values (minimums and maximums), medians, and first (upper) and third (lower) quartile projections. Such a boxplot helps to understand how all these models are calculating the WQI and their ranges, which are used to compare the accuracy and overall results among all models.

VI. RESULTS AND DISCUSSIONS
Following the holistic method adopted in this study, the results of pertinent analyses are present as follows:

A. THE IDEAL INPUT COMBINATION
Different input combinations based on CCs were constructed using a variety of WQ characteristics, as presented in Table 5. pH emerged as the least relevant predictor of WQI when using the previously submitted equations to calculate it. The same has been indicated by [83] and [13]. On the other hand, pH was the most critical predictor of WQI in research by Mohammadpour et al. [84], which is the opposite of this study's findings. The 16 algorithms were trained using the ten input combinations discussed previously. A testing dataset to assess these combinations, as presented in Table 6, and the most effective was chosen for modeling and further study. The results reveal how well the algorithms fit with the training dataset. These data points were not utilized in the algorithm's evaluation. The best possible combination was identified based on the testing data RMSE value (Table 6). Since all models were built on a training dataset, this table only specifies how the models fit with the training dataset. From the testing dataset, it can be seen that, on average, five combinations have the lowest RMSE. Still, combinations of more than 5 algorithms are also close to the lowest RMSE, which indicates the higher dataset does not contribute much to error. However, lower than five combinations have relatively high RMSE, which is understood because a lower dataset would have more errors because of the non-availability of data.

B. ALGORITHM'S PERFORMANCE
The 16 algorithms were tested ( Figures. 3 and 4). As per the observations, all of the algorithms functioned well. Of all the algorithms, RT-ANN, BA-RT, RF, BA-RF, and BA-M5P have the highest prediction power. All algorithms were validated as the predicted WQI was compared with measured WQI for each model at each testing dataset. It can be seen that all models performed well, but RT-ANN, BA-RT, RF, and BA-RF models predicted the best prediction. Figure 3 shows how measuring and predicting WQI differ at each testing datapoint and how big the difference is among them. Again, all models performed well; no significant deviation between measured and predicted can be seen. Also, no pattern can be identified among all models identified as an error, so overall, all models gave reliable results.  Figure 3 a-p, Figure 4 a-p show how to fit the models by plotting the measured WQI and predicted WQI for all the models. It is another representation of predicted and measured values. Figure 3 a-p depicts the variation between the predicted and measured values. Similarly, Figure 4 a-p depicts the fitness of the model. Figure 4 a-p shows that all models have good fitness as most data points fall near the straight line, which is nearly perfect for model reliability. The RF algorithm has the minimum error among the standalone algorithms. The error ranges for RT and REPT were also between ±10; these algorithms failed to estimate the results accurately. The predictive value of standalone algorithms was improved by VOLUME 10, 2022 hybrid algorithms, notably the bagging algorithm (compare Figure 4a with e, c with g, and d and h). The RFC-RT, RFC-M5P, CVPS-M5P, M5P, and BAM5P algorithms are highly accurate at predicting the maximum WQI values, as shown by the box plots of measured and estimated WQI values. Only RFC-RF correctly estimated the lower values (see Figure 5).

Similar to
PREI, which evaluates the efficiency of algorithms on the potential to over-or underestimates the WQI, was used to analyze the results, as shown in Figure 5. Though it has been established that all models predict reliable WQI, one factor still needs to be addressed. It must be checked if the model overestimates or underestimates the outcome. Only then can model accuracy be judged (when it predicts nearly to actual values, i.e., having a lower RMSE value). However, as shown, all the values are overestimated or underestimated, which means there is something wrong with the model, and it needs some refinement or model tuning. This overestimation of underestimation can be estimated by PREI calculation. Figure 5 shows that all the models have close to zero PREI values.
Further, all models have different PREI values for each testing dataset, which means that the model performed reasonably accurately. The model has not predicted biased values, i.e., overestimated or underestimated. It can be seen that RT-ANN, BA-RT, and BA-RF models performed well in PREI analysis as they have close to zero PREI value because usually, the ±10 PREI range is considered to be acceptable. Nevertheless, directly analyzing the algorithms' predictions to compare their effectiveness has drawbacks. Those with stronger prediction powers are easier to spot but determining the optimal algorithm and success ranking is complex.
As a result, quantitative data that gives more substantial evidence of each algorithm's performance is required, as presented in Table 7. Boxplots access the dataset's mean, range, and overall distributions. Hence to compare how our models are predicting among all dataset's boxplot was used, as shown in Figure 6.
The boxplot shows that all models range almost equally and have similar distribution except RT and BA-RF, which have higher ranges and distribution. Since the difference is minimal, it cannot be declared an outlier. Furthermore, the boxplots show that the best models are RT-ANN, BA-RT, RF, BA-RF, BA-M5P, and M5P, while RCF-RT, RCF-REPT BPNN predicted the lowest values and have relatively lower accuracy. The hybrid RT-ANN (R 2 = 0.951) had the highest prediction success (R 2 = 0.75), while the BPNN (R 2 = 0.752) had the lowest. The RT-ANN algorithm had the best MAE (2.284) and the lowest RMSE (2.319). An algorithm has excellent prediction ability when the NSE is 0.75 to 1 [85].
As a result, all algorithms performed admirably, but RT-ANN outperformed the competitors (NSE = 0.945). All algorithms except BA-RF, RF, BA-REPT, REPT, RFC-M5P, RFC -REPT, and ANN-ANFIS overestimated WQI, according to the PBIAS metric. Based on their performance results, the algorithm's ranking from best

C. DISCUSSION
To forecast WQI along the KKH stretch from Gilgit to Khunjerab Pass, six standalone tree-based algorithms (M5P, RF, RT, REPT, ANN, and BPNN) were used in this study. In addition, ten new hybrid algorithms were created by merging  the standalone algorithm with BA and RFC algorithms. The sixteen algorithms were compared in terms of performance. Previously researchers [12], [21], [23] have examined the predictive power of several independent tree-based algorithms using the neuron-based algorithm (ANFIS). These were hybridized with meta-heuristic optimization techniques. The findings of the previously conducted studies [12], [21], [23] show that isolated neuron-based algorithms have low prediction capacities due to significant flaws. Hybridization can considerably improve their forecasts. Their results also show that standalone tree-based algorithms perform similarly to ANFIS hybridized with meta-heuristic optimization, which outperforms tree-based algorithms in prediction power.
In the current study, hybrid algorithms improved the performance of specific independent tree-based algorithms, but not all. On their own, tree-based algorithms offer high predictive potential. For example, the best algorithm was BA-RT, with an R 2 value of 0.941 in a relevant study, while in this research, the best algorithm is RT-ANN having an R 2 of 0.951. Overall, the comparison shows the improvement in algorithms, e.g., published M5P has an R 2 value of 0.923, and this research has 0.929, etc.
Apart from the structure of an algorithm, determining the appropriate mix of variables to be inputted into the algorithm is one of the most critical influences on performance. Because of the variety of point and non-point sources of pollution that generate nonlinear interactions between factors and WQ, VOLUME 10, 2022 the impact of combining variables on the result varies from catchment to catchment. Some studies failed to consider alternative variable combinations while determining the optimum set. Other researchers added all factors at the same time [6]. Similarly, some researchers used different approaches to pick the optimal input variables, such as multiple linear regression (dependent on CC) [41].
The current study shows that various input combinations have distinct outcomes. Therefore, different variable input combinations should be tried to increase performance and select the most effective set. Each algorithm may have its own ''best'' combo. The outcomes are determined by the structure of each algorithm and the dataset's fit to the algorithm's structure (data structure and distribution). As mentioned earlier, new proposed hybrid algorithms performed better than the existing algorithms by at least 2%. If we compare the RF models with its associated hybrid model, it can be seen that standalone RF model has higher accuracy then the hybrid models. Usually the hybrid models perform better but in this case the RF alone performs better. Although the difference is not relateively large but nevertheless RF performs better in this study. To simulate WQI, Sahoo et al.
[86] utilized ANFIS, and Yaseen et al. [13] employed a hybrid ANFIS. According to our findings, all standalone and hybrid algorithms produced superior WQI predictions than any previous algorithm examined for WQI prediction. Hence, based on the results, these algorithms can be used in any part of the world for WQI estimations and prediction. These algorithms can handle large long-term datasets and lower the cost of WQI estimation as just the WQ parameters for the algorithms to predict the WQI. Modifying the inputs for the algorithms used in this research can be done to adopt the divergent effects of modeling in other regions, or perhaps it can be done with the same variable combinations.

VII. CONCLUSION
This study investigated the performance of six standalone (RT, EPTR, RF, and M5P) and ten hybrid data-mining algorithms (hybrids of the standalone with CVPS, RFC, and BA) algorithms for forecasting the WQI in Northern Pakistan. The goal was to develop algorithms for WQI prediction and assess the WQ in the study area. According to the modeling procedure, the essential factor of the WQI was fecal coliform concentration. BOD, NO3, DO, EC, COD, PO24, turbidity, TS, and pH were then listed in relevance. It was discovered that multiple variable combinations led to varying degrees of algorithm performance. The predicting power was the best when the algorithms' variables with the highest CCs were utilized. Low-CC variables have a detrimental impact on predictive power. Compared to the standalone algorithms, the hybrids demonstrated an enhanced prediction accuracy rate (i.e., adequate than the standalone algorithms) as they have > 0. 9

A. PRACTICAL AND RESEARCH IMPLICATIONS
This research compares the implementation of new and existing algorithms for WQI assessment. This is important to mention that these algorithms can give stable outputs with a short-term dataset. The stability can, however, be increased with the longer-term dataset. As a result, these algorithms may be highly efficient in emerging areas with minimal measuring networks or when gauging networks have only recently been constructed. According to our results, the recommended RT-ANN algorithm appears practical and costeffective for assessing WQI in Northern Pakistan. In the future, relevant research can be conducted using the proposed algorithms in developed and developing countries. The proposed algorithms can become more beneficial in underdeveloped nations since the costs of testing various WQ parameters are large and may be unaffordable generally. However, local climatic modifications need to be considered before applying this algorithm.
However, the research outcomes can be valuable for the water management authorities in a way that they can take preventative measures to safeguard against the leaching of different detrimental pollutants and chemicals into the water resources, thus ensuring a relatively better WQ.

B. LIMITATIONS AND FUTURE PROSPECTS
This research has some limitations that can be potential future research areas. Firstly, the datasets used in this research were based on two years of sampling, making it a comparatively smaller sample, so the long-term analysis was impossible. The performance of these algorithms on long-term datasets can be investigated in the future. Secondly, the important WQ parameters, namely COD and BOD, were not considered in the present due to some practical limitations. In the future, data over multiple years, such as the last decade, can be used for similar purposes. The statistical and ML algorithms were used in this research that provided highly accurate results; it will be beneficial to use deep learning algorithms, for instance, convolution neural network, to cross-check the results and compare them with this study to yield holistic results. Further, in addition to the correlation tests, other tests, such as the PCA, should be conducted in the future. Moreover, it would also be valuable to consider the WQ variables of COD and BOD for future research. He has published more than 90 research papers in peer-reviewed international journals and conferences and has authored two book chapters.
ALI HASSAN CHEEMA received the B.E. degree in civil engineering from COMSATS University Islamabad, Pakistan, in 2022. He worked on several projects related to town planning at the local level. His research interests include data analytics tools, data modeling, and town planning.
FAHIM ULLAH received the Ph.D. degree from the School of Built Environment, University of New South Wales (UNSW), Sydney, Australia. He is currently a Senior Lecturer in construction project management at the University of Southern Queensland (UniSQ), where he was a Casual Lecturer four years. He also taught various courses in project management at The University of Sydney as a Lead Lecturer. Previously, he worked for three years as a Lecturer at the National University of Sciences and Technology (NUST), Pakistan, where he taught the courses of construction engineering and management and project management at three schools. Further, he has more than two years of industry experience as an Assistant Manager (Planning) and a Planning Engineer. His research interests include construction management, project management, digital built environment, digital technologies, and disruptive innovation. He has been awarded multiple research grants and best paper awards. He has published more than 70 high-quality research articles on construction, projects, smart cities, real estate, and property management. In addition, he has edited multiple special issues in Q1 journals related to digital disruptions in the built environment and industry 5.0 technologies.
ABDULLAH ALHARBI received the Master of Science degree in information technology from the Rochester Institute of Technology, Rochester, NY, USA, and the master's degree in information assurance and cybersecurity and the Ph.D. degree in computer science from the Florida Institute of Technology, Melbourne, Florida, USA. He is currently an Assistant Professor in computer science at King Saud University (KSU), Riyadh, Saudi Arabia. He is also the Dean of the College of Applied Computer Sciences, KSU, Muzahmiyah Branch. He is also the CEO of the Information Security Association (Hemaya), a non-profit organization. He is also a Research Fellow at the Center of Excellence for Information Assurance, KSU, where he was the Department of Administrative Sciences Chair at the Community College. He also got an Information Assurance and Cybersecurity Graduate Certificate from the Florida Institute of Technology. His research interests include wearable devices security, transparent and continuous security, alternative authentication, usable security, and behavioral biometrics.
MUHAMMAD IMRAN (Member, IEEE) is currently working as a Senior Lecturer with the School of Science, Engineering and Information Technology, Federation University Australia. Previously, he worked as an Associate Professor with King Saud University (KSU), Saudi Arabia. His research interests include mobile and wireless networks, the Internet of Things, big data analytics, cloud/edge computing, and information security. He is the Founding Leader of the Wireless Networks and Security (WINS) Research Group, KSU, from 2013 to 2021. His research is financially supported by several national and international grants. He has completed several international collaborative research projects with reputable universities. He has published more than 300 research papers in peer-reviewed, highly-reputable international conferences (90), journals (198), editorials (15), book chapters (one), and two edited books. Many of his research articles are among the highly cited and most downloaded. His research has been cited more than 11,500 with an H-index of 55, and an i-10 index of 175 (Google Scholar). He has received a number of awards and fellowships.
He served as an Editor-in-Chief for European Alliance for Innovation (EAI) Transactions on Pervasive Health and Technology and an Associate Editor for IEEE Communications Magazine. He is serving as an Associate Editor for top-ranked international journals, such as IEEE Network, Future Generation Computer Systems, and IEEE ACCESS. He served/serving as a Guest Editor for about two dozen special issues in journals, such as IEEE Communications Magazine, IEEE Wireless Communications Magazine, Future Generation Computer Systems, IEEE ACCESS, and Computer Networks. He has been involved in about 100 peer-reviewed international conferences and workshops in various capacities, such as the chair, the cochair, and a technical program committee member. He was consecutively awarded Outstanding Associate Editor of IEEE ACCESS, in 2018 and 2019, besides many others. VOLUME 10, 2022