Empirical Comparison of Approaches for Mitigating Effects of Class Imbalances in Water Quality Anomaly Detection

Imbalanced class distribution and missing data are two common problems and occurrences in water quality anomaly detection domain. Learning algorithms in an imbalanced dataset can yield an overrated classification accuracy driven by a bias towards the majority class at the expense of the minority class. On the other hand, missing values in data can induce complexity in the learning classifiers during data analysis. These two problems pose substantial challenges to the performance of learning algorithms in real-life water quality anomaly detection problems. Hence, the need for them to be carefully considered and addressed to achieve better performance. In this paper, the performance of a range of several combinations of techniques to deal with imbalanced classes in the context of binary-imbalanced water quality anomaly detection problem and the presence of missing values is extensively compare. The methods considered include seven missing data and eight resampling methods, on ten different learning state-of-the-art classifiers taking into account diversity in their learning philosophies. The different classifiers are evaluated using stratified 5-fold cross-validation, based on three performance evaluation metrics namely accuracy, ROC-AUC and F1-measure. Further experiments are carried out on nineteen variants of homogeneous and heterogeneous ensemble techniques embedded with resampling and missing value strategies during their training phase as well as an optimized deep neural network model. The experimental results show an improvement in the performance of the learning classifiers, especially when dealing with the class imbalance problem (on the one hand) and the incomplete data problem (on the other hand). Furthermore, the neural network model exhibit superior performance when dealing with both problems.


I. INTRODUCTION
There is a consensus that easy access to water of good quality to the public leads to improved health and living conditions, and has a direct impact on the economy and national security of countries. Furthermore, due to the massive amount of data currently generated by water utilities and the impact of the water industry on the lives of people [1]. There is a need to implement better ways of water quality monitoring and prediction based on new and advanced technologies The associate editor coordinating the review of this manuscript and approving it for publication was Xinyu Du . such as new and enhanced machine learning and data mining techniques [2]. Imbalanced class distribution (ICD) and missing values (MV) in data are two common problems and occurrences in data analysis that are synonymous with data quality issues [3]- [5]. MV and ICD continue to be prevalent in numerous real-world problems and across many application areas [6], [7], including water quality anomaly detection domain. Consequently, these occurrences have continued to generate lots of attention from researchers because the majority of conventional predictive machine learning algorithms are not developed to handle these challenge in data, because they assume completeness of data and a balanced VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ class distribution [8], [9]. As a result, predictive or classification algorithms perform sub-optimally on these kinds of datasets if not properly handled, resulting in bias, inaccurate and low-quality predictive performance of the classifiers [7], [8], [10]. Similarly, previous investigations of missing values and imbalanced class distribution challenges in water quality anomaly detection have taken place mostly in isolation, even though their harmful effects on the classifiers' performance is well acknowledged in many research works. Although some authors have experimented using a combination of both methods, their effects and interactions on learning algorithms were not fully showcased. This study aims to demonstrate that considering the effects of both missing values and imbalanced class distribution in water quality anomaly detection offers a means of better understanding the problems, and thereby offering ways of mitigating their effects on the performance of learning algorithms. This study also demonstrates experimentally and verify our hypotheses that these two problems harm the performance measures of learning classifiers, and hence a reason to consider them in tandem because of their prevalence in the examined domain. Additionally, the imbalanced class problem cannot be possibly considered solved without considering the challenge pose by missing values in data since it also harms the performance of learning classifiers and often occur together with class imbalance distribution. Class imbalance is a term that refers to a dataset that has an uneven distribution of classes, whereby one or more of the classes have a larger number of instances than the other does. In a binary-class scenario, for example, the class with the most frequent instances is referred to as majority class, while the class with the rarest instances is the minority class. Anomalies in water quality are rare events of interest in real-life, but predicting these rare events from an imbalance learning perspective using traditional machine learning approaches poses significant challenges to researchers [11]. Reasons for such challenges would include the inability of the traditional classifiers to cope with imbalanced scenarios. As a result of problems that include treatment of rare events as noise, overlapping of minority and majority classes, biases induced by performance metrics such an accuracy towards the majority class and disjuncts in imbalanced data that entails small sample size with a high feature dimensionality [12].
Missing data are prevalent in almost all of the research domains that relays on sensors for data generation such as in monitoring the quality of water in an urban water distribution network. Missing data is a term that refers to the absence of values or observations, which are usually anticipated to be present in a dataset [4]. Missing data in water quality anomaly detection dataset can arise for several reasons, such as faulty sensor readings and measurement errors that could be as a result of low signal-to-noise ratio during digital signal processing, mistakes and mishandling of data during generation or reporting by personnel, and sometimes the outright deletion of data information [13]. Approaches to handling missing data are well documented in the literature, using statistical or machine learning methodologies [14]. The process of replacing or substituting missing values are collectively termed as missing data imputation [13]. Missing values if not adequately addressed induces an element of complexity into data analysis, and not only affect the performance of machine learning (ML) algorithms, but also impacts on the value that could be derived in terms of accurately detecting an anomaly in the water distribution system [13].
The importance of data reliability and completeness in ensuring data quality cannot be overemphasised, particularly due to their relevance in enhancing the predictive performance of learning algorithms, and by extension the value of the information that can be derived from the data [5]. It is for these reasons that motivate researchers on the need to address them, to produce a more reliable outcome or conclusions that can be inferred from a dataset [7], [10]. There are numerous reported strategies or methods in scholarly works that have been developed to deal with incomplete data and class-imbalance problems [6], [7].
Several solutions have been proposed in the literature in dealing with water quality anomaly detection. For example, a study using tree-based ensemble approaches is carried out in [15]. Several machine learning and deep learning approaches to dealing with anomaly detection in water quality based on time-series data are examined in [16]. Multi-objective machine learning for feature selection on support vector machine and ensemble generation on decision trees is proposed in [17] to solve online anomaly detection of drinking-water quality on time series data. Authors in [18] further proposed two imbalance boosting-based ensemble models namely SMOTEBoost and RUSBoost using oversampling and undersampling techniques to balance the training data respectively. The authors finally applied multi-objective pruning on the base models for the ensembles, to optimize the prediction and generalisation performance of their models. The authors in [19], proposed two models namely adaptive learning rate BP neural network and 2-step isolation, and random forest to predict water quality based on both physical and biological indicators in an urban water supply scenario. Most of these works focused on specific missing values and class-imbalance methods in dealing with the challenges in this domain. We also observe that majority of these works focused on evaluating only the training set, preprocessed with MV and ICD methods. However, applying MV and ICD methods on the training set is one case, while evaluating the classifiers on the imbalanced test set (unseen data) with or without MV is a different case altogether.
This work experimentally studies the mitigating effects of applying missing data and imbalanced class distribution methods in water quality anomaly classification problem based on two hypotheses that 1) missing data does harm performance measures; 2) class-imbalances would harm performance measures. In this paper, we conduct an exploratory study to compare the performance of select MV, data-level ICD and ensemble approaches previously proposed in the literature on different classifiers. Specifically, a combination of seven MV methods (replacing missing values with zero, listwise deletion, mean, mode, missForest, expectation-maximization (EM), and multiple imputations by chained equations (MICE)), and eight resampling methods (ROS, SMOTE, ADASYN, RUS, Tomek links, RENN, SMOTE + Tomek links, SMOTE + ENN) on the performance of ten different classifiers (LR, k-NN, LDA, SVC, NB, DT, RF, AdaBoost, ANN and DNN). Furthermore, we empirically evaluate the performance of 19 static ensembles of heterogeneous and homogeneous approaches that include bagging, boosting, stacking and their variants embedded with resampling strategies during their training phase, and as well as an optimized DNN model. To the best of our knowledge, such a comprehensive experimental study of the combination of several techniques for anomaly detection in drinking-water quality classification problem is not common. Furthermore, the comparison of these models would provide useful insight and shed more light on their benefits and differences. It is worth noting that the purpose of this study is not to examine all existing methods but to focus on methods that are frequently presented in the literature for our dataset problem.
The three major contributions and objective of this study are given as follows: 1) To compare the effect of 7 missing data algorithms on 10 different machine learning models. 2) To compare the effect of combining of 7 missing data and 8 resampling methods on 10 different machine learning models; 3) To investigate the performance of different formulations of machine learning algorithms on imbalanced class distribution and missing data, that include different ensembles and deep neural network models. The rest of the paper is organized as follows. Sections II presents a brief overview of classifiers, resampling and missing values methods. The experimental setup, dataset and the performance metrics are reported in section III. Section IV reports the experimental results, including statistical tests and time computational complexity of the models. A discussion that summarizes the experimental findings is reported in section V. Lastly, section VI concludes the paper.

A. CLASSIFIER 1) CLASSIFICATION PROBLEM DEFINITION
A classifier is a mathematical function that assigns class labels to training data instances [20]. Given a dataset with test features X Test ∈ /R n * k are a matrix of n test examples and k features and vector y Test ∈ {0, 1} n classifying False (0) or True (1) of an event. The aim is to estimate a function f such that y = f (x) which minimises the misclassification as is the prediction for the i th test example and y i is the i th classification.
Depending on the assumptions of the learning model, classifiers can be categorized as either parametric (fixed number of parameters for data distribution) and nonparametric (data distribution with no fixed number of parameters) [21]. Ten supervised learning classifiers are considered in this study. They are selected based on their appropriateness for our dataset problem and their different learning philosophies (linear, density-based, instance-based, tree-based and neural network-based models), in order to consider a broad spectrum of families of learning algorithms [22]. This ensures a robust assessment of the effects of the missing data and the class-imbalanced methods on the selected classifiers evaluated in this study [3], [7], [8], [14], [23].

B. RESAMPLING METHOD
Resampling methods aim to transform a dataset distribution to account for the imbalanced or skewness nature of the class labels in classification tasks, in order to arrive at a fairer and acceptable decision boundary [6]. Numerous resampling techniques in literature are mostly k-NN or Euclidean distance inspired, and can be broadly categorized into four: 1) over-sampling the minority class, 2) under-sampling the majority class, 3) hybrid combination of under-sampling used in conjunction with over-sampling methods and 4) creating an ensemble with balanced dataset [6]. Another resampling categorization usually adopted is based on methods that consider and select the data examples to keep, methods that consider and select examples to delete and the hybrid of both. Since in this study we are concerned with binary classification problem, throughout this paper we shall refer to the majority class (True) or normal state as class 0 and the minority class (False) or abnormal state as class 1. In this study, the occurrence of the minority class 1 that is poorly represented in the data space in comparison to class 0 represents the class of interest.
Class imbalanced solutions are broadly categorized into four different approaches [10], [11], [24]: This approach addresses the class-imbalanced problem by resampling class distribution during preprocessing. Techniques that perform these class modifications are collectively known as resampling algorithms. The resampling algorithms handle class-imbalance problems, by either over-sampling the minority class, under-sampling the majority class or a hybrid approach of combining over-sampling and undersampling.

2) ALGORITHM-LEVEL
This approach involves learning algorithms adapted to handle imbalanced class distribution, by modifying the learning algorithms to handle such a problem. An example of an algorithm level approach is where an ensemble classification method incorporates an internally resampling technique before creating the ensembles.

3) COST-SENSITIVE
The algorithms in this approach take into account the cost associated with the different class instances (minority or VOLUME 8, 2020 majority classes) by assigning a different cost, in the process, the learning algorithm is modified to take into account the assigned costs. For example, a high misclassification cost is assigned to the minority class during the learning process to underscore its importance as the class of interest, weakening the majority class in the process. The cost-sensitive approach could either use a direct method, whereby the cost is assigned directly on the class instances or through a meta-learning approach that employs during training a data-level technique for prepossessing or employing some postprocessing steps.

4) MULTIPLE CLASSIFIERS ENSEMBLE (MCE)
This approach involves combining an ensemble learning algorithm with either one of data-level, cost-sensitive or algorithm-level approach during preprocessing.

C. MISSING DATA METHOD
Missing data method (MDM) is a form of data cleaning process that usually forms part of data preprocessing [5]. MDM could be broadly categorised into two strategies for handling incomplete data [5], [25]: 1) Missing data toleration strategy that ignores, delete or remove missing values in either the training or test dataset. 2) Missing data imputation which entails filling missing values in a dataset with some suitable and estimated values, rather than leaving them empty. Gaining an understanding of the pattern and mechanism for missing values is a critical step that will inform the type of strategy to use in any given scenario [4], in addition to knowing the percentage of missing values in data and the sample size [23]. The mechanism for missing values are broadly categorized into three: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) [4].
The assumption on the nature or mechanism of missing values in data could be derived by understanding the data collection process, as well as through statistical investigation and testing [23]. Different missing value methods induce biases, and uncertainties, particularly if the methods are based on certain assumptions concerning the missing value mechanisms. It is also worth noting that mean and mode imputations are reported in the literature to induce more uncertainties during missing values estimations as compared to the MICE method. This is for example because MICE take into account all the available information from other instances in the data and then averages their results to provide better estimates of the unknown true missing value. Various strategies for handling missing values in data exist in the literature, they include statistical, machine learning, model-based using maximum likelihood with EM, and ensemble approaches [7]. Missing data strategies are broadly categorized into four: (1) Case deletion -filling with a value, or ignoring data with missing values, or deleting or dropping missing values, (2) imputation strategies (mean, median, multiple imputation & machine learning such as k-NN, (3) model-based imputation strategies (maximum-likelihood with EM algorithm) and (4) machine learning-based strategies (ensemble approach with RF) [7].
With the increased availability of computational resources, more complex and advanced missing data techniques have become available. Recent development as candidate solutions of missing data recovery task is using computational and artificial intelligence approaches to address identified disadvantages such as low prediction performance in the well-known missing data methods. In [26], the authors proposed a computational intelligence technique, the non-iterative neural-like structures of the Successive Geometric Transformation Model (SGTM) to handle missing data. The authors reported an increased estimation accuracy in comparison with the arithmetic mean algorithm. The authors in [27], proposed a solution to missing values using the General Regression Neural Network (GRNN). The method is reported to show improved performance accuracy when compared to previous methods. To further improve the performance accuracy of missing values specifically in data collected through IoT devices, the authors in [28], proposed an ensemble method (GRNN-SGTM) by combining GRNN network and SGTM neural-like structure. The performance of the ensemble method was shown to be more effective in comparison with single standalone GRNN and SGTM methods. The combination of neural network and Evolutionary computing has also been well studied in literature as an effective way of estimating missing values [29] In our view, an imputation ability of one method over another seems highly problem dependent. Secondly, the more advanced methods require higher computational resources to complete their operations, which may not be justifiable given that our dataset has less than 1% MV in both the training and test sets [23]. Hence, the focus of this work is on testing the most popular MV methods. For now, we leave the use of the complex techniques in estimating missing values to future work.

III. EMPIRICAL FRAMEWORK A. EXPERIMENTAL SETUP
The experiments are conducted using 'SPyDER' (Scientific Python Development EnviRonment) on the Anaconda Python distribution environment. In this paper, the aim is to combine three approaches to observe the effects this combination on anomaly detection in drinking-water quality classification problem with imbalanced class distribution and incomplete values in data. Hence, the experimental simulation is a five-way repeated-measures strategy, which allows the main effect factors (10-classifiers, 7-missing data methods, 8-resampling methods, stratified 5-Kfold crossvalidation [20] and 3-performance metrics) evaluated against interaction with the random effect factor one dataset. The experiments are conducted using Intel Xeon CPU@3.20GHz, 16GB RAM system. The default settings of the examined classifiers were kept throughout the entire experiments. Fig. 1 depicts the general framework followed in conducting the experimental data analysis. The MV and ICD methods considered in this paper were selected using their popularity and citation rates as criteria. All the methods used in this  study are listed and summarized in Table 1. They comprise of seven missing value, eight resampling methods (three oversampling, three undersampling, and two hybrids), ten classifiers, nineteen variants of homogeneous and heterogeneous ensemble methods and one optimized deep neural networks model.

B. DATASET
The dataset used in all our experiments is obtained from GECCO 2018 industrial challenge project [1], sourced from Thüringer Fernwasserversorgung public water utility company located in Germany. The dataset is a time series based and made up of ten independent variables, and one dependant variable. The instances have null values in all the columns except in the 'Time' and 'EVENT' columns. We assume that the dataset is missing completely at random (MCAR), which implies that the probability of the data being missing is the same for all observations, that is, there is no relationship with other data present or missing that make an observation more likely to be missing. More so on inspection of the dataset, we observe that the missing data are all within a certain range. The goal of this dataset is a classification problem intended for drinking-water quality anomaly detection, to predict if there is an event or not. The 'EVENT' is the dependant variable that is to be predicted, as either 'True' or 'False'. The training and test dataset components are summarized in Table 2. The majority of the data belongs to false majority class-0, whereas true minority class-1.  The data was collected continuously for over 98 days between 03/08/2016 and 13/02/2017 at an interval of 60 seconds in between readings. The dataset has a time series variable that was not included in this current study for two reasons. Firstly, the goal of this paper is to investigate the mitigating effects of MV and ICD on learning classifiers. Secondly, the timeseries analysis on this dataset has been addressed in previous studies such as in [15], [16].

C. PERFORMANCE METRICS FOR IMBALANCED CLASS PROBLEM
Performance metrics for comparing experimental results in imbalanced classification problems are fundamental in discovering the quality of relationships between the data and the predicted event targets [58], [59]. In other words, the performance metric aims to show how well a learning algorithm can predict given some data observations. The most commonly accepted performance evaluation methods in imbalance classification problems are Accuracy, Sensitivity, Specificity, Precision, Recall, Balanced accuracy, ROC-AUC and F1-measure, G-mean and Matthews Correlation Coefficient [12]. However, The authors in [2] advocate using a combination of these different metrics to assess the goodness of fit for models better, as reliance on only one metric, such as accuracy may be misleading, as accuracy metric is biased toward the majority class. Although there have been studies on strategies for selecting performance metrics used for evaluating classifiers in an imbalanced scenario, such as in [59], [60], this is, however, beyond the scope of this study. Hence, the performance metrics, ROC-AUC and F1-measure, considered in this study have been selected based on their wide adoption in imbalanced classification problems, including previous studies on water quality anomaly detection, because they take into consideration class distribution. In our own opinion, this approach will provide a fair comparison with earlier studies in this domain.

A. EXPERIMENT 1: COMPARISON OF MISSING DATA METHODS
This experiment aims to compare the effect of 7 missing data methods on 10 different machine learning models on the training set. The observations would allow us to verify our hypothesis that missing values harm performance measures of learning algorithms. In our case, with the training dataset having less than 1% MV (see Table 2 above). The examined ML models were evaluated on the training data, our intuition is that this gives a reasonable estimation of the performance of the different ML algorithms on the future test set (unseen data). Table 3 presents the results when applying only the MV methods namely replacing missing values with zero, listwise deletion, mean, mode, missForest, EM, and MICE. The result also provides a robust estimate of the performance of the different machine learning algorithms on the imbalanced training dataset. The graphical results are shown in Fig. 12 in Appendix A. Fig. 2 shows the data distribution before applying the imbalanced methods, with the majority class-0 labels (False) in red colour outnumbering the minority class-1 labels (True) in blue colour. Additionally, based on the results obtained in Table 3, the classifiers were ranked in terms of F1-measure based on the MV methods applied. The statistical results are outlined in Table 10 in Appendix B. We observe that RF, DT, k-NN, SVC and AdaBoost were consistently the top-5 performers in terms of F1-measure. While the neural network models (ANN and DNN), in addition to LR, LDA and NB, were the worst performers. The reason for this can be attributed to the learning philosophies of the different learning algorithms. The results also show that the neural networks are the most sensitive to the imbalanced training set, and the high accuracy results obtained by the different ML classifiers are misleading. This observation is in line with findings in numerous works of literature. All the best performing combinations are highlighted in bold.

B. EXPERIMENT 2: COMPARISON OF MISSING VALUE IN COMBINATION WITH RESAMPLING METHODS
This experiment aims to compare the effect of the combination of 7 missing data and 8 resampling methods on the 10 machine learning models. The observations would allow  us to verify our hypothesis that class-imbalances would harm performance measures of learning algorithms. In our case, with the training set having a majority to minority ratio of 80:1 (see Table 2 and Fig. 2). Table 4 presents the results, of combining MV and resampling methods to estimate the performance of the different machine learning algorithms. While Fig. 3 shows the visual distribution maps of the training dataset with the application of the different resampling methods. It is observed that the relationship between the two classes overlaps significantly, which is one factor in an imbalance dataset scenario that not only affects the performance of learning classifiers but adds complexity to the learning algorithms in terms of for example further oversampling the training set [61]. In this experiment, SMOTE + ENN combined with all the MV methods exhibited a better performance across all the learning models in terms of F1-measure with low statistical variances. SMOTE + Tomek coming a close second best also across all the learning models in terms of F1-measure. ROS also exhibited good performance; however, the recorded F1-measure across a combination of various methods had higher statistical variances when compared to SMOTE + ENN, this in addition to the possibility of exacerbating the class overlap issue on the classifiers' performance [61]. Generally, we observe that all the undersampling methods (RUS, Tomek link and RENN) performed the worst across all combination of methods, indicating their unsuitability for this dataset. In the experiments reported in Table 4, MV and resampling methods were both applied to the training set.

C. EXPERIMENT 3: COMPARISON OF ENSEMBLE AND DNN MODELS WITH MISSFOREST AND SMOTE + ENN
Despite the significant progress achieved in machine learning research, conventional machine learning methods may not achieve satisfactory performance when dealing with imbalanced data with missing values. This is because of the inability of these methods to cope with the imbalanced dataset and their assumptions of balanced class distribution. Several studies using WQAD data and traditional machine learning algorithms have previously been conducted, but challenges associated with missing values and class-imbalance leave room for improvement. Ensemble learning and DNN have proven to be efficient approaches to dealing with imbalanced dataset problems over traditional individual classification models [62], [63].
Recently, deep learning method evaluated in combination with appropriate data preprocessing techniques (missing value and resampling method) has shown promise in improving the predictive its predictive performance [63], as well as in other water quality anomaly detection studies. Motivated by these previous findings, we proposed and employed a deep neural network in combination with missing value and resampling preprocessing methods to determine its effectiveness in addressing class-imbalance with missing values, in comparison with the ensemble methods.
This section presents the results obtained using ensemble approaches where we considered the top-5 performing pool of machine learning heterogeneous classifiers obtained in experiments 1 and 2 namely RF, DT, k-NN, SVC and AdaBoost (except for the neural network model which was evaluated separately). The top-5 machine learning classifiers were used to implement the various variations of ensemble voting-based bagging, boosting and stacking, models, which were then compared to the optimized DNN model.
We used grid search to find the best hyperprameters of the optimized DNN using F1-measure to choose the best model. We tested the following hyperparameters: neurons =  Figure 4.
The experimental results are shown in Tables 5-7, while that of the optimized DNN is shown in Table 8. In all the experiments, MissForest MV and SMOTE + ENN resampling methods were applied to the training set due to their better performance observed in experiment 1 and 2. The pictorial representation of the balanced training data using the SMOTE + ENN method is shown in Fig. 5. The models were all evaluated on the imbalanced test set, using accuracy, ROC-AUC and F1-measure as the performance evaluation      metrics. We observe that even though the classifiers benefited from the combination of MV and ICD in the training set, they suffered significantly when evaluated on the imbalance test set in terms of the performance measures. This is attributed to the sensitive of classifiers to the level of imbalance in the presence of classes' overlap, which leads to difficulty in distinguishing between the two classes because of nearly equal prior probabilities estimates of both classes [12].

D. STATISTICAL TEST
In this subsection, we apply statistical tests to evaluate and validate the results obtained from the experiments conducted.  We have evaluated 10 classifiers on 7 missing values and 8 resampling methods, using F1-measure as the main performance score.
Recall the results obtained in experiment 1 presented in Table 3, ten learning algorithms were evaluated using seven different missing value methods. Similarly, for the results in experiment 2 presented in Table 4, ten learning algorithms were evaluated using seven different missing value and eight resampling methods. The evaluations were based on three performance metrics namely accuracy, ROC-AUC and F1-measure. Lastly, in the results from experiment 3 presented in Tables 5 -7, the top-5 machine learning classifiers in experiment 2 were used to implement the various variations of voting-based ensemble models, which are then compared to the bagging, boosting and stacking ensemble models. Based on the performance of DNN in experiment 2, we went ahead to develop an optimized DNN, to verify if we could get a better result. The result for the optimized DNN is present in Table 8.
Our aim here is to find out if the performance differences between the different learning algorithms are statistically significant. To assess the results obtained for each classifier, we adopt the non-parametric Friedman test proposed in [64]. The Friedman test is firstly used to evaluate the acceptance or rejection of the null hypothesis (H 0 ) that all classifiers perform equally for a given significance or risk level (alpha level). Therefore, in our case, the null-hypothesis being tested is that all the classifiers performed the same and the observed differences are merely random or by coincidence.
The Friedman test ranks the algorithms for each data set separately. The best performing algorithm gets the rank of 1, the second-best rank 2 etc. In the case of ties, it assigns average ranks. Then, the Friedman test compares the average ranks of each algorithm and calculates the Friedman statistic. If a statistically significant difference in the performance is detected, we proceed with a post hoc test. We use the post hoc Nemenyi test to compare all the classifiers to each other. In this procedure, the performance of two classifiers is significantly different if their average ranks differ more than some critical distance (CD). The critical distance depends on the number of algorithms, the number of data sets, metrics and the critical value (for a given significance level p) that is based on the Studentized range statistic [64].
As recommended in [64], caution is to be applied regarding the statistical process to be used when testing multiple classifiers on a single dataset. This avoids biased estimation and Type I error because the mean performance and variance computed from the repeated training/test random samples are related. To avoid this trap, we test the ten classifiers on the seven different missing value methods and eight different resampling methods. This way, the dataset distributions are slightly different and not entirely the same. The intuition here is that the multiple dataset distributions created from the different missing value and resampling methods are used only to evaluate the performance measures. While the differences in performance over the independent missing value and resampled datasets give us the sources of variance and a sample of independent measurements. That way, the statistical test assumes the form of comparing multiple classifiers over multiple datasets. For comparing multiple random variables, a non-parametric rank-based Friedman test is recommended in [64].
According to Demsar, 2006 [58], given r j i be the rank of the j th classifiers (K) in the i th dataset (D), under this null hypothesis setting, the Friedman test statistic is formalized in (1). X 2 F is distributed according to chi-squared distribution, with (K-1) and (K-1) (D-1) degrees of freedom. Give a large enough X 2 F value, then the null-hypothesis that there is no difference between the classifiers can be rejected.
r j i is the formula used by the Friedman test to compares the average ranks of the classifiers.
According to [64], the post hoc Nemenyi test states that the performance of two or more classifiers is significantly VOLUME 8, 2020 different if their corresponding average ranks differ by at least the critical difference. In other words, the post hoc Nemenyi test is utilised to report any significant differences between the individual classifiers. The critical difference (CD) is expressed using the formula in (2) as: where q α is the critical value based on the Studentized range statistic, using significance levels of α = 0.05 and α = 0.10; K is the total number of classifiers, and D is the number of the dataset used in the study, the resampled datasets in our case for this study.   The average rank of classifiers for the two analyzed experimental scenarios: 1) interactions between classifiers, missing value and F1-measure, and 2) interactions between classifiers, missing values, resampling and F1-measure are shown in Table 10 in the appendix section. The average ranks are used to show the graphical representation of the post hoc Nemenyi test results in Figure 6-9. For experiment 3 scenario, the overall average rank of the 19 different ensemble models and the optimized DNN model considering all three performance metrics (balanced accuracy, ROC and F-score) are shown in Table 11, also at the appendix section.
For our ten classifiers, seven missing value and eight resampling methods, df = 10-1 = 9 for classifiers, df = (10-1) * (7-1) = 54 for classifiers and missing values methods and df = (10-1) * (8-1) = 63 for classifiers and resampling methods. We first apply Friedman's test for the average ranking of the ten classifiers when applying only missForest imputation methods. We report the Friedman test statistic = 35.166 and a very small p-value = 4.002E-06. This result shows that the performances of the classifiers are statistically significantly different since the p-value < ∝ = 0.05. This is an indication that the performances of the individual classifier are not equivalent. Moreover, the very small p-value and large enough Friedman test value obtained provides strong evidence against the null hypothesis; hence, we reject the null hypothesis. We can thus proceed to apply the post hoc Nemenyi test to compare any significant differences between individual classifiers.
The diagrams in Fig. 6 -9 shows the visual representation of the average ranked performances of the ten classifiers with the critical distance of post hoc Nemenyi test. The results are presented using the significance level, α = 0.10 and α = 0.05.
In Fig. 6 the diagram shows the ranked performance of the classifiers with CD = 4.725 at a significance level, α = 0.10, We see in Fig. 6 that ANN, DNN, LR, LDA and NB classifiers are significantly different from the best-performing classifier RF having the lowest rank across the missing value methods. While the diagram in Fig. 7 shows the average rank performance of the classifiers with CD = 5.12, at a significance level, α = 0.05. Similarly, we observe in Fig. 7 RF is the best classifier, having the lowest rank across all the missing value methods and statistically better than ANN, DNN, LR, LDA and NB classifiers as indicated by the CD bar. In both Fig. 6 and Fig. 7, we see all the top-5 performing classifiers on the right-most part of the diagram, as against the least performing classifier to the right-most part of the diagram. The set of classifiers that do not differ significantly are grouped with a bold horizontally connected line.
Next, we perform the Friedman test on the average ranking of the ten classifiers when applying the missForest missing value method combined with the eight resampling methods, we report Friedman test statistic = 55.50 and a very small p-value = 1.187E-09. The result shows statistically significant different performance (p-value < α = 0.05), indicating the performances of all the classifiers are not equivalent. Hence, once again, we reject the null hypothesis. Consequently, we can proceed to apply the post hoc Nemenyi test to compare all classifiers to each other. The diagram in Fig. 8 shows the average ranked performance of the ten classifiers along with the CD = 4.420, at a significance level, α = 0.10. We observe in Fig. 8 that LR, LDA, DT, NB and k-NN classifiers are statistically significantly different from the best-performing classifier, this time DNN having the lowest rank across all the resampled dataset methods in combination with missForest imputation method. While the diagram in Fig. 9 shows the average ranked performance of the classifiers with CD = 4.789, at a significance level, α = 0.05. Similarly, we observe in Fig. 9 that LR, LDA, DT, NB and k-NN classifiers are significantly different from the best-performing classifier, which is DNN also having the lowest rank across all the resampling methods with miss-Forest imputation. In both Fig. 8 and Fig. 9, we see all the top-5 performing classifiers (DNN, ANN, SVC, RF and ABC) on the right-most part of the diagram, as against the least performing classifier (k-NN, NB, DT, LDA, and LR) to the left-most part of the diagram. It is evident in this study that RF is the least affected by imbalance data as well as showing good behaviour in comparison to the other nine classifiers. On the other hand, the neural network-based classifiers specifically DNN are the most sensitive to imbalanced class data based on the earlier results in Fig. 6 and Fig. 7. From these preliminary results, it would suggest that DNN is a worthy candidate for further study on this problem. This we achieve by investigating an optimized DNN.
Finally, we perform the statistical test on the average global ranking of the 19 ensembles and the optimized DNN models, with df = 20 -1 = 19. This test reveals the Friedman test statistic = 24.2745; a very small, p-value = 5.3562E-06; at significant level α = 0.05. The result shows statistically significant different performance, (p-value < α = 0.05) meaning the performances of all the models are not equivalent. Hence, once again, we reject the null hypothesis. Consequently, we can proceed to apply the post hoc Nemenyi test to compare all the 20 models to each other. The diagram in Fig. 10 shows the average rank performance of the 20 models along with CD = 17.1182, at a significance level, α = 0.05. The result reveals that although the DNN_mF_SMENN model is consistently having a better performance than the other methods, there is, however, no significant difference at α = 0.05 (risk level) between DNN_mF_SMENN that performed the best (lowest-ranked) and 18 ensemble models that are grouped and connected with a bold line, signifying similar performance with the DNN_mF_SMENN method. The reason why the differences in performance of these methods are not significant stems from the reason that the statistical tests we use have a conservative behavior [64]. Whereas, ENSEM9 model was the method that significantly performed worse than DNN_mF_SMENN. A similar result is also observed for α = 0.1, with CD = 16.0334, as depicted in Fig. 11. An interesting observation on the average ranking diagrams is the stacking classifiers models, ranking above most of the ensemble and RUSB methods on the left-hand side of Fig. 10 and Fig. 11, an indication of a potentially good candidate for this dataset problem.

E. COMPUTATIONAL COMPLEXITY
In this section, we discuss the computational costeffectiveness of our proposed DNN model in comparison to the other different models. Algorithm complexity comprises of two factors, namely time complexity and space complexity. Time complexity is the amount of time required by an algorithm to complete, which is represented by the number of computational steps that a processor would take to solve a dataset problem using any individual algorithm. The runtime of each model depends on the input parameters with a dependency function T representing the model's time computational complexity. Time computational complexity is a function that represents the dependency between input dataset size and the number of floating-point operations (FLOP) required by the algorithm to complete, which is described as T (D), where D is the size of the input dataset. While space VOLUME 8, 2020 complexity is the amount of memory required by an algorithm to complete [65], [66].
In machine learning research, one of the goals of complexity evaluation is to understand how an algorithm scales when the input data size grows. In other words, complexity evaluation compares different models to know which model takes lesser resources in terms of time and space as data size (D) input grows. Big-Oh notation is a standard approach of evaluating the computational complexity or efficiency of an algorithm. Big-Oh notation [65] is a formal way of denoting an order of a function of some input based on the maximum number of operations an algorithm performs. Each machine learning model has its order of function irrespective of the number of operations. Hence, to compare models' complexities, it is sufficient to compare their Big-Oh notation represented by a constant linear dependency, which indicates how the runtime of an algorithm grows depending on the input dataset size. Big-oh notation focuses on the main computational operations while ignoring the low-level mathematical operation details [65]. For this section, we focus only on the time complexity of the learning algorithms, since more interest is usually shown to the computation speed, irrespective of the problem definition.
Evaluating the computational complexity of machine learning algorithms is no trivial endeavour. As the complexity of learning algorithms depends on many factors such as the algorithm's implementation, type of dataset problem, and other parameters passed on to the algorithm. For instance, the complex ensemble voting-based methods rely on other algorithms. We do not include the ensembles in our current analysis, however, the ensemble methods are product of the complexity of the original model by the number of voting models in the ensemble implementation. In the bagging ensemble methods, for instance, the training size is replaced by the size of each bag. Table 9 shows approximated time computational complexity of the models used in this study based on the training dataset N , the number of features P, and their specific implementations, such as the number of trees and their depth for trees based methods, number of support vectors, number of operation for comparing nearest neighbors distances and the number of neurons in each layer in the neural network.
In general, training neural networks is time and computational resource consuming. The training time of the neural network is training time per epoch multiplied by the number of epochs required to achieve the optimal solution [66]. An epoch is the single forward and backpropagation pass through all the training dataset, usually in predefined batch size [66].
To access the impact computational environment has on-time complexity in terms of the hardware configuration, we tested on two different hardware configurations: 1) Intel i3 CPU with 6GB RAM, and 2) Kaggle's cloud platform with 4 CPU core and 16GB RAM, similar to the works in [66], [67]. The time complexity gains as shown in Table 9 proves that apart from the dataset size and machine learning algorithm, the computational environment does have a significant impact on the runtime of an algorithm, which is in agreement with earlier studies [66], [67]. It is also worth noting that applying preprocessing strategies (MV, resampling imbalanced data and data normalization) before training the models do also have a positive impact on speeding up the computational time.
We observe in Table 9 that even though methods such as linear regression and GaussianNB, had lower time complexity in comparison to the other models and the proposed DNN, their prediction power suffered in the process because they are not able to take account of the exact and intrinsic properties of the dataset in comparison to the other more complex models. This is evident based on the results obtained in Tables 3-4, as well as with the analysis of the statistical test.

V. DISCUSSION
In this section, we summarize our findings and draw some conclusions given the experiments conducted as follows: 1) Applying missing value benefited all the classifiers, as some classifiers such as k-NN and SVM do not work in the presence of missing values, unlike the tree-based methods that can handle missing values. Hence the better performance of the tree-based methods, especially RF, even with an imbalance training set. The statistical test also supports this observation. 2) Upon applying both missing values and resampling methods, the neural network methods were the better performers, except for methods like LR and LDA because they are less able to capture the intrinsic characteristics of the dataset. However, the neural network methods are the most sensitive to imbalance class distribution and showed the most bias toward the majority class. 3) An interesting observation was the slightly improved performance of the optimized DNN method over the ensemble methods, with the caveat that optimization was not performed on the ensemble methods, which could have given different results. The result seems to support the intuition that DNN implements functions of higher complexity, so that they are able, with the same number of resources, address more difficult problems [66]. However, for neural networks to generalize well, aside from training on a large amount of data, the test data must be similar to the training data, allowing the output decision to interpolate between the training set. This assumption is evident in the performance of the optimized DNN techniques when tested on an unseen imbalanced dataset different from the balanced data used during training. 4) Training a neural network includes both forward and backward propagation, whereas deploying a trained network on a test set (unseen new data points) involves only forward propagation. Thus, estimating the execution time of model training incorporates both model training and deployment [67]. Hence, we can infer that predicting the execution time for the entire deep neural network is possible by combining these two execution time results. We can also infer that a low time complexity does not necessarily mean a high prediction power. 5) We also observed that rbf kernel SVM had the highest time computational complexity in comparison to the other models. This observation is also in-line with conclusions in the literature about the computational cost of the SVM algorithm. SVM is an effective learning algorithm, but its computation and storage requirement increases rapidly with the number of training vectors, based on its learning philosophy of separating support vectors from the rest of the training dataset. Hence, the complexity of SVM is assumed to be O(n 3 ) and depends on the number of the training set, the number of features, type of kernel function and the regularization parameter [68], [69]. 6) The experimental analysis was mainly performed on a binary classification problem. However, various methods exist to handle multi-class and multi-label as multiple binary classification tasks.

A. LIMITATION OF STUDY
In this study, we compared different strategies for mitigating the effect of imbalanced data with missing values and proposed an optimized DNN model as a way of achieving a better performance.
One apparent limitation of the proposed DNN method observed in this study is the high sensitivity of neural networks to an imbalanced dataset, in comparison to the tree-based k-NN algorithms. We observed the model showing bias to the majority class, such inherited biases in the training phase are also transferred into the test or prediction stage. Another possible limitation of DNN methods is what is often referred to as the 'black box problem', centering on transparency in how it arrived at a decision. This is very critical to know about problems like water quality anomaly detection.
The process of training neural networks is the most challenging aspect of applying the method, and in general is by far the most time consuming, both in terms of effort required to configure the process and computational complexity required to execute the process. Other limitations and challenging factors associated with of DNN method are well articulated in [70].
In this study, default settings were maintained during the entire experiments for all the other learning models. However, their performance could still be improved by conducting detailed parameters tuning, especially with the tree-based and ensemble models. Notwithstanding, the results obtained would serve as baseline performance for future improvements on the learning models for this particular dataset problem.

B. THREATS TO VALIDITY
In machine learning and data science experiments, a discussion on how threats to validity were dealt with is crucial [71]. VOLUME 8, 2020  This is to ensure that the dataset supports the conclusion being made and it is as a result of the effect of the treatment applied (MV, resampling and learning algorithm), and not just by chance.
A threat to internal validity is accounting for the influences that have an impact on the empirical results obtained. To mitigate threats to internal validity, a standard SPYDER Python data mining tool was used for all the algorithms experimented on, using Intel Xeon CPU@3.20GHz, 16GB RAM system. Secondly, all the parameters settings are outlined in Table 1 to allow for the reproducibility of the experiments. Before building all the learning algorithms, the dataset was first standardized, considering our dataset is composed of numeric features. This is important to ensure all the features lie within the same range, to avoid large-value features to have a dominating influence on the small-value ones. The parameter tuning for optimizing the DNN algorithm were selected using a grid search approach to achieve better performance on our dataset. To ensure confidence in the reliability of the experiments conducted, all the results obtained were also validated by all co-authors for accuracy.
Threats to external validity account for the generalization of the results obtained outside the experimental setting or framework, and the limits needed to be applied. In all our experiments, we utilized a real-world dataset from a reliable source, which in no small measure boosts the reliability of our conclusions. Analysing the dataset on several learning algorithms in addition to repeated cross-validation gives us confidence in the reliability of our experimental results and conclusions.

VI. CONCLUSION AND FUTURE WORK
In dealing with missing value and class imbalanced in water quality anomaly detection classification problem, this paper experimentally evaluated the empirical evidence based on the argument that a combination of missing value and resampling methods can improve the performance of classifiers due to the benefit they derive by implementing these preprocessing methods on the training set before fitting a learning model. We experimentally evaluated the performance of several homogenous and heterogeneous based static ensemble classifiers with the application of MV and ICD methods on the training set as well as optimized DNN model using grid search.
This paper aimed to observe the effects of combining these variations of strategies on the performance of classifiers specifically on two-class anomaly detection for drinking-water quality dataset problem. For the experiments, we considered seven missing data and eight resampling methods on ten different classifiers. The experimental results obtained revealed that classifiers benefit from combining MV and ICD preprocessing methods to enhance their classification performance in terms of the F1-measure metric. In particular, the tree-based RF algorithm algorithms consistently performed across a combination of these varying MV and ICD strategies. Generally, the performance of the entire ensemble models implementations was low when tested on the imbalanced test set. However, there was a slight improvement using DNN. This is because evaluating classifiers when applying MV and ICD methods on only the training is one case, whereas evaluating the classifiers on an imbalance (unseen) test set is a different case altogether that test the robustness of a given classifier. These experimental observations lead us to accept the two alternative hypotheses that missing values and imbalanced data harm performance metrics of learning classifiers and reject the null hypotheses.
In future work, we aim to implement two dynamic selection techniques namely, dynamic classifier selection and dynamic ensemble selection methods [11], [72], that suits and improve the water quality anomaly detection prediction problem over the traditional voting and boosting approaches examined in this study. Additionally, DNN has shown promise based on the results obtained in this study. The authors would also like to pay detailed attention to DNN approaches [2], as an area of interest in future work. Fig. 12 shows the graphical results when applying missing value methods on the training set to evaluate the classifiers.

APPENDIX B
The average rank of classifiers for the two analysed scenarios: interactions between classifiers, missing value and F1-measure; and interactions between classifiers, missing values, resampling and F1-measure are shown in Table 10. The average rankings were used to show the CD diagram of the post hoc Nemenyi test results in Figs. 6-9. The global average rank of ensemble and the proposed DNN models across the three performance scores (balanced accuracy, ROC and F1-score) are presented in Table 11. These average rankings were used to show the CD diagrams of the post hoc Nemenyi test depicted in Fig. 10 and Fig. 11.
BHEKISIPHO TWALA (Senior Member, IEEE) received the M.Sc. degree in statistics from the University of Southampton, U.K., in 2005, and the Ph.D. degree in machine learning and statistical science from Open University, U.K. He held a postdoctoral research position with Brunel University, U.K., mainly focusing on empirical software engineering research and further looking at data quality issues in software engineering. He is currently a Professor in Artificial Intelligence and Data Science and the Executive Dean of the Faculty of Engineering and Built Environment, Durban University of Technology (DUT), South Africa. He has held several prestigious positions, namely as the Director of the School of Engineering, University of South Africa, the Director of the Institute for Intelligent Systems, University of Johannesburg, the Head of the Department of Electrical and Electronic Engineering, University of Johannesburg. He is the author or coauthor of more than 180 journal articles, a book, book chapters, and other publications. His broad research interests include image and signal processing, multivariate statistics, applied and theoretical machine learning, knowledge discovery and reasoning with uncertainty, and the interface between statistics and computing. His many honors include the Prestigious Annual TW Kambule NSTF Research Award, in 2016, for his work in building on diverse experience in using artificial intelligence to solve several problems in many industries. He has made more than 50 keynote and related presentations at national and international forums. He is also the Editor-in-Chief and an Associate Editor of several prestigious international journals within his area of research.
CLINTON OHIS AIGBAVBOA received the Ph.D. degree in engineering management. He is currently a Full Professor of Sustainable Human Development with the Department of Construction Management and Quantity Surveying and the Director of the Sustainable Human Settlement and Construction Research Centre, Faculty of Engineering and the Built Environment, University of Johannesburg, South Africa. He has published over 500 research articles and several scholarly research books in his areas of interest. He has extensive knowledge in practice, research, training, and teaching. His recent research interest includes sustainable human settlement and construction research in the era of the fourth industrial revolution (4IR). He is an active postgraduate degree supervisor and has supervised over 40 masters and six Ph.D. students to completion. He is currently an Editor of the Journal of Construction Project Management and Innovation and has received national and international recognition in his field of research.