A Comprehensive Unsupervised Framework for Chronic Kidney Disease Prediction

The incidence, prevalence, and progression of chronic kidney disease (CKD) conditions have evolved over time, especially in countries that have varied social determinants of health. In most countries, diabetics and hypertension are the main causes of CKDs. The global guidelines classify CKD as a condition that results in decreased kidney function over time, as indicated by glomerular filtration rate (GFR) and markers of kidney damage. People with CKDs are likely to die at an early age. It is crucial for doctors to diagnose various conditions associated with CKD in an early stage because early detection may prevent or even reverse kidney damage. Early detection can provide better treatment and proper care to the patients. In many regional hospital/clinics, there is a shortage of nephrologists or general medical persons who diagnose the symptoms. This has resulted in patients waiting longer to get a diagnosis. Therefore, this research believes developing an intelligent system to classify a patient into classes of ‘CKD’ or ‘Non-CKD’ can help the doctors to deal with multiple patients and provide diagnosis faster. In time, organizations can implement the proposed machine learning framework in regional clinics that have lower medical expert retention, this can provide early diagnosis to patients in regional areas. Although, several researchers have tried to address the situation by developing intelligent systems using supervised machine learning methods, till date limited studies have used unsupervised machine learning algorithms. The primary aim of this research is to implement and compare the performance of various unsupervised algorithms and identify best possible combinations that can provide better accuracy and detection rate. This research has implemented five unsupervised algorithms, K-Means Clustering, DB-Scan, I-Forest, and Autoencoder. And integrating them with various feature selection methods. Integrating feature reduction methods with K-Means Clustering algorithm has achieved an overall accuracy of 99% in classifying the clinical data of CKD and Non-CKD.

a state where one is either suffering from severe kidney damage and/or has a glomerular filtration rate (GFR) of less than 60 ml/min/1.73 m 2 for more than 3 months. The use of GFR as the best indicator of renal function to identify different stages of CKD with each successive stage defining a more severe decrease in GFR and the last stage defining kidney failure with a GFR <15 ml/min/1.73 m 2 [12] were also advocated. Often kidney disease does not cause any major symptoms in the early stages of the disease, making it difficult to detect. Early detection is considered to be a crucial factor in the management and control of chronic kidney disease.
This research aims to ascertain whether Chronic Kidney Disease is present at an early stage by deploying various unsupervised algorithms on patients' data and validating the classifications to ensure their accuracy. Intending to support medical personnel and Nephrologists, a novel and efficient model for predicting Chronic Kidney Disease at an early stage, even before the clinical diagnosis is proposed. Also take in to consideration that the time and monetary costs of CKD diagnosis have to be minimized by using a limited number of tests to cover the population. This is where the feature selection plays its part as any reduced model which uses fewer features, while still maintaining high performance is preferable. As there is an overlap in the symptoms of CKD with other diseases and there is also a need to select the most important features so that patients do not need to be subjected to a larger number of tests than necessary for diagnosis of CKD [6]. A selection technique is desired to ensure the selection of the most significant features.
There have been a number of research initiatives in the field of Machine Learning for forecasting of kidney disease, but very few use unsupervised feature learning. Unsupervised methods have received attention recently [7] due to the nondependency on labeled data and are suitable for training models when the data are imbalanced. The prospects of the unsupervised approach for CKD were explored and further investigated. There have been some notable works based on semi-supervised learning in predicting CKD.

A. RESEARCH APPROACH
This research aims to build an intelligent machine learning model that can be used reliably to establish CKD diagnosis. This model will classify the clinical data of 'CKD' and 'Non-CKD'. This model can also be used to confirm an initial diagnosis. To do so, various feature selection methods and unsupervised machine learning algorithms are implemented, so that a combination of feature selection and machine learning algorithms can be identified which optimizes accuracy. Unsupervised learning can extract patterns from unlabeled CKD-related clinical data. These extracted patterns can be used to classify the patients as 'CKD' and 'Non-CKD'. Various feature selection mechanisms related to filter methods, wrapper methods, embedded methods, and unsupervised methods are implemented to identify the most important features and reduce the number of input variables into the machine learning model. Algorithms such as, K-Means clustering, Isolation Forest, DB-Scan, and Autoencoder are implemented on various sets of selected features. Evaluation metrics are generated and are compared with the performance of existing machine learning models.

II. PREVIOUS WORK
Khamparia et al. [8] proposed a novel deep learning framework for CKD classification in which a stacked autoencoder model utilizing multimedia data for feature selection with a SoftMax regression was used as a classifier. Autoencoders have been used primarily in supervised learning, also automatically learn the hidden feature representation of data in an unsupervised manner. The learned feature representation can then be used as input to supervised classifiers, which makes the entire model a semi-supervised learning model. This paper claimed that multimodal model outperformed conventional classifiers used for chronic kidney disease. In late 2020, Ebiaredoh-Mienye et al. [9] introduced a feature learning and classification approach which integrated unsupervised enhanced sparse autoencoder (SAE) and supervised Softmax regression. The challenge of an imbalanced dataset in applying machine learning algorithms was addressed in this work and a robust semi-supervised learning model was proposed [9]. Applied this to three different diseases, obtaining a 98% accuracy for Chronic Kidney Disease (CKD).
Gopika and Vanitha [15] proposed a model based on a clustering algorithm of the test results for detecting Chronic Kidney disease and identifying its different stages, in 2017. Clusters for the different stages in chronic kidney were established. The k-means, k-medoids and Fuzzy C Means were the most commonly used classifiers. Fuzzy C-Means achieved an accuracy of 89%. Polat et al. [18] succeeded in early diagnosis of Chronic Kidney disease using an SVM classifier in 2017. The significance of their work was the use of feature selection algorithms to reduce the dimension of the dataset. The two feature selection methods employed were the wrapper and filter approaches. The filtered subset evaluator with the Best First search engine feature selection method with the SVM classifier resulted in an accuracy of 98.5%. This demonstrated that feature selection methods can play a significant role in terms of the performance of the model. In 2020, Ogunleye et al. [6] proposed an approach to diagnosing chronic kidney disease using the Extreme Gradient Boosting (XGBoost) model. The University of California Irvine (UCI) CKD dataset with all the 25 features and attained an accuracy of 98.7% were used. Wang et al. [19] also employed the CKD dataset from the UCI machine learning data warehouse in late 2018. An Associative Classification Technique implementing several algorithms ZeroR, OneR, Naive Bayes, J48, IBk (k-nearest-neighbor) based on Apriori associative algorithm was proposed, of which IBk achieved the best result: 99.0% accuracy. No feature reduction technique was used. Rady and Anwar [20] compared several data mining techniques for predicting kidney disease stages in 2019. In their work, hidden information was extracted from clinical and laboratory patient data, which assisted physicians in maximizing the accuracy of the disease severity stages identification. However, only used the 361 CKD Indian patients' data which was only a part of the UCI Machine Learning repository dataset. Different data mining classifiers, Probabilistic Neural Networks (PNN), Multilayer Perceptron (MLP), Support Vector Machine (SVM) and Radial Basis Function (RBF) algorithms were deployed. PNN achieved the best classification and prediction performance in terms of accuracy, sensitivity and specificity. Implementing PNN achieved a maximum accuracy of 96.7% for the five stages of CKD. Rustam et al. [21] analysed gene expression data using Random Forest and Support Vector Machine (SVM) for detecting chronic kidney disease in 2019. A hybrid model that combined RF and SVM, called RF-SVM, was proposed to effectively predict CKD using highly dimensional gene expression data. The data were collected from the Gene Expression Omnibus (GEO) database. The 48 samples were used out of which 36 used for training and 12 for testing. The accuracy of RF-SVM algorithm was 83.4% which outperformed some other hybrid models, but the research was limited by the small dataset.  Fig. 1 shows the framework of the proposed method and the steps involved. Pseudo code for the proposed method is given below.

III. PROPOSED METHOD
Initially, data preparation and standardization methods were implemented on the dataset to clean and prepare the data for further processing, as can be seen in pseudo code and Fig. 1. The dataset is part of the online data repository of the University of California Irvine (UCI) and contains data of 400 patients [22]. It consists of 24 clinical attributes and 1 class attribute. The datasets consist of 250 CKD cases and 150 Non-CKD cases. Missing data is a significant problem in real-world datasets, especially in the medical field. On average, every patient record and attribute have a few missing values. Fig. 2 shows the missing values present in the UCI dataset. Data preparation methods were implemented to handle the missing values. The proportion of missing values for each variable range from 0.3% (1 missing value) to 38% (152 missing values) as shown in Figure 2.  Table 1 shows the environment setup used for the proposed method.

C. CHARACTER ENCODING
Before addressing the missing values in the dataset, character encoding is performed to convert the categorical attribute values into binary numbers. Since most machine learning models only accept numerical variables as input, it is important to VOLUME 9, 2021  convert textual information into binary values. Categorical features such as 'poor' or 'good', 'no' or 'yes', 'not present or 'present' are converted to '0' or '1' binary values.

D. HANDLING MISSING VALUES
After performing the character encoding, missing values in the dataset are handled using the 'mean imputation' method, see Fig. 1. Only one feature has attribute values for all cases, whereas the rest of the attributes had some missing values. This is to be expected with real-life patient-data. It is important to handle missing data because any result based on a dataset with non-random missing values could be biased. To tackle the issue, the following method was used:

1) MEAN IMPUTATION
During the data preparation process, the dataset is analyzed to check for missing attribute values. A statistical method knows as 'mean imputation' is then implemented on the dataset. Mean imputation is a process of replacing missing values of a certain attribute with the mean of non-missing values of that attribute, see equation 1. The imputed values are calculated as the weighted average value of the items for the current or previous instances. Using this method, the missing values in the dataset are filled in.

E. DATA TRANSFORMATION
Data transformation changes the values of the dataset so that all can be used for further processing. This research uses the data standardization method. Data standardization can increase the accuracy of the machine learning models.
This can be expressed in the following way: where, y i is the value for variable y for i-items. w i is the weighted average value for i-items.

1) STANDARDIZATION OF DATA
Standardization converts the data to a mean of 0 and a standard deviation of 1. The conversion formula is given below: where, Z = Standardized score. X = Observed value. µ = Mean of sample. σ = Standard deviation of sample.
The value ranges of the features before and after standardization of the data, are displayed in Table 2.
F. DATA REDUCTION Dimensionality reduction, or data reduction is used to reduce the input variables to the machine learning model by identifying the most useful features/attributes in the dataset. It is crucial to implement data reduction because using large number of input variables can result in poor performance of the machine learning algorithms.

1) REASON FOR FEATURE REDUCTION
In order to limit the time and monetary costs of CKD diagnosis the smallest number of tests that is sufficient for the widest range of people need to be selected. This is where the feature selection plays a role as a it is desirable to reduce the number of features while still maintaining high performance. Also, correlated features are redundant and might degrade the performance of machine learning algorithms. Reducing the dimension of the dataset and removing irrelevant features can produce a comprehensive model for classification. The main challenge of the feature reduction procedure is to recognize the best subset of features in order to achieve the best classification result [23]- [26].
The correlation between the features is depicted in Figure 3. It can be seen that packed cell volume and hemoglobin, as well as packed cell volume and red blood cell count, have positive correlation coefficients of about 0.85 and 0.7 respectively. Another positive relationship with a correlation coefficient of 0.68 was detected between red blood cell count and hemoglobin. On the other hand, the lowest correlation can be seen for hypertension with hemoglobin and red cell volume with an approximated correlation value of −0.6.

2) FEATURE SELECTION METHODS
Feature selection techniques are important for unsupervised machine learning algorithms as are essential to extract the best attributes for classification. The main purpose of feature selection is to remove a subset of input features which are not important for classification [18]. This can decrease the cost of the training and obtain higher accuracy [27]. Feature selection allows the machine learning model to remove non-informative and redundant predictors from the model and establish a CKD diagnosis more quickly with less clinical data. Classifying the patients into 'CKD' and 'Non-CKD' classes as quickly as possible can help the clinics/hospitals to allocate hospital resources to the patients that VOLUME 9, 2021 require them. Various feature selection methods are implemented in this research and are integrated with various unsupervised machine learning algorithms.
Feature selection methods are generally divided into three categories: Filter, Wrapper, and Embedded methods. An appropriate feature selection improves the performance of the classifier and reduces the computing time by using optimized data in the dataset [18], [23], [28], [29]. Although traditional feature selection algorithms are used frequently, suffer from explainability issues, e.g., when working with clinical data, it is often difficult to explain why some of the features are removed from the provided dataset. Each of the categories of feature selection algorithms has its explainability limitation making it difficult to clarify why certain features are selected without diving deep into the mathematical formulation. The Filter methods do not leverage the model's characteristic to filter the features. Although Wrapper methods do leverage a model's prediction, it chooses a subset of features solely based on accuracy or another similar scoring. For the Embedded method, even though it is calculated as a part of the training process, it has to incorporate each model's individuality and it is often difficult and tedious to provide explanations for every single model. Considering these drawbacks, an unsupervised feature selection technique, based on model agnostic explanations is required for this work and SHAP (SHapley Additive exPlanations) was adopted. This approach assigns the SHAP values, which are contribution values for a model's output for each feature of each data point. These SHAP values determine the feature importance so that the contribution information of each feature can be used to sort the features based on their importance. Selecting a subset of features based on SHAP values means selecting the first features after ordering them based on the feature contributions to the model's prediction. Feature selection methods based on SHAP values has proven their superiority for solving various classification problems in recent years [30] The motivation to use such an approach is based on the growing need for model interpretation.
In this research, all 24 features were ranked using the 6 feature selection techniques which belong to four different types of feature selection methods. The set-theory-based rule is presented, combining several feature selection methods. The four kinds of feature selection techniques that are utilized were illustrated in figure 4.

a: FILTER METHODS
Filter feature selection methods make use of statistical techniques to predict the relationship between each independent input variable and the output (target) variable. The filter methods evaluate the significance of the feature variables based on their inherent characteristics without the incorporation of any learning algorithm. These methods are computationally inexpensive and not subjected to overfitting [27].

PEARSON
The correlation coefficient formula quantifies the linear dependence between two continuous variables. It returns values between −1 and +1. The below Pearson correlation coefficient formula is used to measure the correlation of two variables: where, N = the number of pairs of scores. xy = the sum of the products of paired scores.
x = the sum of x scores. y = the sum of y scores. x 2 = the sum of squared x scores. y 2 = the sum of squared y scores.
The Pearson product-moment correlation coefficient, or simply the Pearson correlation coefficient or the Pearson coefficient correlation r determines the strength of the linear relationship between two variables. The stronger the association between the two variables, the closer the answer will be to +1 or −1. Attaining values of 1 or −1 signify that all the data points can be plotted on the straight line of 'best fit.' The closer the answer lies near 0, the larger the independent variation in the variables [46].
After applying the Pearson correlation between each feature and target variable (Class), the features can be ranked in this way illustrated in figure 5: it can be seen that, based on Pearson correlation, hemoglobin is highly correlated to the target variable and potassium is the least correlated one. This makes hemoglobin as a highly important and potassium as a least important feature.

CHI-2
A chi-square test is used in statistics to test the independence of two events. Given the data of two features, the observed count and expected count were obtained. Chi-Square measures how the expected count and observed count deviate from each other [47]. Contigency table and expected values has to be calculated before chi square calculation. Contigency table is a table that represents the distribution of one feature and another in columns. It is used to study the relationship between two features. The expected count for each cell would be the product of the corresponding row and column totals divided by the sample size. Observed values are the actual values calculated from the sample. Then the expected counts will be contrast with the observed counts, cell by cell. The more the difference, the higher the resultant statistics, which is the chi square. The formula for chi square is, When two features are independent, the observed count is close to the expected count, thus will have a smaller Chi-Square value. Inorder to find the feature importance, chi square between each feature and target variable (Class) is calculated. Higher the Chi-Square value between a feature and target column means it more dependent on the target column and it can be selected for model training. After applying Chi-2 technique, the features can be ranked in this way illustrated in figure 6.

b: WRAPPER METHODS
Wrapper methods create several models which have different subsets of input feature variables. Later the features that result in the best performing model according to the performance metric are selected [29]. The main idea behind a wrapper method is to search for the set of features which work best for a specific classifier as shown in figure 7:

RECURSIVE FEATURE ELIMINATION
The Recursive Feature Elimination (RFE) works by recursively removing attributes and building a model on those attributes that remain. It performs a greedy search to find the best performing feature subset [31]. It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute. It iteratively creates models and determines the best or the worst performing feature at each iteration. The subsequent models use the remaining features until all the features are explored. The features are then ranked based on the order of their elimination. In the worst case, if a dataset contains N features RFE will do a greedy search for 2N combinations of features. Here RFE is used with the Logistic Regression classifier to select the top features as depicted in Figure 8.

c: EMBEDDED METHODS
Machine learning models that have feature selection naturally incorporated as part of learning are called Embedded feature selection methods [50]. Built-in feature selection is incorporated in some of the models, which means that the model includes predictors that help in maximizing accuracy, as illustrated in figure 9.  Rule-based models like Logistic Regression (LR) with L1 penalty (Lasso regression) intrinsically conduct feature selection [48]. It is a linear model that uses this cost function: where, a j is the coefficient of the j-th feature. The final term is called L1 penalty and α is a hyperparameter that tunes the intensity of this penalty term. The higher the coefficient of a feature, the higher the value of the cost function. So, the idea of Lasso regression is to optimize the cost function, reducing the absolute values of the coefficients. Obviously, this works if the features have been previously scaled. For example, using standardization or other scaling techniques. α hyperparameter value must be found using a crossvalidation approach. Trying to minimize the cost function, Lasso regression will automatically select those features that are useful, discarding the useless or redundant features. In Lasso regression, discarding a feature will make its coefficient equal to 0. A feature Importance plot created with LR is shown in figure 10.

RANDOM FOREST
Random forest (RF) is another common feature selection technique. It consists of extracting the feature importance rank from tree-base models [49]. The feature's importance is essentially the Mean of the individual trees' improvement in the splitting criterion produced by each variable. In other words, it is the magnitude of the score or impurity which was improved when splitting the tree using that specific variable. This can be used to rank the features and then select a subset. RF feature importance is biased towards features with more categories. Besides, if two features are highly correlated, both of their scores decrease regardless of the quality of the features. As mentioned, Random Forest uses the mean decrease impurity (Gini index) to estimate a feature's importance. The lower the value, the more important the feature is. Gini index is defined as: where, the second term is the sum of the squared probabilities of each class for sample i. The Gini index of feature j is measured for each node of a tree where feature j was used and averaged over all trees in the ensemble. If all the samples that reached the node are linked with a single class, then that node can be called pure. This can give a good estimate on the threshold value to set when selecting features based on their importance as shown in figure 11.

d: UNSUPERVISED FEATURE SELECTION METHOD SHAP (SHAPLEY ADDITIVE EXPLANATIONS)
The SHAP approach assigns the SHAP values, which are contribution values for a model's output for each feature of each data point [51]. These SHAP values encode the importance of a feature for the model. The mean of the columns of each matrix is calculated and the vectors of mean SHAP values for each class are summed and ordered in a decreasing way. The first position of the resulting vector contains the most important feature, the second position contains the second most important, and so on. Since SHAP can provide a means to interpret the model's decisions by indicating the importance of the dataset features. A feature selection algorithm based on the most important features according to the absolute SHAP values would provide good results [30]. Here, the Tree SHAP explainer approach is used with the Isolation forest model for feature selection and the feature importance plot is shown in figure 12.

3) OUTCOMES OF FEATURE SELECTION PROCESS
The 24 features were ranked using the Pearson, Chi-2, RFE, Random Forest, Logistic Regression, and SHAP. The rankings were shown in Figures 5 -12 This section discusses the unsupervised machine learning algorithms implemented in this research. After applying the feature selection methods described above, preprocessed datasets are created. These datasets are used for training and testing the machine learning models. Since all the classification models are unsupervised separate training and testing data is not required, moreover, dataset is limited in size thus, the whole data comprises of 400 data points were passed to kmeans model as a preliminary training data. The training allows the models to generate a distinct set of data points (Z 0 to Z 3 ). As illustrated in Fig. 14 each classifier, K-means clustering, DB-scan, Autoencoder, and I-forest have a distinct point that is used to separate the data into clusters of CKD and Non-CKD cases. These clusters are used to classify the data into classes.

1) K-MEANS CLUSTERING
Unsupervised algorithms can make predictions or inferences from unlabeled data. Clustering unlabeled data based on inferences is very useful when working with clinical data. K-means clustering is a centroid-based unsupervised clustering algorithm that can be used for classification. The preprocessed dataset created with the feature selection methods is used to train the algorithm and extract a data point (Z 0 ). This data point is used to classify the data in 'CKD' and 'Non-CKD' cases. Similar data points are clustered together to find an underlying pattern for assessment. K-means delivers the final output through a process called iterative refinement. It tries to minimize the sum of the squared distance between the data points and the cluster's centroid. The centroid is defined as the arithmetic mean of all the data points that belong to that cluster. The number of groups is denoted by K, and each data point is iteratively assigned to one of these groups of clusters based on the identified similarities among the features. The initial number of clusters 'K' has to be provided as an input. This can sometimes be a delicate issue and users sometimes end up running the system multiple times with different values of K. Afterwards, a comparison is then made to select the best value of 'K'. However, various methods are available for getting a reasonably stable approximation of K. K-means most commonly uses 'Euclidean Distance' to determine the distance between two data points (Z n and Z m ). One of the key advantages of K-means is that, in case the number of features is really high, it can still complete the computation in a reasonable time if the value of 'K' is kept relatively small [32].
Given a set of d-dimensional real vector observations (y 1 , y 2 , . . . , y n ), K-means clustering partitions the n observations into k (≤n) sets S = [S 1 , S 2 , . . . , S k ] so as to minimize the Variance. µ i denotes the 'Mean' of S i and V is the Variance.
The number of clusters was set to six by parameter tuning, and the actual class labels on each cluster were checked. Except for cluster 1, the other clusters reflect CKD patients, as seen in the tables below, where cluster 1 only contains non-CKD cases, while the majority of cases in the remaining clusters are CKD cases. To categorise a new data point in the future, it can be given as test data, and the euclidian distance to each cluster centroid can be calculated to discover which one is closest, and then it can be labeled under that cluster.

2) DB-SCAN
DB-Scan is Density-Based Spatial Clustering of Applications with Noise. The goal of DB-=Scan is to find core samples of high density and expand them to clusters. It is most suitable for data which contain clusters of similar density [33].
DB-Scan detects density connected clusters by discovering one of its core objects p and computing all objects which are density-reachable from p. The collection of density-reachable objects is found by iteratively computing density reachable objects. DB-Scan checks the neighborhood N of each object p in the database. If N (p) of an object p consists of at least µ objects, i.e., if p is a core object, a new cluster X containing all objects of N (p) is created. Then, the neighborhood of all objects q X, which have not yet been processed, is checked. If object q is also a core object, the neighbors of q, which are not already assigned to cluster X, are added to X and their neighborhood is checked in the next step. This procedure is repeated until no new object can be added to the current cluster X.
DBSCAN aims at discovering clusters which are highdensity regions of the dataset. It applies two hyperparameters: Eps (the neighborhood radius) and minPts (minimum number of neighbors) to consider a point a core point. It defines a point as a core-point if there are at least minPts sample points in its Eps neighborhood. The points within the Eps neighborhood of a core-point are said to be directly reachable from that core-point. A point q is reachable from a corepoint p if there exists a path from q to p where each point is directly reachable from the next point. The parameter values of MinPts and Eps corresponding to the highest clustering accuracy were selected.
The whole dataset comprising of 400 data points was passed to DB scans model for training. Parameter values for Eps and minPts were selected as 3.6 and 150 respectively by hyper parameter tuning Based on these parameter values, DBscan treats some data points as a cluster and other datapoints as outliers, labeling them as −1. There is only one cluster. Table 4 depicts the number of elements in the cluster and the number of outliers. The cluster consists of 174 datapoints of which 150 cases are non-CKD cases and all the outliers are CKD cases. To classify a new data point in future, it can be given as test data into this DB scan model which checks whether a given sample is within eps distance of one of the core samples. If it is, it takes the label of the core sample (classify it as non CKD case), if it is not, it us an outlier (CKD case).

3) AUTOENCODER
An autoencoder neural network is an unsupervised deep learning technique that consists of two components: an encoder and a decoder. The main concept is that both encoder and decoder are trained together, minimizing the discrepancy between the original data and its reconstruction [34].
The encoder e(x) represents a mapping of an input x with higher dimensions to a hidden compressed representation, and the decoder d(x) maps this compressed representation back to a reconstructed version of x, such that d(e(x)) ≈ x.
The reconstruction error of autoencoder networks can be used to classify CKD and non-CKD cases. Here the encoder has two layers, one input layer and one hidden layer, whereas the decoder has one hidden layer and one output layer. Encoder/decoder networks are fully or densely connected neural networks with rectified linear unit (ReLu) activation between layers. An encoder network, defined as e(x): X → Z, maps from the input space X ∈ RM to latent embedding Z ∈ RD, and a decoder network, d(e(x)): Z → X, maps the embedding Z back to the input space-optimize over encoder and decoder networks as follows: where, φ and ψ are the parameters of the encoder and decoder neural networks, respectively. The expectation is taken over the training data, and the loss is the squared 2-norm distance between the input x and the reconstructed input. The training parameters for auto encoder are the number of times the algorithm trains on the training data and the number of samples processed before the model is updated. Loss MSE between inputs and outputs, see equation 10, gives the anomaly score for the Auto-Encoder, for each datapoint that passes through it.
where, MSE is the Mean Squared Error, n is the number of data points xi is the observed values and (x^i ) is the predicted values. Tuning parameters for autoencoder is given in Table 5. Here the encoder of the model consists of two layers that encode the data into lower dimensions. The decoder of the model consists of two layers that reconstruct the input data. The reconstruction errors are considered to be anomaly scores. The model is compiled with Mean Squared Logarithmic loss and Adam optimizer.
The model is then trained with 40 epochs and a batch size of 50, and in the testing phase, scores are sorted in ascending order and a threshold is set such that scores of more than the threshold result in a cluster of CKD instances, while those below that threshold result in a cluster of non-CKD cases.
Fine-tuning of this threshold is done by comparing the anomaly scores with actual class labels. (Note that class labels are not given as input to the model). As a result, based on this threshold, there are two clusters: cluster 1 contains all cases with a loss MSE of more than the threshold value, which will be mapped as 1 and cluster 2 contain all cases with a loss MSE of less than the threshold value, which will be mapped as 0.
The clusters obtained using the autoencoder with all features considered are shown in the table below. Cluster 1 has a total of 260 datapoints, with 250 of them belonging to CKD. Cluster 2 has 140 datapoints, all of which are non-CKD cases. The model and threshold value can be used to cluster new data in the future. VOLUME 9, 2021

4) ISOLATION FOREST
Isolation Forest (Iforest) 'isolates' observations by randomly selecting a feature and then randomly selecting a ''split value'' between the maximum and minimum values of the selected feature. Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node. This path length, averaged over a forest of such random trees, is a measure of normality and is used as the decision function. Random partitioning produces noticeably shorter paths for anomalies. A forest of random trees collectively produces shorter path lengths for particular samples [35]. Tuning parameters for isolation forest are given in Table 7. Training parameters for Isolation Forest are the number of trees to create a forest, the maximum number of features, and the sub-sampling size. During the test phase: Isolation Forest finds the path length of the data point from all the Isolation Trees and finds the average path length. The higher the path length, the more normal the point, and vice-versa. Based on the average path length, it calculates the anomaly score. Decision_function of Iforest can be used to get this. For Iforest, the lower the score, the more anomalous the sample. Scores are sorted and a threshold is set such that scores less than that threshold result in a cluster of CKD instances, while those below that threshold result in a cluster of non-CKD cases. Fine-tuning of this threshold is done by comparing the anomaly scores with actual class labels. (Note that class labels are not given as input to the model). As a result, based on this threshold, there will be two clusters: cluster 1 contains all cases with an anomaly score less than the threshold value and will be mapped as 1; cluster 2 contains all cases with anomaly scores more than the threshold value and will be mapped as 0.
The clusters obtained using Isolation Forest with all features considered are shown in Table 8. Cluster 1 has a total of 250 data points, with 232 of them belonging to CKD. Cluster 2 has 150 data points with 132 of them belonging to nonckd cases. This model and threshold values can be used to cluster new data in the future.

H. CLUSTER VALIDATION
The clusters generated from each algorithm are evaluated using cluster validation methods. These methods are used to compare the performance of each cluster.
Validation can be done in two ways: 1. Internally 2. Externally

1) INTERNAL VALIDATION
Internal validation processes evaluate the connectedness, i.e., how well a pair of data points within the same cluster is connected to each other. Tand the compactness, i.e. how close are the data points, placed inside the same cluster are to each other. Internal measures do not require any prior cluster labelling or ground-truths. Acceptable clusters have minimal 'Connectedness' and 'Compactness' [36], [37].
In this section, a look at how the clusters have been validated using various internal metrics were carried out. The indexes that were used here is also discussed.

a: DAVIES-BOULDIN INDEX (DBI)
The metric works on the basis of the ratio of within cluster distances to between-cluster distances. The smaller the values are, the better the clustering would be. A factor to note is that, to make it consistent with other indices used in this research, the reverse of Davies-Bouldin Index (1-Davies-Bouldin Index) [38] were used. The Davies Bouldin Index can be calculated for any value of a cluster (n) using the following expression [39]: where, d is the Euclidian Distance between the points, c j is the cluster j having x j as the centroid. Figure 16 illustrates the Daviesbouldin score for all the classifier without and with feature reduction. In both cases, it can be seen that kmeans performing well with good scores.

b: CALINSKI-HARABASZ INDEX
Calinski-Harabasz is a ratio-type index that evaluates the cluster validity by comparing the average between and within-cluster sum of squares. A higher value indicates better clustering [40].  The Index, CH, Is Defined as: where Vb is the overall between-cluster variance, Vw is the overall within-cluster variance, N is the number of observations and k denotes the total number of clusters. Figure 17 depicts the Calinski Harabasz Index for all the classifier without and with feature reduction.

c: SILHOUETTE COEFFICIENT SCORE
Silhouette coefficient score is one of the most widely used internal cluster validation techniques. The Silhouette Coefficient score is derived for each of the samples using the mean within-cluster (intra-cluster) distance and the mean nearestcluster distance, generally using the following equation [38].
c is Silhouette Coefficient score. where, p is mean within-cluster (intra-cluster) distance. q is the distance between a sample and the nearest cluster that the sample is not a part of.
The metric is primarily an intuitive graphical tool that aids the user in visually assessing cluster quality. Figure 18 depicts the silhoutte score for all the classifier without and with feature reduction.

2) EXTERNAL VALIDATION
External validation techniques gauge the degree to which cluster labels match class labels supplied externally. These class labels have not been used in any of the processes discussed in previous sections. The 'True Rate of Detection' (the 'Recall' measure) for each of the clusters were observed. Several validation methods have been applied.
This section will provide a detailed inspection of the quality of the clustering using various External metrics.

a: ADJUSTED RAND INDEX (ARI)
The Rand Index (RI) is a similarity measure between two sets of clusters by considering all pairs of provided samples that are assigned in the same or in different clusters in the predicted and the true clusters. Scores closer to 1 signify better clustering [42], [43]. The ARI results are shown in Figure 19.
The raw RI score is adjusted for chance as follows: homogeneity-the measure of a cluster holding only members of a single specific cluster, and completeness-whether all members of a given class are allocated to the same cluster [44].
where, v is V-measure v. Default value of β is 1, signifying equal weightage of homogeneity and completeness. Figure 21 shows the Vmeasure score for all the classifier without and with feature reduction. In both cases, it can be seen that kmeans performing well with good scores.
The effectiveness and accuracy of the four unsupervised machine learning methods can be evaluated using performance indicators. Positive classification occurs when a person is classified as having CKD. When a person is not classified as having CKD, he has a negative classification. Similarly, True Positive (TP) indicates instances correctly categorized as CKD, True Negative (TN) instances correctly categorized as non-CKD. False Positive (FP) indicate non = CKD cases, incorrectly classified as CKD and False Negative (FN) indicate CKD cases incorrectly classified as non-CKD. The Table 10 gives more explanation.

d: ACCURACY
Accuracy is the most intuitive performance measure. It is simply a ratio of the correctly predicted observation to the total observations. Accuracy can be expressed as

IV. RESULTS AND DISCUSSIONS
Validation scores obtained by considering all the 24 features for DB scan, K-means, I-forest, and Autoencoder are given in the Table 11. Both K-means and autoencoder have a 100% recall, indicating that all CKD cases were correctly predicted. K-means clustered 253 anomalies as CKD, although only 250 of these are true CKD cases, giving it a precision of 98 percent. Smaller values for davies bouldin score and higher values for mutual information_scores, adjustedrand scores, Vmeasurescore, silhouette scores and calinskiharabasz scores indicate s how good the clustering is. All the internal validation scores such a s silhouette score, calinskiharabasz score and daviesbouldin score are slightly better for DB scan than for K-means. However, with an accuracy of 99.3 percent and an F1-score of 99.4 percent, K-means clustering outperforms the other three approaches.     final reduced feature set are shown Table 11, Table 12 and  Table 13 respectively. For these highly reduced feature sets, Autoencoder yielded an unsatisfactory result, while DBscan and Isolation Forest produced acceptable results. However, k-means had a low Daviesbouldin score, high other cluster validation scores and low computational time which indicates VOLUME 9, 2021  that K-means performs well with a reduced feature set. It has ah 99% accuracy and a 99.2 % f1-score.

A. COMPARISON OF THE PROPOSED MODEL WITH PREVIOUS WORK
There are only a limited number of studies, using unsupervised systems and algorithms to solve the issue of early detection of CKD. However, in detecting CKD, there were some studies based on semi-supervised and supervised learning which were worth mentioning. Relevant studies have been included for performance comparison in Table 14.
From the comparison table it can be seen that no existing work in detecting CKD achieved an accuracy of more than 99.0% whereas the proposed method showed a maximum accuracy of 99.3% using the K-means Clustering algorithm. Most studies did not employ feature selection techniques, and those that did not clearly state why some features were left out. The research sorting out the most important features for disease prediction leaving out less important ones. Using an unsupervised method, combined with appropriate feature selection techniques led to an improvement in accuracy for detecting CKD.

V. CONCLUSION AND FUTURE WORK
This work developed an approach for improved prediction and detection of Chronic Kidney Disease based on various unsupervised machine learning approaches including autoencoder, Isolation forest, DB-scan and Kmeans. For considering all the 24 features resulted in a 91% accuracy for I-forest, 94% for DB-Scan, 97.5% for Autoencoder and, 99.3% for K-means clustering. To reduce the time and financial expenses of CKD diagnosis, six feature selection strategies, which fall into four distinct categories of feature selection methods, were used. The best features were selected using a set-theory-based rule, which combines multiple feature selection approaches. The data were then classified and validated. For the reduced feature set also Kmeans outperformed other unsupervised algorithms with 99% accuracy.
The suggested technique can assist clinicians in managing numerous patients and providing CKD diagnoses more quickly. Organizations can use the suggested machine learning architecture in regional clinics with reduced medical expert retention over time, allowing patients in regional locations to receive early diagnosis. As an extension of this work, detection of the five different stages of Chronic Kidney Disease in a similar manner can be done. Thus, would support the medical community in just to detecting the existence of the disease, but also in identifying the stages of the disease.
LINTA ANTONY received the M.Tech. degree. She has been a Junior Research Fellow with C-MET since 2018. Her current research interests include biomedical engineering, artificial intelligence, estimation and detection, 3-D imaging, sensors, and DOA estimation.
SAMI AZAM (Member, IEEE) is currently a Leading Researcher and a Senior Lecturer with the College of Engineering and IT, Charles Darwin University, Australia. His research interests include computer vision, signal processing, artificial intelligence, and biomedical engineering. He has number of publications in peer-reviewed journals and international conference proceedings.
EVA IGNATIOUS is currently a Ph.D. Researcher with Charles Darwin University, Australia. Her research interests include biomedical signal processing (interesting features and abnormalities found in bio-signals), theoretical modeling and simulation (breast cancer tissues), applied electronics (thermistors), process control and instrumentation, and embedded/VLSI systems. She has considerable research experience with one U.S. patent and two Indian patents for the development of thermal sensor-based breast cancer detection at its early stages together with the Centre for Materials for Electronics Technology (C-MET), an autonomous scientific society under Ministry of Electronics and Information Technology (MeitY), Government of India.
RYANA QUADIR received the bachelor's degree in computer science and engineering from North South University (NSU) and the master's degree from Daffodil International University (DIU), Bangladesh. She has been a Software QA Automation Engineer for the last six years at Stibo DX, a content and digital asset management system developers for media companies in Europe. Her research interests include machine learning, fuzzy systems, AI in medical diagnostics, and machine learning methods in psychiatry.
ABHIJITH REDDY BEERAVOLU is currently pursuing the M.S. degree in information systems and data science with Charles Darwin University, Casuarina, NT, Australia. He is also a Computer Science Enthusiast who is interested in anything related to computers. His research interests include reading books on History and making comparisons with the current world, to make sense of the reality and its progression. He is also interested in reading and analyzing information related to cognitive and behavioral psychology and trying to integrate them into various technological ideas.
MIRJAM JONKMAN (Member, IEEE) is currently a Lecturer and a Researcher with the College of Engineering, IT and Environment. Her research interests include biomedical engineering, signal processing, and the application of computer science to real life problems.
FRISO DE BOER is currently a Professor with the College of Engineering, IT and Environment, Charles Darwin University, Australia. His research interests include signal processing, biomedical engineering, and mechatronics. VOLUME 9, 2021