A Weighted k-Nearest Neighbours Ensemble With Added Accuracy and Diversity

Ensembles based on <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>NN models are considered effective in reducing the adverse effect of outliers, primarily, by identifying the closest observations to a test point in a given training data. Class label of the test point is estimated by taking a majority-vote of the nearest observations’ class labels. While identifying the closest observations, certain training patterns might possess high regulatory power than the others. Therefore, assigning weights to observations and then calculating weighted distances are deemed important in addressing this scenario. This paper proposes a <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>NN ensemble that identifies nearest observations based on their weighted distance in relation to the response variable via support vectors. This is done by building a large number of <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>NN models each on a bootstrap sample from the training data along with a randomly selected subset of features from the given feature space. The estimated class of the test observation is decided via majority voting based on the estimates given by all the base <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>NN models.The ensemble is assessed on 14 benchmark and simulated datasets against other classical methods, including <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>NN based models using Brier score, classification accuracy and Kappa as performance measures. On both the benchmark and simulated datasets, the proposed ensemble outperformed the other competing methods in majority of the cases. It gave better overall classification performance than the other methods on 8 datasets. The analyses on simulated datasets reveal that the proposed method is effective in classification problems that involve noisy features in the data. Furthermore, feature weighting and randomization also make the method robust to the choice of <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>, i.e., the number of nearest observations in a base model.


I. INTRODUCTION
A wide range of supervised learning techniques have been introduced to deal with classification problems. Among these, the nearest neighbour (NN) model is one of the top ranked methods that classifies an unseen observation on the basis of its neighbourhood in a given feature space. Although it is an efficient method but it suffers from the issue of over-fitting. To overcome this disadvantage, one of the most fundamental, simple and appealing approach is k-Nearest Neighbors (kNN) method [1]. This technique can be used for dealing with both classification and regression problems in machine learning.
The associate editor coordinating the review of this manuscript and approving it for publication was Fu-Kwun Wang . In the context of classification, the kNN approach estimates a class value for a new/unseen instance by finding its k nearest neighbors whose classes are known [2], [3], [4]. Initially, it was developed to perform discriminant analysis in situations where reliable parametric estimates of probability densities are not known. Now-a-days, the kNN method is a preferred method in classifying data when there is little or no prior knowledge about the distribution of the data [5]. It is a widely used classifier because of its simplicity. It shows robustness to noisy training data and effectiveness in case of large training data [5].
In spite of the above advantages this technique has some disadvantages such as: high computation cost due to computing distance of each query point to all the training samples, requirement of large memory for implementation of data proportional with the training set size, low accuracy rate in multidimensional data, the choice of the type of distance to be use in distance based learning, the selection of better attributes for the best results and a need to determine the value of k-the number of nearest neighbours [6]. Examples of attempts to make the kNN method fast, among several others, can be found in [7] and [8].
The choice of the parameter k affects the performance of the kNN. When k is too small, the method would be more sensitive to the data points that lie outside the general pattern of the data. Similarly, when k is too large, too many points from other classes may be in the neighbourhood [1]. There is no appropriate way to find the value of k, however, the square root of the total number of observations in the given dataset is considered as an optimal value of k. Generally, k value is taken in odd so as to avoid ties during decision making. An error or accuracy plot is usually used to find the most suitable value of k. Thus k in kNN is the core deciding factor.
kNN uses different types of distance metrics for calculating distances. The most commonly used method is the Euclidean distance, the straight line distance between the specified training sample and a test sample [9]. Manhattan Distance-the absolute differences of the distance between two points i.e Cartesian coordinates and Hamming Distance-that compare two binary data strings by looking into the entire dataset to discover when data points are similar or dissimilar, are the other choices as distance metrics used in literature. Different distances would yield different accuracy results for a given problem. Other modifications in terms of the distance metric can be found in [10], [11], and [12].
Theoretically, the kNN classification performance is determined by the estimate of the conditional class probabilities of the query point in a local region of the data space, which is found by the distance of the kth nearest neighbor to the query point. Different values of k yields different conditional class probabilities which make the selection of the neighborhood size k more sensitive and thus affect the estimate. For very small value of k, the estimate tends to be very poor owing to the data noise, sparseness, ambiguous and mislabeled points. The increase in the k value smooths the estimate by taking into account a large region around the query point. On the other hand, a very large value of k makes the estimate over-smooth with the introduction of outliers from other classes which degrade classification performance [13]. This reduces the generalizability of the method.
To deal with the sensitivity issue of different neighborhood size k, several weighted voting methods have been developed in the literature.
The basic idea behind weighted kNN (WkNN), is to give more weight to the nearby points and less weight to the points that are far away. Any function whose value decreases as the distance increases can be used as a weight. The most simple function usually used is the inverse distance function. Thus different weights are assigned to the k neighbors based on their distance to the test point which makes use of training observations [14]. Many related models are proposed based on weighting scheme of kNN classifiers, some of them can be found in [13], [15], [16], [17], and [18].
Although WkNN shows better performance than other weighting methods for kNN through their empirical comparisons in many cases, it still suffers from the issue of outliers in the data. This issue particularly occurs when the sample size is small [19], [20]. Moreover, the irregular class distribution of a data set affects both kNN and WkNN. The later is also affected by the choice of k [13], [21]. The presence of irrelevant features in the given feature space also affect kNN, often in data with large number of features.
Ensemble procedures have been proposed to improve the performance in such situations [22], [23], [24], [25], [26]. Pooling a large number of base models is termed as ensemble learning method. This approach has gained a lot of interest by developing efficient and sophisticated techniques to improve the classification accuracy of base models. Diversity and accuracy are the two desirable characteristics of ensemble methods stating that errors made by classifiers are uncorrelated and weak models do not take part in decision making. The most frequently used procedure is bagging (bootstrap aggregation), that pools the results of classifiers fitted on bootstrap samples, drawn randomly from training data [27], [28], [29]. Similarly, an extended weighted voting ensemble method, ensembles (groups) base classifiers and then select the group with the highest vote based on the single classifiers weights [30]. Motivated by this idea, this work explores the idea of features weighting instead of assigning weights to observation in conjunction with random subsets of feature selection for ensemble construction. The proposed weighting leads to accurate base models while building them on random feature spaces to introduce diversity. Moreover, the proposed method is also robust to the choice of k in that only regulatory features take part in decision making whereas the rest are penalized by assigning them low weights. The key features of the proposed ensemble method, that makes the proposal novel, are: • Base models in the ensemble are accurate due to feature weighting.
• The ensemble consist of diverse models due to bootstrapping and feature subsets.
• The ensemble is robust to the choice of k due to feature weighting by support vectors which works equally efficient in case of large or small neighbourhood values.

II. RELATED WORK
Several methods are proposed in literature based on the idea of weighting and randomization. The selection of the neighborhood size k still remains a major challenging problem to be solved for kNN. Modified k Nearest Neighbor(MkNN) is proposed in the literature. It uses weights like weighted kNN while looking for the validity of the data point when classifying the nearest neighbour. The validity of all data samples is computed in the training set according to its neighbors, and then a weighted kNN is performed on any VOLUME 10, 2022 test samples [15]. The distance-weighted k-nearest neighbor rule (DWkNN) [13] is another approach to improve the classification performance of kNN. DWkNN uses a dual distance-weighted function based on WkNN to determine the class of the query point by majority weighted voting [13]. It is well known that kNN is a time consuming method because of the proportionality of the classification time to the number of features and the number of training instances [15], [31]. To overcome this issue, an attempt called the ''k Nearest Neighbors on Feature Projections'' (kNNFP) [31] was made. This technique sorts training instances as their projections on each feature dimension separately to achieve fast classification of a new instance as compared to kNN. kNNFP first makes one set of predictions for each feature and then builds a kNN on the projections. A new instance is finally classified through a majority voting on individual classifications made by each feature. All the features are equally relevant, that is, each feature has the same power in the voting is the basic assumption of kNNFP approach. To some extent, this reduces the hindrance of irrelevant features. However, if the irrelevant features are in large number, then voting alone is not sufficient. To overcome this issue, WkNNFP [17] was introduced that investigates the effect of incorporating feature weights while voting by multiplying the vote of each feature with its weight. This algorithm store all the projections of training instances on linear features in memory as sorted values. Then the vote of the feature and its distance from the test point is computed to give ultimate classification. Model based k nearest neighbor [16] models the input data and use the fitted model for classification of the data. This approach not only improves the accuracy but also shows more efficiency in terms of execution time. It also chooses appropriate value for k automatically.
In the context of ensemble approaches, a large number of base kNN models also known as pooling several nearest neighbor models via multiple feature subsets (MFS) is proposed in the literature. A random subset of feature the space is used for each base model of the ensemble. The final class is predicted by pooling the results of base models [32]. Rank Nearest Neighbour (RNN) is another modification based on the classic kNN, which tries to improve accuracy with less execution time by assigning ranks to training data for each category [18]. A similar technique, random kNN, is used to classify high dimensional data sets. The features are ranked on the basis of their discriminative power and the resultant set of top ranked features is obtained for the ultimate model [33]. Another approach uses Term Frequency-Inverse Document Frequency (TF-IDF) as a weighting scheme to assign weight to permission features [34]. Similarly, the work done by [35] first ranks the permission attributes using Information Gain (IG) as a feature selection method and then the obtained ranked features weights are assigned by implementing ensemble extra trees on the ranked features to produce feature subsets that represent the attribute properties. The final subsets of features are used to update the sets of observations with 5, 10 and 20 top ranked features. Then, weightages are computed using ensemble extra trees on the updated datasets to produce a permission feature model.
The BagInRand [36] technique adapted bagging to kNN classifiers by inducing randomness to distance metrics. Double-Bagging approach use out-of-bag observations to train a second classifier. To perform a linear discriminate analysis on each bootstrapping step, the out-of-bag sample is used. The discriminate variables of each bootstrap sample are incorporated as supplementary predictors for the classification tree [37]. The work proposed in [38] and [39] reduce data size and boost up the execution time by removing those samples that are identical and do not provide extra information.
Although, a model based kNN improve the prediction performance and reduces the size of training data, however, this procedure fails in case of class imbalanced problem. The k-d tree nearest neighbour [40] is developed by dividing the total training dataset in half plane and use it for the formation of multi-dimensional observations. This method produces perfectly balanced tree in less time. However, k-d tree nearest neighbour is computationally complex and misclassify data pattern. A hybrid method based on SVM and kNN, which deals with multi-class problems and gives appreciable performance is proposed in [41]. The authors in [42] have suggested to combine different base kNN learners using various distance function weights acquired by genetic algorithm. To introduce diversity in the ensemble, authors in [43] have used different metrics as perturbations parameters, for distance calculation. To select a diverse and accurate optimal base models, the work in [44] tried to improve bagging by applying an optimization process called Selecting Base Classifiers on Bagging(SBCB). This work propose feature weighting for identifying the nearest neighbours to estimate the class label of a test/unseen observation. Features that do not possess high discriminative ability will be assigned low weights while those that actually regulate the response in the training data will be assigned high weights. All the above mentioned methods, fail to provide a frame work of ensemble construction that builds accurate and diverse models and show robustness to the choice of k. Therefore, this paper proposes an ensemble methods based on k nearest neighbours classifiers that achieves the two desirable characteristics of accuracy and diversity and is also robust to the choice of the number of nearest neighbours.

III. PROPOSED METHOD
The proposed technique give weights to k-Nearest neighbors like weighted kNN by identifying discriminative features. Furthermore, for extra randomization and avoiding overfitting, base kNN models are built on bootstrap sample with random feature subsets. This leads to an ensemble of diverse/random and accurate kNN (DRkNN) models.
Suppose a training dataset L = (X n×p , Y ), where X n×p is a feature space with n observations and p features and Y is a binary response variable and X 1×p is a test point. Let B bootstrap samples are taken from L, each with a random sub-sample of d ≤ p features. To determine k observations in the neighbourhood of a test point, fit a kNN model using a distance formula with feature weights w computed from support vector machine (SVM). The method uses majority voting to predict the class of the test point in each model. In this way, B estimates of the unseen observation are obtained i.e. Y 1 ,Ŷ 2 ,Ŷ 3 , . . . ,Ŷ B . The final predicted class of the unknown observation X 1×p is a second round majority vote of the predictions given by base models. The ordinary kNN model uses the following Euclidean distance formula to determine the nearest observations: whereas the proposed method uses a Euclidean distance in conjunction with weights w given by SVM classifier:

Feature Weights w:
As the SVM algorithm uses a hyperplane (H) to classify the data points in their respective classes i.e.
The distance between a given point ψ(x 0 )) and the hyperplane H is given by where, ||w|| 2 is the Euclidean norm defined as The weight vectors are the arguments that maximize the distance given in Equation 3, that is, Based on the above description, the proposed algorithm takes the following steps for weighted distance calculations in the base kNN models and ensemble formation. 1) Take B bootstrap samples from the given training data L = (X n×p , Y ) considering a random subset of features with each sample. 2) For each bootstrap sample and the random feature subset, build a base kNN model using Equation 2 to find the weighted distances of observations to a test point. 3) Use majority voting to estimate the class label of the test point based on the nearest neighbours identified in each of the B models. 4) Allow the B models collectively to vote for ultimate class label prediction. The psudocode of the proposed DRkNN ensemble is given in Algorithm 1 and its flowchart is shown in Figure 1.

IV. EXPERIMENT AND RESULTS
To assess the performance of the proposed method and to compare the results with the other stat-of-the-art methods a total of 14 benchmark datasets are considered. These datasets are described in the following sub-section.

A. BENCHMARK DATASETS
The benchmark datasets used in this paper are taken from openml. A brief description of these datasets is given in Table 1. The table shows the number of features p, number of observations n, class-wise distribution and a hyperlink to the source of each dataset.

B. EXPERIMENTAL SETUP
Each dataset is divided into two mutually exclusive parts, i.e., 70% training and the remaining 30% as testing. A total VOLUME 10, 2022  The R packages caret [45] and kknn [46] are used for kNN and weighted kNN respectively, while for random kNN, the R library rknn [47] is used. Similarly, the R libraries kernlab [48] and randomForest [49] are used for SVM and random forest models, respectively. The underlying parameters of all the methods are fine tuned via 10-fold cross validation.

C. DISCUSSION ON RESULTS FROM BENCHMARKING
It is clear from the table 2 that the proposed method shows overall good performance on 8 out of 14 datasets. RkNN is showing better results than the others on KCB1 dataset. Similarly, RkNN also show higher accuracy and kappa for the data set sonar while Brier score on this dataset is lower for the proposed method. kNN show high accuracy and less prediction error on dataset Vine, while kappa is high for SVM method. The proposed method show high accuracy on dataset MC, while SVM and RkNN show good results for the same dataset in terms of kappa and Brier score, respectively. The proposed and other kNN based techniques are also assessed for the different values of parameter k, i.e., k = 3, 5, 7. The results are given in Table 3. It is clear from the table that the proposed method outperformed the others on the majority of the datasets in terms of all the performance metrics and thus, show that proposed method is not affected by the parameter k as much as the other kNN based methods. Boxplots are       200 samples 100 are generated from a distribution with fix parameters and assigned to one class, i.e., 0, and the remaining 100 generated from the same distribution with different parameter values are reserved for Class 1. The remaining features are randomly generated that do not affect the response. The idea is to have two different sets of features, one that consists of informative variables where feature weighting is expected to work, and the other with non-informative variables that are expected to be ignored by the algorithm while identifying the nearest neighbours for a new/unseen observation. In the second scenario, observations on features are generated in such a manner that corresponding to both the classes, they do not differ significantly between the classes and there is no discrimination between important and non-important features. In this way, features are not allowed to differ significantly between the two classes. Furthermore, in this scenario, all the features carry equal importance. Table 4 gives a brief description of the synthetic datasets, where the first column shows the ID of the datasets, the second and third columns represent feature's distributions according to Class 0 and Class 1.  The results of these synthetic datasets are given in table 5. The same experimental setup as given for benchmark datasets is used for synthetic datasets.
It is evident from the results on simulated datasets that the proposed method is outperforming the others in simulation scenarios D1 and D2 where some of the features are important. The results also reveal that the proposed method is robust to the choice of k. In scenario D3, where all features carry equal importance, the proposed method does not outperform the other methods. For further illustration, boxplots of the results are given in Figure 7.

V. CONCLUSION
Inspired by the weighted kNN, the proposed method classifies the given data on the basis of k-nearest neighbor via assigning weights to features obtained by using support vectors. For making the base model diverse, random feature subsets are used for the models. The final class of the test point is assigned by using majority vote in the predicted classes given by all base models. The results of the proposed ensemble are compared with base kNN, weighted kNN, random kNN, random forest and support vector machine on 14 datasets. Different values of k are tried for kNN base models. The performance is checked using accuracy, Cohen's kappa and Brier score. The results clearly show that the proposed ensemble outperformed the other standard procedures on 8 out of 14 datasets.
The main intuition behind the efficiency of the suggested methods is that each base learner is constructed on a randomly selected bootstrap sample drawn from training observations with a random subset of features, which ensures diversity in the model and also ensures accuracy due to feature weighting. This technique draws bootstrap samples from training data and draws a hyperplane to find support vectors used as a weighting scheme which makes this method more efficient. Our simulation analysis also revealed that the proposed method is effective in that case of datasets with non-informative features. Moreover, the robustness of the proposed method was tested by using different values of k.
The proposed method could further be improved by model selection based on out-of-bag observations or sub-sampling. For additional randomness in the base models, the idea of randomly projecting the given feature space into lower dimensions could also be used. This might be very helpful in high dimensional settings. Model selection as given in [50] could also be incorporated for further improvements. Furthermore, considering the ideas of feature engineering and feature weighting [51], [52], [53], [54], [55], [56], [57], in conjunction with the proposed method, might open further research avenues for improved classification and prediction. Incorporating the above mentioned ideas might increase execution time of the proposed method. This issue can be solved by using parallel computing for parallelizing Step 1 and Step 2 of the proposed algorithm.