Challenges in KNN Classification

The KNN algorithm is one of the most popular data mining algorithms. It has been widely and successfully applied to data analysis applications across a variety of research topics in computer science. This paper illustrates that, despite its success, there remain many challenges in KNN classification, including K computation, nearest neighbor selection, nearest neighbor search and classification rules. Having established these issues, recent approaches to their resolution are examined in more detail, thereby providing a potential roadmap for ongoing KNN-related research, as well as some new classification rules regarding how to tackle the issue of training sample imbalance. To evaluate the proposed approaches, some experiments were conducted with 15 UCI benchmark datasets.


INTRODUCTION
N N (nearest neighbor) classification is an efficient solution to approximation, which was first proposed as a nonparametric discrimination in statistics [1]. However, it has long suffered from the key issue of overfitting [2], [3]. KNN (k nearest neighbor) classification was also advocated by Fix and Hodges [1] as a possible solution to this issue. Here, K objects are found in a training dataset that are closest to a test object/data. A label is then assigned according to the predominance of the majority class in this neighborhood. This is the standard prediction approach to KNN classification, known as the majority rule.
KNN classification has the remarkable property that, under very mild conditions, the error rate of a KNN algorithm tends towards being Bayes optimal as the sample size tends towards infinity [4]. For any data analysis application, if establishing a model with some training dataset is proving troublesome, it is likely that a KNN algorithm will provide the best solution [5]. As a result, KNN algorithms have been widely used in research and are considered to being one of the top-10 data mining algorithms [6]. In this era of big data, KNN approaches provide a particularly efficient way of identifying useful patterns and developing casebased reasoning algorithms for AI (artificial intelligence) [7], [8].
As with other classification algorithms, KNN classification is a two-phase procedure: model training and test data prediction. In the training phase, KNN only involves finding a suitable K for a given training dataset [9]. The most common method for this is the cross-validation. In the prediction phase, the first step is a search for K data points in the training dataset that are most relevant to a query (test data/sample). Without other information, the K most relevant data points are taken to be the K nearest neighbors of the test data within the training dataset. After this, a prediction is made on the basis of the class of test data that most frequently occurs amongst the K neighbors. This is referred to as the majority rule (which is similar to the Bayesian rule). From the above procedure of KNN classification, it indicates that there are mainly four challenging issues, K computation, nearest neighbour selection, nearest neighbour search, and classification rule.
It can be very difficult to set a suitable K for a given training dataset. Training samples often have different distributions in the sample space, which can lead to there being no obviously suitable K for the whole training sample space. This has resulted in two research directions as follows. One is to set different K values to different sample subspaces [10], [11]. Another is to set different K values to different test samples [12]. From Zhang et al. , the efficiency is pretty good when clustering the sample space to 3-5 subspaces [11]. It is time-consuming to set different K values to different test samples.
Nearest neighbor selection has been studied extensively. It is really a procedure of determining a proximity measure. Much of the related research has focused on constructing distance functions for measuring the proximity. Song, et al. proposed two KNN methods for measures the informativeness of test data, Locally Informative-KNN (LI-KNN) and Globally Informative-KNN (GI-KNN) [13]. Each of them is as a query-based distance metric to measure the closeness between objects. However, no distance function has yet been identified that is suitable for all training samples regardless of their distribution. In other words, there remains a need for distance function for the selection of the K nearest neighbor points that can work effectively across most training samples. On the other hand, feature selection is useful to choosing the K nearest neighbors [14]. This can be a lazy procedure because it depends on data mining tasks.
Nearest neighbor search is particularly challenging and remains unsolved because it is a complete sample space search when looking for all the K nearest neighbors for each test data. As a result, KNN classification is often referred to as a lazy data mining method. There are some efforts devoted to resolving the problem of how to undertake a truly effective nearest neighbor search. From the lately related reports, most of them were focused on seeking K approximate nearest neighbors. Li, et al. examined 16 approximate nearest neighbor search algorithms in different domains [15]. And then, they proposed a nearest neighbor search method that achieves both high query efficiency and high recall empirically on majority of the datasets under a wide range of settings.
The majority rule has been used widely and successfully in real applications as a KNN classification principle. However, if a training dataset has unbalanced classes, the majority rule is unable to work effectively. This is why there is little research that has sought to extend the use of KNN algorithms to cost/risk-sensitive learning.
The rest of this paper is organized as follows: Section 2 briefly introduces the problem of K computation. The methods for nearest neighbor selection are reviewed in Section 3. Nearest neighbor search is the topic of Section 4. An overview of the issues surrounding classification rules is provided in Section 5. Section 6 evaluates the efficiency of some new classification rules that have been developed to improve KNN classification. Some conclusions are put forward in Section 7.

K COMPUTATION
Setting a suitable K for a given training dataset is a key step in KNN classification. There are two main ways in which this can be accomplished. The first option is for the data analyst employing KNN classification to assume that the users will provide the K for their datasets. However, it is clearly challenging for users to establish effective K values in this way.
The second option is to use all of the samples in the given training dataset, i.e., to attack the issue of K computation with training samples. There are three main approaches to dealing with K computation. We briefly outline them below.
Setting a single K value for the whole sample space: Since its inception almost all approaches to KNN classification focused on setting only one suitable K for a given training dataset. A natural way of going about this was to use cross-validation to find a K for the training dataset. This method has been successfully applied in real applications in statistics and data mining [16]. One way of computing an optimal K for a dataset is to use what is known as holdout cross-validation. First of all, a suitable K for each sample in a dataset is searched for. Then, the K that delivers the highest classification efficiency is chosen for the whole sample space. Another technique is to use m-fold cross-validation. This first partitions the training dataset into m mutually-disjoint subsets. Then, cross-validation is used to generate a viable K for each subset. Finally, the K that delivers the best classification efficiency is chosen for the whole sample space.
After KNN method was invented by Fix and Hodges [1], a lot of research has been devoted to setting a single K for the whole sample space. For example, Loftsgaarden and Quesenberry referred the KNN method to a nonparametric solution that was applied to estimate the multivariate density function [17]. Sebestyen took the KNN method as an important tool of decision making and collected it to his monograph "Decisionmaking Processes in Pattern Recognition [18]. Cover and Hart designed a KNN algorithm for pattern classification [2]. Wettschereck and Dietterich experimentally compared the nearest-neighbor and nearesthyperrectangle algorithms [3]. Hastie and Tibshirani proposed a KNN method for classification and regression [19]. They designed a general model based on a local Linear Discriminant Analysis, called the DANN algorithm. Singh, et al. applied KNN method to image understanding [20]. They designed a nearest neighbor algorithm, aiming to find the nearest average distance rather than nearest maximum number of neighbors. Peng, et al. advocated to use other classification rules to KNN classification. A locally adaptive neighborhood morphing classification method was developed to minimize bias [21]. Chen and Shao applied the KNN method to estimate the value of missing data [22]. A jackknife variance estimation was designed for improving the nearestneighbor imputation. Domeniconi and Gunopulos presented an adaptive KNN algorithm for pattern classification [23]. The maximum margin boundary found by the SVM is used to determine the most discriminant direction over the neighborhood of test data. Tao, et al. proposed a RKNN (reverse k nearest neighbor) method for retrieving dynamic multidimensional datasets [24]. It utilizes a conventional data-partitioning index on the dataset and does not require any pre-computation. Zhu and Basir designed a fuzzy KNN algorithm for remote sensing image classification [25]. In the algorithm, each nearest neighbor provides evidence on the belongingness of the input pattern to be classified, and it is evaluated based on a measure of disapproval to achieve the adaptive capability during the classification process. Zhang and Zhou applied KNN method to multilabel classification [26]. They designed an ML-KNN algorithm, or a multi-label lazy learning approach. Qin, et al. applied the KNN method to cost-sensitive classification [27]. The neighborhood in the minor class is set much more cost. Liu, Wu and Zhang designed a KNN algorithm for multilabel classification [28]. A nearest neighbor selection was designed for multilabel classification. The certainty factor is further adopted to well address the problem of unbalanced and uncertain data. Gallego, et al. developed a clustering-based KNN classification. The K nearest neighbors are taken as the initial points [29].
Gou et al. proposed a KNN algorithm based on local constraint representation [30]. Specifically, it first finds the K nearest neighbors of the test data in each class of the training data according to the Euclidean metric. Then it constructs K local mean vectors in each class according to the K nearest neighbors in the previous step. Finally, it uses the local mean vector to fit the test data and combines local constraints to get the final representation-based distance metric. It predicts class labels of the test data through the representation-based distance metric. In addition, Gou et al. also proposed a KNN algorithm based on local mean representation [31]. It is different from the previous algorithmist. Although it is also the first to construct a local mean vector in each class of training data. However, in the following steps, when performing linear representation of the test data, it performs two linear representations to obtain the optimal relationship representation. Finally, it predicts the class label of the test data through a new metric function based on the local mean.
Setting different K values for different samples: It is well known that instances are nonuniformly distributed in a sample space. It therefore seems reasonable that different test data should be given different K values. Thus, some recent work has proposed the approach of setting an optimal K for each test sample. For example, Guo et al. has presented an approach where a KNN model is constructed for a sample that replaces the sample itself, to serve as the basis of classification [32]. The value of K is automatically determined, can vary according to the sample and is optimal in terms of classification accuracy. Li et al. have proposed an improved KNN classification algorithm that uses different numbers of nearest neighbors for different categories, rather than having a fixed number across all categories [33]. More samples (nearest neighbors) are used to decide whether test data should be classified to a category, if there are more samples for that category in the training dataset.Yu et al. have put forward a method, called optimally pruned K-nearest neighbors (OP-kNNs), that can compete with other state-of-the-art methods while remaining fast [34]. Setting an optimal K for each sample has also been proposed in the context of graph sparse reconstruction (see [12]), called G-optimal-K. When compared with the preceding three optimal-K methods above in experiments, the G-optimal-K approach produced the best results. We will therefore briefly examine this approach in greater detail.
The idea of Zhang et al. was to revise conventional KNN algorithms using a sparse reconstruction framework [12]. This can generate different K values for different test samples and make the best use of prior knowledge in the training dataset. The Goptimal-K approach was designed with three regularization terms. A reconstruction process is adopted to move between training and test samples that obtains a K value for every test sample. In the reconstruction process, a least square loss function is applied to achieve the minimal reconstruction error. An L1-norm is then applied to generate element-wise sparsity for selecting the different K values for the different test samples. To improve the reconstruction performance, an L 2,1 -norm is employed to generate row sparsity, thus removing noisy samples. Finally, an LPP (Locality Preserving Projection) regularization term is suggested to preserve the local sample structure.
K value approximation: The G-optimal-K approach has since been extended to approximation computation in an algorithm, called Ktree (see [35]). This is illustrated in Fig. 1. To get an optimal K for a test data, one must compute a K value for each new data item, one by one, before predicting the class of the test data. A major issue with K computation is that it is both expensive and time-consuming if users would like to set different K values to predict different data classifications. Therefore, Zhang et al. advocated approximating the optimal K of the test data with its nearest neighbors optimal K by using what is called a Ktree [35]. A Ktree is built to rapidly search for the nearest neighbor and K value for the data. The K computation proceeds as follows. In the training phase, the KNN method is Usedom to build a Ktree for the training dataset, where each leaf node is a training sample with an optimal K value. In the prediction phase, the KNN method is used to search the Ktree to obtain the nearest neighbor for the test data. The optimal K of the nearest neighbor is assigned to the test data.
The K value approximation delivers three results as follows. One is that different K values can be set with training samples. Another is that the approximation K values work well compared with the real-time computed K. Last one is that the approximation K values can be trained before given a data mining task, whereas the real-time computed K is obtained after given the data mining task which is a lazy procedure.

NEAREST NEIGHBOR SELECTION
As having mentioned, nearest neighbor selection is really a procedure of determining a proximity measure. Most of the research relating to nearest neighbor selection involves studying the distance function or similarity metrics for measuring the proximity between KNN classification objects. Numerous techniques have been developed for modifying KNN classification in terms of distance measurement selection/construction and this has become a hot topic in KNN algorithm research [36], [19], [21], [37]. Currently a variety of distance measures are available, such as Euclidean, Hamming, Minkowsky, Mahalanobis, Camberra, Chebychev, Quadratic, Correlation, Chi-square, and hyperrectangle [38], Value Difference Metrics [39] and Minimal Risk Metrics [40], with an additional option being grey distance [41].  However, distance functions generally do not perform consistently well, even under specified conditions [42]. This makes the use of a KNN approach highly experience dependent. Various attempts have been made to remedy this situation. Amongst these, DANN (Discriminant Adaptive Nearest Neighbor) is notable for carrying out a local linear discriminant analysis to deform the distance metric based on the 50 nearest neighbors [19]. LFM-SVM (Local Flexible Metric based on Support Vector Machine) also deforms the metric by feature weighting. Here, the weights are inferred by training an SVM on the entire dataset [23]. HkNN (Klocal Hyperplane distance Nearest Neighbor) uses a collection of 15-70 nearest neighbors from each class to span a linear subspace for that class. Classification is then based not on the distance to prototypes but on the distance to linear subspaces [37].
There are other kinds of distance defined by the data properties. Examples here include: tangent distance using the USPS zip code dataset [43], shape context-based distance using the MNIST digit dataset [44], distances between histograms of texts using the CUReT dataset [45] and geometric blur-based distances using Caltech-101 [46]. These measures can be extended by kernel techniques so as to estimate a curved local neighborhood [47]. This makes the space around the samples nearer or further from the test data, depending on class-conditional probability distributions. There are also many other efforts to measuring the proximity between samples. For example, Blanzieri and Ricci presented a minimum risk metric (MRM) for classification tasks that exploits estimates of the posterior probabilities [40]. The MRM is optimal, in the sense that it optimizes the finite misclassification risk, whereas the Short and Fukunaga Metric minimize the difference between finite risk and asymptotic risk. Domeniconi, et al. built a locally adaptive nearest-neighbor classification method to try to minimize bias [36]. a chi-squared distance analysis was employed to compute a flexible metric for producing neighborhoods that are highly adaptive to query locations. Neighborhoods are elongated along less relevant feature dimensions and constricted along most influential ones. Peng, et al. advocated an adaptive KNN classification method for minimizing bias [47]. A quasiconformal transformed kernels was applied to compute neighborhoods over which the class probabilities tend to be more homogeneous. Athitsos, et al. applied the KNN method to information retrieval [48]. A method, BoostMap, was designed for efficient nearest neighbor retrieval under computationally expensive distance measures. Chen, et al. developed a KNN search by utilizing the distance lower bound to avoid the calculation of the distance itself if the lower bound is already larger than the global minimum distance [49]. They constructed a lower bound tree (LB-tree) by agglomeratively clustering all the sample points to be searched. Li, et al. proposed a KNN algorithm with local probability centers of each class [33]. It can reduce the number of negative contributing points which are the known samples falling on the wrong side of the ideal decision boundary, in a training set and by restricting their influence regions. Liu, Wu and Zhang presented a nearest neighbor selection was designed for multilabel classification [28]. The target labels of test data are predicted with the help of those relevant and reliable data, which explored by the concept of shelly nearest neighbor. Song, et al. proposed two KNN methods for measures the informativeness of test data [13]. That is, Locally Informative-KNN (LI-KNN) and Globally Informative-KNN (GI-KNN) were constructed as a query-based distance metric to measure the closeness between objects. Zhang, Cao and Wang developed a weighted heterogeneous distance Metric (WHDM) [50]. With the WHDM, the RRSB (reduced random subspace-based Bagging) algorithm is proposed for construct ensemble classifier, which can increase the diversity of component classifiers without damaging the accuracy of the component classifiers. Gou, et al. designed a generalized mean distance-based KNN classifier (GMDKNN) [51]. The multi-local mean vectors of a test data in each class are calculated by adopting its class-specific K nearest neighbors.
More recently, a new measure named neighborhood counting has been proposed that can define the similarity between two data points by using the number of neighborhoods [42]. To measure the similarity between two data points, all neighborhoods of covering both the two data points are counted and the number of such neighborhoods as a measure of similarity. As the features of high-dimensional data are often correlated the above kinds of measures can easily become meaningless. Some approaches have been designed to deal with this issue, e.g., by applying variable aggregation to define the measure [36], [52], [53]. Aside from the above kinds of measures, another strategy that can be applied is to consider the geometrical placement of the neighbors rather than their actual distances [54]. This approach is effective in some cases, but is in conflict with human intuition when the data is manifold. López et al. proposed a nearest neighbor classification for high-dimensional data [55]. Specifically, it first sorts all the features through the FR strategy. Then it selects the first r features with larger weights. Finally, a new distance function is constructed to predict the class label based on the selected features and the test data. Feng et al. proposed a new distance measurement function to solve the problem of class imbalance [56]. It first reconstructs the data with a projection matrix, and then calculates the KL divergence between different classes. Finally, it gets a distance metric matrix by solving the proposed objective function. In order to solve the problem of outliers, Mehta et al. proposed a KNN classification through harmonic mean distance [57]. It looks for K nearest centroid neighbors of the test data in each class. In addition, it also calculates a local centroid mean vector in each class, and uses the nearest harmonic mean distance between the test data and the local centroid mean to predict the class label of the test data. Syaliman et al. proposed a KNN algorithm based on local mean and distance weight [58]. It not only finds the local mean vector in each class, but also applies weights based on the distance. The class weights farther from the test data are smaller, on the contrary, the class weights closer to the test data are larger. Jiao et al. proposed a paired distance metric for KNN [59]. It avoids the traditional method using only one distance metric for global data. It also sets weights on features and has resolved the uncertainty in the output of the classifier. Nguyen et al. proposed a large-margin distance metric learning approach [60]. It can maximize the margin of each training example. It also solves the optimization problem that the proposed formula is non-convex. Weinberger et al. proposed a distance measurement function for KNN [61]. It can make K nearest neighbors always belong to the same class and maximize the distance between different classes. Goldberger et al. proposed a new mahalanobis distance measure, it can learn the low-dimensional linear embedding of a data [62]. In this way, it can reduce the computational complexity of the algorithm and speed up the KNN classification. Mensink et al. proposed two distance-based classifiers (i.e., KNN and NCM (nearest class mean) ) [63]. In addition, it also introduces a new distance measurement function to improve their performance. In the experiment, it is verified that the NCM algorithm has better performance. V. Daviset al. used information theory to learn a mahalanobis distance function [64]. It minimizes the KL divergence between two Gaussian distributions by constraining the distance function. Finally, it learns a mahalanobis matrix A by optimizing the objective function. Nguyen et al. proposed a new metric learning method through maximization of the Jeffrey divergence [65]. Specifically, it first generates two multivariate Gaussian distributions from local pairwise constraints. Then it maximizes the Jeffrey divergence of these two distributions to get a linear transformation. Finally, it turns the problem into an unconstrained optimization problem and solves it. Globerson et al. proposed a new metric learning for classification tasks [66]. It can make the points of the same class be close to each other and the points of different classes are far away. It optimizes the equivalent convex dual form of the proposed function to solve the metric matrix. FISHER et al. discussed the application of multiple measures to the same principles in classification problems [67]. Wang et al. proposed a feature extraction algorithm [68]. Its core idea is to draw the data of same class in its neighbors, and push data of different classes away from it as much as possible. It avoids the problem of small sample size in traditional LDA (Linear Discriminant Analysis). In addition, it has also been extended to nonlinear feature extraction through a kernel function.
Lately a very different approach to nearest neighbor selection has been adopted, called the shell-KNN algorithm, as mentioned in Section 2 [16], [69]. It first selects the K nearest neighbors using distance functions. Then, the shell nearest neighbors are chosen from the K nearest neighbors. This has therefore been described as quadratic-selection.
Generally, there is an expectation that all the selected nearest neighbors for a test data will be ideally distributed around the test data, as illustrated in Fig. 2. Fig. 2 shows an ideal nearest neighbor distribution for test data A. However, the collection of training samples for real applications is effectively random and the training samples often have different distributions. Consequently, the nearest neighbors of a test data will not have an ideal distribution. Other potential cases are shown  According to experiments conducted by Zhang [16], [69], it is much more efficient to take nearest neighbors closely distributed around test data A. It takes quadratic-selection to identify the nearest neighbors within the shell surrounding test data A. First of all, one searches for the K nearest neighbors of test data A in the training dataset. Each feature/attribute is then treated as an axis and the left and right nearest neighbors are selected from the K nearest neighbors for test data A. Finally, the nearest neighbors within the shell of test data A are selected from the left and right nearest neighbors on all axes.
Quadratic-selection can only identify those nearest neighbors that are distributed around test data A. It is therefore worth taking note of the following cases. Point 1 Some nearest neighbors amongst the K nearest neighbors will be selected many times; Point 2 The number of selected shell-nearest neighbors of an item of test data, S, is less or equal to K, i.e., S≤K; Point 3 Different test data will produce different S values.
In the case of point 1 above, the number of times a nearest neighbor is selected can be used to compute the weight of the nearest neighbor. This is a new way of setting weights for samples. We will discuss how to make use of this case in detail in Section 5.
In relation to point 2, the set of selected shell nearest neighbors of an item of test data is a subset of the K nearest neighbors. Note that, when the nearest neighbors of the test data are further away, as shown in Fig. 5, the quadratic-selection approach may fail to locate them.
For point 3, it delivers a fact that different test data can be set different K values. In the shell-KNN algorithm, all nearest neighbors are expected to distribute around the test data. Some points therefore need to be discarded from the selected K nearest neighbors. In other words, the number of shell-nearest neighbors should be less or equal to K for each item of test data.

NEAREST NEIGHBOR SEARCH
It is often suggested in the literature that KNN classification is a lazy form of learning. It does not need to train any model to fit the given training samples, beyond setting the K values. This means that, for each item of test data, KNN classification has to search the whole training sample space to obtain the K nearest neighbors. This is a time-consuming procedure that weakens the range of possible applications for KNN algorithms. Therefore, Samet designed an algorithm for finding K nearest neighbors of a test data [70]. It adopted a pruning technique with the maxnearestdist as distance upper bound.
For supporting the increasing functionalities of smartphones, it is important to fast locate one of the nearest objectives for a customer. Therefore, there are recently some reports on searching K approximate nearest neighbors. For example, Jegou, Douze and Schmid proposed a product quantization for nearest neighbor search [71]. The idea is to decompose the space into a Cartesian product of low-dimensional subspaces and to quantize each subspace separately. And then, many improved models have been developed for spatiotemporal data. Li and Hu applied the product quantization to develop an approximate nearest neighbor search algorithm for high order data [15]. They incorporated the high order structures of data into the process of designing a more effective subspace decomposition way. Pan, et al. designed a Product quantization with dual codebooks for approximate nearest neighbor search [72]. It uses dual codebooks simultaneously to reduce the quantization error in each subspace. Also, a grouping strategy was presented to group the database vectors by their encoding modes, and thus the extra memory cost caused by dual codebooks can be reduced. Chiu, et al. presented a ranking model with learning neighborhood relationships embedded in the index space [73]. In this model, the nearest neighbor probabilities are estimated by employing neural networks to characterize the neighborhood relationships, i.e., the density function of nearest neighbors with respect to the query. Lu, et al. advocated an approximate nearest neighbor search via virtual hypersphere partitioning [74]. The idea is to impose a virtual hypersphere, centered at the query, in the original feature space and only examine points inside the hypersphere. Munoz, et al. developed a large scale approximate nearest neighbor search for high dimensional data [75]. They used a nearest neighbor graph created over large collections. This graph is created based on the fusion of multiple hierarchical clustering results, where a minimum-spanning-tree structure is used to connect all elements in a cluster. Malheiros and Walter constructed a data-partitioning error-controlled strategy for approximate nearest neighbor searching [76]. By changing the size of the candidate neighborhood, the precision and performance are balanced when searching for neighbors. Tellez, et al. thought, for intrinsically high-dimensional data, the only possible solution is to compromise and use approximate or probabilistic approaches [77]. And then, they proposed a singleton indexes for nearest neighbor search. Ozan, et al. designed a vector quantization method for approximate nearest neighbor search which enables faster and more accurate retrieval on publicly available datasets [78]. The vector quantization is defined as a multiple affine subspace learning problem and the quantization centroids are explored on multiple affine subspaces. Dasgupta and Sinha theoretically studied three Randomized Partition Trees for Nearest Neighbor Search [79]. And then, they combined classical k-d tree partitioning with randomization and overlapping cells.
To compare these approximate nearest neighbor search algorithms, Li, et al. examined 16 representative algorithms selected from different domains [80]. And then, they proposed a nearest neighbor search method that achieves both high query efficiency and high recall empirically on majority of the datasets under a wide range of settings.
In view of the fact that KNN classification is an approximate solution, effort has recently been undertaken to overcome the lazy aspects of KNN classification with an improved approach to approximation that can achieve good prediction efficiency and accuracy when compared to standard KNN classification algorithms. This is known as K*Tree [35]. K*Tree is not a classifier. It is just a particular kind of tree that has a range of useful information in its leaf nodes, to facilitate obtaining the K nearest neighbors for test data quickly. An example of a K*Tree is provided in Fig. 6. Fig. 6. A K*Tree K*Tree is an extension of the Ktree illustrated in Fig. 1(b), with samples added to each leaf node. These samples include the K nearest neighbors of the data in a leaf node and the K nearest neighbors of the nearest neighbor of the data in the leaf node. In other words, in a K*Tree, each of the samples in its leaf nodes can be expressed as (k; e 1 , < e 11 , e 12 , ..., e 1k > ; e 2 , < e 21 , e 22 , ..., e 2k >; ...; e i , < e i1 , e i2 , ..., e ik > ; ...; e n , < e n1 , e n2 , ..., e nk >), where, k is an integer value, < e 1 , e 2 , ..., e i , ..., e n > is a vector of the training samples that have the same k value, and < e i1 , e i2 , ..., e ik > is a vector of the k nearest samples of e i . However, there are many samples with the same K value, so, information has to be added to the leaf nodes of a K*Tree sample by sample.
K*Tree can be used in the following way. For an item of test data T, the K*Tree can first be searched to obtain the nearest sample, e i , of T in a leaf node. The K value in this leaf node can then be taken as the K value of T. Finally, the K nearest samples of T can be searched for just amongst the samples attached to the nearest sample e i , as follows: This means that KNN classification using the K*Tree approach does not need to search the whole training sample space, significantly enhancing its performance.
Clearly, K*Tree is only an approximation of finding the K nearest neighbors of an item of test data. There are some particular things to note about it:

P1
The test datas nearest neighbor will definitely be included in the set of K nearest neighbors for the test data. Other K-1 nearest neighbors will be close enough to the test data. P2 The prediction efficiency of KNN classification using K*Tree is almost the same as that of KNN classification using Ktree. P3 KNN classification with K*Tree is much more robust than KNN classification with Ktree and standard KNN classification.
From the above, it can be seen that K*Tree classification approaches are very different from traditional KNN classification methods. KNN classification with K*Tree is not a lazy learning approach because the K*Tree has to be trained before predicting the test data. Its advantage is that there is no need to search the whole sample space. However, there are still two main limitations in K*Tree classification as follows.
1) K*Tree classification provides only an approximate solution to a query. It is not clear how to estimate the confidence of the answer? 2) It is not clear which tree is the best structure to store the K and the nearest neighbors for the K*Tree classification?

CLASSIFICATION RULES
Once the K nearest neighbors of an item of test data have been selected from training samples, the class label of the test data needs to be predicted, using a classification rule/principle. In general, the most popularly-used rules for KNN classification are the majority rule and its various forms of weighting. Recently, two classification rules have been proposed against datasets with imbalanced classes. Apart from recalling these classification rules, some suggestions are advocated to improve the classification rules of shell nearest neighbor classification in this section.

The majority classification rule
This is a simple yet efficient approach to classification that predicts the class of the test data according to the class of the majority of the K nearest neighbors. For a training dataset, D, with n features and a decision attribute, let {c1, c2, . . . , cm} be the domain of the decision attribute, Y . One can obtain K nearest neighbors of a query (test data), T = (X, Y ), from the training dataset. KN N (T ) is the set of these K nearest neighbors. The majority rule for KNN classification is as follows: where I(•) is an indicator function.

The weighting classification rule
In the majority rule, the K nearest neighbors are implicitly assumed to have equal weight for any decision, regardless of their distance from the test data. This rule therefore adheres to the notion that it is conceptually preferable to give different weights to the K nearest neighbors according to their distance from the test data, with closer neighbors having greater weight. The distance weighting-based classification rule is as follows: where d(Xi, X) is the distance between (Xi, Y i) and the test data T.
A general weighting classification rule is as follows: A kernel function classification rule is as follows: where K(Xi, X) is a kernel function. Eq. (5) is constructed for a numerical value. If the decision attribute is of a certain character type, Y i can be replaced with the indictor function.

CF (Certainty Factor) classification rule
There are numerous variations upon the above classification rules. However, neither of these classification rules work well for the imbalanced datasets typical of real applications, such as tumor diagnosis in medical tests or investments in the stock market. Generally, this is a challenge when the data mining tasks are sensitive to cost or risk.
To deal with this issue, [81], [82] has proposed ways of increasing the competitiveness of minor classes when undertaking imbalanced data classification, known as CF-KNN (Certainty Factor KNN) classification. These are summarized below.
The CF measure is incorporated into the KNN classification as follows. For a test sample T = (X, Y ), assume p(Y = ci|D) is the ratio of ci in the training dataset, D, and p(Y = ci|KN N (T )) is the ratio of ci in the set of K nearest neighbors, KN N (T ). If p(Y = ci|KN N (T )) ≥ p(Y = ci|D), the CF is computed as follows: It means CF (Y = ci, KN N (T )) > 0. If p(Y = ci|KN N (T )) < p(Y = ci|D), the CF is computed as follows: It means CF (Y = ci, KN N (T )) < 0. The CF strategy for KNN classification can be defined as follows: According to the description of CF, CF (Y = ci, KN N (T )) will have a value in the range [-1, 1]. If CF (Y = ci, KN N (T )) > 0, it will increase that the class of the query should be predicted to be Y = ci. If CF (Y = ci, KN N (T )) < 0, however, it will decrease that the class of the query should be predicted to be Y = ci. If CF (Y = ci, KN N (T )) = 0, it will be the same as it is for the training set D that the class of the query should be predicted to be. In other words, the class of T is undetermined for binary classification applications.

Lift classification rule
The above CF-KNN classification can be modified by using the lift measure, also known as Lift-KNN classification as follows: Clearly, Lif t(Y = ci, KN N (T )) > 0. If the lift of a class is less than or equal to 1, the probability of the class is not increased for the K nearest neighbors. However, if Lif t(Y = ci, KN N (T )) > 1, the probability of the class is increased for the K nearest neighbors. In that case, the Lift-KNN strategy for KNN classification can be defined as follows: If Lif t(Y = ci, KN N (T )) = 1, the class of T is undetermined for binary classification applications.

Shell KNN classification rule
Recalling Fig. 2 in Section 3, shell nearest neighbors are well distributed around the test data A. In real applications, the shell nearest neighbors may not be ideal, due to the fact that training examples are randomly collected. This case is illustrated in Fig. 7 as follows. With the quadratic-selection rule, sample B is selected many more times than samples C, D, E and F, although all five nearest neighbors are very close to test data A. This is a very interesting case when using KNN classification in data mining applications. In this paper, we formally discuss this case as follows.
Using the quadratic-selection rule, we can obtain 2n samples, left ( From Eq. (12), if a nearest neighbor is selected many times, it will be the winner. This is another way of addressing imbalanced classes with Case 1 (mentioned in Section 3). For Fig. 7, let the label of B be c1, the label of C be c2, and the label of D, E and F be c3, and n =10. From Eq. (2), the label of T is assigned to c3. From Eq. (12), the label of T is voted to c1. If we let the label of B be c1, the label of C, D, E and F be c2. From Eq. (2), the label of T is predicted to be c2. In the case of Eq. (12), the label of T is not determined, due to the fact that c1 and c2 receive the same votes. To attack this issue when mining imbalanced data, we can modify Eq. (13) as follows: where the count(Xi, Y i) is the number of selected nearest neighbors (Xi, Y i), and "l" is greater than 1.

EXPERIMENTS
A large number of new ideas have been presented in this paper. For the sake of simplicity, just a few representatives of the new classification rules in Section 5 will be evaluated in this section. The classification accuracy of the new classification rules was compared with the majority classification rule (referred to here as the standard KNN classification rule) across 15 datasets, as shown in Table 1, below.

Experimental setting
The 15 datasets above were downloaded from the UCI machine learning library. They include 5 binary datasets and 10 multi-class datasets. The 15 datasets needed to be slightly modified to meet the requirements of the evaluation. So, 80% of the samples belonging to a certain class in the binary datasets were deleted, with all of the remaining samples forming the new datasets. In the multi-class datasets, if the number of classifications was odd, we removed 80% of the samples belonging to the odd-numbered classes (i.e., 1, 3, 5...). If the number of classifications was even, we removed 80% of the samples belonging to the even class (i.e., 2, 4, 6...). A set of experiments was conducted on the above datasets using the classification rules mentioned in Section 5. The primary goal was to compare their performance with that of the standard classification rule, but a further objective was to test their effect on the classification imbalances in the data. Each dataset was first divided into a test set and a training set using 10-fold crossvalidation. Then, all the classification rules were examined using the original dataset (no sample deletion, i.e., unclassified and unbalanced data). This was to ensure the experiment was using different K values. After this, all the classification rules were tested using different K values on the unbalanced datasets. Finally, K=5 was selected to perform 10 experiments on the unbalanced datasets, to examine the average and variance of the classification accuracy.
It should be noted that, for the binary dataset, we not only obtained the classification accuracy (ACC), but also the sensitivity (SEN) and specificity (SPE) for the unbalanced dataset.

KNN Classification of the original datasets
The first set of experiments were conducted using the downloaded datasets without any changes. The results are shown in Fig. 8, which presents the classification accuracy of the classification rules for the 15 original datasets with different K values. It can be seen that the accuracy of these classification rules does not differ greatly for most of the datasets, especially in relation to the OCCUDS, CNAE, Isolet, Letter, Segments, Vehicle and Waveform datasets. The ShellKnn classification rule performs the worst on the Yeast dataset, but the best on the German, Ionosphere and USPS datasets. Overall, the ShellKnn classification rule performs pretty well. For other classification rules, there was little difference in their performance on the original datasets, though the weighted classification rules performed particularly poorly on some datasets, such as Chess and Yeast.

Unbalanced datasets with binary classes
The second set of experiments was conducted with the modified datasets where there were unbalanced binary classes. The results are presented in Fig. 9. Fig. 9 shows the classification accuracy for all of the classification rules for a class-unbalanced dataset, with Chess, German, Ionosphere and Isolet being the binary datasets. For the German, Ionosphere and Isolet datasets, the performance of the Majority classification rules and Weighted classification rules was not very good because these two classification rules do not consider the importance of small sample classes when dealing with class imbalances. The CF, Lift and ShellKNN classification rules slightly increased the competitiveness of the small classes in the unbalanced classifications. Thus, in most cases, their performance was much better than the Majority and Weighted classification rules. Table 2 shows the ACC, SEN, and SPE results for the binary dataset. Here, it can be seen that the ShellKnn classification rule performed the best for the binary-class datasets, with the Majority classification rule being the worst.

Unbalanced multi-class datasets
The third set of experiments were conducted using the modified datasets with multi-class imbalances. The results for the average classification accuracy and variance across the 10 experiments are presented in Tables 3 and 4.
With regard to the variance, all of the classification rules were stable, the variance was small and the results were robust. For the average classification accuracy, the ShellKNN. Lift, and CF classification rules improved the average classification accuracy by 2.74%, 0.89%, and 0.32%, respectively, in relation to the Majority classification rule. In particular, t on For the USPS and Waveform datasets, the ShellKNN classification rule improved the classification accuracy by 11.48% and 21%, respectively.
Across the various unbalanced datasets, the ShellKnn classification rule generally performed well in comparison to the other classification rules.

CONCLUSION
This paper has systemically reviewed the latest research regarding KNN classification that is addressed to its four most challenging issues. This paper has focused on introducing the main results recently established within our research group. These can be distinguished from other extant approaches by their interest in delivering new research directions for KNN classification.
Although the approaches presented here are both efficient and promising, there are still some open issues that require further research: 1) How to set different K values for different kinds of test data so that the results delivered by the KNN classification algorithm will offer the best possible performance whilst remaining robust. 2) How to establish the best tree structures for building KTree and K*Tree when a decision tree is not the best data structure for saving or rapidly searching for the K values and the nearest neighbors of a leaf node. 3) How to make KNN classification efficient when mining big data.

ACKNOWLEDGMENT
This work has been supported in part by the Natural Science Foundation of China under grants 61836016 and 61672177.