Instance-Based Classification through Hypothesis Testing

Classification is a fundamental problem in machine learning and data mining. During the past decades, numerous classification methods have been presented based on different principles. However, most existing classifiers cast the classification problem as an optimization problem and do not address the issue of statistical significance. In this paper, we formulate the binary classification problem as a two-sample testing problem. More precisely, our classification model is a generic framework that is composed of two steps. In the first step, the distance between the test instance and each training instance is calculated to derive two distance sets. In the second step, the two-sample test is performed under the null hypothesis that the two sets of distances are drawn from the same cumulative distribution. After these two steps, we have two p-values for each test instance and the test instance is assigned to the class associated with the smaller p-value. Essentially, the presented classification method can be regarded as an instance-based classifier based on hypothesis testing. The experimental results on 40 real data sets show that our method is able to achieve the same level performance as the state-of-the-art classifiers and has significantly better performance than existing testing-based classifiers. Furthermore, we can handle outlying instances and control the false discovery rate of test instances assigned to each class under the same framework.


INTRODUCTION
C LASSIFICATION is a fundamental data analysis procedure, which is ubiquitously used across different fields. Thousands of classification algorithms (classifiers) have been developed during the past decades [1]. These classifiers range from simple models such as k-nearest neighbors (k-NN) [2] to more sophisticated models such as support vector machine (SVM) [3] and random forests (RF) [4].
Despite the advances on the development of new classifiers, no single classification algorithm can always achieve the best performance on all data sets [1]. This indicates that different classifiers are complementary to each other in different contexts. Therefore, it is still necessary to develop new and alternative classifiers based on some principles that remain unexplored.
The motivation behind this research is based on the following observations. First, existing non-lazy classifiers typically formulate the classification problem as an optimization problem. Such optimization-based learning strategies can always generate the target classifiers, regardless of the statistical significance of learnt models. Second, classifiers such as logistic regression are able to provide probability values for categorizing an unknown test instance. However, it is not an easy task to determine a universal probability threshold to ensure that the classification of the test instance into the corresponding class is statistically significant. Last but not least, existing classifiers cannot control the number of misclassified test instances in terms of metrics such as false discovery rate (FDR). Such capability is quite important in the scenario of biological data analysis, in which the prediction results will be further validated by wet-lab experiments that can be costly and time-consuming [5]. Thus, we need to add some notion of statistical significance to classifiers.
In fact, the classification problem has already been formulated as a hypothesis testing issue in [6]. More recently, several research efforts [7], [8] further extend the initial formulation in [6] from different aspects. However, the following observations motivate this research. First of all, existing testing-based classification methods deserve certain theoretical drawbacks, as discussed and summarized in Section 2. Second, only simulation data sets and several small real data sets have been empirically tested, making it difficult to convince people on the practical usage of such testing-based formulation. Third, the connection between this new formulation and existing classification methods have never been discussed. Finally, the potential benefit of the testing-based classification model remains unexplored.
Based on the above observations, we present a new testing-based classification formulation, in which the null hypothesis is that, informally, the test instance doesn't belong to any class. To precisely define the null hypothesis, we focus on the classification problem in a two-class setting. First, we can calculate the distance between the test instance and each training instance in the training data set. In this way, we will generate two sets of distances for one test instance that needs to be classified. Then, the hypothesis testing issue can be casted as a two-sample testing problem [9], in which each sample corresponds to a set of distances. In this formulation, the null hypothesis is that two sets of distances are drawn from the same cumulative distribution.
Two-sample testing is a fundamental problem in statistics. We employ the classical Wilcoxon-Mann-Whitney (WMW) test for quantifying the statistical significance in terms of p-values. To alleviate the effect of outlying and irrelevant training instances, we further apply the WMW test to two distance sets that are generated from k-NNs of the test instance.
The testing-based classification formulation has several salient features. First of all, it can provide p-values for each test instance to quantify the statistical significance of classifying this instance to certain classes. Accordingly, we can detect outlying test instances that do not belong to any class if the p-values with respect to all classes are larger than the significance level threshold. Second, we can control the FDR of test instances that are assigned to each class based on their p-values.
We evaluate our method on forty data sets from the UCI [10] repository and the KEEL-dataset repository [11] with respect to the standard classification task. The experimental results show that our method is able to achieve the same level performance as the state-of-the-art classifiers. Meanwhile, it can handle outlying test instances and control the FDR of test instances assigned to each class in a natural manner.
The main contributions of this paper can be summarized as follows.
(1) The binary classification issue is formulated as a two-sample testing problem. Since two-sample testing is a fundamental problem in statistics and many well-known tests are available in the literature, it can be expected that we may introduce many effective testing-based classifiers in the near future.
(2) The classification model that integrates hypothesis testing and the k-NN method is presented. This formulation can alleviate the effect of outlying and irrelevant training instances to improve the classification accuracy significantly.
(3) A comprehensive performance comparison over 40 real data sets is conducted. The experimental results demonstrate the fact that the testing-based classifier is able to achieve the same level performance as standard classifiers such as SVM and decision tree.
(4) Some interesting connections between our testingbased classifiers and existing classification methods are presented.
(5) The advantage of the testing-based classification model on handling outliers and controlling the Type I error rate in terms of FDR is empirically investigated.
The rest of this paper is organized as follows. Section 2 discusses some previous works that are related to our method. Section 3 presents the details of our method. Section 4 reports experimental results on 40 real data sets. Section 5 discusses the relationship between our method and other approaches. Finally, Section 6 concludes this paper.

Instance-based learning
Instance-based learning is a lazy learning scheme in which the training instances are simply stored. When a new instance is encountered, a set of similar training instances are retrieved to classify the unknown testing instance. The most basic instance-based method is the k-nearest neighbor algorithm (k-NN) [2] [12], which assigns a new instance to the most common class among its k-NNs in training instances.
Essentially, our method can be considered as an instancebased learning approach since the two-sample test is conducted on the distance sets generated from all training instances or k-NNs. This indicates that it is feasible to apply techniques developed for instance-based learning during the past decades (e.g. [13], [14], [15]) to further improve our method.

Classification based on hypothesis testing
Liao & Akritas [6] introduce a classification method based on hypothesis testing, which is abbreviated to TBC. Suppose there are two classes (positive vs. negative) in the training set, i.e., a binary classification problem, the issue is to allocate a new instance t * to one of the two classes. The basic idea of TBC is that, if t * is placed into the wrong class, then the difference of two samples will be blurred. To implement this idea, two tests with respect to the equality of the means of two samples are conducted, in which t * is placed into the set of positive instances and the set of negative instances, respectively. Accordingly, we will obtain two p-values p + and p − , where p + (p − ) is generated from the test in which t * is assumed to belong to the positive (negative) class. If p + < p − , then t * is classified as a positive instance. Otherwise, t * will be classified as a negative instance. This method works well when the theoretical p-values can be computed and compared. However, TBC has two problems. First, when the number of features of data set is larger than the sample size of one class, the p-values cannot be computed at all because of the singularity of the sample covariance matrix. Second, when the instances from two class are well separated, the p-values will equal to zero.
Ghimire & Wang [7] improve the TBC method by introducing a minimum distance into the method and come up with a new classifier for image pixels. Their new method works well in the context of image pixel classification.
Modarres [16], [17], [18] studies the properties of squared Euclidean interpoint distances (IPDs) between different samples which are taken from multivariate Bernoulli, multivariate Poisson and multinomial distributions. And he also discusses some applications based on IPDs within one sample and across two samples in different distributions.
Afterwards, Guo & Modarres [8] develop a classification method based on hypothesis testing, which is abbreviated to IDC. It is capable of classifying high dimensional instances by employing testing methods based on the IPDs between different instances. Several different test statistics based on IPDs have been discussed in [8] and we will take the Baringhaus and Franz (BF) statistic as the example. Given two sets of training instances, i.e., one positive set D + and one negative set D − , IDC first computes the average IPDs within D + , within D − and between D + and D − , which are denoted byd D + ,d D − andd D + D − respectively. Then, it calculates by placing t * into D + and D − , respectively. Note that |BF 1 − BF 0 | (|BF 2 − BF 0 |) can be used to measure the change in the value of BF when t * is assigned to D + (D − ). Therefore, if |BF 1 − BF 0 | < |BF 2 − BF 0 |, t * is classified as a positive instance; otherwise, t * will be labelled as negative instance.

Asymmetric classification error control
In binary classification, most classifiers are constructed to minimize the overall classification error, which is a weighted sum of type I error (misclassifying a negative instance as a positive one) and type II error (misclassifying a positive instance as a negative one). However, in many realistic applications, different types of errors are often asymmetric, which have different costs and need to be treated with different weights.
The cost-sensitive classification (CSC) method [19], [20] can solve this problem to some extent. It takes the misclassification costs into consideration and aims to minimize the total cost of both errors. Another method is the Neyman-Pearson (NP) classification [21], which is inspired by classical NP hypothesis testing. It is a novel statistical framework for handling asymmetric type I/II error priorities and can seek a classifier that minimizes the type II error while maintaining the type I error below a user-specified level α [22], [23]. CSC and NP classification are fundamentally different approaches that have their own pros and cons [21]. A main advantage of the NP classification is that it is a general framework that allows users to control type I classification error under α with a high probability.
It is very easy to control the type I error in terms of FDR in our formulation since the p-values of each test instance with respect to different classes will be generated in the classification phase. In other words, such testing-based classification formulation provides a unified framework for controlling the asymmetric classification error in a natural way.

Two-sample testing
Given two independent random samples G X and G Y , where G X = {x 1 , x 2 , ..., x n } is drawn from the X population and G Y = {y 1 , y 2 , ..., y m } is drawn from the Y population, the general two-sample testing problem is concerned with the null hypothesis that the two samples are drawn from identical populations [9]: where F X and F Y are the cumulative distribution functions for the X population and the Y population, respectively.

Problem formulation
We consider the binary classification problem, in which the training set D is composed of two disjoint sets D + and .., t − n } are called the positive training set and the negative training set, respectively. Given a test instance t * , the classification task is to decide its class label (positive vs. negative).
We formulate the binary classification problem as a twosample testing problem. In this formulation, the first sample G X is a set of n observations, where the ith observation is the distance between the test instance t * and the ith training Similarly, each observation in the second sample G Y is the distance between the test instance and each training instance To conduct the standard classification task, we may test the null hypothesis against two alternative hypotheses ) to obtain two onesided p-values (p X and p Y ). If p X < p Y , we will label t * as a positive instance. Otherwise, we will classify t * as a negative instance.
To handle the multi-classification problem with Q classes (Q > 2), we can explore the one-vs-rest strategy by regarding the set of instances from one class as the positive training set and using the set of instances from the remaining classes as the negative training set. For each of Q binary classification problems, we first conduct the two-sample testing to generate a one-sided p-value for the corresponding class. Then, we can assign the test instance to the class that has the smallest p-value.

K -NN variants
In the above problem formulation, the distances to all training instances are utilized in the hypothesis testing. However, the existence of outlying and irrelevant training instances may decrease the classification accuracy. To alleviate this issue, we can conduct the hypothesis testing on two samples that are derived from the k-NNs of the test instance.
Under H 0 , two natural k-NN variants can be formulated. Similar to the k-NN classifier, the first variant is to directly take the k-NNs of the test instance to generate two samples. The distances from the test instance to these k nearest training instances are divided into two groups according to the class label, where each group corresponds to one sample in our scenario. The second variant is to take k 1 nearest instances from D + and retrieve k 2 nearest instances from D − to generate two distance sets, where k1 k2 = n m . The rationale behind the second variant is that, if the null hypothesis is true, then the number of k-NNs from each class is proportional to the number of training instances in that class. Since k 1 = k 2 when n = m, we can take the same number of k-NNs from each class in this case.

The choice of testing methods
The testing method for two-sample differences has been extensively investigated in the literature. One widely used test for this issue is the WMW test, which is also called the Mann-Whitney U test or Wilcoxon rank-sum test [24]. To obtain the test statistic in WMW test, G X and G Y are merged to form a combined sample G Z = {z 1 , z 2 , ..., z m+n }. Then, the observations in G Z are ordered: According to the ordered list, R i1 is defined as the rank of Based on the above normal approximation, we can calculate the one-sided p-value to test for some t. In our classification model, the choice of testing method is very flexible since the samples to be tested are unidimensional. That is, we can use any univariate two-sample testing method in our classifier. Therefore, we can also employ the testing methods such as pooled t-test, two-sample Kolmogorov-Smirnov test [25] and precedence test instead of the WMW test. In Section 5, we will further show that the use of different testing methods will establish the connection between our formulation and existing classification models.

Handling outliers and FDR control
As we have argued, the testing-based classification model has the advantage of controlling the FDR of classified test instances and handling outlying instances under the same framework. In general, we will assign the test instance to the class that has the smallest p-value among Q p-values, where Q is the number of classes. However, it is inappropriate to do so when all Q p-values are not significant. Luckily, we can use FDR [26] to tackle this problem. We can obtain Q sets of p-values from all test instances because our method returns Q p-values to classify every test instance. Every pvalue set is firstly sorted in a non-descending order: p 1 ≤ p 2 ≤ ... ≤ p u , where u is the number of all test instances. Given a significance level α, let i max be the largest index for which If i ≤ i max , then the corresponding test instance will be assigned to the current class. After conducting FDR control on all Q p-value sets, we can label the test instances that are not classified to any class as outliers.

Data sets and experimental settings
We have conducted experiments on 40 data sets from the UCI [10] repository and the KEEL-dataset repository [11]. Among these data sets, the number of instances ranges from 80 to 10092 and the number of features varies from 2 to 90. Most data sets have less than 10 classes and only six of them have more than 10 classes. The detailed characteristics of these data sets are given in Appendix A. Moreover, the instances with missing values are discarded and the numeric feature values are normalized into the interval [0, 1] in the pre-processing process.
In the experiment, we perform 10-fold cross-validation (CV) and count the number of instances which have been correctly classified to compute a classification accuracy value. For every data set, we repeat the 10-fold CV experiment 10 times and record the average and standard deviation of 10 accuracy values as the final results.  1 The average accuracy over forty data sets for IBT-U and IBT-U-K variants (k =3).

Methods
Avg accuracy

All instances vs. k -NNs
In the first experiment, we compare several variants of our formulation to check which one is better in practice. Since our method is a classifier that combines instancebased learning and hypothesis testing, we will use the abbreviation IBT to denote such a classification model. To distinguish different variants, IBT-U is used to denote the classification model when the Mann-Whitney U test is applied to the distance sets derived from all training instances. Similarly, IBT-U-K is used to denote the classification model in which the distance sets are generated according to k-NNs of the test instance. Furthermore, two k-NN variants are denoted by IBT-U-K-D (k-NNs are obtained Directly without considering the class label) and IBT-U-K-S (k-NNs are obtained Separately from different classes), respectively. Additionally, the parameter k for two k-NN variants is specified as 3,5,7 and 9, respectively. The detailed experimental results on these three variants are given in Appendix B, C and D and their average accuracies are summarized in Table 1 and Table 2.
As shown in Table 1, the performance of IBT-U is much worse than that of two k-NN variants. This indicates that it is plausible to explore the k-NN strategy in the testingbased classification model. As shown in Table 2, the average classification accuracies of two k-NN variants are quite similar when k is varied from 3 to 9. In the forthcoming sections, we will use IBT-U-K-D (k=3) as a representative of our classifiers in the performance comparison.

Our method vs. Other testing-based classifiers
In the second experiment, we compare our method with two previous methods, TBC [6] and IDC [8], which also use hypothesis testing to solve a classification problem. The detailed experimental results are given in Appendix E and their average accuracies are presented in Table 3.
In the implementation of TBC, we employ the Hotelling's T 2 test as the testing method, which has been utilized in [6]. And we use the Hotelling's T 2 statistics instead of p-values in the classification since the generated pvalues are often zeros. In the implementation of IDC, we use the Baringhaus and Franz (BF) statistic as the test statistic and assume equal prior probabilities in splite of unequal sample sizes.
For TBC, the classification accuracies on five data sets (Cleveland, Dermatology, Hepatitis, Movement libras and Among these three methods, our method can achieve the best performance due to the following reasons. First, our method only consider the k-NNs of test instance while TBC and IDC utilize all training instances without considering the existence of outlying and irrelevent ones. Second, our method employs a hypothesis testing strategy that is totally different from that used in TBC and IDC.

Our method vs. Classic classifiers
In the third experiment, we compare our method with three classic classifiers: k-NN, support vector machine (SVM) and decision tree (DT). The detailed experimental results are given in Appendix F and G and their average accuracies are presented in Table 4.
For SVM, k-NN and DT, we use the functions fitcecoc, fitcknn and fitctree with their default parameter settings in Matlab 2018b, respectively. The reason for using fitcecoc function is that it can generate a multi-class model for SVM.
As shown in Table 4, our method is able to achieve the same level performance as these classic classifiers. Concretely, there are 13, 19 and 18 data sets on which our method can produce higher classification accuracies than k-NN, SVM and DT among the 40 data sets, respectively. In a word, our method is competitive to these classic classifiers with respect to the overall performance.

Handling outliers through FDR control
In the last experiment, we investigate the potential of our method on outlier detection and FDR control. The balance data set from UCI is used as an example, which has 625 instances and three classes (L, B and R). There are 288, 49 and 288 instances in the three classes respectively, as shown in Table 5. If we take a subset of the 576 (288+288) instances from the class L and R as training instances and use the 49 instances from the class B as test instances, then it is obvious that all test instances should be considered as outliers.
We randomly take 80 percent of instances from the class L and R to compose the training set. In order to obtain the average performance, 10 different random training sets are generated. We use IBT-U as the classifier and the significance level for FDR is set to be 0.05. The experimental results show that 48 of 49 test instances can be labelled as outliers on average. Specifically, there are at most 2 test instances which cannot be labelled as outliers and they are usually different when the training set is different. Therefore, our method is able to recognize outliers and control the FDR of classification results in the same time.

RELATIONSHIP TO OTHER APPROACHES
Our classification method is a two-phase approach: two distance sets are first generated and then the two-sample test is conducted. As we have discussed, we may use different significance testing methods in the second phase. In this section, we will show that the use of different testing methods will lead to different classifiers that have close relationship with existing classification models.

Connection to Nearest Centroid Classifier
The nearest centroid (mean) classifier is one of the most widely used instance-based classification models [27]. In the training phase, only the centroid for each class is calculated and stored. In the classification phase, the distance between one unknown instance and each centroid is calculated to find the nearest centroid. Then, this new test instance is assigned to the class of its nearest centroid.
If the pooled t-test is employed as the significance testing procedure in our model, then we can reveal some interesting connections between our method and the nearest centroid classifier. To simplify the analysis, we first consider the scenario of univariate data set and then discuss the case of multivariate data set.
Given two one-dimensional sets D + = {t + 1 , t + 2 , ..., t + m } and D − = {t − 1 , t − 2 , ..., t − n }, their centroids (means) can be easily computed by Given an unknown instance t * , the distances between t * and these two centroids can be measured by d + = |t * − C D + | and d − = |t * − C D − |. The nearest centroid classification method will assign t * to the positive or the negative class according to whether d + < d − .
In our method, two samples Then, we test the null hypothesis against two alternative hypotheses (F X (t) < F Y (t) and F Y (t) > F X (t)) on the two samples to obtain two one-sided p-values (p X and p Y ). At last, our method will assign t * to the positive (negative) class if p X < p Y (p X > p Y ).
Note that when the pooled t-test is employed in our method, we will obtain two t statistics (t X and t Y ). We can get Similarly, we can also get p X > p Y ⇔d X >d Y . Therefore, our method will assign t * to the positive class ifd X <d Y . Otherwise, we will label t * as a negative instance. According to the triangle inequality, we can get in which the equality holds if and only if t * ≥ max When d + =d X and d − =d Y , our method will assign the test instance to the same class label as the nearest centroid classification method. Obviously, the above analysis establish the equivalence between our method and the nearest centroid classifier under very strict constraints: (1) onedimensional data set, (2) the test instance is no less (more) than all training instances in each class.
For the multivariate case, it is very difficult to analyze their relationship in a quantitative manner. One naive con- then our method and the nearest centroid classification method will produce the same classification result.

Connection to k -NN Classifier
The k-NN classifier is one of the most popular classification methods in the literature [28]. In our formulation, if the precedence test [9] is employed as the significance testing method, then we may uncover some interesting connections between our method and the k-NN classifier.
We still consider the binary classification problem in which the training data is composed of m positive instances from D + and n negative instances from D − . Given an unknown instance t * , the k-NN classification method finds its k nearest neighbors (k-NNs) to conduct the classification. These k-NNs can be divided into two groups: k + positive instances from D + and k − instances from D − , where k = k + + k − . If k + > k − , then t * will be classified as a positive instance. Otherwise, t * is assigned to the negative class.
The precedence test is a two-sample test based on the order of early failures [29]. Given two independent samples, G X = {x 1 , x 2 , ..., x m } and G Y = {y 1 , y 2 , ...., y n }, let x (1) ≤ x (2) ≤ ... ≤ x (m) and y (1) ≤ y (2) ≤ ... ≤ y (n) denote their order statistics. The precedence test is based on the number of observations from one sample which exceed (precede) some threshold specified by the other sample. More precisely, the test statistic W r is the number of observations in G X that precede the r-th order statistic y (r) from G Y . Alternatively, one can use the number of observations in G Y that exceed the s-th order statistic x (s) from G x as the test statistic W s . Large values of these two test statistics will lead to the rejection of the null hypothesis that two distributions are equal.
In our problem formulation, G X (G Y ) is the distance set between t * and the instances in D + (D − ). Then, x (1) , x (2) , ..., x (k + ) , y (1) , y (2) , ..., y (k − ) will be the k distance values between t * and its k-NNs. If we use the precedence test as the significance testing method and suppose that x (k + ) ≤ y (k − +1) ≤ x (k + +1) , we can set r = k − + 1 to obtain the corresponding test statistic W r = k + for testing the null hypothesis against the alternative hypothesis (F X < F Y ). Alternatively, if we let s = k + + 1, we can obtain another test statistic W s = k − for testing the null hypothesis against the alternative hypothesis (F X > F Y ). And we can also get two p-values, p X and p Y . At last, t * will be assigned to the positive (negative) class if the former (latter) is smaller.
If we further assume that the positive training set and the negative training set have the same size, i.e., m = n, then the two p-values will be totally determined by the two test statistics: Therefore, our method and the k-NN classifier will generate the same classification result under the above assumptions. From this aspect, we may regard our method equipped with the precedence test as a generalized "statistical" k-NN classifier.

CONCLUSION
Due to the importance of the classification problem, many effective classification algorithms have been proposed from different societies. However, most work on classification does not address the issue of statistical significance. Towards this direction, several initial research efforts have investigated the feasibility of constructing a classifier through significance testing. Unfortunately, this interesting idea has not receive much attention during the past 10 years. This is mainly because the following reasons: (1) there are still no such testing-based classifiers that can achieve the same level performance as the state-of-the-art methods on real data sets; (2) the potential benefit of deploying such testingbased classifiers is still not clear.
Based on the above observations, this paper takes one step further towards this direction by formulating the classification problem as a two-sample testing problem. This new formulation enables us to generate several testing-based classifiers that have comparable performance with standard classifiers such as SVM. In addition, we show that it is quite easy to handle outlying test instances and control the FDR of classification results based on the p-values associated with each test instance.
We believe this paper will significantly contribute to the development of testing-based classification model, which will become a new promising classifier family. As the study on the testing-based classification model is still in its infancy stage, many research issues remain unexplored and should be further investigated in the future work. For example, since all the existing testing-based classifiers are based on the idea of instance-based learning, how to build a non-lazy testing-based classifier will be an interesting and challenging issue.

APPENDIX A
The detailed characteristics of the forty data sets is given by Table 5.

APPENDIX B
The detailed experimental results of IBT-U are given by Table 6.

APPENDIX C
The detailed experimental results of IBT-U-K-D are given by Table 7.

APPENDIX D
The detailed experimental results of IBT-U-K-S are given by Table 8.

APPENDIX E
The detailed experimental results of TBC and IDC are given in Table 9.

APPENDIX F
The detailed experimental results of k-NN are given in Table  10.

APPENDIX G
The detailed experimental results of SVM and DT are given in Table 11.

TABLE 5
The detailed characteristics of the forty data sets. For each data set, the number of instances without (with) missing values is provided outside (inside) the parentheses in the second column. The class distribution information, i.e. the number of instances in every class, is given in the 5th column. The last column provides links to download the corresponding data set.