An Improved kNN Classifier for Epilepsy Diagnosis

The electroencephalogram (EEG) signals are important for reflecting seizures and the diagnosis of epilepsy. In this paper, a weighted k-nearest neighbor classifier based on Bray Curtis distance (WBCKNN) is proposed to implement automatic detection of epilepsy. The Fourier transform can transform the time-domain characteristics of the signal into frequency domain, which can display more useful information. The WBCKNN classifier can well overcome the sensitivity of the neighborhood size k and has good robustness. Therefore, it can classify EEG signals more accurately for different situations. WBCKNN is applied on public dataset and tested by k-fold cross-validation. Experimental results show that the best accuracy of the two-classification problems and three-classification problems is 99.67% and 99%, respectively. Compared to other classifiers, the accuracy of classification is also improved. In addition, this method is superior to traditional methods in sensitivity, specificity and false alarm rate of epilepsy classification. This method can be applied to the medical market to help doctors diagnose epilepsy.


I. INTRODUCTION
Epilepsy is the most common brain disease in the world. It affects about one in 200 people and may last a person's whole life [1]. This neurological disease is caused by sudden disturbances of brain function. The frequent occurrence of the seizure causes epilepsy can directly threaten the patient's life but the transition from the interictal stage to the next interictal is not sudden. Therefore, the diagnosis and treatment of epilepsy seizures are imperative to prevent the epilepsy or death of the patient.
EEG is the measure of the electrical activity of the brain due to the postsynaptic potentials of Cortic nerve cells representing the superposition of the potentials produced due to synchronous firing of the fibers or neurons [2]. It is an important tool for studying the temporal dynamics of the human brain's large-scale neuronal circuits [3]. At the same time, the EEG contains a wealth of neurophysiological and pathological information, which is an effective tool for evaluating epilepsy and other brain disorders. Experienced experts can even predict upcoming seizure by examining changes in the EEG record. But the visual detection of epileptic seizures from EEG signals is error-prone and time-consuming. This method of visual examination mainly relies on the knowledge The associate editor coordinating the review of this manuscript and approving it for publication was Simone Bianco . and experience of doctors, which leads too many uncertainties. Computers have long been proposed to solve this problem and thus, automated systems to recognize electroencephalographic changes have been under study for several years [4] and the automated detection techniques based on EEG signal also have been widely used in the diagnosis of epilepsy. Because there are fewer uncertainties in automated inspection technology, automated inspection technology can get more accurate results faster than visual inspection.
In recent years, a large amount of literature describes classifiers that attempt to identify seizures through automatic monitoring. These EEG classifiers can be roughly divided into the following five categories in general: linear classifiers, neural networks, nonlinear Bayesian classifier, nearest neighbor classifiers, and combinations of classifiers [5]. For example, in [6] a new approach for feature extraction and pattern recognition of ictal EEG, based upon EMD and SVM was proposed. Petrosian et al. [7] applied recurrent neural networks combined with signal wavelet decomposition to the problem. In 2013, Labate et al. [8] implemented multiscale permutation entropy complexity measure and SVM classifier for automatic seizure detection. A method that finds the median in time series data and processing the lower and upper parts of the data separately for classification have been developed by Paulose and Bedeeuzzaman [9]. In 2017, a novel seizure detection algorithm using mutual information and SVM have been researched [10]. A new algorithm was used in 2018 to classify and detect epilepsy risk levels in EEG signals with the help of expectation maximization and improved expectation maximization [11]. Alam et.al in [12] derived the IMF patterns by extracting amplitude frequency modulated oscillatory patterns and further applied the higher order statistics. In [13], Long Short-Term Memory (LSTM) networks are introduced in epileptic seizure prediction using EEG signals, expanding the use of deep learning algorithms with convolutional neural networks (CNN). Samiee et.al classified the EEG signals for epileptic seizure detection by the width of the largest local Gabor binary patterns by using the sparse rational decomposition as a feature [14]. Among the several algorithms for epilepsy classification, some method based on feature in time-domain, while others based on frequency domain. Some algorithm classification steps are complicated and have the disadvantage of insufficient diagnosis rate.
In this article, our research aims to reduce the cost of epilepsy diagnosis. On the other hand, it can reduce the workload of doctors, help doctors improve the efficiency of diagnosis, and reduce the possibility of misdiagnosis, which can enable patients to receive timely and effective treatment and reduce the burden on doctors and patients. To solve the above problems, we proposed a new epilepsy classification system, which consists of two parts: processing the EEG signal using a Fourier transform and WBCKNN classifier. Our main contributions include: 1. In order to get more efficient information from the EEG signal, the data is processed using Fourier transform. The classification situation before and after date processing was compared experimentally.
2. The WBCKNN algorithm is proposed and used in the diagnosis of epilepsy. WBCKNN method has better robustness and is not easily affected by outliers.
3. Detection of WBCKNN method by preforming two sets of experiments and compared with other algorithms. The possibility and reliability of its practical application for seizure detection are verified.
The contents of the paper are organized as follows. The EEG dataset, data processing and WBCKNN technique are given in Section 2. Simulation experiments and evaluation methods are introduced in Section 3. The experimental results, comparison with other classifiers and the advantages of WBCKNN are given in Section 4. At last, we propose our conclusion in Section 5.

II. THE FOUNDATION OF THE WBCKNN AND MATERIALS A. DATA INTERPRETATION AND PREPROCESSING 1) DATASET
The EEG dataset used in this paper if from the work of Andrzeiak et al. [15]. There are five sets(A, B, C, D and E). Each class containing 100 single-channel EEG segments. The length of the EEG signals is 23.6s and it were digitized at 173.61HZ. Each sample is a time series with 4097 numbers. Set A and B contain segments extracted from surface EEG recordings of five health volunteers in an awake state with eyes open(A) and eyes closed(B). Set C, D and E are taken from EEG archive of pre-surgical diagnosis from five epileptics patients. Segments in Set D were recorded from within the epileptogenic zone, and those in Set C from the hippocampal formation of the opposite hemisphere of the brain. While Set C and D contained only activity measured during seizure free intervals, Sets E only contained seizure activity. The amplitude change of each category is shown in Fig. 1.

2) DATA PREPROCESSING
In analysis of time series or signal information, frequency domain analysis and time domain analysis are more commonly used methods. The method used in this paper is based on frequency domain analysis. By analyzing the signal from the perspective of frequency domain, it is easy to find some features that are not easy to find in the time domain. This analysis is not only simple but also superior to analysis of complex signals.
The discrete Fourier transform (DFT) is the Fourier transform that is discrete in both the time domain and the frequency domain. We used the following formula to process EEG samples where X (k) is express as the discrete Fourier coefficient, N is the length of the transformed data and x (i) is a each input signal on the time domain. Let us consider the input the EEG signal dataset that contains several time series sample as ' where i = 1, 2, · · · , n. Here 'n' indicates the total number of samples in a given dataset. Each sample x i can be represented as Here 'N ' indicates the total number of time series features contained in the sample. By means Eq.(1) we convert the information of each sample from the time domain x i to the frequency domain to obtain the new sample X i , X i = X 1 , X 2 , · · · , X n .And each X i also represented as After the above processing, we transform the time domain feature to the frequency domain feature. Then we proceed with the data as follow By means (2), we reconvert the sample characteristics from the complex number domain to the real number domain. The resulting sample are expressed as

B. DESCRIPTION OF WBCKNN
The k-Nearest-Neighbours (kNN) is a nonparametric classifycation method, which is simple but effective in many cases [32]. The major drawbacks with respect to kNN are its low efficiency-being a lazy learning method, and its dependency on the selection of a ''good value'' for k. Frequently, to apply kNN we need to choose an appropriate value for k, and the success of classification is very much dependent on this value. In addition, Euclidean distance is commonly used in the kNN algorithm. Euclidean distance is often used to measure the similarity of two low-dimensional samples. However, the characteristic parameters used in the diagnosis of epilepsy are often more and related. Euclidean distance cannot characterize the similarity of data in high-dimensional space. Therefore, in the process of identifying and classifying epilepsy signals, this distance measurement method has certain limitations. In order to make kNN classifier overcome the sensitivity of k and improve the sensitivity of kNN algorithm in the task of classifying epilepsy. We proposed a weight kNN classification method on Bray Curtis distance. An introduction of Bray Curtis distance is in paper in [33]. The Bray Curtis distance between data in a high-dimensional space can directly represent the similarity between data. The core of the kNN algorithm is to achieve the classification purpose by calculating the distance between the classification sample and each training sample in space. Therefore, to ensure that there is a clear difference between the distance between patients of the same type and the distance between patients of different types, the kNN algorithm can accurately classify. The EEG data is time-continuous, and the characteristic parameters of the samples have a certain correlation. Euclidean distance can't be used to find the correlation. However, the Bray Curtis distance has the characteristic of mining the correlation before the feature parameters, and it has significance to measure the similarity of data in high-dimensional space. Therefore, the Bray Curtis distance can be used as the distance measurement standard in the kNN algorithm to effectively improve the recognition accuracy in EEG signal recognition.
The WBCKNN classifier was used to classify epileptic seizures. In epilepsy classification, n training samples with m different classes are defined as X = x i ∈ R d n i=1 , m class are represented as set C = {C 1 , C 2 · · · , C m }, and the class label of x i is C i ∈ C. The training subset from j-th class is recorded as . For a given sample y to be tested its class label is as follow: (a) Calculate the distance between y(y = {y 1 , y 2 , · · · , y N }) and the sample x i in the training set at first by using Bray Curtis distance.
where N is the number of characteristics of the sample. Obviously, b (y, x i ) = 0 means the samples are exactly the same, while b (y, x i ) = 1 is the maximum difference that can be observed between two samples. (b) Find the k nearest neighbors of the query sample y in each class according to the b (y, x i ). Therefore, a total of k * m samples are selected. The k samples from class C j closet to sample y are represented as . (c) Calculate k local mean vectors by using the k nearest neighbors in each class.
to denote the k local mean vectors from class (d) Using (6) to calculate the Bray Curtis distance between y and the local mean vectors of each class, and mark k distances in each class as (e) Calculate generalized mean distances based on the B j . The definition of the generalized mean is given in [34]. The r-th generalized mean distances is calculated by the Bray Curtis distance from the first r local vectors in each class to using (7).
where p is a non-zero real number. When p = 1,the generalized mean becomes the arithmetic mean. When p → 0, the generalized mean becomes the geometric mean. And when p = −1, the generalized mean becomes the harmonic mean. The k generalized mean distances in class C j are denotes as G j = g j 1 (y) , g j 2 (y) , · · · , g j k (y) . (f) Calculate the weight distance of each class. The weights for k generalized mean distance g j r (y) from the class C j is defined as in (8) The weighted distance between the k nearest neighbor and y in class C j is calculates as (g) Finally, the weighted distance in each classes is expressed as D (y) = {D 1 (y) , D 2 (y) , · · · , D m (y)} where m is the number of categories. Using (11) to determine the category of sample y to be classified.
The above is the specific process of the WBCKNN classifier for classification operation. Figure 2 shows the WBCKNN classifier process.
The pseudo codes of the proposed WBCKNN method are intuitively summarized in Alogrithm1.

III. EXPERIMENTAL SETTINGS
The WBCKNN technology is implemented in Python 3.7. The main purpose of WBCKNN technology is to improve the accuracy of early diagnosis of epilepsy. It can achieve the effect of accurately classifying the state of epilepsy patients to help doctors give patients timely treatment. Therefore, WBCKNN technique are estimated in terms of accuracy (ACC), sensitivity (SEN), specificity (SPE), and false alarm rate (FAR). Among them, ACC is the main evaluation index, which is the ratio of the number of patients correctly classified as normal or abnormal to the total number of patient data evaluated by experiments. ACC can most intuitively reflect whether the WBCKNN technology achieves its purpose of accurate classification, and also serves as the main basis for our comparison with other advanced classifiers. In statistical learning theory, k-fold cross-validation, independent test and self-consistency are commonly used to test the validity of model. This paper also used k-fold cross-validation to test the performance of the WBCKNN.
In addition, we conducted comparative experiments with 6 widely used classification models. They are k-nearest neighbor (kNN), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), eXtreme Gradient Boosting (XG-boost) and deep neural network (DNN). These classifiers are all representative in the field of machine learning, integrated learning, and deep learning. In medical diagnosis, these kinds of classifiers are also widely used. Not only that, we also conducted a comparative test before and after data processing. The evaluation standards and experimental settings will be described in detail below.

A. INTRODUCTION OF EVALUATION INDICATORS
The classification test performance of the WBCKNN classifier can be determined by computation of FAR, SEN and SPE along with ACC. They are defined as: Algorithm 1 WBCKNN Classifier //WBCKNN Classifier Algorithm Require: X = x i ∈ R d n i=1 the training EEG data with n train samples. C = {C 1 , C 2 · · · , C m } : the class set with m class.
: the training subset from j-th class y : the given query sample. k : the neighborhood size p : the size of generalized average calculation coefficient Input: Processed data X ; query sample y; k; p Output:Query of category C entered in sample y Step1: Begin Step2: Calculated distance b (y, x i ) between y and x i using (3) Step3: Find the k nearest neighbors of y in each class by Step4: Calculate k local mean vectors from every class. For Step5: , · · · , b y, v NN kj using(6) Step6: Calculate k generalized mean distances G j = g j 1 (y) , g j 2 (y) , · · · , g j k (y) based on the B j Step7: Calculate the weighted distance of each class D (y) = {D 1 (y) , D 2 (y) , · · · , D m (y)}, using (10) Step8: Get the category C of C = arg min C j D (y) Step9: Return C Step10: End Here TP and TN represent the total number of correctly detected true positive events and true negative events respectively. The FP and FN represent the total number of erroneously positive events and erroneously negative events.

B. K-FOLD CROSS-VALIDATION
K-fold cross validation is a method to ensures that each subsample is trained and tested, reducing the generalization error. This method divides the data into k groups. Each time one of the k subsets is used as the test set and the other k − 1 subsets are put together to form a training set. The final classification results is the average of k-fold cross-validation. In each set of experiments, we performed k = 2, 5, 10 fold cross-validation.

C. EXPERIMENTAL EXPLANATION
In this paper, we conducted two sets of experiments. They are defined as follows: Experiment 1: Classify patients with epilepsy. The purpose of this experiment is to classify patients with seizures from those who have not. It can help to detect when it's happening and notify doctors to rescue the patient, which may provide an alert as soon as possible.
There are differences in EEG signals between healthy people and patients with seizures, and classification can help to diagnose whether the people have the disease. At the same time, the EEG signals of clinically healthy people are similar to those without seizure intervals. This experiment can also help to distinguish these two groups of people, thereby improving the accuracy of the diagnosis.
The experimental process is shown in Fig. 3.

IV. RESULTS AND DISCUSSION
In this section, the experimental results obtained by the WBCKNN classifier will be introduced. At the same time, we will also analyze the characteristics and advantages of WBCKNN system. By analyzing the WBCKNN system, it provides a theoretical basis for the effectiveness of WBCKNN.

A. RESULTS OF THE EXPERIMENT
The results of the first experiment are shown in Table 1, Table 2 and Table 3. Table 1 shows result of {C, D} E classification after 2-fold cross validation. Table 2 shows the result of {C, D} E classification after 5-fold cross validation. Table 3 shows the result of {C, D} E classification after 10-fold cross validation.
The experimental results of the three classification A, D and E are given in Table 4, 5 and 6. Table 4, 5 and 6 are the result of 2, 5 and 10 fold cross validation respectively.

B. ANALYSIS OF EXPERIMENTAL RESULTS AND ANALYSIS OF WBCKNN
The proposed EEG signal classification system has three characteristics: (a) the relationship between samples is mea-   sured by Bray Curtis distance. (b) introducing a generalized mean. (c) the distance weighting mechanism is introduced.
Before classification, we performed a Fourier transform on the data. This allow us to get more useful information from EEG signals. In Table 1-6, the classification results of each classifier after data processing are better than the corresponding results obtained without data processing. Experimental results also prove this advantage.
In the WBCKNN classifier, we used the Bray Curtis distance to measure the similarity between the two samples.
Replacing the original Euclidean distance in kNN classifier with Bray Curtis distance can better reflect the relationship between samples. This method of calculating the correlation lays a good foundation for the following classification. In addition, we also used the local mean vector. The local mean vectors is also an effective method to overcome the influence of outliers on the classification results and overcome the sensitivity of k.
The proposed method used generalized mean distance. Given a set of k positive real number {z 1 , z 2 , · · · , z k }, VOLUME 8, 2020    the generalized mean is define as where p is a non-zero real number. It's worth nothing that generalized mean is affected by the value p.
With the decrease of p, the generalized mean values is greatly affected by the numbers with smaller values. As p increase, the generalized mean will be more influenced by the processed different according to the importance of samples in kNN classifier. Therefore, generalized mean values is introduced into epilepsy classification can strengthen the classification contribution from the neighbors with small Bray Curtis distance. This also has a good effect on overcoming the negative effects of outliers in the k domain.
Finally, we assign weights to each generalized mean distance in such a way. In this way, those samples closer to the test sample contribute more to the classification. Therefore, the proposed classification method can not only overcome the sensitivity of k but also have better robustness in the classification the following conclusions: 1) Compared with other six traditional methods, the proposed WBCKNN has significant improvements in accuracy, sensitivity and specificity.
2) The classification specificity of WBCKNN technology has reached 100%, and the accuracy of classification has also reached 99.67%.
Based on the results of Experiment 1, we can consider that WBCKNN has a better performance in classifying the activities measured in the seizure phase. Therefore, WBCKNN technology can help doctors to classify patients with epilepsy. According to the results shown in Table 4, 5 and 6, we can also obtain similar conclusion as in Experiment 1: 1) The classification effect of WBCKNN method is better than other traditional methods. 2) Classification accuracy reached 99% in 10-fold cross validation. Sensitivity and specificity have also reached a relatively good level. The above experimental results also fully show that compared with the traditional kNN classifier, the proposed method has greatly improved the classification performance and has better robustness.
We also compared our results with previous the results reported by earlier methods. Table 7 gives the classification accuracy of our method and previous methods. As we can see from these results, for the {C, D},E classification task our method using 5-fold cross-validation obtains the highest classification accuracy, 99.67%, reported so far. For A, D, E classification tasks, we used the 10-fold cross-validation method to obtain the highest classification accuracy, reaching 98.72%. This accuracy is the same as that proposed by Hassan.

V. CONCLUSION
Detecting epilepsy is a difficult task. Effective analysis of seizure activity is critical to help to patients choose the right treatment. This article presents a new epilepsy classification system. The system consists of two parts, one is to process the data through Fourier transform, and the other is to use WBCKNN classifier for decision. Using this hybrid system we interpret EEG signals and make decisions. Through experimental verification, the accuracy rates of the two types of problems and the three types of problems are 99.67% and 99%, respectively. Compared with the existing classification methods of seizure and non-seizure EEG signals, this method has better classification accuracy. Therefore, we consider it as a diagnostic decision support mechanism for epilepsy treatment.

ACKNOWLEDGMENT
The findings achieved herein are solely the responsibility of the authors.