MS4PS: A Mentor-Student Architecture for Patient-Specific Seizure Detection With Combination of Transfer Learning and Active Learning

Privacy protection, high labeling cost, and varying characteristics of seizures among patients and at different times are the main obstacles to building seizure detection models. Considering these issues, we propose a novel Mentor-Student architecture for Patient-Specific seizure detection (MS4PS). It contains a new method of knowledge transferring called mentor-select-for-student, which exploits the knowledge of a mentor model by using this model to select data for training a student model, making it possible to avoid transferring patient data and the negative influence of transferring parameters/structures of pre-trained models. It also contains a new method of active learning, which uses both an experienced mentor model and a quick-learning student model to select high-quality samples for doctors to label. Each of the two models is coupled with a particular sample selection strategy that combines uncertainty/certainty and the distance between the unlabeled samples and labeled seizure samples. The proposed method can quickly train a suitable detector for a patient at his/her first epilepsy diagnosis with the help of: (1) an experienced mentor model that chooses the most category-certain electroencephalography (EEG) data segments; (2) a student model (detector itself) that chooses the most category-uncertain EEG data segments; (3) doctors who label these data segments selected by both the mentor model and student model. By replacing or improving the mentor model and refining the historical models of patients when they come next time, the MS4PS system can be sustainably promoted. The proposed method is tested on the CHB-MIT and NEO datasets, and the results demonstrate its effectiveness and efficiency.


I. INTRODUCTION
Epilepsy is a chronic neurological disorder of the brain that causes death in many patients every year. If the symptoms could be detected in a timely manner and patients receive appropriate treatment, they are likely to live a normal life with a high probability. Currently, many devices can record the electroencephalography (EEG) of subjects, therefore, doctors can check whether they have epilepsy and analyze their The associate editor coordinating the review of this manuscript and approving it for publication was Guangcun Shan . conditions using EEG. However, checking EEG signals is a time-consuming and challenging task for doctors.
To relieve the burden on doctors and improve efficiency, many machine learning-based methods have been proposed to build automatic seizure detectors for EEG, such as the dictionary-learning-based method [1], SVM-based method [2], GMM-based method [3].
In recent years, with the surge in the development of deep learning [4]- [6], many researchers [7]- [9] have attempted to employ deep learning methods to train patient-independent seizure detectors, and their works have shown great improvement and potential. However, there are at least three main obstacles to building deep-learning-based seizure detectors: privacy protection of patient data, high labeling costs, and the varying characteristics of epilepsy EEG data.
The privacy protection of patient data and high labeling costs hinder the acquisition of sufficient training data. To reduce the need for labeled data and improve the quality of labeled data, deep transfer learning [10] and deep active learning (AL) [11] methods have been employed in patient-independent seizure detection [12]- [15] and have shown good performance.
The seizure characteristics vary among patients and even at different times for the same patient. This makes it difficult for patient-independent seizure-detection models to fit specific patients. Some research works [16]- [18] have been done and proved that building a patient-specific seizure detector for each patient seams more suitable than building a patient-independent seizure detector for all patients. Considering that it is still difficult to train a suitable detector for a specific patient who has no historically labeled data (e.g., when he/she sees a doctor for the first time) or does not have enough historically labeled data, it is reasonable to employ transfer learning. However, commonly used transfer learning methods can lead to negative transfers [18], [19] when the distributions of the source and target domains are different.
In light of these obstacles and solutions, we propose a novel Mentor-Student architecture for Patient-Specific seizure detection (MS4PS) that combines transfer learning and active learning. To the best of our knowledge, no researchers have proposed this method for patient-specific seizure detection.
Instead of transferring patients' data or copying a pretrained model's parameters/structures and then fine-tuning [12]- [14], this method exploits the knowledge of a mentor model by using the mentor model to select data for the learning of the student model. Transferring knowledge through mentor-select-for-student makes it possible to avoid transferring patient data and the negative influence of transferring parameters/structures (according to [18], [19] and our experiments, transferring parameters/structures could be of little help or even hinder the learning process in patient-specific cases).
This method contains a new method of active learning, which uses both an experienced mentor model and a quicklearning student model to select samples for doctors to label. Each of the two models is coupled with a particular sample selection strategy that combines the uncertainty/certainty and the distance between the unlabeled samples and labeled seizure samples. The new active learning method has the potential to significantly reduce doctors' burden of searching for and labeling seizures in severely ill-balanced EEG data.
Overall, the proposed method could quickly train a good student model for a specific patient at his/her first epilepsy diagnosis with the help of: (1) an experienced mentor model that chooses the most category-certain EEG data segments; (2) a student model itself that chooses the most category-uncertain EEG data segments; (3) doctors who label these data segments selected by both the mentor model and student model. These trained student models for specific patients would be saved, reused, and further trained when these patients have an epilepsy diagnosis next time. Moreover, the mentor model can be replaced and improved when sufficient labeled data are collected. This means the mentor-student system could work better and better.
The main contributions of this paper are as follows: 1) A novel method that combines transfer learning and active learning with a mentor-student architecture for patient-specific epilepsy detection is proposed. This method makes training a patient-specific seizure detector easy and quick, and the mentor-student system could be sustainably promoted. 2) A new knowledge transferring method, which uses a mentor model to select data for student-model's learning, is proposed for mentor-student system. It protects patients' privacy by not transferring their data and avoids the negative influences of transferring model's parameters/structures. 3) A novel active learning method, which uses both an experienced mentor model and a quick-learning student model to select samples for doctors to label, is proposed for the mentor-student system to relieve doctors' burden of searching and labeling seizures in severely ill-balanced EEG data. On the CHB-MIT [20] and NEO [21] datasets, 39 different methods (including MS4PS methods, two state-ofart transfer learning methods [12]- [14], two classic active learning methods, and many other methods) are compared. It should be known that the two transfer learning methods (denoted in this paper by S M (TF)-R and S M (TAL)-R) are not exactly the same as methods in [12]- [14] because to make the comparison among 39 methods fairs, we force them all to use our model and dataset. The S M (TF)-R method replaces the source-model's classifier with a task-specific classifier (in [14], it is a SVM layer; in [12], [13], and in this study, it is fully-connected and softmax layers) and then fine-tunes this task-specific classifier; the S M (TAL)-R method does the same but fine-tunes all the model's layers. The two classic active learning methods are random and maximum entropy (denoted in this paper by S(TAL)-R and S(TAL)-U). The top 5 of the 39 on Avg_F1 (average F1-score over given budgets and patients) and the top 1 of the 39 on Avg_seizures (average number of seizure samples selected over given budgets and patients) are all MS4PS methods. The best MS4PS method surpasses the two transfer learning methods, random and maximum entropy by 30.5%, 26.7%, 19.3% and 9% on Avg_F1, and by 39, 39, 39, and 5 on Avg_seizures. These results demonstrate the feasibility of MS4PS for patient-specific seizure detection problems.

II. RELATED WORKS A. PATIENT-SPECIFIC SEIZURE DETECTION
There are many works that study patient-independent seizure detection, for example, [1] employs a real-time method VOLUME 10, 2022 based on dictionary learning and sparse representation; [22] employs a method based on a deep neural network that combines a seizure representation part to eliminate inter-subject noise and an attention mechanism part to enhance interpretability.
To adapt to the variation in seizure characteristics among patients and at different times, some patient-specific seizure detection algorithms have been proposed. Most of them suppose having enough labeled EEG data and try to improve the performance of patient-specific detectors by using elaborate models and features, for example, [2] uses a group of SVMs and features extracted through empirical mode decomposition (EMD) and common space patterns (CSP); [17] uses a voting SVM system and features containing both the temporal-domain and spectral-domain information of EEG; [23] uses a RVM model and the harmonic multiresolution and self-similarity-based fractal features from EEG data; and [16] builds a predictor based on spatio-temporal-spectral hierarchical GCN with an active pre-ictal interval learning scheme (STS-HGCN-AL).
In contrast, our work supposes not having enough labeled EEG data and tries to solve it with MS4PS that combines transfer learning and active learning. To the best of our knowledge, only a few works have considered the problem of not enough labeled EEG data and try to solve it, for example, [18] employs a probabilistic framework for training a personalized neonatal seizure detector based on transfer learning and semi-supervised learning. It uses a source-task model (a patient-independent seizure detector) as the initial target-task model (personalized seizure detector) and labels specific patients' EEG data with the source-task model when doctors are absent. This work is the most relevant to ours. However, our work transfers knowledge through mentorselect-for-student rather than copying and refining model parameters/structures, and we use the pre-trained (patientindependent) model to select samples for active learning rather than to provide a hypothesis label for semi-supervised learning.

B. ACTIVE LEARNING
Active learning aims to alleviate the need for plenty of labeled data in training a model by selecting the most informative data for labeling [11]. The most commonly used active learning methods are uncertainty-based methods and representativeness-based methods.
The uncertainty-based methods select the most uncertain samples for the current model, making a quick model learning process. The representativeness-based methods select data that well represent the overall input patterns of unlabeled data, reducing the redundancy of selected data. Although they can achieve good results in many areas, they have some weak points: uncertainty-based methods rely on a not-lousy model and would select redundant samples [24]; the representativeness-based methods need more labeled data than the uncertainty-based ones to make a model converge. Therefore, methods that combine uncertainty and representativeness usually yield better results [24], [25]. Combining different active learning strategies is popular and valuable, so the combinations of uncertainty, certainty and representativeness (measured by distance) are considered in our work.
Few works have used active learning methods in seizure detection. The work most similar to ours is the cost-sensitive deep active learning method [15]. It integrates uncertainty, misclassification cost and diversity to construct a utility function for the samples-selection strategy in the labeling process and develops a new generic double-deep neural network (double-DNN) to obtain utility. The critical difference between our work and the work in [15] is that our work uses both models to select samples instead of using only one model for selecting samples and the other for obtaining costsensitive utility.

C. TRANSFER LEARNING
Transfer learning aims to reduce the need for labeled data in the target domain by applying source domain data and source domain model to the target domain [10].
The most commonly used transfer learning methods in EEG data analysis are domain adaption, improved CSP algorithms, deep neural network-based (DNN-based) algorithms, and subspace learning, according to [19].
Our work employs DNN-based algorithms. There are many related works, for example, [12] uses ImageNet dataset to pre-train googlenet, resnet101 and vgg19 as source domain models, then replaces each source-domain-model's classifier layer with a task-specific classifier and uses EEG data to fine-tune only the task-specific classifier; [13] does almost the same as [12], but pre-trains different source domain models; [14] uses ImageNet dataset to pre-train ten different source domain models, then replaces each source-domainmodel's classifier layer with two different kinds of taskspecific classifier, and at last uses EEG data to fine-tune all the layers of the model or only fine-tune the task-specific classifier.
These works focus on reusing the feature extractor pre-trained with source domain data in the target model. Meanwhile, they suppose that the labeled target domain data are sufficient for fine-tuning. Unlike them, our work trains a target model from scratch with the assistance of the pre-trained source domain model and supposes that the labeled data are not sufficient and should be accumulated through active learning.

III. METHOD
As shown in Fig.1, the student model is a personalized model for a patient. The main process for training and using this model is as follows: Firstly, the unlabeled EEG data segments of a specific patient are added into the unlabeled data pool after data processing (re-referencing and splitting), and if there are, the patient's historically labeled data and student model are loaded from the Database and Modelbase, respectively; Secondly, the segments from the unlabeled data pool and the key segments (labeled seizure segments of this patient) from the labeled data pool go through both the mentor model and student model to get predictions; Thirdly, a batch of EEG data segments is selected with specific active learning (AL) strategies and sent to a doctor for labeling. Among those, α * batch_size segments are selected basing on predictions from the mentor model, and (1 − α) * batch_size segments basing on predictions from the student model; Fourthly, the doctor labels these selected EEG data segments and adds them into the labeled data pool; Fifthly, all data in the labeled data pool are used for training the student model, meanwhile, updating the α according to (8). The process (including the second, third, fourth and fifth steps) keeps iterating until running out of the budget, saving the labeled data and trained student model into the Database and Modelbase, respectively. The pseudocode of the MS4PS method is presented in Algorithm 1.
When a patient comes for the first-time checking, there are no historically labeled data in the Database and trained student model in the Modelbase for him/her. However, when the patient comes for the second and the nth-time checking, there are historically labeled data in the Database and a trained student model in the Modelbase for him/her. These historical data and model can be reloaded to further fine-tune the student model and assist doctors' diagnosis.
In the following, we will describe in detail the mentor and student models, active learning strategies, knowledge transferring from mentors to students, and how to calculate the parameter α. labeled data pool D l T (T = Budget) 1: for t = 0 to Budget do 2: use f 1 to select α * batch_size samples from D u t . 3: label the selected samples (denote them with D s t ). 5: train S t with D l t+1 to S t+1 8: calculate α using (8) 9: end for this institution. It can be any useful model, and in this paper, we use a deep Convolutional Neural Network (CNN) trained with the NEO dataset [21] as the mentor model.

2) STUDENT MODEL
For each patient, a student model should be built to capture his/her personalized characteristics. When a patient comes for the first-time checking, his/her specific student model must be built from scratch or from a general model for all patients. When the patient comes for the second and the nth time, his/her historical student model could be reloaded and further fine-tuned. A student model could be any useful model, and in this paper, we use a deep CNN as the student model.

B. ACTIVE LEARNING STRATEGY
All active learning strategies try to select the most informative samples for labeling. We name these strategies with their VOLUME 10, 2022 measurements for informativeness and these used in this paper are as follows.

1) UNCERTAINTY STRATEGY
Here the uncertainty of each sample x for the current model is measured using its entropy as where c 0 is the normal category, and c 1 is the seizure category in the case of seizure detection. The p(c i |x) can be calculated by softmax, as where z 0 and z 1 are the outputs of the last fully-connected layer of the model in this paper. The uncertainty strategy selects the samples that are the most uncertain or closest to the classification hyperplane of the current model, as shown in Fig.2 (red dotted circle). The uncertainty measurement can be calculated with either a student model or a mentor model, and we intuitively think that the model with a student model will be more useful for MS4PS.

2) CERTAINTY STRATEGY
In contrast to the uncertainty measurement, minus entropy is used to measure the certainty of each sample x for the current model, as

3) DISTANCE STRATEGY
It is well known that EEG samples of epilepsy are illbalanced, and seizure samples are more critical than normal samples for model learning and doctors to analyze patients' conditions. To select more seizure samples than normal, the distance strategy employs Kullback-Leibler divergence to measure the distance between EEG samples and seizure samples already labeled, and then selects samples of the minimum distance value, as shown in Fig.2 (blue dotted circle). The distance is defined as where p(c i |x) is the prediction probability of sample x belonging to class c i . And q(c i |DX ) can be calculated as where DX represents all seizure samples in the labeled data pool and |DX | is the number of labeled seizure samples.

4) MIXTURE OF UNCERTAINTY/CERTAINTY STRATEGY AND DISTANCE STRATEGY
The uncertainty/certainty measurement and distance measurement can be mixed to obtain a better one as where γ ∈ [0, 1] is an empirical parameter, and we set it to 0.5 in this paper.

C. KNOWLEDGE TRANSFERRING AND PARAMETER α
Based on the mentor-student architecture, we propose a different way of knowledge transferring named mentor-selectfor-student, which exploits a mentor model's knowledge by using the mentor model to select data for student-model's learning and train a student model from scratch instead of a model transferred from a mentor model. Transferring knowledge through mentor-select-for-student (not the patients' data or model parameters/structures) makes it possible to protect the patients' privacy and avoid the negative influence of transferring parameters/structures. Institutions that own superior knowledge can safely distribute their models to help others.
Intuitively, the mentor model should select the most certain samples for training a student model, just like a mentor teaches students with his/her most certain knowledge, and the student model should select the most uncertain samples for training itself, just like a student studies the knowledge that confuses him/her for capability improvement.
In our method, a student model learns from the data chosen by a mentor model and itself, with the supervision of a doctor. The influence of the mentor model would be less and less with the performance of the student model being better and better. It is controlled by parameter α as where N(seizure, M, t) is the number of seizure samples selected by mentor model at budget t, N(seizure, S, t) is the number of seizure samples selected by student model at budget t, N(seizure) = N(seizure, M, t)+ N(seizure, S, t), and β is an empirical parameter for the sake of N(seizure) = 0.

IV. EXPERIMENTS
To verify the performance of MS4PS, we design experiments to simulate two kinds of scenarios: 1) patients come for the first-time checking, in which their personalized models must be trained from scratch. 2) patients come for the second-time checking, in which historical models and labeled data could be used for them.
In each experiment, we try different settings to compare and find the best one. Each setting is named with a threepart pattern, ''selection model(training method)-AL strategy''. For simplicity and clarity, some abbreviations are used and all of them are shown in Table1. For example, M-MCD&S(TAL)-MUD means such a setting: using both a mentor model and a student model for selecting samples, MCD AL strategy for mentor model and MUD AL strategy for the student model, and training all of the layers of student model; S M (TF)-R means that using the student model, which loads the parameters of mentor model for sample selection, R AL strategy for this model, and only training its fullyconnected layers.
The EEG datasets of CHB-MIT [20] and NEO [21] are used in this paper. Both of them comply with the international 10-20 system of EEG electrode positions and are sampled with 256hz. The NEO dataset is labeled by three experts, therefore, there are disagreements. The CHB-MIT dataset has definite labels. In CHB-MIT, there are 24 folders for 23 patients (the chb01 and chb21 are from the same patient. Here we regard them as different patients), and each folder contains many.edf. In NEO, there are only one.edf for each patient. In this paper,   Fig.3. Secondly, the NEO dataset is relabeled with such rule: for each duration of EEG data, if two or three experts annotated it as seizure, it would be annotated with the seizure-category label and otherwise annotated with the normal-category label. Then, each.edf of CHB-MIT and NEO is split into seizure durations and normal durations. Finally, each duration of EEG data is split into 10-seconds segments. Seizure duration is split with an overlapping of 8 seconds and normal duration with no overlapping. The statistics of segments are shown in Table2 and Table3.

B. STUDENT MODEL AND MENTOR MODEL
The same model structure (Fig.5) is used for both the student model and the mentor model.

1) MODEL INPUT
In this paper, the raw data format is chosen as the input of the model, although there are other commonly used formats such as frequency features [26] and spectrograms [27]. For each   EEG segment, the 18 channels of 1-D time-domain signals are stacked in a specific order to form a 2-D matrix (Fig.4) which contains the information of both the time and spatial domains. And as the input of the model used in this paper, the matrix has a shape of 18 × 2560.

2) MODEL STRUCTURE
To effectively extract the features in a 2-D input format, a deep CNN [28] is used. It consists of 5 convolutional blocks, two fully-connected blocks, and a softmax block, as shown in Fig.5. The details are provided in the Appendix.

C. HOW TO GET MENTOR MODEL
A deep CNN, as shown in Fig.5, is trained with the NEO [21] dataset to be the mentor model. All seizure segments and normal segments that are randomly sampled and twice as many as the seizure segments are used for training. The model is trained until the performance could not be improved. The Adam algorithm is chosen as the optimization algorithm. The learning rate is set to 0.0001. And the cross-entropy function is selected as the loss function.

D. EVALUATION CRITERIA
Two different performance metrics that cover different demands are considered here.

1) F1-SCORE
If a relative adequate test dataset can be obtained for each patient, as for the selected patients in this paper from CHB-MIT, the F1-score is a good metric for evaluating the performance of a student model. The F1-score is a commonly used metric that well balances Recall and Precision as the harmonic mean of them, and is defined as where Recall = TP/(TP+FN) and Precision = TP/(TP+FP), TP (true positive) is the number of segments correctly detected as the seizure class, FN (false negative) is the number of segments incorrectly detected as the normal class, TN (true negative) is the number of segments correctly detected as the normal class, and FP (false positive) is the number of segments incorrectly detected as the seizure class.

2) #seizure/B
In practice, when training a patient-specific model for the one who comes for checking, it is hard to get a test dataset for calculating F1-score on it. In this case, the number of seizure samples detected in a given Budget B (#seizure/B) could be a good metric to evaluate the performance of a student model. It is available during the process of a doctor's diagnosis. Meanwhile, it is intelligible and friendly to the doctor. It could be defined as where (#seizure) k is the number of seizure segments detected in the kth AL loop. In this paper, B is set to 10 for the experiments of the first-time checking and to 3 for the experiments of the second-time checking.

3) AVERAGE METRICS
To alleviate the bias of performance caused by random factors, the average metrics will be employed as (11) and where T is repeat times of experiment (T = 5), P is the number of patients (P = 23 or P = 1), and B is the Budget (B = 10 for the first-time checking and B = 3 for the second-time checking).

E. THE FIRST TIME CHECKING
In this section, experiments are assumed to be in such a scenario: patients come for the first-time checking, in which their personalized models must be trained from scratch.

1) TARGET
The experiments are designed to answer the following questions: 1) Comparing with training a student model from scratch, could the common knowledge transferring method, copying and refining parameters of a pre-trained model, bring improvement for patient specific seizure detection? 2) Of the many different AL methods composed by coupling different models (such as S, S M , M, and their combination) with element sample selection strategies (as described in Section III.B), which is the best for MS4PS? 3) Does the method of transferring knowledge through mentor-select-for-student work well in training good patient-specific seizure detectors? 4) Is the distance strategy (proposed in Section III.B) helpful in MS4PS?

2) DATA PREPARING
The CHB-MIT dataset is used for training the student model. Because chb16 has too few seizure segments (see Table2), it is ignored, and so at last, there are 23 patients' data for the experiments. As shown in Fig.6, for each patient, all segments are used. In each experiment, the segments of each patient are shuffled and divided into a training dataset and a testing dataset with 80% and 20% of the total, respectively.

3) STUDENT MODELS
There are two ways to obtain an initial student model: using an empty model (denoted as S) and transferring from other pre-trained models such as mentor model(denoted as S M ). For the empty model S, we train all its layers (denoted as S(TAL)). For S M , we fine-tune all its layers (denoted as S M (TAL)) or fine-tune only the fully-connected layers (denoted as S M (TF)).

5) OTHER SETTINGS
The experiments are repeated 5 times to alleviate the influence of random factors. In each experiment, the student model is randomly initialized, Budget is set to 10, batch_size is set to 32, training times (epoch) is set to 10, and the learning rate is set to 0.0001. The Adam algorithm is chosen as the optimization algorithm and cross-entropy as the loss function.

F. THE SECOND TIME CHECKING
In this section, experiments are assumed to be in such a scenario: patients come for the second-time checking, in which historical models and labeled data could be used for them.

1) TARGET
The experiments are designed to answer the following: 1) Does fine-tuning the historical model work better than retraining a new model from scratch for MS4PS? 2) Of the two methods, reusing the historically labeled segments and not using these segments, which is better for MS4PS? 3) Could the student model get sustained promotion through MS4PS, when patients come more times and his/her data are accumulated more and more?

2) DATA PREPARING
The CHB-MIT dataset of 23 patients (except chb16) is used here. For each patient, only the EEG data containing seizure are used, and they are chronologically divided into two groups (see Fig.7): the first one is used for simulating the first-time checking (with the same setting as E. THE FIRST TIME CHECKING) to get a historical student model and labeled data; and the second for simulating the second-time checking. In each experiment, segments of the second group are shuffled and divided into a training dataset and a testing dataset with 80% and 20% of the total, respectively.

3) OTHER SETTINGS
The experiments are repeated 5 times to alleviate the influence of random factors. In each experiment, the student model is randomly initialized, Budget is set to 3, batch_size is set to 32, training times (epoch) is set to 7, and the learning rate is set to 0.0001. The Adam algorithm is chosen as the optimization algorithm, and cross-entropy as the loss function. - [14], because to make the comparison among 39 methods fair we force them all use our DNN models and dataset.

V. RESULTS AND ANALYSIS
In this section, we would analyze the results of experiments and answer the questions proposed in section IV EXPERIMENTS.
A. THE FIRST TIME CHECKING 1) Avg_F1 For each patient, the Avg_F1 of different AL methods are shown in Table10-Table15 of the Appendix, and the F1-score curves are shown in Fig.9-Fig.14 of the Appendix. The Avg_F1 is 0 for patients 4, 6, and 21 in almost every experiment. The main reason for that is that it is too hard to select seizure segments of these patients for training the model due to the severe imbalance between seizure segments and normal segments, as shown in Table2. Table4 shows the Avg_F1 (P=23, T=5, B=10) and Fig.8 shows the Avg_F1 (P=23, T=5) curves of all 39 methods in Table4. It could be found that when coupling with a same AL strategy, S M (TF) gets worse performances than S M (TAL). It implies that the difference between the NEO dataset and CHB-MIT dataset is so significant that the feature extractor (the convolutional layers in this paper) trained with segments of the NEO dataset could not be used directly for segments of the CHB-MIT dataset. Meanwhile, when coupling with most AL strategies, the empty model S works better than S M . Model S gets its top Avg_F1 0.241 with S(TAL)-U, and S M gets its top Avg_F1 0.218 with S M (TAL)-C. The above results show that transferring knowledge with the parameters of the pre-trained model (denoted as S M ) could not bring improvement for patient-specific seizure detection models in this paper. This finding is consistent with that of [18].
Of all the cases, the top 5 Avg_F1 performances respectively. These top 5 settings all use our new knowledge transferring method, mentor-select-for-student. It proves that the proposed knowledge transferring method of MS4PS really works well.
It could be found that coupling model S with U/MUD gets better performance than with other AL strategies, and coupling model M with C/MCD gets better performance than with other AL strategies. That is consistent with our intuition (see III. Method).
In addition, it is hard to say the distance strategy helps all models, and it is difficult and meaningless to analyze the results one patient by one patient. Nevertheless, M-MCD&S(TAL)-MUD and M-MCD&S(TAL)-MCD outperform the M-C&S(TAL)-U and M-C&S(TAL)-C with an excess of 0.012 and 0.009 on Avg_F1, respectively, suggesting that the distance strategy really does some help in these cases.

2) Avg_seizures
Table5 shows the Avg_seizures of different AL methods. It could be found that the M-MCD&S(TAL)-MUD selects the most, 45, seizure segments. If we regard the AL methods with R as a basic-level doctor who tries to find seizure segments, and other AL methods as higher-level doctors, then from Table6, it could be found that the M-MCD&S(TAL)-MUD could promote the efficiency of diagnosis with 7.5 times of that of a basic-level doctor.
From the above results and analysis of the first-time checking, it could be verified that the M-MCD&S(TAL)-MUD is the best in this paper, not only on Avg_F1 but also on Avg_seizures. Furthermore, because M-MCD&S(TAL)-MUD does select more seizure segments than M-C&S(TAL)-U, it could be confirmed that the distance strategy makes sense.

B. THE SECOND TIME CHECKING
Note that only the top 1 AL method of the first-time checking is employed in the second-time checking.

1) Avg_F1
Table7 shows the Avg_F1 of the second-time checking. It is evident that the reloaded student model gets better performance than the empty student model in the same conditions. When the segments of the first-time checking are used to assist the second-time checking, there are performance promotions of 0.059 and 0.262 for the reloaded student model and the empty student model, respectively.    . The field name ''segment'' means using historical segments, ''non-segment'' means not using historical segments. ''Reloaded'' means using the reloaded student model. ''Empty'' means using an empty student model.
resource for training a model online. In that case, a slight performance degradation (in Table8, 20 instead of 24 seizure segments are selected) would happen.
Meanwhile, it could be found that using historically labeled segments of the first-time checking always gets a larger or equal number of seizure segments than without using them.
From the above results and analysis of the second-time checking, it could be verified that the reloaded student model is better than the empty student model, and reusing historically labeled segments would help improve performance, revealing that the patient-specific models would get sustained promotion through MS4PS when patients come more times and their data are accumulated more and more. Avg_seizures of the second-time checking. The field name ''segment'' means using historical segments, ''non-segment'' means not using historical segments. ''Reloaded'' means using the reloaded student model. ''Empty'' means using an empty student model. M-MCD&S(Fixed)-MUD means that the student model would not be trained.

C. FURTHER DISCUSSION
Considering the complexity of the method we proposed, we would make a further discussion to explain how and why MS4PS could effectively help to obtain advances and better effects, and support the reasoning with the results shown in Table4-5 and Fig.8. We think there are several factors that help MS4PS to obtain advances.
1) The knowledge transferring method named mentorselect-for-student. Transferring knowledge through mentor-select-for-student makes it possible to avoid the negative transfer [19] of transferring parameters/structures. In our experiments (see Table4), the negative transfer is that the S M (TAL)-R and S M (TF)-R       work worse than S(TAL)-R, implying that the transferred model S M is of no help and even hinders the learning process in the patient-specific case.
2) The AL method using both a mentor model and a student model to select samples. As shown in Table4 and Fig.8, the top 5 of 39 methods are all using both a mentor model and a student model, implying that using the double-model has more potential of performance improvement than using only one model. In the top 2 MS4PS methods, M-MCD&S(TAL)-MUD and M-C&S(TAL)-U, the mentor model is coupled with certainty-based AL methods, and the student model is coupled with uncertainty-based AL methods. That is consistent with our intuition that the mentor model should select the most certain samples for training the student model, just like a mentor teaches students with his/her most certain knowledge, and the student model should select the most uncertain samples for training itself, just like a student studies the knowledge that confuses him/her for capability improvement.
3) The distance strategy for solving the problem of ill-balanced EEG samples. The EEG samples of epilepsy are ill-balanced, and seizure samples are more critical than normal samples for model learning and for doctors to analyze patients' conditions. To select more seizure samples than normal, we introduce the distance strategy. As shown in Table4 and Table5, M-MCD&S(TAL)-MUD gets a higher Avg_F1 and selects more seizure samples than M-C&S(TAL)-U. So, the distance strategy does make sense.
As shown in Fig.8, the top 5 of 39 methods in Table4 are all MS4PS methods. The more factors are included, the better performance the MS4PS methods will get. Combining all the above factors makes the best MS4PS method, M-MCD&S(TAL)-MUD.

VI. CONCLUSION
The main obstacles to building good seizure detection models are privacy protection, high labeling costs, and the varying characteristics of seizures among patients and at different times.
In this paper, we have proposed a novel mentor-student architecture for patient-specific seizure detection. It contains a new way of knowledge transferring named mentor-selectfor-student, which exploits mentor-model's knowledge by using a mentor model to select data for student-model's learning, making it possible to protect the data of patients and avoid the negative influence of transferring parameters/structures of pre-trained models. It also contains a new way of active learning, which uses both an experienced mentor model and a quick-learning student model to select samples for doctors to label, and each of these with a particular sample selection strategy that combines uncertainty/certainty and the distance between unlabeled samples and labeled seizure samples.
The proposed method could quickly train a suitable detector for a patient at his/her first epilepsy diagnosis, with the help of: (1) an experienced mentor model that chooses the most category-certain EEG data segments; (2) a student model (detector itself) that chooses the most category-uncertain EEG data segments; (3) the doctors who label these data segments selected by both the mentor model and student model. By replacing or improving the mentor model and refining the historical models of patients when they come next time, the MS4PS system could be sustainably promoted.
Except for the potential advantages in protecting the privacy of patients, getting good performance for patient-specific seizure detection, and sustainably capability promoting, the MS4PS has another merit, it could be easily generalized to any other object-specific problems that suffer similar obstacles to patient-specific seizure detection, for example, patient-specific depression detection.
There are still some insufficiencies in this paper, such as (1) the labeled seizure segments are too few for some patients in the CHB-MIT, which causes zero performance for them in almost every experiment, dragging down the overall performance of MS4PS. (2) The primary attention is paid to verifying the feasibility of MS4PS rather than improving seizure detection performance. Further performance improvement could be made by elaborately designing deep neural networks and feature inputs for seizure detection, which will be one of our future works.

APPENDIX
In this section, we will explain why the parameters Budget and β are set such values in this paper, and show the further details of the DNN model and results.

A. BUDGET
We set the Budget to a not big number, 10, in THE FIRST TIME CHECKING. There are two main reasons for this: (1) most patients in the CHB-MIT have a few of seizure segments (see Table2). (2) A good student model (seizure detector) should select seizure samples out of EEG data as fast as possible. We set the Budget to 3 in THE SECOND TIME CHECKING, mainly because that there is only a smaller number of segments left for simulating the second time checking, after using some for simulating the first time checking (see Fig.7).

B. β
In our works, we set the β to 0.5. This is because when neither the mentor model nor the student model could select seizure segments, we should let both of them have an equal opportunity to make a tentative choice.

C. DETAILS OF DNN
The details of the deep neural network are shown in Table9.

D. DETAILS OF RESULTS
Table4 has shown the Avg_F1 averaged over 23 patients, 5 repeat times and 10 budgets, for 39 different AL methods. Here, the 39 different AL methods are divided into 6 subgroups. Table10-Table15 show the Avg_F1 averaged over each patient, 5 repeat times and 10 budgets. And Fig.9-Fig.14 show F1-score curves along Budget, averaged over each patient and 5 repeat times.