Cluster Embedding Joint-Probability-Discrepancy Transfer for Cross-Subject Seizure Detection

Transfer learning (TL) has been applied in seizure detection to deal with differences between different subjects or tasks. In this paper, we consider cross-subject seizure detection that does not rely on patient history records, that is, acquiring knowledge from other subjects through TL to improve seizure detection performance. We propose a novel domain adaptation method, named the Cluster Embedding Joint-Probability-Discrepancy Transfer (CEJT), for data distribution structure learning. Specifically, 1) The joint probability distribution discrepancy is minimized to reduce the distribution shift in the source and target domains, and strengthen the discriminative knowledge of classes. 2) A clustering is performed on the target domain, and the class centroids of sources is used as the clustering prototype of the target domain to enhance data structure. It is worth noting that the manifold regularization is used to improve the quality of clustering prototypes. In addition, a correlation-alignment-based source selection metric (SSC) is designed for most favorable subject selection, reducing the computational cost as well as avoiding some negative transfer. Experiments on 15 patients with focal epilepsy from the Children’s Hospital, Zhejiang University School of Medicine (CHZU) database shown that CEJT outperforms several state-of-the-art approaches, and can promote the application of seizure detection.

most important clinical manifestations of epilepsy. Patients have unusual behaviors and sensations during seizures, and sometimes lead to loss of consciousness. Driven by data, researchers have begun to build epileptic seizure detection models through machine learning, correlation analysis, and time-frequency analysis [1] in recent years. By automatically identifying seizure on electroencephalography (EEG), it can provide an objective reference to neurologists for epilepsy diagnosis, treatment and evaluation [2], [3], [4], [5].
Most of the existing EEG-based seizure detection methods focus on patient-dependent scenarios, including training and testing data originating from the same patient, or mixing the collected data together for model training and testing [6], [7], [8]. The patient-dependent forms strongly rely on the patient history records. Patient-dependent algorithms have been extensively studied in the past. The high accuracy of seizure detection methods in this scenario can be attributed to a basic assumption that training and testing data follow the same distribution. However, in real scenarios, it is shown that there are differences in the onset and propagation of abnormal electrical activity in the brain [9]. Moreover, EEG is greatly affected by age and individual differences, especially in children, with the increase of age, the frequency, amplitude and rhythm of EEG background activity are significantly different [10]. Faced with a more diverse data distribution from different subjects, patient-dependent seizure detection methods become insufficient for new patients.
Due to the significant individual differences in EEG signals, the training data and the actual testing data do not obey the assumption of independent identical distribution. How to establish a cross-subject seizure detection model that can overcome individual differences is a long-standing issue? Domain adaptation naturally comes to mind, which offers the possibility to generalize a classifier learned from well-labeled source domains to an unlabeled target domain, where observations from source and target domains are often derived from different distributions. In cross-subject seizure detection, the domain consisting of multiple subjects with sufficient labeled data is called the source domain, and the domain consisting of subjects with unlabeled data is named the target domain. In this paper, we propose a domain adaptation-based learning framework to develop a robust cross-subject seizure detection algorithm, which can eliminate the influence of distribution differences between patients. There are two key contributions in the paper: 1) We design a simple but effective evaluation metric for source-domain transferability, the correlation-alignment-based source selection (SSC), to select the most favorable subjects in multi-source transfer learning.
2) We propose a new domain adaptation algorithm, the Cluster Embedding Joint-Probability-Discrepancy Transfer (CEJT), which unifies the cluster learning and joint probability distribution discrepancy. Minimizing the joint probability distribution discrepancy can reduce the difference between domains and strengthen the discriminative knowledge of categories, and the clustering learning can deeply explore the data distribution structure of the target domain. In this study, we validate the proposed cross-subject seizure detection model consisting of the transferability evaluation metric SSC and domain adaptation algorithm CEJT on the dataset collected from the Children's Hospital, Zhejiang University School of Medicine (CHZU).
The remainder of this paper is organized as follows: Section II introduces related work on domain adaptation and transfer learning based seizure detection. Section III describes the details of the seizure detection framework composed of CEJT and SSC. Section IV presents the experimental studies to compare the performance of CEJT with several state-ofthe-art (SOTA) domain adaptation methods and to verify the effectiveness of the proposed framework. Finally, Section V draws the conclusions.

A. Domain Adaptation
The most commonly used domain adaptation approaches include instance-based adaptation and feature representation adaptation [11]. It is generally believed that distribution differences can be compensated by the instance-based adaptation approaches, such as weighting the samples from the source domain to better match the target-domain distribution; or adopting feature transformation-based methods to project the features of the two domains to another subspace with small distribution shift.
Feature-based approaches seek a unified/respective transformation that projects data from two domains into a domain-invariant space to reduce distribution differences between domains while preserving data properties in the original space. Such methods often rely on a distance metric, the maximum mean discrepancy (MMD). MMD measures the distance between two distributions in the reproducing kernel Hilbert space (RKHS). Pan et al. [12] propose the transfer component analysis (TCA) using MMD to learn the transport components across domains in RKHS. TCA assumes that there is a feature map such that the marginal distributions of the two domains are close after the mapping. Joint distribution analysis (JDA) [13] improves the disadvantage in TCA which only considers the marginal distribution shift, and JDA takes the conditional distribution shift into account using the pseudo-label of the target domain. Adaptation regularization based transfer learning (ARTL) [14] builds a domain-invariant classifier by introducing the structural risk loss. Domaininvariant classifiers tend to have better performance than single feature transformations. Manifold embedded distribution alignment (MEDA) [15] is the first to quantitatively evaluate the importance of marginal and conditional distributions when performing distribution alignment. Joint geometrical and statistical alignment (JGSA) [16] breaks the strong assumption that the source and target domains need to be transformed uniformly, learning two coupled projections while reducing the geometric and distributional shifts.
Instance-based adaptation is often not considered separately, but is usually combined with feature matching to achieve domain adaptation. Long et al. [17] state that there are some source-domain samples unrelated to the target domain in feature matching, and propose a transfer joint matching (TJM) by introducing the l 2,1 -norm regularization term to achieve instance weighting. Locality preserving joint transfer (LPJT) [18] explicitly weights samples from the source and target domains, and reduces the influence of outliers through landmark selection.
Deep domain adaptation utilizes deep networks to enhance domain adaptation performance, where discrepancy-based methods have been extensively studied. The deep domain confusion network (DDC) by Tzeng et al. [19] adds an adaptation layer with MMD metric to the convolutional network, and the domain discrepancy loss of the adaptation layer is used to improve the original objective function. Rather than using a single layer and linear MMD, the Deep Adaptation Network (DAN) [20] measures domain discrepancy by considering all task-specific layers and designs an optimal multi-kernel selection strategy to improve the effectiveness of embedding matching. The joint adaptation network (JAN) [21] aligns the joint distribution of features and labels in multiple domain-specific layers based on joint MMD. CORrelation ALignment (CORAL), which learns a linear transformation to align second-order statistics between domains, has been extended to deep networks [22]. Adversarial-based methods encourage domain confusion through adversarial objectives, resulting in domain-invariant representations. The domain-adversarial neural network (DANN) adds an adversarial mechanism to the deep transfer network. Yu et al. [23] prove that the adversarial network also suffers from probability distribution mismatch, and propose a dynamic adversarial adaptation network (DAAN) to dynamically learn domaininvariant representations.
The above findings for shallow methods, most feature matching algorithms consider a linear combination of aligned marginal and conditional distributions, that is not equivalent to a joint distribution. Meanwhile, existing domain-invariant classifiers often use the squared loss and hinge loss, and the learned classifier generally labels the target samples separately, failing to make full use of the target-domain data structure information.

B. Transfer Learning Based Seizure Detection
Domain adaptation has been applied in seizure detection in the past. Yang et al. [24] use the large-margin projected transductive SVM to reduce the distribution difference between training and testing data, and realize the adaptive recognition of EEGs. In [25], the TSK fuzzy system and distribution alignment are jointly optimized, and the proposed TL-SSL-TSK has strong interpretability and adaptability in epilepsy recognition. Jiang et al. [26] introduce a semi-supervised learning method based on [25] to exploit the unlabeled testing data. In [27], the feedforward neural networks, fuzzy systems, and transductive transfer learning are successfully unified into a generalized hidden-mapping model for seizure recognition. Recently, a cross-domain epilepsy EEG signal classification model with knowledge utilization maximization [28] has been proposed, which makes full use of the data global structure of source and target domain. And a pairwise constraint regularization term is added to utilize the association information between the labeled samples. In [29], from the perspective of error consistency, a regularization used for knowledge transfer is proposed to unify the TSK fuzzy classifier to achieve online calibration. The effectiveness of these algorithms for EEG differences in different states has been proved, but the performance on individual differences is not well studied.
Deep transfer learning has also been used for seizure detection. Zhang et al. [30] convert EEG signals into the time-frequency maps and three fine-tuned deep networks, VGG16, VGG19 and ResNet50, are adopted for classification. In [31], a unified adversarial learning framework is proposed to extract the epilepsy-specific representations while removing inter-patient noises. Cao et al. [32] perform quadratic feature extraction on the mean amplitudes of sub-band spectrum representing brain activity rhythms through a deep pre-trained network and develop a deep network for epileptic state classification. Most of these studies initialize the network parameters or carry out secondary feature extraction through pre-training models, which realize model transfer and cannot effectively solve the problem of individual differences in EEG signals.

III. DATASET AND METHODS
The proposed cross-subject seizure detection framework, which aims to use data from multiple source subjects to help target subject build domain adaptation classification models, is introduced in this section. Shown in Fig. 1, the multi-channel EEG signal are first filtered and segmented, and then wavelet packet decomposition (WPD) is performed to extract statistical features of EEGs. The correlation-alignmentbased source selection (SSC) is then used to evaluate the transferability of subjects in the source domain. Finally, the selected source subjects are used together with the target subjects to build the Cluster Embedding Joint-Probability-Discrepancy Transfer learning (CEJT) classification model.

A. CHZU Dataset and Feature Extraction
The EEG signals used in this study are obtained from the Children's Hospital, Zhejiang University School of Medicine (CHZU). The recording time of EEG signals for each subject is 2 hours or 16 hours, respectively. The EEG signals are collected by the international 10-20 lead system, each record contains 21 scalp EEG channels, and the sampling frequency is 1000 Hz. In the study, 15 children with focal epilepsy are analyzed. We first divide the EEG signal into interictal and ictal states, where the interictal state refers to signal from one hour and more before a seizure onset but one hour after the previous seizure. The EEG signals are further segmented into 2-second frames, and the overlap rate between the two adjacent samples is 50% for ictal state. While for interictal state, there has no overlap between EEG frames. Table I lists the specifications of CHZU dataset. In order to utilize both the time and frequency domain EEG knowledge, the wavelet packet decomposition (WPD) is adopted for feature extraction. In the experiment, we perform a 7-layer WPD on the pre-processed EEG signal, and the first 11 sub-bands covering 0-40 Hz are selected. Then, 5 statistical features, including the mean amplitude, standard deviation, median, kurtosis and skewness, are extracted on each subband. A 55-dimensional feature vector is generated on each EEG channel. Finally, for all 21 channels, each EEG frame is represented by a feature vector of 21 × 55.1155.
B. Cluster Embedding Joint-Probability-Discrepancy Transfer 1) Problem Settings and Notations: A domain D contains three parts: feature space X , probability distribution P (X) and label space Y, where X ∈ X . For simplicity, We use subscripts s and t to indicate the source domain and the target domain, respectively. The key notations used in this paper and the corresponding descriptions are shown in Table II.
We assume the feature spaces and label spaces between domains are the same: we devote to seek a latent common space shared across source and target domains through a projection P ∈ R d×m , where the domain shifts are minimized and the discriminative knowledge is transferred from D s and D t . On this basis, we aim to design a adaptive classifier by exploring two learning strategies: distribution adaptation and label propagation. Thus, we adopt the projected clustering to regard the samples within the same cluster in target domain as a whole to emphasize the data distribution structure of target domain. CEJT is formulated by finding a projection to obtain new representations of the respective domains and labels of the target domain, such that 1) in the projected space, the clustering of the target domain is achieved through the class centroids of the source domain, 2) the distribution matching of the same class and distinguishability of different classes in source and target domains are jointly explored, 3) the local manifold is introduced to improve the quality of cluster centroids.
2) Projected Clustering: Projected clustering aims to jointly optimize cluster centroids and labels in the embedding space so that samples within the same cluster can share the same label. In the case that all the source-domain labels are available, the class centroids of the source data can be obtained by calculating the mean of sample features in the identical class after projection. Based on the discriminative structure of the source data and the sample distribution structure information of target data, the pseudo-labels are assigned to the target samples under the guidance of the class centroids. Then, the projected clustering can be expressed as: where α > 0 is a tradeoff parameter, P ∈ R d×m is the projection matrix, F ∈ R m×C is the cluster centroids. E s ∈ R n s ×C is a constant matrix used to calculate the class centroids of source data in the projected space with each element E i j = 1 n j s if y s,i = j , and E i j = 0 otherwise.Ŷ t ∈ R n t ×C is the one-hot encoded matrix of the predicted labels of the target domain.
3) Joint Probability Distribution Discrepancy: The core goal of domain adaptation is to match the different distributions in the source and target domains. The maximum mean discrepancy (MMD) criterion of marginal distribution and conditional distribution and their linear combination are commonly used for distribution alignment. Here, we use a more natural metric MMD criterion based on the joint probability distribution to measure the distribution difference between the source and target domains. The objective is to increase the discriminability between different classes while align the joint distributions of the source and target domains. Therefore, the joint probability distribution discrepancy is adopted and expressed as: with where P s x s y c s represents the conditional probability, and P s y c s is the prior probability of class c in the source domain. According to the marginal distribution discrepancy F based on MMD, M T and M D are further expressed as: where C is the number of categories, μ > 0 is a trade-off parameter, and E [·] denotes the mathematical expectation operation. Besides, N s = Y s /n s andN t =Ŷ t /n t , in which Y s = y s,1 ; . . . ; y s,n s ∈ R n s ×C andŶ t = ŷ t,1 ; . . . ;ŷ t,n t ∈ R n t ×C are the one-hot coding matrices of the true labels of the source samples and the predicted labels of the target samples, respectively. (:, 1) , . . . , Y s (:, C)] ⊗ 1 C−1 (the symbol ⊗ denotes the Kronecker product operation, and 1 C−1 is the all-one vector of dimension C − 1), andM t =F t /n t withF t = Ŷ M T measures the distribution difference between the same classes of the source and target domains, and M D measures the distribution difference between different classes of the two domains. Converted to the trace form, the joint probability distribution discrepancy can be rewritten as: where To verify the effectiveness of the proposed method, the clustering based joint probability distribution discrepancy L j pd obtained from the EEGs of 15 subjects in CHZU dataset (Table I) is derived. Meanwhile, comparisons to L j pd obtained on without using the clustering method are also presented. As shown in Fig. 2, a smaller L j pd value on almost all subjects can be derived in our proposed method than not adopting clustering. The comparison indicates that applying clustering can bring a positive effect to the joint probability distribution difference, thus enhancing the seizure detection performance.
4) Structure Consistency: The quality of cluster centroids plays an important role in whether the algorithm can accurately classify samples in the target domain. In real applications, many high-dimensional data are generally considered to reside in low-dimensional manifolds space with nonlinear geometric structures. Relevant studies [33] have shown that introducing the local manifold structure can improve the clustering performance of non-linear characteristic data. As one trivial but effective trick, we add a Laplacian regularization term to exploit the similar geometrical property of nearest points as: where X = [X s , X t ], W is the affinity matrix, defined as: where x i , x j represents the inner product of x i and x j , N p (x i ) denotes the set of p-nearest neighbors of point x i . The Laplacian matrix is L = D − W, where D is a diagonal matrix with diagonal entries D ii = n s +n t j =1 W i j . 5) Regularization: In the knowledge transfer and manifold regularization, the structure of the data is constrained, but we do not want to lose the data attributes of the target domain. To avoid information loss, we introduce a regularization term to preserve the energy of the original signal: After performing several algebraic steps and constant term removal, the minimization problem of (12) can be written as: whereÎ is a diagonal matrix defined asÎ ii = 1 if x i ∈ X t , otherwiseÎ ii = 0. 6) Overall Formulation and Optimization Procedure: Then, by combining (1), (7), (10) and (13), we arrive at the final CEJT formulation: where β > 0, λ > 0 and ρ > 0 are penalty parameters, E = E s ; 0 n t ×C , V = diag 0 n s ×n s , I n t and Y = 0 n s ×C ;Ŷ t . H is a centering matrix defined as H = I n −(1/n)1 n , n = n s +n t . The constraint P T XHX T P = I m is introduced to avoid trivial solutions.
In (14), the labels of the target domain are needed for the projection clustering and the calculation of joint probability discrepancy. It is very difficult to obtain the bestŶ t by optimizing (14), so we solve it by assigning the label of each target sample to the nearest class centroid in optimization. Then: In addition, there are two variables P and F to optimize. We update each of them alternately while keeping the other variables fixed. When other variables are fixed, the optimization problem of F becomes: Then, by taking the derivative of (16) with respect to F, and setting the derivative to zero, we get: Next, substituting (17) into (14) to replace F, the optimization of P can be written as: According to the constrained optimization theory, the Lagrange multiplier is introduced for optimization and the Lagrange function of (18) is: (σ 1 , . . . , σ m ). Then the optimal solution is obtained by calculating the eigenvectors of (19) corresponding to the m-smallest eigenvalues. The proposed CEJT is summarised in Algorithm 1.  (8) and (9). b) Update P by solving the generalized eigenvalue problem in (19). c) Update F by Equation (17). d) UpdateŶ t by Equation (15).

C. Correlation-Alignment-Based Source Selection
Correlation alignment (CORAL) [34] minimizes the domain shift by the second-order statistics of source and target distributions. Inspired by the correlation alignment, we design an evaluation metric for source selection to find subjects that have a high correlation with the target domain. It thus can reduce the computational cost while avoide some negative transfer.
Assume there is a target domain T with unlabeled feature matrix X t , there have z labeled source domains , where X s,i is the feature matrix of the i -th source domain, the SSC between the i -th source domain and the target domain is defined as: where C S i is the covariance of X s,i , C T is the covariance of X t , and C S c i represents the covariance of the c-th category in the source domain. The di f (S i , T) measures the distribution difference between the i -th source domain and the target domain, and di s (S i ) measures the inter-class discriminability of the i -th source domain. For the target domain T, a larger ssc (S i , T) indicates a higher transferability of the i -th source domain. Therefore, we selectẑ ∈ (1, z) source subjects with the highest ssc (S i , T).
We take the subject P15 as an target domain example to show the process of source selection in Fig. 3, where in the testing, all the rest subjects are taken as the source domain data. First, the ssc scores of P15 and each source subject are calculated, and the topẑ are selected. Then we test the effect of different number of source subjects on the classification results. Obviously,ẑ = 5 performs the best, also slightly better than using all subjects as the source (ALL). Similar results can be obtained for other patients when tested independently as the target domain.

A. Experimental Settings
To show the effectiveness of the proposed transfer learning algorithm, experimental studies on the CHZU focal epilepsy dataset are carried out in this section. The accuracy, G-mean, sensitivity and F 1 score are used as the performance measure: where TP, TN, FP and FN denote the true positive, true negative, false positive and false negative detection, respectively. P and R are precision and recall rate, calculated by On the one hand, the proposed domain adaptation algorithm is compared with 2 classical intelligent methods without transfer learning abilities, i.e., SVM and KNN. On the other hand, the proposed CEJT algorithm is also compared with 7 classical domain adaptation approaches, i.e., TCA [12], JDA [13], TJM [17], ARTL [14], MEDA [15], JGSA [16] and our previous work Joint-Probability-Discrepancy-Based Domain Adaptation (JPDDA). By the way, JPDDA learns a domain-invariant classifier with structural risk minimization, while performing joint probability distribution discrepancy minimization, and manifold consistency maximization. Domain adaptation algorithms TCA, JDA, TJM and JGSA learn through a transformation on all data in X s and X t for a common feature space across the source and target domains. Then the classification model is trained on the mapped source data using SVM. In our experiments, the parameters of the learning algorithms are optimized on the given search grids. For KNN, the optimal number of nearest neighbors is selected from {1, 2, . . . , 10}. The best value of the trade-off parameter in SVM is searched on 2 −6 , 2 −4 , 2 −2 , 2 0 , 2 2 , 2 4 . Besides TCA, other domain adaptation algorithms need to iteratively update the target-domain labels during the feature matching, where the iteration number is set to be T = 10. For TCA, JDA, TJM, JGSA and CEJT, the dimension of the common feature space is set to 50, and the manifold feature dimension of MEDA is also set to be 50. The optimal distribution adaptation parameters in all transfer learning algorithms (e.g., λ in our CEJT) are searched in the range of 2 −6 , 2 −4 , 2 −2 , 2 0 , 2 2 , 2 4 . For the domain invariant classifiers ARTL, MEDA, JPDDA and CEJT, we obtain the optimal manifold regularization parameters by searching within 2 −6 , 2 −4 , 2 −2 , 2 0 , 2 2 , 2 4 . Finally, the tradeoff parameter μ in CEJT is set to be 1, and the tradeoff parameter α is set to be 0.25. Specifically, in the proposed SSC method, we selected 5 source subjects with the highest ssc (S i , T) for each subject to compose the source domain.

B. Comparisons Among Different Learning Methods
The detailed results of different algorithms for each subject are listed in the Table III. To be more clarity, the highest accuracy, G-mean, sensitivity and F 1 score are highlighted in bold font in the table. The results show that CEJT has the best classification performance. Obviously, domain adaptation algorithms are generally better than non-transfer learning algorithms, and the reason is that domain adaptation methods take into account the distribution differences between the source and target domains. It is worth noting that compared with the combination of feature transformation and classifier (e.g. TCA, JDA, TJM and JGSA), domain-invariant classifier (e.g. ARTL, MEDA, JPDDA and CEJT) perform better in jointing feature matching and classification. JPDDA and CEJT are superior to the state-of-the-art domain adaptation algorithms, thanks to the fact that the joint probability distribution difference strengthens the discriminative knowledge of classes while aligning the source and target domains. This is also confirmed by the tSNE visualization shown in Fig. 4, the EEG features extracted by JPDDA and CEJT are more distinguishable between different categories after the feature transformation. The clustering learning in CEJT further improves the performance by utilizing the data distribution structure of the target domain.
Further, we perform the statistical tests on the performance of the proposed algorithm and existing methods. The nonparametric Friedman test is used to evaluate whether the difference in performance among different methods is statistically significant. The rank of each algorithm is determined. The post-hoc test is then performed to verify that the difference between the top-ranked algorithm and the others is significant.  Table V show that the proposed algorithm significantly outperforms ARTL as well as algorithms ranked lower than ARTL. Meanwhile, it can be seen from Tables III and IV that the proposed algorithm outperforms MEDA and JPDDA to some extent although the improvement is not statistically significant.
Further, we visualize the decision boundary obtained by the JPDDA and the proposed CEJT for comparisons in Fig. 5, where in the figure, the data of the subject P06 is used as the target domain. The squared loss as a structural risk function of JPDDA, making it possible to classify the target domain by labeling the samples individually. While CEJT introduces the clustering to take advantage of the data structure of the target domain, which can adjust the labels of the target domain by clusters. Obviously, Fig. 5 confirms our assumption.

C. Ablation Study
We conduct the ablation experiments and analyze the significance of each loss in CEJT. The joint probability distribution discrepancy, structure consistency, and regularization are removed sequentially, and the average accuracy and G-mean are shown in Table VI. When the weight of the joint probability distribution discrepancy λ is set to 0, our method degenerates to a traditional clustering algorithm. With Table III, it is found that the overall performance is better than some transfer learning algorithms. A possible explanation is that in projection clustering, the class centroid of the source domain guides the clustering of the target domain, playing the role of aligning the distribution. When the weight of structural consistency ρ is set to be 0, the average accuracy and G-mean drops severely. It confirms that the structural consistency affects the quality of cluster centroids. In addition, it can be observed that the overall performance decreases slightly after removing the regularization term, indicating that focusing on the preservation of the original information can appropriately improve the performance.

D. Comparison Among Different Source Selection Strategies
This subsection validates the effectiveness of the proposed source selection strategy in finding the most beneficial source subjects. Fig. 6 shows the classification results when using different source selection methods: Euclidean distance (L 2 ), Earth Mover's distance (EMD), A-distance and CORAL distance of source and target domains, Domain Transferability Estimation (DTE) [35]. ALL sources without selection is also included for comparison. As observed, the proposed SSC algorithm is superior to other selection strategies in both classification accuracy and G-mean score, even slightly higher than the unused selection strategy, which greatly reduces the computational cost and avoids the negative transfer caused by unrelated subjects to some extent. Specifically, compared with

E. Comparison Among Different Seizure Detection Methods
We compare the proposed approach with a set of competitive state-of-the-art (SOTA) seizure detection algorithms. Compared with ARTL and MEDA, JPDDA and CEJT consider the MMD distance information between classes, and the inter-class discriminability is more obvious. Thanks to projection clustering, the source and target domains are better aligned and the samples are more compact after CEJT domain adaptation.    adversarial training, resulting in a better cross-subject seizure detection model. The compared deep domain adaptation methods with the mean amplitudes of sub-band spectrum (MAS) [32] as input include: • Deep Domain Confusion (DDC) [19], which is a single-layer deep adaptation method with the MMD loss. • Deep CORAL (DCORAL) [22], which is a deep neural network with the CORAL loss.  [37], which is a deep network that achieves the efficient domain transfer being indistinguishable between source and target data.
The overall performance of all the compared methods is reported in the Table VII. It is clearly observed that our method outperforms some state-of-the-art seizure detection models and simple deep domain adaptation methods, validating the effectiveness of our method in cross-subject seizure detection. The superiority of our method can be found on 11 subjects. It is worth noting that the model in [31] performs poorly, and one possible explanation is that the validation of the algorithm is in an ideal situation where the number of interictal and ictal samples is the same. In this study, most subjects has more interictal samples than ictal samples, which is more consistent with the actual situation. In addition, for the subject P08, the performance of the compared algorithms is much better than the proposed algorithm. The reason may be due to that the extracted wavelet packet features are not strong enough for EEG representation for this case, which also reminds us not only to pay attention to the establishment of classification models in the future, more attention needs to be paid to the attributes of the features themselves. In above experiments, all unlabeled target data are used for transfer learning. In this section, we show the performance with varying amount of target data in transfer learning. First, the unlabeled target data (denoted as All) is divided into two parts, one part is used to measure the distribution difference with the source domain in the training model (denoted as Transfer), and the other is used for testing which is not visible during training (denoted as Test). One-third, one-half, and two-thirds of the target data are used to measure distribution differences, respectively, as shown in Table VIII. There is no doubt that in any case the average accuracy of Transfer learning is better than that for Test. When one-third of the target data is used for transfer learning, the average accuracy of the Test is the lowest (84.0%). When more data is used in transfer learning, the average accuracy of Test is improved, reaching a maximum of 85.0%. The average accuracy of All reaches the highest 85.9% when half of the data is used for transfer learning, only 0.4% lower than that recorded in the Table III. Therefore, it is sufficiently feasible to use partial data to measure the distribution of the target domain for transfer learning. This also shows that the proposed framework has low data constraints for practical applications.

V. CONCLUSION
The effectiveness of domain adaptation has been demonstrated in seizure detection to cope with variations among different subjects or tasks. In the proposed cross-subject seizure detection framework, when the number of source subjects is large, the source selection evaluation metric SSC can reduce the computational cost and reduce the impact of irrelevant subjects on the subsequent classification modeling. CEJT organically unifies clustering learning, feature matching and discriminative structure, and performs well in solving individual differences in EEG signals. However, the number of source subjects selected is obtained through simple experiments, which lacks the individual adaptability, and we will make new explorations on this issue in the future.

ETHICAL STANDARDS
This study has been approved by the Second Affiliated Hospital of Zhejiang University and registered in Chinese Clinical Trail Registry (ChiCTR1900020726). All patients gave their informed consent prior to their inclusion in the study.