Multi-Source Decentralized Transfer for Privacy-Preserving BCIs

Transfer learning, which utilizes labeled source domains to facilitate the learning in a target model, is effective in alleviating high intra- and inter-subject variations in electroencephalogram (EEG) based brain-computer interfaces (BCIs). Existing transfer learning approaches usually use the source subjects’ EEG data directly, leading to privacy concerns. This paper considers a decentralized privacy-preserving transfer learning scenario: there are multiple source subjects, whose data and computations are kept local, and only the parameters or predictions of their pre-trained models can be accessed for privacy-protection; then, how to perform effective cross-subject transfer for a new subject with unlabeled EEG trials? We propose an offline unsupervised multi-source decentralized transfer (MSDT) approach, which first generates a pre-trained model from each source subject, and then performs decentralized transfer using the source model parameters (in gray-box settings) or predictions (in black-box settings). Experiments on two datasets from two BCI paradigms, motor imagery and affective BCI, demonstrated that MSDT outperformed several existing approaches, which do not consider privacy-protection at all. In other words, MSDT achieved both high privacy-protection and better classification performance.

BCI (aBCI) [5], which usually aims to identify the emo-39 tion states, e.g., positive, neutral, and negative, induced by 40 videos, audio, images, etc. These two paradigms have been 41 widely used in human-machine interaction, health manage-42 ment, neural rehabilitation, etc., and are the focus of this paper. 43 The conventional pipeline for processing EEG signals in MI 44 or aBCI consists of signal preprocessing, feature extraction, 45 and model training. The latter two can also be integrated into a 46 single neural network. Signal preprocessing and model training 47 procedures in MI and aBCI are similar. Since discriminative 48 information in MIs is mainly spatial, commonly extracted 49 features include log-variance of the spatially filtered EEGs, 50 e.g., common spatial pattern (CSP) [6], [7], and tangent 51 space features from the covariance matrices of EEG trials [8]. 52 For aBCI, commonly used features include time domain, 53 frequency domain [9], and entropy features [10]. 54 A major challenge of BCIs is that EEG signals are non-55 stationary, with high variations across sessions, subjects, tasks, 56 and devices [11]. Transfer learning (TL) [11] is a promising 57 approach to alleviate this problem. Various TL approaches 58 have been proposed for BCI in the last decade, e.g., adaptive 59 CSP [12], data alignment [13], [14], instance-based TL [15], 60 [16], feature-based TL [17], [18], and deep TL [19]. For 61 aBCIs, existing TL approaches mainly include feature-based 62 TL [20] and adversarial-based deep TL [21], [22]. 63 However, most existing TL approaches require access to 64 the source EEG data or features [17], [18], [19], [20]. EEG 65 signals contain rich private information [23], [24], e.g., health 66 status, emotion, psychological state, personal identity, etc. 67 Consequently, the source EEG data may not be directly 68 shared due to regulations such as the European General Data 69 Protection Regulation 1 (GDPR), China Personal Information 70 Protection Law, or user privacy concerns. For example, the 71 raw EEG signals or extracted features can be used to perform 72 personal identification or authentication [25]. Additionally, due 73 to high data collection costs, EEG data are usually organized 74 distributedly in small datasets from different groups, instead 75 of a large centralized dataset like ImageNet [26] in computer 76 vision. 77 To protect data privacy in TL, we consider a decentralized 78 scenario: all data and computations of each source subject 79 are kept local, and only the parameters or predictions of the 80 pre-trained source models are accessible to the new subject. Illustration of the proposed MSDT approach, which contains two stages in gray-box setting. In source model pre-training, a set of source models Θ = {θ m } M m =1 are trained for the source subjects, one from each. In decentralized transfer, these source models are transmitted to the new subject, who has some unlabeled instances X t , for privacy-preserving TL.

II. BACKGROUND 118
This section introduces background knowledge on tangent 119 space feature extraction for MI, privacy-preserving machine 120 learning, and multi-source TL in BCIs, which will be used in 121 the next section. For MI, the discriminative information is mainly spatial, 124 which can be obtained from the covariance matrices of the 125 EEG trials. These covariance matrices are symmetric positive 126 definite (SPD) and lie on a Riemannian manifold [8]. Tangent 127 space mapping can map a Riemannian space SPD matrix 128 P i ∈ R c×c , where c is the number of EEG channels, to a 129 Euclidean tangent space vector x i around an SPD matrix M, 130 which is usually the Riemannian or Euclidean mean:  Privacy-preserving machine learning [27] addresses the 149 increasing privacy concerns in real-world applications, and can 150 be realized by federated learning [28], differential privacy [29], 151 etc. However, several challenges hinder the broad adoptions 152 of these techniques in BCIs, e.g., technology immaturity, 153 performance degradation, and deployment difficulties.

154
Recently, there is a trend in the computer vision community 155 to use data-free TL to balance the cross-domain learning 156 performance and source privacy protection. In data-free TL, 157 knowledge can be transferred from the source model para-158 meters or predictions. Various such approaches have been 159 proposed, e.g., deep hypothesis transfer [30], [31], virtual 160 domain construction [32], knowledge distillation [33], etc.

161
In BCIs, the privacy of each source subject should be 162 individually addressed, requiring all data and computations 163 to be kept local. This paper considers MSDT, which only 164 accesses the parameters or predictions of the pre-trained source 165 models, to protect source privacy and improve the target 166 learning performance.        Table I were used for MI classification, 264 where C noise , C mult and C freq are three hyper-parameters set 265 as 2, 0.05 and 0.2, respectively, according to [47]. 266 More specifically, noise injection adds uniform noise to each 267 EEG trial, data flipping flips the EEG signal of each channel, 268 data scaling multiplies the original EEG signal by a coefficient 269 close to 1, and frequency shift uses the Hilbert transform [48] 270 to shift the frequency of the EEG signal. Some example EEG 271 trials before and after augmentation are shown in Fig. 2. 272 2) Feature Extraction: After data augmentation, we extract 273 commonly used features from EEG signals [11], [17], such as 274 tangent space vectors for MI, and differential entropy features 275 for aBCI (Section II-A). The extracted features of each subject 276 are denoted as X in this paper.  transfer [30]. The final target model is a weighted ensemble 303 of the adapted models where h(p) = − i p i log p i is the information entropy 319 on the predicted probabilities p. Minimizing the conditional 320 entropy H (Ŷ t |X t ) and maximizing the marginal entropy H (Ŷ t ) 321 make the target predictions more class-balanced and diverse, 322 respectively. When there are multiple source models to adapt, 323 we minimize the sum of the negative mutual information of 324 all adapted source models, i.e., ] is the average class probability 327 in T u , and ρ is the softmax function.

328
By minimizing the above loss, predictions of the adapted 329 models are forced to resemble one-hot encoding. 2) Source Consistency Regularization: If a target sample x t,i 331 is well adapted to multiple source models, then the conditional 332 probabilities p(y|x t,i ) predicted by different source models 333 should be similar. More specifically, for a particular target 334 sample x t,i , we define the source consistency loss as: where θ m (x t,i ) k is the probability of x t,i belonging to the k-     only available as black-box APIs for querying. This subsection 388 extends the gray-box MSDT approach to black-box MSDT 389 using knowledge distillation [55], which is frequently used to 390 transfer knowledge from a pre-trained model.

391
Our goal is to learn a single student model θ t by query-392 ing multiple source APIs. This is different from MSDT-G, 393 in which we learn a set of target adapted models {θ m } M m=1 . 394 Additionally, MSDT-G requires to know the parameters of 395 each source model, whereas MSDT-B only needs to query 396 the source models for their outputs (the model parameters are 397 not required). 398 We randomly initialize the student model θ t = (g t • f t ), and 399 replace the source consistency regularization by the following 400 unsupervised knowledge distillation loss between the weighted 401 queries and the student model predictions, where KL is the Kullback-Leibler (KL) divergence [56] loss, 404 θ m is a black-box model API, and α m is the source transfer-405 ability estimated by (11) on the source API predictions of the 406 target data. By minimizing the knowledge distillation loss, the 407 target model can learn the responses of the source APIs.

408
Note that, since there is only one student model, the 409 information maximization term becomes: The overall loss function for black-box MSDT is The pseudocode of MSDT in gray-box and black-box 414 settings is summarized in Algorithm 1.   approach by accessing the source data.

504
The backbone network consisted of two fully connected layers, 505 each followed by layer normalization and ReLU activation.

542
Although the average accuracy of MSDT-G was lower than 543 MS-MDA on SEED, MS-MDA needs to know the EEG data of 544 each individual user, i.e., it may leak the identify and EEG data 545 privacy of the source users. Additionally, MS-MDA performed 546 poorly on MI2, maybe due to its structure specifically designed 547 for aBCI tasks.

548
Table VI compares the cross-subject classification accura-549 cies under gray-box and black-box settings. MSDT-G and 550 MSDT-B always achieved the best or the second best per-551 formances, while protecting the privacy of each source 552 subject.

D. Privacy-Protection Capability 554
This subsection discusses the privacy-protection capabilities 555 of different approaches.

556
Private knowledge of the source subjects may be in the 557 form of raw EEG data, extracted features, model parameters, 558 or model APIs. Table VII compares the privacy risk and 559 memory requirement of different approaches.

560
Source data based approaches need to access the source 561 EEG data or features, which contain rich private informa-562 tion, e.g., emotion, health state, and personal identity. The 563 literature [25] has shown that at least personal identify can 564 easily leak. Model parameter based approaches, e.g., SHOT, 565 SHOT-Ens and MSDT-G, have lower privacy risks.  575 We also performed person identification and authentication 576 experiments to verify the source privacy protection ability of 577 MSDT. Person identification finds out who the subject is. 578 It models each subject as a separate class, and tries to identify 579 the subject's category. Person authentication seeks to prove or 580 disprove the subject's claimed identity [25]. A binary classifier 581 is used to admit or reject the claimed identity.

582
Specifically, for source data based approaches, we extracted 583 EEG features and then adopted support vector machine 584 (SVM) [70] as the base classifier for person identification and 585 authentication. For identification, we selected 80% samples 586 from each source subject as a single class, combined them 587 all-together as the training data, and used the remaining 588 samples as the test data. For M source subjects, the learning 589 task is an M-class classification problem. For authentication, 590 the target trials were used as test data. If the maximum 591 prediction probability of a target trial x t was below 2/M, then 592 x t was considered not belonging to the training set.

593
The results are shown in Table VIII. The person identifica-594 tion accuracy was over 99% from the raw source data. Model 595 parameter or API based approaches do not suffer from person 596 identification attacks. All approaches had low privacy risk in 597 person authentication.  had similar low performance, possibly because the training 610 data were too small to learn good feature representation.

611
In summary, for a specific subject with a small number of 612 training samples, hand-crafted features followed by an MLP 613 classifier may be a good choice. We conducted an ablation analysis to evaluate how each loss 616 in MSDT contributed to the final performance.   This work considered decentralized transfer learning in 641 BCIs, without accessing source data or features. We pro-642 posed MSDT-G, which uses the source model parameters in 643 gray-box settings, and MSDT-B, which uses source model 644 APIs in black-box settings. Cross-subject transfer learning 645 performances in Tables IV-VI show that MSDT-G and MSDT-646 B achieved the best or second-best performance. We also 647 investigated whether MSDT can protect the identity of the 648 source subjects. Result in Table VIII showed that using source 649 model parameters or APIs is much safer than using source 650 EEG data or features.

651
MSDT still has some limitations. First, its performance on 652 SEED could be improved. Second, it needs some unlabeled 653 EEG trials from the new subject, which may limit its applica-654 tion to online BCIs. Additionally, developing a practical BCI 655 system needs to consider also the training data size, model 656 generalization, etc. Thus, our future research will: 1) introduce