Semi-Supervised Learning for Auditory Event-Related Potential-Based Brain–Computer Interface

A brain–computer interface (BCI) is a communication tool that analyzes neural activity and relays the translated commands to carry out actions. In recent years, semi-supervised learning (SSL) has attracted attention for visual event-related potential (ERP)-based BCIs and motor-imagery BCIs as an effective technique that can adapt to the variations in patterns among subjects and trials. The applications of the SSL techniques are expected to improve the performance of auditory ERP-based BCIs as well. However, there is no conclusive evidence supporting the positive effect of SSL techniques on auditory ERP-based BCIs. If the positive effect could be verified, it will be helpful for the BCI community. In this study, we assessed the effects of SSL techniques on two public auditory BCI datasets—AMUSE and PASS2D—using the following machine learning algorithms: step-wise linear discriminant analysis, shrinkage linear discriminant analysis, spatial temporal discriminant analysis, and least-squares support vector machine. These backbone classifiers were firstly trained by labeled data and incrementally updated by unlabeled data in every trial of testing data based on SSL approach. Although a few data of the datasets were negatively affected, most data were apparently improved by SSL in all cases. The overall accuracy was logarithmically increased with every additional unlabeled data. This study supports the positive effect of SSL techniques and encourages future researchers to apply them to auditory ERP-based BCIs.


I. INTRODUCTION
As a brain-computer interface (BCI) performs communication based on neural activity measurement. It is a useful tool for patients with paralysis who find it difficult to express their feelings and thoughts via body movements [1], [2]. Among devices such as functional near-infrared spectroscopy (fNIRS), functional magnetic resonance imaging (fMRI), and electroencephalogram (EEG) used to measure neuronal activities, EEGs have attracted considerable research attention due to their noninvasive monitoring potential with a high The associate editor coordinating the review of this manuscript and approving it for publication was Tao Zhou . temporal resolution for capturing information of the neuronal responses [3], [4]. When a patient with paralysis operates an EEG-based BCI, a paradigm evoking event-related potential (ERP) is commonly used [5]- [7]. An ERP is evoked by a visual, auditory, or tactile rare stimulus and contains characteristic positive/negative amplitude peaks with their latencies (e.g., P100, N100, P200, N200, and P300) appearing a few hundred milliseconds after the stimulus. A P300 speller, which mainly uses the temporal features of positive amplitude peaks appearing approximately 300 ms after the aforementioned stimuli is presented, is regarded as the most well-known paradigm in BCI community [8], [9]. In order to use such ERP-based BCI regardless of the environment, combining them with wearable devices is a hot research topic for BCI researchers [10]. Some studies developed portable ERP-based BCIs through the use of wearable EEG devices [11]- [13]. These developments make routine use of ERP-based BCIs possible for not only patients with paralysis but also for healthy people.
Classifying ERPs is predicated on machine learning (ML) methods such as linear discriminant analysis (LDA) [14] and support vector machine (SVM) [15]. To classify them, the ML methods consider the temporal features including peak amplitudes and latencies among trials, and spatial features over the channels [16]. Human-specific parameters such as fatigue and concentration change during the operation of the BCI as well as on a day-to-day basis, and the characteristics of the ERP change accordingly. This is one of the central challenges of the ML methods, To deal with this challenge, an increasing number of studies have been conducted on robust universal feature extraction methods such as deep learning [17], [18]. However, if the parameters of the ML-based classifier are fixed after the initial training, it is not possible to mitigate the effects of these occasional changes. This leads to a sudden accuracy drop and results in inhibition of the routine use of ERP-based BCIs.
Semi-supervised learning (SSL) is an effective technique for adapting ML-based trained classifiers to varying patterns owing to the use of unlabeled new data. It has recently been applied to various biological signal-based applications [19]- [21] including visual ERP-based and motor imagery (MI)-based BCIs [22]- [25]. In the visual ERP-and MI-based BCIs, SSL has indicated its efficiency of adapting pre-trained LDA and SVM families: SVM [22], least squares SVM (LS-SVM) [23], spectral regression kernel discriminant analysis (SRKDA) [24], and shrinkage regularized based LDA (SKLDA) [25]. In most cases, an incremental update strategy is naively applied every time a single-trial unlabeled data is recorded. Although SSL has recently been combined with a deep learning method in an MI-based BCI [26], LDA and SVM series are still used/applied in the mainstream. BCI operation protocol generally consists of pre-training phase in which the labeled data is used for training a classifier, and testing phase in which the trained classifier infers the class of unlabeled data for controlling the system and storing the unlabeled data. In SSL, the trained classifier is updated by unlabeled data in the testing phase to tune the parameters optimally, combining older and new data. Thus, the technique can mitigate the effect of changes on human-specific parameters during BCI operation even with a shortened pre-training phase [22], [25].
In this study, we investigate the adaptability of SSL to auditory ERP-based BCIs. In patients with amyotrophic lateral sclerosis (ALS), the final stage of the disease involves the loss of eye-related functions (e.g., manipulation of eyes and opening and closing of the eyelids) [27]. Therefore, auditory paradigms have been gaining attention to meet the needs of BCI for a wider range of patient groups [28], [29]. However, to the best of our knowledge, previously proposed SSL techniques have not been applied to auditory ERP-based BCIs. In visual and auditory ERP-based BCI, the characteristic peaks of ERPs represented by N200 and P300 differ between auditory and visual stimuli [30], [31]. Moreover, the accuracy of the auditory ERP-based BCI is less than that of the visual one because of its weaker ERP evoked by the auditory stimulus [32]. These differences may cause a negative effect for the application of SSL techniques to auditory ERP-based BCIs. In contrast, if the positive effect could be verified in the auditory ERP-based BCIs, it will be helpful for BCI communities.
To evaluate the effect of SSL for auditory ERP-based BCI, we used multiple public datasets (AMUSE and PASS2D) and traditional ML algorithms for ERP-based BCIs as backbone classifiers: step-wise LDA (SWLDA) [33] and LS-SVM [23]. In addition, two advanced ML algorithms named SKLDA [34] and spatial-temporal discriminant analysis (STDA) [35] were also investigated as backbone classifiers. Employing the two datasets and four ML algorithms, we evaluate the effect of SSL for auditory ERP-based BCI. We expect that the obtained results will encourage future studies related to auditory ERP-based BCIs to use SSL techniques for their paradigms.
This paper is organized as follows. Section II details the datasets, prepossessing methods, classification algorithms including SSL techniques, and evaluation metrics. Section III presents the results and discussions of the experiments; the supervised learning techniques and SSL techniques using four ML algorithms are compared. Section IV presents the conclusions of our study.

II. MATERIALS AND METHODS
Public datasets enable us to compare the obtained results with those of previous studies. For this study, we used two public auditory ERP-based BCI datasets: auditory multiclass spatial ERP (AMUSE) [36], [37] and predictive auditory spatial speller with two-dimensional cues (PASS2D) [38]. These datasets were collected by Neurotechnology Group, Technische Universität Berlin, Germany (available from BNCI Horizon 2020 database [39]) and they have been used in many studies [40], [41]. Therefore, we utilize them for comparing our results with those of other studies, enabling future researches to compare their results with ours.
The presented auditory stimuli must be easily recognizable by all users with minimal or no loss of cognitive function. The elements of auditory stimulus are broadly divided into two factors: (1) timbre and (2) sound direction. Several studies proposed methods for differentiating recognizable timbres using synthetic beeps [37], natural speech [42], and natural sounds (i.e., ducks, birds, and frogs) [33]. Furthermore, some other studies proposed methods to differentiate the directionality of recognizable sounds by physically placing loudspeakers evenly around the user or by varying the interaural level difference or interaural time difference when using headphones [36], [38]. AMUSE and PASS2D datasets present paradigms of auditory stimuli that can recognize both VOLUME 9, 2021 FIGURE 1. (A) Experimental flow of AMUSE dataset. Forty eight trials were conducted in pre-training phase. After training classifiers, 29 to 165 trials were conducted in testing phase. (B) Stimulus selection flow of trials. Each trial consisted of 15 sequences, and each sequence consisted of 6 sub-trials. Sound duration was 40 ms, and ISI was 135 ms. (C) Explanation of stimulus sounds. Six ring sounds were presented from each six directions. (D) Experimental flow of PASS2D dataset. Twenty seven trials were conducted in pre-training phase. After training classifiers, 29 to 127 trials were conducted in testing phase. (E) Stimulus selection flow of trials. Each trial consisted of 12 sequences, and each sequence consisted of 9 sub-trials. Sound duration was 100 ms, and ISI was 125 ms. (F) Explanation of stimulus sounds. Three different tones were presented from each of the three directions (left, right, and both sides).
differences based on combinations of the aforementioned methods. Both datasets are provided in the binary data container formats used in MATLAB (.mat).

A. AMUSE DATASET
The paradigm for the AMUSE dataset was proposed by Schreuder et al. [36] and published by Höhne et al. [37] as a dataset with 21 healthy subjects. EEG data were recorded with 60 Ag/AgCl electrodes based on the placement of the International 10-20 System. Electrooculogram (EOG) data were recorded with two bipolar channels over the eyes; the reference channel was placed on the nose. These data were amplified using a Brain Products amplifier (Brain Products Co., Munich, Germany) with a sampling rate of 1000 Hz, and filtered through an analog bandpass filter (between 0.1 and 250 Hz; the publicly available data are distributed in a downsampled form to 250 Hz).
The experimental design of the AMUSE dataset is shown in Fig. 1 (A)-(C). Firstly, training data for constructing classifiers were measured in pre-training phase. The pretraining phase consists of 48 trials. Secondly, classifiers were trained to classify target and nontarget ERP features, as shown in Fig. 2. Thirdly, subjects typed digital letters as they liked using the trained classifier and the BCI system in 47010 VOLUME 9, 2021 testing phase. The trained classifiers inferred which letter the subject wanted to select. The minimum and maximum number of trials in the testing phase were 29 and 165, respectively. Lastly, the BCI performance was evaluated. The trials were composed of sequences and sub-trials shown in Fig. 1 (B). The target direction was visually and aurally instructed prior to each trial. The number of sub-trials in each sequence was equal to the kinds of sound stimuli. Discriminable synthetic beep stimuli were randomly presented from six audio speakers evenly spaced around the subject (i.e., audio speakers were arranged every 60 • ) as shown in Fig. 1 (C). Based on the idea of a classic oddball paradigm, any of the six directions could be a target (probability 16.7%), which leaves the others as nontargets (probability 83.3%). Each stimulus was presented for 40 ms. The inter-stimulus interval (ISI) was 135 ms. Furthermore, the stimulus onset asynchrony (SOA) or inter-stimulus onset interval, which is the sum of the stimulus presentation time and interval time for sound stimuli, was 175 ms.

B. PASS2D DATASET
The paradigm for this dataset was proposed and published by Höhne et al. [38] as a dataset with 12 healthy subjects. EEG data were recorded using a Fast'n Easy Cap (Easy-Cap GmbH, Inning, Germany) with 63 Ag/AgCl electrodes based on the placement of the International 10-20 System. The reference channel was placed on the nose. These data were amplified using a Brain Products amplifier (Brain Products Co., Munich, Germany) with a sampling rate of 1000 Hz and filtered through an analog bandpass filter (between 0.1 and 250 Hz; the publicly available data are distributed in a downsampled form to 250 Hz.).
The experimental design of the PASS2D dataset is shown in Fig. 1 (D)-(F). Similar to AMUSE dataset, the flow consists of pre-training phase, training the classifiers, testing phase, and BCI performance evaluation. The pre-training phase consisted of 27 trials. In testing phase, subjects typed virtual pre-defined letters until all selections were correctly perfomed. The minimum and maximum number of trials in the testing phase were 29 and 127, respectively. The trials were composed of sequences and sub-trials shown in Fig. 1 (E). The target direction was visually and aurally instructed prior to each trial. The number of sub-trials in each sequence was equal to the kinds of sound stimuli. Discriminable synthetic beep stimuli were randomly presented from headphones and were generated by three kinds of pitch (high/medium/low) and location (left/middle/right). Based on the idea of a classic oddball paradigm, any of the three pitches and directions could be a target (probability 11.1%), leaving the others as nontargets (probability 88.9%). The current target cue was presented to the subject three times before starting each pre-training recording trial, and the corresponding number on a 3 × 3 grid was further highlighted on the screen. Each stimulus was presented for 100 ms. The ISI and SOA were 125 ms and 225 ms, respectively.

C. PREPROCESSING
To compare the effects of SSL techniques for traditional and advanced ML algorithms, we performed the same preprocessing approach on the two datasets.
Each EEG signal was low-pass filtered with 40 Hz using a fourth-order Butterworth filter. Furthermore, the filtered signal was segmented into epochs from -150 ms to 800 ms based on the stimulus onset. If the peak-to-peak voltage difference of any channel was greater than 100 µV, the epoch was considered contaminated with some artifacts (e.g., motion and ocular artifacts), and it was discarded [38]. Then, the remaining epochs were downsampled to 50 Hz. Finally, we performed baseline correction using the average values of the first 150 ms data. Since the amplitude values can be used as temporal features in ERP analysis, baseline corrected 800 ms epoch (40 time points) after the onset were flattened in the channel direction and used as a feature vector x ∈ R D , where D denotes the dimensionality of the features (the number of channels and time points).

D. CLASSIFICATION
In this study, for all classification algorithms, a binary classifier to determine whether an auditory stimulus was a target or nontarget was generated based on a previous study [37]. The typical target and nontarget ERPs are described in Fig. 2. There are differences of peak amplitudes, the latencies ( Fig. 2 (A)), and spatial patterns over the channels (Fig. 2 (B)) between target and nontarget ERPs. The binary classifiers were trained to recognize such differences. It should be noted VOLUME 9, 2021 that the binary classification problem swaps the definitions of class labels, depending on the target sounds. If there are N s types of target sounds, N s binary classifiers will be trained (k = 1, . . . , N s ). In the AMUSE and PASS2D datasets, six and nine types of target sounds were presented, respectively. Thus, six binary classifiers for the AMUSE (N s = 6) and nine binary classifiers for the PASS2D (N s = 9) were trained.
The feature vectors in the pre-training phase X tr = where N L indicates the number of all pretraining trials (4,320 or 2,916), were used to train the classifiers. We did not emphasize the ERP waveform of training epochs by an averaging process. To evaluate the performance of trained classifiers, we used the averaged feature vectors over sequences in the testing phase where N U indicates the number of averaged featured vectors. In this case, an averaging process was applied to a group of data corresponding to the stimulus sound.

1) SWLDA
SWLDA is commonly used in auditory ERP-based BCI studies as a classifier, including for feature selection functions [33], [43]. The classification module of SWLDA is the same as that of Fisher LDA (FLDA), and it learns the decision boundaries for binary labeled (target and nontarget) training feature vectors X tr . In the dataset for the k-th auditory sound, N L 1 trials in class 1 are denoted by X L 1 , and N L 2 trials in class 2 are denoted by The FLDA finds the class separability in the direction w ∈ R D by maximizing, where S B ∈ R D×D , S W ∈ R D×D , and (·) T are the betweenclass and within-class scatter matrices, and the transpose operator, respectively. The between-class and within-class scatter matrices can be written as, wherex ∈ R D is the overall mean of X tr , andx L c ∈ R D is the mean of X L c . In the testing phase, the likelihood p k,j for the k-th auditory sound to the j-th testing data can be obtained by linearly multiplying the k-th learned weight vector w k by the averaged feature vector x j as, The label corresponding to the sound stimulus that presented the highest likelihood is set to 1, and the others are set to 0 to determine the prediction vectorŷ j ∈ R N s to the j-th testing data.
The feature selection module of SWLDA involves a stepwise approach that selects important feature variables automatically with a combination of forward and backward steps [44]. The p-values are calculated using least-squares regression of the feature variables. The feature variable set has a lower p-value compared to the input threshold (p-value < 0.10), and it is pooled into the FLDA module. The unpooled feature variables are added to the feature set one by one, and the p-value is updated in the FLDA module. Each time the p-value is updated, the feature variable is removed from the variable set if the p-value is greater than the threshold value (p-value > 0.15). This loop is repeated until the feature variables no longer meet the input or output criteria. Since ERPs have subject-specific shapes, the feature variables were selected for each subjects.

2) SKLDA
FLDA estimates between-class and within-class covariance matrices S b , S w . Generally, these matrices are not biased and have acceptable properties. However, when the training data is high-dimensional, the covariance matrix cannot be estimated accurately because the estimates of the maximum and minimum eigenvalues of the original covariance matrix deviate significantly.
In order to deal with this problem, SKLDA was proposed to estimate the covariance matrices more precisely through adjusting the extreme eigenvalues of the covariance matrix towards the average eigenvalue [16], [45]. To adjust the estimation error, the new covariance matrixS is calculated as follows,S where γ is a tuning parameter γ ∈ [0, 1], v is an average eigenvalue of S, and I is the identity matrix. In this study, we used optimal tuning parameter calculated using the following equation [34], where s ij is the element in the i-th row and j-th column of covariance vector S and z ij (n) is computed as follows.
where (x n ) i and (x) i are the i-th element of the feature vector x n and mean vectorx, To train a weight vector, SKLDA adopts vectorized (one-way) training samples with (M × P) dimensions where M denotes the number of channels and P denotes the temporal points of each channel. Similar to that in LDA, the training data is high dimensional.

3) STDA
In traditional classifiers for auditory ERP-based BCI such as SWLDA and LS-SVM, D(= MP)-dimensional samples where each sample is the concatenation of P temporal features from each of M channels have been used as training and testing data. Therefore, the dimensionality tends to be very high.
STDA adopts spatial-temporal (two-way) training samples X i ∈ R M ×P (i = 1, 2, · · · , N ) to separately find the two projection matrices from spatial and temporal dimensions and effectively reduces the original feature dimension for discriminant analysis [35]. The spatial and temporal projection matrices are expressed as, W 1 ∈ R M ×L and W 2 ∈ R P×L respectively. Here, L should be lower than M and P. The projection matrices are optimized using traditional LDA algorithm. After the projection matrices are optimized asW 1 andW 2 , a training data is transformed to a lower dimensional sample as: where f i ∈ R L 2 and vec(·) denotes the vectorization operator.
× 1 and × 2 denote multiplication of matrices in a particular dimension. The projected samples are used for training an LDA classifier.

4) SSL WITH SWLDA, SKLDA, AND STDA
In SSL, the backbone pre-trained SWLDA, SKLDA, and STDA update their weight vectors using the unlabeled data measured in testing phase. These updates are based on incremental update strategy that updates with each one-trial testing data. We refer to it as SSL in this study. Each testing dataset inferred as to which class it belongs to by the classifier trained until the previous trial. The estimated labels are not true labels but called pseudo-labels and used for re-training the classifiers. SSL uses training data and pseudo-labeled ''original'' testing data. Here, original testing data indicates that the data are not subjected to averaging process across trials.
The three LDA-based classifiers commonly use Eqs. (1), (4), and (5) for testing data. From Eq. (5), we can determine the stimulus-induced ERPs contained in the averaged testing data x j . We assign binary pseudo-labels (i.e., target and nontarget) to all original testing data based on the estimation result for the averaged testing data, merge a set of original testing data and pseudo labels into the original training data, and retrain the model. Not only the classification module but also the important feature variables obtained in the stepwise approach and projected samples in STDA projection approach are retrained by SSL.

5) LS-SVM
SVM constructs a hyperplane or a set of hyperplanes using the first layer of vectors closest to the plane, called support vectors. This classification algorithm is frequently used in BCI communities [15], [46]. Instead of solving computationally intensive quadratic programming of traditional SVM, LS-SVM, which uses a set of linear equations for training, has been proposed [23], [47]. The optimization problem of LS-SVM is: where ϕ(x i ) denotes a mapping function from the original space to a high-dimensional space, e i represents an error for the i-th feature, γ denotes the regularization term (γ ≥ 0), and b represents the bias term. The labels here are the binaries of −1 and 1 (i.e., y i ∈ {−1, 1}). Introducing a positive-definite kernel K ∈ R N L ×N L , the problem in Eq. (10) can be represented as a set of linear equations: where y ∈ R N L denotes the label vector; all elements in vector 1 ∈ R N L are 1. I ∈ R N L ×N L denotes an identity matrix, and a ∈ R N L represents the dual variable vector.
The optimal parameters for a and b are calculated by, where the model matrix H ∈ R N L ×N L is represented as, The decision function of LS-SVM can be expressed as, In this study, we used the linear kernel. LS-SVM was applied for each sound stimulus; thus, N s decision functions were calculated and the auditory sound tied to the highestvalue function was selected as the estimated target for the testing data x j .

6) SEMI-SUPERVISED LS-SVM
Semi-supervised SVM for P300 BCI was proposed by Li et al. [22]; Subsequently, Gu et al. [23] reduced the computational cost by introducing LS-SVM, and they developed the application by sequentially updating the self-training algorithm in an online P300 BCI called SUST-LSSVM. By adding new unlabeled feature vectors in the testing phase {x j } N U j=1 to the original feature vectors {x i } N L i=1 , the model matrix H can be updated using, To update the parameters in SUST-LSSVM, the predicted labels for new data {y j } N U j=1 obtained using the initial pretrained LS-SVM classifier were merged with the initial labels . Similar to the semi-supervised SWLDA, the important feature variables found by a step-wise approach with LDA were applied in SUST-LSSVM.

E. EVALUATION
The effectiveness of the SSL techniques was evaluated based on two metrics: BCI accuracy and information transfer rate (ITR). The BCI accuracy indicates how accurately the user can select the desired command, which purely evaluates the performance of the supervised/semi-supervised classification algorithms. ITR is proportional to BCI accuracy; however, it evaluates the performance of the BCI on a timescale that considers the number of commands per minute, thereby allowing a direct comparison of performance with previous studies that use the same/other BCI frameworks, unlike BCI accuracy.
For each test data, the number of data in the pre-training phase is validated in an unchanging state (i.e., the trained classifier remains unchanged over all testing data); however, in SSL, each test data can be added to the training dataset. Thus, we incrementally added test data to training datasets and retrained the classifier and plotted the changes in BCI accuracy and ITR.

1) BCI ACCURACY
Considering the application for the BCI system, the BCI accuracy should be defined as the accuracy of selecting a correct command or a character connected to one sound stimulus in each trial. For each trial of the testing data, we intentionally defined one sound stimulus as the target and the other (i.e., N s −1) sound stimuli were classified as nontargets. Since only one auditory sound is identified as a target label per trial, if a nontarget is targeted and misclassified, the target label is automatically misclassified as a nontarget as well. Under this unbalanced data scenario, the correct classification rate of the target data is of paramount importance when assessing the performance of the BCI. For example, subject ''ja'' included in the PASS2D contains 72-trial data (i.e., it contains 72 target data). If all target label data are classified correctly, the BCI accuracy will be 72/72 = 100%. This value was used as a metric of the accuracy of the BCI as with other BCI studies [5], [22].

2) ITR
A well-known metric in BCI studies is ITR, which can fit letter choices, typing speed, and classification accuracy into a single value [2], [48]. It is important to evaluate from a perspective other than classification accuracy because the classification accuracy tends to increase with a reduction in the number of classes and an increase in the number of sequences. However, such paradigm manipulation lowers the number of choices of letters and typing speed. ITR allows us to evaluate the amount of information per unit time (second) of an implemented BCI system quantitatively, independent of the paradigm and the dataset. It can be calculated as, where T denotes the average time for a command selection, N denotes the number of selectable commands, and P represents the BCI accuracy (0 ≤ P ≤ 1). The first term in Eq. (19)-log 2 N -denotes the information for N ; the second term-P·log 2 P-denotes the information entropy defined by Shanon [49], and the last term denotes the information for the misclassification rate.

3) CHANGES IN BCI PERFORMANCE THROUGH SSL
First, we designed a backbone supervised learning classifier using all labeled data (pre-training phase data) and then plotted BCI performance (BCI accuracy and ITR) of SSL for each trial to evaluate the change in BCI accuracy and ITR when unlabeled data (testing phase data) incrementally provided pseudo-labels. Each plot denotes five trials, as displaying all plots would be too detailed. Note that because the number of trials in the testing phase varies between subjects, the length of the plot also varies between subjects. To evaluate the effect of the two different categorical independent variables (classifier type and learning method), we used a two-way analysis of variance (ANOVA), which allows us to evaluate the main effect of each independent variable and the interaction between them. We noted that there was no interaction between the factors of classifier and learning method (F(3, 21) = 0.05, p = 0.98). There was no significant difference for the main effect of the classifiers (F(3, 21) = 0.11, p = 0.96). The main effect of learning method was probably significant considering that the p-value is under 0.1 (F(1, 21) = 2.80, p < 0.10) [50], [51]. Therefore, regardless of the classifiers, SSL techniques probably result in positive effect for AMUSE dataset.

III. RESULTS AND DISCUSSIONS
In the PASS2D dataset, the four classifiers (SWLDA, SKLDA, STDA, and LS-SVM) showed the averaged BCI accuracy as 94. 34  The two-way ANOVA revealed that there was no interaction between the factors of classifier and learning method (F(3, 10) = 0.07, p = 0.97). There was no significant difference for main effect of the classifiers (F(3, 10) = 0.29, p = 0.83). On the other hand, there was significant difference due to the main effect of learning method (F(1, 10) = 5.42, p = 0.02). For PASS2D dataset, the positive effect of SSL technique was apparently confirmed.
The overall accuracy of AMUSE dataset was lower than that of PASS2D dataset. SKLDA classifiers with SSL showed the best performance in AMUSE dataset (see Fig. 3). In addition, the improvement rate for SKLDA classifiers was the highest among four classifiers. It is known that SKLDA is effective when the number of samples is extremely lower compared to the feature dimensionality [16]. However, the number of training samples in the dataset is considered to be sufficient. Therefore, we speculate that the characteristics of the ERPs were changed during the BCI experiments, and the ideal sample for calculating the exact covariance matrix was insufficient for each trial compared to the feature dimensionality. Using SKLDA, the covariance matrices could be regularized and adjusted as ideal and would lead to the high improvement rate of SSL. SWLDA classifiers showed the best performance in PASS2D dataset (see Fig. 4). It is assumed that the ERPs are more easily classified in PASS2D dataset. In this case, SWLDA classifiers are probably ideal for sophisticated data in which the ERPs tend to be  clearly classified. It is difficult to choose the best classifier for auditory ERP-based BCI because the superiority or inferiority can be changed by the BCI paradigms and proficiency of subjects in controlling the BCI. However, it is clear that SSL could significantly improve the performance of auditory ERP-based BCIs.

B. CHANGES IN BCI PERFORMANCE THROUGH SSL
To investigate the effect of SSL in more detail, we plotted the changes in BCI accuracy and ITR when the number of unlabeled data in testing phase for re-training increased. Fig. 5 shows the BCI accuracy and ITR averaged across subjects. Since the number of testing data varied from subject to subject, the results of other subjects were supplemented according to the subject with the largest number of tests VOLUME 9, 2021 (i.e., if 150 trials was the maximum, the results of the subject who has only 120 trials were concatenated with the same results as 120-th trial from 121-st to 150-th one). The vertical axis denotes accuracy and ITR. The horizontal axis is the number of trials of unlabeled data added in SSL; thus, the leftmost points are the results of classifying testing data using supervised learning.
Similar to previous studies, the accuracy was logarithmically increased by SSL [22], [24]. The tendency was more remarkable for AMUSE dataset because of its lower overall accuracy (i.e., there is still room to grow). From these results, it is clear that SSL has the potential to improve the performance of auditory ERP-based BCIs, especially with a small number of unlabeled data.

C. DIFFERENCES OF EFFECT OF SSL AMONG SUBJECTS
Figs. 6 and 7 plot the changes in BCI accuracy and ITR of each subject as the number of unlabeled data added for training is increased incrementally in the AMUSE and PASS2D datasets, respectively. The horizontal axis is the number of trials of unlabeled data added in SSL; thus, the leftmost points are the results of supervised learning. Unlike the previous subsection, the number of unlabeled data is shown uncorrected in order to make comparisons for each individual. In both figures, each graph denotes a classifier ((a) SWLDA, (b) SKLDA, (c) STDA, and (d) LS-SVM). Although the BCI performances of most subjects were incrementally improved by the four classifiers, for each subject, each classifier showed different tendencies. For example, the accuracy of subject ''fch'' in Fig. 6 was incrementally improved by LS-SVM. However, other three classifiers decreased the accuracy in the middle or end of the incremental updating. On the other hand, the accuracy of ''fcl'' seems to be incrementally increased by the SWLDA, SKLDA, and STDA, but not by the LS-SVM. In previous studies, the dimension of features and parameters of classifiers are commonly optimized for each subject [52], [53]. This results suggest that we will need to select the best classifier for each subject. To analyze the effect of SSL from the viewpoints of spatial patterns of ERPs, the topological maps of P300 and weight vectors w of the SKLDA classifier are shown in Fig. 8. The improvement of BCI performance was remarkable for subject ''fcj'' by SKLDA classifier in AMUSE dataset; thus, we selected this setting for the visualization. In addition, the dimensions of the weight vectors of SWLDA, STDA, and LS-SVM were reduced from the MP-dimensions, but the weight vectors of SKLDA remained full (2900 for AMUSE dataset and 3150 for PASS2D dataset), hence the weights could be visualized as topography. The feature dimensions selected by the SWLDA was 972 for AMUSE dataset and 1108 for PASS2D dataset averaged across subjects and trials. The selected dimensions were also used for LS-SVM. STDA projected the feature vectors from the full dimensionality to four dimensions (L = 2) after finding two projection matrices. The color gradation of left four maps in the figure indicates the strength of peak amplitude of P300. In contrast to the color gradation of nontarget which is not extremely 47016 VOLUME 9, 2021 different between pre-training and testing phase, the strength of target P300 was apparently increased in parietal area in testing phase compared to pre-training phase. Following the changes of P300, the value of weight vectors in position FCz was increased in testing phase compared to pre-training phase (see right maps in Fig. 8). Halder et al. showed that there is training effect in auditory ERP-based BCI by which the peaks of target P300 in parietal become stronger as testing sessions are increased [54]. It suggests that one of the reasons why the SSL improved the accuracy is, the updating weight vectors followed the increase of peak strength of P300 in parietal area.
The changes of topographic maps of P300 and weight vectors of the SKLDA classifier for subject ''nw'' of PASS2D dataset is shown in Fig. 9. We assessed the reason of accuracy decreasing in SSL during testing phase by checking topological maps. The leftmost topographic maps show the weight vectors of SKLDA and target P300 in pre-training phase. The other maps show the sequential changes of updated weight vectors and P300. The subject ''nw'' has 96 trials in testing phase, thus 95-th trial is the last update of weight vector. As shown in Fig. 7, the accuracy was decreased while updating weight vectors by SKLDA with SSL. The reason can be explained by the Fig. 9. The strength of peaks of P300 in parietal area was decreased around 24-th trial and increased again after 48-th trial. This changes indicated that the weight vectors trained in pre-training phase is more suitable for the P300 after 48-th trial than updated weight vectors at 24-th trial. During the testing phase, subjects were forced to be more focused. Changing from the pre-training phase to the actual BCI test task may cause some subjects to feel fatigue and emotional changes after the pre-training phase. They indicated that the BCI causes considerable fatigue, and would not prefer to use it very often [6]. Therefore, it is difficult to increase the peak of P300 linearly and it probably leads to the negative effect of SSL. Interestingly, the decrease of BCI performance was mitigated when the STDA and LS-SVM classifiers were applied. The projected features of STDA and support vectors of LS-SVM probably prevented the reduction. It suggests that combining feature extraction methods and the SKLDA would be effective for solving the accuracy reduction issue. Furthermore, applying an SSL technique after collecting some trial unlabeled data instead of every trial would be useful.
Tables 1 and 2 summarise individual accuracy and ITR of classifiers when all unlabeled data were used for SSL and were not used for re-training classifiers. From subsections III-A and III-B, SSL has been proven to improve the averaged BCI performance of auditory ERP-based BCI. However, when we focus on each individual, some subjects' accuracy decreased by SSL. In addition, different subjects resulted in the decline of BCI performance depending on the classifiers. It would be necessary to optimize the selection of classifiers for each individual as well. It has also been confirmed in previous studies that the trend of improvement in BCI accuracy stops and the accuracy stabilizes as the number of additional unlabeled data increases [22], [55]. This means that the more the unlabeled data increased, the weaker the effect of SSL becomes. The used datasets included a sufficient number of pre-training data. Therefore, the parameters of classifiers would have already been sophisticated before re-training and the SSL not changing the parameter significantly would lead to the stable BCI accuracy.

D. COMPARISON WITH PREVIOUS STUDIES
We calculated ITR to compare our results with other previous researches, even if their paradigms are different. A previous study using SKLDA showed that an SSL technique improved the ITR of MI-based BCI from 0.32 bits/min to 0.84 bits/min [25]. Our results using SKLDA classifiers showed that the ITRs were improved from 6.37 bits/min to 7.15 bits/min in AMUSE dataset, and from 5.22 bits/min to 5.53 bits/min in PASS2D dataset. Furthermore, there has been a previous study on visual ERP-based BCI using SVM classifiers. Li et al. showed that the ITR improved from 9.12 bits/min to 14.03 bits/min [22]. Compared to the study, our results showed less improvements of ITRs using LS-SVM classifiers (from 6.44 bits/min to 6.97 bits/min in AMUSE VOLUME 9, 2021  dataset and from 5.28 bits/min to 5.46 bits/min in PASS2D dataset). It might be attributed to the difference of number of data used for training backbone classifiers. In the previous study, there were only three trials of training data. Thus, the effect of the new data and the range of improvement in accuracy would have been greater. Although it is not possible to show a consistent trend because the experimental setting is not the same across studies, this study could show that SSL is effective for the four mentioned classifiers in auditory ERP-based BCIs.

E. CALCULATION TIME
In the case of the setting mentioned in the previous section, the learning process of the SSL techniques is performed on a trial-by-trial basis in the testing phase by adding testing data to the data obtained in the pre-training phase. Here, the high computational cost of the incremental re-training makes command selection time consuming, and it has a negative impact on the performance of the BCI. Thus, it is important to clarify the relationship between the addition of unlabeled data and the computational cost of re-training. Fig. 10 shows the computation time for re-training four classifiers by SSL techniques. To determine the computation time, a laptop with 64-bit Windows 10 was used for the simulation. The CPU was an Intel Core i7-10710U, 16 GB of RAM, and data were recorded and read on a 512 GB SSD. The analysis code was implemented in Python 3.6.  The top table and the bottom table  present the results for AMUSE dataset and PASS2D dataset, respectively. During the actual ERP-based BCI operation, there is a possibility that re-training for each command selection may be required in time. In the BCI system, the presentation of the whole auditory stimuli is followed by estimation of the target sound. Therefore, there is one trial between the completion of the command selection and the start of the next command estimation. If the calculation time for re-training is less than the time length of the trial, the negative effect of SSL techniques on BCI performance can be avoided. One trial of sound stimulation required 15.62 s for the AMUSE dataset and 30.25 s for the PASS2D dataset in the testing phase. They are shown in Fig. 10 as purple dash line.
Among the four classifiers, SWLDA had the lowest computational cost, and the re-training time did not exceed the  The top table and the bottom table present the  results for AMUSE dataset and PASS2D dataset, respectively. stimulus time for both datasets. i.e., even a paradigm other than an ERP-based BCI paradigm could use SWLDA for semi-supervised BCI without loss of interval training time. The plots show that it is easy to infer class label of testing data within the acceptable processing time for SWLDA, SKLDA, and LS-SVM, but difficult for STDA. For SWLDA and SKLDA, re-training time increases linearly. Therefore, the number of additional unlabeled data affecting the BCI performance can be predicted easily. The calculation cost of LS-SVM increases with the number of total training samples (the complexity order is O(N 3 ) [56]). The time length was initially less than one second, but it increased drastically following the increasing of trials and exceeded the time limit for selecting one command in the AMUSE and PASS2D datasets. The estimation of the processing time of STDA is more difficult. STDA optimizes two projection matrices, and transforms the original samples into low-dimensional features, and re-trains new classifier. The more the data is accumulated, the more difficult it is to optimize the projection matrices. Such a method with multiple processes seems to be incompatible with SSL.
In this study we updated the classifiers in every trial in testing phase. However, updating in every trial is not necessary. Updating classifiers every few trials is also expected to raise the BCI performance. Therefore, we should determine the timing of applying SSL in actual online BCI applications considering its processing time. We hope that our results concerning processing time will help the future designs.

F. LIMITATIONS
There are three limitations in this study: (1) Lack of consideration of subjects' motivation: The simulation results do not consider using an SSL technique at the time of measurement. If the subject participated in the experimental task, the online testing phase shows the results on the screen or speaker and notifies the subject regarding the command that was selected. This feedback effect raises the motivation of subjects and improves the BCI accuracy [31], [57]. If more accurate feedback is returned for SSL, subjects may be more motivated and focused on the task, which can result in more discrimination of unlabeled data. Therefore, it is necessary to prepare a BCI that incorporates SSL and validates the case of online testing and measurement simultaneously.
(2) Variation in the number of trials during the testing phase: The number of trials of the labeled data (in the pretraining phase) was the same across subjects in both datasets (AMUSE dataset: 48 trials, PASS2D dataset: 27 trials). However, the number of trials in the testing phase varied across subjects, thereby making it impossible to assess the same number of trials. In particular, a few subjects had extremely low numbers of unlabeled data compared to other subjects (see Figs. 6 and 7). Since matching the number of unlabeled data to the lowest state resulted in a very small number of trials that could be validated, the equalization process was discarded in this study, and the incremental update strategy was conducted using the number of trials conducted for each subject. One future work is to investigate the effect of longer-term SSL with some amount of data for each subject.
(3) Data reliability: All data, including the original training data and the newly added unlabeled data, were treated uniformly. Therefore, the quality of the data is not considered, and there is a concern that the effect of poor quality data may be significant. Furthermore, when the estimated labels of unlabeled data in the testing phase are different from the true labels, the accuracy of the classification model will decrease. The problem is related to user's motivation or classifier's incompleteness; if the user is not focused on the selection of commands or is disheartened while playing BCI, ERPs will not occur and all data will be classified as nontarget. In fact, immature classification models often assign incorrect labels to unlabeled data. Our evaluation showed that the accuracy worsened with SSL based on a low-accuracy classifier before re-training started (see Tables. 1 and 2). To avoid such a decrease in accuracy, it may be useful to calculate measures such as data reliability and reflect those values as constraints in SSL. This method using unlabeled data more intelligently is called safe SSL [58], [59], and it has the potential to improve our results in future studies.

IV. CONCLUSION
This study investigated the adaptability of SSL to auditory ERP-based BCIs. Four classifiers (SWLDA, SKLDA, STDA, and LS-SVM) and their SSL expansion versions were applied to two public datasets (AMUSE and PASS2D).
As subject averages, the BCI accuracies and ITRs for all four classifiers were improved by combining SSL techniques into the backbone classifiers. Thus, we verified the positive effect of SSL on auditory ERP-based BCIs regardless of traditional and advanced ML classifiers. However, the sequential plots, which showed the changes in BCI accuracy with incremental updates of the classifiers and the changes in EEG topographical maps, suggested that the BCI accuracy was sometimes deteriorated by SSL especially in the beginning of testing phase for a few subjects because of weakening of the ERP. In addition, it was revealed that except SWLDA, the remaining three classifiers cannot be applied for updating parameters in every trial of auditory ERP-based BCIs because of their high computational cost. Thus, auditory ERP-based BCI designers should update backbone classifiers by SSL techniques when some trials were finished and should make SSL techniques more stable. We expect that our study will encourage future researchers working on auditory paradigms to use SSL techniques and that it will help improve the performance of their BCI frameworks. VOLUME 9, 2021