Cross-Subject Tinnitus Diagnosis Based on Multi-Band EEG Contrastive Representation Learning

Electroencephalogram (EEG) is an important technology to explore the central nervous mechanism of tinnitus. However, it is hard to obtain consistent results in many previous studies for the high heterogeneity of tinnitus. In order to identify tinnitus and provide theoretical guidance for the diagnosis and treatment, we propose a robust, data-efficient multi-task learning framework called Multi-band EEG Contrastive Representation Learning (MECRL). In this study, we collect resting-state EEG data from 187 tinnitus patients and 80 healthy subjects to generate a high-quality large-scale EEG dataset on tinnitus diagnosis, and then apply the MECRL framework on the generated dataset to obtain a deep neural network model which can distinguish tinnitus patients from the healthy controls accurately. Subject-independent tinnitus diagnosis experiments are conducted and the result shows that the proposed MECRL method is significantly superior to other state-of-the-art baselines and can be well generalized to unseen topics. Meanwhile, visual experiments on key parameters of the model indicate that the high-classification weight electrodes of tinnitus' EEG signals are mainly distributed in the frontal, parietal and temporal regions. In conclusion, this study facilitates our understanding of the relationship between electrophysiology and pathophysiology changes of tinnitus and provides a new deep learning method (MECRL) to identify the neuronal biomarkers in tinnitus.


I. INTRODUCTION
E LECTROENCEPHALOGRAPHY (EEG) is a medical technique that reads scalp electrical activity generated by brain structures. It has been shown to represent the macroscopic activity of the surface layer of the brain underneath [1], [2], [3], [4]. Owing to its non-invasive and real-time reflection of brain state [5], EEG has been widely applied in the exploration of central mechanism, diagnosis and treatment of tinnitus. Tinnitus is the self-perception of sound with the absence of an external sound source. The previous studies regarded tinnitus as a lesion of cochlea or auditory pathways. However, many imaging studies have shown that tinnitus not only involves the auditory cortex, but also becomes a kind of central plasticity encephalopathy involving a large number of non-auditory brain regions and network connections [6]. And abnormalities in temporal, frontal, parietal, cingulate gyrus and other brain regions in different frequency bands of EEG have been widely studied [7], [8]. For example, Schmidt et al. discovered that the network connectivity with the precuneus is a invariant marker of long-term tinnitus [9]. Araneda et al. found that the executive function deficit caused by the change of prefrontal cortex would be the key factor for the generation and persistence of tinnitus [10]. And other studies have shown that these central changes associated with tinnitus reflect the dysfunction of the default mode network (DMN) [11], [12]. The default mode network may regard tinnitus as a norm, which is an important reason why tinnitus persists [13]. However, due to the complex central mechanism of tinnitus and numerous influencing factors, many previous studies have not reached comprehensive consistent results. Therefore, many attempts have been made in diagnosing tinnitus based on the EEG signals by using machine learning and deep learning methods [14], [15].
For the EEG-based machine learning methods, the spectral features of the EEG signals are firstly extracted by some methods like fast Fourier transform (FFT), and then feature selection is performed depending on clinical experience and the selected features are further supplied to the machine learning models (e.g. K-nearest neighbors (KNN), Naive-Bayes and Support Vector Machine (SVM)) to complete classification or regression tasks [16], [17], [18]. Some methods including Random Forest [19] and XGBoost [20] take advantage of an automatic selection of optimal feature combinations and achieve promising performance. However, these machine learning methods hardencode the EEG signals and fail to comprehend the semantic information among them. Therefore, it is difficult to further enhance the performance. On the other hand, due to the unparalleled fitting capabilities of deep neural networks and the structural similarity between neuron and real-world nerve cells, more and more studies have been conducted in developing EEGbased deep learning methods to accomplish various downstream tasks [14], [21], [22], [23]. Unfortunately, the high individual variability of EEG [24] significantly limits the generalization capabilities of these methods, which leads to wide performance gap between seen and unseen subjects [25]. That's why many EEG-based deep learning models are restricted to the laboratory stage and have seldom been able to take the step of large-scale promotion.
In order to address the above issues and deal with the problem of high individual variability of the EEG signals, we take tinnitus diagnosis as the typical task and propose a robust, dataefficient multi-task learning framework called Multi-band EEG Contrastive Representation Learning (MECRL). The MECRL framework consists of four components, namely EEG data sampler, multi-band EEG encoder, feature fusioner and linear classifier. And three layer-by-layer progressive learning tasks are designed to capture the semantics of EEG. The whole training process of MECRL consists of the representation learning process and the classification process. During the representation learning process, the EEG data sampler generates various mini-batches by randomly sampling subjects and segments from the dataset, and different sampling strategies are applied to support different auxiliary tasks. The multi-band EEG encoder is composed of the spatial module and the temporal module which integrates the multi-dimensional information (e.g. spatiotemporal domain and frequency domain) of EEG automatically and extracts discriminative multi-band features. The feature fusioner fuses the extracted multi-band features and generates a unified representation. During the classification process, the linear classifier is attached after the encoder and the whole model is fine-tuned via the supervised learning paradigm.
A tinnitus dataset is collected by recruiting 187 tinnitus patients and 80 healthy subjects from the department of Otolaryngology, Sun Yat-sen Memorial hospital, Sun Yat-sen University. Experiments are conducted on this tinnitus dataset. The results show that the proposed MECRL framework achieves superior performance over the state-of-the-art baseline models in the cross-subject tinnitus diagnosis task because it not only takes the multi-dimensional information of EEG into account, but also utilizes multiple self-supervised learning tasks to help the encoder understand the semantics of EEG and align the individual variability. Furthermore, the results demonstrate the validity of the data from high-weight electrodes in tinnitus diagnosis, which can undoubtedly help us explore the causes of chronic tinnitus.
Overall, the key contributions of this paper are as follows: r We have collected a resting-state EEG dataset by recruiting 187 tinnitus patients and 80 healthy subjects from the department of Otolaryngology, Sun Yat-sen Memorial hospital, Sun Yat-sen University. Compared with the previous studies [14], [26], the dataset has higher spatial precision (more electrodes) and a larger number of subjects, which eases the high individual variability of EEG data when applying deep learning methods and makes it possible for us to carry out a larger range of subject-independent experiments to uncover macroscopic EEG differences between the chronic tinnitus patients and the healthy individuals. The dataset can be accessed in drive.google.com/drive/ folders/1Su1IWGyZlED-lINUTVGcY9dIL_1k1ip_. r A novel multi-task learning framework called Multi-band EEG Contrastive Representation Learning (MECRL) is proposed. The framework helps the deep residual networkbased encoder understand the semantics of EEG data and align the individual variability by constructing different self-supervised auxiliary tasks. After the training process, discriminative representations will be generated by the trained model which are used to distinguish the tinnitus patients from the healthy controls accurately.
r Subject-independent comparison experiments and visualization experiments are conducted on the collected EEG dataset to demonstrate the superiority of our framework on cross-subject tinnitus diagnosis comparing with stateof-the-art methods. Meanwhile, the visualization of the key parameters of the model helps us explore the central nervous mechanism of chronic tinnitus with our medical collaborators. The rest of this paper is organized as follows. Section II describes the proposed MECRL framework. Section III presents the details and results of the subject-independent experiments exhaustively. Section IV visualizes the key parameters of our model and discusses the pathogenic mechanism of the chronic tinnitus. Section V draws the conclusion of this paper.

II. METHODS
The Multi-band EEG Contrastive Representation Learning (MECRL) framework consists of four components, namely EEG data sampler, multi-band EEG encoder, feature fusioner, and linear classifier. The first three components constitute the representation learning process, which is shown in Fig. 1. To help the encoder learn about EEG and align the individual variability, the following three contrastive learning tasks are designed: r Task A: Because the EEG signals of tinnitus patients differ significantly from those of the healthy controls, the features extracted from different tinnitus patients should be more similar than those extracted from the healthy controls. Task A is a subject-wise task.
r Task B: Due to the individual variability, the features extracted from the EEG segments of the same subject should be more similar than those extracted from different subjects. Task B is designed to allow the model to identify individual biases and would not confuse them with the In order to support various auxiliary tasks, the EEG sampler uses different sampling strategies to generate mini-batches, and then the corresponding mini-batches will be fed to the model and the corresponding contrastive loss will be calculated at each step of representation learning process. By minimizing the contrastive loss, the model parameters are continuously optimized, so that the model can integrate multi-dimensional information in the EEG signals and learn discriminative representations across subjects.
general differences in tinnitus disorders. It is a segmentwise task.
r Task C: Because multiple clips obtained by frequency division of the same raw EEG segment are temporally consistent, the features extracted from the same segment should be more similar than those extracted from other segments. Task C is a band-wise task. Task A is a supervised contrastive learning task [27] while task B and task C are self-supervised contrastive learning tasks [28]. The three tasks correspond to different aspects of the EEG semantics in tinnitus diagnosis, respectively. And the similarity between the positive samples of these three contrastive learning tasks is progressive, which means we can obtain a more semantically informative and uniformly distributed feature space with appropriate parameter settings, especially the temperature coefficient [29]. This enables the multi-band EEG encoder to extract higher-quality and more robust cross-subject representations. The whole training process consists of representation learning process and classification process. During the representation learning process, the multi-band EEG encoder is trained with the designed contrastive learning tasks. In order to assist each task in training, the EEG sampler takes different sampling strategies for different tasks to build differentiable mini-batches, which enriches the model in learning materials. After the representation learning process, a mapping is established from EEG segments to discriminative feature vectors. In the classification process, we add a linear classifier after the multi-band EEG encoder. The labeled data and cross-entropy loss are used to train the linear classifier and fine-tune the encoder. Finally, the well-trained model would be applied to classify the tinnitus patients and the healthy controls.
In the following subsections, we will introduce the components of the MECRL framework in detail.

A. EEG Data Sampler
The EEG datasets tend to have a small number of individuals, which makes it difficult for the EEG-based deep learning models to capture discriminative features and learn task-related representations across subjects due to the overfitting issue. To address this issue, some studies divide EEG data into isometric segments and treat these segments as irrelevant data to enlarge the dataset. However, this practice will introduce personal variance when segments from a single subject appear multiple times on the dataset, which may confuse the model, leading them to mistake individual characteristics as universal ones. To solve this problem and support the multi-task learning framework, we design an EEG data sampler to generate mini-batches by randomly sampling subjects and segments from the dataset. Suppose that the EEG data in the tinnitus dataset is {X c where c means the class of the data, B is the number of bands, M is the number of electrodes, and T i is the number of time points of the EEG signal from subject i. In order to adapt to the multi-task learning framework, the EEG data sampler adopts the following three sampling strategies to support different contrastive learning tasks.
r Strategy A: To explore the general difference of the EEG signals between the tinnitus patients and the healthy controls, the sampler first randomly samples equal numbers of tinnitus subjects and healthy subjects. And then a single EEG segment will be sampled from each selected subject to guarantee that data of the same individual will not appear in the mini-batch, which helps the EEG encoder capture the general differences in the EEG signals between the tinnitus patients and the healthy controls, and avoids the introduction of individual biases.
r Strategy B: To help the model recognize the individual biases, the sampler first randomly samples two EEG segments from a single subject as positive samples, and then constructs negative samples by sampling segments from multiple separate subjects.
r Strategy C: To capture the temporal consistency amount multiple clips obtained by frequency division of the same raw EEG segment, the sampler first randomly samples clips of a raw EEG segment from a single subject as positive samples, and then constructs negative samples by sampling clips from the same subject. By applying different sampling strategies, the EEG data sampler provides abundant learning materials to the models. Moreover, the random sampling method and the coordination of multiple sampling strategies introduce more misinformation into the training process, which makes it easier for the model to escape from the local optimum to avoid overfitting, and also makes the learned representations more robust.

B. Multi-Band EEG Encoder
The multi-band EEG encoder takes multi-band EEG segments as inputs and transforms them into subject-independent representation related to tinnitus diagnosis. Structurally, it is a neural network model composed of multiple deep neural networks in parallel and each deep neural network is responsible for processing EEG data of a single frequency band. In order to integrate the rich spatial and temporal information in EEG data, each network consists of a regularized spatial convolution layer as spatial module to extract spatial correlation of electrode positions and a deep residual network as temporal module to explore dependence of the EEG signals in the time domain. These two modules will be described in the following subsections.
1) Spatial Module: Tinnitus is considered to be a long-term neurological disorder by mainstream scholars, which may be caused by abnormal neural activities involving multiple brain regions and frequency bands. To explore the functional connections among different brain regions and remove the redundant electrodes, we utilize a spatial module containing a regularized spatial convolution layer to assign weights to the different electrodes. The module can be formulated as: where X b ij ∈ R M ×T denotes the j-th EEG segment of subject i in the b-th frequency band, W b ∈ R K 1 ×M ×1 denotes the convolution kernel of the spatial convolution in the b-th neural network andX b ij ∈ R K 1 ×T denotes the output of the spatial module. T means the length of EEG segments and K 1 represents the number of spatial convolution filters. f spatial means the process of the spatial convolution.
In recent years, the EEG data tends to contain more and more electrodes with the upgrading of acquisition equipment. This presents both opportunities and challenges for EEG-based deep learning methods since most of these electrodes are redundant, which may introduce extra noise signals and even worsen the performance of the downstream tasks. In order to reduce the spatial redundancy and explore which brain regions are more indicative of tinnitus diagnosis, we utilize a L 1 regularization loss (i.e. L region ) on the kernel of the spatial convolution layer to force them to be sparse, which allows only valid information to pass smoothly and enter the next layer.
2) Temporal Module: The EEG signals dynamically change over time, which contains abundant temporal information related to various neural activities. To detect the dependence of the EEG signals hiding in time domain, we apply a deep residual network as the temporal module which is not only with strong fitting ability, but also tractable in the optimization process. The network architecture is exactly analogous to [30], and the process can be formulated as: where h b ij ∈ R K 2 is the feature extracted from the j-th EEG segment of subject i in the b-th frequency band, and K 2 represents the dimension of the extracted feature. f b temporal represents the function map of the b-th deep residual network.

C. Feature Fusioner
The feature fusioner is utilized to fuse the extracted multiband features and generate a unified representation. To comprehensively consider the information obtained from each single band, attention mechanism is introduced and the process can be formulated as:h where h b ij denotes the feature from the b-th frequency band, and f fusion denotes the attention layer used to fuse the multi-band features. In order to enhance the expressiveness of the model, a projector composed of three layers of MLPs is introduced [31]: where δ( * ) means the activation function. And it should be noted that different contrastive learning tasks have their own projectors.

D. Representation Learning: Loss Function and Training
The three contrastive learning tasks correspond to the three different aspects of the EEG semantics. The loss function of each task could be formulated as follows: where τ A , τ B , τ C ∈ R + are scalar temperature parameters. l n =k is an indicator function whose value is 1 when n = k and 0 when n = k. lỹ n =ỹ k returns 1 when the label of the n-th sampled subject is the same as that of the k-th sampled subject and returns 0 when the labels are different, while lỹ n =ỹ k is just the opposite. N A , N B , N C denote the batch size of each task and sim( * ) represents a function that measures the cosine similarity of the input vectors. Finally, we can get the loss function of the representation learning process (denoted as L rpt ) as follows: That is, L rpt is the sum of L region , L A , L B and L C . During the representation learning process, the optimizer adjusts model parameters by reducing L rpt , and the model can better understand the EEG data after continuous optimization.

E. Linear Classifier
After the representation learning process, our multi-band EEG encoder gains the capacity of transforming the EEG segments into cross-subject representations. In the classification process, we attach a linear classifier after the feature fusioner to train the classifier and fine-tune the previous components. The classifier is trained via the supervised learning paradigm, using crossentropy (i.e. L cls ) as the loss function which can be formulated as:ŷ where f classif ier denotes the function map of the linear classifier,ŷ denotes the probability of tinnitus and y denotes the true label.

III. EXPERIMENTS
A. Data and Preprocessing 1) Participants: The study groups consist of a tinnitus group with 187 tinnitus patients (77 females and 110 males, mean age = 43.24 years, ranging from 14 to 73 years, std = 14.205) and a control group with 80 healthy subjects (37 females and 43 males, mean age = 39.84 years, ranging from 20 to 62 years, std = 14.362). All tinnitus patients have suffered tinnitus for at least 3 months. Furthermore, the tinnitus patients with Meniere's disease, pulsatile tinnitus, central nervous system disorders, otosclerosis and the history of previous middle ear surgery are excluded. All tinnitus patients are recruited from the department of Otolaryngology, Sun Yat-sen Memorial hospital, Sun Yat-sen University. Before collecting data, routine audiological examinations including otoscopy and pure-tone audiometry are performed for all participants. Moreover, the Tinnitus Handicap Inventory (THI) questionnaire and tinnitus specific assessments including tinnitus pitch and loudness matching measurements are performed for all tinnitus patients. Tinnitus severity is assessed by the THI questionnaire which has 25 items to evaluate the self-perceived level of handicap caused by tinnitus, based on a scale of 0-100. And the hearing state is calculated from the average of 500, 1000, 2000, 4000 and 8000 Hz. Low frequency tinnitus pitch means less than or equal to 1000 Hz, medium frequency tinnitus pitch means 1000 Hz to 4000 Hz, and high frequency tinnitus pitch indicates more than 4000 Hz. The details about the study groups are listed in Table I.
2) EEG Data Collection: A high-density EEG system with 128 channels (EGI, Eugene) and a NetAmps 200 amplifier are used to collect the resting-state EEG recordings from all participants. The sampling rate is 1000 Hz and impedances are kept below 50 kΩ. The CZ electrode is used as the reference electrode. During the EEG data collection process, participants are asked to sit on a chair, open their eyes and focus on a cross mark on the computer screen to keep awake. The whole process lasts for about 7 minutes.
3) Data Preprocessing: We apply the EEGLAB v13.0.0 toolbox in MATLAB R2013a to preprocess the raw EEG data. The raw data is first resampled at 250 Hz, band-pass filtered between 0.5 Hz and 80 Hz, and notch filtered at 50 Hz, and re-referenced to the 56th and 107th electrodes which locate in the bilateral mastoid. Next, evident artifacts are removed from the raw EEG data manually after visual inspection, and the independent component analysis (ICA) is utilized to remove muscle artifacts, eye movement, and heartbeats in the raw data. Then, the EEG data is segmented into 2 s slices and a basic finite impulse response (FIR) filter of EEGLAB toolbox is applied to transform the slices to 8 frequency bands which are significant for tinnitus diagnosis, including delta (2-3.5 Hz), theta (4-7.5 Hz), alpha1 (8-10 Hz), alpha2 (10-12 Hz), beta1 (13-18 Hz), beta2 (18.5-21 Hz), beta3 (21.5-30 Hz) and gamma (30.5-44 Hz) [32], [33], [34], [35], [36]. The process of extracting each frequency band of the EEG signals is shown in Fig. 2. Finally, for each frequency band, we calculate the mean and standard deviation of the voltage value of each EEG segment on each electrode, and then electrode-level zero-mean normalization is performed on each EEG segment, which normalizes the data while maintaining the dynamic pattern of the EEG signals.

B. Model Training
The entire training process of MECRL consists of the representation learning process and the classification process. In the representation learning process, the EEG sampler first applies different sampling strategies (strategy A, B and C, respectively) to generate mini-batches (mini-batch A, B and C, respectively), and then the generated mini-batches will be fed to the multi-band EEG encoder, which is made up of 8 parallel deep neural networks composed of the spatial module and the temporal module. The EEG segments of different frequency bands first pass through the spatial module to integrate spatial information and streamline channels (electrodes), so that only the data of electrodes with better tinnitus diagnosis effect in each frequency band will flow into the next layer. And then the EEG data is further streamed into the temporal module to learn its dynamic patterns in time domain. After that, the outputs of the multi-band EEG encoder will be fed to the feature fusioner to generate a unified representation. Finally, different contrastive loss (L A , L B , L C , respectively) will be calculated and the model parameters are continuously optimized via minimizing the loss by the optimizer. After the representation learning process, we attach an MLP classifier behind the model and train it using the supervised learning paradigm, which constitutes the classification process.

C. Experiment Setting
In order to ensure scientific nature of the subject-independent tinnitus diagnosis experiments, the training set and test set are obtained by stratified sampling (randomly sample 90% of the subjects in the tinnitus group and control group as the training set, and the remaining subjects as the test set), which makes the subjects in the test set unseen by the model. Under this experimental setting, we can better evaluate the effectiveness of different models in clinical tinnitus diagnosis.
For the architecture of the multi-band EEG encoder, the number of spatial convolution filters is set to 16 and the number of temporal convolution filters increases from 16 to 128 from the first layer to the deeper layer. The temporal convolution filter length is set as 32 and the kernel length of average pooling is set as 16 to extract relatively stable averaged features. The output dimension of the projector is set to 32. The temperature coefficients of the three auxiliary tasks are set to 0.1, 0.07, and 0.04 respectively. In the representation learning process, we train the model for 300 epochs with early stopping (maximal tolerance of 30 epochs without descending validation loss). An Adam optimizer with a cosine annealing learning rate scheduler and a three-time warm restart are applied to optimize the model. The initial learning rate of the optimizer is set to 0.0007, and the weight decay is set to 0.015 empirically.
For the linear classifier in the classification process, there are two hidden layers with 64 and 32 units respectively. Rectified linear units (ReLUs) are used between the two layers. We use cross-entropy loss and an Adam optimizer to optimize the parameters. The learning rate is set as 0.0005 empirically, and the weight decay is set to 0.025. The batch size is set as 64 empirically. The classifier is trained for 100 epochs.

D. Evaluation Measures
In our experiments, five evaluation measures are adopted to evaluate the performance of different models, namely ACC, AUC, Precision, Recall and F1-score. ACC is the segment-wise accuracy of the model for tinnitus diagnosis. AUC stands for "Area under the ROC Curve" which measures the entire twodimensional area underneath the entire ROC curve (receiver operating characteristic curve). ROC is a graph showing the performance of a classification model at all classification thresholds while AUC provides an aggregate measure of performance across all possible classification thresholds. Precision refers to the proportion of true tinnitus EEG segments among the tinnitus segments predicted by the model, while Recall reflects the proportion of EEG segments which are correctly predicted in all tinnitus EEG segments. F1-score combines the precision and recall of a model into a single metric by taking their harmonic mean. The larger ACC, AUC, Precision, Recall and F1-score are, the better the classification effect is.

E. Subject-Independent Tinnitus Diagnosis Experiment
To demonstrate the superiority of our MECRL framework, we select the traditional machine learning methods based on feature engineering and several competitive EEG-based deep learning methods as baselines and reproduce them on our EEG dataset.
1) v-SVM [37] is a variant of the Support Vector Machine (SVM) algorithm, which can be interpreted as a maximal separation between subsets of the convex hulls of the data. The convex hulls are controlled by choice of the parameter v. Because the traditional machine learning methods can hardly handle complex EEG signals, Power Spectrum Density (PSD) of each electrode is calculated as the input, which is a common way of hard-encoding the EEG data in disease diagnosis. 2) MLP [38] stands for Multilayer Perceptron, which is a basic deep learning model and has strong nonlinear fitting ability. It takes the PSD of electrodes of EEG data as input and outputs the prevalence of tinnitus. 3) EEGNet [39] does not hard-encode the EEG data, but uses Convolutional Neural Networks (CNN) to automatically extract features and complete the classification tasks. However, since there is no special mechanism for individual variability, this method suffers from the overfitting in practical applications. 4) SiameseAE [14] is an auto-encoder based Siamese network which takes ABRs (auditory brainstem responses) as input for tinnitus diagnosis. ABRs are single-channel evoked potentials recorded with EEG sensor. To adapt the SiameseAE method to the EEG dataset, we use the potentials of the corresponding electrode (electrode no.13) located on the midfrontal line as its input. The method designs a variety of different loss functions to align individual differences and achieves good results. However, the auto-encoder structure is too simple and does not take the spatial and frequency information of the EEG data into account, which limits the further improvement of the diagnosis performance. 5) SMeta-SAE [40] is an ABR-based cross-dataset tinnitus diagnosis method using SiameseAE as the backbone model. The introduction of meta-learning alleviates the problem of missing large-scale datasets. However, similar to SiameseAE, it does not consider the spatial and frequency information. 6) 4D-CNN [41] is a deep learning method that fully considers the multi-dimensional information of the EEG data.
Through frequency division and topographic map extraction, the two-dimensional EEG signals are converted into four-dimensional signals as the input of 4D convolutional neural network to complete the classification tasks. Similar to EEGNet, the overfitting problem limits the performance of the method in the cross-subject tinnitus diagnosis task. The results of the subject-independent tinnitus diagnosis experiment are listed in Table II, in which the mean and the standard deviation are reported over 10 runs with different stratified sampling results. From Table II, we can find that the proposed MECRL method achieves outstanding accuracy on tinnitus diagnosis as high as 91.34%. It outperforms the second and third best methods (i.e. 4D-CNN and SMeta-SAE) by achieving improvements of 8.364% and 12.336% respectively, which reflects the excellent performance of the MECRL method. Moreover, we find that the methods taking handcrafted features as input (i.e. v-SVM and MLP) generally perform worse than the methods extracting deep features from the raw EEG data (i.e. EEGNet, SiameseAE, SMeta-SAE, 4D-CNN and MECRL). This is because the transformation from raw EEG data to handcrafted features will lose a lot of useful information for tinnitus diagnosis. On the contrary, the deep neural network can dynamically extract effective information from the EEG signals according to the needs of downstream tasks. However, since EEGNet and 4D-CNN are trained under a simple supervised learning paradigm, it is easy for the deep neural network model to take random noise in complex signals as discriminative features in the case of insufficient training samples, making the model perform poorly when faced with data from unseen subjects. Unlike them, MECRL deconstructs the complex EEG data through a series of self-supervised learning tasks. These tasks are progressive and semantically complementary, so that the MECRL model can effectively map EEG segments into semantically rich and discriminative representation. That's why MECRL outperforms other methods.
In terms of stability, because the high complexity of input data will bring greater uncertainty, the standard deviation of the accuracy of v-SVM and MLP is smaller, which means that they are more consistent in multiple experiments. Different from other EEG-based deep learning methods, the representation learning process of MECRL is equivalent to self-supervised pre-training of the model parameters. Therefore, compared with other methods that start training from completely random initialization parameters, the MECRL method has stronger stability, i.e. smaller standard deviation.

F. Parameter Analysis
By applying the regularized spatial module, the EEG data from brain electrodes in different spatial locations is fed into the backbone encoder with different weights. With the advantage of the training process, the weights of electrodes in brain regions that are more important for tinnitus diagnosis would be continuously amplified, while the weights of electrodes in incoherent brain regions would tend to be zero. To explore the role of the EEG data in different frequency bands for tinnitus diagnosis and trace abnormal brain regions, we extract the weight coefficients of the electrodes in each brain region, which are visualized in Fig. 3. In the figure, we can find that the electrodes with high weights in each frequency band are sparse and concentrated, which implies that many electrodes in EEG are likely to be unhelpful for specific downstream tasks. The results also illustrate that high-weight electrodes in each frequency band are spatially clustered and the brain areas in which these electrodes are located may be related to the onset of chronic tinnitus.
To demonstrate the validity of the traceability of abnormal brain regions associated with tinnitus, we retrain the single-band version of the model with the spatial module removed using the EEG data of the 32 electrodes with the highest weight and the 32 electrodes with the lowest weight in each frequency band respectively. The results are shown in Fig. 4. From the figure, we can see that the performance of the model trained with highweight electrodes in each frequency band is much better than that of the model trained with low-weight electrodes. What's more, the performance of the model using the 32 electrodes with the highest weight is very close to the model using all electrodes, and even slightly better than the all-electrode model in the gamma frequency band. This suggests that the brain regions where these high-weight electrodes are located are associated with the onset and diagnosis of tinnitus. In addition, we can find that the model performance of the single-band version is much worse than that of the multi-band version. It shows the essentiality of frequency domain information for tinnitus diagnosis, which is one of the reasons for the poor performance of EEGNet, SiameseAE and SMeta-SAE. More medical discussions can be found in Section IV.

G. Ablation Study
In this section, ablation study is conducted to demonstrate the role of different auxiliary tasks as well as the representation learning process in our MECRL framework. Specifically, we compare the proposed MECRL framework with four variant models: r MECRL-Task A: The multi-band EEG encoder is trained only with task B and C.  Table III show that the designed auxiliary tasks as well as the representation learning process can effectively help the model to better learn cross-individual feature representations from the EEG data, which is helpful for the tinnitus diagnosis.

H. Overfitting Analysis
In order to further demonstrate the robustness of the proposed MECRL framework in the case of scarcity of training data, we modify the original experimental setting on generating the training set and the test set. Specifically, we randomly sample 50% of the subjects in the tinnitus group and control group as the training set, and the remaining subjects as the test set. That is, different from the 9:1 training-test ratio, we have 5:5 training-test ratio. Under the new experimental setting (i.e. the 5:5 training-test setting), the training data is almost cut in half, and the larger test set allows us to evaluate the performance of the models more accurately. Moreover, we display the loss curve of MECRL in the classification process under the two experimental settings in Fig. 5 to verify whether it is overfitting. And the results of the subject-independent tinnitus diagnosis experiment under 5:5 setting is shown in Table IV. From the table, we can find that under the experimental setting of 5:5, MECRL still performs better than the baselines. Meanwhile, we can observe that compared with the experimental setting of 9:1, the accuracy of EEGNet, SiameseAE, SMeta-SAE, 4D-CNN and MECRL decreased by 3.41%, 3.06%, 2.84%, 4.82% and 1.34% respectively under the experimental setting of 5:5. This is because the reduction of training data makes it difficult to optimize the parameters for data-driven deep neural networks. However, the drop in MECRL is the lowest among all deep learning methods, reflecting its robustness in the face of the dilemma of scarcity of training data. This is due to the self-supervised pre-training process (i.e. the representation learning process) that makes the generated feature space have a good internal structure and rich semantic information.
In addition, from Fig. 5, we can find that the training loss and test loss can achieve convergence regardless of the experimental setting, which shows that the overfitting phenomenon in MECRL is not as serious as other baselines. This can also explain its excellent performance in the EEG-based tinnitus diagnosis task.

IV. MEDICAL DISCUSSION
Resting state EEG markers are promising for large-scale, lowcost, noninvasive screening of tinnitus subjects. Robust automatic classification of such neural signatures that are predictive at the individual patient level would both advance the neurobiological understanding of tinnitus and provide important clinical implications. Therefore, our work aims to differentiate tinnitus patients from healthy people, and further identify characteristic tinnitus features extracted from the EEG data using a robust, data-efficient multi-task learning framework called Multi-band EEG Contrastive Representation Learning (MECRL). The deep model finally achieves higher accuracy (ACC 0.9134 and AUC 0.9256) for tinnitus diagnosis than other models.
To further explore potential brain regions that contribute to the successful tinnitus classification, we extract and visualize the spatial weights in the spatial module which are shown in Fig. 3. From Fig. 3 we can find that electrodes helpful for the classification and diagnosis of tinnitus are mainly distributed in the frontal (Fp1 and Fz), parietal (Pz), central cranial (Cz), temporal (T5 and T3) and occipital region (O1) in each frequency band, as well as nearby auxiliary electrode. In summary, our Essentially, the EEG biomarkers highlighted here assist the interpretation of the mechanism of central neurophysiological changes in tinnitus. These high-classification weighted electrodes are distributed in the frontal, parietal, temporal and occipital regions, which are the core regions of the default mode network (DMN). The DMN is generally considered to consist of core regions including the auditory cortex, dorsolateral prefrontal cortex, parahippocampal gyrus and precuneus/posterior cingulate gyrus [42], [43], which are normally active in the absence of external stimuli. The DMN's function includes monitoring the external environment, maintaining self-awareness, generating spontaneous thoughts, and processing episodic memory, cognition, and emotion [44]. Consistent with the previous studies [11], [12], [45], our results suggest that the DMN abnormality is a significant central neural characteristic of tinnitus. Dysregulation of the default mode network, which controls self-representation processing, is related to the pathogenesis of tinnitus pathology [9]. In other words, the sound (tinnitus) becomes an integral part of patients' self [46]. This may be why tinnitus has become persistent and difficult to treat.
Furthermore, the pathological DMN is associated with changes in brain oscillatory activity. The special activity of DMN is largely organized synchronously mediated by alpha rhythms [47]. Early EEG studies have suggested that the attenuation of alpha rhythm and the increase of theta and high frequency band (beta and gamma) are the electrophysiological manifestations of tinnitus [48], [49], [50], However, the significance of DMN dysfunction in tinnitus is still a mystery [51]. It is not clear that the neural oscillation of DMN measured with EEG is related to spatial distribution of the electrode nodes over the cortex [52], [53]. Interestingly, by using machine learning to analyze multiple EEG bands, our results show that the number of electrodes with classification weight significance is more in the alpha1, beta1, beta2, and beta3 bands, but less in the alpha2 band. This suggests that the slowing of the alpha rhythm of tinnitus EEG may be a phenomenon, but the more underlying pathological significance is the alteration of DMN under the alpha rhythm. This is a significant neuro-biomarker for EEG oscillations of tinnitus, which facilitates our understanding of the relationship between electrophysiology and pathophysiology changes of tinnitus.
Limitations must be acknowledged in this study. This is a cross-sectional study with a limited sample size. We mainly investigate the differences in cortical activity between chronic tinnitus and healthy subjects. The high heterogeneity factors of tinnitus such as laterality, hearing condition, tinnitus severity on brain network are not further explored in the present study. In the future study, it would be very interesting to explore the influence of various factors on the central brain networks in tinnitus patients by categorizing them on the basis of the tinnitus characteristics.

V. CONCLUSION
In this study, we propose a robust, data-efficient multi-task framework named MECRL for cross-subject tinnitus diagnosis. It can effectively solve the overfitting problem suffered by the deep neural networks caused by the lack of large-scale dataset and high individual variability. In the framework, we design several layer-by-layer progressive learning tasks to deconstruct the semantics of EEG data, which help the deep residual network-based encoder integrate multi-dimensional information and align the individual variability. Moreover, the selfsupervised learning tasks do not depend on the annotated data and complement each other. This allows us to efficiently train deep models without large-scale EEG datasets and manually extracting discriminative subject-independent representations that are used to distinguish the tinnitus subjects from the healthy controls precisely. Experiments are conducted on a tinnitus dataset collected from the department of Otolaryngology, Sun Yat-sen Memorial hospital, Sun Yat-sen University. And the results have confirmed the effectiveness of the proposed MECRL framework over the state-of-the-art baselines.