DiffMDD: A Diffusion-Based Deep Learning Framework for MDD Diagnosis Using EEG

Major Depression Disorder (MDD) is a common yet destructive mental disorder that affects millions of people worldwide. Making early and accurate diagnosis of it is very meaningful. Recently, EEG, a non-invasive technique of recording spontaneous electrical activity of brains, has been widely used for MDD diagnosis. However, there are still some challenges in data quality and data size of EEG: (1) A large amount of noise is inevitable during EEG collection, making it difficult to extract discriminative features from raw EEG; (2) It is difficult to recruit a large number of subjects to collect sufficient and diverse data for model training. Both of the challenges cause the overfitting problem, especially for deep learning methods. In this paper, we propose DiffMDD, a diffusion-based deep learning framework for MDD diagnosis using EEG. Specifically, we extract more noise-irrelevant features to improve the model’s robustness by designing the Forward Diffusion Noisy Training Module. Then we increase the size and diversity of data to help the model learn more generalized features by designing the Reverse Diffusion Data Augmentation Module. Finally, we re-train the classifier on the augmented dataset for MDD diagnosis. We conducted comprehensive experiments to test the overall performance and each module’s effectiveness. The framework was validated on two public MDD diagnosis datasets, achieving the state-of-the-art performance.

M AJOR Depression Disorder (MDD) is a common men- tal disorder that affects about 350 million individuals worldwide, as reported by the World Health Organization (WHO) [1].MDD patients are with abnormal thinking and behavior patterns.They consequently have feelings of sadness, meaningless and hopeless, so their life quality declines significantly.MDD has brought so much pain to the patients that they even choose to commit suicides.According to WHO, MDD kills more than one million individuals each year [1].Early and precise diagnosis of MDD is crucial for preventing the development of the disease and potentially saving lives.Extensive efforts have been devoted to improving the diagnostic procedures for MDD over the years.Typical clinical MDD diagnosis methods primarily mainly rely on the external observation of behaviors and mental states, as well as the standardized questionnaires such as DSM, BDI, and PHQ [2].However, they are dependent on the expertise of medical professionals, resulting in subjective diagnostic outcomes.Actually, apart from external symptoms, MDD patients' brain activity patterns also exhibit abnormalities [3].For example, MDD patients exhibit lower brain activity levels compared with the healthy [2].Brain activity is typically measured through the spontaneous electrical activity recording methods such as electroencephalography (EEG), a non-invasive technique with electrodes placed on the scalp [4], [5].EEG records the brain waves' amplitude variations over time.Fig. 1 (a) shows the signal components, which are decomposed by Independent Component Analysis (ICA), of a healthy control subject and a MDD patient's multi-channel EEG.We can find that the MDD patient's EEG fluctuates less frequently and less dramatically, indicating that the MDD patient's brain activity is less active than that of the healthy subject.By further analyzing EEG, we can also see many brain activity differences between the healthy people and MDD patients, such as energy distribution across different frequency bands, and relationships among functional regions and hemispheres [6].Due to its high temporal resolution, low cost and non-invasive nature, EEG has gained considerable attention and has been a promising diagnostic tool for MDD.
However, the data quality and data size problems are extremely challenging for MDD diagnosis by EEG.(1) A large amount of noise is inevitable during EEG collection, making it difficult to extract discriminative features from raw EEG.A lot of noise, such as environmental and machine noise, eye movements and heartbeat artifacts, is inevitable during the collection of EEG [7].As shown in Fig. 1 (b), the two subjects' EEG both contains a considerable scale of noise, which is from many sources.The dashed lines highlight some typical kinds of noise.The above dashed regions probably mainly contain environmental sinusoidal noise, and the below one probably mainly contains sudden movement noise.In some studies, pre-processing methods such as band filtering and ICA are used to remove noise from raw EEG [8], [9], [10], [11], [12].However, they rely heavily on human inspection, which is time-consuming and not generalized, and they cannot remove noise completely.To tackle the problem efficiently, automatic methods should be proposed to improve the model's robustness to noise.(2) It is difficult to recruit a large number of subjects to collect sufficient and diverse data for model training.Due to the difficulty in recruiting subjects, the privacy issues and high cost of data cleaning [13], the amount of subjects in a MDD diagnosis dataset is usually small (e.g., 30 [14], 51 [15], and 64 [16], and the range is between 12 and 213 [2]).The clipped EEG samples that belong to the same subject are closely related to each other, and the data is lack of diversity.[17] Although the existing EEG based MDD diagnosis studies can draw some reasonable conclusions from the limited data, the small data size still severely constrains the model performance.Taking together, the above two challenges both cause the overfitting problem, especially for deep learning methods, whose performance usually depends on data quality and data size.The existing deep learning methods for MDD diagnosis using EEG [18], [19] focus on the direct training on original datasets, and the performance is limited.
In order to address the above challenges, we propose a novel diffusion-based deep learning framework for MDD diagnosis using EEG.The forward and reverse diffusion processes are utilized and integrated together to improve the model's performance on MDD diagnosis.Our main contributions are: • We propose DiffMDD, a 3-step diffusion-based deep learning framework for MDD diagnosis using EEG.
It was validated on two public real-world datasets with subject-independent 10-fold cross-validation, achieving the state-of-the-art performance.
• We extract more noise-irrelevant features to improve the model's robustness by designing the Forward Diffusion Noisy Training Module.
• We increase the size and diversity of data to help the model learn more generalized features by designing the Reverse Diffusion Data Augmentation Module.

A. MDD Diagnosis by EEG Using Deep Learning
In recent years, there have been many studies focusing on MDD diagnosis by EEG using deep learning.The earlystage DL-based methods for MDD diagnosis rely on handicraft features, such as MLP [20] and PNN [21].The recent models based on CNN architectures automatically extract features of local in-variance, especially of short term.For example, DeprNet [22] uses the ConvNet architecture and 5 stacked CNN layers to capture temporal features.InceptionNet [19], [23] uses kernels of different size to capture features in different terms, and uses channel-wise attention in highlevel layers to learn the channel importance.Reference [8] extracts spectral features firstly and feed the time-frequency maps into a 2D-CNN.More recently, since LSTM is capable of capturing features of long term, some work use hybrid architectures.References [18] and [24] build CNN-LSTM to model the features of short term and long term at the same time.Reference [25] stacks the GCN and the GRU models to capture the spatial-temporal features of EEG, and the brain connectivity graphs are built in an adaptive way.However, the existing studies train the models with original data directly, and their performance is constrained by the poor data quality and small data size in MDD diagnosis problem using EEG.The above DL models are prone to overfit during training and fail to generalize well to unseen subjects.In our work, we build a base CNN-Transformer classifier to encode features of both short term and long term.But different from the previous studies, we try to address the poor data quality and data sparsity problem by the diffusion-based modules.

B. Denoising Diffusion Probabilities Models
In our work, to handle the data quality and data sparsity problem, the forward and the reverse diffusion procedures are utilized for noisy training and data augmentation, respectively.They are both typical procedures in Denoising Diffusion Probabilities Models (DDPMs).DDPMs are a kind of deep generative models that involve both a forward (inference) and a reverse (generative) diffusion chain.They have achieved the SOTA performance on many generative tasks in multiple fields, such as image generation [26] and speech synthesis [27].They beat Generative Adversarial Networks (GANs) owing to the training stability and higher generation quality.DDPM [28] first well defines the forward the reverse diffusion process, and it is the prototype for all the subsequent studies.However, the original reverse diffusion process in DDPM works step by step and is very time-consuming.Some studies fasten the reverse diffusion process by modifying the original conditional probability formulas or the noise parameters, such as DDIM [29].There are also some studies focusing on the guided and conditional data generation using DDPMs.It is proved that using a classifier to guide the reverse process [30] and modifying the gradient [31] during training are equal mathematically.In our work, we use the forward diffusion process for initial noisy training, and mainly use the reverse diffusion process in the guided version of DDIM for data augmentation.Different from the typical classifier-guided diffusion models, in the reverse diffusion process, we use the initially trained classifier to both guide and condition the diffusion model.In this way, the two diffusion-based modules are integrated, and the classifier's output features as conditions can provide more information for reverse diffusion for better quality.Besides, instead of directly using typical U-Nets and temporal Transformers, we also design a spatial Transformer block in the diffusion model to capture the spatial relationships of different EEG channels.By incorporating the diffusionbased modules, the model learns noise-irrelevant and more generalized features, and its generalization performance on unseen patients is improved.

III. METHODOLOGY A. Problem Definition
We formulize our MDD diagnosis problem as a binary classification problem.Here we denote x sub ∈ R C×N as an original EEG recording from a subject, where C is the number of the EEG channels and C = 19 in our work.N is the number of the sampling points in a whole EEG recording.Each x sub has a groundtruth label ŷsub ∈ {0, 1} where 0 denotes that the subject is healthy and 1 denotes that the subject has MDD.Then we clip each x sub ∈ R C×T into frames of equal length as our EEG samples.Each EEG sample is denoted as x ∈ R C×n , where n is the number of the sampling points in a EEG sample and n = 1280 in our experiment.Each sample also has a groundtruth label ŷ and ŷ = ŷsub .Our MDD diagnosis task is defined as learning a classifier f that maps each EEG sample x into the likelihood that it is being a MDD patient:  Data Augmentation Module is designed to tackle the data sparsity problem by generating new EEG samples.In order to improve the generation quality and make the training process more stable, in addition to guiding the diffusion model with the initially trained classifier's output gradients, we also condition the diffusion model with the classifier's output EEG embeddings.Specifically, in the diffusion model, we use 1D-CNN-Transformer to capture temporal features of both short and long term, and use spatial Transformer with hemisphere and functional region embeddings to capture the relationships between different EEG channels.The generated new data of high quality and diversity can help the classifier learn more generalized features.Finally, we re-train the classifier on the augmented EEG data and get and MDD diagnosis results.The diffusion-based modules are closely related to each other, and they both help alleviate the overfitting problem in MDD diagnosis using EEG.They regularize the classifier, help it learn more noise-irrelevant and generalized features, and make it perform better on unseen subjects.

C. Forward Diffusion Noisy Training
EEG has the nature of low signal to noise ratio because of the inevitable environmental noise and artifacts like eyemovement during its collection.The existing directly trained MDD diagnosis models using EEG and deep learning [19], [22], [32] are sensitive to noise, and may fail to extract discriminate features from raw EEG.Therefore, we design this forward diffusion-based module to regularize the classifier and help it learn more noise-irrelevant features from the noisy EEG data.
In our Forward Diffusion Noisy Training Module, we first inject Guassian noise of random time-steps t into each original EEG sample x 0 using the Markov process: where the noise schedule parameters The number of the max time-steps T is set to 1000 here.In this way, we obtain the EEG samples x t of different noise levels.We choose to inject Gaussian noise mainly because it can better guide the reverse diffusion process afterwards theoretically [30].Besides, it is also a typical type of noise during EEG collection.
After that, we use the Gaussian noise-injected EEG samples x t to initially train our classifier f noisy .The classifier's architecture is as shown in Fig. 3 and described in Subsection Re-training and Classification.

D. Reverse Diffusion Data Augmentation
Besides the large amount of noise, the lack of data for MDD diagnosis using EEG is also an important reason that causes the severe overfitting problem.Due to the cost of EEG collection and cleaning and privacy issues in healthcare, the size of a MDD diagnosis dataset is usually small, especially for powerful deep learning based models.So we design this module for data augmentation, and the it can help our classifier to learn more generalized features.The initially trained classifier f noisy is relatively reliable for MDD diagnosis, and we find that it is useful for conditioning and guiding our diffusion model.In this way, our diffusion model can generate new EEG samples corresponding to the original EEG samples and their labels, and the guidance can improve the quality of generated data [30].
Additionally, there are relationships between EEG's different channels.According to [32], the relationships reflect the brain functional patterns.Besides, the EEG channels of the same functional region (e.g., prefrontal lobes or frontal lobes) and hemisphere (i.e.left, right or central) are usually strongly related to each other [33].These relationships are useful for distinguishing MDD patients and healthy subjects.In order to better capture the channels' relationships in EEG, in the diffusion model we design a spatial Transformer apart from the temporal 1D-CNN-Transformer.
As shown in Fig. 2, in the the reverse diffusion process, we need to predict and remove noise from each noise-injected EEG sample x t .The architecture of our diffusion model ϵ θ is shown in Fig. 4. It takes x t , its time-step t and its trained EEG embedding x E E G t output by f noisy as input, and its output is the non-guided predicted noise ϵ θ (x t , x E E G t , t).The architecture mainly builds on CNN layers, temporal and spatial transformer layers, and residual layers.Specifically, considering the fact that the EEG channels of the same functional region (e.g., prefrontal lobes or frontal lobes) and hemisphere (i.e.left, right or central) are usually strongly related to each other [33], we concatenate the brain functional region embedding layer and the hemisphere embedding layer to represent the position embedding of each channel in spatial transformer Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.layers.We minimize the Mean Squared Error (MSE) between the non-guided predicted noise ϵ θ (x t , x E E G t , t) and groundtruth noise ϵ as the loss function: where K is the number of the noise-injected EEG samples.
After training the diffusion model ϵ θ , given each randomly sampled noise-injected EEG sample x t , we infer our targeted new EEG x0 using the classifier-guided reverse diffusion process based on Denoising Diffusion Implicit Models (DDIM), the accelerated version of DDPM.Taking the output embeddings and gradients of the initially trained model as guidance and condition, for DDIM, the reverse diffusion process can be defined as: where the variance . Then we define the conditional diffusion based on score function: where ∇ represents the gradient derived from the model.Finally, we get the update equations in the reverse diffusion process: where εθ is the guided predicted noise of x t , y is the groundtruth label of x t .We set the max value of t, T ′ = 100 since DDIM sampling can skip steps.

E. Re-Training and Classification
Finally, we re-train the previously trained classifier f noisy in the Forward Diffusion Noisy Training Module using the augmented EEG dataset obtained in the Reverse Diffusion Data Augmentation Module.The classifier gets the initial parameters from the initially trained in Forward Diffusion Noisy Training, and the data inputs are also of the same form.As shown in Fig. 3, we build the classifier based on the 1D-CNN-Transformer.In this module, we set t = 0 for the original and synthesised EEG sample x t .We define the CNN part as: where Conv l , Batch N or m l , Max Pooling l denotes the convolution layer, the Batch Normalization layer and the maxpooling layer of the l-th CNN layer respectively.x l t denotes the output of the l-th CNN layer.The number of CNN layers L is set to 2 here, and x L t ∈ R 19×32 .Then we feed the output of the last CNN layer into a standard transformer encoder layer [34] and get the embedding of all steps.The core module is the classic multi-head self-attention mechanism: where head i is the i-th head, matrices Q i , K i , V i consist of queries, keys and values of the i-th head respectively, and W O is the head projection matrix.Here the hidden dimension d k = 8 and head count H = 4. Finally, we take the last step-output as our EEG embedding x E E G t ∈ R 32 because the temporal model usually integrates more information in the last step.We also use an embedding layer T emb to obtain the noise timestep embedding x noise t ∈ R 32 of x t , and feed them together to a fully connection layer for the final classification.We use cross-entorpy (CE) function as loss: where m is the number of the noise-injected EEG samples, ŷ j is the groundtruth label of the j-th noise-injected sample (the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. same as the original EEG sample) and y j is its final prediction score output by our re-trained classifer f .

IV. EXPERIMENTS A. Datasets and Preprocessing
We used two public MDD diagnosis datasets, Mumtaz2016 [16] and Arizona2020 [6], in our experiments.Mumtaz2016 dataset contains EEG recordings of 34 MDD patients and 30 healthy control subject.The subjects were recruited from the outpatient clinic of hospital Universiti Sains Malaysia (HUSM), and the MDD patients were diagnosed based on the DSM-IV criterion [35].Arizona2020 dataset consists of EEG recordings of 121 participants in total.The EEG recordings were collected by the University of New Mexico, and the participants were recruited from a broad survey of the BDI scores taken at Arizona State University.In our experiments, we chose the 23 current or past MDD patients, and the 19 healthy control subjects with the lowest BDI scores as the target subjects, for better data balance and data quality.For both of the two datasets, all the subjects were in the resting state during EEG collection.The EEG electrodes were placed in accordance with the 10-20 system, and we used their common 19 channels: Fp1, F3, C3, P3, O1, F7, T3, T5, Fz, Fp2, F4, C4, P4, O2, F8, T4, T6, Cz and Pz.
The EEG signals were resampled into 256Hz.0.5 Hz high-pass and 50 Hz low-pass filters were used to remove environmental noise.For Arizona2020 dataset, periods with poor quality were detected and removed using EEGLAB.All the EEG recordings were around 5 minutes long, and we clipped them into 5s-long samples using a sliding window of 2.5s stride.Finally, for Mumtaz2016 dataset, there are 14,035 EEG samples in total, including 7,223 positive and 6,812 nagetive samples.For Arizona2020 dataset, there are 10,582 EEG samples in total, including 5,710 positive and 4,872 negative samples.

B. Experimental Setup
1) Performance Metrics: Considering the fact that the samples of the same subject are closely related to each other, to avoid the data leakage problem, we utilized 10-fold subjectindependent cross-validation (i.e., all of the samples of one subject were either in the training set or the test set) to evaluate all the models.
We also use common binary classification metrics including sample-wise accuracy (ACC), F1 score, recall and precision for performance evaluation.Besides, we also introduce the metric of subject-wise accuracy, which indicates the rates of correctly diagnosed subjects.According to the real MDD diagnosis situation, if over 50% samples of one subject are correctly classified, then he or she will be correctly diagnosed.
2) Implementation: We used PyTorch to implement all the models.Our model was trained with the Adam optimizer with a learning rate of 1e-3 for 50 epochs in the first and second modules, and the learning rate was changed to 1e-4 in the last module.We adopted early stopping strategy to avoid overfitting.We set the batch size to 128 and the rate to 0.5.The transformer blocks of our classifier and diffusion model have 4 heads and 32 hidden dimensions.We trained all the models on a server with a Intel Core i9 10900K CPU and 4 NVIDIA RTX 3090 GPUs.
3) Compared Methods: We compare our model with the following MDD diagnosis methods trained with the same subject-independent training policy (we directly cite the reported results of EEGNet [36] trained with the same subjectindependent policy, and implement all the other methods by ourselves): LR (Logistic Regression) is a traditional linear classifier.SVM [37] (Support Vector Machine) uses a linear kernel function to maximize the inter-class margins.XGBoost [38] is an advanced ensemble learning method based on Gradient Boosting Decision Tree (GBDT).All the three methods are classic machine learning methods for MDD diagnosis.Their inputs are representative handcraft features, including the spectral energy value of different bands [32].1D-CNN-LSTM [24], [39] utilizes a 1D-CNN and Bi-LSTM to learn temporal features of short-period and long-period, respectively.1D-CNN-Transformer utilizes a 1D-CNN and a Transformer encoder layer to learn temporal features of short-period and of long-period.CWT-1D-CNN uses Continuous Wavelet Transform (CWT) to transform the raw EEG signals into the time-frequency domain and extract spectral features, and uses a 1-D CNN to extract temporal features.CWT-2D-CNN [8] also adopts CWT, but it then feeds the time-frequency maps into a 2D-CNN to extract local features simultaneously.EEG-Net [36] is a pure 1D-CNN based architecture with a global pooling layer at the end.DeprNet [22] is a deep 1D-CNN based architecture with dense layers at the end.InceptionNet [19] utilizes kernels of different sizes to capture temporal features of different scales, and utilizes SE attention to learn channel importance.GC-GRU [25] combines the GCN [40] and the GRU models to capture the spatial-temporal features of EEG, where the brain connectivity graphs are sliced by time and are built in an adaptive way.In addition, we also compare the SOTA methods of emotion recognition.TSception [41] utilizes InceptionNet in both time and spatial dimension to capture the spatial-temporal features.GRU-Conv [42] utilizes a hybrid architecture with a GRU module to capture temporal features and CNN to capture spatial features.

C. Results
1) Overall Performance: Table I compares the performance with other MDD diagnosis methods on the two public datasets.In conclusion, our model achieves the state-of-the-art performance in almost all the metrics, especailly F1, ACC and subject-wise ACC.We mainly focus on the three metrics, because they can better reflect the comprehensive performance of the models.Higher F1 indicates more balanced precision and recall score, which is more important for MDD diagnosis than a single precision or recall score.Higher ACCs indicate the model's smaller probability of misdiagnosing a patient.We can find that all the models perform better on Mum-taz2016 dataset than on Arizona2020 dataset, mainly due to the unequal data quality of the two datasets.On Mumtaz2016 dataset, all the models achieve a relatively balanced recall and precision, and F1 and ACC are also high.On the other hand, some of the models fail to well balance precision and recall score on Arizona2020 dataset, and F1 and ACC are also lower.However, we can still draw some common conclusions on both the datasets.The ML methods, LR, SVM and XGBoost, perform worse than the DL methods.Compared to our proposed model, they perform nearly 20% worse in F1 and ACC, and nearly 15% worse in subject-wise ACC on Mumtaz2016 dataset.They also perform nearly 20% worse in F1, and nearly 10% worse in ACC and subject-wise ACC on Arizona2020 dataset, and they also balance the recall and precision score badly.The reason may be that they rely heavily on handcraft features, and some useful information can be lost during the feature extraction.
The deep learning models are based on different architectures.Hybrid 1D-CNN-LSTM performs nearly 10% worse in F1 and ACCs compared with our model on both the datasets.1D-CNN-Transformer performs better that it because of its superiority in capturing long-term features, but still over 5% worse than our model on both the datasets.It indicates that simply stacking the sequence models may cause overfitting problem and cannot improve the performance efficiently.Our classifier is also a base 1D-CNN-Transformer, but the diffusion-based noisy training and data augmentation steps of the framework can regularize the learning procedure, thus alleviate the overfitting problem and help the model perform better.CWT-1D-CNN and CWT-2D-CNN transform the raw EEG into time-frequency domain and extract features by CNNs, but the 2D version performs about 5% worse than the 1D version in F1 and ACCs on both the datasets.The 2D-CNN architecture is not suitable since the local invariance is not the same in the time-axis and frequency-axis.CWT-1D-CNN still performs 10% worse than our model in F1 and ACCs on both the datasets, indicating that the CNN architecture may not be powerful enough, and that the time-frequency transform may cause the loss of information.EEGNet, the shallow normal 1D-CNN model, performs almost 10% worse than our model in F1 and ACCs on Mumtaz2016 dataset, and around 5% worse on Arizona2020 dataset.InceptionNet, the wide 1D-CNN model, performs almost equally with EEGNet, but it fails to well balance the recall and precision score on Arizona2020 dataset.Similar to the situation for CWT-1D-CNN and CWT-2D-CNN, the 2D version of InceptionNet, TSception performs worse than the 1D version (about 5% in both F1 and ACCs on both the datasets).This also indicates the importance of designing a suitable spatial feature encoder for EEG.DeprNet, the deep 1D-CNN model, performs the best among all the compared MDD diagnosis models, but still nearly 5% worse than our model in F1 and ACCs on both the datasets.The above three models are all totally based on 1D-CNN and all use the raw data in time domain.These 1D-CNN architectures face the problem of overfitting, mainly due to the low signal to noise ratio of EEG and the insufficient amount of high-quality data.GRU-Conv performs over 5% worse than our model in F1 and ACC on Mumtaz2016, and over 10% worse than our model in F1 and ACCs on Arizona2020, and it performs worse than DeprNet.This indicates that CNN architecture is more suitable for feature extraction in time domain compared with GRU.GC-GRU's performance is similar to DeprNet, still nearly 5% worse than our model in F1 and ACC on both the datasets.This indicates that GNNs are also applicable when the brain connectivity graphs are built in a reasonable way, but they still face the overfitting problem.
Compared with other MDD diagnosis methods and other DL methods based on EEG in the related fields, our diffusionbased model can address the severe overfitting problem caused by poor quality of EEG and limited data size.Our Forward Diffusion Noisy Training Module helps the model learn noise-irrelevant features from Gaussian-noise injected data.Based on this, the Reverse Diffusion Data Augmentation Module can generate more EEG data for further training, and it helps the classifier learn more generalized features.Both the diffusion-based modules regularize the classifier and improve its generalization performance.It achieves the best performance in the ACC scores, both sample-wise and subjectwise, as well as F1 score.It indicates that our model not only performs the best from the comprehensive perspective of classification, but also generalizes well when diagnosing MDD for a new patient.
2) Ablation Study: To investigate the contribution of each single module, we conducted an ablation study.Since our model consists of three modules, we consider the single Classification Module as our baseline, and add the Forward Diffusion Noisy Training Module and the Backward Diffusion Data Augmentation Module respectively to study their effectiveness.As shown in Fig. 5, both of the two diffusion-based modules contribute to the model performance significantly, but the Backward Diffusion Data Augmentation Module contributes more.When the two modules are both removed, the model's performance has a drop of around 5% in F1 and ACC on Mumtaz2016 dataset.On Arizona2020 dataset, it has a drop of over 10% in F1, and around 5% in ACC and subject-wise ACC.It indicates that the single classifier faces the problem of overfitting, and fails to capture the noise-irrelevant and generalized features, especially on Arizona2020 dataset.When we add the Forward Diffusion Noisy Training Module, F1, ACC and subject-wise ACC have an increase of around 2% on Mumtaz2016 dataset.Recall score increases dramatically (over 6%), while precision drops slightly (around 1%), so the overall performance still improves.On Arizona2020 dataset, F1 has an increase of 8.2% and ACC has an increase of 1.9%.This indicates that our Forward Diffusion Noisy Training Module can help the model to learn noise-irrelevant features and thus improve the model performance, especially help balance the performance of recall and precision score.When we add the Backward Diffusion Data Augmentation Module to the base classifier, we can find that it improves the model performance on both the datasets more significantly.On Mumtaz2016 dataset, it improves F1 by over 3%, and improves ACC and subject-wise ACC by around 2%.On Arizona2020 dataset, it improves F1 by 8.8%, improves ACC by 3.7% and improves subject-wise ACC by 2.4%.This indicates that our Reverse Diffusion Data Augmentation Module can provide more valid and different data for our model to learn generalized features.In conclusion, the two designed diffusion-based modules regularize the model in different ways, while they both alleviate the severe overfitting problem.They help the classifier to learn noise-irrelevant and more generalized features for MDD diagnosis, so they can improve the model's overall performance.
3) Comparison With Other Data Augmentation Methods: We also conducted comparative experiments with other data augmentation methods to validate the effectiveness of our Reverse Diffusion Data Augmentation Module.The compared methods include basic time series augmentation methods and GANsbased methods [43], [44] First, we can find that our Reverse Diffusion Data Augmentation Module performs the best on both the datasets.WGAN-GP performs the best among the compared data augmentation methods, but our module performs around 2% better than it in F1, ACC and subject-wise ACC on both the datasets, indicating the better comprehensive performance.Besides, in general, the GAN-based methods perform better than basic  time series augmentation methods, but the gaps are narrow.Among the basic time series augmentation methods, Flipping performs the worst and STFT Shuffling performs the best.The reason may be that the random noise in time-frequency domain can effectively enrich the data diversity.GANs are the SOTA augmentation methods before DDPM, and WGAN-GP performs the best among them, mainly due to its improved distance metrics and training procedure.In summary, our Reverse Diffusion Data Augmentation Module can synthesis new EEG samples of high quality and diversity.By doing that, it effectively alleviates the overfitting problem caused by the limited data in the MDD diagnosis scene, and helps our model to learn generalized features and perform better.
4) Sensitivity Analysis of Hyper-Parameters: To test the robustness of our model, we conduct the sensitivity experiments for two important hyper-parameters in the diffusion based modules: data augmentation size and max noise timestemp in the forward diffusion process.
Figure .6 shows the variation of the model's performance in F1 and ACC on both the datasets when the augmentation size changes.We can find that as the augmentation size increases from 0 to 1 times (equal to the original data size), the model's performance also increases.However, there is a trend that the increase of the model's performance becomes slower as the augmentation size increases.For instance, F1 and ACC increase by around 1.5% on Mumtaz2016 when the augmentation size increases from 0 to 0.5 times, while the increase is smaller than 1.0% when the augmentation size increases from 0.5 to 1.0 times.Finally, when the augmentation size becomes 2 times, the model's performance on both the datasets is nearly equal to its performance when the augmentation size is 1 times.This indicates that on one hand, data augmentation improves the size and diversity of EEG samples, hence alleviating the overfitting problem.On the other hand, it also has limitations since the generated samples may lose diversity when augmentation size becomes larger.
Figure .7 shows the variation of the model's performance in F1 and ACC on both the datasets when the max noise timestamp in the forward diffusion process changes.We can find that when the max noise timestamp increases from 0 to 1000, the model's performance also increases (around 2% in F1 and ACC on both the datasets).However, when the max noise timestamp becomes larger than 1000, the model's performance drops, and the model performs almost equally on both the datasets when the max noise timestamp is 0 and 2000.The reason may be that, adding moderate amount of noise for training can help the model learn more noise-irrelevant features.However, too much added noise make the model overfits more easily, so the model's performance drops instead.In general, our DiffMDD is robust to the change of hyper-parameters in the diffusion based modules according to Fig. 6 and Fig. 7.
5) Generated Data Visualization: We visualize the CWT results of the average band power of healthy subjects and MDD patients in original data and generated data in timefrequency domain in Fig. 8 to validate the quality of our generated data.We can find that in general, our generated data is quite similar to the original data.For example, in both the original and the generated data, healthy subjects' EEG's band power is more evenly distributed in different frequency bands.On the contrary, the MDD patients' EEG's band power mainly focuses on the low frequency bands, which indicates that their brains are less active.However, since noise is introduced in the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.diffusion processes, there are also some reasonable differences between the generated data and original data.For instance, the differences between EEG of the two groups are less significant, while the volatility over time is more significant in the generated data.The similarities between the generated and original data demonstrate the quality of our generated data, while the small yet reasonable differences can improve the diversity, which provides more information for the model to learn more generalized features.

V. CONCLUSION
In this paper, we propose DiffMDD, a novel diffusion-based deep learning framework for MDD diagnosis using EEG.Considering the large amount of noise contained in non-invasive EEG, we first design the Forward Diffusion Noisy Training Module to help the model learn noise-irrelevant features from Gaussian-noise injected data.Based on this, we design the Reverse Diffusion Data Augmentation Module to generate more EEG data for further training.The diffusion model is conditioned and guided by the initially trained classifier's output gradients and EEG embeddings in the first step, and its generated new data of high quality and diversity can help the classifier learn more generalized features.Finally, we re-train the classifier on the augmented EEG data and get and MDD diagnosis results.The diffusion-based modules regularize the classifier, help it learn more noise-irrelevant and generalized features, and make it perform better on unseen subjects.
Our model achieves the state-of-the-art performance in F1 and accuracy with 10-fold subject-independent crossvalidation on two public MDD diagnosis datasets.Additionally, we conduct the ablation studies of the diffusion based modules, and find that both of the modules contribute significantly to the model performance, while the Reverse Diffusion Data Augmentation Module contribute more.The comparison of Reverse Diffusion Data Augmentation Module and other augmentation methods, as well as the visualization of the original and generated EEG samples demonstrate the effectiveness of our data augmentation method.Besides, the sensitivity analysis of the two hyper-parameters in the diffusion based modules shows the robustness of our model.The comparison of the band power of the original and generated EEG demonstrates the reliability of our generated data.
In our future work, we plan to collect and conduct experiments on more datasets of bigger size to draw more generalizable conclusions.We also plan to further explore the generative models for EEG and comprehensively assess the generation quality.We will also try to cooperate with a hospital and deploy our model in practice.
DiffMDD: A Diffusion-Based Deep Learning Framework for MDD Diagnosis Using EEG Yilin Wang , Sha Zhao , Member, IEEE, Haiteng Jiang, Shijian Li, Benyan Luo , Tao Li, and Gang Pan , Senior Member, IEEE Abstract-Major Depression Disorder (MDD) is a common yet destructive mental disorder that affects millions of people worldwide.Making early and accurate diagnosis of it is very meaningful.Recently, EEG, a non-invasive technique of recording spontaneous electrical activity of brains, has been widely used for MDD diagnosis.However, there are still some challenges in data quality and data size of EEG: (1) A large amount of noise is inevitable during EEG collection, making it difficult to extract discriminative features from raw EEG; (2) It is difficult to recruit a large number of subjects to collect sufficient and diverse data for model training.Both of the challenges cause the overfitting problem, especially for deep learning methods.In this paper, we propose DiffMDD, a diffusion-based deep learning framework for MDD diagnosis using EEG.Specifically, we extract more noise-irrelevant features to improve the model's robustness by designing the Forward Diffusion Noisy Training Module.Then we increase the size and diversity of data to help the model learn more generalized features by designing the Reverse Diffusion Data Augmentation Module.Finally, we re-train the classifier on I. INTRODUCTION

Fig. 1 .
Fig. 1.The three-channel EEG of a healthy control subject and a MDD patient in the resting state.The raw EEG was decomposed into the signal component and the noise component using ICA.

Fig. 2 .
Fig.2.The overall framework of our diffusion-based DiffMDD.The forward and reverse diffusion processes are also illustrated.

Figure 2
illustrates the overview of our model, which integrates the forward and reverse diffusion processes to improve the generalization ability.It mainly contains three modules: (1) Forward Diffusion Noisy Training, (2) Reverse Diffusion Data Augmentation and (3) Re-training and Classification.First, the Forward Diffusion Noisy Training Module is designed to tackle the data quality problem.To help the model learn noiseirrelevant features, we adopt the forward diffusion process and inject Gaussian noise into the original EEG data to initially train the classifier.The second Reverse Diffusion

Fig. 3 .
Fig. 3.The architecture of our MDD diagnosis classifier, which is based on 1D-CNN-Transformer.

Fig. 4 .
Fig. 4. The architecture of our diffusion model.Each residual layer consists of a temporal 1D-CNN-Transformer and a spatial transformer.

Fig. 5 .
Fig. 5.The ablation study of model modules on Mumtaz2016 and Arizona2020 dataset.
. All the data augmentation methods are used together with the Forward Diffusion Noisy Training Module and the base Classification Module.The GANs-based methods use the same base 1D-CNN-Transformer classifier architecture as our classifier to build the discriminator, and use the same architecture as the diffusion model in our Reverse Diffusion Data Augmentation Module to build the generator.The results are shown in Tab.II.

Fig. 7 .
Fig. 7. Sensitivity analysis of max noise timestamp in the forward diffusion process on Mumtaz2016 and Arizona2020 dataset.A larger max noise timestamp indicates that more noise is added in the Forward Noisy Training Module.

Fig. 8 .
Fig. 8.Comparison of the average band power of healthy subjects and MDD patients in original data and generated data in time-frequency domain.

TABLE I OVERALL
PERFORMANCE COMPARISON WITH OTHER MDD DIAGNOSIS METHODS ON MUMTAZ2016 AND ARIZONA2020 DATASET."-" DENOTES THE CORRESPONDING VALUE IS NOT PROVIDED

TABLE II THE
COMPARISON OF REVERSE DIFFUSION DATA AUGMENTATION WITH OTHER AUGMENTATION METHODS ON MUMTAZ2016 AND ARIZONA2020 DATASET