MASTF-net: An EEG Emotion Recognition Network Based on Multi-Source Domain Adaptive Method Based on Spatio-Temporal Image and Frequency Domain Information

In the field of neuroscience, the electroencephalogram (EEG) is a crucial indicator of emotion. The EEG emotion recognition method based on domain adaptation (DA) has good objectivity and high time resolution and is the preferred method to study the brain’s response to emotional stimuli. However, due to the obvious instability of EEG emotion characteristics, it is difficult to predict the emotion corresponding to EEG signals of cross-subjects by a model that combines all source domains into a single source. In order to solve the problem of cross-subject emotion analysis, we propose an EEG emotion recognition net with a cross-subject multi-source adaptive method (MASTF-net), where EEG features of different subjects are regarded as different domains. Through analyzing the invariance of the target domain and the uniqueness of the source domain, this method realizes the emotional analysis of different objects according to the spatio-temporal images and frequency domain information. First, features of EEG image are extracted from frequency and time dimensions. Secondly, combined with the serialized EEG frequency characteristics of local brain regions, independent classification module are established for different domains to recognize the emotion feature distribution of different subjects. In addition, a feature extraction method of differential entropy(DE) data of EEG is proposed based on frequency band division, which can provide stable feature input for our network structure. Finally, experiments are conducted on the SEED dataset. The experimental results show that our method has better classification accuracy in the experiment on the problem of cross multiple subjects. MASTF-net is superior to other relevant methods and models in multi-source domain. On the issue of cross subject emotion analysis, the highest accuracy of our method can reach $88.19\%$ .


I. INTRODUCTION
Emotion, as a form of physiological information, serves as a crucial indicator of individuals' current mental state during everyday communication [1], [2].Its significance is evident in various contexts, such as mental state assessment [3], emotion classification, and the development of user-friendly interfaces.
From a health perspective, the frequent fluctuations in emotions often contribute to the onset of mental illness.In an exploration of this phenomenon, Jutta and Lan [4] discovered a strong correlation between depression and the implementation of emotion regulation strategies.Similarly, Radford [5] investigated the recognition of nonverbal cues and emotional signals in patients diagnosed with Alzheimer's disease.Furthermore, Weilong and Baoliang [6] directed their research towards understanding psychological issues through traditional means, such as the analysis of facial expressions and language, which enables the refinement and classification of emotional problems.However, these methods often fail to yield satisfactory testing results, as they can be influenced by the subjects' personal circumstances.Therefore, there is a need for a measurement method that can effectively capture the genuine sentiments of the subjects.
In recent years, non-invasive brain-computer interfaces, such as electroencephalography (EEG), have gained widespread usage in acquiring brain signals and analyzing psychological disorders.These techniques are appealing due to their reliability, accessibility, and high precision [7], [8].Researchers have focused on developing a method for detecting mental health conditions based on EEG signals, with the goal of identifying the onset of mental illness and enabling early and effective treatment initiation.For instance, Yildirim et al. [9] extracted features based on spectral power and phase-locked values, and conducted emotion recognition research using swarm intelligence (SI) algorithm.Reference [10] utilized the spatial information extracted from EEG signals to classify individuals with depression.Similarly, Bingtao et al. [11] proposed a network framework for diagnosing severe depression based on EEG data.Additionally, Behshad and Mohammad [12] investigated the nonlinear characteristics of EEG signals to differentiate between patients with depression and healthy individuals.These studies highlight the potential of EEGbased approaches in diagnosing mental health disorders and facilitating timely intervention.
The emotion recognition method based on EEG signals can more accurately reflect the changes in the subject's emotions compared to methods based on facial and language information.However, considering the individual differences in EEG signals, it remains challenging to share a network model trained on specific subjects with other subjects for testing.Although this detection method utilizing EEG signals from a single subject improves the efficiency of model training, its application in cross-subject and cross-session scenarios is difficult.
To address the issue of emotion analysis among multiple subjects, domain adaptation(DA) has been widely employed in this research work.DA improves the learning in the unlabeled target domain through the transfer of knowledge from the source domains, which can significantly reduce the number of labeled samples [13].
In practical applications, we often encounter scenarios with multiple source domains, such as in cross-subject emotion analysis research, where EEG data from different subjects are considered as multi-source domain data.In recent years, many researchers have tended to merge multiple source domain data into one domain.For example, Yiming et al. [14] and He and Yiming [15] employed deep adaptive networks (DANs) [16] for emotion recognition based on EEG signals, using maximum mean difference (MMD) [17] as a distance metric between source and target domains.
The approach of combining multiple source domain data into one domain disrupts the unique characteristics of EEG signals from different subjects, neglecting the differences in EEG among different subjects, resulting in significant biases in cross-subject applications.To address this issue, we introduced a multi-branch structure approach [13], establishing one-to-one DA branches for all target domains and source domains to preserve the specific features of each source domain data.Many similar methods appeared in [13], [18], [19].Pan and Zheng [18] proposed a MSFR-GCN model for EEG-based emotion and cognition recognition using graph convolutional networks (GCN) to obtain graph representations of EEG signals.Wenpeng and Bing [19] developed an end-to-end depth convolutional neural network(CNN) for cross-subject transfer using attention-based CNNs.However, these methods only considered the spatial features of different electrode locations in EEG signals, neglecting the temporal sequence characteristics of EEG signals, leading to suboptimal performance of the model.
To address the issues of multi-source domain adaptation and effectively utilize both temporal and spatial features in EEG-based emotion recognition, we propose an EEG emotion recognition network by use of multi-source domain adaptive method based on spatio-temporal image and frequency domain information(MASTF-net).In MASTF-net, a dual-branch shared feature extraction network is used to extract spatial and temporal features from EEG signals respectively, and a one-to-one DA branch feature extractor is established to capture domain-specific features.In this way, we consider the differences in EEG signals among different subjects, as well as the temporal and spatial features.
To sum up, the main contributions of our work are as follows: • We propose a module(L-RNN) for extracting serial EEG temporal features from local brain regions, which is used to extract features of emotional state differences between left and right brain regions and learn the differences between brain regions under different emotions.[23].
In recent years, there has been a growing interest among researchers in the field of artificial intelligence (AI) and, more specifically, in algorithms based on machine learning (ML).This increased interest can be attributed to the higher robustness, improved accuracy, and technological diversification offered by AI.However, most research was based on expensive large-scale equipment to collect experimental data, which has a huge obstacle to the research of biological signals.Recently, BCI plays an important role in emotion recognition, because BCI can use simple equipment to solve the research difficulties caused by large equipment.BCI is used to establish the correlation between the computer and the subject's brain physiological signals, this connection is non-invasive, so it will not cause harm to the subject when collecting data.In addition, researchers began to pay attention to the application of deep learning in emotion analysis based on BCI.Alexander and Yongtian [24] used deep learning (DL) algorithm for EEG, especially emotional assessment tasks.In addition, some models retrieved emotional states from EEG signals based on DL algorithm [6], [25].These models adopted DL to analyze the conversion of EEG signals to different forms.For example, EEG signals were transformed into graphical representations in [26], and vectors were separated between each hemisphere [6].

B. DOMAIN ADAPTATION VIA DEEP LEARNING
In EEG emotion analysis involving multiple subjects, many researchers use one-to-one domain adaptation (DA) to combine EEG signals from different subjects and a source domain, which ignores the difference of marginal distribution in different EEG domains.To solve this problem, Chen et al. proposed MS-MDA [13], which regards each data as a separate domain, thus avoiding the destruction of the edge distribution of multiple EEG source domains, and the domain invariant feature is also considered.
In order to solve the problem of EEG emotion classification of different subjects, MASTF-net is proposed for estimating emotions from EEG.The net is composed of a common feature extraction module with two branches and a domain feature extraction module with multilayer perceptron (MLP).In addition, EEG data comes from three domains of the frequency, time and space.

III. MATERIALS A. DATASET
In this article, we use SEED and SEED-IV, In SEED, there are three emotion labels where 0 represents sadness, 1 represents neutral emotion, and 2 represents positive emotion.15 people participated in the test, including 7 men and 8 women.EEG signals are recorded during watching 15 different movie clips by ESI NeuroScan system at a sampling rate of 1000Hz.In SEED-IV, there are four emotion labels where 0 represents happy, 1 represents sad, 2 represents neutral emotion and 3 represents fear.In these two datasets [19].Each subject has gone through three session experiments, and the number of session is represented by n s ∈ {1, 2, 3}.And n p ∈ {1, 2, 3, . . ., 15} indicates a number of subject.In order to clearly express the meaning of symbols in the paper, symbols are listed in Table 1.

B. SIGNAL PREPROCESSING
EEG is one of the physiological signals that can truly reflect people's inner activities.However, in the process of EEG data collection, acquisition equipments and acquisition environment often lead to the interference of eye movement, electromyography (EMG) or other high-frequency noise.Therefore, the collected EEG data can only be used after being processed in a professional way.In the SEED dataset, the raw EEG signal is down-sampled at 200Hz, and then the noise interference is reduced by using 1Hz to 75Hz bandpass filtering.Next, the signal features are extracted by DE, because DE features have the best performance in feature selection, so DE features are choose as the input of MASTFnet.SEED dataset provides DE features with five bands which are delta (1-4 Hz), theta (4-8Hz), alpha (8-14Hz), beta (14-31Hz) and gamma .The extraction method of DE feature is as follows [26] h where the input x is subject to the Gaussian distribution N (µ, σ 2 ), µ and σ represent variance and standard deviation, respectively and e is the Euler constant.According to [26], raw EEG data are inputted in the form of vector and the DE feature h(x) can be obtained by transforming the input EEG raw data through equation 1.The dimension of a DE feature map is 310 (62 channels × 5 frequency bands) where 62 channels means that EEG signals are collected according to 62 channel ESI NeuroScan system, and the five frequency bands are represented by different letters δ, θ, α, β and γ respectively.

C. FEATURE SELECTION ANALYSIS
In order to obtain the regular of the strength of EEG signals with time, the data of three different emotions are analyzed, such as sad, neutral, and happy.Due to the different length of movie clips in SEED, we select the first 60 seconds and the last 60 seconds of each movie clip to judge which time period of emotional activity is the most obvious.Here, 100 EEG data are randomly selected, each of which consist of four frames and the interval of each frame is 0.1 second.Then, the EEG data are averaged and mapped into the EEG topographic map.The color in the EEG topographic map represents the intensity of the EEG signal, where the darker the color, the more obvious the emotional activity of the EEG as shown in Fig.  map that is averaged after 100 random samples in the last 60 seconds.From the comparison between Fig. 1(a) and (b), it can be seen that the period of emotional activity always appears in the early stage of the emotional test.As time goes on, the emotional activity becomes weaker and weaker.
Because different movie clips have different lengths of time, the dimensions of DE features are different.If the data sample is not filtered, the sample quantity of different labels will vary too much, which affects the final classification results.In order to maintain the balance of data samples, the number of samples for different labels should be consistent.The duration of each targeted experiment is not uniform, and from a psychological point of view, significant emotional changes often occur in the first part of a period of time.Therefore, in order to balance the number of samples and get better classification results, the first 120 DE features are obtained in each targeted experiment in the paper, which as shown in Fig. 2. The total number of training data for each session is 1 × 15 × n t × ⌊120/L t ⌋, where 1 represents a session and 15 represents the total number of subjects, n t represents total number of trials for each session of one person and L t represents the length of each sample, which as shown in Fig. 2. Because there are three sessions in total, the total number of samples is 3 × 15 × n t × ⌊120/L t ⌋, where 3 represents the total number of sessions.The range of L t is from 4 to 8, which guarantees that the total number of samples at least 10000 for ensuring sufficient number of samples.

IV. METHODS
A new architecture is proposed, which combines multisource domain learning and spatio-temporal EEG image representation, as shown in Fig. 3.This model mainly includes three modules, which are two modules(TFEM and SFEM) with different common feature extraction and a module(DACM) with independent adaptive domain feature classification.TFEM evaluates the relationship of EEG data from the left and right hemisphere according to the brain electrode positions in different local areas and extracts the temporal features of EEG signals based on the contribution of each element at each spatial level.SFEM learns the spatial information contained in EEG data through a spatial extraction network, and extract the time frame features of EEG images by a gated recurrent neural network(GRU) [27].
In DACM, an independent discriminant branch is established for each different source domain data to learn the features of multiple specific domains.Next, we will describe the details of each module.

B. TEMPORAL FEATURE EXTRACTION MODULE (TFEM)
From Fig. 1, the level of brain activity is observed in different regions, where the action of EEG is high in the frontal lobe region, temporal lobe, and the occipital lobe, but the movement is relatively weak in the parietal lobe.It can be seen from the Fig. 1 that the most active regions of the brain are the left prefrontal lobe, right prefrontal lobe, left temporal lobe, right temporal lobe, left occipital lobe, and right occipital lobe, which show that EEG signals in these regions are more discriminative.In selecting EEG signal data, the relevant knowledge of EEG signal is considered.The frontal and temporal brain regions, especially the prefrontal regions, are mainly used to regulate emotions and memory functions.The occipital lobe region is mainly responsible for visual function, so the electrode signals in the occipital lobe brain region are also essential for visual emotion analysis.The parietal lobe brain region is mainly responsible for tactile and motor functions and plays a relatively weak role in emotional analysis.The results shown in Fig. 1 are consistent with physiological signals.

1) FEATURE SELECTING (FS) OF EEG TOPOGRAPHIC MAP REGIONS
As shown in Fig. 3, the dimension of r n s n p is n c × n b × L t .The features of r n s n p are extracted by FS in eight regions.The process of FS is as follows.We divide r n s n p according to L t , where r n s n p = [a 1 , a 2 , . . ., a i , . . ., a L t ], i ∈ {1, 2, . . ., L t } and As shown in Figure 4, 36 electrodes are selected from n c electrodes, and the 36 electrodes are divided into eight regions, where the eight regions are A 1 , A 2 , . . .,A j ,. . .A 8 .a A j i represents DE features of A j region at i.The dimension of a A j i is n A j ×n b , where n A j represents the total number of electrodes in the A j region.
In order to obtain the difference between the left and right brain EEG features and retain the time information of the EEG signal, we merge the EEG features of the same area at L t into the same matrix )}, j = {1, 3, 5, 7} and R n s n p is the result of FS.

2) L-RNN AND TFEM STRUCTURE CONFIGURATIONS
The features about time, space and frequency information can be extracted from EEG signals.At the same time, in order to consider the relationship between different cerebral hemispheres based on the same time period, an L-RNN module consisting of multiple parallel RNN and LSTM [30]  was established, where each layer of L-RNN is used to extract information from different regions, including electrodes, physiological region information, and hemisphere-based relationships.The structure of L-RNN is shown in Fig. 5.
Two parallel RNNs with the same structure are used to initially extract the time information of DE features, and the information based on time is processed through the RNN hidden layer.The traditional RNN unit has the problem of gradient disappearance, which causes certain damage for processing of DE features of EEG.If the DE features are processed through simple RNN, the needs of EEG analysis may not be met.Therefore, after extracting feature information through RNN, RNN is no longer used to process EEG information of different regions but the proposed LSTM unit is used to further extract EEG feature of the local hemisphere.LSTM not only improves the memory ability of time information on the basis of RNN, but also improves the ability to solve the problem of gradient disappearance.
Each input of L-RNN module is a symmetrical EEG feature {R A j , R A j+1 }, as shown in After the features of each region are obtained, LSTM is used to obtain the feature information between the left and right brain in time series in order to obtain the activity difference between the left and right brain, which is represented as where ⊖ represents that the corresponding positions of matrices b is the feature extracted from LSTM and Relu activation function is used after ⊖, and the final output result x A j,j+1 i of L-RNN is gotten.The final output is a matrix with dimension L t × 16.
Each L-RNN module can extract the EEG feature of two symmetrical hemispheres regions A j and A j+1 .In order to obtain the DE feature information of all important regions from A 1 to A 8 , multiple parallel L-RNNs are spliced together to form a TFEM module, which as shown in Fig. 6.The input of each TFEM module is the DE features of eight areas.The paired EEG features are input into four L-RNN modules  with the same structure, and the results obtained by L-RNN are flattened and spliced [31] after fully connected layer(FC layer) [32].A vector with a length of 64 is obtained after Norm [32].

C. SPATIAL FEATURE EXTRACTION MODULE (SFEM)
TFEM module plays an important role in extracting features of EEG signal, but TFEM only collects EEG features of local important brain regions and does not pay attention to the overall changes of global brain regions.Therefore, in order to supplement the TFEM module, we propose a feature extraction method for global EEG features(SFEM).The SFEM module contains two sub-modules, which are a module of converting EEG signals into images(ETIM) and a spatial convolution module(SCM), which are shown in Fig. 7 and Fig. 8.In ETIM, DE features are mapped into the EEG topographic map and SCM is used to obtain the global EEG spatial feature information.SFEM module aims to learn the feature representation of global EEG information.

1) EEG TOPOGRAPHIC MAP FEATURE MAPPING
The DE featuers is transformed into an image representation of the EEG signal feature, and the spatial information is considered from the image-based EEG feature map [33].The EEG mapping conversion method we use is as follows.Firstly, the specific position of the 3D brain electrode (the Cartesian coordinates of the scalp position ) is obtained [34], and the three-dimensional electrode coordinates are mapped to the two-dimensional space according to the isometric projection.Secondly, DE features is assigned to the corresponding electrode position, and then the EEG feature image is constructed by cubic spline interpolation [35].Finally, the dimension of EEG topographic map is n b ×h×w, where h is height, w is width, n b is the channel of the EEG topographic map, which represents five different frequency bands.The height h and width w can be taken as required.In this paper, we set the size of EEG image to 32 × 32.Therefore, the size of each sample image is expressed as 5 × 32 × 32.The specific conversion process is as follows [36]: Equation ( 4) calculates the elevation angle elev ∈ − π 2 , π 2 , formed by the z axis and xy plane from the three-dimensional space, where x, y and z represent the three-dimensional coordinates of the input electrode.
Equation ( 5) is the counterclockwise angle in the xy plane measured from the positive direction of the x axis, and the angle value is within the range of (−π, π).
x = ( Equation ( 6) and Equation ( 7) respectively calculate electrode coordinates x and ŷ mapped from 3D to 2D.Through the above EEG to image method(ETIM), we generate five characteristic maps corresponding to five frequency subbands of 62 electrode positions, as shown in Fig. 7. Before retraining the model, the EEG features of 15 subjects are converted into images.We transform matrix S into matrix Q, where S = {r n s 1 , r n s 2 , . . ., r n s 15 } and Q = {q n s 1 , q n s 2 , . . ., q n s n p , . . ., q n s 15 } through the ETIM.For sample q n s n p , the dimension of q n s n p is n c × 32 × 32 × L t .q n s n p is rewritten as [p 1 , p 2 , . . ., p i . . .p L t ], where the dimension of p i is n c × 32 × 32, n c = 5.

2) SPATIAL FEATURE EXTRACTION STRUCTURE BASED ON SCM AND GRU
In order to obtain the global EEG spatial feature, SCM and GRU are used to extract global EEG feature based on time frames.In SCM shown in Fig. 8, Resnet18 [37] are used as the basic network.The network first learns the spatial spectrum features in EEG images based on the 2D convolution layer and then uses GRU to learn the time series features.The reason for using GRU is that image data is difficult to process compared with signal data.In order to improve the operation speed, the faster GRU is selected to replace LSTM [38].SFEM is used to make up for the incomplete global feature extraction of TFEM.If LSTM is used, the training efficiency will be reduced.Therefore, a more efficient GRU is used as a substitute for LSTM.
As shown in Fig. 9, this structure enables the GRU unit to discard useless outdated information and update its own state according to the newly obtained input.The input of each period of time t is expressed as x t .The state of the reset gate and the update gate are controlled according to the input state at time t − 1 and the input at time t, where the control door is represented as where r t indicates the door control of control reset, W r is the weight matrix of reset gate, h t−1 represents the state information of previous GRU unit, σ represents the logistic sigmoid function.This function can be used to convert data to a value in the range of 0-1.Update door representation as where z t represents the door control for updating, W z represents the weight matrix of the update door.The closer the gate control signal is to 1, the more data will be memorized; The closer you get to 0, the more you forget.When the r t is obtained, it can pass the current candidate state ht , ht depends on the state information h t−1 at the previous time.ht  is expressed as where W h represents the weight matrix of candidate states.Finally, z t , h t−1 , ht get the current state h t .
In order to obtain the spatial information of the electrode position in EEG images, SCM is built to learn spatial features from multiple frequency dimensions as shown in Fig. 8. Table 2 shows the specific parameter configurations for the spatial convolution module.
In order to extract the global EEG features without ignoring the time features L t , SCM modules are used to extract the spatial position information of EEG in different time sequences and GRU is used extract the time features.Each p i is calculated by the SCM module with weight sharing, which is shown in Fig. 10.The input dimension of GRU is L t × 128 × 8 × 8, which consists of L t parallel SCM outputs.Each GRU unit obtains a feature vector with a length of 256.The output of all GRU units is spliced and flattened to obtain a feature vector with a length of 64 through FC layer, and then the final result is obtained by Norm.The whole process is shown in Fig. 10.The configuration of GRU and FC layer is shown in Table 3.

D. DOMAIN ADAPTIVE CLASSIFICATION MODULE (DACM)
Domain adaptation (DA) [39] is a special case of Transfer Learning.The idea is to map data features from different  domains (such as two different datasets) to the same feature space, so that target domain can be enhanced by the data from other domains.Due to the problem of multi-source domain adaptation in cross-topic EEG emotion analysis, we introduced domain specific feature extractor (DSFE) and domain specific classifier (DSC) from [13] to solve crosssubject emotion classification problem.Through TFEM and SFEM, two feature vectors with 64 dimensions can be obtained, and the two vectors are spliced which is the input of DACM.The structure of DACM is shown in Fig. 11.DACM contains two sub-modules, which are DSFE n p and DSC n p consisted of three FC layers.DSFE n p has three FC layers whose input dimension is 128 and output dimension is 32.DSC n p also has three FC layers whose input dimension is 32 and output dimension is 3. DSFE n p and DSC n p are represented as follows Y n s n p = φ n p y n s n p (13) where n p represents the DSFE n p , x n s n p is a vector with a length of 128, y n s n p is a vector with a length of 32, φ n p represents the DSC n p and Y n s n p is the output of φ n p , consisting of a vector of length 3.
In order to obtain the individual feature of each source domain, 14 DACM n p (n p =1,2,. . .,14) are constructed, 14 DSFE n p are to extract the corresponding 14 source domain features.At the same time, in order to map the source domain and corresponding target domain to the same space, the target domain x n s 15 as input to train 14 DSFE n p , as shown in Figure 3. 14 DSC n p are used to provide classification results for the 14 source domain.

E. LOSS FUNCTION DESIGN AND OPTIMIZATION
In order to estimate the distance between the source domain and the target domain, L mmd [40] is used to align the specific source domain to the target domain gradually without changing the spatial distribution of the target domain.L mmd are represented as follows where the source domain feature x n s n p and target domain feature x n s 15 are extracted by n p .The classification results of the specified domain are obtained through the sotftmax classifier.The final result are composed of 14 outputs of DSC n p .In order to make the prediction convergence of each classifier result, crossentropy loss L cls is introducedwe to solve the prediction result loss, which is shown as follows (15) where, M represents the total number of sample labels, Ŷ n s n p represents the sample label value.
Finally, if the target domain is not controlled and the prediction is simply averaged to the final result, the variance will be high.In order to make the gap between 14 target domains features in DSFE n p not too large, a method L disc [13] is introduced to calculate the sum of feature differences of 14 independent target domains to control the feature distribution of the target domains.The equation of L disc is as follows The final loss is as follows where ϵ, µ and η are three hyperparameters for the three losses, minimizing L mmd can get domain-invariant features for each pair of the source and target domains, minimizing L cls will bring more accurate classifiers for predicting the source domain data and minimizing L disc will get more convergent multiple classifiers [13].
Algorithm R n s j =FS(r n s j ), q n s j =ETIM(r n s j ), 4: x n s j =Contact(TFEM(R n s j ),SFEM(q n s j )) 5: if j! = 15 then 6: y n s j =DSFE j (x n s j ), 7: y n s 15 =DSFE j (x n s 15 ) 8: Y n s j =DSC j (y n s j ) 9: L cls (Y n s j , Ŷ n s j ) return Y n s z 24: end for 25: save model

V. EXPERIMENTS
In order to evaluate MASTF-net, different methods for validation are used on the SEED and SEED-IV.Model evaluation is mainly carried out through multi-source learning in the crosssubject situation.Using this evaluation method, one subject is used as the target domain data, and the other subjects are used as the source domain data.During training, the model is trained through all source domain data, and the target domain data is evaluated.In this way, all subjects are evaluated repeatedly and the corresponding mean value and standard deviation are calculated.At the same time, in the ablation experiment, the results are compared using DE features and PSD features as inputs, which is shown in Fig. 12.

A. IMPLEMENTATION DETAILS
In the experiment process, because of the use of dual branch network, L t must be consistent for SFEM and TFEM.When converting EEG features into images through ETIM, the dimension of the image is set to 32 × 32 .In the training phase, the cross entropy loss function is mainly used.ϵ is set to 1, µ is set to 0.3 and η is set to 0.03 during training.The Adam optimizer is used with a learning rate lr of 0.01.output SFEM and output TFEM represent the feature output dimensions of the SFEM module and TFEM module, respectively.During the experiment, we find that a significant amount of time and memory consumption is required for converting DE features into image data during data reading.Therefore, in order to improve experimental efficiency, the DE features of all subjects are preprocessed into images in advance and images are saved in numpy format.Then, the image file is directly read to reduce time.The specific parameters are shown in Table 4.
A training model is built based on the Python library, and EEG features is stored in the numpy format before training, which is executed on the 8 GB Nvidia GeForce RTX 3060 GPU.The main algorithm flow is shown in Algorithm 1. S represents the DE features of 15 subjects.The first fourteen subjects are labeled source domain data, and the fifteenth subject is unlabeled target domain data used for testing.Firstly, feature selection and image transformation are applied to r n s n p for obtaining R n s n p and q n s n p .Then, R n s n p and q n s n p are separately input into two branches(TFEM and SFEM) to extract common features.Contact indicates the concatenation of the common features extracted from the TFEM and SFEM branches along the second dimension, resulting in x n s n p .When the input is labeled test data, the model is trained using both the source domain data and the target domain data to calculate the classification loss (L cls ) for each domain branch.However, when the input is unlabeled test data, the feature difference loss (L disc ) between the target domain of all branches are used, as well as the domain loss (L mmd ) between the source domain and the target domain are calculated.During testing, the common features of the test data are first obtained following the training process.Then, the results are sequentially predicted by passing through the domain adaptation branches, returning all the prediction results of the domain adaptation branches.Finally, the average of the final predictions is obtained.

B. EXPERIMENTAL RESULTS AND COMPARISON
A lot of emotion classification experiments are performed on the SEED and SEED-IV [6], and MASTF-net is compared with other models.
As shown in Table 5, we evaluate the performance of each participant based on SEED and SEED-IV.Each participant's three sessions are divided into Session I, Session II and Session III.
The model in the paper is compared with other models which are DDC [41], DAN [13], DCORAL [42], MS-MDA [13], IAG [43], GMSS [44], MSFR-GCN [18].By comparing MASTF-net with other models, it can be seen that MASTFnet is superior to other methods in both the experimental results for each session and average results, which is shown in Table 5.
In our study, experiments are conducted by using four different methods: DDC, DAN, DCORAL, and MS-MDA, for various conversations.The results for IAG, GMSS, and MSFR-GCN are directly cited from the literature.Notably, due to the use of a large sliding window for feature extraction in SEED-IV, there was a severe lack of training samples in the experiments.Therefore, we re-extract the DE features of five frequency bands from SEED-IV using a sliding window of length 1.
From the results in Table 5, it can be seen that MASTF-net achieved the best performance on both the SEED and SEED-IV datasets, surpassing MS-MDA and MSF-GCN by 1.33% and 1.41% respectively on the SEED dataset.Simultaneously, on the SEED-IV dataset, our method outperformed MS-MDA, indicating that MASTF-net combining global spatial information and temporal feature information has a significant impact on emotion classification results.Compared to other models, MASTF-net's performance on SEED-IV is similar to GMSS, IG, and MSFR-GCN, but MASTFnet exhibits relatively smaller bias during training, which shows that, compared to others, MASTF-net demonstrates competitive performance and improved stability, with lower standard deviation and stronger generalization ability.
Different hyperparameters are tested to evaluate the MASTF-net and MASTF-net is compared with other models.
Multiple comparative experiments are conducted with different sample length L t .The experimental results indicate that the larger the L t , the better the TFEM module can learn time-based features from DE features, which is shown in Table 6.TFEM and SFEM cannot learn the key knowledge to be used from a short time series.In addition, it can be seen from Table 6 that as the batch size increases, the performance of MASTF-net decreases.When the L t value is 4 and the batch size value is 128, model results have a significant decrease.When the L t value is 8 and the batch size is adjusted to 32, the MASTF-net can achieve a highest average accuracy of 88.23%.The results also show that our method can still obtain better experimental results when the number of iterations is small, which indicates that our method can improve the classification accuracy of emotion recognition for multiple subjects in different scenarios.
As shown in Table 7, MASTF-net and other models are trained 1500, 2500, and 3500 times, respectively.It can be observed that with the increase of time, the accuracy of all models has also improved.From Table 7, it can be seen that MASTF-net can achieve the highest testing  accuracy.This also indicates that the MASTF-net has a more significant performance improvement in emotion recognition by use of EEG.During the experiment, we saved the results of each experiment.We used bar charts and confusion matrices to record the performance results of all participants in all sessions, as shown in Fig. 13.According to the experimental results, MASTF-net can effectively predict positive and negative emotions, but the predictive effect of neutral emotions not as good as that of positive and negative emotions, especially in the second session.It may be that the neutral mood change is relatively weak relative to the other two moods, thus reducing the average accuracy.

VI. DISCUSSION
In this section, we investigate the proposed algorithm MASTF-net.We conduct ablation experiments on MASTFnet, including ablation experiments on TFEM and SFEM without using domain loss function, as well as ablation experiments on different frequency band data.Additionally, we analyze the performance of MASTF-net when using DE features and PSD features as inputs separately.

A. ABLATION STUDY
In order to better understand whether each module can play a key role, the results of the ablation experiment are shown in Table 8.
The TFEM module only learn the DE features of eight key regions, but some electrode positions are ignored by the TFEM module.In order to prove that the SFEM module compensates for the shortcomings of TFEM in the global space, in the ablation experiment, TFEM replaces the feature extraction module with a dual branch structure and is compared with the MASTF-net.The experimental results indicate that the SFEM module improves the overall experimental effect.
In order to prove that the domain loss function and difference loss are effective, in the ablation experiment, we removed these two loss functions and only used the crossentropy loss to predict the results.The experimental results show that if only one prediction loss is used and the domain loss and difference loss are abandoned, the accuracy of the final prediction results will decline.The experimental results are shown in Table 8.
According to the results in Table 8, the testing accuracy decreases significantly without L mmd and L disc , indicating the importance of processing each source domain with DACM modules during the training process.In addition, if TFEM module is only used as a common feature extractor, classification accuracy will reduce, which proves that the  In order to evaluate the impact of PSD features and DE features, repeated experiments are conducted on PSD features in the same experimental environment.In addition, if the TFEM module is only used as a common feature extractor, the classification accuracy will be reduced compared to the dual branch structure common feature extractor, which proves that the SFEM module compensates for the shortcomings of global spatial features in the TFEM module.SFEM module compensates for the lack of global spatial features in the TFEM module.
In order to investigate the extraction of low-level common feature information by MASTF-net, we conducted multiple experiments on the data of the five frequency bands, as shown in Table 9.According to Table 9, in seed and seed-IV, δ has the least impact on the experiment, while β has a significant impact on the prediction results of both datasets.In SEED-IV, fear, negative and neutral emotions account for about one quarter of the data, and according to studies [18], and the β and γ bands are associated with positive emotions, while the α band is associated with neutral emotions.Therefore, if high frequency bands are overly emphasized, the overall effect will not be good.Our findings are consistent with the physiological research results of MSFR-GCN [18].

B. INPUT FEATURE COMPARISON
The SEED provides not only DE features but also PSD features.In order to improve the effectiveness of crosssubject emotion classification and evaluate the impact of PSD features and DE features, repeated experiments are conducted on PSD features in the same experimental environment, and the results are shown in Fig. 12.We randomly select the DE and PSD feature data of all subjects in one session of three sessions.The abscissa in Fig. 12 represents the subject's number n p , and the ordinate represents accuracy.MASTF-net_DE and MASTF-net_PSD represent experimental results using DE and PSD features as MASTF-net inputs, respectively.TFEM_DE and TFEM_PSD represent using DE and PSD features as network inputs.However, unlike MASTF-net_DE and MASTF-net_PSD, in order to reflect ablation experiments, in the common feature extractor, we only use the TFEM module instead of the dual branch feature extraction module to demonstrate the high generalization of the TFEM module.The experimental results show that DE outperforms PSD in SEED under the same model and experimental environment.Therefore, we conclude that for SEED, DE is more conducive to the adaptation of EEG emotion recognition across subjects than PSD.From the experimental results can see that our method has shown good results in the field of adaptive EEG emotion analysis.
Furthermore, MASTF-net fails to achieve satisfactory results in cross-subject sentiment classification when using PSD features as input, as shown in Figure 12.Further optimization is needed to improve the generalization ability of MASTF-net, enabling it to adapt to DE features as network input and achieve satisfactory results when using PSD features as input.Finally, although MASTF-net demonstrates favorable analytical results compared to the comparative methods, its model complexity leads to longer training time.In future work, we will adjust the model structure of  MASTF-net based on the weighted proportion of different frequency bands in emotion prediction to enhance its efficiency.

VII. CONCLUSION AND FUTURE WORK
In the paper, MASTF-net is compared with other models which are DDC [41], DAN [13], DCORAL [42], MS-MDA [13], and it is clear that MASTF-net has significant improvement in the emotional classification of cross-subjects.A dual branch network are used to build a shared network to extract the common time, space, and frequency information of EEG signals.On one hand, L-RNN module is used to obtain local area features of EEG in high-frequency active areas.On the other hand, 3D EEG data is mapped to a 2D EEG topographic map to extract EEG spatial global features.
In addition, in experiments, it can be seen that length of L t have a significant impact on the cross-subject emotion classification results of the experiment.As can be seen from Table 6, when L t is set to 8, the best experimental results are obtained, which shows that EEG signals have a relatively long time correlation.We also find that MASTFnet can achieve higher accuracy when choose a smaller batch number.In addition, when selecting the loss function, three aspects are considered.First, inter-domain loss is used to calculate the difference between the source domain and the corresponding target domain.Second, domain invariant loss is used to improve the stability of the target domain after feature extraction.Third, prediction loss is used to predict the classification results.
Finally, the MASTF-net uses a relatively complex network structure.Compared with other simple architecture models, the MASTF-net has good performance in cross-subject emotion classification, but has a problem of longer training time.It can see that the training time increases linearly with the number of source domains, that is, the larger the number of source domains, the larger the model, and the longer the training time.In the next step, our goal is to improve the training efficiency without reducing the model accuracy.The main idea is to add a discriminator in the branches for domain features to discard unnecessary branches for reducing training time.

FIGURE 1 .
FIGURE 1. Select continuous EEG data for a specific time period from SEED, and obtain continuous EEG topographic map for three different tags (sad, neutral, happy).(a) The EEG topographic maps of the subject are shown when the emotional change is active.(b) The EEG topographic maps of the subject are shown when the emotional change is not active.A frame of EEG topographic map is obtained for the time interval is 0.1 seconds from SEED.
1. Fig. 1(a) shows the EEG topographic map that is averaged after 100 random samples in the first 60 seconds, and Fig. 1(b) shows the EEG topographic

FIGURE 2 .
FIGURE 2. Extract the first 120 values of each continuous DE feature in SEED.The 120 values are divided into samples of length L t .Then, EEG data samples are drawn through the L t second window (green dotted line window).
A. INPUT DATA OF MASTF-NET When training the network, we use labeled training data and unlabeled test data as the input of the network, and the input data is represented by a matrix S = {r n s 1 , r n s 2 , . . ., r n s n p , . . ., r n s 15 }, as shown in Fig. 3. r n s 1 , r n s 2 , . . ., and r n s 14 are the labeled source domain data, which is used to train the model.r n s 15 is the target domain data without label, which is used for model training and model accuracy test.The total number of samples is 3 × 15 × n t × (120/L t ), and each matrix r n s n p represents a training data containing n c electrode positions in the n b frequency band with length of L t .The dimension of matrix r n s n p is represented as R n c ×n b ×L t , where n c is the total number of electrodes for collecting EEG, and n b is total number of frequency bands.

FIGURE 3 .
FIGURE 3. The structure of MASTF-net.This model mainly includes three modules, which are two modules(TFEM and SFEM) with different common feature extraction and multiple modules(DACM n p ) with independent adaptive domain feature classification.S contains a sample r n s n p of each subject.TFEM evaluates the relationship of EEG data from the left and right hemispheres according to the brain electrode positions in different regions and extracts the temporal features of EEG signals based on the contribution of each region at each spatial level.SFEM learns the spatial information contained in EEG data through a spatial extraction network, and extracts the time frame features of EEG images by a gated recurrent neural network(GRU).DACM establishes separate domain feature extraction branches(DSFE n p ) and classifiers(DSC n p ) for different domains according to the common features extracted by TFEM and SFEM.FS represents feature selection, which is detailed in Section IV-B1.ETIM is described in Section IV-C1.

FIGURE 4 .
FIGURE 4. In order to explore the feature transformation of frontal lobe, temporal lobe and occipital lobe in left and right hemispheres when emotions change, eight regions with frequent EEG activity during emotional changes are divided.

FIGURE 5 .
FIGURE 5.The structure of L-RNN.Each L-RNN input is a symmetrical EEG feature.t represents the input time sequence, A j represents the location of the EEG region, and a A j i represents the EEG feature in the A j region of the corresponding time series t .⊖ represents the subtraction of the elements at the corresponding position of vectors b A j i and b A j +1 i .The final output is the features matrix with dimension L t × 16.

Fig. 5 .
L t represents the input time sequence, A j represents the location of the hemispheres region, and a A j i represents the DE feature in the A j region of the corresponding time series L t .The final output result is a vector of length 16.We input a A j i , a A j+1 i into the parallel RNN according to the time sequence L t to get b

FIGURE 6 .
FIGURE 6.The structure of TFEM.The EEG features of eight areas are input into the L-RNN module in pairs.The output from the L-RNN module is flattened and spliced after FC layer, and finally, a vector with a length of 64 is obtained after Norm.

FIGURE 7 .
FIGURE 7. EEG data of five frequency bands are mapped to the corresponding positions of the EEG map to form a five-channel EEG map.

FIGURE 8 .
FIGURE 8.The structure of SCM.SCM is used to extract the global spatial feature of EEG, and the dimension of final output feature is 128 × 8 × 8.

FIGURE 9 .
FIGURE 9.The traditional structural unit of GRU extracts the time information in EEG DE features.

FIGURE 10 .
FIGURE 10.The structure of SFEM.SFEM module contains L t sub-modules, which are SCM.SCM is used to obtain the global EEG spatial feature, and weight sharing among L t SCM modules.SFEM module is to learn the feature representation of global EEG information.

FIGURE 11 .
FIGURE 11.The structure of DACM.DACM module contains two sub-modules, which are DSFE n p and DSC n p consisted of three FC layers.The input dimension of DSFE n p is 128, and the output dimension is 32.The input dimension of DSC n p is 32, and the output dimension is 3.

FIGURE 12 .
FIGURE 12.In order to evaluate the impact of PSD features and DE features, repeated experiments are conducted on PSD features in the same experimental environment.In addition, if the TFEM module is only used as a common feature extractor, the classification accuracy will be reduced compared to the dual branch structure common feature extractor, which proves that the SFEM module compensates for the shortcomings of global spatial features in the TFEM module.

FIGURE 13 .
FIGURE 13.For the 45 cross-subject emotion classification results of each subject in each conversation as the target domain, the confusion matrix represents the number of true and false predictions of each type, and the histogram represents the accuracy of each type of emotion prediction.

FIGURE 13 .
FIGURE 13. (Continued.)For the 45 cross-subject emotion classification results of each subject in each conversation as the target domain, the confusion matrix represents the number of true and false predictions of each type, and the histogram represents the accuracy of each type of emotion prediction.

TABLE 1 .
Meaning of each symbol.

TABLE 2 .
Parameter configuration used in SCM.

TABLE 3 .
Configuration of GRU and FC layer.

TABLE 4 .
The hyperparameter values in the experiment in this paper.

TABLE 5 .
Comparison of different models on SEED and SEED-IV based on different sessions.

TABLE
Evaluation of MASTF-net with different batch size and Sample length L t on SEED.

TABLE 7 .
Comparison of different models on SEED and SEED-IV based on different epochs.

TABLE 8 .
Ablation experiment for Loss function and common feature extraction on SEED.

TABLE 9 .
The proportion of frequency bands for TFEM and SFEM on SEED and SEED-IV.