Joint Modality Features in Frequency Domain for Stress Detection

Rich feature extraction is essential to train a good machine learning (ML) framework. These features are generally extracted separately from each modality. We hypothesize that richer features can be learned when modalities are jointly explored. These joint modality features can perform better than those extracted from individual modalities. We study two modalities, physiological signals – Electrodermal activity (EDA) and electrocardiogram (ECG) to investigate this hypothesis. We investigate our hypothesis to achieve three objectives for subject-independent stress detection. For the first time in the literature, we apply our proposed framework in the frequency domain. The frequency-domain decomposition of the signal effectively separates it into periodic and aperiodic components.We can correlate their behaviour by focusing on each band of the signal spectrum. Second, we show that our framework outperforms late fusion, early fusion and other notable works in the field. Finally, we validate our approach on four benchmark datasets to show its generalization ability.


I. INTRODUCTION
S TRESS is defined as the nervous system's reaction to a danger or an instruction [1]. Stress has been taken seriously in recent years as it affects many people. This tendency could be due to changing work styles, cultural demands, varying lifestyles, etc. [2]. In some circumstances, stress can be beneficial up to a point in high-pressure situations such as at work, exams, and so kind. Stress is no longer beneficial once it crosses a certain level; it also harms an individual's emotional state, health, quality of life, and productivity [3]. If certain events occur frequently and a person becomes highly concerned, the body will be stressed for the rest of the time, leading to severe health issues [4]. As a result, the importance of stress detection systems has grown compared to the situation that existed a decade ago. Protecting individuals from the growing effects of stress is critical, mainly because stress is unavoidable. As a result, timely stress diagnosis and control are crucial for improving an individual's mental health and overall well-being [5].
Automatic stress detection mainly uses three modalities: psychological, physiological, and behavioral [6]. The Hypothalamic Pituitary Adrenal (HPA) axis and the Autonomic Nervous System (ANS) are the two key components that respond to stress by attempting to restore physiological balance [7]. This is caused by changes in heart activity, sweat gland activity, skin temperature, etc. As effective stress markers, physiological signals can thus provide information on ANS activity. In addition, among the physiological signals, ECG and EDA provide a realistic view of an individual's stress level [8] The frequency-domain analysis of physiological signals has received less attention than the time-domain analysis. The signal's transitory properties can be used to comprehend the signal's frequency-domain interpretation [9]. Frequencydomain analysis for stress detection has received little attention. When looking for periodic behavior in a signal, frequency-domain analysis comes most in handy. [10]. This paper describes a joint modality feature learning method for stress detection in the frequency domain. The proposed method uses a deep neural network to learn joint-modal mapping. The ECG and EDA frequency bands are identified, and features are extracted from the PSD. These features are used for joint modality feature learning.
This study differs from earlier works in the following as-pects. Most physiological signal-based stress detection studies used time-domain and time-frequency-domain features. Frequency-domain analysis, despite its importance, receives less attention than time-domain analysis. As a result, we incorporate joint modality feature learning in the frequency domain for stress detection in this study. We use autoencoders to learn joint representation from different modality features. ECG and EDA's frequency bands, which contribute the highest to stress detection, are also evaluated.
The main contributions of this work are summarized as follows: 1) Frequency domain analysis is performed on ECG and EDA signals. The frequency bands of the ECG and EDA have been identified. We analyze the performance of each frequency band of ECG and EDA separately to identify the band that performs best for stress detection.
2) The ECG and EDA signals are divided into fixed duration segments of varying lengths. The above-developed frequency analysis framework is investigated for each segment duration separately to study the influence of segment duration on overall performance. 3) We propose an Auto-encoder-based framework to learn joint modality feature representation from ECG and EDA signals. Results obtained by using all the bands (whole signal) and individually performing the best bands (band-level) are analyzed. 4) We build an optimal CRNN-SE model consisting of convolutional and Long Short Term Memory (LSTM) layers and Squeeze-Excitation modules for use as a classifier in all of our experiments. 5) Finally, we evaluate the developed framework on four benchmark datasets to study the generalization capability.
The remaining paper is structured as follows. Section II reviews recent works on joint modality feature learning and frequency domain analysis of physiological signals. The research gap has been identified, and the objectives of the current proposal have been established. Section III contains details of our proposed frameworks. Section IV presents the results obtained and analysis performed on four benchmark datasets. Section IV-E compares the performance of the proposed method with other appropriate methods from the recent literature, and Section V concludes the paper.

II. RELATED WORKS
This section reviews prior works in joint modality feature learning and frequency domain analysis on physiological signals.

A. JOINT MODALITY FOR NON PHYSIOLOGICAL SIGNAL APPLICATIONS
Zhen et al. [11] proposed a CNN based cross-modal learning framework text-image matching.The modalities used were images and text. Two sub-networks (an image CNN and a text CNN) with weight sharing constraints at the fully connected layer were developed to learn the cross-modal correlation between the modalities. Discrimination loss was used for cross-modal learning. A linear classifier was trained using the features obtained from the cross-modal representation space. For text-image matching, a modality invariant framework was proposed by Liu et al. [12]. The proposed framework fine-tunes a pre-trained CNN image network and text RNN network with an auxiliary adversarial loss to improve the distribution consistency of the two groups of embeddings (image and text). The distributions of images and text were more similar after adversarial learning, which improved retrieval accuracy. A cross-modal representation for audio-video retrieval was proposed by Surís et al. [13]. Visual audio embeddings were obtained by projecting them into a common feature space with deep neural networks. The joint features were used for a retrieval task that generated a query from either of the two modalities. Cross-entropy was employed as the classification loss function. This loss is optimized with the cosine similarity loss to provide the best results.
A modality-invariant (MI) representations for multimodal sentiment analysis was proposed by Hazarika et al. [14]. Text, image and video were used for multiclass classification using Transformer. Joint modality features were obtained by training encoder with text, image and video. In MI learning, all modalities for the task are mapped to a common subspace for distributional alignment. Although multimodal signals come from a variety of sources, they are all used to achieve the same goal. Individual modalities are projected into a common subspace and aligned by minimising the loss of Central Moment Discrepancy (CMD). The learned representations are used as joint modality feature representations.

B. AUTOENCODER BASED WORKS IN FREQUENCY DOMAIN
A Frequential Stacked Sparse Auto-Encoder (FSSAE) was proposed by Feng et al. [15] for detecting Sleep Apnea (SA) using ECG features. The RR intervals are the input to the FSSAE module. This module transforms time-domain RR intervals into frequency-domain RR intervals. Mean Square Error (MSE) was used to calculate the reconstruction loss. Features retrieved from the hidden layer were used to train a separate Time-dependent, cost-sensitive (TDCS) model. An auto-encoder-based system for detecting epilepsy using electroencephalogram (EEG) data was proposed by Sharathappriyaa et al. [16]. Harmonic Wavelet Packet Transform (HWPT) and the Katz approach (yielding Fractal Dimension (FD)) are applied to the source EEG signal. The FD and HWPT outcome was supplied into an auto-encoder to map a high-dimensional vector into a lower-dimensional embedding. This lower-embedded feature vector was found to yield higher classification rates. The cost function used to train the autoencoder was MSE. An approach for classifying emotional states in the plane of valence-arousal using a stacked autoencoder was proposed by Bagherzadeh et al. [17]. Physiological signals from the DEAP database, includ- ing electromyogram (EMG), electroencephalogram (EEG), and other peripheral signals, were used. Time and spectral features were extracted from these source signals. These features were used to train multiple stacked autoencoders. MSE was used as the reconstruction loss. The majority voting method was used to make the final classification decision. A Supervised Denoising Autoencoder (SDAE) to learn a lowdimensional representation of ECG dynamics to detect false arrhythmia alarms was proposed by Lehman et al. [18]. MSE and binary cross-entropy were used to calculate the reconstruction and classification losses.
However, the use of autoencoders for joint modal feature learning in physiological signals, particularly in the frequency domain, has received relatively little attention. Hence, we propose a framework for subject-independent stress detection using features extracted from the ECG and EDA signals.

III. METHODOLOGY
An outline of the proposed framework is given in Figure 1. Frequency bands of EDA and ECG signals are identified. Features are extracted from the PSD. These features are used to learn a joint modality feature representation using an Autoencoder. The obtained joint modality features are used to train a CRNN-SE model to differentiate between stressed and unstressed subjects. Each of the modules is explained in detail below.

A. DATASET DETAILS
The following four benchmark datasets are used in this study.

1) ASCERTAIN
The electroencephalogram (EEG), EDA, ECG physiological signals, and facial activity recordings of 58 subjects are included in this dataset. The average age of the participants was 30. The physiological signals produced by subjects watching the emotional video were recorded. 36 video clips from [19] were used. The length of the videos was 58 to 128 seconds. The sampling rate of EDA and ECG was 128 HZ, and ECG was 256 HZ, respectively. The subjects were asked to give valence arousal ratings on a 7-point scale, expressing their emotional perception after seeing each video clip. Valence rating ranges from -3 to 3, and arousal rating ranges from 0 to 6 [20]. Based on the Valence and Arousal ratings [21], we assigned stress labels as 1 and unstressed as 0 respectively. In the 2-D valence arousal plane, as shown in the Figure  2, HALV is considered as stressed. As a result, those with high arousal and low valence were labeled as stressed, and others as unstressed. The mean value of the ratings is used to determine whether arousal or valence is high or low.

2) CLAS
The Plethysmography (PPG), EDA, and ECG physiological data were collected from 62 subjects with a mean age of 20. The sampling rate was 256 Hz. Most of the subjects were students. The subjects are involved in five different activities, including three problem-solving tasks and two perceptive tasks. Image and video-clip stimuli were used for provoking the emotional reactions of subjects in perceptive tasks. 16 emotionally classified 30-second clips from the DEAP database [23] were used as video-clip stimuli. We had 59 subjects after eliminating subjects who didn't have VOLUME 4, 2016 complete information. Stress labels were assigned using predefined stimulus tags, which are provided in the dataset [24].

3) MAUS
The dataset captured simple physiological signals under various mental load situations. The N-back task was used to create a mental workload in 22 subjects, 20 of whom were male, and 2 of whom were female. GSR, Wrist-PPG, Fingertip-PPG, and ECG signals were recorded for 35 minutes with a sampling rate of 100 Hz for Wrist-PPG and 256 Hz for others. There was a five-minute rest period at the start of the trial. The N-back task of six trials was performed after a rest interval. The subject had to remember the last N one-digit value in a succession of quickly showing digits in the N-back task. The participant was instructed to reply by pressing the space bar on the keyboard when a stimulus was identical to the N-th number before the stimuli number. The intricacy of the tasks served as ground truth. As the more significant level of N generates a greater level of mental effort, 2 and 3-back tasks were labeled as "high" mental workload states, and 0back tasks were labeled as "low" [25].

4) WAUC
The study involved 48 participants who performed the NASA Revised Multi-Attribute Task Battery II under three different activity level conditions. The speed of a stationary bike or a treadmill was changed to manipulate physical activity. Six neural and physiological modalities were recorded during the activity: ECG, EDA, breathing rate, electroencephalography, skin temperature, blood volume pulse, and 3-axis accelerometer. After each experimental section, subjects were asked to complete the NASA Task Load Index questionnaire. The NASA Task Load Index questionnaire rating was converted to a binary value and subjects were labeled (low mental workload or high mental workload) using the average rating as a threshold, which is given in the dataset [26]. We had 45 subjects after removing those subjects who lacked the necessary information.
For subject independence, we fixed training and testing subject IDs. The first 42, 43, 18 and 36 subject samples of ASCERTAIN, CLAS, MUAS and the WAUC dataset respectively are used for training. The remaining 16 subject samples of ASCERTAIN, CLAS, 4 subject samples of MUAS and 9 subject samples of WAUC dataset are used for testing. We addressed the class imbalance problem by applying the Synthetic Minority Oversampling Technique (SMOTE) [27] to training data.

B. FREQUENCY BAND AND FEATURE EXTRACTION
Based on prior works in the literature by Kwon et al. Power spectral density (using Welch's approach) of the Heart Rate Variability (HRV) extracted from each band of ECG is computed. The python library's frequency module pyHRV [32] is used for this purpose. From these PSDs computed, we extracted a total of 51 frequency-domain measures including Peak, relative powers, logarithmic powers, absolute powers, and so on. Complete list of the 51 measures are available in [32]. Power spectral density (using Welch's approach) of each band of EDA is computed. From these PSDs, we extracted a total of 40 (5 bands with 8 features each) statistical features such as mean, median, min, max, variance, standard deviation, kurtosis and skewness.
An overview of the proposed Auto-encoder to learn the joint modality representation. ECG and EDA features are concatenated (U ECG_EDA ) and given as input to the encoder. The embedded layer outcome h2(.) is taken as joint modality feature representation and used to train a CRNN-SE model.

C. AUTO-ENCODER BASED JOINT MODALITY LEARNING MODULE
ECG and EDA modalities are simultaneously mapped to a single subspace, and we use adversarial learning to learn this subspace, termed as joint modality. Different from the other works in the literature, we investigate this joint (also referred as shared, cross, common subspace in the literature) modality subspace in the frequency domain for the first time.
We propose an auto-encoder based framework to achieve this objective. The architecture of the proposed Joint Modality Auto-encoder (JMAE) is shown in Figure 3. Firstly, we concatenate the ECG features, U ECG and EDA features U EDA into one single vector input, U ECG_EDA . The first, second and third fully connected layers are h 1 (.), h 2 (.) and h 3 (.) respectively. The last layer is an output layer, Y joint of the length same as the input vector U ECG_EDA . The first, second and third hidden layers constitute the parameter vector θ(.) to be learnt by minimizing a cost (reconstruction) function. The cost function is selected such that the distributions of ECG and EDA are aligned in the joint subspace.

12:
/* The decoder function returns a Y n from a hidden representation h n (.) */ 13: 15: min θ (Loss) 16: i ← i + 1 17: end while 18: return θ 19: θ ← P arameters 20: Loss ← MSE, Cosine similarity and KL divergence 21: end procedure Based on different works in frequency domain, we investigated the following three cost functions -MSE, cosine similarity, and Kullback-Leibler (KL) divergence. The cost function will represent the differences between the input U ECG_EDA and the reconstructed Y joint . The proposed model was trained with the Adam optimizer using the default learning rate and 64 as the mini-batch size. The pseudo-code for training the JMAE is summarized in Algorithm 1.

1) MSE
MSE is calculated, as shown in Eqn. 1, where a i is the target value -U ECG_EDA . and p i is the predicted value -Y joint . The cost function value ranges from 0 to ∞. The reconstructed Y joint is more similar to input U ECG_EDA if the MSE value is near to 0 else they are dissimilar.

2) Cosine similarity
The cosine similarity is computed between the a i , the target value -U ECG_EDA and p i , the predicted value -Y joint , as shown in Eqn.2. The cost function has a value between 0 and 1. The value near 0 implies that the Y joint is similar to the U ECG_EDA , while the value near 1 indicates that they are dissimilar.
Cos(a, p) = 1 − The KL divergence is the distance metric that computes the similarity between the a i , the target value -U ECG_EDA and p i , the predicted value -Y joint , as shown in Eqn.3. The cost function value ranges from 0 to ∞. The two distributions (U ECG_EDA and Y joint ) are similar if the value is close to 0, else the distributions are dissimilar.
The results of each loss are compared in the result's section Table 3.

D. CLASSIFIER
We selected a CRNN-SE model having 2 convolutional layers, one LSTM layer and two SE modules as our classifier in all our experiments. Details for this choice are given in Appendix A. For frequency domain analysis, each signal is broken into segments of duration 5 sec each. Details for this choice are given in Appendix B.
All the models are trained with the Adam optimizer using the default learning rate and 64 as the mini-batch size. Binary Cross-Entropy (BCE) given by Eqn. 4 is taken as the loss function. Here, y act i is the actual label, and y pred i is the predicted label for all the N samples.
An early-stopping strategy controls the training duration if the loss does not decrease for 30 epochs in succession. The accuracy and F1-score is used to evaluate the performance of various models.

IV. RESULTS AND DISCUSSION
This sections presents the results obtained by applying our proposed framework on the four benchmark datasets.

A. SELECTION OF FREQUENCY BAND
To study the performance of each of ECG and EDA frequency band, the features obtained from each band used to train separate CRNN-SE classifier. Table 1 shows the frequency band analysis of the ECG dataset, and Table 2 shows the frequency band analysis of the EDA dataset. The results show that the HF band (0.15-0.40 Hz) of the ECG and b band (0.15-0.25 Hz) of EDA achieved the highest accuracy and F1 score for all the four datasets. It means frequencies from 0.15-0.25 Hz, both ECG and EDA have VOLUME 4, 2016 features with higher discriminative capacity for identifying stress. For a hardware implementation, low pass filter can be used to extract these richer features from the frequency transform on the ECG, EDA signals. It will be interesting to pursue if this band range is valid for other physiological signals e.g. EEG.

B. BAND LEVEL VS WHOLE SIGNAL
We investigated the proposed framework on the whole signal (using all the ECG and EDA frequency bands) as well as on a band level (using the bands with highest performance as obtained in Section IV-A). For the whole signal's performance, 51 frequency-domain features from the ECG signal and 40 frequency-domain features from the EDA signal are used to train a JM AE whole module. The first hidden layer h 1 (.) is a full-connected layer of length 95. The second hidden layer h 2 (.) is also a full-connected layer of length 100. The third hidden layer h 3 (.) is another full-connected layer of length 95. The joint modality features obtained from the JM AE whole are used to report the results in third and fourth columns of the  We validated the proposed model by performing K-fold cross-validation on the highest performed model (Loss-MSE). The K value is chosen to be 5. The joint features obtained from the JMAE model are split into 5 folds. Classifi- cation accuracy and F1-score (mean ± standard deviation) is given in Table 4. In all the datasets cross-validation results outperformed previous results (Table-3) by 2.7-4% (absolute). We infer that this increase is due to subject dependence during cross-validation.

C. T-SNE VISUALISATION
To further investigate the joint feature learning achieved by our model, we plot t-distributed Stochastic Neighbour Embedding (tSNE) before and after joint feature learning. The t-SNE approach projects multi-dimensional points onto two-dimensional or three-dimensional spaces such that if two points have the same distribution, the resulting projection keeps them close. Similarly, in the t-SNE projections, distant points remain far apart. With tSNE, we project the joint features into a 2-D space. The feature visualization of U ECG_EDA (regular features) and h 2 (.) (joint features learnt) using MSE cost function on whole signal of all the benchmark datasets are shown in Figure 4. The red dots represent ECG features, and the green dots represent EDA features. Joint feature learning aims to bring different modalities features to a shared space. In the visualization, we observed close overlapping among modalities (ECG and EDA) after joint feature learning. This indicates that the modality gap between the distribution of modalities is significantly reduced.

D. GENERALIZATION CAPABILITIES
The proposed model is tested on four benchmark datasets to assess the proposed framework's generalization capabilities. These tests ensure that our proposed framework is not overfitting to a specific dataset collected in a given environment. We discovered that the performance on all four datasets followed the same patterns. As a result, we ensured that the four benchmark datasets we used were gathered in various scenarios. The CLAS and ASCERTAIN were collected while subjects watched emotional video clips, MAUS and WAUC were collected when subjects undergone physical activity.

E. COMPARISON WITH OTHER WORKS
This section contrasts results obtained by our proposed JMAE framework with recent works on the ASCERTAIN, CLAS, MAUS and WAUC datasets. An overview of the metrics -accuracy, F1-score and AUC are given in Table 5. It is noted that the majority of the stress detection studies used time and frequency domain features, [24], [25], [33]- [36], [39]- [41] and [26]. Our proposed JMAE based features are learned from the frequency domain measures. Hence, they perform better than the time and frequency domain feature-based frameworks of ASCERTAIN, MAUS and WAUC datasets by 15-17%.
Most works [24], [33]- [35], [38], [39] and [25] reported performance on subject-dependent scenario. The performance of these works are usually higher owing to prior knowledge of the testing subject during training process itself. However, our proposed framework also outperforms VOLUME 4, 2016 Few works utilized traditional handcrafted features in conjunction with Machine Learning (ML) models, such as Support Vector Machine and Naive Bayes and Random Forest [24], [25], [33]- [35], [38], [39] and [26]. Few more trained end-to-end deep learning model such as CNN [36] and [37] The proposed framework trains a DL models and uses the outcome of DL models (intermediate layer) to train a DL model. Our framework outperformed existing ML and DL works of ASCERTAIN, MAUS and WAUC datasets by 8-17% Using ECG biomarkers several stress related abnormalitiesties can be detected (Coronary Artery Disease (CAD [42], myocardial ischemia [43], stroke, atrial fibrillation, cardiac arrhythmias [44]).Using EDA/GSR biomarker some other set of abnormalities caused by stress can also be detected (brain and heart attack [45], Epilopsy [46], blood pressure [47], Depression [48]). Traditional approaches built separate classifiers using these modalities (biomarkers) and then took the final decision (late fusion techniques) of stressed or not. Our approach concatenates the two biomarkers (feature fusion till here) and then learns joint representation (our contribution) to yield the best feature representation biomarkers for stress detection. This is in line with the clinical practice of diagnosing by simultaneous monitoring physiological signals to take decision. Clinical decisions are rarely made by monitoring only one physiological signal. Our results are performing better than other works in the literature that are based on the single, early, and late fusion of modalities. It is interesting to note that the band-level features based framework (JM AE band ) performs better than all the other works on ASCERTAIN and MAUS datasets by 11-15%. This reinforces the richer nature of our proposed JMAE based features.
The results indicate that learning joint features of different modalities from the shared space can enhance the performance of the models. The proposed model is able to perform better than other existing works on ASCERTAIN, MAUS, and WAUS datasets. On the CLAS dataset, the accuracy of [39] is higher due to the ensemble voting on subject dependent model.

V. CONCLUSION
We proposed a joint modality features-based framework in the frequency domain for stress detection. We validated our framework using physiological signal modalities -EDA and ECG. Frequency bands of ECG and EDA are identified. Features extracted from the PSD are used to train CRNN models with SE modules. The proposed framework was tested on four benchmark datasets. The High Frequency (HF) band (0.15-0.40 Hz) of ECG and b frequency band (0.15-0.25 Hz) of EDA were found to have the most impact on the overall performance. Our promising findings encourage us to continue further study into joint modality learning with more than two modalities. .

APPENDIX A SELECTION OF CRNN ARCHITECTURE
The following sections provide information on selecting the number and location of different Convolutional layers, LSTM layers, and SE modules. VOLUME   Each convolutional layer is always followed by Batch normalization and max pool layers. All the models have two fully connected layers (FC1 and FC2) and a sigmoid output layer. Performance of individual modalities on two datasets for different models is presented in Table 7. Model 7 yielded the highest performance in ASCERTAIN EDA, and CLAS EDA features. Model 5 yielded the highest performance in ASCERTAIN ECG and CLAS ECG features. We selected Model 7 architecture for the rest of the experiments using ECG features and EDA features.

APPENDIX B SELECTION OF SEGMENT DURATION
Model 7 is used as framework for EDA features and ECG features. Each signal (ECG/EDA) is divided into segments of fixed duration. Four cases are considered -2 sec, 5 sec, 10 sec and 15sec duration each. In each case, Model 7 is trained, and the performances obtained are reported in the Table 8. The highest performance is observed for segment duration 5 sec. The overall performance of 5 sec segmented signals (Table  8) rows 2 and 6, for both ECG and EDA features) is higher than the baseline performance of the full signal (Table 7 rows  1 and 5 for ECG features, rows 3 and 6 for EDA features). V RAMANA MURTHY ORUGANTI received his Masters and PhD degrees in Electrical Engineering from IIT Delhi, India. He is an Assistant Professor in Department of Electrical and Electronics Engineering, Amrita Vishwa Vidyapeetham, India. His past affiliations include NUS (Singapore), NTU (Singapore), University of Canberra (Australia) and Carnegie Mellon University (US). His research focuses on medical image processing and Affective computing. He is a Member of IEEE and the ACM. VOLUME 4, 2016