SigRep: Toward Robust Wearable Emotion Recognition With Contrastive Representation Learning

Extracting emotions from physiological signals has become popular over the past decade. Recent advancements in wearable smart devices have enabled capturing physiological signals continuously and unobtrusively. However, signal readings from different smart wearables are lossy due to user activities, making it difficult to develop robust models for emotion recognition. Also, the limited availability of data labels is an inherent challenge for developing machine learning techniques for emotion classification. This paper presents a novel self-supervised approach inspired by contrastive learning to address the above challenges. In particular, our proposed approach develops a method to learn representations of individual physiological signals, which can be used for downstream classification tasks. Our evaluation with four publicly available datasets shows that the proposed method surpasses the emotion recognition performance of state-of-the-art techniques for emotion classification. In addition, we show that our method is more robust to losses in the input signal.


I. INTRODUCTION
Emotion recognition is becoming an increasingly important field in human-computer interaction. The common emotions displays are speech [1], facial expressions [2], gestures [3], and physiological signals [4]. Among them, physiological signals are one of the most reliable means as they originate from the activity of the Autonomous Nervous System (ANS) and can hardly be triggered/suppressed by any conscious or intentional control [4].
Before the emergence of smart wearable devices, physiological signals could only be obtained using medical sensing devices such as Electroencephalography (EEG) and Electrocardiograph (ECG) sensors. Such sensors are intrusive, nonportable, and cumbersome to use, making it challenging to embed emotion recognition technologies in real-life applications. Recent advancements in smart wearable devices have offered a paradigm shift in wearable sensing. Consumergrade smart wearable devices such as smartwatches, fitness The associate editor coordinating the review of this manuscript and approving it for publication was Ines Domingues. trackers are portable, non-invasive, and equipped with various sensors. They enable continuous monitoring of physiological signals and make affect detection technologies possible for daily usage.
Despite the merits of smart wearable devices, they are not as highly accurate as medical-grade devices. They also tend to get lossy due to users' activities or environmental interference. These could negatively impact the reliability of affect detection algorithms [5].
Deep Learning models are robust to lossy signals in general; therefore, they can be used to develop robust affect detection algorithms [6]. Deep learning models also make representation learning feasible, which fully or partially eliminates the need for feature engineering. Feature engineering is the method of designing features using domain knowledge. It is a complex task that requires significant human time and effort, which can take even decades for an entire community of researchers [7]. A representation learning algorithm can discover a good set of features for a task in a fraction of the time required by manual feature engineering. However, it requires an enormous amount of labelled data for deep VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ learning models to work effectively. It is challenging and labour-intensive to identify and assign emotion labels to sensor reading segments. Self-supervised and unsupervised learning minimises the need for labelled data, making representation learning using deep models more feasible. Researchers have studied selfsupervised representation learning for physiological signals. However, most of them are targeted for high-frequency signals such as EEG and ECG [8], [9], and little research has been done on using representation learning techniques for low-frequency signals such as heart rate, electrodermal activity that are generated from widely available wearable devices. This paper presents a novel self-supervised representation learning mechanism, SigRep, that works well with low-frequency physiological sensor data generated from commodity wearable devices. These representations can be easily adapted for downstream emotion recognition tasks with limited numbers of labeled data. With SigRep, we introduce a signal encoder consisting of 1D convolutions. We use a block-like neural network architecture inspired by Inception. In the pre-training stage, our network learns to contrast signal samples with random augmentation. We train signal representations for individual signal modalities using a large set of unlabeled signals. Then we fuse pre-trained signal representations for emotion recognition tasks. In the emotion recognition phase, we keep the weights of representations frozen.
We extensively evaluate our method for 1) classification performance on intact datasets, ii) behaviour of the method when data is lossy, iii) performance of the system when a less amount of labelled data is available for training, iv) significance of our encoder component, and v) effect of the individual signal modality on the emotion recognition. The results show that our technique outperforms eight other stateof-the-art techniques in seven tasks out of 12 tasks. Our contributions are as follows: 1) We adopt contrastive learning, a self-supervised training technique, to learn signal representations from lowfrequency physiological sensor data, which can be effectively used for downstream emotion classifiers. 2) We propose an improvement to the conventional contrastive learning framework by proposing a new inception-inspired lightweight encoder, which offers better performance than a conventional encoder for downstream emotion classification tasks. 3) To demonstrate our proposed technique's performance, we conduct a series of experiments on four datasets. Experimental results show that our proposed approach offers significantly better performance than state-ofthe-art methods. Results also show that the proposed approach requires a significantly smaller amount of labelled data and is robust to data loss than a fully supervised model.

II. RELATED WORK A. FEATURE ENGINEERING AND REPRESENTATION LEARNING FOR EMOTION RECOGNITION USING WEARABLE SENSING
Before the advent of deep learning, research on emotion recognition was based on feature engineering, which is essentially hand-crafting features based on domain knowledge. It is, however, challenging to design features from the high dimensional data captured from a multitude of wearable devices [10]. Recent deep learning methods aim to address this issue by representation learning, which automatically extracts features from the raw signal. These deep learning techniques are promising as they achieve higher accuracy than conventional approaches using hand-crafted features. For example, Santamaria-Granados et al. [11] compares deep CNN with several classical machine learning methods for emotion recognition tasks, where CNN learns features from raw electrocardiography (ECG) and electrodermal activity (EDA) signals and the classical methods use hand-crafted features. The comparisons show that representation learning outperforms feature engineering. Recently, Yang et al. [12] proposes a hybrid neural network architecture to learn human emotion using EEG signals. Authors propose a parallelly concatenated architecture of a CNN and a long-short term memory (LSTM) network to learn from raw electroencephalography (EEG) signals and validate their method using publicly available datasets. Again, experimental results show improved accuracy of CNN-LSTM over models using handcrafted features. Although deep learning methods outperform classical machine learning methods [11]- [13], they require a large amount of labelled data to learn representative features [11]. It is challenging and labor-intensive to identify and assign emotion labels to sensor reading segments. Unsupervised feature learning techniques in deep learning address the requirement of a large amount of labelled data. These techniques can learn representations from unlabelled data; then, the representations can be reused for multiple downstream tasks built around smaller labelled datasets [14]. However, in a recent review, Schmidt et al. [15] highlight the limited usage of unsupervised and semi-supervised learning methods for wearable-based affect recognition research.
Out of the studies in this research area, autoencoder is a widely used technique for unsupervised representation learning. Recently published CorrNet [16] uses autoencoder based automatic feature extraction in a wearable signalbased emotion recognition task and outperforms the stateof-the-art baseline for CASE dataset for arousal (74.03%) and valence (76.37%) detection. Martinez et al. [13] deploy a denoising autoencoder network to learn features from blood volume pulse (BVP) and electrodermal activity (EDA) signals and reuse the learned features to classify affective states. Tang et al. [17] also uses a denoising autoencoder to learn features from EEG and peripheral physiological signals, and achieve accuracies of 93.97% and 83.53% for binary arousal valence classification of SEED and DEAP datasets, respectively.
Recently, a few research studies have used selfsupervised learning on ECG [18] and EEG signals [9], [19]. Banville et al. [9] proposed a self-supervised strategy to automatically learn features from unlabelled EEG signals and demonstrate that EEG features learned in a self-supervised manner outperforms traditional supervised features while performing similarly to a fully supervised model on sleep stage detection tasks. Furthermore, they demonstrate that self-supervised models outperform supervised methods in low data situations by extensive margins. Cheng et al. [20] propose a contrastive learning method for EEG and ECG signals. Their study shows that self-supervised representations yield comparable performance against a fully supervised counterpart.

B. EMOTION RECOGNITION USING LOSSY SENSOR DATA FROM WEARABLE DEVICES
Signal steams from wearable devices are inherently lossy, resulting in gaps in signal streams [21]. Traditionally, researchers use statistical values such as mean and median values to replace the gaps in data [22]. However, filling gaps with such values is problematic in time-series data as those values do not reflect the qualities of signals [23]. To overcome this problem, lately, researchers are looking into generating values to fill the gaps. For example, Che et al. [24] use a deep learning model, 'GRU-D', to impute missing data in multivariate time series. However, the lack of annotated data makes the 'GRU-D' technique less usable in wearable emotion recognition. Recently, Generative Adversarial Networks (GANs) based approaches have become popular for data imputation [25], [26]. Again, generative models are computationally heavy, but in this paper, we focus on lightweight methods that can be potentially used in resource-constrained environments.

C. EMOTION RECOGNITION USING LIMITED LABELLED DATA
Addressing limited labelled data problems is a popular research area in machine learning. Transfer learning [27], [28] has been a popular approach in addressing the challenge of limited labelled data. The technique focuses on transferring knowledge from a model trained on a similar task to a new task. In wearable sensing, it is common to transfer learn with the models trained initially for activity recognition tasks for emotion recognition. However, the signal modalities used in activity recognition (accelerometer and gyroscope) do not fully cover physiological signals captured with wearable sensors.
Another way to address limited labelled data is by augmenting existing data to create new data points. The data augmentation method has been successfully used in computer vision. In wearable emotion recognition, a reflection of emotion has a personalised nature [29]. Given that existing FIGURE 1. Contrastive learning framework. This framework borrows elements from SimCLR [30]. Initially, two separate transformation operations (t , t ) selected from a set of transformations (τ ) are applied to samples (x) in the training distribution. Then, transformed signals are used to train the encoder network f (.) and projection head g(.) to create latent vectors (z i , z j ). Then, a contrastive loss is calculated between z i and z j to maximise the agreement. The calculated loss is propagated back through the network and weights are updated accordingly. After the training process, the projection head g(.) is detached. The encoder network f (.) and the latent representation h are used for downstream tasks.
wearable-based emotion recognition datasets consist of a limited number of subjects, data augmentation may not expand the inter-subject variability, leading to lower prediction performance. More studies are needed to understand how data augmentation can be used for emotion recognition using wearable devices.
In wearable sensing, datasets are usually large as most of the sensors run in the background as a daemon process producing enormous data points. However, due to the high cost of annotation, it is prohibitively expensive to label these large datasets. Self-supervised techniques can address these issues by learning meaningful representation from the data. However, studies using self-supervised techniques on wearable emotion recognition tasks are very limited. II-A.
To summarise: (1) A majority of the already limited unsupervised/self-supervised feature learning approaches are targeted for high-frequency signals such as EEG and ECG, and a little research work has used representation learning techniques for low-frequency signals from wearable devices. Our focus is to use physiological signals that can be captured with commodity smart wearable devices, and there is a clear gap in the literature regarding unsupervised/self-supervised methods that could be used for our purpose. (2) Advanced and computationally expensive techniques like GAN can be used to address missing data challenges. There is still a need for lightweight techniques to account for the data losses in the input signal in wearable sensing. (3) Limited labelled data is a universal challenge in machine learning and hence common in wearable sensing. Self-supervised learning and data augmentation are used to address the issue of limited labelled data. However, there is a gap in the literature regarding the suitability of these techniques in the wearable sensing platform.

III. MODEL ARCHITECTURE
Signal representations are core components of our research work. We propose a Self-Supervised Learning (SSL) paradigm to learn representations. In particular, we use contrastive learning [30], which learns an embedding space by minimising the distance between similar sample pairs while maximising the distance between dissimilar pairs. We use contrastive learning which has been show to be one of the most powerful self-supervised learning paradigms [31]. VOLUME 10, 2022 FIGURE 2. inception inspired block. We define an inception inspired block inspired by the inception block [34] with Conv1D layers with different kernel sizes [1,3,5,7] and a max-pooling layer. Each convolution layer has two filters and uses rectified linear unit (ReLU) as the activation function. The max pool layer is configured to have a pool size of 3 and a stride of 1. All the parallel layers in this block use zero paddings to keep the output width similar to the input width. An input vector to a block goes through each layer parallelly. In the end, they are stacked together to construct the output.
We borrow elements from the SimCLR framework [30] for contrastive learning, which is originally proposed for visual representations. It simplifies the specialised architecture of contrastive learning yet outperforms previous self-supervised and semi-supervised learning methods on ImageNet. We bring SimCLR to the wearable sensor domain. We present our SimCLR Contrastive learning framework in Fig. 1 and describe its various components below.

A. DATA AUGMENTATION COMPONENT
In contrast to the conventional learning paradigms, SSL techniques do not require manual data labelling. They use data augmentation to generate labels. The data augmentation component transforms input data x into two views ( x i , x j ) by applying transformations (t, t ). Informed by previous research, we randomly select a transformation τ from the following pool τ = {amplitude re-scaling, random DC shift, zero maskings, additive noise} [32], [33]. When ( x i , x j ) are generated from the same input, we recognise them as a positive pair; otherwise, we consider them negative. We use following configurations for the signal transformations.
• Amplitude re-scale: This transformation selects a random scale factor scale from a uniform distribution scale ∈ (0.1, 1.9) and multiplies it with the input signal.
• Random DC shift: For this transformation, we select a random shift value shift from a uniform distribution shift ∈ (0.1, 0.9) and add it to the input signal.
• Zero mask: For this transformation, we select a random mask length w such that, w ∈ (0.1 * l, 0.9 * l), where l = signal length; also a random starting point s id such that s id < l/2. Then, we mask the input signal with zeros starting from the s id with a length of w. In case where s id + w > l, zero mask is applied until the end of the signal starting from s id .
• Additive noise: We generate a noise signal sampled from (−1, 1) with the same length as the input signal for this transformation. Then we add the input signal with the noise signal to create the transformation.

B. ENCODER
For the encoder we propose a Inception network block [34] as illustrated in Fig. 3. The Inception network uses convolutional layers with multiple kernel sizes on the same level in a CNN. By having multiple convolution kernel sizes, the network can learn patterns of different lengths from the input signal. We pass the input to multiple Conv1D layers with kernel sizes 1, 3, 5, 7 and a max-pooling layer before stacking as the output. The encoder is trained to learn a function f (.), where h = f ( x); h denotes the latent representation of the transformed signal x.
The Original inception network proposed by Szegedy et al. [34] uses many inception blocks resulting in approximately five million trainable parameters in the final network. However, our proposed signal encoder uses only four inception inspired blocks, resulting in an encoder network with less than 5,000 trainable parameters. Therefore our final encoder network can reduce resource consumption and avoid overfitting for smaller datasets.

C. PROJECTION HEAD
It is a neural network component in the SimCLR framework. It is designed to learn a function g(.) on top of the representation h before calculating the contrastive loss. Chen et al. [30] experimented with the effect of having a non-linear, linear and no projection between the latent representation and contrastive loss calculation and reported that that having a nonlinear projection on top of the representation outperforms the other two settings. Guided by this finding, we use a nonlinear neural network for the projection head, consisting of two fully connected layers with 16 units each and use ReLU as an activation unit.

D. CONTRASTIVE LOSS
The contrastive loss function maximises the agreement between latent representations; positive pairs attract while negative pairs repel each other. In this work, we use the normalised temperature scaled cross-entropy loss (NT-Xent) as the loss function. Equation 1 defines the Contrastive loss, where l(i, j) is defined in Equation 2, and sim(i, j) is the cosine similarity of the i, j vectors.

E. EMOTION CLASSIFICATION HEAD
After the training, the projection head g(.) is detached. The encoder network f (.) and the latent representation h are used for downstream tasks of emotion classification. The emotion classification head is a tiny, fully connected neural network component used in the downstream emotion classification tasks. As illustrated in Fig. 4, the classification head is built with two fully connected layers with 16 and 8 units each, followed by a softmax layer. Each fully connected layer uses 'ReLU' activation, and the number of units in the softmax layer is equal to the number of classes used in the classification task.

IV. EXPERIMENTAL SETUP A. DATASETS
We use multiple publicly available datasets, which consists of physiological signals captured using wearable devices. We provide a brief description of the datasets below. The AffectiveROAD [35] dataset consists of multi-model physiological and ambient sensor data captured during realworld driving. Data is collected from ten people across 14 driving sessions of 1.5 hours. Two wrist-worn devices, Empatica E4, 1 were used for data collection from both hands of the driver. A chest-worn device, BioHarness 3 , 2 was also used to collect data, but we only consider data from the wristword devices for this work. Data streams have been annotated for stress from the perspective of an external party and later validated with the driver. We use the physiological signal streams from this dataset for representation training.
The continuously annotated signals of emotion (CASE) [36] dataset contains physiological signals (Electrocardiogram (ECG), Blood Volume Pulse (BVP), Electromyogram (EMG) and Electrodermal Activity (EDA)) captured from 30 participants while they were watching emotion stimulating videos. Data streams were annotated with arousal and valence values from the perspective of the participant. CASE dataset provides the arousal/valence rating in nine levels. However, in the literature [16], researchers have binned nine levels into two class and three class configurations for evaluation. For the comparison purpose, we follow the class configuration proposed by Zhang et al. [16] in our evaluations. We use the CASE for signal representation learning as well as the emotion recognition tasks.
The CLAS [37] dataset consists of physiological signals (Electrocardiogram (ECG), Photoplethysmography (PPG) and Electrodermal Activity (EDA)) with inertia signal (Accelerometer (ACC)) captured from 60 participants. Data were collected while participants engaged in various activities that elicit different cognitive load, affect and stress levels. A Shimmer 3 GSR+ and Shimmer 3 ECG units were used in the data collection process. Our study only uses PPG, EDA, and ACC signals from the dataset in representation learning and emotion recognition tasks.
The K-EmoCon [38] dataset contains multiple physiological signals (Electrocardiogram (ECG), Electroencephalogram (EEG), Blood Volume Pulse (BVP), Electrodermal Activity (EDA), Body Temperature (TEMP)) and inertia signals (Accelerometer (ACC)) from 32 participants during 16 debate sessions. Data of four participants were discarded due to sensor malfunctioning. The dataset contains annotations of arousal, valance, categorical emotions from multiple perspectives; first-person (self-report), second person (debate opponent) and third-person (external party). We use this dataset in booth representation learning and emotion recognition stages. In this dataset, arousal and valence values are reported at five different levels. However, due to heavy imbalance of class distribution, we binn arousal/valence levels into two and three binns and create binary and three-class classification problems for arousal and valence.
Further, authors have published intensity levels of the five categorical emotions (happy, sad, angry, cheerful and nervous). When we chunk the dataset for emotion prediction tasks, we treat the most intense emotion as the categorical emotion in that chunk. We turn those categorical emotions into a five-class classification problem. As previously mentioned, the dataset has been annotated from three different perspectives. For this research work, we select self-reported emotions for analysis.
The PPG dataset for motion compensation and heart rate estimation in daily life activities (PPG-DaLiA) [39] contains physiological signals (Blood Volume Pulse (BVP), Electrocardiogram (ECG), Electrodermal Activity (EDA), Body Temperature (TEMP)) and inertia signals (Accelerometer (ACC)) captured from 15 subjects while they engaged in a range of activities in daily life. The authors used a wrist-worn Empatica E4 device and a chest-worn RespiBAN device to capture signals. For our signal representation learning stage, we use signals captured from the wrist-worn device. Wearable stress and affect detection (WESAD) dataset [40] contains physiological signals (Blood Volume Pulse (BVP), Electrocardiogram (ECG), Electrodermal Activity (EDA), Body Temperature (TEMP)) and inertia signals (Accelerometer (ACC)) captured from 15 subjects during a controlled environment study. The dataset also contains signals captured using a RespoBAN device also ECG signals. The authors present the affective state in their dataset as a binary classification (stress vs non-stress) and three class classification (baseline vs stress vs amusement) problems. We use the partition of data recorded using the wrist-worn device in this paper's representation learning and emotion recognition stages. Our emotion recognition task tries to solve their three-class classification problem using the wearable signal partition. Further, the authors of the dataset present data with few other class labels not specified in the dataset description. We ignore those signals in the emotion evaluation; however, we use the physiological and inertia signals in our representation training stage as the labels are not required for SSL.

B. DATA PRE-PROCESSING
Datasets used in this work are captured with various devices with different sampling frequencies. To unify the signal frequency, we chunk the continuous signals into window size of four seconds with a one-second overlap. The window size is based on the findings from the literature [16]. Then, we reconstruct the signal within the signal chunk and resample it to the target signal frequency. To minimise the signal resampling, we chose the most common sampling frequency for each signal type as shown in Table 1. When we chunk signals, it is very important to have a proper convention to assign the correct class label to each chunk. We select the majority agreement protocol to select the class label. If there is more than one majority agreement on the class label, we discard that chunk from the emotion recognition tasks. Table 1 summarises the signal chunks we use for representation learning while Table 2 summarises the class distribution for each emotion recognition tasks.

C. MODEL TRAINING
We use two main stages for model training-(1) representation training and (2) emotion recognition. For each training stage, we use training parameters listed in the Table 3. We implement

1) TRAINING SIGNAL REPRESENTATIONS
We train individual representations for each signal: ACC, BVP, EDA, TEMP, in a self-supervised manner. First, we preprocess datasets for representation learning (see Table 1). Second, we mix and shuffle datasets before using them as training data. We use the SimCLR framework illustrated in Fig. 1 for representation training, wherein we use the proposed inception inspired encoder architecture (see Fig. 3) as the encoder component of the SimCLR framework. We use 512 epochs and a batch size of 256 for encoder training. At the end of the training, the weights of each trained encoder network are saved for further usage in emotion recognition tasks.
We also train another set of signal representations with a basic encoder architecture illustrated in Fig. 5. As illustrated, the basic encoder is built with naively stacking Conv1D layers and MaxPool layers. In contrast, the inception inspired network is wider, with multiple Conv1D layers parallelly in each network level. In order to make the basic encoder and inspection inspire network comparable, we make the basic encoder deeper than the inception inspired encoder so that both networks have a similar amount of trainable parameters. We keep the training procedure identical to the procedure mentioned in the previous paragraph. We identify this encoder as 'Basic Encoder' in the rest of the paper. The purpose of this encoder is to experiment with the significance of the inception inspired encoder.  With this SimCLR based approach, we expect the encoder network to achieve a comprehensive understanding of the raw input signal. For the contrastive loss to get minimised, the encoder should be able to create similar latent representations for positive pairs regardless of the random augmentation added to the signal. To achieve that encoder should either learn to decode the augmentation or learn how to extract information about the underline signal. Since the augmentation added is random in each run, and each augmentation has randomness within the method of augmentation, it is unlikely the encoder network learns to decode the applied augmentation. Therefore the only way the contrastive loss get minimise would be the encoder learning qualities of the underline signal. For the same reason, the trained encoder should be able to retrieve information from a lossy signal, improving the robustness of downstream tasks.

2) TRAINING FOR EMOTION RECOGNITION
We use representations learned in the previous step for the downstream task of emotion recognition. We build a new neural network by stacking the outputs of individual signal representation networks. On top of the representation embeddings stack, we implement a smaller neural network for emotion recognition task, as shown in the Fig. 4. The emotion recognition network is created with two fully connected layers with ReLU activation and a Softmax layer. We keep the trained parameters of the representations frozen in this phase of training. We train the emotion recognition network in a fully supervised manner for tasks and datasets listed in Table 2. We evaluate our emotion recognition model with the leave one user out method and report the average accuracy and F1 scores.

3) BASELINE MODEL TRAINING
To compare the performance of our model, we benchmark it against a fully supervised model. In this paper, we refer to it as the 'baseline model'. The only difference between the proposed model and the baseline model is that the encoder component in the proposed model is trained in a self-supervised manner, whereas that in the supervised model is trained in a supervised manner. We use the same amount of training data like that used for the representation-based emotion recognition model. Also, we keep the similar training parameters as tabulated in Table 3.

D. SELF-SUPERVISED BENCHMARK
Although self-supervised learning is heavily used in computer vision and natural language processing tasks, only a few explorations have been conducted with time series sensor data from wearable devices. In addition, the majority of the existing self-supervised learning methods for wearable sensor VOLUME 10, 2022 signals are focused on downstream tasks such as activity recognition. Despite the lack of comparable works, we benchmark our work with Sense & Learn framework [41] given that it the most recent state-of-the-art self-supervised representation learning work with wearable sensor signals. In the Sense & Learn framework, authors have proposed a generic representation learning framework for heterogeneous sensor signals. Saeed et al. [41] evaluate eight self-supervised tasks to train signal representations and evaluated on multiple downstream tasks (activity recognition, sleep stage detection, stress detection and WiFi-sensing) and provided insights on choosing representation learning techniques for different downstream tasks.
We replicated Sense & Learn framework [41] with the parameters used for stress detection, as it is the closest task to emotion recognition. We train representations using all eight proxy tasks and use them in the downstream emotion recognition task. Initially, we use data chunks with 30 seconds following the stress detection task proposed in the Sense & Learn framework. Representations based on all eight proxy tasks result in poor performance in emotion recognition. Prior work [16] suggests sampling with smaller window sizes results in better emotion recognition accuracy in the context of the wearable signal-based emotion recognition. Therefore we attempt to evaluate with a smaller sample size. However, small sample windows are theoretically impossible with the encoder architecture suggested in Sense & Learn framework by Saeed et al. [41]. Therefore, we use the encoder architecture proposed in the current work to train representations with proxy tasks defined in Sense & Learn framework [41].
The eight proxy tasks we adapted can be summarised as follows.

1) T1: BLEND DETECTION
The blend detection task is defined as a three-class classification. The classification task's data samples and labels are generated by blending two signal samples with a random weight. The original sample without blending is labelled as class A. If two signal samples are selected from different modalities, it is marked as class B. If two signal samples are from the same modality, they are labelled as class C. The random weight for blending is selected from a uniform distribution in range (1,0). Finally, negative log-likelihood is used as the loss function to train the classification task on these three classes.

2) T2: FUSION MAGNITUDE PREDICTION
In this task, signals are blended in a similar strategy as the previous task (T1). In the learning phase, the objective of the network is to predict the random weight used for blending. For a clean sample, weight is considered zero.

3) T3: FEATURE PREDICTION FROM A MASKED WINDOW
In this task, a random segment is selected from an input sample. Eight statistical values (mean, standard deviation, maximum, minimum, median, kurtosis, skewness, number of peaks ) are generated from the selected segment. Then mask the segment with zeros. Later, a model is trained to predict the statistics of the masked segment.

4) T4: TRANSFORMATION RECOGNITION
The transformation recognition task is based on previous work of Saeed et al. [32]. One transformation from eight pre-defined transformations (permutation, channel shuffle, time-warp, scale, noise, rotation, flip, negation) is applied to the input sample per instance. Each transformation is labelled with a class-index. Then the representation learning model is trained to classify the respective class of the transformation.

5) T5: TEMPORAL SHIFT PREDICTION
An input sample is circularly shifted with a random interval in the temporal domain. The random shifting interval is divided into seven classes based on the shifting period. Then the representation learning model is trained to predict the seven classes of shifts.

6) T6: MODALITY DENOISING
This task has a similarity with a denoising autoencoder. A clean input sample is blended with a random sample from a different signal modality to generate the noisy signal. The blending process uses a random weight selected from a uniform distribution. Then a model is trained to re-generate the clean sample given the blended sample.

7) T7: ODD SEGMENT RECOGNITION
In odd segment recognition task, an input sample is split into four similar length segments. One of the segments is replaced with a similar length signal segment chosen from a random sample from a different modality. Then the representation learning model is trained as a four-class classification problem to predict the replaced segment id.

8) T8: METRIC LEARNING WITH TRIPLET LOSS
For this task, a triplet (anchor, positive, negative) of samples is used as the input. The original sample is chosen as the anchor. While the positive is generated by applying a transformation to the anchor. The negative is selected from a different signal modality. Finally, the representation learning model is trained with triplet loss to minimise the distance between the anchor and the positive while increasing the distance between the anchor and the negative.

V. EVALUATIONS AND RESULTS
We evaluate our proposed emotion recognition model with four public datasets (CASE, CLAS, K-EmoCon, WESAD). As shown in Table 2, each dataset has different emotion and affective state labels. We evaluate them using the Leave One Subject Out (LOSO) method. We report average categorical prediction accuracy and average macro F1 scores.  [40]. We could not find any existing benchmark for emotion recognition tasks in K-EmoCon dataset.

A. EXPERIMENT 1: EVALUATION OF EMOTION RECOGNITION MODELS 1) EXPERIMENT
In this experiment, we evaluate the performance of SigRep emotion recognition models. We train the emotion recognition models for each classification task from each dataset. As tabulated in Table 2, we have 12 classification tasks from four different datasets. Because the class labels are heavily imbalanced in most tasks, the accuracy metric alone does not reflect model performance. Therefore we report the macro F1 score along with the prediction accuracy metric.
As discussed previously, current literature has very little work on using self-supervised techniques for wearable signal based emotion recognition task. The majority of existing supervised work is based on classic machine learning approaches. Therefore benchmarking only against work published in the literature may not reflect the advantages of using SigRep in emotion recognition tasks. On the other hand, benchmarking only against self-supervised learning methods may not properly position SigRep within existing literature. Therefore we benchmark performance of SigRep in two different scenarios. i) benchmark against current state-ofthe-art for each emotion recognition task from the literature, ii) benchmark against other self-supervised learning methods.

2) RESULTS
For the CASE dataset, CorrNet [16] provides the state-ofthe-art emotion recognition performance. The CASE dataset consists of arousal and valence levels in nine intensities. Due to the heavy class imbalance, CorrNet [16] uses only two and three-class configurations for evaluation. Following that, we also evaluate our method using only two and three-class configurations. Although the two-class results are on par with each other, results of the three-class problem clearly demonstrate the superior performance of the proposed method over CorrNet.
We used prediction results reported by Markova et al. [37] as the benchmark for the CLAS dataset. The CLAS dataset contains emotion data elicited in two ways, using (1) image and (2) video stimuli. The results are presented in Table 5.
For the WESAD dataset, most of the works in the literature are focused on using ECG and EMG signals. The best performance for three-class affective state classification using ACC, BVP, EDA and TEMP signals is achieved by Schmidt et al. [40]. As we focus on commodity sensors (such as sensors built into a smartwatch), we only use the wrist-based signals and compare our performance with results of wrist-based signals reported by Schmidt et al. [40]. Similarly, our method shows superior performance, as shown in the Table 5.
Except for CorrNet, which is based on a representation learning approach, other state-of-the-art results are based on classic machine learning approaches. In order to compare SigRep with other self-supervised learning-based methods, as mentioned before, we have re-implemented the self-supervised methods proposed in the Sense & Learn framework [41]. We benchmark the emotion recognition performance of SigRep against all eight proxy tasks proposed in the Sense & Learn framework [41]. Table 5 presents classification accuracy, and Table 6 presents the F1-Score for all 12 classification tasks.
As the results reflect, SigRep has demonstrated the top accuracy for 7 out of 12 emotion recognition tasks and second best accuracy for 3 out of the remaining 5 tasks. In the case of F1-Scores, SigRep has achieved the top two F1 scores in 10/12 tasks. Overall, SigRep has demonstrated better emotion recognition performance.

3) DISCUSSION
Diving deeper into the emotion recognition performance, we observe that out of the eight proxy tasks in the Sense & Learn framework [41], Tasks 3, 4 and 8 have achieved one of the top two accuracies and F1 scores frequently. To explain this observation, we analyse those proxy tasks and the proxy task proposed in SigRep.
The proxy task proposed in the SigRep contrasts samples after adding a random augmentation to the signal components. The proxy task 8 in the Sense & Learn framework [41] is to contrast samples from different modalities. Both proxy tasks have a common element of learning how to contrast distinct elements and identify similar elements at a higher level. SigRep uses random data augmentations before learning to contrast them. Those augmentations are re-scaling amplitude, random DC shift, additive noise and random zero masking.  [41] to benchmark SigRep. Best results for each task is presented in bold text while the second best result is presented in italic.  [41] to benchmark SigRep. Best results for each task is presented in bold text while the second best result is presented in italic.
The first three augmentations are similar to transformations added in the task 4; zero masking is similar to task 5. At a higher level, the proxy task in SigRep contains the essence of proxy tasks 3,4 and 8 of the Sense & Learn framework [41]. Based on that, we suggest that the combined effect of the proxy task in SigRep has resulted in better emotion recognition performance. Further, the findings of this experiment support the argument that the pre-training proxy task has an effect on the downstream prediction task. Also, we recommend using a combined proxy task consisting of signal transformations, zero masking and a contrastive learning approach for wearable signal base emotion recognition.

B. EXPERIMENT 2: EVALUATION OF ROBUSTNESS 1) EXPERIMENT
In real-life usage, signals captured from consumer-grade wearable devices can be lossy due to various reasons such as user movements, software errors, and malfunctioning sensors. These signal losses have been identified as a technical limitation by researchers using wearable devices in the wild [5]. To evaluate the robustness of our method to signal losses, we randomly drop data frames from every evaluation record. An evaluation record is a set of data frames from each signal modality and the target emotion label. To identify the threshold of noise robustness, we define a variable p, which corresponds to the probability of dropping a data frame. We gradually increase the value of p from 0 to 0.9 with a step of 0.1 for each evaluation round. We simulate the signal loss by replacing the corresponding data frame with a vector of zeros. We demonstrate our strategy to drop data frames in Algorithm 1. We ensure that at least one data frame has non-zero values. To avoid bias, we randomise the selection of signal frame dropping for each evaluation record. To compare the performance of robustness, we benchmark our proposed method against the baseline model, which is a fully supervised model (please see description in section IV-C3). Further, we evaluate the emotion recognition models based on eight proxy tasks presented in the Sense & Learn framework [41]. In this evaluation, we consider a scenario where there is a 50% chance of losing a signal frame.

2) RESULTS
We report the observed accuracy for each classification task in Fig. 6). Interestingly, the SigRep model achieves higher accuracy than the baseline models for almost every p value. To quantify the robustness, we conduct a post-hoc test using the Tukey Honest Significant Difference test (HSD) on each scenario to determine which p value makes the significant loss of accuracy. We identify p values, where the drop of accuracy starts significantly in each task for both SigRep and baseline models. We then average those p values for each VOLUME 10, 2022 TABLE 7. Emotion Recognition Model Performance with Lossy Signal (50%): Accuracy. The best accuracy for each classification task is highlighted in bold text while the second best accuracy is marked in italic format. S&L: T# refers to each proxy task proposed in the Sense & Learn framework [41].

TABLE 8. Emotion Recognition Model
Performance with Lossy Signal (50%): F1-Score. The best F1-Score for each classification task is highlighted in bold text while the second best F1-Score is marked in italic format. S&L: T# refers to each proxy task proposed in the Sense & Learn framework [41].
setting and identify that when the average p value is greater than 0.27, the accuracy drop in the baseline setting gets significant. In contrast, models in SigRep settings demonstrate a significant drop in accuracy when the average p is greater than 0.55. This result indicates that our proposed method is significantly more robust compared to a model with similar architecture trained in an end-to-end manner. Table 7 and Table 8 show the accuracy and F1-Score of SigRep and Sense & Learn framework [41] at a 50% signal loss probability. Overall results suggest that SigRep has shown better accuracy and F1-Scores for all 12 emotion classification tasks. Further, for nine out of 12 tasks, Sense & Learn proxy tasks 3,4 and 8 have achieved the second-best results based on prediction accuracy. Which is consistent with the results of previous experiment.

3) DISCUSSION
Prior work indicates that representation learning can achieve a better understanding of the underline data [14]. Also, as we discussed in Section IV-C, contrastive learning inherently offers robustness to noise and losses. Due to these aspects of our model, we conjecture that we achieve higher robustness than the baseline model. The cost of data annotation is one major issue in physiological signal-based emotion recognition. Our proposed method addresses this challenge by adapting to the downstream task with less labelled data leveraging on the learned representations. We experiment by reducing the amount of labelled data used in the downstream task to quantify the performance. Since we use leave-one-subject-out evaluation, we control the training data as a fraction of available subjects for training in this experiment. Especially for each evaluation round, we leave out the evaluation subject and then drop 50% of subjects from the leftover set for training. We keep the training parameters similar to experiment 1. Similar to the previous experiment, we compare the performance of our model with that of the baseline model.

2) RESULTS
Classification accuracy for each task for each scenario is plotted in Fig. 7. As anticipated, with limited training data, classification accuracy drops significantly for all classification tasks. On average, for the baseline, for a 50% drop of training data, accuracy drop around 20% (calculated by comparing with 100% training data used for baseline); however, with the proposed method with learned representation, the average accuracy drop is around 10%. t-test shows that the difference between the baseline and the SigRep method is significant in nine out of twelve classification tasks (p < 0.05). For the remaining three tasks (g), (k) and (l), although the SigRep method demonstrates a higher accuracy, we do not find a statistical significance.

D. EXPERIMENT 4: SIGNIFICANCE OF INCEPTION INSPIRED ENCODER 1) EXPERIMENT
Our proposed encoder architecture is built with Conv1D layers inspired by the inception architecture. To test the effect of the proposed architecture, we compare it with a simple stacked convolutions architecture built with Conv1D layers with a similar number of trainable parameters (see Fig. 5). We denote it as the ''basic encoder''. We train the basic encoder with the proposed SSL method with the same datasets and training configurations as the proposed inception inspired encoder. Then we train emotion classifiers for all 12 tasks using the learned representations with the basic encoder and evaluate emotion classification performance. We keep the evaluation conditions identical to our Experiment 1 (see section V-A).

2) RESULTS
The results of this experiment are plotted in Fig. 8. For all classification tasks, the average accuracy of the proposed inception inspired encoder is higher than the basic encoder.  Fig. 3) compared to a basic Conv1D architecture (see Fig. 5) for all 12 classification tasks. 95% confidence intervals are marked on each column. Overall, inception inspired encoder show higher average accuracy. In seven tasks the accuracy gain is statistically significant.
We conduct t-tests for each classification results for an indepth analysis. We observe that for all twelve classification tasks, the inception inspired encoder performs better than the basic encoder, where for seven tasks, the inception inspired encoder significantly (p < 0.05) outperforms the basic encoder.

E. EXPERIMENT 5: PERFORMANCE AND ROBUSTNESS COMPARISONS OF INDIVIDUAL MODALITIES 1) EXPERIMENT
Our proposed model makes use of four types of signals (ACC, BVP, EDA, TEMP). Each type of signal carries independent and correlated pieces of information. In this experiment, we investigate the performance of individual modalities, also their robustness to data losses. Some sensors are more reliable than others. This experiment can potentially assist researchers in selecting sensors for their applications. In this experiment, we train emotion classification models for all twelve classification tasks using only a single signal modality in each run. All the evaluation rounds use the leave-one-subjectout evaluation method and used similar training parameters as Experiment 1 (see Section V-A). We evaluate models in two settings, (1) without data losses and (2) with 50% of data loss. Evaluation process with data loss is similar to our Experiment 2 (see Section V-B).
2) RESULTS Fig. 9 shows the average accuracy of each signal modality as well as the combined modalities. Fig. 9(a) shows the results without data loss setting, while Fig. 9(b) shows the lossy signal scenario. As one would expect, combined modalities should offer better performance than individual modalities, which is what we observe in Fig. 9. While comparing individual modalities, the BVP signal and ACC signal show higher accuracy than the EDA and TEMP signals. This observation can be explained based on the findings reported in the literature: when someone experiences an emotion, bodily reaction reflects faster with heartbeat compared to body temperature variations and skin conductance changes [42]. Also, literature [43], [44] suggest a higher correlation between heart pulse and wrist accelerometer readings, providing better accuracy for the accelerometer.
Interestingly, when there is no data loss, the prediction accuracy of the combined model is not significantly (p > 0.05) higher than the prediction accuracy of any individual signal. However, when the signals are lossy, seven out of 12 tasks, combined models demonstrate significantly higher prediction accuracy than individual modalities. This result attests that combined modalities can offer higher robustness compared to individual modalities.

VI. CONCLUSION AND FUTURE WORK
This paper presents a novel contrastive representation learning approach for emotion recognition using wearable signals. We achieve the following key results: • We excel the state-of-the-art methods for emotional classification performance over three widely used datasets (CASE, CLAS and WESAD) and create benchmark performance for the K-EmonCon dataset.
• We benchmark SigRep with state of the art selfsupervised methods for signal representation learning and show that SigRep outperforms. • We demonstrate that our self-supervised model using augmented data achieves significantly higher robustness to data losses than a fully supervised baseline. We also observe that while combined modalities do not achieve significantly higher accuracy than individual modalities without data loss; but with data loss combined modalities provides significantly better performance than that of individual modalities.
• We demonstrate that we can reduce the requirement of labelled data for downstream emotion classification tasks by learning representation.
In future work, we aim (1) to explore the effect of different fusion techniques on downstream task performance, and (2) to investigate the feasibility of using different self-supervised learning methods for on-device learning. Understanding the effect of fusion would help build better wearable signal representation based systems optimal for downstream tasks. On-device learning could improve representation based models on the go and personalise models after the deployment.
VIPULA DISSANAYAKE received the B.Sc. degree (engineering) in computer science and engineering from the University of Moratuwa, Sri Lanka, and the Master of Engineering degree from The University of Auckland, in 2019, where he is currently pursuing the Ph.D. degree with the Augmented Human Laboratory, Auckland Bioengineering Institute. His research interests include ubiquitous computing, machine learning, and human-computer interactions.
SACHITH SENEVIRATNE received the B.Sc. degree in computer science and engineering from the University of Moratuwa, Sri Lanka, and the Ph.D. degree in machine learning from Monash University, Australia. Currently, he is working as a Research Fellow at The University of Melbourne. His current research interests include deep learning, with a focus on contrastive representation learning and applications. He is broadly interested in self-supervised deep learning approaches across various disciplines, such as computer vision, NLP, and reinforcement learning.