PhyMER: Physiological Dataset for Multimodal Emotion Recognition With Personality as a Context

Physiological signals are widely used in the recognition of affective status. Recording of such physiological signals involves elicitation of emotions through different stimuli including video-based stimulus. Considering that the same stimulus videos often induce different emotions in different individuals, emotion recognition in such a scenario requires consideration of the individual differences in the consumption of the stimulus content. With this as our goal, we present a Physiological dataset for Multimodal Emotion Recognition (PhyMER) for studying emotion through physiological response with personality as a context. The PhyMER dataset consists of electroencephalogram (EEG), electrodermal activity (EDA), blood volume pulse (BVP), and skin temperature along with the personality traits of 30 participants. We collected the video-based stimulus dataset for emotion elicitation and developed a web-based annotation tool for labeling felt emotions. We compared the stimulus labels and the self-annotation of felt emotions labeled during physiological data recording. Correlation among personalities was analyzed to study the impact of personality on the intensity of emotions in arousal and valence dimensions. Finally, we proposed a baseline model for the classification of emotions using physiological signals. The dataset is publicly available to the academic community for analysis of affective states and the development of emotion recognition models.


I. INTRODUCTION
Emotions are behavioral phenomena that occur in response to an event or stimulus and are expressed through various behavioral and physiological changes.Emotions have been studied as discrete categories, continuous values in various dimensions such as arousal, valence, and dominance, and in terms of changes in a set of components based on The associate editor coordinating the review of this manuscript and approving it for publication was Alicia Fornés .different subjective qualities.Ekman [1] introduced seven basic emotions which are widely accepted as emotions independent of race, culture, or geography [2], [3].Similarly, Parrot's [4] tree structure of emotions is a popular example of a discrete representation of emotions where emotions are hierarchically organized as primary, secondary, and tertiary emotions.Plutchick [5] organized the discrete emotions in a wheel, known as Plutchick's wheel of emotion, with 8 basic emotions towards the center and fine-grained sub-categories towards the edge of the wheel.On the other hand, the dimensional view of emotions refers to the organization of emotions as continuous values across dimensions such as arousal, valence, and dominance, where the emotional states are interrelated systematically.For instance, Russel's Circumplex Model of Affect [6] includes emotions in the dimensions of arousal and valence.Arousal represents the degree of an individual's excitement while valence indicates the level of pleasant or unpleasant feelings.The component process model [7] is based on the coordinated changes in the individual in terms of components such as appraisal, motivation, physiology, and expression.Several studies such as [8], [9], and [10] demonstrated the multi-componential emotions in different scenarios.Mohammadi and Vuilleumier [10] used a component model to show the role of personality in the recognition of discrete emotions.The theory of constructed emotion [11] states that emotions are invisibly constructed by the brain based on the situation.The bodybudgeting regions in the brain predict the experienced world by tweaking the neurons based on past experiences and such predictions are sent to the rest of the body controlling the physiological processes such as heart beats and respiratory rate [12].
Emotions are felt and expressed in a significantly different manner among individuals [13].Similarly, electroencephalogram (EEG) signals, one of the physiological signals used in emotion recognition, are variable across individuals [14], [15], [16].Therefore, physiological datasets with precise and fine-grained labels of such emotions in consideration of individual differences are essential for the accurate analysis of human emotions.
One way to take individual differences into account is through personality.However, only a few studies have considered personality during emotion recognition.Emotions generate both behavioral and physiological changes which include variations in facial expression, posture, or alterations in physiological activities such as heartbeat, neural activations, perspiration, and body temperature.Unfortunately, among the existing physiological emotion datasets, only a few consider individual personality as a crucial factor in identifying emotions [17], [18], [19].Most of the existing datasets either use one of the categorical or dimensional emotions or have coarse annotation for different levels of emotion dimensions.The annotations from both categorical and dimensional perspectives are important for emotion recognition research.Moreover, to study individuals' emotions more precisely, a multimodal dataset labeled with fine-grained emotions is required.
In this paper, we present a Physiological dataset for Multimodal Emotion Recognition (PhyMER) that encompasses a wide range of physiological signals recorded during video viewing as emotional stimulus.Physiological signals are not only obtained from visual or physical evaluation but can also be obtained through off-the-shelf non-invasive consumer-grade devices which offer a convenient and costeffective means of acquiring physiological data.Although basic emotions are considered universal, the observation and consumption of the stimulus video content may differ among individuals.Therefore, we include personality traits in the dataset to provide individual context for emotion recognition.
The PhyMER dataset consists of physiological signals obtained from 30 Korean participants (15 male and 15 female) using two different wearable devices.To avoid the bias due to the stimulus comprehension the participants with similar age group were selected.The participants were university students aged between 20 and 30 years.The dataset was collected with video-based emotion stimuli, where the participants watched 23 stimulus videos of varying lengths (1 to 3 minutes).The modalities collected include EEG, blood pulse volume (BVP), electrodermal activity (EDA), and skin temperature (TEMP).A custom annotation tool was developed for self-assessment of the felt emotions and for recording the experiment times and emotion annotations to synchronize the signals collected from different devices.
To ensure the quality of the dataset, two annotation experiments were conducted.Firstly 28 evaluators labeled the stimulus videos to verify if the stimulus videos collected by the experimenters could induce the expected emotions in the participants.Secondly, the participants of the physiological data collection experiment labeled their felt emotions while they watched the stimulus videos.The physiological signals collected using commercial equipment as well as personality traits are publicly available for academic research.The main contributions proposed by this paper are as follows.
• We present a physiological signal dataset with multiple physiological signals collected from 30 participants.We recorded EEG, EDA, BVP, and temperature information with annotations in both categorical and dimensional views along with the individual personality traits of the participants for the study of emotions in presence of individual personality differences.
• We present the analysis of the emotion elicitation following a video-based stimulus.We conducted experiments to analyze both stimulus and physiological signal annotations using two similar experiments involving participants of similar age groups.The video stimulus data based on Korean movies were collected and evaluated using inter-rater agreement analysis.Moreover, we analyzed the correlation between the felt emotions and personality traits to see how different personality traits affect emotion elicitation.
• We present an emotion recognition framework as a baseline method for the classification of seven basic emotions and the prediction of arousal and valence values.In this case, we performed both subject-dependent and subject-independent experiments for classification and prediction.The rest of the paper is organized as follows; in section II, we discuss the existing studies on multimodal emotion recognition, emotion recognition datasets, and personality as a context.Section III describes the overall datasetbuilding process, including criteria for participant selection, experiment scenario, the stimulus video selection process, and devices used for the data collection.Section IV provides an overview of the dataset and the statistical analysis of the dataset.In section V, we explain the classification and regression experiments for basic emotions and dimensional emotion values respectively.In section VI, we discuss the contribution and potential applications of the dataset, limitations, and future research directions.Finally, we conclude the paper in section VII.

II. RELATED WORK
Several research studies on emotion recognition using biosignal data have been conducted over the past few years.In this section, we have summarized related studies involving physiological datasets and personality as a context.
The publicly available datasets have enabled rapid advancement in emotion recognition research.Over the past few years, several datasets based on a variety of modalities have been published.In this section, we review emotion recognition datasets involving the use of physiological signals, which are relevant to this study.
Similarly, DREAMER [23] dataset consists of EEG and ECG signals collected from 23 participants stimulated by 18 stimulus videos.It is labeled for arousal, valence, and dominance dimensions on the 5-point scale.The DECAF database [24] includes multiple modalities including, ECG, EOG, EMG, MEG, and near-infrared (NIR) video.It includes self-reported scores for valence, arousal, and dominance.Multimodal Spontaneous Emotion Corpus for Human Behavior Analysis (BP4D+) [25] consists of multiple modalities, including 2D and 3D videos, thermal scans, Respiration, Blood pressure, GSR, and heart rate.It was collected from 140 participants who self-reported felt emotions of 10 discrete emotions elicited using various tasks such as interviews, watching videos, pain induction using ice, and smelly odor.KEmoCon [26] includes audio, video, and physiological signals including EEG, ECG, EDA, BVP, and TEMP during paired debates on a political topic.
Emotions were labeled with 20 discrete emotions and arousal and valence dimensions on a 5-point scale.
Different studies have interpreted the context in diverse manners, encompassing aspects such as multimodality, interagent relationships within the scene, socio-cultural dynamics, and personality [19], [27].Emotion expression is different in individuals as it is affected by several factors, including personality [28].Personality refers to human characteristics which explain or predict individuals' behavior [29].The relationship of personality with various emotional states has been studied in the past, for example, the use of the personality model with a textual modality for emotion reasoning [30], [31].Considering the potential variations in psychophysiological changes among individuals, incorporating personality into the analysis of physiological signals for emotion recognition can offer supplementary contextual information.Personality traits have been found to have an impact on perception, causing different reactions to emotional perception [10].It is essential to examine the connection between personality and emotions to understand how personality traits influence emotional experiences.The widely used Big-5 personality trait model offers a valuable approach for identifying and characterizing human personality, comprising five key qualities: extraversion, neuroticism, conscientiousness, agreeableness, and openness.The commonly used personality assessment method includes Neuroticism, Extraversion, and Openness Five-Factor Inventory (NEO-FFI) [32], the Goldberg Adjectives Scale [33], and Newcastle Personality Assessment (NPA) [34].
AMIGOS [18] dataset uses personality and mood information for emotion recognition of individuals and groups.It was collected from 40 participants while they watched 16 emotional video clips from several movies.It consists of 7 discrete emotions, 5-point annotations of arousal, valence, and dominance, and binary labels for Liking and familiarity.The results showed weak linear correlations between emotions and personality.Personality has been used for behavior analysis in Mission Survival II corpus dataset [35], which consists of video and audio data labeled with task area functional roles and socio-emotional functional roles.The data was collected during meetings of 4 participants.The personality information was obtained using the Ten Item Personality Inventory [36].The analysis showed the correlation between extraversion and audio features such as pitch and energy, indicating the need for further research in emotion recognition in presence of personality traits.Similarly, ASCERTAIN [17] includes multiple physiological signals such as EEG, ECG, GSR, and facial videos.It was collected from 58 participants while they watched short movie clips.Personality information was recorded through a big-5 marker scale personality questionnaire.The study showed there is a weak correlation between emotions and personality.MEmoR [19] dataset includes personality information of TV characters to reason the emotions.It is a multimodal dataset with video, audio, and text modalities focusing on emotion reasoning based on contextual information.This dataset consists of categorical labels (8 primary emotions and 24 fine-grained emotions), labeled by multiple annotators.
While several datasets use physiological signals and personality as a context, the existing datasets either deal with categorical emotions or provide coarse annotations in dimensional space.Therefore, in the present study, we present the physiological dataset with both discrete emotions and 9-point ratings on arousal and valence dimensions for increased precision of measurement unit and consideration of sensitivity towards the changes in physiological signals.A 9-point scale not only measures more precise levels of emotions but also allows for finer distinctions between small changes in physiological signals.Moreover, in this study, we focus on emotion recognition of Koreans as induced by the video content in native language of the participants which ensures the better elicitation of the emotions.We collected physiological data set after evaluation of the stimulus dataset by the evaluators from same culture and age group.The characteristics of the existing databases related to this paper have been summarized in Table 1.

III. DATASET BUILDING
The dataset construction was approved by Chonnam National University Institutional Review Board (IRB); a dataset construction protocol and consent form containing the information on the data collection procedure, the purpose of data collection, and the type of data to be collected was approved by IRB.The participants were briefed on the overall experiment both in verbal and written form before signing the consent documents.The participants provided written consent for the disclosure of the physiological signals as a public dataset.However, the data did not include any Personally Identifiable Information (PII) such as audio-visual information.For statistical purposes, only the age and gender of candidates are published along with the anonymized dataset.The overall experiment was conducted in three steps; selection of the stimulus video by two experimenters, annotation of the stimulus videos by 28 evaluators, and collection of physiological data from 30 participants as discussed in the following sections as shown in Fig 1.

A. STIMULUS DATA
As the reliability of the dataset depends primarily on the elicitation of emotions, the selection of the stimulus is a crucial step in dataset building.In this work, we prepared a stimulus video dataset for emotion stimulation and evaluated it using a multi-rater annotation.We choose movie clips as the emotional stimuli as they are highly effective in evoking emotions [17], [24], [37].As the experiment was conducted on Korean subjects, we decided to use Korean videos as a stimulus for emotion elicitation to avoid any issues in perceiving the content due to linguistic and cultural differences.Initially, we acquired the video clips from Korean Video Dataset for Emotion Recognition in the Wild (KVDERW) [38] based on 7 basic emotions (Happy, Sad, Angry, Surprise, Fear, Neutral, and Disgust).The KVDERW dataset is designed for emotion recognition using facial expressions in the scene, and the length of the stimulus videos is less than 10 seconds.Despite their brevity, we chose the KVDERW dataset for two main reasons.Firstly, the videos were in the Korean language, which is the native language of our study participants.Secondly, the dataset allowed us to extend the video clips to our desired duration since they were sourced from movies.As the KVDERW dataset was constructed using Korean movie clips, we searched on the web for the extended video clips for those clips and collected 25 video clips based on the availability of clearly distinguishable target segments with a single emotion.To complement the stimulus video set, five additional video clips were obtained from YouTube based on emotion-related tags on the video clips.The emotion-related keywords were prepared based on the basic emotion labels and were searched on YouTube.Five movie clips tagged with such emotionrelated keywords were selected based on the observation of two researchers.We trimmed the clips to approximately 2 minutes each to ensure they contained a single emotion.The trimming process was determined by identifying the start and end of the target scene, and initially confirmed to have a single emotion by two experimenters through observation.Consequently, the clip durations varied, ranging from 61 to 122 seconds.
Two researchers selected the initial video sets with 30 clips through visual observation, and the clips were further validated through multi-rater validation.To ensure the quality and appropriateness of the clips, we sought validation from 28 evaluators, comprising 15 males and 13 females, with ages ranging between 20 and 26 years, and a mean age of 23.18 years.To maintain consistency in emotion perception, we specifically recruited evaluators from the same age group (20-30 years) as the participants involved in recording physiological data for the stimulus video evaluation.This approach aimed to ensure that emotions were perceived in a similar manner across both sets of evaluators.Out of the 30 videos, five videos were randomly selected and held out to demonstrate the process and familiarize the evaluators with the annotation task.For the evaluation of the videos, 28 evaluators watched the videos in a group.The stimulus videos were displayed on a 60-inch screen in an auditorium where the evaluators were seated with enough spacing to prevent any interaction among the evaluators.Each of the 30 videos, including 5 test videos, were annotated by evaluators for 7 basic emotions and the arousal and valence dimensions using a 9-point continuous scale.The SAM interface was used to annotate the dimensional emotions on a scale of 1-9.The evaluators were asked to label 5 test videos at the start of the annotation process to familiarize themselves with the annotation interface.During the annotation of these test videos, the evaluators interacted with the experimenter and asked questions, and the videos were frequently paused.This interaction helped the evaluators understand the annotation process and ensured that the annotation of the remaining 25 videos was smooth and uninterrupted.As our goal is to collect the physiological signals labeled with the emotions felt by the participants, we instructed the evaluators to annotate with the emotions felt by themselves rather than the emotions exhibited by the actors in the clips.Fig. 2 shows the screenshots from the randomly selected clips from each category to illustrate the type of content present in the stimulus videos.
Finally, the labels provided by the evaluators on 25 videos were checked for agreement by calculating the percentage of evaluators who agreed on a single emotion.Based on this agreement, as well as the agreement with the original labels in the KVDERW dataset, 23 out of the 25 clips were selected as emotion stimulus videos.Table 2. shows the average arousal and valence labels for stimulus videos grouped by corresponding basic emotions.The percentage of the agreement was observed to verify how well the emotions were labeled for the expected emotions.For videos with highly contrasting views among the evaluators, the video labels were compared with the original labels in KVDERW dataset, as the selection of the emotions based only on the percentage of the agreement would lead to the inclusion of the videos which are likely to have content with multiple emotions.It is possible that the same video clip can evoke slightly different emotions in different people, due to the influence of their personality traits.As shown in Table 2., the videos marked with an asterisk ( * ) were not included in the stimulus dataset for emotion elicitation due to a lack of dominant consensus among evaluators.The clips needed to have agreement on noncontrasting emotions.For instance, although the VID06 had 50% agreement for sad, 39.3% of evaluators voted for anger which was the expected emotion.Similarly, VID22 (Introduction) was selected by the experimenters as a stimulus for 'happy' which was labeled as happy by 46.43% of the evaluators, while 53.57% of the evaluators labeled it as neutral.In such a case, the stimulus videos were excluded for having the possibility of eliciting contrasting emotions which are not close in arousal and valence space.Contrary to this, the videos with emotions closely situated in arousal-valence space, for example, anger and fear, were not excluded.For example, VID21 was not excluded despite having a low agreement percentage of 32.1% for anger because the second most highly annotated emotion was fear with 28.6%.Videos such as VID12 and VID23, despite having conflicting emotion labels, were included in the study based on majority voting with agreement of 42.86%.

B. PARTICIPANT SELECTION
Thirty participants between 20 to 30 years of age, (mean 23.56 years; 15 males, 15 females) were recruited two weeks before the beginning of the data acquisition experiments.All the participants were students from Chonnam National University.To determine the number of participants, we calculated the required sample size using GPower [39] for windows for one sample t-test with the smallest effect size (d) of 0.5, an alpha risk of 0.05, and a power of 0.80.A priori power analysis with one tail t-test suggested 28 as the minimum number of samples required, 30 participants were recruited for the experiments through an online advertisement on the university web portal and the participant was provided with remuneration of approximately $15 per hour for their time.To avoid the impact of abnormal emotion elicitation, the participants were inquired about their medical condition to ensure the absence of mental illness in the recent past.Other criteria included the absence of any signs and symptoms of common illness such as a minor headache or common cold.The Beck Depression Inventory (BDI) [40] test was conducted during the recruitment process to confirm that the participants did not have any psychological abnormalities.Due to the pandemic situation, further special precautions were taken such as screening for fever or headache before the study and wearing face masks during the experiments.

C. EXPERIMENT SCENARIO
After signing the consent form, the participants sat in front of a 22-inch screen with a resolution of 1920 × 1080 pixels.Stereo speakers were switched used for audio output, and the volume level was adjusted based on the participant's preference before the experiments.The participants were asked to position themselves comfortably in front of the screen and watch the stimulus video as shown in Fig. 3.An experimenter assisted them to wear an Emotiv Epoc X [41] EEG headset and an Empatica E4 [42] wristband.The EEG headset includes movable electrodes with wet saline terminals.The experimenter applied saline water to the contact points of the headset and placed the electrodes on the scalp.Similarly, the wristband was positioned correctly and connected to the experimenter's smartphone using a Bluetooth connection for recording.The EEG signals were wirelessly recorded at 256Hz using a vendor-provided USB dongle attached to the experimenter's machine.The impedance of the electrodes was maintained by saline-based felts attached to the electrodes, and the signal quality was confirmed on the recording software.EEG quality was verified using vendor-provided recording software which indicated contact quality and EEG signal quality.The participants were asked to avoid any movements to maintain the EEG quality.The experimenter commenced the experiment using a webbased annotation tool that we developed for the experiments.Each participant contributed to about 44 minutes of recording divided into three sessions of about 15 minutes each with at least 10-15 minutes of break to avoid emotional fatigue or possible discomfort due to the prolonged wearing of the EEG headset.Participants wore the EEG headset for a maximum of 15 minutes during each session.However, due to the time needed for remounting the EEG headset after each break, the total time contribution per participant was about 3 hours.
The stimulus videos were not fully randomized to avoid having the same basic emotions in subsequent videos.To determine the order, we shuffled the videos with a constraint that no two consecutive videos contain the same emotion.Two sets of predefined orders were used alternately for the subjects.After each video, a color bars video containing calm music was displayed for 1 minute to neutralize the emotion; a color bars video is considered to have a soothing effect [26], [37].Three test videos were used to familiarize the participants with the experiment process and annotation tool.During the annotation of these test videos, the participants occasionally adjusted their posture and interacted with the experimenters to ask about the annotation tool.It was observed that the annotators were able to annotate the third test video without movement and without disturbing the EEG signals.Using the test videos before the annotation of the stimulus videos was found to be effective in reducing potential noise in the signals due to lack of familiarity with the annotation tool.
An annotation interface was shown on the video player at the end of each video to report the emotions that they felt while watching the video.The participants were asked to report their emotions immediately; however, no time limit was imposed for labeling the emotions.Self-assessment Manikin (SAM) [43] was used for annotation as shown in Fig. 4. SAM-based annotation interface was displayed only after the stimulus video and recording of physiological signals were stopped.Participants selected a discrete emotion from a list of 7 basic emotions displayed in the first row of the annotation tool.Similarly, for a dimensional view of the  emotions, the participants rated their felt emotions on a Likert scale of 1-9 represented by SAM icons on the annotation interface.

D. PHYSIOLOGICAL SIGNALS
We recorded 4 types of physiological signals: EEG, EDA, BVP, and TEMP.Electroencephalograms (EEG) capture electrical activity in the brain through electrodes placed on the scalp while EDA, BVP, HR, and TEMP were recorded using a wrist-worn sensor.Although various physiological can be used for emotion recognition, these modalities were selected based on their unobtrusiveness and ease of availability as consumer products.The analysis of emotions using consumer-grade devices leads to the implementation of emotion recognition wider application domain.
In EEG, the electrodes capture signals generated due to the movement of ions during the activation of neurons.Such activations are directly related to cognitive processes and various emotions [44].EDA, also known as Galvanic Skin Response (GSR), refers to the change in electric potential in the skin in response to perspiration; it measures the effect of neural activities on the permeability of the sweat glands.The EDA signal represents the activity of synthetic nerve on eccrine glands [45].It is a non-invasive measurement of skin conductance as it uses a constant supply of low voltage [46].Due to this non-invasiveness, affordability, and convenient way of acquisition, EDA has been used widely for several applications in affective computing, including smart and intelligent wearable devices.
Blood Volume Pulse (BVP) recorded using photoplethysmography is useful in emotion recognition studies as the change in heart rate affects arousal and valence [47]; heart rate has been found to have a positive correlation with valence [48].Several features, including heart rate variability (HRV), IBI, and spectral features of the BVP, can be used for emotion recognition.
Peripheral skin temperature fluctuates with the change in emotional states and is used for emotion recognition studies [21], [26].Notably, the Skin temperature can be recorded continuously in a non-intrusive way using wristworn sensors.

E. DEVICES
To record the EEG data, we used Emotiv Epoc X, a wireless EEG headset with 14 electrodes (AF3, F7, F3, FC5, T7, P7, O1, O2, P8, T8, FC6, F4, F8, AF4) and two reference electrodes located at the left and right mastoid process.Emotiv X includes the saline-based wet electrodes with a 10/20 international system arrangement as shown in Fig. 5. Epoc X allows the recording of EEG signals with a frequency of 128Hz or 256 Hz.
To collect EDA, BVP, and temperature, we used the Empatica E4 wristband, a wrist-worn wearable device with multiple sensors.The E4 device captures various signals including EDA at 4Hz, BVP at 64Hz, and peripheral skin temperature at 4Hz as listed in Table 3.The device is also equipped with an accelerometer and gyroscope sensors.However, movement information is not included in the dataset because the movement of the body parts was prohibited during the experiment to avoid mechanical disturbances during EEG recording.Physiological signals from the E4 device were recorded using a mobile device (Galaxy Z Flip 5G, 256GB, 6.7 inches, 1080 × 2636) through a Bluetooth connection.Fig. 6 illustrates the devices used for data collection.

IV. DATASET CHARACTERISTICS
The data gathered from multiple sensors in the experiment was associated with UTC timestamps, ensuring accurate time reference.To align the sensor signals with the actual duration of the experiment, the timestamps recorded using the annotation tool were used to clip the signals appropriately.The details of the collected dataset are presented in Table 4, providing an overview of the data acquired.In order to evaluate the stimulus data, a group of 28 individuals, referred to as evaluators, participated in the assessment process.These evaluators were responsible for reviewing and analyzing the stimulus videos.
Additionally, the physiological data and corresponding emotion labels were collected from a separate group of 30 participants, referred to as annotators.These individuals were involved in providing annotations and labeling the emotions exhibited in the stimulus videos.By involving both evaluators and annotators, the study aimed to gather comprehensive insights into the stimulus data and its impact on physiological responses and emotional states.The utilization of multiple participants ensures a diverse range of perspectives and enhances the reliability and validity of the collected data.

A. STIMULUS DATASET
We constructed a stimulus dataset using video clips from various sources and further evaluated it through multiple annotations.A group of 28 evaluators labeled the videos with 7 basic emotions and dimensional emotions.To assess the relative accuracy of the annotations, we calculated the agreement among the evaluators.The selection of stimulus videos was based on the agreement among evaluators and two videos with low agreement were removed from the stimulus set.We adopted Cronbach's Alpha [49] statistic as a measure of agreement among the evaluators.Cronbach's alpha is a commonly used interobserver agreement measure for continuous labels.For stimulus labeling the Cronbach's alpha of 0.97 and 0.96 were observed for valence and arousal respectively.Similarly, for the categorical annotations we used Fleiss Kapa [50] as a metric for inter-rater agreement and a moderate inter-rater agreement of 0.40 was observed.

B. INTER-OBSERVER AGREEMENT IN THE DATASET
As the same stimulus videos were shown to all the participants, the annotations were validated through interannotator agreement analysis.Cronbach's alpha of 0.84 for arousal and 0.89 for valence was observed, indicating strong inter-annotator reliability among 30 participants for arousal and valence.
For categorical annotations of seven emotions, a moderate inter-rater agreement was observed with a Fleiss Kappa value of 0.50.We also analyzed the agreement of the categorical annotations based on the number of annotators agreeing on one of the seven basic emotions.As shown in Fig. 7, we computed percentage scores for each stimulus video.The results showed a high agreement among the annotators for most of the videos; 19 videos had an agreement of over 50%.The VID05 video had the highest agreement; a 100% agreement among the annotators.Videos VID23 and VID03 had the lowest agreement, with an agreement of 36.67 and 40 percent, respectively.Such low agreement might be due to conflicting emotions resulting from participants' personality differences; the same stimulus video may impart slightly different emotions to different individuals.For example, in the case of the stimulus VID23, the video contains a conversation between two characters which caused 36.67% of the annotators to feel sad which was the expected emotion, while 30 percent felt neutral and 16.67% felt angry as seen in Fig. 7 (b).Similarly, VID03(scary principal), where a school principal shows abusive behavior towards a female student, may induce fear if the subjects perceive the video from the student's perspective, while based on the general observation of his abusing behavior towards a female student, the subject may get angry.These differences in viewers' perspectives and the manner in which video content is consumed by the individuals suggest that individual differences among the participants may have influenced their focus on different characters in the scene, indicating the need for personality-based profiling in video-based stimuli for eliciting emotions.

C. DISTRIBUTION OF EMOTIONS IN AROUSAL-VALENCE SPACE
Stimulus videos were evaluated by 28 evaluators.Both the evaluators and the participants of the data collection experiments labeled the videos in both categorical and dimensional views.Based on the inter-rater agreement of annotations by stimulus evaluators, we selected 23 video clips for the experiments for data collection experiment.
As shown in Fig. 8 (a) and (b), both stimulus video evaluators and physiological data annotators were labeled similarly for arousal and valence.Fig. 8 (a) shows the distribution of the average values of annotations in arousal and valence space as annotated by 28 evaluators for the stimulus videos.The majority of the videos were labeled as high arousal and low valence.This is due to the selection of stimuli based on the categorical labels.Fig. 8(b) shows the average values of the annotations during the physiological data acquisition experiment while watching the stimulus videos.The videos for happy and surprise can be seen on the positive valence quadrants while videos with surprise were found to be labeled both positive and slightly negative.It can be concluded that the videos were labeled following a similar trend and the participants of the data acquisition experiment expectedly annotated the felt emotions.

D. COMPARISON OF LABELS WITH THE STIMULUS VIDEO ANNOTATIONS
To evaluate the annotations, we calculated Spearman's correlation coefficient between the mean of the participants' annotations from the physiological data collection experiment and the annotations made by the evaluators for each video.As the videos were labeled by a different set of annotators with different numbers, we calculated the arousal and valence for each video.We observed a high correlation of 0.9708 for valence and 0.8702 for arousal.A strong linear correlation was observed between the annotations and the stimulus labels.This shows that the participants annotated the videos in the same way as the stimulus video dataset.The strong linear correlation between the 9-point labels of arousal and valence indicates that the participants in the data acquisition experiment voted in the same way as the participants of the stimulus labeling experiment.The participants were provided with proper information on the annotation process and rating scales, and the selected stimulus video set is appropriate for emotion stimulation.The results show that valence is more consistent across participants than arousal.

E. PERSONALITY CORRELATION
For personality information, we use Big5 personality traits, namely extraversion, neuroticism, conscientiousness, agreeableness, and openness using the Newcastle Personality Assessor (NPA) questionnaire [34].NPA is a short questionnaire with 12 questions to be answered on a Likert scale of 1-5, representing very uncharacteristic to very characteristic.We obtained the personality scores on a 4point scale representing low, low-medium, medium-high, and high based on the NPA questionnaire interpretation [34].We calculated the average Spearman's correlation coefficient of the personality scores and the annotated values in valence and Arousal annotated for emotions felt while watching 23 videos.Table 5 shows the correlation coefficients between the personality traits and levels of emotional dimensions.Valence was found to have weak negative significant (p < 0.05) correlation with openness while the correlation of valence with other personality traits was not significant.This implies that the personality traits do not necessarily affect the level of valence.However, arousal had a significant negative correlation with neuroticism and agreeableness, while Conscientiousness had a positive correlation with arousal.Among the personality traits, a negative significant correlation was found between neuroticism and conscientiousness indicating and inverse relationship between the two personalities.The correlations between the personality traits and emotions in arousal and were significant for all personality traits, suggesting that personality does not necessarily play a significant role in the intensity of felt emotions.
In general, the analysis of all stimulus videos did not show a significant correlation, however, there was an anomaly in the annotation as seen in Fig. 7. in the interrelated emotions such as sadness and anger in a conversational context.As seen in VID23, the conversation between two characters may have two different aspects imparting distinct emotions to the viewer.To investigate the role of personality in such a scenario we calculated Spearman's correlation of the personality traits and the annotations on VID23 and VID03 by all the participants.For VID23 and VID03, which potentially could induce low valence, a negative significant correlation of -0.447 between valence and openness was found.Such an inverse relationship between openness and valence suggests that the participants with low openness experienced higher valence.Moreover, in case of VID23 and VID03, which were labeled as anger and fear by most participants, a negative correlation was found between arousal and personality traits, was seen while for the overall dataset, neuroticism showed a negative correlation with arousal.In addition, both openness showed a negative correlation with both arousal and valence, suggesting a lower level of emotion elicitation for the subjects having high scores in openness.These observations were found to agree with the existing studies [12].

F. DATA AVAILABILITY
The dataset is publicly available for academic research at https://sites.google.com/view/phymer-dataset.The data presented in this study were also available as a part of the Third Korean Emotion Recognition Challenge (KERC) 2021.The preprocessed data, divided into training, validation, and the test set was open to the participants of the competition on Kaggle during the com-petition duration (from Aug 30 to Oct 31, 2021).(https://www.kaggle.com/c/kerc2021).The baseline model for the competition was an LSTM-based classification model, achieving an f1-score of 0.55.

V. BASELINE EXPERIMENTS
In this section, we present the implementation of the baseline method for multimodal emotion recognition.We also evaluate the performance of the proposed baseline model in a subject-independent and subject-dependent manner.As the range of physiological data such as heart rate, blood pulse volume, or electrodermal activity may differ among individuals, within-subject normalization was performed for subject-independent emotion analysis.We normalized the extracted features using Robust Scalar [59] which scales the data based on quantile range and is suitable for small data sizes.

A. PREPROCESSING
EEG data is highly susceptible to noise due to its sensitivity towards minor physiological and physical activities such as blinking of eyes, heartbeats, and muscular movements [53].We applied a band-pass filter of 4Hz-40Hz to include only the frequencies in the range of 4 bands (Theta, Alpha, Beta, and Gamma) of EEG signals as the Delta band is not relevant to this study as it is observed only during sleep.Although the filter of this range eliminates certain noise components, it is impossible to get rid of the noise using frequency alone as some artifacts may lie within the same frequency range as the EEG signals.We applied AWICA [54] to eliminate such noise.
AWICA is a threshold-based automatic artifact removal technique involving wavelet analysis and Independent Component Analysis (ICA).In this method, each channel of the EEG signal is partitioned into 4 bands of EEG through Discrete Wavelet Transform (DWT).The artifactual components in the WT components are selected quantitatively based on thresholds of Kurtosis and Reny's Entropy [55].Then ICA is performed on the wavelet components for automatic rejection of the artefactual components.ICA is a commonly used blind source separation technique used for isolating the source signals from the recorded signals.Finally, artifact-free EEG channels are reconstructed through inverse ICA followed and subsequent inverse DWT operation.Noise removal in EDA signals is done using a low pass filter with a 2Hz cutoff frequency [56].
Blood Pulse Volume (BVP) signals recorded using photoplethysmography are susceptible to motion artifacts (MA) [57].However, in our experiments, we minimized the wrist motion with the sensor to avoid motion artifacts.As the dataset did not include motion information, the preprocessing of BVP signals mainly involved removing out-of-band noise through the fourth-order Butterworth band-pass filter (BPF) between 0.4Hz and 4.0 Hz.For temperature (TEMP), only normalization was performed as the experiment process did not involve activities that could produce significant noise in data.Eight samples for Subject SUB10 (SUB10VID09-SUB10VID16) were excluded from the experiments due to the device malfunctioning during the data collection experiment.

B. EMOTION RECOGNITION
To evaluate the dataset, we conducted subject-dependent and subject-independent experiments and compared the performance using different combinations of the features.For subject dependent method we performed a 5-fold cross-validation on the whole dataset split into training and validation sets at the ratio of 80 and 20 percent respectively and compared it with the subjectindependent method where the leave-one-subject-out cross-validation, where one subject's data was used for validation.
For the baseline experiments, we extracted various handcrafted features for different modalities.Inspired by the high performance of features based on Toolbox for Emotional feature extraction from Physiological signals (TEAP) [58] features, we followed [26] for feature extraction.For each of the 14 channels of EEG samples, we calculated the power spectrum for theta (4-8Hz), alpha (8-13Hz), beta , and gamma (30-40Hz) bands, and obtained 56 features (14 channels × 4 bands).In addition to the band power, we included two statistical features (mean and standard deviation) for each channel.Similarly, for EDA and BVP, and skin temperature (TEMP) we extracted various features as shown in Table 6.
We performed experiments involving the classification of the seven basic emotions and the prediction of arousal and valence values labeled by the participants.Following feature extraction, we employed the Extreme Gradient Boosting (XGBoost) [56] based classifier and regressor models, chosen for their demonstrated effectiveness in various classification and regression tasks [26].Different combinations of EDA and BVP features (Table 6) extracted in the preprocessing step were concatenated with the personality feature to perform the classification and prediction in the presence and the absence of personality features.The proposed emotion recognition framework is illustrated in Fig. 9.The baseline system consists of handcrafted feature extraction and emotion recognition modules.The feature extraction module involves the preprocessing of different physiological signals where statistical, and signal features are extracted.In the classification module, the extracted features are fused and used for the recognition of both categorical and dimensional emotion labels.
We evaluated the classification results using F1-score and Mathews Correlation Coefficient (MCC) to evaluate the classification task considering the imbalance in the dataset.F1-score is a commonly used evaluation metric for classification tasks representing a harmonic mean of precision and recall.As F1-score does not consider the true negatives, MCC can better represent the classification accuracy [57].The best MCC score was obtained with the combination of EEG and EDA, where there was no significant improvement while using the personality.
To assess the influence of multimodality on emotion recognition, we conducted a comparison of classifications using different modalities.The results, presented in Table 7, indicate that the incorporation of BVP and TEMP does not necessarily lead to performance improvement.Furthermore, when combining personality features with physiological signals, we observed only a slight increase in performance, suggesting a weak or low impact of personality on emotion labeling.As the use of multiple modalities resulted in an increase in performance, it appears that further research on the effective fusion of the modalities needs to be performed.As the EEG signals are highly subjective there was a huge difference in subject-dependent and subject-independent classification results.Arousal and Valence prediction was evaluated using mean absolute error (MAE) and concordance correlation coefficient (CCC) as shown in Table 8, which shows the inclusion of personality features improves the prediction of both arousal and valence.
Table 9 presents the F1-scores for each categorical emotion comparing the performance when a different combination of modalities is used in the presence and absence of personality features.We can see from Table 9 and Table 10 that the use of all modalities improved the performance for some emotions in some cases.Disgust, happiness, and anger showed higher performance when personality information was used, while the performance not improved in the case of neutral, sad, fear, and surprise.All the experiments on Nvidia GeForce RTX 3080Ti 12 GB of memory.The experiment codes were implemented in Python, and to ensure the reproducibility of the experiment results, a fixed random seed of 42 was set for all the experiments.
We that performance improvement can be achieved through different feature engineering and multimodal feature fusion techniques.Moreover, this dataset can provide value to multimodal emotion recognition research through further analysis of personality as a context.The dataset contains annotations for both categorical and continuous affects, enabling its utilization in the development of multitask models.The diverse range of annotations within the dataset provides an opportunity to enhance the complexity of models and further improve the performance of multitask models in future studies.

VI. DISCUSSION
This paper primarily focuses on dataset acquisition and annotation protocol.We generated a stimulus dataset, designed a data collection experiment, and developed a web-based data annotation tool for physiological data generation.Moreover, a baseline method for emotion classification was developed to evaluate the dataset.
The experiment protocol was meticulously crafted to simulate the real-world scenario to the greatest extent possible.Although the ideal data acquisition procedure required signal recording while the participants were consciously unaware of the collection process, it is practically impossible for EEGbased datasets as certain restrictions are required to avoid signal noise.Despite having been conducted in laboratory settings, comfortable posture for watching stimulus videos, use of consumer-grade commercial equipment, and personalized audio levels make the data collection natural and close 107652 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.to a A key feature that sets our dataset apart from existing ones is the inclusion of fine-grained emotion annotation, which utilizes a 9-point scale to provide a dimensional perspective of emotions.This enhances the richness and granularity of emotional information captured in the dataset, offering a more comprehensive view of emotional states to available datasets.
Context plays an important role in emotional representation.The context in emotion recognition can be represented by environmental and socio-cultural factors.As this dataset focuses on the bio-signal data collected in a uniform environment, only the personality traits have been included in the dataset to provide the individual nature of the participants as a context for emotion recognition.The individual differences and their impact on emotion recognition can be studied in the context of various other factors such as age, gender, and ethnicity.However, this dataset focuses on adults from homogenous populations.
With the increasing use of consumer-grade wearable sensors, the collection and analysis of physiological data has become more convenient and non-obtrusive.Several accessories such as wristbands and headsets have been widely accepted as they can provide personalized information.However, behavioral differences among individuals also need to be considered during the analysis of different types of emotions.Individuals with reserved or introverted personalities may not respond to the same stimuli as actively as extroverts.In other words, the degree of emotional expression may not be the same for individuals with different personalities, therefore analysis of fine-grained continuous values in different dimensions is essential.
Therefore, the same stimulus videos often induce different emotions in different individuals.Such differences may arise when the stimulus contains interrelated emotions and participants watch the videos with a focus on a different character.For example, a video with a character showing anger at another character may induce either anger or sadness based on which character the participant focuses on.Therefore, the role of personality in distinguishing closely related emotions requires further exploration, although there is no significant correlation between personality traits and felt emotions in general.
While the dataset's primary intended use is for studying emotions during video content consumption, it holds potential for broader applications concerning the analysis of physiological signal variations among individuals in different emotional scenarios.The dataset, collected using consumer-grade portable headsets in response to videobased stimuli, offers multichannel EEG data that could contribute to advancements in brain-computer interfaces (BCI) by exploring temporal and spatial information.This physiological signal-based emotion recognition method also has potential for application in the analysis of emotions in immersive environments such as virtual reality (VR) based games.We also conducted experiments with VR devices with 360-degree videos as emotion stimuli to study the application of physiological signals in an immersive environment.However, our preliminary experiments showed that both mechanical and electronic noise affect the data collection process due to the simultaneous use of VR and EEG devices.Thus, the application domain of physiological datasets is limited to situations with minimal movement.
The dataset compilation has certain limitations, primarily related to noise interference in EEG sensors.Unavoidable sources of noise, such as blinking of eyes and subtle muscular movements during data collection, may have affected the dataset.To ensure effective utilization of the datasets, appropriate noise removal techniques are crucial.Another limitation is the potential bias arising from the possible elicitation of multiple emotions.Efforts were made to mitigate this bias through careful stimulus video selection and labeling before the experiments.The chosen stimulus video clips were intended to depict a single emotion, and 28 evaluators of the same age group verified this.However, the personalities of these evaluators might differ from the 30 annotators participating in the physiological data collection experiments.Despite the thorough selection process, one video (VID21) was found to induce both anger and fear emotions, both being negative valence emotions.Considering individual differences, this video was included in the dataset.Consequently, the inclusion of samples recorded through stimulation using such videos is a limitation that may hinder the distinction between emotions like anger and fear.Additionally, the stimulus data collection process involved partial use of an existing emotion recognition dataset.Although the reconstructed stimulus dataset was evaluated by 28 evaluators, the final selection was made by the experimenters based on evaluator agreement.Videos with different emotion labels from the original dataset were excluded, which might conflict with the purpose of the evaluation experiment.However, as the videos in the source dataset were labeled by multiple annotators, it is expected that any decision bias would be minimal.
Similarly, the familiarity with stimulus videos, human limitations in identifying own emotions, and homogenous demography are some of the limitations to be considered during the use of this dataset.The order of the stimulus videos was predefined by conditionally randomizing the order, where the order was changed if the consecutive videos had the same emotion labels.Although the order was made different for two consecutive subjects, the order was the same for alternate subjects.Such partial randomization may have induced order bias in the dataset.Another limitation of the present study is the use of the NPA questionnaire for personality data collection.A short questionnaire might have limited the reliability to some extent in the case of a small sample size.Although multiple modalities were considered in this study, we explicitly excluded the video modalities which are common in multimodal emotion recognition.

VII. CONCLUSION
In this paper, we presented a multimodal bio-signal dataset for emotion recognition with both categorical and dimensional perspectives and analyzed the importance of personality as a context.The datasets include various physiological signals obtained from 30 participants, where the emotions were induced through stimulus videos.We observed strong inter-observer agreement among the annotators.To evaluate the dataset quality, we performed baseline experiments using a multimodal classification model which achieved an F1-score of up to 0.73 with multiple physiological modalities.
We plan to continue our research work in multimodal emotion recognition using physiological signals.We will improve the baseline classification and prediction models through improved feature extraction using deep learning techniques.As a further study, the stimulus dataset will be investigated with multiple emotion labels.
The database is publicly available for academic research in the field of emotion recognition with the hope that this dataset would be beneficial in the development of new emotion recognition methods and algorithms.

FIGURE 1 .
FIGURE 1. Overall data collection process for physiological data collection.

FIGURE 2 .
FIGURE 2. Example clips for seven basic emotions.

FIGURE 3 .
FIGURE 3. A participant watching a stimulus video during the experiment.The participant wore an EEG headset and a wristband for data recording.

FIGURE 4 .
FIGURE 4. The Emotion annotation interface based on Self-Assessment Manikin (SAM) for labeling 7 basic emotions and two emotion dimensions on a 9-point scale.

FIGURE 5 .
FIGURE 5. Position of EEG channels on the scalp.

FIGURE 6 .
FIGURE 6. Devices used for data collection.

FIGURE 7 .
FIGURE 7. Heatmap showing the agreement percentage during (a) stimulus evaluation and (b) annotation during physiological data collection.

FIGURE 8 .
FIGURE 8. Distribution of the mean arousal and valence in stimulus dataset (a) and those labeled by annotators during the physiological data collection experiment (b).

FIGURE 9 .
FIGURE 9.The overall framework for emotion recognition using physiological signal.The feature extraction part includes extraction of the features from each channel of EEG and other modalities.The emotion recognition module includes classification of seven basic emotions and prediction of 9-point labels for arousal and valence.

TABLE 1 .
Summary of the existing emotion recognition datasets based on physiological signals.

TABLE 2 .
Stimulus video clips with corresponding emotions labeled by evaluators.

TABLE 3 .
Wearable devices used in the experiments.

TABLE 5 .
Correlation among personality traits and emotion dimensions annotated for 23 videos.

TABLE 6 .
Features extracted from physiological signals.

TABLE 7 .
Classification results of seven emotions.

TABLE 8 .
Subject-dependent and subject-independent prediction of arousal and valence.

TABLE 9 .
Subject-dependent classification performance for each class in the presence of personality features.

TABLE 10 .
Subject-independent classification performance for each class in presence of personality features.