Perceptual Borderline for Balancing Multi-Class Spontaneous Emotional Data

Speech is a behavioural biometric signal that can provide important information to understand the human intends as well as their emotional status. The paper is centered on the speech-based identification of the seniors’s emotional status during their interaction with a virtual agent playing the role of a health professional coach. Under real conditions, we can just identify a small set of task-dependent spontaneous emotions. The number of identified samples is largely different for each emotion, which results in an imbalanced dataset problem. This research proposes the dimensional model of emotions as a perceptual representation space alternative to the generally used acoustic one. The main contribution of the paper is the definition of a perceptual borderline for the oversampling of minority emotion classes in this space. This limit, based on arousal and valence criteria, leads to two methods of balancing the data: the Perceptual Borderline oversampling and the Perceptual Borderline SMOTE (Synthetic Minority Oversampling TEchnique). Both methods are implemented and compared to state-of-the-art approaches of Random oversampling and SMOTE. The experimental evaluation was carried out on three imbalanced datasets of spontaneous emotions acquired in human-machine scenarios in three different cultures: Spain, France and Norway. The emotion recognition results obtained by neural networks classifiers show that the proposed perceptual oversampling methods led to significant improvements when compared with the state-of-the art, for all scenarios and languages.


I. INTRODUCTION
Speech is a biometric signal that is able to provide information about the identity of the speaker [1], [2], the content of the message [3], [4] or the language used to code it [5]. In addition, speech can be analysed to identify the current emotional status of the speaker, which could be an important cue to improve mood prediction as well as to monitor some mental disorders such as anxiety or depression, among others [6]. However, speech as a behavioural biometric data, can also be affected by many other factors such as the speaker habits, personality, culture or the specific task being performed. As a consequence, any behaviour analysis, prediction or monitoring becomes a challenge.
This work concerns well-being conversational systems for behavioural change, which is a research topic of growing The associate editor coordinating the review of this manuscript and approving it for publication was Bijan Najafi . interest [7]. In particular, the EMPATHIC european project 1 develops new interaction paradigms for personalized virtual coaches to promote healthy and independent aging. A natural speech interaction between humans and machines implies, among other things, that the machine understands the emotional state of the user, hence the importance of emotion recognition. The paper is centered on the speech-based identification of the senior's emotional status during their interaction with a virtual agent playing the role of a health professional coach. The knowledge of the senior's emotions allows the system to react accordingly [8]- [10]. In addition, the outcomes of these interactions in terms of moods and emotions could provide useful and real time information to the elderly care support systems as well as to caregivers [11].
This objective needs to face important challenges, of which the more important are derived from the need of processing spontaneous emotions triggered in real scenarios. Given the difficulty to implement these conditions, the research on machine analysis of human emotions has been carried out over the simulation of the six basic emotions [12] performed by professional actors in the lab [13]. However, the features selected for acted and realistic emotions show significant differences [14]. Furthermore, only a small subset of the emotions defined by Eckman [15] can be distinguished in real scenarios. Moreover, this subset is strongly dependent of the task. For instance, a political debate on TV [16] or a podcast interview [17] cannot be expected to show the same human emotions than a human-machine scenario [10], [18], [19]. Indeed, fear is not expected in none of the mentioned corpus.
Emotions, unlike moods, are triggered by specific events and do not last more than a few seconds after which the basal mood appears again [12], [20]. Importantly, these emotions are usually the ones to be considered for the analysis of human behaviour as for instance in the proposed research scenario where the virtual coach needs to adapt its conversation accordingly [9]. In terms of Artificial Intelligence and Pattern Recognition techniques these facts pose an important problem for the automatic identification of human emotions: the huge difference among the number of samples acquired for each of the spontaneous emotions identified [21].
As a consequence, the number of objects in one class, or emotion, is considerably lower than in other classes. Referred to as an imbalanced dataset problem, classification performance typically degrades in several data mining applications, including pattern recognition, telecommunications and bioinformatics [21], [22]. After two decades of research, learning from imbalanced data is still a focus [23], [24]. In addition, the multi-class imbalanced classification is not well developed as a binary classification [25]. In this paper we are dealing with a more complicated situation. Indeed, when it comes to multi-class imbalanced data, we can easily loose performance on one class while trying to gain on another [22], [24]. Given this problem, a more in-depth understanding of the nature of the class imbalance problem is needed. Recent trends focus on analyzing not only the disproportion between classes, but also other difficulties related to the nature of the data [26].
Speech emotion recognition is naturally a multi-class classification problem, in which the imbalance character of the dataset could affect its performance. Indeed, most of the test segments are assigned to the majority class due to the data skewness. However, all the emotions should have the same importance. Even so, minority classes are a focus of attention in many scenarios [16].
This work examines the intrinsic properties of the speech samples to find a suitable borderline between majority and minority emotion classes. In a previous work [19] we have proposed the dimensional model of emotions (Valence-Arousal-Dominance, or VAD) as an additional space of the parameter representation, which has demonstrated to improve the emotion recognition performance. On the other hand, some studies [27] have shown that the oversampling of borderline samples is an effective way to deal with imbalance of data and remains the state of the art.
The main contribution of this paper is the proposal of the VAD model of emotions as a representation space alternative to the generally used acoustic one to ground the oversampling borderline criteria. To our knowledge, this is the first time that the dimensional parameters are involved in a balancing process of emotional data. The work is based on the following hypothesis: if two emotions are close in the dimensional space, they are acoustically nearby and the emotion classifier can confuse them. It is also assumed that the information extracted from perceptual space is more reliable than that from acoustic space. Indeed, the perceptual space is the result of a manual annotation procedures whereas the acoustic one is automatically estimated by machine.
The perceptual borderline gives rise to oversampling approaches in the context of multi-class oversampling -either by increasing the size of the samples on the frontier (by replication or by artificial synthesis), -or by establishing a variable number of neighbours for the artificial synthesis in order to avoid classes' samples overlapping, resulting in additional contributions [21], [23], [24]. The proposed methodology has been evaluated over three very imbalanced corpus consisting of emotional speech samples acquired in a spontaneous human-machine scenario in three different cultures, namely Spain, France and Norway, resulting in significant improvements of emotion identification rates.
The paper is organized as follows. Section II illustrates related works and Section III describes the dimensional model of emotions. Section IV develops the perceptual oversampling methodology. Then Section V shows the description and analysis of the EMPATHIC data for three languages and Section VI describes the experimental framework. Finally Section VII shows and discusses experimental results and Section VIII presents the main conclusions of the work.

II. RELATED WORK
Studies have shown that for several basic classifiers, a balanced dataset provides improved overall classification performance compared to imbalanced data set [28]. Hence the interest in balancing data. Solutions addressing the problem of imbalanced data tend to focus on the data level and classifier level. Hybrid methods tend to combine their advantages. Undersampling [29] and oversampling [27], [30] are common data sampling methods. Undersampling removes some data from the majority class which can lead to a loss of discriminating samples. Oversampling appends replicated data to the original dataset, so multiple instances of certain examples become ''tied'' leading to overfitting [28]. Furthermore, marginal and noisy examples are also replicated.
In order to introduce some generalization, Synthetic Minority Over-sampling TEchnique (SMOTE) [31] is proposed. This approach aims to overcome imbalance in the original data sets by artificially generating data samples. To this end, it produces synthetic samples between an example and 55940 VOLUME 9, 2021 its nearest neighbours. SMOTE method generates the same number of synthetic data samples for each original minority example. This procedure, does not pay attention to neighbour examples, which results in an increase of the occurrence of overlapping between classes [28]. To avoid this effect, various adaptive sampling methods [32] have been put forward. Some representative work include Borderline-SMOTE [33] and Adaptive Synthetic sampling (ADA-SYN) [34]. In the case of Borderline-SMOTE, borderline samples are identified and then over-sampled. On the other hand, ADA-SYN creates different amount of synthetic data according to their distribution: more synthetic data are generated for minority class samples that are harder to learn compared to minority samples that are easier to learn [34].
Algorithm-level methods modify classifiers to alleviate their bias towards majority class. The most popular approaches are grouped as cost-sensitive learning. The classifier is modified to introduce varying penalty for each class. These methods have been used in many classification systems, including boosting, decision trees and Support Vector Machines (SVM) and recently Deep Neural Networks (DNN) [35].
Most of the mentioned work has been carried out in a binary classification context. Multi-class classification of imbalanced data has not been well developed because of the complexity of the task. We can face several difficulties, namely -class overlapping may appear with more than two groups, -class label noise may affect the problem andborders between classes may be far from being clearly defined. Therefore, data sampling procedures that take into account the variety of characteristics of classes and balanced performance of all of them should be proposed [26].
As mentioned above, corpus of spontaneous emotions acquired in realistic scenarios are scarce because huge effort is needed to their development. They are also more challenging because the emotional state of the speakers is not predetermined. In fact, Speech Emotion Recognition (SER) from spontaneous data is still a challenging task and several further steps must be taken before SER can be considered ready for usage ''in the wild'' [36]. As a consequence, spontaneous SER systems performs generally worse than acted SER systems [37]- [40]. In addition, they face the problem of imbalance in data.
Few research studies are carried out to balance emotional training data. Generally, small sample environment and acted data, in which ratio between majority and minority classes is not very important, are targeted. For example, a selective SMOTE algorithm based on acoustic parameters is described in [41]. The approach aims to avoid oversampling noisy samples. It is validated on acted small datasets (SAVEE [42], EMO-DB [43] and CASIA [44]) using SVM classifier. In [45], Random and SMOTE oversampling are applied to balance IMPROV [46] and IEMOCAP [47] datasets. In both acted datasets, the majority/minority class ratio is about 10%. SMOTE technique is also employed with emotional Youtube data and evaluated by three learning techniques: multi nominal Naive Bayes, decision trees and SVM [21]. Oversampling and undersampling are combined in [39] to increase movies data. Then SVM algorithm classified samples into fear/not fear.
As sub-mentioned, Random oversampling approach suffers from replicating noisy samples and SMOTE performs blind interpolation with fuzzy class boundaries. The objective of this research is to overcome these drawbacks in the context of multi-class prediction. In this paper we first show the advantage of a borderline based on the intrinsic nature of the data. In this context, both oversampling ways, by replication and synthesis, are investigated. On the other hand, SMOTE varieties (such as Borderline-SMOTE and ADA-SYN) focus on fixing the number of artificial samples to prevent samples overlapping. We then suggest to use a dynamic neighbours number to solve the same issue. This number is based on the perceptual borderline. This work is carried out in the context of a spontaneous emotional dataset in which the minority and majority class ratio is extremely important.

III. DIMENSIONAL SPACE OF EMOTIONS
Emotions are traditionally represented by two models: a discrete categorical model and a continuous dimensional one.
Categorical model is based on a set of mutually exclusive discrete ''basic'' categories (Fear, Surprise,..). Thus, even though a person might undergo two emotions simultaneously, each emotion belongs to one and only one of the basic categories [48]. Several classification proposals are introduced ([12], [15], [49],..) but they don't seem to converge to the same final categories. Dimensional emotion representation is based on some psychological understandings. In the dimensional model, emotions are treated as being dimensional or continuous rather than discrete. The emotion plane is viewed as a continuous space where each point corresponds to a separate emotion state. The VAD is a tridimensional model defined by Valence, Arousal and Dominance axes. However, most models posit the existence of two fundamental dimensions: valence (or pleasantness) and intensity (or arousal) as represented in Figure 1. VOLUME 9, 2021 Valence varies from −1 (unpleasant) to 1 (pleasant) and therefore it can be characterized as the level of pleasure. Arousal, on the other hand, represents the intensity of the emotional state and it ranges from −1 (passive) to 1 (active) [51].
According to previous studies (see SECTION II), multiclass classification requires a deeper understanding of the intrinsic nature of imbalanced data. In particular, it is interesting to analyse the type of examples present in each class and their relations to the other classes [26]. Dealing with emotions, these analyses can be conducted in an emotional representation space, for instance the 2-D dimensional one. Indeed, the relationships between emotions in the 2D plane are easier to interpret than in the high dimensional acoustic one. In addition, the information extracted from these 2D relationships is more reliable than the one extracted from the acoustic space as it comes from manual annotation of the human perception of the emotional dimensions and not from sets of acoustic parameters automatically calculated.

IV. PERCEPTUAL OVERSAMPLING
We denote by borderline data the minority class samples that are at the frontier with the majority class. Our proposal consists in defining a border based on the intrinsic nature of our data which in this case is based on the human perception of emotions through a manual annotation procedure. More precisely, this frontier is set in the 2-D emotional space where each point represents the valence and arousal values perceived by annotators. For this reason, it is entitled ''Perceptual Borderline''. This border will allow a better over-sampling of the data by replication/synthesis as well as the extension of the SMOTE algorithm to multi-class problems.

A. PERCEPTUAL BORDERLINE OVERSAMPLING (PBO)
The objective of the Perceptual Borderline Oversampling (PBO) approach is to over-sample the borderline samples more than others. This oversampling is performed by replication. The PBO algorithm is described as follows: PBO: Perceptual Borderline Oversampling 1) All training samples are projected into 2-D space. A sample s is represented by its arousal and valence co-ordinates s(a,v) 2) Gravity center of majority class G is computed.
3) For a sample s of the minority class: a) Compute the distance d(s,G) between s and G b) Compare d(s,G) to a predefined threshold .
• if d(s,G)≤ , s is considered borderline. It is replicated R1 times.
• if d(s,G)≥ , s is not borderline. It is replicated R2 times with R2 ≤ R1 4) Take an other sample s and go to step 3.  center of gravity of the majority class (G) and samples of the minority classes are computed. Then, examples are considered to be borderline samples or not depending on these distances.

B. PERCEPTUAL BORDERLINE SMOTE (PB-SMOTE)
This method is similar to the previous one except that the oversampling is carried out by the SMOTE algorithm, i.e. the algorithm generates synthetic samples (see Section II). The objective of Perceptual Borderline SMOTE (PB-SMOTE) is to investigate the perceptual borderline impact on synthetic oversampling. The algorithm is described as follows: PB-SMOTE: Perceptual Borderline SMOTE 1) training samples are projected into 2-D space.
2) Gravity center of majority class G is computed.
3) For a sample s of the minority class: a) Compute the distance d(s,G) between s and G b) Compute the distance d(s,Si) between s and its N neighbours Si c) Compare d(s,G) to a predefined threshold .
• if d(s,G)≤ , s is considered borderline. It is assigned an oversampling rate R=R1.

C. STRETCHY SMOTE (S-SMOTE)
The SMOTE algorithm is based on a previously fixed number of neighbours. Stretchy SMOTE (S-SMOTE), is an extension of the SMOTE algorithm aimed to deal with multi-class oversampling. S-SMOTE method manages a dynamic number of neighbours, which depends on the distance to the majority  class, instead of considering a fixed number. The algorithm is described as follows:

S-SMOTE: Stretchy SMOTE
• Training data are projected into 2-D space. • Compute R, the ratio between majority and minority classes.
• Gravity center of majority class G is also computed.
• For each sample s of a minority class: 1) Compute the distance d(s,G) between s and G 2) Compare d(s,G) to a predefined threshold .

V. DATA ANALYSIS
This research is carried out in the context of EMPATHIC project, where seniors interact with a virtual agent playing the role of a health professional coach. In this context, three highly imbalanced corpus of spontaneous emotions were acquired in human-machine scenarios in three different languages and cultures: Spanish, French and Norwegian.
The recordings were carried out in four main regions of Europe: the Basque Country in Spain, Île-de-France and Bourgogne regions in France and Oslo area in Norway. Participants were required to be over 64 years, healthy and living independently.  Through the next subsections we analyse the perception experiments and their results as well as the emotion distribution in the three datasets.

A. PERCEPTUAL ANNOTATION OF EMOTIONS
The audio files of the three datasets were labelled by native collaborators in terms of emotions. Two dialogues are recorded per speaker: the first one implements an introductory session through some general aspects about the senior's lifestyle whereas the second one simulates a coaching session on healthy habits for nutrition. As a result, the corpus consists of 134 Spanish audio files, 76 French and 62 Norwegian files. Each file was annotated by three persons. The time limits that indicate changes in emotional state were also set by the annotators.
The perceived emotion was labelled into both, the categorical and the three-dimensional models. Thus, each emotion is described by four parameters: • its category : Calm, Sad, Happy, Puzzled and Tense, • its arousal level : excited (1), slightly excited (0) and neutral (−1), • its valence: positive (1), neither positive or negative (0) and negative (−1) • its dominance level : rather dominant (1), neither dominant nor intimidated (0) and rather intimidated (−1). For each language and corpus, the intersection of the three annotations is considered as the final label. The segments getting annotator's agreement constitute the samples of the corpora. When only two annotators agree, we evaluate the level of confidence of their agreement to decide if the segment will be included or discarded. To this end, the inter-annotator agreement W is computed as follows: where V is the per event agreement also known as Jaccard Index [52] and G is the Global agreement (intersection over union). For these analyses, α is empirically set to 0.5. The agreement score W varies between 0 (for total disagreement) and 1 (for total agreement). If W ≥ threshold the intersection is performed, otherwise the segment is removed. In this work, threshold is fixed to 0.6.

B. DATASET IMBALANCE
The duration of the audio files for each annotated emotion and dataset are:  Table 2 for each corpus and emotional category.  Table 2 shows that the three datasets are highly imbalanced, which is the typical situation when dealing with spontaneous emotions and realistic tasks. In fact, the minority classes percentages are between 5 and 0.2 % of the database, which are very low percentages.
For the sake of comparison, only the common classes among datasets are considered for the experiments, namely Calm, Happy and Puzzled. The Calm category is designated majority class whereas Happy and Puzzled are designated minority classes. Ratios between majority class and minority classes are reported in Table 3. Table 3 shows that the Norwegian corpus is less balanced than the French one, which is also less balanced than the Spanish one. In addition, Norwegian corpus not only faces the imbalance problem but also the small size of the dataset. The combination of these issues presents a new challenge to the community [28], [53].

C. EMOTIONS DISTRIBUTION
The number of speakers is 67 for Spanish experiments, 38 for French and 31 for Norwegian ones. Let's classify these speakers into four groups: -speakers who are perceived always Calm, and thus labeled Calm -speakers whose audio files include segments labeled as Calm and segments labeled as Happy -speakers who generate both Calm and also Puzzled segments and -speakers whose audio files include segments labeled as Calm, segments labeled as Happy and segments labeled as Puzzled. Each group is represented by a bar in Figure 5. The x-axis indicates the number of speakers and the y-axis represents the percentage of segments per emotion. We notice that most Spanish and French speakers express all three emotions. However, the majority of Norwegians show only two emotions. Figure 5 also shows the speaker dependency of emotions. In fact, Calm segments are present in all the recordings whereas Happy and Puzzled are concentrated in a subset of speakers. In particular, Puzzled category only appears in six Norwegian speakers'dialogues.
Thus, in this emotion recognition task, taking or skipping some speakers of the training/test set can lead to the presence or absence of emotions in that set. Hence the importance of choosing an adapted experimental protocol.

VI. EXPERIMENTAL PROTOCOL
As explained above, in Subsection V-C, some speakers don't show certain emotions so an experimental protocol such as ''one speaker leave out'' does not guarantee the presence of all emotions in all the folds. As a consequence, we opted for a protocol where all the emotions are present in all the train/test folds. However, this experimental protocol can cause performance losses of the Norwegian system, as we will discuss in Section VII.
Each dataset has been divided into ten partitions. In order to obtain the same distribution of emotions in all partitions, each partition contained 10% of each emotion occurrences. The experimental protocol is based on 10 training/test folds where: • training/test folds are not overlapping • folds cover all the dataset. • each test fold matches a partition • training set contains the rest In the following subsections we will analyse emotion distribution per training/test fold in terms of both, categories and continuous dimensions.

A. CATEGORICAL DISTRIBUTION
The majority class contains enough data for all training/test folds, but minority classes can show a lack of data in some cases.
As defined by the protocol, the distribution of emotion categories is similar for the ten training/test folds. This is convenient for the Spanish and the French datasets but not for the Norwegian one. Indeed, the reduced amount of data of Norwegian dataset results in only one sample of class ''Puzzled'' per test fold. At the same time, training folds include very few ''Puzzled'' data, which are not enough to train the emotion recognition models. As a consequence, recognition of ''Puzzled'' class should be really difficult.

B. DIMENSIONAL DISTRIBUTION
Unlike categorical parameters, which are discrete labels, dimensional parameters correspond to continuous values. Figure 6 shows the gravity center of each class into the 2-D model space, for each of the training and each of the test folds, for the three datasets. Regarding training datasets, gravity centers are quite close to each other. This is due to the abundance of data. For the test data, the situation is the opposite. For test folds, Spanish gravity centers are quite overlapping, i.e. classes are mixed. Nevertheless, French and Norwegian classes do not overlap so much. This bias comes from cultural differences and also from differences on the perception of annotators. We also notice that Spanish gravity centers are outside the circle. This reflects very low values of arousal for all classes along with positive values of valence. This is not the case for the other languages.
Regarding the imbalance of the 2-D model parameters, Figure 6 also shows a clear imbalance of arousal and a slightly more balance in valence samples. Also, high activation values as well as negative values of valence do not appear in these datasets. This is related to the nature of the task since high arousal and negative emotions, such as stress or nervous (see Figure 1), are not expected in the EMPATHIC humanmachine interaction task previously described. Moreover, only a small part of the dimensional space is occupied by the annotated data. VOLUME 9, 2021 For more information on the distribution of classes, the euclidean distances between the center of gravity of the majority class, i.e. the Calm category, and those of each of the minority classes, i.e. Happy and Puzzled, are computed and reported in Tables 4 and 5 for training and test folds respectively. These tables show that Calm is mostly closer to Puzzled for French but closer to Happy for Spanish and Norwegian datasets.

C. EVALUATION METRICS
Traditionally, the most frequent metrics for classification tasks are Accuracy and Error Rate, also noted as 1−Accuracy. Additional information can also be obtained from the analysis of the confusion matrix. Furthermore, Precision, Recall and F value provide specific information about quality and sensitivity of the model. By convention the class label of minority class is positive and the class label of the majority is negative. Recall × Precision (β 2 .Precision) + Recall β is a coefficient to adjust the relative significance of Precision versus Recall. If β < 1 then more significance is given to Precision whereas if β > 1 then more significance is given to Recall. For β = 1, both are equally significant.
Many representative works of the ineffectiveness of Accuracy in the imbalanced learning scenario can be found in the literature [28], [54]. F value is a popular evaluation metric that lessens this effect. It combines Precision and Recall, which are effective metrics for information retrieval community where the imbalance problem exists [33].
Nevertheless, in the context of emotion recognition of imbalanced datasets, average Recall denoted as Unweighted Accuracy (UA) is commonly used [38], [40], [45], [55]. It is defined as the average of Recall obtained on each class: where Recall i is the Recall value for class i and N is the number of classes. Unlike Accuracy, UA gives the same importance to all classes. For this research Unweighted Accuracy as well as F value are used as evaluation metrics. We consider F2 score which is F value when β = 2.
Confidence interval measurement is introduced and commonly used in speech recognition [56], [57] to measure the reliability of the error of recognition rate. This measure can be applied in many classification problems including speech emotion recognition [54]. According to [56], the 'true' classification rate has a probability x of falling in the confidence interval [P+,P−] where P is the measured classification rate. The limits of the confidence interval [P+,P−] are obtained as follows: where N is the number of tests, z 90% = 1.64, z 98% = 2.33, etc and ± being + for P+ and − for P−. In this work, UA represents the classification rate P. Confidence levels are calculated for the best results reported in Subsection VII-F through Tables 16, 17, 18, 19 and 20.

VII. EXPERIMENTS AND RESULTS
Inspired by the source-filter model of speech production, researchers have extracted a large number of acoustic parameters to analyse speech content. Some of these acoustic features have subsequently been used to recognize emotions [58]. These are parameters derived from the time domain (e.g. speech rate), frequency domain (e.g. pitch), amplitude domain (e.g. energy) and spectral distribution domain (e.g. relative energy in different frequency bands) [59], [60]. Recently, the raw audio as well as the spectrograms [61], [62] have been supplied to the input of a Convolutional Neural Netwok to extract the acoustic characteristics for emotion recognition purposes [63]. In this research, we use the following acoustic parameters: ''zero crossing rate'', ''energy'', ''energy entropy'', ''spectral centroid'', ''spectral spread'', ''spectral entropy'', ''spectral flux'', ''spectral rolloff'', ''13 mfcc coefficients'', ''13 chroma features'', ''log(F0)'' and ''Harmonic Noise Ratio (HNR)''. These parameters are extracted using the tools parselmouth [64] for F0 and HNR and pyaudioanalysis [65] for the rest. Based on frame-by-frame extraction [36], acoustic features are computed for each 50 milliseconds of signal. Then the average and the standard deviation are computed for all segment features leading to only one vector per speech segment. Hence each sample is represented by a 72 dimensional vector of acoustic information.
The experiments are carried out using a deep neural network predictor implemented on the Theano library [66]. Different hyper-parameters have been tested empirically on a development set (10% of fold0 of training set). As a result of this procedure, the first two layers of the multi-layer perceptron have 50 neurons each and ReLu activation function and the output layer is Softmax. RmsProp algorithm [67] is used for optimization. The number of epochs is set to 100 after some training/dev evaluations to avoid overfeed. Batch normalization is performed with a small set of 16 samples. All experiments were run on a GPU machine (with Nvidia TITAN Xp graphics card). The average time is one hour per experiment. A ''seed'' function for random values is set to 1 in order to make experiments reproducible.
It is important to mention that the oversampling process concerns only the training data. So, even if the test set is also highly imbalanced, no test data oversampling is performed. For comparison purposes, the same oversampling rate R is adopted in all the experiments, which is set to the ratio of the number of samples for majority/minority classes.

A. BASELINE EXPERIMENTS
A preliminary set of experiments was carried out without oversampling for the three datasets. The evaluation performance is reported in Table 6 in terms of Precision and Recall per class. In addition, the overall results per dataset are also reported in terms of average of Precision, UA, F2 value and Accuracy scores.
In the absence of oversampling, the model is overfed to the majority class (Calm) and almost all test samples are classified Calm. Hence, Calm scores are relevant. Overall Accuracy, which is more related to majority class identification, is also high. Regarding minority classes, they are rarely recognized and the Recalls are very low. In fact, Precision values for the minority classes are not informative because almost no test samples are assigned to Happy or Puzzled categories. Thus, Precision values are computed as a ratio of two very little values. Nevertheless, the tasks involving recognition of spontaneous emotions frequently need to focus on minority classes.

B. BORDERLINE SAMPLES DISTRIBUTION
The distribution of borderline samples is computed a priory in order to understand the impact of the proposed perceptual borderline sampling. To this end, we noted the total number of borderline samples. Then, we deduced the proportion of each class (see Figure 7). For Spanish and French, there are more Puzzled borderline samples than Happy ones. Nevertheless, Table 4 shows that the centers of gravity of the Spanish folds for Calm are closer to the centers of gravity of Happy folds than to the centers of gravity of Puzzled folds. This can be due to the higher number of Puzzled samples for Spanish. In addition, their distribution could be somewhat flat because of the coarse annotation. VOLUME 9, 2021

C. PERCEPTUAL BORDERLINE (PB)
Random oversampling (RO) and borderline perceptual oversampling (PBO) experiments are performed in order to increase the size of the minority class training data. Both are carried out under the same conditions, as follows: • RO algorithm: all the samples of minority classes are replicated R times.
• PBO algorithm described in Subsection IV-A deals with a pair of (minority/majority) classes. It is adapted to the three classes context as follows: -Happy class: borderline samples are replicated R1 = α * R and other samples R2 = (2-α)* R -Puzzled class: borderline samples are replicated R1 = β * R and other samples R2 = (2-β)* R Note that RO is equivalent to PBO for α = β = 1

1) SPANISH RESULTS
Unweighted Accuracy (UA) is reported in Table 7 for α and β values ranking from 0.25 up to 1.75. We notice that for each pair of (α, β), UA is higher when α and β are close to each other and β ≥ α. Indeed, according to Figure 7, there are more Puzzled samples at the borderline than Happy samples. The best PBO performance (58.69%) is reached for α = β = 0.75. This corresponds to an absolute gain of about 1.6% compared to RO for which UA=57.09% (α = β = 1).  to Happy than to Puzzled (Table 4), it seems that the model is more overfed to Happy than to Calm.

2) FRENCH RESULTS
Unweighted Accuracy of French emotions experiments are reported in Table 8. Table 4 shows that the centers of gravity of the French folds for Calm are closer to the centers of gravity of Puzzled folds than to the centers of gravity of Happy folds. In addition, Figure 7 shows a higher number of borderline Puzzled samples than Happy samples. Hence α ≤ β. Indeed, UA is mostly higher under the diagonal. The best PBO performance is obtained for α = 0.25 and β = 1.0. This corresponds to an absolute gain of about 3% compared to RO.
Only two exceptions are noticed when α β or β α.

3) NORWEGIAN RESULTS
Unweighted Accuracy of Norwegian emotions experiments are reported in Table 9. According to Table 4, the centers of gravity of the Norwegian folds for Calm are closer to the centers of gravity of Happy folds than to the centers of gravity of Puzzled folds, as for Spanish experiments. In addition, Figure 7 shows a considerably higher number of borderline Happy samples than Puzzled samples. Hence α ≥ β. Indeed, UA is mostly higher above the diagonal.
The test folds contain only one Puzzled sample. Moreover, training set contains a very limited number of samples (12) which justifies low scores and more exceptions than in the other datasets (mainly when β ≥ 1.25). The best PBO performance is obtained for α = 1.75 and β = 0.25. This corresponds to an absolute gain of about 6% compared to RO.

D. PERCEPTUAL BORDERLINE SMOTE (PB-SMOTE)
We now apply the perceptual borderline algorithm in Subsection IV-B. This algorithm is similar to PBO algorithm in Subsection IV-A, which was applied in experiments of previous section, except that now new samples are artificially generated instead of being generated by replication. SMOTE algorithm is the state of art oversampling method by artificial synthesis. The number of neighbours is set to 5 for these experiments. SMOTE is equivalent to Borderline Perceptual SMOTE (PB-SMOTE) when α = β = 1

1) SPANISH RESULTS
Unweighted Accuracy (UA) is reported in Table 10 for α and β values ranking from 0.25 up to 1.75.  Table 10 shows that UA values are generally higher when β ≥ α, as in the case of the PBO algorithm. However, there are more exceptions, when α, β > 1. The best PB-SMOTE performance is obtained for α = 1.0 and β = 1.75. This corresponds to an absolute gain of about 1.4% compared to SMOTE.

2) FRENCH RESULTS
Unweighted Accuracies (UA) obtained through French emotion recognition experiments are reported in Table 11 for α and β values ranking from 0.25 up to 1.75.  Table 11 shows that UA is generally higher when β ≥ α, as in the case of the PBO algorithm. There are some exceptions when α ≥1.5 and/or β ≥1.5. This Table also shows that high oversampling ratios for both classes, Happy and Puzzled, result also in a high number of synthetic samples generated in the same area. This could lead to the overlapping of the synthetic samples.
The best PB-SMOTE performance is obtained for α = 0.25 and β = 1.0. This corresponds to an absolute gain of more than 3% compared to SMOTE.

3) NORWEGIAN RESULTS
For highly imbalanced datasets, the minority class is often poorly represented and lacks a clear structure. Hence, methods that rely on relations between minority objects (like SMOTE) tend to fail [26]. In this case, the imbalance rate for Puzzled class is greater than 220, which is very high. Lets now see the impact of perceptual borderline on SMOTE performance in such a case. To this end, Unweighted Accuracy is reported in Table 12. UA is generally higher above the diagonal. When α ≥ 1.25 and β ≥ 1.25, UA is often lower. In this case, we are generating more samples in the borderline area than in the rest of the acoustic space. The UA degradation could be due to a number of new artificial samples of Puzzled class that overlapping the new Happy ones.

E. STRETCHY SMOTE (S-SMOTE)
S-SMOTE algorithm (See Subsection IV-C) fixes the oversampling ratio R as the ratio between majority and minority classes, which is the same for all samples. As a consequence, the borderline is not used to select samples for which a different ratio has to be applied as in PBO (Subsection IV-A) and PB-SMOTE (Subsection IV-B) algorithms.
When new samples are generated by the SMOTE technique, a fixed number of neighbours N has to be considered. We want now to verify the hypothesis that borderline samples should have less neighbours than the others, as the S-SMOTE algorithm proposes. To this end, we denote by N 1 the number of neighbours of the borderline samples and by N 2 the number of neighbours to be considered for other samples.
When N 1 = N 2, Stretchy SMOTE is equivalent to SMOTE.
In order to compare S-SMOTE to SMOTE, the same total number of neighbours is considered. To this end,

1) SPANISH RESULTS
For Spanish data we obtained: -if N 1 ≤ N 2 and the difference N 1 − N 2 is small then generally UA(N 1/N 2) ≥ UA(N 2/N 1).
-if N 1 ≥ N 2 then generally UA(N 1/N 2) ≤ UA(N 2/N 1) As mentioned in Figure 6, classes are close to each others. Therefore, when the number of neighbours is high, a new synthetic sample of one class may overlap another sample of a different class. Best performance of S-SMOTE corresponds to UA = 57.36% and is reached for N 1 = 4 and N 2 = 8. It corresponds to an absolute gain of 0.7% compared to SMOTE best result which is UA=56.71%.

2) FRENCH RESULTS
For the French dataset we got that if N 1 < N 2 then UA(N 1/N 2) ≥ UA(N 2/N 1). According to Figure 6, French classes are more distant from each other than the Spanish classes. So they are less confused when the number of neighbours is high.
S-SMOTE best performance (UA=62.16%) is obtained for N 1 = 3 and N 2 = 7. This corresponds to an absolute gain of about 2% compared to SMOTE best performance (UA=60.43%), which is achieved for N 1 = N 2 = 7.

3) NORWEGIAN RESULTS
We can draw roughly the same conclusions for the Norwegian data that we got from the Spanish ones. However, the scores are quite lower.
S-SMOTE best performance corresponds to UA=49.92% for N 1 = 4 and N 2 = 8, while SMOTE higher score is UA=49.19% for N 1 = N 2 = 3. The absolute improvement is about 0.7%. Figure 5 shows that some emotions can be only found in a reduced set of speakers. As a consequence, a classical experimental protocol such as one speaker left out would lead to the lack of certain emotions in the training or in the test set. In addition, the distribution of data between training and test folds could be different. Hence, we opted for a protocol where all emotions are present in all the training/test folds as described in Section VI, which it is now named Protocol 1. However, Protocol 1 results in very low scores of the Norwegian system (Subsections VII-C, VII-D and VII-E). Indeed, the cardinal of the test set is always equal to 1 for Puzzled class. In this Subsection we analyze the best results obtained   results per dataset are also reported in terms of average of Precision scores, UA and F2 value. Confidence intervals are calculated for UA as described in Subsection VI-C, resulting in a different confidence level for each protocol and dataset.

F. DISCUSSION OF RESULTS
The best performances are obtained for the French dataset, which are similar for both protocols. In fact, in this case, the three classes are more distant each other (see Figure 6) and the confidence level is 92% for both protocols. The Spanish dataset is less imbalanced (Table 3) than the French and Norwegian datasets but classes overlap more. Protocol 1 provides better performance than Protocol 2, but the confidence level is lower than for French, namely 80% for Protocol 1 and 85% for Protocol 2. Norwegian results vary depending on the protocol. This dataset is extremely imbalanced and contains very few samples of Puzzled class (see Table 3), which results in a very difficult task and, consequently, lower results are expected. The confidence level for Protocol 1 is 98%. Even if the UA is low the system classifies well two classes, but not three because of the lack of samples of one class. On the contrary, the UA for Protocol 2 is higher because the absence of Puzzled samples in many test folds, which explains the good results. The confidence level of 80%.
Let us now compare the results in Table 6 obtained without oversampling with the results in Tables 16 and 17, in terms of Recall per class and UA. We notice that Recall decreases for the majority class and increases for minority ones resulting in a significant improvement of the UA scores, when oversampling is applied.
Finally, we highlight that the proposed approach (PBO) improves the classification performance in both protocols. Tables 18 and 19 report emotion recognition Precision (Pre.) and Recall (Rec.) per class for SMOTE and Perceptual  Borderline SMOTE (PB-SMOTE) when Protocol 1 and Protocol 2 are used, respectively. In addition, the overall results per dataset are also reported in terms of average of Precision scores, UA and F2 value.

2) PERCEPTUAL BORDERLINE SMOTE
In general terms, we can draw similar conclusions to those from the previous tables. Moreover, we can conclude that PB-SMOTE performs better than SMOTE in terms of UA with confidence levels 80% and 92% for Spanish data, 92% and 98% for French data, and 98% and 90% for Norwegian data. In addition, we notice that SMOTE performs slightly worse than RO for all languages and protocols. Furthermore, PBO is generally the most efficient except for French and Protocol 2. Table 20 shows the best emotion recognition scores for SMOTE and Stretchy SMOTE (S-SMOTE) and Protocol1.

3) STRETCHY SMOTE
This Table also shows that overall performances obtained by S-SMOTE are close to those obtained by PB-SMOTE, resulting in confidence levels arround 80%. Notice that in S-SMOTE, α and β are fixed to 1 and thus only the number of neighbours number varies. As a consequence, borderline samples are over-sampled at the same ratio than the other samples. As a result, SMOTE blind interpolation issue rises and the new synthetic samples of one class could overlap some examples of other classes.

VIII. CONCLUDING REMARKS
This research addresses the understanding of human behaviour through the analysis of the speech signal in the context of human-machine interaction. More specifically, we focus on the speech-based identification of the seniors's emotional status during their interaction with a virtual agent playing the role of a health professional coach. In this convincing scenario, only a very reduced set of spontaneous emotions was identified in human perception experiments, which shows a huge difference among the number of samples resulting in an imbalanced dataset problem.
To deal with this issue, several balancing data algorithms are proposed in the context of binary classification. Nevertheless, most of these methods are inefficient to deal with a multi-class classification problem. Indeed, we may loose performance on one class while trying to gain it on another. Consequently, there is a need for new methods that consider the specific characteristics of the data and its nature. Some examples of new information to be additionally considered are the distribution of classes or their boundaries.
Emotions are often represented in a continuous dimensional space. So our contribution is to take advantage of this perceptual, visual and easy to interpret dimensional space to examine classes borderlines in this space. This limit, based on the perceived arousal and valence, leads to two methods of balancing the data: the Perceptual Borderline Oversampling where the over-sampling is carried out by replication and the Perceptual Borderline SMOTE that generates artificial samples. In this case, the proposed methods avoid the overlap of the samples of the different classes.
The additional contribution of this work, denoted by Stretchy SMOTE, is an alternative to Perceptual Borderline SMOTE, where the same oversampling rate is applied to the data of the same class but the number of neighbors is variable. This number depends on the distance between the minority class samples and the majority class, which is also calculated in the perceptual emotion plane.
These proposals are implemented and compared to state-of-the-art approaches, namely Random Oversampling and Synthetic Minority Oversampling. The experimental evaluation was carried out on three extremely imbalanced datasets of spontaneous emotions acquired in humanmachine scenarios in three different cultures: Spain, France and Norway. The emotion recognition results obtained by neural networks classifiers show that the proposed perceptual oversampling methods lead to significant improvements when compared to the state-of-the art, for all scenarios and languages.