Coupling Sentiment and Arousal Analysis Towards an Affective Dialogue Manager

We present the technologies and host components developed to power a speech-based dialogue manager with affective capabilities. The overall goal is that the system adapts its response to the sentiment and arousal level of the user inferred by analysing the linguistic and paralinguistic information embedded in his or her interaction. A linguistic-based, dedicated sentiment analysis component determines the body of the system response. A paralinguistic-based, dedicated arousal recognition component adjusts the energy level to convey in the affective system response. The sentiment analysis model is trained using the CMU-MOSEI dataset and implements a hierarchical contextual attention fusion network, which scores an Unweighted Average Recall (UAR) of 79.04% on the test set when tackling the task as a binary classification problem. The arousal recognition model is trained using the MSP-Podcast corpus. This model extracts the Mel-spectrogram representations of the speech signals, which are exploited with a Convolutional Neural Network (CNN) trained from scratch, and scores a UAR of 61.11% on the test set when tackling the task as a three-class classification problem. Furthermore, we highlight two sample dialogues implemented at the system back-end to detail how the sentiment and arousal inferences are coupled to determine the affective system response. These are also showcased in a proof of concept demonstrator. We publicly release the trained models to provide the research community with off-the-shelf sentiment analysis and arousal recognition tools.


I. INTRODUCTION
The market penetration of smart devices is increasing every year and is changing the way how users interact with the technology.For instance, the launch of voice-based Virtual Assistants (VAs) -such as Siri™ (Apple), Alexa™ (Amazon), Cortana™ (Microsoft), Bixby™ (Samsung), Celia™ (Huawei), or Google Assistant™ -has advanced the Human-Computer Interaction (HCI) field, as these deploy hardware The associate editor coordinating the review of this manuscript and approving it for publication was Yiming Tang .and software components that allow users to interact verbally with these assistants towards a natural interaction.Current VAs focus on the analysis of the linguistic information to provide this sort of natural interaction.Nevertheless, human-human communication is more complex, as the nonverbal communication is a fundamental and decisive aspect of the interaction.Hence, to boost the user experience when interacting with VAs towards a more natural and realistic interaction, there is a need to power these assistants with affective capabilities by means of affective computing technologies [1], [2].
Research works on VAs with affective capabilities can be found in the literature.Among the most recent examples, we highlight the EMPATHIC Virtual Coach [3] and the Ryan agent [4].The former [3] modifies the agent's voice based on the user's emotional state, which is inferred from the user's face and the paralinguistic information embedded in the user's voice recorded during the interaction.The latter [4] includes an affective dialogue manager able to generate responses based on the inferred emotions of the users.Despite considering multimodal information -as the system features sentiment analysis and face emotion recognition -, the information inferred from a single modality is sufficient to determine the affective response.
We present the technologies developed for sentiment and arousal analysis, so that a speech-based dialogue manager can adapt the system response to the sentiment and arousal level conveyed by the user during the interaction.We utilise a customised smartphone app as the gateway for users to communicate and interact with the system.The dialogue manager features a dedicated sentiment analysis component, which exploits the linguistic information embedded in the user's voice, and a dedicated arousal recognition component, which analyses the paralinguistic information.While the output of the former determines the body of the system answer, the output of the latter conditions the level of energy to convey in the response.We detail two of the sample dialogues deployed at the back-end of the system -to exemplify the system logic in the specific use case of an agent that engages its users with short affective dialogues at different points throughout their working day [5] -and provide a proof of concept demonstrator to showcase the implemented affective dialogue manager.An additional contribution of this work is the public release of the Application Programming Interfaces (API) developed to interact with the models trained in an attempt to provide the research community with off-the-shelf sentiment analysis and arousal recognition tools.
The scientific contribution of this work focuses on determining the optimal sentiment analysis and arousal recognition models to deploy in the system, which are trained using the CMU-MOSEI dataset [6], and the MSP-Podcast corpus [7], respectively.The CMU-MOSEI dataset is annotated in terms of both sentiment and emotion.Although the sentiment annotations are in the continuous space, the emotional annotations are in the categorical space.Hence, for a fine-grained arousal recognition, we opt for the MSP-Podcast corpus, as it provides affective annotations in the continuous space.In the sentiment analysis literature, a range of conventional [8], [9] and deep learning [10] approaches have been explored.Recurrent Neural Networks (RNN) are a specific deep learning technique suitable for sentiment analysis, as it is a sequence modelling task with variable length inputs.The goal of an RNN is to learn an embedded representation of the input sequence, which is then coupled with a classification block responsible for the actual inference.This embedded representation usually corresponds to the hidden state of the RNN produced at the last time step of the input sequence, which encodes information from the whole sequence, but excludes the previous hidden states from the preceding computations, losing potential information.The experiments we conduct target the assessment of this aspect, as we hypothesise that an effective fusion of the hidden states learnt at each time step could help improve the performance of the sentiment analysis models.Following the current trends in the Artificial Intelligence (AI) domain, researchers have recently started investigating the utilisation of Transformers [11], [12] and Large Language Models (LLM) [13] for sentiment analysis.In the paralinguistic-based affective computing literature, a wide range of feature representations [14] and network architectures [15], [16] have been studied, highlighting the dependency of the models performance on the available data and the targeted application.Thus, we compare the performance of arousal recognition models trained with different neural network architectures exploiting hand-crafted and deep-learnt representations extracted from the speech signals.
The rest of the paper is organised as follows.Section II introduces the datasets explored to train the sentiment analysis and the arousal recognition models.Section III describes the methodology followed.Specifically, this section provides an overview of the composite system and reports on the research conducted in both research areas.Section IV summarises and analyses the results obtained from the experiments conducted.Section V provides a proof of concept demonstrator, showcasing the overall affective dialogue manager implemented, and Section VI concludes the paper.

II. DATASETS
This section introduces the two datasets exploited in this work.Section II-A presents the CMU-MOSEI dataset [6] -used to train the sentiment analysis models -, while Section II-B describes the MSP-Podcast corpus [7] -employed to train the arousal recognition models.

A. CMU-MOSEI DATASET
The data used for training the sentiment analysis model belongs to the CMU-MOSEI dataset [6].This is one of the largest gender-balanced multimodal datasets for sentiment analysis and emotion recognition in English, containing more than 3 000 video clips with language, vision, and acoustic features extracted from over 65 hours of video.To download and process the data, we use the CMU Multimodal Data SDK 1 [17].For the purpose of our study, we exploit the original sequences available, and their corresponding 300-dimensional word embedding representations extracted using Global Vectors (GloVe) [18].We opt for the exploitation of the GloVe word embeddings for consistency with previous works in the literature exploiting the CMU-MOSEI dataset [6], [17].Early processing of the corpus using the available SDK includes the alignment of both linguistic representations and the segmentation of the original videos into the corresponding sentences, and their splitting into the pre-defined train, development, and test partitions.The compiled vocabulary contains 16 824 tokens.Table 1 synthesises the statistics of the resulting data.
Each sentence in the dataset is annotated with a sentiment score in the range [−3, 3], determined by 3 crowdsourced annotators.These scores correspond to highly negative (−3), negative (−2), weakly negative (−1), neutral (0), weakly positive (1), positive (2), and highly positive (3) sentiments.In this work, we aim to tackle the sentiment analysis task as a binary classification problem to properly support the envisioned use cases of the presented dialogue manager (cf.Section V).Consequently, we map the scores ∈ [−3, −1] to the negative class, and the scores ∈ [1,3] to the positive class.Although related works in the literature based on this corpus cluster the sentences corresponding to the neutral sentiment into the negative class [19], we exclude these sentences to minimise biasing our models towards the negative class.Table 2 summarises the number of positive and negative sentences belonging to the resulting train, development, and test partitions.

B. MSP-PODCAST CORPUS
The data explored for training the arousal recognition model belongs to the MSP-Podcast corpus [7], which was gathered from freely available English podcasts.The selected podcasts were converted into the audio format 16 kHz/16 bit singlechannel PCM.The resulting recordings were segmented, so that information from a single speaker was contained in each audio segment.The corpus was annotated via crowdsourcing in terms of emotional attributes (arousal, valence, and dominance), and categorical emotions.For the 1 https://github.com/A2Zadeh/CMU-MultimodalDataSDKpurpose of our study, we focus only on the arousal-related annotations, as arousal seems to be more prominent in the paralinguistic information embedded in the user's voice [20].
To annotate the audio segments in terms of arousal, the annotators rated the perceived level of arousal of the speaker using a seven-point Likert scale; i. e., the annotators were asked to rate whether the speaker was perceived to be very calm (1), calm (2), somewhat calm (3), neutral (4), somewhat active ( 5), active (6), or very active (7).Each segment was evaluated by several annotators, and the gold standard was determined as the average value among the annotations provided by the individual annotators.
As the envisioned affective dialogue manager does not need to infer arousal information with this level of granularity, we simplify the problem by clustering the arousal annotations in three different levels [21]: the annotations ∈ [1,3] are assigned to the low arousal class, the annotation in ∈ (3,5], to the mid arousal class, and the annotations ∈ (5, 7], to the high arousal class.Table 3 summarises the number of audio samples assigned to the low, mid, and high arousal classes belonging to the resulting train, development, and test partitions.

III. METHODOLOGY
The architecture and the information workflow of the overall system is depicted in Figure 1.A smartphone app acts as a gateway, so the users can record their own voice to communicate and interact with the system.The resulting media file is then transferred via the Internet to the system back-end.Upon reception, the speech file is transcribed using the off-the-shelf Automatic Speech Recognition (ASR) service provided by Google Cloud.The benefit of this approach is that the back-end can exploit the recorded speech file, and the resulting transcription, separately.
The proposed back-end architecture contains three main blocks: i) the sentiment analysis component, ii) the arousal recognition component, and iii) the dialogue manager component.Sections III-A and III-B describe the methodology followed to determine the best sentiment analysis, and arousal recognition models, respectively, to deploy in the respective components.The implementation of the dialogue manager component is detailed in Section V.This engineering-based section emphasises how the sentiment and the arousal information inferred is coupled to affectively adapt the system response to the current affective state of the user interacting with the system.

VOLUME 12, 2024
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

A. SENTIMENT ANALYSIS COMPONENT
This section describes the methodology followed to determine the best sentiment analysis model.Section III-A1 details the pre-processing applied to the sentences belonging to the CMU-MOSEI dataset (cf.Section II-A), Section III-A2 introduces the models implemented, and Section III-A3 summarises their training details.

1) DATA PREPARATION
Each sentence in the CMU-MOSEI dataset is composed of a different number of tokens.The first step is, therefore, the homogenisation of the sequence lengths, so these can be used to train our neural networks.According to the results obtained from our data analysis (cf.Table 1), the longest sentence belongs to the training partition and has a total of 310 tokens.Thus, we fix the length of the sequences to train our networks to 310 time steps.This parameter determines the maximum length of the sentences that can be analysed with our models at inference time.So that the training sentences have a length of 310 time steps, we opt for repeating the sequence of tokens for the shorter sequences until reaching the desired sequence length, avoiding zero-padding.Nonetheless, the sentences with their original lengths are used when evaluating the performance of the models.To overcome the imbalanced data in terms of the positive and the negative sentiments (cf.Table 2), we upsample the under-represented classes via replication, so that the same number of samples for each sentiment is used for training the networks at each epoch.

2) MODELS DESCRIPTION
The sentiment analysis networks implemented in this work are composed of two main blocks: the first block is responsible for learning the embedded representations of the input sequences, while the second block, for the actual classification.The first block features a single-layer, bidirectional Gated Recurrent Unit -Recurrent Neural Network (GRU-RNN) with 128 hidden units.We select the use of a GRU-RNN to overcome the vanishing gradient problem suffered by other RNNs, such as the Long Short-Term Memory -Recurrent Neural Network (LSTM-RNN).The embedded representation learnt at the output of this block can be mathematically represented as where w i corresponds to the word embedding representations extracted from the sequence of words [w 1 • • • w s ] in the sentence.The second block is composed of two-stacked fully connected layers, preceded by two dropout layers with probability 0.3.The first layer contains 32 neurons and uses the Rectified Linear Unit (ReLU) as the activation function.
The second layer has as many neurons as classes we need to classify our samples and uses Softmax as the activation function, so that the outputs of the network can be interpreted as probability scores.
The embedded representations learnt at the output of the first block, h i , encapsulate the salient information from the input sequences.Hence, the way how this information is exploited determines the performance of the overall model.In this work, we exploit the embedded representations h i using the following network architectures.i) Baseline Network (Baseline RNN).The baseline network uses the last hidden state of the GRU-RNN as a standalone representation of the input sequence, h.This embedded representation is then fed to the second block of the network for the actual classification.ii) Hierarchical Naïve Fusion Network (H-N).This network fuses the sequence of embedded representations, h i , by averaging the representations over all the sequence.This can be mathematically formulated as: We refer to this approach as a naïve fusion method, since no parameters need to be trained by the network.iii) Hierarchical Contextual Attention Fusion Network (H-CA).Based on the methodology presented in [22] and adapted from [23], this approach fuses the information by computing contextual attention scores as follows: In this approach, W, b, and u are defined as trainable parameters.The parameter u can be interpreted as a contextual tensor, which contributes to the identification of the relevant words in the sentences.iv) Convolutional Fusion Network (CNN).The fusion of the sequence of embedded representations is performed using a 1-dimensional convolutional layer with 256 and 128 input and output channels, respectively, a kernel size of 3, and a stride of 1.The parameters selected guarantee a smooth integration of this convolutional block into the baseline network for a fair and effective comparison between the models.Batch normalisation is applied to the output of the convolution, and the resulting representation is transformed using a ReLU function.
Finally, a 1-dimensional adaptive average pooling is applied to obtain 2 values as a result of the fusion.The final representation is reshaped into a 1-dimensional tensor h, ready to be fed into the classification block of the network.v) Convolutional Contextual Attention Fusion Network (CNN-CA).This final network combines the approaches described for the H-CA and the CNN networks.First, the contextual attention scores from the sequence of embedded representations are computed as defined in Equations ( 3) and (4).Then, h i is transformed into an intermediate representation mathematically defined as This new representation h ′ i is then exploited using a 1-dimensional convolutional layer, as described for the CNN network.

3) NETWORKS TRAINING
At the initialisation of each network, the pseudo-random number generator is manually seeded for a fair comparison, and reproducibility of the results.The models described in Section III-A2 are trained using the Categorical Cross-Entropy as the loss to optimise.As the optimiser, we use Adam with a fixed learning rate of 10 −4 .The network parameters are updated in batches of 256 samples, and their gradients are clipped at 1.The networks are trained during a maximum of 100 epochs, and we implement an early stopping mechanism to stop training when the validation loss does not improve for 20 consecutive epochs.Using this early stopping mechanism, we determine the number of epochs needed for training the networks, while minimising the chances of overfitting.

B. AROUSAL RECOGNITION COMPONENT
This section describes the methodology followed to determine the best arousal recognition model.Section III-B1 details the pre-processing applied to the speech samples belonging to the MSP-Podcast corpus (cf.Section II-B), Section III-B2 introduces the models implemented, and Section III-B3 summarises their training details.

1) DATA PREPARATION
The feature representations to extract from the original audio files play a vital role in the paralinguistic analysis.Hence, we aim to compare the performance of the arousal models when exploiting the functionals of the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) [24] extracted using openSMILE [25], and the Mel-spectrogram representations of the audio signals.The former extracts an 88-dimensional feature vector representation of each audio signal as a whole.The Mel-spectrograms are computed using 128 Mels and a hope size of 128 samples.The audio signals in the MSP-Podcast corpus have different durations.To homogenise their duration for training the models, we window the Mel-spectrogram representations so that they contain the information equivalent to 5 seconds of the original audio signals using an overlap of 50 %.Each windowed representation is stored as an image of 224 × 224 pixels for further processing.As the speech samples are imbalanced with respect to the arousal classes (cf.Table 3), we use a weighted random sampler to select the samples to use for training the models at each epoch.With this strategy, the samples corresponding to the less represented classes are used more often for training the models than the samples corresponding to the most represented classes.

2) MODELS DESCRIPTION
To model the different features extracted from the speech files (cf.Section III-B1), we explore different network architectures, which we proceed to describe.i) MLP.The eGeMAPS features are modelled using a Multi-Layer Perceptron (MLP) composed of two main blocks.The first block acts as a feature adapter, as it uses a linear layer to convert the original features into a 512-dimensional representation.This 512-dimensional representation is then fed to the classification block, which implements two linear layers with 32 and 3 output

3) NETWORKS TRAINING
At the initialisation of each network, the pseudo-random number generator is manually seeded for a fair comparison, and reproducibility of the results.We train the models described in Section III-B2 to minimise the Categorical Cross-Entropy loss, using Adam as the optimiser with a learning rate of 10 −3 .The networks are trained in batches of 128 samples and during a maximum of 200 epochs.
We implement an early stopping mechanism to stop training when the validation error does not improve for 20 consecutive epochs.With this early stopping mechanism, we determine the number of epochs needed for training the networks, while minimising the chances of overfitting.We decide for the Unweighted Average Recall (UAR) as the metric to compare the ground truth and the inferred arousal annotations, and, therefore, we define (1 − UAR) as the validation error to monitor the training process.

IV. EXPERIMENTAL RESULTS
This section reports the results obtained from the experiments conducted.Section IV-A compares the performance of the sentiment analysis models that implement the different network architectures described in Section III-A2.Section IV-B analyses how the performance of the arousal recognition models is impacted by choosing different feature representations of the speech signals and different network architectures to analyse the information extracted (cf.Section III-B2).

A. SENTIMENT ANALYSIS MODELS
To assess the performance of our sentiment analysis models, we compute the UAR between the inferred and the ground truth annotations.We consider the UAR as the most suitable metric to use in this case, as it is not impacted by the imbalanced data.Hence, the chance level in terms of UAR for the binary classification problem is 50.00 %.
The performance of the binary sentiment analysis models trained is summarised in Table 4.To contextualise the performance of our models, we apply a state-of-the-art transformer-based binary sentiment analysis model to infer the sentiment corresponding to the sentences belonging to both the development and the test partitions.Specifically, we select the pre-trained, off-the-shelf binary sentiment model available from the pipeline API of the Transformers library2 [27].This model was trained based on the DistilBERT architecture [28] and fine-tuned on the SST2 dataset [29].The results obtained with this pre-trained model are included in Table 4.
Comparing the results obtained, we observe that all our models achieve a higher performance than the state-of-theart transformer-based binary sentiment model on the test set.The highest performance on the development partition is obtained with the baseline RNN, scoring a UAR of 75.84 %.Nevertheless, the H-CA network scores the best performance on the test set, with a UAR of 79.04 %, surpassing the baseline network.
The sentiment analysis component in the affective dialogue manager architecture (cf.Section III) deploys the H-CA network-based sentiment analysis model trained.The sentiment analysis component is implemented through a simple API, which is publicly available with the aim to provide an off-the-shelf sentiment analysis tool to the community 3 .

B. AROUSAL RECOGNITION MODELS
The results obtained from the network architectures described in Section III-B2 are reported in Table 5.As it can be observed, the best performance is obtained using the

V. PROOF OF CONCEPT DEMONSTRATOR
Coupling the sentiment analysis and arousal recognition technologies developed and hosted in their corresponding components (cf. Figure 1), we can power a dialogue manager with affective capabilities.As depicted in Figure 1, the proposed system initialises the dialogue, and, then, the users record their voice via the smartphone app to answer.The content of the answer is open to the user.Upon reception of the recorded file, the back-end of the system runs the generated transcription and the received speech file through the sentiment analysis and the arousal recognition components, respectively.Open dialogue systems are a  Hence, we opt for a rule-based dialogue manager.We employ predefined sentences -containing the body of the system answer -and interjections -to convey different levels of energy in the system response -, which the dialogue manager selects according to the sentiment and arousal information inferred by the corresponding components to determine an affective system response to the users' input.
As a proof of concept demonstrator, we integrate the proposed affective dialogue manager at the back-end of a larger companion system which interacts with its users at specific, relevant points in time during the day and gathers users' and context-related information to determine timely and personalised recommendations that can support wellbeing, wellness, and productivity [5].For this use case, the affective dialogue manager aims at improving the user experience when interacting with the companion system.For a more natural interaction, the dialogue manager addresses the users using a nickname of their choice, which is adapted to each user on the fly.Considering the deployment scenario, we define two of the short dialogues in which the users are engaged at wake-up time (cf.Dialogue 1), and at the end of the working time (cf.Dialogue 2).We also showcase these dialogue scenarios in a video demonstration 5 .
It is worth mentioning that the smartphone app showcased in the video demonstration utters the dialogue manager responses with the Text-to-Speech (TTS) functionality provided by Android.The authors agree that emotional TTS is an emerging research field [30], which could potentially be Dialogue 2 Affective dialogue designed at the end of the working day considering the outcomes of the sentiment analysis and the arousal recognition models when analysing the open user's response applied in the proposed affective dialogue manager for a more natural system response.Nevertheless, as herein we focus on the affective capabilities of a dialogue manager from a user analysis perspective, we consider the synthesis aspect of the affective dialogue manager as future work.
Ethical concerns are inherent to voice-based HCI applications; especially those related to privacy [31].In our case, users actively press a button on the smartphone interface to start and stop the audio recording.We opted for this approach to gain users' trust, avoiding them having the impression their were continuously recorded.When sending sensible data -such as voice -throughout the Internet, the connection between the smartphone and the system back-end needs to be secured and encrypted; for instance, using the HTTPS protocol.Finally, the raw recordings should be deleted after processing and providing the answer to the users in order not to store personal data and minimise the damage of potential data leaks associated to exposing the back-end system to the public Internet.

VI. CONCLUSION AND FUTURE WORK
In this work, we presented a speech-based affective dialogue manager system powered by sentiment analysis and arousal recognition capabilities to create an instantaneous affective profile of the user, so it can be used to condition and adjust the system response.The research conducted on the sentiment analysis problem focused on analysing the information loss experienced by using the hidden state of a recurrent neural network produced at the last time step as the embedded representation encoding the whole input sentence.The best model implemented a hierarchical contextual attention fusion network, which exploited the hidden states produced during all the time steps of the input sentence as the embedded representations.The research conducted on the arousal recognition problem focused on assessing the suitability of using different feature representations of the speech signals and using different network architectures to exploit the information extracted.The best model extracted the Mel-spectrogram representations of the speech signals and used a CNN trained from scratch to generate deep learnt representations.Overall, the deployed sentiment analysis model was able to infer whether the input sentence conveyed a negative or a positive sentiment, while the deployed arousal recognition model was able to infer whether the speaker conveyed a low, mid, or high level of arousal.Furthermore, we provided a proof of concept demonstrator of the implemented affective dialogue manager and presented two of the dialogues supported at the back-end of the system, exemplifying how the inferred affective information determined the system response.
Future work includes the assessment of the proposed affective dialogue manager with real users.To evaluate the effectiveness of the proposed solution, it would be relevant to also compare it with existing dialogue managers.Further investigations could consider exploring more advanced neural network architectures to improve the performance of the models trained.Motivated by the recent trends in the field of AI, a throughout analysis of Transformer-based architectures for sentiment analysis could be conducted with the aim to understand and determine which architecture works best and why.Furthermore, the study of LLMs in this problem is also a promising research direction, based on the excellent performance of such models in a wide range of problems and applications.Additionally, powering the dialogue manager system with natural language understanding capabilities and emotional TTS to utter the system responses would be encouraging directions to support open dialogues between the users and the system and to increase the affective perception of the system by the users, respectively.

TABLE 3 .
Number of low, mid, and high arousal samples belonging to the train, development, and test partitions of the resulting MSP-Podcast corpus when considering the task as a three-class classification problem.

FIGURE 1 .
FIGURE 1. Block diagram illustrating the architecture of the affective dialogue manager system implemented and presented in this work, including the interaction workflow.The system answer is determined by considering the sentiment and the arousal information inferred from the linguistic and the paralinguistic analysis, respectively, of the voice-based user's response.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.neurons, respectively, preceded by 2 dropout layers with probability 0.3.The outputs of the first linear layer are transformed using a ReLU function, and the outputs of the second linear layer use a Softmax activation function, so the network outputs can be interpreted as probability scores.ii) Scratch CNN.This network exploits the Melspectrogram representations of the audio signals using a CNN trained from scratch.This network is composed of two main blocks.The first block extracts deep learnt representations from the input Mel-spectrograms.For this, we implement 3 convolutional layers with 32, 64, and 128 filters each, a kernel size of 3 × 3 and a stride of 1.After each convolutional layer, we use batch normalisation, and the network outputs are transformed using a ReLU function.The first two layers use a 2-dimensional max-pooling layer with a kernel size of 2 × 2, while the third layer uses a 2-dimensional adaptive average pooling layer, so the outputs of this feature extraction block produce a 512-dimensional representation of the input data.The second block of the network is responsible for the actual classification and implements the same architecture as the classification block of the MLP network described above.iii) Pre-trained CNN.This network also exploits the Mel-spectrogram representations of the audio signals, but using a pre-trained CNN.This network is also composed of a feature extraction and a classification block.We choose the same architecture for the classification block as in the MLP and the Scratch CNN architectures.The difference, however, lies in the feature extraction block.In this case, we opt for applying a pre-trained Resnet-18 [26] network without the last layer to extract deep learnt representations from the input Mel-spectrograms.We fine-tune the network during the training process.This network produces a 512-dimensional representation of the input data.For this reason, we engineered the previous network architectures to produce a 512-dimensional representation at the output of the feature extraction block.This way, we can fairly compare the performance of the three different architectures proposed.

TABLE 4 .TABLE 5 .
Summary of the results obtained in terms of UAR (%) when tackling the sentiment analysis task as a binary classification problem.The performance of the models is assessed in both development and test partitions.Summary of the results obtained in terms of UAR (%) when tackling the arousal recognition task as a 3-class classification problem.The performance of the models is assessed in both development and test partitions.scratchCNN architecture exploiting the Mel-spectrogram representations of the audio files with a UAR of 61.61 % and 61.11 % on the development and the test partitions, respectively.The lowest UAR of 57.81 % on the test set is obtained with the MLP architecture exploiting the functionals of the eGeMAPS feature set, similar to the performance obtained with the pre-trained CNN on the Mel-spectrogram representations, which scores a UAR of 57.98 % on the test set.These results suggest the suitability of exploiting the Mel-spectrogram representations of the audio signals with CNNs trained from scratch for arousal recognition.The arousal recognition component in the affective dialogue manager architecture (cf.Section III) extracts the Mel-spectrogram representations of the input speech signals, and deploys the scratch CNN-based model trained.The arousal recognition component is implemented through a simple API, which is publicly available with the aim to provide an off-the-shelf arousal recognition tool to the community4 .

Dialogue 1
Affective dialogue designed at wake-up time considering the outcomes of the sentiment analysis and the arousal recognition models when analysing the open user's response ← [SYSTEM] Good morning, [NICKNAME]!How did you sleep tonight?

TABLE 1 .
Statistics of the resulting CMU-MOSEI dataset per partition after aligning the linguistic and word embedding representations of the original data, and segmenting the original videos into the corresponding sentences.

TABLE 2 .
Number of negative and positive samples belonging to the train, development, and test partitions of the resulting CMU-MOSEI dataset when considering the task as a binary classification problem.
Hi [NICKNAME], your working day is finally over.Do you have some plans for this afternoon?