Robust Semisupervised Generative Adversarial Networks for Speech Emotion Recognition via Distribution Smoothness

Despite the recent great achievements in speech emotion recognition (SER) with the development of deep learning, the performance of SER systems depends strongly on the amount of labeled data available for training. Obtaining sufficient annotated data, however, is often extremely time consuming and costly and sometimes even prohibitive because of privacy and ethical concerns. To address this issue, this article proposes the semisupervised generative adversarial network (SSGAN) for SER to capture underlying knowledge from both labeled and unlabeled data. The SSGAN is derived from a GAN, but the discriminator of the SSGAN can not only classify its input samples as real or fake but also distinguish their emotional class if they are real. Thus, the distribution of realistic inputs can be learned to encourage label information sharing between labeled and unlabeled data. This article proposes two advanced methods, i.e., the smoothed SSGAN (SSSGAN) and the virtual smoothed SSGAN (VSSSGAN), which, respectively, smooth the data distribution of the SSGAN via adversarial training (AT) and virtual adversarial training (VAT). The SSSGAN smooths the conditional label distribution given inputs using labeled examples, while the VSSSGAN smooths the conditional label distribution without label information (“virtual” labels). To evaluate the effectiveness of the proposed methods, four publicly available and frequently used corpora are selected to conduct experiments in intradomain and interdomain situations. The results illustrate that the proposed methods are superior to the state-of-the-art methods. Specifically, in experimental settings with mismatched and semimismatched unlabeled training sets, the SSSGAN and VSSSGAN are more robust than the SSGAN because of the distributional smoothness.


I. INTRODUCTION
Speech emotion recognition (SER), which aims to identify emotion states from speech signals, is an active research field in artificial intelligence. SER helps computers perceive human intentions and can be applied in many application scenarios, such as human-computer interaction, intelligent voice service, and psychotherapy assistance. During past decades, various supervised learning methods, including support vector machines (SVMs) [1], [2], Gaussian mixture models (GMMs) [3]- [5], and hidden Markov models (HMMs) [6]- [8], have been utilized to build The associate editor coordinating the review of this manuscript and approving it for publication was Shiqing Zhang . high-performance SER systems. However, due to their shallow structure, these methods cannot extract distinct emotion knowledge from speech signals, which limits the SER system performance. Fortunately, deep learning has achieved great success in speech recognition and image processing [9]- [14] owing to the capability of deep representation learning. Considerable effort has been made to introduce deep learning into SER [15]- [19], further improving the performance compared to traditional supervised learning methods.
Nevertheless, one challenge of deep learning is the substantial data requirements. Obtaining a sufficient labeled training dataset is a time-consuming task. Furthermore, there is no unified consensus on emotional annotation. However, a large amount of unlabeled data can easily be collected from the VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Internet. In this context, unsupervised learning, which aims to distill the internal structure of inputs without any label information, is utilized to extract latent representations of the training set. Then, the learned representations are fed into supervised learning methods for emotion recognition using labeled examples. Huang et al. [20] applied several unsupervised learning methods, such as K -means clustering, sparse autoencoder (SAE), and sparse restricted Boltzmann machines, to learn emotion-related features from unlabeled data, which were then fed into an SVM to recognize emotional states. Eskimez et al. [21] investigated four different unsupervised learning methods, i. e., denoising autoencoder (DAE), variational autoencoder (VAE), adversarial autoencoder (AAE) and adversarial variational Bayes, to learn the intrinsic structure of input data. Then, these features were fed into a classifier with three fully connected layers for emotion recognition.
Although combining unsupervised and supervised learning methods to build an SER system can relieve the dependence on label data, an intrinsic conflict exists between unsupervised and supervised learning methods. Unsupervised learning methods seek to retrain based on as much information as possible to reconstruct the input from the learned representations, while supervised learning methods aim to distill the discriminated representations for classification. Hence, most recent research focuses on semisupervised learning, which can simultaneously deploy unsupervised and supervised learning to distill latent and discriminated features from a combination of labeled and unlabeled data. Zhang et al. [22] proposed a novel enhanced semisupervised learning method in which two strategies are applied to improve model performance: complementary audio and visual information is considered to improve the performance of a supervised path and mislabeled data are reprocessed until the confidence of the method cannot be further increased. This approach relieves the noise accumulation problem. Deng et al. [23] proposed a semisupervised autoencoder (SSAE) for SER. The model consists of an unsupervised path and a supervised path. The unsupervised path performs reconstruction using an autoencoder, while the supervised path learns the discriminated features for classification. The unsupervised and supervised learning paths work together with several shared hidden layers, and the outputs of the last shared layer are treated as common representations of the inputs. Specifically, a pseudoclass introduced to represent the unlabeled data is used for classification and regularization. In the training stage, the parameters of the SSAE are optimized under the joint objective function of reconstruction and classification loss. Previous research shows that semisupervised learning can achieve good performance with the combination of a limited quantity of labeled data and an amount of unlabeled data.
As mentioned above, semisupervised learning is an effective way to relieve the data scarcity problem. Therefore, this article focuses on building an SER method based on semisupervised learning and proposes the semisupervised generative adversarial network (SSGAN) for SER. Similar to the generative adversarial network (GAN) [24], the SSGAN consists of two neural networks with different purposes, that is, a generator and discriminator. The generator generates fake examples from a random vector drawn from a prior probability distribution that makes the generated samples as similar to real examples as possible. In contrast, the discriminator performs classification. Different from the GAN, in the SSGAN, the emotion category expands a new pseudoclass to represent the generated data. In this way, the SSGAN not only learns the distribution of the input data but also shares the label information between labeled and unlabeled samples. Since the SSGAN aims to learn the probability distribution of the input data, it promotes the generalization performance by exploring the distributional region subject to perturbations, the region over which we are likely to receive new input data. Salimans et al. [25] applied the SSGAN in image processing; however, to the best of our knowledge, this is the first time the SSGAN has been exploited for SER.
Although the SSGAN promotes generalization performance by learning the distributional region of inputs to make the model robust to disturbances, it rarely explores the region over which new input data are likely to be received. Previous work [26] discovered the model may obtain an error category when the input is disturbed by a small perturbation in a specific direction, called the adversarial direction. This result demonstrates that the mapping between the input and output distribution is anisotropic. To address this issue, adversarial training (AT) is proposed to assign the input the same label as its neighbors in the adversarial direction. Encouraged by the performance of AT, this article proposes a smoothed SSGAN (SSSGAN) for SER that extends the SSGAN by means of AT. The SSSGAN can smooth the mapping between the distribution of the input and corresponding class labels. In the training phase, the AT loss of the labeled examples is added to the loss function of the SSGAN where it acts as a regularization term. Furthermore, this article proposes the virtual smoothed SSGAN (VSSSGAN) for SER to relieve the dependence on labeled data when performing distribution smoothing. The VSSSGAN extends the SSGAN with virtual adversarial training (VAT), which smooths the distribution without label information. Concretely, the VSSSGAN assumes that unlabeled examples hold corresponding ''virtual'' emotion classes. For each category, the distribution of labeled and unlabeled data is consistent; therefore, the VSSS-GAN exploits the adversarial directions without label information. To evaluate the proposed methods, we conduct several experiments on four publicly available datasets, i. e., Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset, Berlin Database of Emotional Speech (EmoDB), FAU Aibo Emotion Corpus (AEC) and MSP-IMPROV, in different scenarios.
The contributions of this article are summarized as follows: (1) We propose the SSGAN for SER to relieve the dependence on labeled data. In contrast to the GAN, the prediction categories of the discriminator expand the emotion classes with a fake class. Therefore, the discriminator of the SSGAN not only detects the source of inputs but also performs emotion recognition. The SSGAN benefits from the ability of the GAN and semisupervised learning to learn knowledge from both labeled and unlabeled data. (2) Previous research has shown that there are adversarial directions in the output distribution [26]. To further promote the generalization performance of the SSGAN, we propose the SSSGAN and VSSSGAN for SER, which can smooth the mapping of the input and output distributions in the label and feature space, respectively. Specifically, the VSSSGAN smooths the output distribution without label information, which can relieve the dependence on labeled data. (3) We conduct several experiments on the IEMOCAP dataset and three other publicly available corpora to evaluate the effectiveness of the proposed methods in intradomain and interdomain situations. The results indicate that the proposed methods are superior to state-of-the-art methods. Specifically, the SSSGAN and VSSSGAN achieve better results than the SSGAN, demonstrating that AT and VAT can promote the generalization of the model. The remainder of this paper is organized as follows. First, related work is discussed in Section II. Then, we present the proposed methods in Section III. Next, Section IV presents experiments on four corpora. Finally, we draw a conclusion and note directions for future research in Section V.

II. RELATED WORKS
Deep learning has shown good effectiveness in speech recognition and image processing. Researchers have attempted to introduce deep learning into SER. Han et al. [15] utilized a deep neural network (DNN) to extract high-level features, which are fed into an extreme learning machine for emotion classification. Trigeorgis et al. [16] combined convolutional neural networks (CNNs) with long short-term memory (LSTM) to build an end-to-end SER that directly feeds a raw signal into the model. Meng et al. [27] proposed a novel end-to-end model, named ACRNN, that cascades a dilated CNN with a residual block, BiLSTM based on the attention mechanism and softmax with center loss to recognize speech emotion from raw input data, i. e., 3D Log-Mel Spectrograms. Mirsamadi et al. [28] proposed a deep recurrent neural network (RNN) to extract temporal emotion representations and applied local attention to make the model focus on features of interest. Zhao et al. [18] proposed a compact convolutional RNN (CRNN) via binarization to reduce the computational overhead. However, the impressive performance of these approaches depends on the availability of labeled examples for training. One elegant solution is to distill knowledge from a combination of labeled and unlabeled data. Therefore, this article focuses on semisupervised learning to overcome the data scarcity problem.
Self-training is a common semisupervised learning strategy that starts with training a weak classifier using labeled data. Then, the classifier is used to estimate the confidence of each emotion category to which unlabeled data belong. Unlabeled data whose confidence exceeds a predefined threshold are added to the labeled dataset for the next training epoch. This procedure is repeated until a satisfactory classifier is obtained. Esparza et al. [29] utilized k-nearest neighbor (KNN) classification to generate a fuzzy label for unlabeled data. Then, an SVM was trained on the combined labeled and unlabeled data with fuzzy labels. Zhang et al. [30] proposed cooperative learning (CL) for SER, which benefits from active learning and self-training. CL can reduce the workload of manual annotation via human-machine collaboration. Unlike self-training, our proposed methods learn internal structure information and discriminated representations from the combination of labeled and unlabeled data, deploying supervised and unsupervised learning simultaneously. Thus, experimental deviations due to the accumulation of classification errors do not arise.
Co-training is another common semisupervised learning strategy employed to address the accumulation of experimental error. First, the labeled data are used to train a classifier on different feature sets. Then, each classifier selects the unlabeled data with highest label confidence. These new labeled data are added to the training set of another classifier to update the classifier parameters. This process continues iteratively until all classifiers no longer change or a preset number of learning rounds have passed. Liu et al. [31] proposed an enhanced co-training for SER from the perspective of temporal features and statistical features. Zhang et al. [32] proved that adding unlabeled with a co-training method for SER can enhance the performance of purely supervised learning methods. Zhang et al. [33] proposed an extension of co-training named collaborative semisupervised learning (cSSL). In contrast to other co-training methods, cSSL involves not only feature types but also classification models. Compared with co-training, the methods proposed in this article do not extract different types of feature sets from different aspects, which reduces the dependence on expert knowledge.
Semisupervised learning has attracted growing interest for addressing the scarcity of labeled data. It reaps the advantages of both unsupervised and supervised learning. Unsupervised learning, i. e., AE and GAN, attempts to learn the internal structure of the input without label information. By contrast, supervised learning aims to distill discriminated features according to labeled data. These approaches work together with several shared hidden layers, and the last shared hidden layer is treated as a common representation that retains the intrinsic information with discriminated attributes. In the training stage, the model parameters are updated by optimizing their joint objective function. Huang et al. [34] utilized ladder networks to project static acoustic features to high-level representations. The model is trained with the weight of supervised loss on the top of the encoder layer and unsupervised loss in the decoder VOLUME 8, 2020 layers. Rana et al. [35] proposed the semisupervised adversarial autoencoder (SSAAE) for SER. To further improve the performance, a multitask mechanism (emotion classification as the main task, speaker and gender classification as auxiliary tasks) is employed in the supervised path. In this article, our approaches also make use of semisupervised learning to learn knowledge from a combination of labeled and unlabeled data. Nevertheless, our proposed methods seek to learn the distribution of the input. They are robust to input perturbations, which promotes generalization performance.
Previous research in [36] discovered that several neural networks are vulnerable to adversarial samples. That is, these samples, which are similar to correctly classified samples drawn from a data distribution, confuse neural networks, resulting in misclassification. Therefore, Szegedy et al. [36] proposed training the neural networks on both a training set and adversarial samples to promote generalization. Goodfellow et al. [26] explored the adversarial direction, in which the model assigns each sample a label that is similar to those of its neighbors. To reduce the dependence of label information, Miyato et al. [37] utilized ''virtual'' labels to explore the adversarial direction. These ''virtual'' labels are used to estimate the model and are trained on the limited labeled data. The results show that the generalization is improved by relieving the impact of the adversarial samples. Luo et al. [38] proposed smooth neighbors on teacher graphs (SNTG) to make the features learned by both neighbors and dissimilar non-neighbors similar. The experimental results suggest that SNTG is robust to noise labels. In this paper, the conditional label distribution, given the input, is smoothed by AT or VAT to improve the generalization from the perspective of classification.

III. METHODOLOGY
This section introduces the framework of the proposed methods. We consider a dataset with N labeled exam- where y 1 , y 2 , · · · , y N ∈ {1, 2, · · · , K } and K is the total number of emotion states. Our goal is to learn a conditional label distribution p(y|x) from the combination of labeled and unlabeled examples to predict the emotion classes of given test examples. First, we describe the framework of the SSGAN in detail. Then, we present the proposed method, named the SSSGAN, which smooths the output distribution p(y|x) according to labeled data. Finally, the VSSSGAN, which smooths the output distribution p(y|x) without label information, is described.

A. SEMISUPERVISED GENERATIVE ADVERSARIAL NETWORK
The SSGAN, i. e., a semisupervised learning paradigm, learns knowledge from a combination of labeled and unlabeled data. The framework of the SSGAN is shown in Fig. 1.
The most classic construction of the SSGAN, which incorporates a generator G(z) and a discriminator D(x), is shown in [39]. The generator G generates fake examples from noise samples z drawn from a prior distribution p(z), and the discriminator D is a classifier. However, the predictions of D comprise K emotion states and an additional pseudoclass (or false) rather than the binary classes (i.e., true or false). Therefore, the SSGAN not only captures the distribution of the input but also models the discriminated information to perform emotion recognition, simultaneously.
Mathematically, we assume that p model (y = K + 1|x) denotes the probability of the input belonging to the fake category, which corresponds to the probability 1 − D(x) in the traditional GAN. Since the minibatch consists of half of the training data and half of the generated data in [39], the loss function of the discriminator D can be expressed as: In Eq. (1), the first term is the expectation of the input under the class labels, which can be applied to perform emotion recognition. In the training phase, the parameters of the discriminator D are updated to minimize Eq. (1), which is maximized by updating the parameters of the generator G.
Recently, a more advanced SSGAN was proposed in [25] that not only exploits a small amount of real labeled data but, more importantly, makes use of a large amount of real unlabeled data. Thus, the first term of Eq. (1) can be updated by where L supervised = −E x,y∼p data (x,y) log p model (y|x, y < K + 1), Thus, the overall loss function of the discriminator D of the SSGAN can be calculated by The SSGAN is trained with a combination of objective functions of supervised and unsupervised learning. During training, the parameters of the SSGAN are updated under the minmax optimization strategy until the discriminator not only recognizes the emotion category but also distinguishes whether the input is real or fake data. Because the unlabeled and labeled data follow the same conditional label distribution, the SSGAN can learn knowledge from the unlabeled data by minimizing the first term of Eq. (3). In this way, the SSGAN learns knowledge from a combination of the unlabeled and labeled data. Meanwhile, the latent discriminated features are preserved for emotion recognition.

B. SMOOTHED SEMISUPERVISED GENERATIVE ADVERSARIAL NETWORK
In Section III-A, the SSGAN is proposed to distill knowledge from a combination of the unlabeled and labeled data. Concretely, the SSGAN models the distribution of the input data, and label information is shared via the assumption that the labeled data and unlabeled data follow the same distribution. In this way, the SSGAN learns the distributional region of the input, the region over which we expect to obtain new input data, to make the model robust to input perturbations. However, Goodfellow et al. [26] proved that there is a specific direction of the perturbation in the input space, that is, the adversarial direction, that leads to the greatest reduction in the model's probability of correct classification.
To further improve the robustness, we propose the SSS-GAN, which employs AT to smooth the output distribution p(y|x). The framework of the SSSGAN is shown in Fig. 2. Using labeled data, AT explores the adversarial direction r adv , where each input datum is assigned a label that is similar to the label of its neighbors. Finally, p(y|x) is smoothed by r adv to improve the generalization performance and make the model robust to adversarial perturbation. The objective function of AT can be defined as: where where x l denotes a labeled sample and Div[q, p model ] is a divergence probability measure between probabilities q and p model , where q is the true conditional label probability and p model is a parameter probability. On the basis of [26], cross entropy Div[q, p model ] = − i q i log q i model is utilized to measure the divergence of distributions q and q model , where q i and q i model denote the probabilities of the ith class. Since q(y|x l ) in Eq. (4) is unknown, it is approximated by one-hot vector h(y; y l ), whose ith entry is set to one and other entries are set to zero. Moreover, the parameter function p model (y|x l + r adv ) is the output distribution of the adversarial attack x l . The goal of Eq. (4) is to approximate the true label distribution q(y|x l ) with a parameter function p model (y|x l + r adv ) that is robust to adversarial attack x l . Since AT can smooth the output labels to overcome the adversarial attack, it is applied to improve the robustness of the SSGAN. Therefore, the loss function of the discriminator D of the SSSGAN can be expressed as: where α is a balance factor to control the contributions between the SSGAN and the distribution smoothness based on labeled examples. However, computation of Eq. (5) is intractable since the exact adversarial perturbation r adv in Eq. (4) is difficult to obtain. An elegant solution provided by [26] is to approximate r adv as a linear approximation of L adv (x l ) with respect to r. When we assume the norm is L 2 , the adversarial perturbation can be approximated by According to Eq. (6), the gradient g can be efficiently computed by backpropagation in the training phase. Finally, the optimal adversarial perturbation r adv in Eq. (4) is computed.

C. VIRTUAL SMOOTHED SEMISUPERVISED GENERATIVE ADVERSARIAL NETWORK
Section III-B briefly describes the proposed method SSSGAN, which smooths the output of the conditional label distribution according to the label information. In other words, the smoothness of the SSSGAN depends on the labeled data. However, not all label information is provided. Thus, we propose the VSSSGAN, which benefits from the SSGAN and VAT [37], where VAT uses a combination of labeled and unlabeled data to smooth the distribution q(y|x) without label information. The framework of the VSSSGAN is presented in Fig. 3. Mathematically, we assume x * represents a combination of labeled and unlabeled data. As a consequence, according to Eq. (4), the objective function can be written as: VOLUME 8, 2020 However, since no direct information exists to compute q(y|x ul ), the computation of adversarial perturbation on the entire dataset r qadv is intractable. In [37], an approximation strategy that assumes that each unlabeled datum has a corresponding ''virtual'' label is applied to address this issue. The ''virtual'' label is learned by the current estimated parameter modelp model (y|x), which is then used to explore the adversarial direction. Finally,p model (y|x * ) replaces q(y|x * ) in Eq. (7), which can be rewritten as:  (9) where β is a balance factor that is utilized to control the contribution of the SSGAN and the local smoothness. Similar to AT, the computation of the optimal r vadv in Eq. 8 is intractable. As Section III-B discussed, a linear approximation of L vadv with respect to r is used to approximate the optimal r vadv . We assume the norm is L 2 ; thus, the optimal r vadv can be written as: where g can be computed by backpropagation in the training phase.

IV. EXPERIMENTS
In this section, we evaluate the effectiveness of the proposed methods on four publicly available corpora. First, the experimental setup is briefly introduced, including selected datasets, acoustic features, model configurations and evaluation measures. Then, the proposed methods are compared with state-of-the-art methods in several scenarios. Finally, several experiments are conducted in intra-and interdomain settings to evaluate the robustness of the proposed methods.

A. THE SELECTED DATASET
Due to its wide use in the affective computing community, the IEMOCAP database [40] is utilized for our experimental evaluation. IEMOCAP, collected by the Speech Analysis and Interpretation Laboratory at the University of Southern California, was recorded as an audiovisual emotion dataset, including context, video, face and audio. This dataset consists of five dyadic sessions, approximating ten hours recording in total. Each session involves dyadic interaction between a male and a female in scripted and improvised spoken communication scenarios. In the scripted scenario, actors were asked to communicate the specified semantic and emotional content, while the emotion of subjects was elicited in hypothetical scenarios. Each session was segmented into turns, which are annotated with an emotion label (i. e., neutral, happiness, sadness, anger, surprise, fear, disgust, frustration, excited and other) by three evaluators. Majority agreement is employed to assign an emotion label to each turn. In line with previous research [41], only four emotion categories (i. e., neutral, sadness, happiness and anger) are taken into consideration in our experiments. In addition, excited examples are merged into happiness. A total of 5 531 turns (i. e., 1 708 for neutral, 1 084 for sadness, 1 636 for happiness and 1 103 for anger) are utilized for our experiments. Additionally, three other publicly available corpora are chosen as unlabeled training sets, namely, the EmoDB [42], AEC [43] and MSP-IMPROV [44]. The EmoDB contains 535 utterances of German speech conversations under seven different specified emotional states, i. e., neutral, anger, fear, joy, sadness, disgust and boredom. The AEC is a spontaneous emotion speech dataset that consists of a voice conversation between 51 German children from two different schools and the Japanese intelligent robot Aibo. The conversations are segmented into utterances that are assigned to one of five emotional labels: anger, emphatic, neutral, positive, and rest. MSP-IMPROV is a multimedia emotion dataset recorded by 12 actors with scripted sentences and controlled emotion. It presents four different scenes to guide the corresponding emotional reactions: happy, angry, sad and neutral. In total, MSP-IMPROV contains 8 438 utterances with emotional annotations. Table 1 summarizes the four chosen datasets and reveals their differences.

B. ACOUSTIC FEATURES
In the experiments, we employ the same acoustic features as those used in the baseline of the INTERSPEECH 2009 emotion challenge for fair comparison [45]. We utilize the open-source openSMILE toolkit [46] to extract the feature set according to the description in Table 2. First, 16 lowlevel descriptors (LLDs) are extracted from the raw signals, including the zero-crossing rate (ZCR), root mean square (RMS), pitch frequency (normalized to 500 Hz), harmonicsto-noise rate (HNR) and Mel-frequency cepstrum coefficients (MFCCs) 1-12 in full. Second, the first order of these 16 LLDs is calculated to append to the feature set. Finally, a set of functionals are applied to the 16 LLDs and their first orders, i. e., mean, standard deviation, kurtosis, skewness, minimum and maximum values, relative position, and range, as well as the offset and slope of the linear regression line and their mean square error (MSE). Finally, a total of (16 + 16) × 12 = 384 acoustic features are taken into consideration in our experiments.

C. MODEL CONFIGURATION AND EVALUATION MEASURE
As discussed in Section III, the frameworks of the proposed methods consist of two different neural networks that represent the generator and discriminator. In our experiments, the networks are multilayer forward neural networks mathematically defined as: where The matrix W and the vector b are, respectively, the weights and bias, which are trainable parameters. f (·) denotes a nonlinear activation function, i. e., rectified linear unit (Relu) [47]. L is the total number of layers. The input x is fed into the first hidden layer to perform weighted linear transformation, and the trainable parameters are updated according to the objective function in the training stage. We utilize the Adam optimization algorithm [48] to optimize the parameters. The numbers of hidden layers for the generator and discriminator are both set to four, and each layer has the same number of hidden nodes. We utilize a grid search over the learning rate {0.1, 0.01, 0.001, 0.0001} and hidden nodes {128, 256, 512, 1 024} to obtain the optimal values. The hyperparameters α, β, are set to 1 to simplify the experimental parameter search. The impact of these hyperparameters is evaluated in the following experiments. Before being fed into the neural networks, the input and target data are standardized to zero mean and unit variance in the training dataset. In addition, since the categories of the dataset are unbalanced, we use the officially recommended measure, unweighted accuracy rate (UAR), as our evaluation criterion. The UAR is the unweighted average of the class-specific recalls. We also apply K -fold cross-validation to reduce the influence of the limited amount of data. In each experiment, the entire dataset is divided into K folds, with K varying according to the speaker. We select K −2 folds as the training set and one fold as the validation set and use the remaining fold as the test set. The experiment is repeated K times. Since the IEMOCAP dataset was recorded by 10 speakers, 10-fold cross-validation is performed. We report the mean and deviation of the UAR for all repeated experiments. Moreover, a one-sided z-test is computed to test for significance. To evaluate the performance of the proposed methods, Table 3 presents several comparison algorithms, i. e., two supervised methods and four semisupervised learning methods, that achieve good performance on the IEMOCAP dataset. DNN and SVM are selected as baseline supervised methods. DNN has a similar framework to the SSGAN, but it does not have the unsupervised loss shown in Eq. (2). The SVM is the linear SVM utilized in the INTERSPEECH 2009 emotion challenge that is trained with a small amount of labeled data. In addition, our proposed methods are compared with four semisupervised learning methods, namely, self-training and denoising autoencoder (DAE) in combination with SVM, as well as the SSAE [23] and semisupervised ladder autoencoder (SS-LAE) [41]. To ensure a fair comparison, the structure of our adopted decoder is consistent with the generator of the proposed method, and the validation procedure matches that of our proposed methods. The results of the comparison methods are shown in Table 3.
The experimental results in Table 3 show that our proposed methods are superior to the two supervised methods and the four semisupervised learning methods with different quantities of labeled data in terms of the average UAR. The performance may be attributed to the application of the GAN to learn the distribution of the input, which promotes the generalization capability. In addition, the semisupervised learning structure encourages the model to learn knowledge from a combination of labeled and unlabeled data. Compared with other methods, the SSSGAN and VSSSGAN achieve significant improvements at p < 0.05. Although the SSGAN performs as well as the SSSGAN and VSSSGAN, given 2 400 labeled data points, we obtain improvements of 1.5% and 0.9% when AT and VAT, respectively, are applied to smooth the output of the conditional label distribution. One possible explanation is that VAT and AT can explore the adversarial direction of the input, which can improve the robustness of the proposed methods.
In addition to the performance comparison between the proposed methods and state-of-the-art methods, we investigate the impact of the number of labeled data. Fig. 4 shows the performance of the proposed methods with 300, 600, 1 200 and 2 400 labeled data.
As Fig. 4 shows, the performance of the proposed methods improves as the quantity of labeled data increases. Notably, when the number of labeled data is doubled, the performance increases slowly. These results suggest that our proposed methods do not always benefit from an increase in the amount of labeled data. Moreover, compared with the SSGAN, the VSSSGAN achieve a 1.2% improvement in the UAR with 600 labeled data. Meanwhile, the relative improvement is 0.7%, 0.4% and 0.9% with 300, 1 200 and 2 400 labeled data, respectively. These results suggest that the number of labeled data influences the extent of the performance improvement. One possible explanation is that there is less label information for smoothing when few labeled data are available. In contrast, providing more labeled data is not helpful for the conditional label distribution. Furthermore, the VSSSGAN achieves better performance than the SSSGAN when less labeled data are provided. In contrast, if more labeled data are available, the SSSGAN is superior to the VSSSGAN. This result indicates that more labeled information can help the model in smoothing the adversarial direction.

E. IMPACT OF CONTRIBUTION FACTORS AND
In this section, we consider the influence of the contribution factors α, β and . As discussed above, α, β and control the smoothness of the SSGAN using AT or VAT. In Section IV-D, α and β are set to 1 for the simple parameter search, which means that the smoothness applied by AT or VAT is equivalent to that applied by the SSGAN. The experimental in Fig. 5 suggest that the SSSGAN and VSSSGAN obtain the best performance of 59.5% and 58.9%  in terms of the UAR when the contribution factors α and β are set to 1.5 and 2.0, respectively. Notably, both the SSSGAN and VSSSGAN achieve higher performance with lower contribution factors: larger contribution factors significantly reduce the performance. One possible explanation is that increasing the weights of the smoothness decreases the contribution of the SSGAN; thus, less labeled information is utilized for classification. Therefore, with higher contribution factor values, the VSSSGAN outperforms the SSSGAN, which may be due to the fact that the VSSSGAN can smooth the output of the conditional label distribution without label information.
In addition to estimating the impact of the contribution factors α, β, this section investigates the influence of the smoothness radius ∈ {0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0}. Fig. 6 shows the performance with different values of using 2 400 labeled data and the default settings in Section IV-D.
The performance of the SSSGAN and VSSSGAN deteriorates as the smooth radius increases. This result is expected because defines the radius of the input vector over which the conditional probability p(y|x) is smoothed. As increases, the model seeks to learn the smooth function but cannot capture the distribution of the input. Specifically, at low , the SSSGAN outperforms the VSSSGAN, possibly because it is easier for the SSSGAN to explore the adversarial direction using the labeled information. In addition, without label information, the VSSSGAN can still obtain the same impressive performance as that of the SSSGAN.

F. IMPACT OF TRAINING WITH OUT-OF-DOMAIN UNLABELED DATA
In Section IV-D, we evaluated our proposed methods in the domain-matched condition (intradomain), where the training and test set come from the same dataset. In this section, we evaluate our proposed method in the mismatched and semimatched cross-corpus setting of intradomain unlabeled data. In the mismatch cross-corpus setting (interdomain), the unlabeled training set consists of data from different domains than the test set. In contrast, some data from the same domain as the test set is used for training in the semimatched cross-corpus setting. We also consider the four semisupervised learning methods presented in Section IV-D and follow the same experimental setup.
Let us first consider the mismatched cross-corpus situation, in which the unlabeled examples are randomly selected from the EmoDB, AEC, MSP-IMPROV and different combinations of the three, while the labeled data are chosen from IEMOCAP. Clearly, the unlabeled data are different from the IEMOCAP corpus shown in Table 1. Table 4 presents the experimental results for the proposed methods. Different selected out-of-domain unlabeled data do not impact the performance compared with the performance shown in Table 3. Therefore, the proposed methods can resist the variation among different corpora due to the model smoothness.
Encouraged by the results of the mismatched condition, we further evaluate our proposed methods under semimatched conditions, where interdomain and intradomain data are combined for training. Here, we randomly select a partition of IEMOCAP in combination with a mixed partition of the AEC, EmoDB and MSP-IMPROV, which are treated as unlabeled training data. Fig. 7 shows the results obtained by our proposed methods under the semimatched condition and matched condition. In contrast to the matched condition, the whole training set used in the semimatched condition is augmented by including other corpora, but such augmentation fails to improve the performance. As the number of labeled data increases, the performance of the proposed methods is better under the matched condition than under the mismatched condition. One possible explanation is that the training set used in the semimatched condition causes a domain-mismatch problem that is harmful for classification. As a consequence, providing more in-domain labeled data can relieve the domain-mismatch problem and improve the performance. Notably, similar results are achieved under semimatched and matched conditions when given a few in-domain labeled data. This result indicates that our proposed methods have a strong learning ability to overcome the harmful impact of domain mismatch. Specifically, when 1 200 labeled samples are available, the VSSSGAN achieves better performance under a semimatched condition than under a matched condition.

V. CONCLUSIONS AND OUTLOOK
In contrast to previous works focused on unsupervised learning with a GAN, this article focuses on deep semisupervised learning with a GAN. We consider both generative and discriminative training. The predictions of the discriminator of the SSGAN comprise the emotional states and an additional fake class (referred to as generated data). Therefore, the SSGAN not only learns the distributional structure of the combination of labeled and unlabeled data but also perform classification, simultaneously. This approach can relieve the dependence on labeled data and promote the generalization performance. Several experiments are conducted to evaluate the effectiveness of the SSGAN. The results indicate that the SSGAN is superior to other semisupervised learning methods. Considering ''adversarial'' directions that exist in the SSGAN, this article further proposes the SSS-GAN and VSSSGAN to smooth the output of the conditional label distribution. Specifically, the SSSGAN utilizes the label information to perform smoothing, while the VSSS-GAN performs smoothing without label information, which relieves the dependence on labeled data. The results demonstrate that the SSSGAN and VSSSGAN improve the recognition performance by smoothing the ''adversarial'' directions. In addition, the SSSGAN and VSSSGAN are implemented in mismatched and semimatched conditions. The results suggest that the SSSGAN and VSSSGAN have the capability to overcome domain mismatch to learn prior knowledge from a different corpus and incorporate that knowledge into the classification. In other words, the proposed methods are robust to perturbations in the data.
A large number of previous work have shown that multimodality methods are superior to single-modality methods. In addition to audio, humans can express their emotions through content, facial expressions, and so on. Thus, our further research will focus on multimodal semisupervised learning for emotion recognition. Therefore, the model can benefit from information of different modalities to obtain better performance.