Transfer Subspace Learning for Unsupervised Cross-Corpus Speech Emotion Recognition

In many practical applications, a speech emotion recognition model learned on a source (training) domain but applied to a novel target (testing) domain degenerates even significantly due to the mismatch between the two domains. Aiming at learning a better speech emotion recognition model for the target domain, the paper investigates this interesting problem, i.e., unsupervised cross-corpus speech emotion recognition (SER), in which the training and testing speech signals come from two different speech emotion corpora. Meanwhile, the training speech signals are labeled, while the label information of the testing speech signals is entirely unknown. To deal with this problem, we propose a simple yet effective method called transfer subspace learning (TRaSL). TRaSL aims at learning a projection matrix with which we can transform the source and target speech signals from the original feature space to the label space. The transformed source and target speech signals in the label space would share similar feature distributions. Consequently, the classifier learned on the labeled source speech signals can effectively predict the emotional states of the unlabeled target speech signals. To evaluate the performance of the proposed TRaSL method, we carry out extensive cross-corpus SER experiments on four speech emotion corpora including IEMOCAP, EmoDB, eNTERFACE, and AFEW 4.0. Compared with recent state-of-the-art cross-corpus SER methods, the proposed TRaSL can achieve more satisfactory overall results.


I. INTRODUCTION
Speech emotion recognition (SER) has been a very attractive research field in affective computing, pattern recognition, and human-computer interaction (HCI). A major task of speech emotion recognition is to provide computers the ability to recognize the human beings' emotional states such as happy, angry, and disgust from their speech signals [1]. In recent years, extensive effective methods have been proposed to deal with this problem [2], [3]. But it can be noted that most of the current speech emotion recognition methods are heavily dependent on one common assumption, namely that the training speech samples and the testing one belong to the same corpus. In this case, it can be thought that the speech The associate editor coordinating the review of this manuscript and approving it for publication was Mohammad Zia Ur Rahman . signals extracted from training and testing speech sequences abide by the same or similar marginal probability distribution. In many practical situations, however, the training and testing samples may belong to different domains, e.g., the training speech samples and the testing ones are recorded by different equipment or collected under different environments. Hence in this scenario, the marginal probability distribution of emotion signal vector set in training speech would quite different from that in testing ones. This thus creates a more difficult yet interesting problem than conventional SER, i.e., cross-corpus SER. To distinguish the training and testing speech corpora in cross-corpus SER problem, these two corpora can be referred as source corpus and target corpus, respectively. In the work of [4], Deng et al. classified cross-corpus speech emotion recognition into two categories including semi-supervised case and unsupervised case. The main difference between these two categories is whether we can get the label information of target domain. Homogeneously, cross-corpus SER can follow this classification. In this paper, we will investigate the unsupervised cross-corpus SER, in which the training and testing speech signals come from two different speech emotion corpora. Meanwhile, the training speech signals are labeled, while the label information of the testing speech signals is completely unknown. Due to this setting, the training and testing speech signals may have different feature distributions. To deal with this problem, we propose a novel method called transfer subspace learning (TRaSL). Our preliminary work [5] reduced the discrepancy of the source and domain to complete the classification, but the structures of both domains were not been considered. The basic idea of TRaSL is to learn a projection matrix which transforms the source and target speech signals from the original feature space to a common subspace. In such common space, the source and target speech signals are enforced to obey the similar feature distributions and hence we can train a classifier, e.g., support vector machine (SVM), based on the labeled source speech signals such that it can accurately predict the emotional states of the target speech signals. Motivated by the works of [6], [7], we construct a label space based on the label information provided by the source speech corpora to serve as the predefined common subspace for TRaSL.
The main contributions of this paper for unsupervised cross-corpus speech emotion recognition are summarized as follows: 1) A new framework called TRaSL for dealing with unsupervised cross-corpus speech emotion recognition is proposed. In the TRaSL model: (a) A projection matrix is learnt to transform the source and target speech signals from the original feature space to the common space. (b) In the common space the disparity of source and target feature vectors is reduced. (c) The structures of source and target domains are enforced to be approximation, which can keep enough discriminant information for further model learning. 2) We use four representative cross-corpus SER methods and SVM as baseline to conduct more extensive evaluation experiments under the designed protocol and deeply discuss the experimental results. The remainder of this paper is organized as follows: Section II presents recent works about cross-corpus SER. In Section III, we describe the central idea of TRaSL framework for crosscorpus SER, along with the optimization method to solve this issue. For evaluating our TRaSL framework, extensive experiments are conducted in Section IV. Finally, we conclude our paper in Section V.

A. DOMAIN ADAPTATION
Domain adaptation (DA) is a representative method in transfer learning, which uses labeled source domain samples to improve the performance of target domain model [20]. DA problem is that labeled source domain and unlabeled target domain share the same categories, but the distribution of features is different, i.e., X s = X t : P s (X ) = P t (X ), where X s and X t are the feature matrices, P s (X ) and P t (X ) are the feature distributions of source and target domain, respectively. DA can be broadly categorized into two groups according to whether the target domain sample has some labels or it is entirely unlabeled. The former is referred to as semi-supervised DA, while the latter is called unsupervised DA. While semi-supervised DA is generally performed by utilizing the correspondence information obtained from labeled target domain data to learn the domain shifting transformation (e.g. [6]), unsupervised DA is based on the following strategies: (i) imposing certain assumptions on the class of transformations between domains [8], or (ii) assuming the availability of certain discriminative features that are common to both domains [9], [10], [13].

B. CROSS-CORPUS SPEECH EMOTION RECOGNITION
Cross-corpus SER is a new learning setting which allows source and target samples to come from different distributions. Consequently, how to deal with this problem is an important and challenging case in current research. In spite of that, some researchers have focused on this challenging problem and proposed some effective methods. Schuller et al. [11] attempted to employ multiple normalization schemes to investigate cross-corpus SER problem, which may be the first research about cross-corpus SER. Thereafter, more diverse cross-corpus SER methods are in sequence proposed [4], [6], [12], [14]- [18]. For example, Deng et al. [4], [12], [14] proposed a series of autoencoder based domain adaptation methods to deal with cross-corpus SER, in which autoencoder networks are exploited to learn the new representations for source and target speech samples. In the work of [15], Hassan et al. proposed an importance weighted support vector machine (IW-SVM) to cope with cross-corpus SER problems. IW-SVM leverages three transfer learning methods [20], i.e., kernel mean matching (KMM) [21], Kullback-Leibler importance estimation procedure (KLIEP) [9], and unconstrained least squares importance fitting (uLSIF) [23], to learn a set of importance weights for target speech samples such that the feature distribution mismatch between source and target speech samples is relieved. Besides the above methods, it is also worth mentioning the work of selective transfer machine (STM) [24], [25], which is proposed for personalized (cross-subject) facial action unit detection. STM inherits the ability of kernel mean matching (KMM) [21] to eliminate the feature distribution difference between source and target samples and also have the discriminative ability of support vector machine (SVM). The abovementioned subspace learning algorithms focus on finding the latent common feature representations to cope with the feature matching problem, and do not take into account the importance of feature selection together. Recently, a transfer non-negative matrix factorization (TNNMF) method is proposed by Song et al. [16] for cross-corpus SER tasks. In TNNMF, the maximum mean discrepancy (MMD) [26] is used to balance the feature distribution difference between the originally distinct source and target speech signals. Zong et al. [6], [7] proposed a novel domain adaptation method called domain adaptive least squares regression (DALSR) model to handle cross-corpus SER. DALSR aims at learning a regression coefficient matrix to bridge the source and target speech corpora. Though DALSR considered the importance of feature selection, it should fix the number of auxiliary samples of the target corpus. More recently, Song et al. [40] also presented a feature selection based transfer subspace learning (FSTSL) method to cope with cross-corpus SER problem, which considers feature selection as an additional constraint. FSTSL considered the feature distribution difference, while neglected the discriminant information of the model.
Besides these studies, a label space is constructed according to the label information provided by the source speech corpora, which serves as the predefined common subspace for TRaSL. TRaSL considered the feature distribution difference and the discriminant property of the model, meanwhile, importance of feature selection also take into account.

III. PROPOSED METHOD A. BASIC IDEA OF TRaSL
In this section, we firstly introduce the basic idea of TRaSL. For better understanding of TRaSL framework, Fig. 1 gives the architecture of proposed model. It can be seen from Fig. 1(a) that the goal of the proposed TRaSL framework is to learn a projection matrix with which the source and target speech signals can be transformed from the original feature space to the label space. In the label space the source and target speech signals would share similar feature and structure distribution which is depicted in Fig. 1(b). What follows is to train a classifier. Using the projected source speech features and its given label information to predict the projected target signal categories.

B. TRaSL MODEL FRAMEWORK
For clarity, we define some notations. In the whole text, matrices are written in upper-case letters, vectors are written as lower-case letters.
Suppose we have two different speech corpora to serve as source and target corpus, respectively. Their corresponding feature matrices are denoted by X s ∈ R d×N s and X t ∈ R d×N t , where d is the dimension of the speech feature vector and N s and N t are the numbers of the source and target speech signals, respectively. For unsupervised cross-corpus SER case, the label information of source speech signals is available, thus we denote their label information as the vector form, which is followed by the works of [4], [6]. Specifically, let L s ∈ R c×N s be the label matrix corresponding to the source feature matrix X s , where c is the number of speech emotion states, and the i th column T is a class label vector of L s whose elements will take the value of 0 or 1 according to the following rule: 1, if x s i belongs to the j th emotion states; 0, otherwise By using these source label vectors, we are thus able to construct a new subspace as the predefined common subspace. Note that our TRaSL aims at learning a projection matrix U to project the source speech feature matrix X s from the original feature space to such common subspace spanned by the columns of L s , which can be formulated as the VOLUME 9, 2021 following optimization problem: Meanwhile, with the projection matrix U, the target speech feature matrix X t can also be mapped to the predefined common subspace, where the projected source and target speech features will be enforced to share the similar distributions. To achieve this goal, following the works of MMD criterion [26] and TNNMF [16], we minimize the distance difference between mean projected source speech feature vectors and mean projected target speech feature vectors, which are formulated as follows: Besides reducing the discrepancy of the source and target domains, in the common subspace the structures of both domains are also expected to be the same. Motivated by the work of Gretton et al. [28], we propose to impose the projected covariance matrix difference between the source speech feature vectors and the target ones in the subspace on to the objective of Eqs. (1) and (2), the structure is limited just by simply minimizing the variance of each domain, which can be formulated as Eq. (3): x t i . By minimizing the combination of the above objective functions in Eqs. (1), (2) and (3), we can arrive at the final optimization problem as the following formulation: where λ 1 and λ 2 are the trade-off parameters to control the balance among three terms in the objective functions.
It should be also noted that besides previously described combination, we introduce a L 2,1 norm term with respect to the transpose matrix of U to serve as the regularization to select the important features contributing to SER [6] during the feature projection. Then we can get our TRaSL model, which is shown in Eq. (4).

C. OPTIMIZATION OF DOSL FRAMEWORK
TRaSL model can be solved by using inexact augmented Lagrange multiplier (IALM) method [27]. More specifically, by introducing two auxiliary variables Q and K , which satisfies U = Q, U = K , first Eq. (4) can be reformulated as: we convert the optimization problem of (5) to a constrained one which can be expressed as: Subsequently, the Lagrange function of Eq. (6) can be obtained as follows: where T 1 and T 2 are the Lagrange multiplier, and µ > 0 is the regularization parameter. Finally, to achieve the optimal solution of U, we only need to iteratively minimize the Lagrange function of Eq. (7) with respect to one of the variables fixing the others until convergence. More specifically, perform the following five steps: 1. Fix U, Q , T 1 , T 2 and µ, update K: In this case, the optimization problem would become as below: which results in K, as shown at the bottom of the page, where I is the identity matrix.
Fix U, K, T 1 , T 2 and µ, update Q: The Eq. (7) can be obtained as follows: 3. Fix Q, K, T 1 , T 2 and µ, update U: The optimization problem can be rewritten as the following formulation: According to Lemma 4.1 in [15], the optimal U can be obtained as follows: , t 2i and k i are the i th row of Q , T 1 , T 2 and K, respectively. 4. Update T 1 , T 2 and µ: where ε denotes the machine epsilon.
D. CROSS-CORPUS SER USING TRaSL MODEL By using the above solving method in Section 3.2 to learn the optimal U * , we have following method to predict the emotion states of the target speech samples. It is to assign the emotion labels to the target speech signals according to the criterion: emotion_labels = arg max means the k th element of the j th column (target speech signal) of the projected matrix U T * X t .

A. SPEECH EMOTION DATABASE
In this section, we conduct extensive cross-corpus SER experiments to evaluate the performance of the proposed TRaSL method. Four popular speech emotion corpora including EmoDB [30], the audio dataset of eNTERFACE [31], the audio dataset of AFEW 4.0 [32], and the IEMOCAP database [34] are employed. The detailed information of the above speech emotion database is shown in Table 1.
• The second dataset is eNTERFACE corpus. eNTER-FACE is composed of 1287 emotion videos from 43 subjects and they are categorized into six basic emotions  (1636). The happy class includes both happy and excitement classes. This is the standard data selection used in many experiments using IEM-OCAP database [10], [36], [37], [38].

B. EXPERIMENTAL SETTINGS
Following the experimental protocol of [5], [6], we select any two datasets of speech corpora each time and select the samples belonging to the common emotion states from these two datasets, which are served as source and target corpus, alternatively. Therefore, there are finally twelve experiments and each group of experiment consists of two sub-experiments. For convenience, these twelve experiments are denoted by Exp.1, Exp.2, ···, Exp.12, respectively, whose detailed source and target speech corpora are illustrated in Table 2 and  Table 3 [35]. The speech signal consists of 384 elements, i.e., 16 acoustic low-level descriptors (LLDs), such as zero-crossing-rate (ZCR), root mean square frame energy (RMS Energy), Mel-frequency cepstral VOLUME 9, 2021 TABLE 2. Results of the exp.1 to exp.12 cross-corpus speech emotion recognition experiments in terms of WAR, we select the common emotion states for the comparative experiment of each group, in which the best results are highlighted in bold.  coefficient (MFCC), and their 12 functions [33], such as standard deviation and kurtosis, as speech feature representation. As to the evaluation metrics, we employ the weighted average recall (WAR) and the unweighted average recall (UAR) [11] to report the performance of all the methods, which are widely used in cross-corpus speech emotion recognition. WAR is the normal recognition accuracy (i.e., accuracy), while UAR is the mean accuracy of each class (i.e., the accuracy per class divided by the number of classes without considerations of instances per class). Since sample classes are unbalanced in the cross-corpus evaluations, as shown in table 1. It is means that the samples numbers in different classes have a large difference, thus, it is more appropriate to evaluate the results from the perspective of WAR and UAR.
For comparison, we choose KMM [20], KLIEP [9], uLSIF [23], DALSR [6] and DoSL [5] to conduct the experiments under the same protocol as our TRaSL. Besides, we select the linear SVM without any domain adaption ability as the baseline of all the comparison method. The detailed trade-off parameters setting of all the methods in the experiments are listed as follows: 1) For baseline method SVM, we use linear kernel function and set C = 1 in the experiments. Meanwhile, for fair comparison, linear kernel function is adopted for all the methods throughout the experiments. 2) For the KMM method, there are two important parameters ε and B to be set, which are the upper limit of importance weight. Depending on the suggestion of [7], the two parameters are set as B = 1000 and ε = √ n tr − 1 n tr , where n tr denotes the number of training samples. 3 Note that the experimental results are directly taken from [5] since the comparative experiments setting are exactly same as that of [5]. Finally, the parameters (λ1, λ2) of our TRaSL are empirically fixed at (2, 7), (121, 1), (8, 3), (18,5), (14,20), and (18, 10) for Exp.7, Exp.8, · · ·, Exp.12 experiments, respectively. Meanwhile, we use the method described in Section III-D for TRaSL to predict the emotion labels of target speech samples.

C. RESULTS AND ANALYSIS
In this section, we report the results of the evaluated methods including various DA methods. The experimental results in terms of WAR and UAR of all the methods for all twelve experiments are depicted in Tables 2 and 3, respectively. The normal numbers are the recognition rate and the subscript numbers are the relative rank of UAR and WAR in each method. To observe the influence of different source and target domain on the results and the overall performance of each method, we calculate the average (Avg) results of all the experiments for each method and all the methods in each experiment, which are show in the last row and last column of these tables. From the results, we make the following observations. Firstly, it can be found that in all experiments, our TRaSL framework achieves promising increases in the performance over the SVM without any domain adaptation ability. Additionally, our TRaSL achieves both best UAR and WAR among all the methods in eight of twelve cases including Exp.5, Exp.7 and Exp.10 to Exp.12. As while, it is clear to see that the UAR of TRaSL in Exp.9 and the WAR of DoSL in Exp.4 are very competitive against the highest results in respective experiments, which is shown in the comparison between KMM and TRaSL in Exp.9 (30.39% v.s. 29.38%) and the comparison between our DoSL and TRaSL in Exp.4 experiment (55.35% v.s. 55.30%).
Secondly, we observe that DALSR outperforms all the comparative methods in terms of WAR and UAR in Exp.8, which shows it is more effective than other methods. Although in the experiments mentioned above, our TRaSL does not perform best in terms of UAR, we can from the results achieved by TRaSL and DALSR (highest), observe that their differences of UAR is actually not large, besides, our TRaSL also achieved highest result of WAR as DALSR. In this case, the UAR and WAR of DALSR are (44.41%, 52.27%), while the results of our TRaSL are (43.44%, 52.71%). In addition, it can be seen that in Exp.3, the results of KMM, KLIEP and uLSIF have big gaps with the baseline method SVM, we guess maybe these three methods have predicted the sample labels to be the same one.
Thirdly, based on our results, it is convincing that the limited label information provided by a small number of samples in source database will lead to low recognition rate. , respectively, which are far higher than the other four comparison methods. Besides, it can be seen at a glance the gap between WAR and UAR of the other four methods is much larger than DALSR, DoSL and TRaSL. Due to the dominant percentage of Anger samples in Emo DB, we consider that most of EmoDB samples may be mistakenly predicted as Angry by the four comparison methods and hence lead to lager gaps between WAR and UAR. Meanwhile, we also find that compared with the experiments of using EmoDB as target database (Exp.1, Exp.8 and Exp.10), there is a big gap between WAR and UAR among all the methods. Besides, one more interesting finding can be obtained according to the tables; the experimental results of using EmoDB as source database are obviously lower than those of using the same database as target one, e.g., Exp3 and Exp.7. It is mostly due to the class imbalance problem existing in EmoDB database, which can be seen in Table 1. In contrast to Exp.1 to Exp.7, and Exp.10, the UAR of our proposed methods in Exp.9 are less than KMM, in which most of the methods achieved low recognition rate. It is probably caused by the unbalance problem of labeled data samples in each class of source corpus.

D. EFFECTIVENESS VERIFICATION
So as to verify the above analysis and further observe how the data imbalance between source and target databases affect the cross-corpus speech emotion recognition tasks, we select three pairs of experiments including Exp.1 to VOLUME 9, 2021  Exp.6, where each database is served as source and target database, respectively, we draw the confusion matrices of all the comparison methods which are depicted in Figs. 3, 4, and 5, respectively. In Fig.3 the left database of ''→''is the source database, and the right one is the target database, e.g. IEMOCAP → EmoDB, IEMOCAP is the source database, EmoDB is the target database. From these confusion matrices, some interesting findings can be obtained: (1) Performance on Different Emotions: From the confusion matrix of TRaSL in Figs. 3, 4 and 5, we see that the Angry expression and the Neutral expression are much more easier to be recognized than the other expressions, and the Happy expressions is much more confusing than any other expressions. Additionally, from the confusion matrix of TRaSL in Fig.5, it can be found that there are big gaps between the recognition rate of Anger and Happy expression, where EmoDB and IEMOCAP are the source and target database, respectively. These experimental results coincide with the analysis from the above experiments. (2) The impact of imbalanced database. By comparing with the confusion matrix of Exp.1, which lies in the first column in Fig. 3, we can clearly see that almost all samples of EmoDB are predicted to be Angry by SVM, KMM, KLIEP and uLSIF, which is confirms our previous analysis. The dominant percentage of Anger samples in EmoDB lead to this phenomenon. Meanwhile, it also explains why there are such big gaps between WAR and UAR in these four methods, which indicates that the proposed TRaSL method is less affected by the extreme class imbalance problem existing in EmoDB and is more applicable to this challenging experiment. (3) Performance on limited label information. Compared the confusion matrix between the last column in Fig. 3 and Fig. 5, it can be found that in Fig. 5 the results of all the methods are seriously affected by the limited label information provided in source database hence most of target samples were wrongly predicted. This is because of that the model cannot get adequate training using small source sample. Though TRaSL method can promisingly alleviate this extremely wrong prediction, we can observe that in Fig. 5 nearly 70% of happy and sad samples are wrongly predicted. Consequently, DA methods including the proposed TRaSL still have very big space for coping with a small number of samples in source database.

E. FURTHER VERIFICATION
Transfer learning is widely used in many fields. To further illustrate the effectiveness of the proposed TRaSL, and the impact of different features on the algorithms performance. We choose one baseline, which is directly taken from [40], and several state-of-the-art transfer learning methods including DR [41], [42], TCA [43], GFK [44], STM [25],TNNMF [16] to conduct the experiments. In the baseline method the training data and testing data are from the same corpus. In this section, we choose Exp.7 and Exp.8 as the representatives. The WAR results are shown in table 4 and 5. From the tables it clearly to see that, TRaSL achieved better performance than other transfer learning methods. Besides, we observe that the dimensionality reduction based transfer learning algorithms not achieved excellent effect, i.e., TCA, GFK, DR and STM, which do not take into account the importance of feature selection. Furthermore, the WAR of baseline method is much higher than other methods, which indicate that different feature distributions have great influence on the recognition rate. It also shows the necessity of cross-corpus speech emotion recognition.

F. ABLATION STUDIES
In order to see how the objective function terms affect the performance of TRaSL, ablation studies of the model are investigated. The final objective function is shown in Eq. (4), which is composed of three parts. In this section, we conducted three kinds of experiments.
TRaSL-I: In Eq.(4), ||U T || 2,1 term is served as the regularization to select the important features. To proving the impact of it to SER, we removed this term from Eq. (4), and marked it as TRaSL-I.
TRaSL-D: In order to check the effect of discriminant property to the model, the term ||U T ( s − t )U|| 2 F , which can keep enough discriminant information is removed, we named it as TRaSL-D.
TRaSL: TRaSL is the final objective function, which includes feature distribution difference term and discriminant property term, meanwhile, importance of feature selection term also takes into account. We show the ablation experimental on speech signal feature results in Table 6. It can from the results be seen that TRaSL achieved promisingly increase compared with the TRaSL-I and TRaSL-D, which can demonstrate the effectiveness of the proposed TRaSL framework. Furthermore, as the table shows, TRaSL-I gained a lower recognition rate, which indicated that important feature selection has a great influence on the results. Besides, in order to verify the effectiveness and robustness of the TRaSL, experiments are carried out by using the spectrogram features of emoDB and IEMOCAP databases. For more training data, the utterances are divided into several segments and all segments in the same utterance share the same label. Researchers have point out that a segment longer than 250ms can provide enough emotional information [45], [46]. Similar to [47], in this work, the length of a segment is set to be 265ms. Following the experiment setting in IV-B, we select the samples belonging to the common emotion states from these two datasets, which are served as source and target corpus, alternatively. Meanwhile, we randomly selected three groups of samples from emoDB and IEMOCAP database, respectively. The numbers of samples in each group are (200, 800), (1000, 2000), (3000, 5000). In Table 7, the sample numbers of emoDB and IEMPOCAP in EXP.1 and EXP.2 are 200 and 800, respectively. Similar to EXP.1 and EXP.2, they are 1000, 2000 and 3000, 5000 in EXP.3, EXP.4 and EXP.5, EXP.6. From the results, it can be seen that under the spectrogram-based statistical features our TRaSL also achieved promising results.

G. PARAMETER SENSITIVITY
There are two important trade-off parameters in the proposed TRaSL framework, i.e. λ 1 and λ 2 , whose selection will affect the performance of TRaSL. So the next obvious question is, whether the performance of TRaSL is sensitive to the selection of λ 1 and λ 2 . To investigate this point, we conduct experiments by fixing the values of one trade-off parameter while changing the other one. As representatives, we select two pairs of experiments including Exp.1, Exp.2, Exp.7 and Exp.8 to conduct the experiments, in which we will report the average recognition accuracy (WAR). The preset spaces are [0.001, 0.01, 0.1, 1, 10, 100] for λ 1 and λ 2 . The fixed λ 1 and λ 2 values are consistent with the above experiment in IV-B. The WAR of these parameters are shown in Figs.6. From Figs.6, we can see that the performance of TRaSL varies slightly with respect to the change of λ 1 and λ 2 in all experiments, which indicates that our TRaSL can achieve optimal recognition performance with a wide range of parameter values, i.e., our TRaSL is less sensitive to its trade-off parameters.

V. CONCLUSION AND FUTURE WORKS
In this work, we propose an unsupervised transfer subspace learning (TRaSL) model via transform the original sample features of source and target database to a predefined common subspace, which can deal with the unsupervised cross-corpus speech emotion recognition (SER) problem. By using TRaSL model, we can learn a projection matrix to transform the source and target speech samples from the original feature space, where the feature distributions of the source and target speech samples have large difference, into the label space, where the transformed source and target speech samples would obey the similar feature distributions. Therefore, the classifier learned based on the transformed labeled source speech samples are then utilized to predict the speech emotion category of the unlabeled target speech samples. Extensive cross-corpus SER experiments based on the four speech emotion corpora are conducted to evaluate the performance of the proposed TRaSL method. The evaluation results demonstrate the superiority of our TRaSL to the recent state-of-the-art cross-corpus SER methods. Besides, the investigations also imply that both the label information provided in source database and the class imbalance of target domain are constraints for domain adaptation, the quantity of label information provided by source database can provide sample emotional features for the model, while the imbalance of target domain will cause the predict result as the same one speech emotion.
In the work, we mainly aim to transform source and target speech signals to share similar feature distributions for FER. It is also expected that a more sophisticated feature selection VOLUME 9, 2021 method will be designed to further improve the performance. With the development of deep learning techniques, its strong nonlinear representation ability will help bridging the source and target domains. One of our future works will study how to introduce the convolution neutral network into our TRaSL method.