Unsupervised Representation Disentanglement Using Cross Domain Features and Adversarial Learning in Variational Autoencoder Based Voice Conversion

An effective approach for voice conversion (VC) is to disentangle linguistic content from other components in the speech signal. The effectiveness of variational autoencoder (VAE) based VC (VAE-VC), for instance, strongly relies on this principle. In our prior work, we proposed a cross-domain VAE-VC (CDVAE-VC) framework, which utilized acoustic features of different properties, to improve the performance of VAE-VC. We believed that the success came from more disentangled latent representations. In this article, we extend the CDVAE-VC framework by incorporating the concept of adversarial learning, in order to further increase the degree of disentanglement, thereby improving the quality and similarity of converted speech. More specifically, we first investigate the effectiveness of incorporating the generative adversarial networks (GANs) with CDVAE-VC. Then, we consider the concept of domain adversarial training and add an explicit constraint to the latent representation, realized by a speaker classifier, to explicitly eliminate the speaker information that resides in the latent code. Experimental results confirm that the degree of disentanglement of the learned latent representation can be enhanced by both GANs and the speaker classifier. Meanwhile, subjective evaluation results in terms of quality and similarity scores demonstrate the effectiveness of our proposed methods.


I. INTRODUCTION
V OICE conversion (VC) aims to convert the speech from a source to that of a target without changing the linguistic content [1].Speaker voice conversion [2] is a typical type of VC and refers to the process of converting speech from a source speaker to a target speaker.In addition, a wide variety of applications could be solved by applying VC, such as accent conversion [3], personalized speech synthesis [4], [5], and speaking-aid device support [6]- [8].Since the spectral property plays an important role in characterizing speaker c 2020 IEEE.Personal use of this material is permitted.Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Wen-Chin Huang is with the Graduate School of Informatics, Nagoya University, Japan.This work was done while he was with the Institute of Information Science, Academia Sinica, Taipei, Taiwan.e-mail: wen.chinhuang@g.sp.m.is.nagoya-u.ac.jp.
Hao Luo, Hsin-Te Hwang, Chen-Chou Lo, Yu-Huai Peng and Hsin-Min Wang are with the Institute of Information Science, Academia Sinica, Taipei, Taiwan.
Yu Tsao is with the Research Center of Information Technology Institute of Information Science, Academia Sinica, Taipei, Taiwan.individuality, spectral conversion has been intensively studied in VC.In this work, we focus on spectral mapping in speaker voice conversion.
Numerous VC approaches have been proposed.The Gaussian mixture model (GMM)-based method [9], [10] has been a popular statistical approach that estimates the joint density of the source-target feature vectors, which requires a training procedure and has a well-known disadvantage that the converted outputs generally suffer from an over-smoothing issue.Frequency warping methods, such as vocal tract length normalization [11], weighted frequency warping [12] and dynamic frequency warping [13], are able to keep spectral details while providing inferior speaker identity conversion quality to that of statistical approaches.Exemplar-based methods [14]- [18] require much less training data and are capable of modeling the high-dimensional spectra.In recent years, deep neural networks (DNNs) have established supremacy in a wide range of research fields, including VC [19]- [22].DNNs have been utilized for not only spectral mapping but also neural vocoding [23]- [25].It has been shown that employing neural vocoders as the waveform generation module can greatly improve the performance of VC systems [26]- [31].It has also been shown that VC systems, whether implemented in highdimensional or low-dimensional features, benefit from spectral detail compensation [15], [18], [32].
Nonetheless, most of the approaches described above rely on the availability of parallel training data, which is often not accessible in real world scenarios.Thus, the development of non-parallel VC methods has been gaining attention [33].One approach is to construct a pseudo parallel dataset from a nonparallel corpus [34].Another family of approaches utilizes a pre-trained automatic speech recognition model to compute the phonetic posteriorgram (PPG) as the speaker-independent linguistic feature, followed by a PPG-to-acoustic mapping to generate converted features [35], [36].A recently popular approach is to use DNNs to model the probability distribution of the target features; state-of-the-art models such as variational autoencoders (VAEs) [37] and generative adversarial networks (GANs) [38] have been successfully applied to non-parallel VC [36], [39]- [46].
In this work, we focus on VAE-based VC (VAE-VC) [39].Specifically, the spectral conversion function is composed of an encoder-decoder pair.The encoder encodes the input spectral feature into a latent code; the decoder mixes the latent code and a specified target speaker code to generate the con-Fig.1: Illustration of how entangled latent representation affects the conversion performance in a general VAE-VC framework.The residual source speaker information in the latent code will be mixed with the given target speaker code, resulting in a mixed speaker identity in the converted feature.Thus, the performance might be harmed.verted feature.The encoder-decoder network and the speaker codes are trained by back-propagation of the reconstruction error, along with a Kullback-Leibler (KL)-divergence loss that regularizes the distribution of the latent code.
The degree of disentanglement of the latent representation is crucial to the success of many speech processing frameworks [47]- [51], including VAE-VC.Since we focus on the task of speaker voice conversion, the degree of disentanglement is defined as the amount of (source) speaker information residing in the latent code, i.e., the independence of the latent code and the speaker code [52].An illustration is given in Figure 1.If the latent code is entangled by multiple components (e.g., in the VC task, the source speaker information remains in the latent code), during conversion, the decoder will draw the speaker information from both the given target speaker code and the residual source speaker information in the latent code, which harms the conversion performance.From the success of VAE-VC, we can infer that, at least to some extent, the decoder is trained to use more information in the given speaker code, rather than the speaker characteristics remained in the latent code, otherwise conversion made by changing the speaker code will not work.Although the success may be a natural result of model optimization, we doubt whether the performance is robust enough.For instance, in [53], it was demonstrated that the performance of autoencoder-based VC models was sensitive to the latent space dimension.This raises the need to design better schemes for making the latent code more independent of the speaker.
In our prior work [54], we proposed a cross-domain VAEbased VC framework (referred to as CDVAE-VC in the following discussion).The motivations of CDVAE-VC are: (1) although the effectiveness of VAE-VC using vocoder spectra (e.g., the STRAIGHT spectra, SPs [55]) has been confirmed, the use of other types of spectral features, such as melcepstral coefficients (MCCs) [56] that are related to human perception and have been widely used in VC, have not been properly investigated; (2) since modeling the low-and highdimensional features alone has their respective shortcomings, based on multi-target/task learning [57], [58], it is believed that a model capable of simultaneously modeling two types of spectral features can yield better performance even if they are from the same feature domain.To this end, CDVAE-VC [54] extended the VAE-VC framework to jointly consider two kinds of spectral features, namely SPs and MCCs.By introducing two additional cross-domain reconstruction losses and a latent similarity constraint into the training objective, the latent representations encoded from the input SPs and MCCs are biased to each other and capable of self-or cross-reconstructing the input features.We speculated that the success of CDVAE-VC came from the fact that a more disentangled latent representation was learned.Furthermore, we observed a positive correlation between the conversion performance and the extent to which the latent code was disentangled.
In this work, we extend the CDVAE-VC framework by incorporating the concept of adversarial training to improve the degree of disentanglement as well as the conversion performance.First, we directly combine CDVAE-VC with GANs.GANs have shown the ability to enhance the output of the decoder in encoder-decoder network based VC frameworks [45].Therefore, it is expected that such a combination can improve the quality of converted speech.Second, inspired from the idea of domain adversarial training (DAT) [59], we add a speaker classification training objective to the latent variables, in order to explicitly project away speaker-related information.A similar idea has been applied to several speech processing tasks, such as speech recognition [60]- [62], speech enhancement [59], VC [45], [63] and singing VC [64].Here, we utilize DAT by considering cross-domain features to further facilitate a more disentangled latent representation.
Designing a clear evaluation metric for degree of disentanglement has long been an open problem in the field of machine learning.In image modeling, visual inspection has been a standard and intuitive approach [65], [66].However, the visual inspection is not perfectly feasible for speech processing tasks since it is hard to quantify the difference in voices as a specific latent variable changes.In previous works [45], [53], [67], a classifier-based metric has been proposed.Since the metric is also based on a trained classifier, it has limitations in comparing the disentanglement between different latent codes obtained by different models due to different training conditions and dynamics.Following [68], we utilize the parallel data that exist in most benchmark VC datasets and derive a novel metric for measuring disentanglement.The key assumption is that an ideal encoder should encode a pair of parallel sentences uttered by two different speakers to similar latent codes.We measure the cosine similarity between such latent codes to evaluate how well the encoder disentangles the latent codes.
The remainder of this paper is organized as follows.In Section II, we first review the VAE-VC and its extended version, CDVAE-VC.Section III introduces how to combine GANs with CDVAE-VC.Then, we describe how to add an adversarial speaker classifier objective to the latent code in Section IV.In Section V, we first examine our proposed mechanisms one by one, using conventional objective and subjective evaluation metrics adopted in VC.Disentanglement measurements of our proposed methods and how they are related to the VC performance are presented afterwards.Finally, we conclude the paper with discussions in Section VI.

II. BACKGROUND
In conventional VC frameworks, the acoustic features of the source speaker are converted to those of the target speaker Fig. 2: Illustration of the conversion phase of the VAE-VC [39] framework.Following traditional VC systems, a vocoder first parameterizes the waveform into acoustic features, which are then converted in different streams, and finally the converted features are used to synthesize the converted waveform by a vocoder. in different feature streams.Many researches focus on the conversion of spectral features [10] and thus formulate VC as follows.Given N source speaker's spectral frames X s = {x s,1 , . . ., x s,N }, the goal is to find a conversion function Note that the second subindices in both sides of the equation are both n, which means that the converted spectral feature sequence has the same length with that of the source.In the rest of the article, we drop the frame or the speaker indices for simplicity.
In the following subsections, we describe two VAE based VC frameworks.Throughout the paper, we use "bar" to indicate the reconstructed features, and "hat" to indicate the converted features.

A. VAE-VC
Figure 2 depicts the conversion process of a typical VAE-VC system [39].The core of VAE-VC is an encoder-decoder network.During training, given an observed (source or target) spectral frame x, a speaker-independent encoder E θ with parameter set θ encodes x into a latent code: z = E θ (x).The speaker code y of the input frame is then concatenated with the latent code, and passed to a conditional decoder G φ with parameter set φ to reconstruct the input.This reconstruction process can be expressed as: ( The model parameters can be obtained by maximizing the variational lower bound: where q θ ( z|x) is the approximate posterior, p φ ( x|z, y) is the data likelihood, and p(z) is the prior distribution of the latent space.L recon is simply a reconstruction term as in any vanilla autoencoder, whereas L lat regularizes the encoder to align the approximate posterior with the prior distribution.In the conversion phase, one could use (2) to formulate the conversion function f : where ŷ is the target speaker code.
The VAE framework makes several assumptions.First, p φ ( x|z, y) is assumed to follow a normal distribution whose covariance is an identity matrix.Second, p(z) is set to be a standard normal distribution.Third, the expectation over z is approximated by sampling via a linear-transformation based re-parameterization trick [37].With these simplifications, we can avoid intractability and optimize the autoencoder parameter sets θ ∪ φ and the speaker codes via back-propagation.

B. CDVAE-VC
In [54], we proposed the CDVAE-VC framework to utilize spectral features of different properties extracted from the same observed speech frame.As depicted in Figure 3, the CD-VAE framework is formed by a collection of encoder-decoder pairs, one for each kind of spectral feature.Considering the SPs and MCCs as two kinds of spectral features (denoted as x SP and x M CC ), the following losses are defined: where In short, we introduce two extra reconstruction streams.By minimizing the cross-domain reconstruction loss, we enforce z SP to contain enough information to reconstruct x M CC , and vice versa.As a result, the behavior of the encoders for both feature domains are constrained to be the same, i.e., they are expected to extract similar latent information from different types of input spectral features.To explicitly reinforce this constraint, a latent similarity L1 loss defined as can be included in the final objective expressed as: The model parameters can be learned by maximizing (14).In the conversion phase, there are four conversion paths (i.e., two within-domain and two cross-domain paths).As reported in [54], the CDVAE MCC-MCC path gave the best performance in terms of subjective evaluation, which matched the assumption that MCCs are more related to human perception.

III. INCORPORATING CDVAE-VC WITH GANS
Minimizing the reconstruction loss in VAE-VC and CDVAE-VC tends to result in blurry spectra, similar to the over-smoothing effects in other VC frameworks.It is expected that introducing a GAN objective [38] can guide the output spectra to be more realistic.In this section, we present the main concepts and system architectures of the combination of GANs and the VAE-VC and CDVAE-VC frameworks.

A. The GAN objective in the general VAE-VC
We follow [69] and incorporate a GAN objective into the decoder in the original VAE-VC.Assume that the real data distribution of any spectral frame admits density p * , and the autoencoding process defined in (2) induces a conditional probability p x. From the data distribution prospective of view, the goal is to enhance the decoder network G in (2) such that p x best approximates the real data distribution p * : A typical GAN [38] realizes the above-mentioned probability approximation by introducing a discriminator D ψ with parameter set ψ that judges whether an input follows a true and natural probability distribution or an artificial one.Together with a generator G that tries to produce realistic output features, these two components play a min-max game and seek an equilibrium with the Jensen-Shannon divergence D JS as the objective, which is defined as follows: To facilitate stable training, in this work we adopt a Wasserstein GAN (WGAN) [70], [71].In the WGAN, the following Wasserstein distance is derived: (17) where the supremum is over all 1-Lipschitz functions D : X → R. Based on the above distance, the following WGAN loss can be defined: where D ψ is now a 1-Lipschitz discriminator.Finally, we can combine the objectives of VAE and WGAN by assigning the decoder of VAE as the generator of WGAN.As a result, combining the WGAN loss (18) and the VAE loss (3) results in a VAEGAN objective: L vaegan (θ, φ, ψ; x, y) = L vae (x, y) + αL wgan (x), (19) where α is the weight of the WGAN loss.This objective is shared across the encoder, decoder, and discriminator.As in standard GAN training, the discriminator is first updated by maximizing this objective, and the encoder and decoder are updated by minimizing the objective.Therefore, the components are optimized in an alternating order.GANs produce more realistic (in our case, sharper) outputs because they optimize a loss function between two distributions in a more direct fashion.
The VAW-GAN-VC method in [41] has a similar motivation to better model spectral features to improve feature generation.However, there is a fundamental difference between the training procedures of VAW-GAN-VC and the training procedures here.In VAW-GAN-VC, the objective of WGANs is to minimize the Wasserstein distance of the two distributions of the converted features and the real target features.Although this is a strong objective, it also brings some limitations.The original VAE-VC and CDVAE-VC consider only auto-encoding in the training phase, and perform conversion by changing the speaker code in the conversion phase.In other words, multiple conversion pairs are integrated into one model, sometimes referred to as "multi-target" training in VC.VAW-GAN-VC, in contrast, needs to consider not only auto-encoding but also conversion in the training phase, since the discriminator needs to discriminate the real target features and the converted features in order to align the distribution of the latter to that of the former.As a result, VAW-GAN-VC is trained to convert from one source to one target, which limits the

B. CDVAE-VC with GANs (CDVAE-GAN)
Now we can combine the GAN objective with CDVAE-VC, which we will refer to as CDVAE-GAN, where the derivation of the objective is as simple as replacing the VAE loss in (19) with the CDVAE objective defined in (14).However, in practice, combining CDVAE-VC with GANs is not as trivial as replacing the encoder and decoder in VAE-GAN with CDVAE.For each kind of feature, a separate discriminator should be trained, i.e., D SP and D M CC should be considered.It seems natural to train two discriminators jointly with the whole network.However, as mentioned above, the MCC-MCC path in CDVAE-VC performs best in four paths in the conversion phase.Introducing a discriminator for SPs might not necessarily benefit the quality of the output MCCs.To determine the best architecture, we examine the effect of three settings, including combining CDVAE with only D SP , only D M CC , and both D SP and D M CC .Detailed experimental results will be shown in Sections V-C and V-D.

IV. ADVERSARIAL SPEAKER CLASSIFIER (CLS)
As discussed above, the viability of the family of VAE-VC frameworks relies on the decomposition of input, which is assumed to be composed of phonetic representation and speaker information.Ideally, the latent code extracted using the encoder should contain solely phonetic information and free from any speaker information.However, this decomposition is not explicitly guaranteed.To this end, we investigate the effect of an adversarial speaker classifier to explicitly force the latent code to be speaker independent.

A. The classifier loss
An adversarial speaker classifier C Ψ with parameter set Ψ tries to classify which speaker the latent code comes from.We will refer to this classifier as CLS.Specifically, given a latent code z, the CLS predicts a posterior probability P (y = y|z), which is the probability that z is extracted from an input frame produced by speaker y.Therefore, we can define the CLS loss as the negative cross-entropy between the predicted posterior and the one-hot ground truth vector:
The training process is divided into three phases, as depicted in Figure 4. Phase one involves the training of the VAE.In phase two, to pre-train the classifier, we first use the trained VAE obtained in phase one to extract latent codes from the same training set.The classifier is then trained with these latent codes to minimize (20).In the third phase, we train the whole network using an alternating update schedule, similar to the one described in Section III-A.Specifically, the encoder and the decoder are first frozen and the discriminator and classifier are trained to maximize L wgan and minimize L cls defined in (18) and (20), respectively, and thus they can discriminate self-reconstructed features and classify latent codes correctly.Then, we freeze these modules and train the encoder and decoder to not only minimize L cdvae in ( 14), but also optimize L wgan and L cls so that they can fool the frozen components.
The described training scheme also plays a min-max game between {encoders, decoders} and {discriminator, classifier}.An ideally trained model should contain encoders that learns to project away as much speaker information as possible and decoders that can generate realistic and natural output spectra given an inferred latent code with a specific speaker code.Algorithm 1 summarizes the training procedure of CDVAE-CLS-GAN.

V. EXPERIMENTAL EVALUATIONS A. Experimental settings
We conducted all experiments on the Voice Conversion Challenge (VCC) 2018 dataset, which contained recordings of The WORLD vocoder was used to extract acoustic features, including 513-dimensional SPs, 513-dimensional aperiodicity signals (APs), and fundamental frequency (F 0 ).35dimensional MCCs were then extracted from the SPs, which were then normalized to unit-sum, and the normalizing factor was used as the energy of SPs.The 0-th coefficient of MCCs was taken out as the energy of MCCs.We further applied Min-Max normalization to SPs and MCCs.In the conversion phase, the converted SPs in VAE systems and the converted MCCs in CDVAE systems (excluding CDVAE-GAN with D SP ) were obtained.The energy and AP were kept unmodified, and F 0 was converted using a linear mean-variance transformation in the log-F 0 domain.
The detailed network architectures are shown in Table I.We adopted the fully convolutional network (FCN) [72] based CDVAE-VC as our baseline system [68], which consumes continuous spectral frames extracted from the whole utterance and outputs a sequence of converted frames of the same length.This model has been confirmed to outperform the framewise CDVAE-VC counterpart.We also adopted a gradient penalty regularization [71] in the WGAN objective to stabilize the training.Layer normalization [73], the gated linear units activation function, and skip connections were also used to more effectively propagate the conditional information.Following [68], the latent space and speaker representation were set to 16-dimensional.We used a mini-batch of 16 and the Adam optimizer with a fixed learning rate of 0.0001.The hyper-parameters α and λ were set to be 50 and 1000, respectively, according to a held-out validation set.For CDVAE-GAN, we first pre-trained the CDVAE for 100000 steps.Then, we adversarially trained the discriminator(s) with the whole network for 10000 steps.We followed a common WGAN training scheme [70], [71] such that the discriminator(s) were updated for 5 iterations followed by 1 iteration of encoder and decoder update.For CDVAE-CLS-GAN, after training the CDVAE for 100000 steps, we pre-trained the classifier with the latent code extracted from the encoders for 30000 steps.Then, we trained the whole network for 10000 steps.After experimenting with different training schemes, here we updated the discriminator and the classifier for 1 iteration followed by 5 iterations of encoder and decoder update.
The following models are compared in order to examine the effectiveness of our proposed methods.
• VAE: The FCN version of the VAE-VC model introduced in [39].This model is only used to evaluate the impact of cross domain features on the degree of disentanglement.• CDVAE: The FCN model in [68], which is the baseline model in our experiments.• CDVAE-GAN SP : The CDVAE with D SP .
• CDVAE-GAN BOTH : The CDVAE with D SP and D M CC .
• CDVAE-CLS-GAN MCC : The CDVAE with D M CC and CLS.For simpilcity, in the rest of the paper, we use brackets to surround the type of feature used during conversion, and that path will be used in CDVAE-based methods.For instance, CDVAE-GAN MCC [MCC] uses the MCC and the MCC-MCC path.In addition, if MCC is used in CDVAE and CDVAE-CLS, we additionally compare systems incorporating the global variance (GV) post-filter [74] to enhance the output, as in the original CDVAE [54].

B. Evaluation methodology 1) Objective evaluation metrics:
• Mel-Cepstrum distortion (MCD): MCD measures the spectral distortion in the MCC domain, and is a commonly adopted objective metric in the field of VC.It is calculated as: where K is the dimension of the MCCs and mcc A dynamic time warping (DTW) based alignment is performed to find the corresponding frame pairs between the non-silent converted and target MCC sequences beforehand.
• Global variance (GV): GV serves as a metric for the oversmoothness of the output features.GV is usually calculated dimension-wise over all non-silent frames in the evaluation set.The d-dimensional GV value is calculated as follows: d is the mean of all converted d-th dimensional MCC coefficients.
• Modulation Spectrum (MS): MS [75] is defined as the log-scaled power spectrum of a given feature sequence.
The temporal fluctuation of the sequence is first decomposed into individual modulation frequency components, and their power values are represented as the MS.In this work we measure the MS of MCCs.Different from previous works that measured the MS of specific dimension of the MCC sequence, here we report the average of all dimensions.We also measure a MS distortion (MSD), where the MSD for the d-dimension is calculated by: 2) Subjective evaluation methods: We recruited 14 participants for the following two subjective evaluations.• The mean opinion score (MOS) test on naturalness: Subjects were asked to evaluate the naturalness of the converted and natural speech samples on a scale from 1 (completely unnatural) to 5 (completely natural).• The VCC [33] style test on similarity: This paradigm was adopted by the VCC organizing committee.Listeners were given a pair of speech utterances consisting of a natural speech sample from a target speaker and a converted speech sample.Then, they were asked to determine whether the pair of utterances can be produced by the same speaker, with a 4-level confidence of their decision, i.e., sure or not sure.

C. Applying GANs to different features
We first compare CDVAE-GAN SP , CDVAE-GAN MCC , CDVAE-GAN BOTH and CDVAE-CLS-GAN SP , CDVAE-CLS-GAN MCC , CDVAE-CLS-GAN BOTH , respectively.As in Table II, CDVAE-GAN BOTH and CDVAE-CLS-GAN BOTH gave the highest MCD, while in Figures 5b, 5c, 6b and 6c, we can see that in terms of GV and MS, CDVAE-GAN MCC and CDVAE-CLS-GAN MCC yielded curves closer to the target curves, where the curves of the other models deviated more from the target curves.Meanwhile, consistent with a common observation in the VC literature that MCD, which measures the sample mean, often yields opposite results to GV and MS, both presenting the sample variance [10], [76].This result suggests that modeling both feature domains simultaneously does not always yield better results.As for perceptual performance, our internal listening tests revealed that CDVAE-GAN MCC gave the best results among the three models.Note that although CDVAE-GAN SP and CDVAE-CLS-GAN SP gave the lowest MCD compared with the other two models, they do not necessarily outperform their MCC counterparts in listening tests.We speculate that fitting the SP domain tends to give more over-smoothed output features, resulting in low MCDs but not beneficial for improving perceptual performance.The result is reasonable since the MCC-MCC path is used when performing conversion.

D. Effectiveness of GANs
Next, we examine the effectiveness of combining GANs with CDVAE and CDVAE-CLS.Based on the discussion in the previous subsection, we focus on CDVAE-GAN MCC  ).These results are consistent with our findings in the objective evaluations, suggesting that GANs enhance the variance of output features, thus have the potential to replace the GV post-filtering process commonly involved in traditional MCC-based VC systems [10].This is advantageous since the model can then be freed from the post-filtering process in the online conversion phase, which may benefit real-time applications.

E. Effectiveness of CLS
Next, we evaluate the effectiveness of the adversarial speaker classifier.Looking at the CDVAE, CDVAE-GAN models and their counterparts with CLS, a trend of increase in MCD values can be observed in Table II.On the other hand, Figures 5a, 6a and 7 show that applying CLS to CDVAE and CDVAE-CLS-GAN MCC yields similar GV values, but with MS values closer to those of the target, as well as a smaller MSD.These results imply that CLS can improve objective statistics.
Table III and Figure 8 show the subjective evaluation results.The effectiveness of CLS can be confirmed by the  following observations: The speech naturalness was improved in all conversion pairs, by adding CLS to CDVAE, CDVAE w/ GV, and CDVAE-GAN MCC .This is consistent with our aforementioned findings from the objective evaluations.Furthermore, the conversion similarity is greatly improved when incorporating CLS in CDVAE and CDVAE w/ GV, and is slightly improved when added to CDVAE-GAN MCC .This confirms our initial motivation of CLS, which is to increase speaker similarity by eliminating source speaker identity in the latent code.

F. Disentanglement Measure
In this section, we investigate the degree of disentanglement of the VC models involved in this study.We use a novel metric that was recently proposed in [68] as the disentanglement measurement, termed DEM.The main design concept of DEM is that a pair of sentences of the same content uttered by the source and target speakers should have similar latent codes since the phonetic contents are the same.Therefore, we can use the cosine similarity to measure the distance of the latent codes obtained from the paired utterances.Specifically, the procedure to calculate DEM is as follows: 1) extracting the latent codes of a pair of parallel utterances spoken by the source and target speakers; 2) aligning the frame sequences of the pair of utterances using DTW; 3) calculating the frame-wise cosine similarity, and then taking the average of the entire sequence.
As with other popular evaluation metrics, e.g., MCD and MSD, computing DEM requires parallel data.Since parallel data are usually available in standardized VC datasets, DEM is a simple but effective measure of the degree of disentanglement of the latent codes.Table IV shows the evaluation results of DEM.First, we observe that CDVAE [SP] yields higher DEM scores than VAE [SP].This confirms that introducing cross domain features indeed increases the degree of disentanglement.Next, comparing the corresponding methods in the upper and lower half of the table, which used SP and MCC as input features respectively, the DEM scores of the upper is consistently higher than those of the latter.This result is somehow reasonable because here SPs (513-dimensional) are of higher dimensions than MCCs (35-dimensional) and carry much detailed information.As a result, in terms of cosine similarity measure, higher DEM could be observed in the upper half methods than the lower half.
One interesting finding here is that when corporating GANs in CDVAE and CDVAE-CLS models, the DEM scores are consistently and significantly improved.This result indicates that during training of CDVAE-GAN MCC , although not in our original expectaions, the discriminator not only benefits the decoders, but also indirectly guides the latent codes to be better disengagled.
As for CLS, we first observe that including CLS in CDVAE improves the DEM score when using MCC yet degrades when using SP.Although this somewhat makes the effectiveness of CLS inconvincing, we note that CDVAE-CLS [SP] and CDVAE-CLS [MCC] have nearly identical DEM scores.This intersesting finding shows that the CLS forces the encoders to encode different features into similar contents.On the other hand, including CLS in CDVAE-GAN models boosts the DEM scores of cross gender pairs, which confirms that CLS can help the encoders eliminate speaker independent information, such as gender.
Finally, we compare the results of similarity tests of CDVAE [MCC], CDVAE-GAN MCC , and CDVAE-CLS-GAN MCC in Figure 8 and the DEM results in Table IV.CDVAE-CLS-GAN MCC achieves the highest similarity scores in Figure 8 and gives the highest DEM scores in Table IV.The result verifies the positive correlation between the conversion performance and the degree of disentanglement of the latent codes.

VI. CONCLUSIONS
In this paper, we have extended the cross-domain VAE based VC framework by integrating GANs and CLS into the training phase.The GAN objective was used to better approximate the distribution of real speech signals.The CLS, on the other hand, was applied to the latent code as an explicit constraint to eliminate speaker-dependent factors.Objective and subjective evaluations confirmed the effectiveness of the GAN and CLS objectives.We have also investigated the correlation between the degree of disentanglement and the conversion performance.A novel evaluation metric, DEM, that measures the degree of disentanglement in VC was derived.Experimental results confirmed a positive correlation between the degree of disentanglement and the conversion performance.
In the future, we will exploit more acoustic features in the CDVAE system, including rawer features, such as the magnitude spectrum, and hand-crafted features, such as line-spectral pairs.An effective algorithm that can optimally determine the latent space dimension is also worthy of study.Finally, it is worthwhile to generalize this disentanglement framework to extract speaker-invariant latent representation from unknown source speakers in order to achieve many-to-one VC.
We have made the source code publicly accessible so that readers can reproduce our results. 2

Fig. 3 :
Fig. 3: Illustration of the training phase of the CDVAE-VC [54] framework.In this framework, each feature has its own set of encoder and decoder.During training, by minimizing the loss derived from the within-and cross-domain reconstruction paths, the latent codes z SP and z M CC learn to reconstruct not only corresponding input features but also the cross-domain features.
SP and G SP are the encoder and decoder for SPs, and E M CC and G M CC are the encoder and decoder for MCCs; xS−S and xM−M , respectively, denote the generated SPs and MCCs from the within-domain reconstruction paths; xM−S and xS−M , respectively, denote the generated SPs and MCCs from the cross-domain reconstruction paths.Note that L recon (•, y) calculates the reconstruction loss between the first argument and the corresponding input feature.

Fig. 4 :
Fig. 4: Illustration of the training procedure of our proposed CDVAE-CLS-GAN model.Phase 1: A CDVAE is trained.Phase 2: The latent codes are used to train the CLS.Phase 3-A and 3-B: The encoders, decoders and the CLS, discriminators are trained in an alternating order.

d
represent the d-th dimensional coefficient of the converted MCCs and the target MCCs, respectively.In practice, MCD is calculate in a utterance-wise manner.

Fig. 5 :
Fig. 5: Global variance curves of all non-silent frames averaged over all conversion pairs for the compared models.

Fig. 6 :
Fig. 6: Average modulation spectrum curves over all dimensions of all non-silent frames over all conversion pairs for the compared models.

Fig. 7 :
Fig. 7: Modulation spectrum distortion curves of all non-silent frames over all conversion pairs for the compared models.

Fig. 8 :
Fig. 8: Similarity results over all speaker pairs for the compared models.

TABLE I :
Model architectures.Conv-h×w-n indicates a convolutional layer with kernel size h×w and n output channels.LReLU indicates the leaky ReLU activation function.FC indicates fully-connected linear layer .LN indicates the layer normalization layer. 1

TABLE II :
Mean Mel-cepstral distortions [dB] of all non-silent frames in the evaluation set for the compared models.

TABLE IV :
The results of DEM: the cosine similarity of the latent codes extracted from non-silent frames of parallel utterances of source-target pairs.