ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion

This article proposes a voice conversion (VC) method using sequence-to-sequence (seq2seq or S2S) learning, which flexibly converts not only the voice characteristics but also the pitch contour and duration of input speech. The proposed method, called ConvS2S-VC, has three key features. First, it uses a model with a fully convolutional architecture. This is particularly advantageous in that it is suitable for parallel computations using GPUs. It is also beneficial since it enables effective normalization techniques such as batch normalization to be used for all the hidden layers in the networks. Second, it achieves many-to-many conversion by simultaneously learning mappings among multiple speakers using only a single model instead of separately learning mappings between each speaker pair using a different model. This enables the model to fully utilize available training data collected from multiple speakers by capturing common latent features that can be shared across different speakers. Owing to this structure, our model works reasonably well even without source speaker information, thus making it able to handle any-to-many conversion tasks. Third, we introduce a mechanism, called the conditional batch normalization that switches batch normalization layers in accordance with the target speaker. This particular mechanism has been found to be extremely effective for our many-to-many conversion model. We conducted speaker identity conversion experiments and found that ConvS2S-VC obtained higher sound quality and speaker similarity than baseline methods. We also found from audio examples that it could perform well in various tasks including emotional expression conversion, electrolaryngeal speech enhancement, and English accent conversion.


I. INTRODUCTION
V OICE conversion (VC) is a technique for converting para/non-linguistic information contained in a given utterance such as the perceived identity of a speaker while preserving linguistic information. Potential applications of this technique include speaker-identity modification [1], speaking aids [2], [3], speech enhancement [4]- [6], and accent conversion [7]. Manuscript  Many conventional VC methods are designed to use parallel utterances of source and target speech to train acoustic models for feature mapping. A typical pipeline of the training process consists of extracting acoustic features from source and target utterances, performing dynamic time warping (DTW) to obtain time-aligned parallel data, and training an acoustic model that maps the source features to the target features frame-by-frame. Examples of the acoustic model include Gaussian mixture models (GMM) [8]- [10] and deep neural networks (DNNs) [11]- [15]. Some attempts have also been made to develop methods that require no parallel utterances, transcriptions, or time alignment procedures. Recently, deep generative models such as variational autoencoders (VAEs), cycle-consistent generative adversarial networks (CycleGAN), and star generative adversarial networks (StarGAN) have been used with notable success for non-parallel VC tasks [16]- [20].
One limitation of conventional methods including those mentioned above is that they are focused mainly on learning to convert only the local spectral features and less on converting prosodic features such as the fundamental frequency (F 0 ) contour, duration, and rhythm of the input speech. This is because the acoustic models in these methods are designed to describe mappings between local features only. This prevents a model from discovering word-level or sentence-level suprasegmental conversion rules. In most methods, the entire F 0 contour is simply adjusted using a linear transformation in the logarithmic domain while the duration and rhythm are usually kept unchanged. However, since these features play as important a role as local spectral features in characterizing speaker identities and speaking styles, it would be desirable if these features could also be converted more flexibly. To overcome this limitation, we need a model that can learn to convert entire feature sequences by capturing and utilizing long-term dependencies in source and target speech. To this end, we adopt a sequence-to-sequence (seq2seq or S2S) learning approach.
The S2S learning approach offers a general and powerful framework for transforming one sequence into another variable length sequence [21], [22]. This is made possible by using encoder and decoder networks, where the encoder encodes an input sequence to an internal representation whereas the decoder generates an output sequence in accordance with the internal representation. The original S2S model employs recurrent neural networks (RNNs) to model the encoder and decoder networks, where common choices for the RNN architectures involve long short-term memory (LSTM) networks and gated recurrent units (GRU). This approach has attracted a lot of This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ attention in recent years after being introduced and applied with notable success in various tasks such as machine translation, automatic speech recognition (ASR) [22] and text-to-speech (TTS) [23]- [29]. The original S2S model suffers from the constraint that all input sequences are forced to be encoded into a fixed length internal vector. This limits the ability of the model especially when it comes to long input sequences, such as long sentences in text translation problems. To overcome this limitation, a mechanism called "attention" [30] has been introduced, which enables the network to learn where to pay attention in the input sequence for each item in the output sequence.
While RNNs are a natural choice for modeling long sequential data, recent work has shown that convolutional neural networks (CNNs) with gating mechanisms also have excellent potential for capturing long-term dependencies [31], [32]. In addition, they are suitable for parallel computations using GPUs unlike RNNs. To exploit this advantage of CNNs, an S2S model was recently proposed that adopts a fully convolutional architecture [33]. With this model, the decoder is designed using causal convolutions so that it enables the model to generate an output sequence autoregressively. This model with an attention mechanism is called the ConvS2S model and has already been applied successfully to machine translation [33] and TTS [27], [28]. Inspired by its success in these tasks, we propose a VC method based on the ConvS2S model, which we call ConvS2S-VC, along with an architecture tailored for use with VC.
In a wide sense, VC is a task of converting the domain of speech. Here, the types of domain include speaker identities, emotional expressions, speaking styles, and accents, but for concreteness, we will restrict our attention to speaker identity conversion tasks in the following. When we are interested in converting speech among multiple speakers, one naive way of applying the S2S model is to prepare and train a model for each speaker pair. However, this can be inefficient since the model for one pair of speakers fails to use the training data of the other speakers for training, even though there must be a common set of latent features that can be shared across different speakers, especially when the languages are the same. To fully utilize available training data collected from multiple speakers, we further propose an extension of the ConvS2S model that allows for many-to-many VC, which can learn mappings among multiple speakers using only a single model.
One important advantage of using fully convolutional networks is that it enables the use of batch normalization in all the hidden layers. This is practically beneficial since batch normalization is known to be significantly effective in not only accelerating training but also improving the generalization ability of the resulting models. Indeed, as described later, it also positively affected our pairwise model. However, as for the many-to-many model, the distributions of the layer inputs can change depending on the source and target speakers, which may affect model training. To stabilize layer input distributions, we introduce a mechanism, called the conditional batch normalization, that switches batch normalization layers in accordance with the source and target speakers. This particular mechanism was experimentally found to work very well.

II. RELATED WORK
Note that some attempts have recently been made to apply S2S models to VC problems, including the ones we proposed previously [34], [35]. Although most S2S models typically require sufficiently large parallel corpora for training, collecting a sufficient number of parallel utterances is not always feasible. Thus, particularly in VC tasks, one challenge is how best to train S2S models using a limited amount of training data.
One idea involves using text labels as auxiliary information for model training, assuming they are readily available. For example, Miyoshi et al. proposed combining acoustic models for ASR and TTS with an S2S model [36], where an S2S model is used to convert the context posterior probability sequence produced by the ASR model and the TTS model is finally used to generate a target speech feature sequence. Zhang et al. also proposed an S2S model-based VC method guided by an ASR system, which augments inputs with bottleneck features obtained from a pretrained ASR system [37]. Subsequently, Zhang et al. proposed a shared model for TTS and VC tasks, which enables joint training of the TTS and VC functions [38]. Recently, Biadsy et al. proposed an end-to-end VC system called Parrotron, which is designed to train the encoder and decoder along with an ASR model on the basis of a multitask learning strategy [39]. Our method differs from these methods in that our model does not rely on ASR or TTS models and requires no text annotations for model training. Instead, we introduce several techniques to stabilize training and test prediction.
Haque et al. proposed a method that enables many-to-many VC similar to ours [40]. As detailed in Subsection IV-C, our many-to-many model differs in that it does not necessarily require source speaker information for the encoder, thus enabling it to also handle any-to-many VC tasks.
In addition, our method differs from all the methods mentioned above in that it adopts a fully convolutional model, which can be potentially advantageous in several ways, as already mentioned.

III. CONVS2S-VC
In this section, we start by describing a pairwise one-to-one conversion model and then present its multi-speaker extension that enables many-to-many VC. The overall architecture of the pairwise conversion model is illustrated in Fig. 1.

A. Feature Extraction and Normalization
First, we define acoustic features to be converted. Although one interesting option would be to consider directly converting time-domain signals, given the recent significant advances in high-quality neural vocoder systems [32], [41]- [50], we find it reasonable to consider converting acoustic features such as the mel-cepstral coefficients (MCCs) [51] and log F 0 , since we would expect to generate high-fidelity signals by using a neural vocoder if we could obtain a sufficient set of acoustic features. In such systems, the model size for the convertor can be made small enough to enable the system to work well even when a limited amount of training data is available. Hence, in this paper we choose to use the MCCs, log F 0 , aperiodicity, and voiced/unvoiced indicator of speech as acoustic features as detailed below.
We first use the WORLD analyzer [52] to extract the spectral envelope, the log F 0 , the coded aperiodicity, and the voiced/unvoiced indicator within each time frame of a speech utterance, then compute I MCCs from the extracted spectral envelope, and finally construct an acoustic feature vector by stacking the MCCs, the log F 0 , the coded aperiodicity, and the voiced/unvoiced indicator. Thus, each acoustic feature vector consists of I + 3 elements. Here, the log F 0 contour is assumed to be filled with smoothly interpolated values in unvoiced segments. At training time, we normalize each element x i,n (i = 1, . . . , I) of the MCCs and the log F 0 x I+1,n at frame n to x i,n ← (x i,n − μ i )/σ i where i, μ i and σ i denote the feature index, the mean, and the standard deviation of the i-th feature within all the voiced segments of the training samples of the same speaker.
To accelerate and stabilize training and inference, we have found it useful to use a similar trick introduced by Wang et al. [53]. Specifically, we divide the acoustic feature sequence obtained above into non-overlapping segments of equal length r and use the stack of the acoustic feature vectors in each segment as a new feature vector so that the new feature sequence becomes r times shorter than the original feature sequence. Furthermore, we add the sinusoidal position encodings [54] to the reshaped version of the feature sequence before feeding it into the model.

B. Model
We hereafter use N t ] ∈ R D×N t to denote the source and target speech feature sequences of non-aligned parallel utterances, where N s and N t denote the lengths of the two sequences and D denotes the feature dimension. We consider an S2S model that aims to map X (s) to X (t) . Our pairwise conversion model is inspired by and built upon the models presented by Vaswani et al. [54] and Tachibana et al. [27], with the difference being that it involves an additional network, called a target reconstructor. This network plays an important role in ensuring that the encoders preserve contextual information about the source and target speech, as explained below. Our model thus consists of four networks: source and target encoders, a target decoder, and a target reconstructor.
As with many S2S models, our model has an encoder-decoder structure (Fig. 1). The source and target encoders are expected to extract contextual information from source and target speech. Given the contextual vector sequence pair produced by the encoders, we can compute a contextual similarity matrix between the source and target speech, which can be used to warp the time-axis of the source speech. We can then generate the feature sequence of the target speech by letting the target decoder transform each element of the time-warped version of the contextual vector sequence of the source speech. This idea can be formulated as follows.
The source encoder takes X (s) as the input and produces two internal vector sequences K, V ∈ R D ×N s and the target encoder takes X (t) as the input and produces an internal vector sequence Q ∈ R D ×N t : where [; ] denotes vertical concatenation of matrices (or vectors) with compatible sizes and D denotes the dimension of the internal vectors. K, V and Q can be metaphorically interpreted as the queries and the key-value pairs in a hash table. By using the query and key pair, we can define an attention matrix A ∈ R N s ×N t as where softmax denotes a softmax operation performed on the first axis. A can be seen as a similarity matrix, where the (n, m)-th element indicates the similarity between the n-th and m-th frames of source and target speech. The peak trajectory of A can therefore be interpreted as a time-warping function that associates the frames of the source speech with those of the target speech. The time-warped version of the value vector sequence V is thus given as which will be passed to the target decoder to generate an output sequence: Since the target speech feature sequence X (t) is of course not accessible at test time, we want to use a feature vector that the target decoder has generated as the input to the target encoder for the next time step so that feature vectors can be generated one-by-one recursively. To enable the model to behave in this way, first, we must ensure that the target encoder and decoder must not use future information when producing an output vector at each time step. This can be ensured by simply constraining the convolution layers in the target encoder and decoder to be causal. Note that causal convolution can be easily implemented by padding the input by δ(κ − 1) elements on both the left and right sides with zero vectors and removing δ(κ − 1) elements from the end of the convolution output, where κ is the kernel size and δ is the dilation factor. Second, the output sequence Y must correspond to a time-shifted version of X (t) so that at each time step the decoder will be able to predict the target speech feature vector that is likely to be generated at the next time step. To this end, we include an L 1 loss in the training loss to be minimized, where we have used the colon operator : to specify the range of indices of the elements in a matrix or a vector we wish to extract. For ease of notation, we use : itself to represent all elements along an axis. For example, X :,2:N t denotes a submatrix consisting of the elements in all the rows and columns 2, . . . , N t of X (t) . Third, the first column of X (t) must correspond to an initial vector with which the recursion is assumed to start. We thus assume that it is always set at an all-zero vector.
The source and target encoders are free to ignore the information contained in the feature vector inputs when finding a time alignment between source and target speech. One natural way to ensure that K, V, and Q contain necessary information for finding an appropriate time alignment is to assist K, V, and Q to preserve sufficient information for reconstructing the input feature sequence. To this end, we introduce a target reconstructor that aims to reconstruct the feature sequence of target speech X (t) from K, V, and Q: and include a reconstruction loss in the training loss to be minimized. This idea was introduced in our previous work [35]. We call Eq. (8) the context preservation loss. Although the reconstructor and the decoder may appear to have similar roles, the difference is that the reconstructor is only responsible for making each column of R contain sufficient information about the current value of the target feature sequence so that the decoder can concentrate on predicting the future value using that information.
As detailed in Subsection V-B, all the networks are designed using fully convolutional architectures using gated linear units (GLUs) [31] with residual connections. The output of the GLU block used in the present model is defined as GLU(X) = B 1 (L 1 (X)) sigmoid(B 2 (L 2 (X))) where X is the layer input, L 1 and L 2 are dilated convolution layers, B 1 and B 2 are batch normalization layers, and sigmoid is a sigmoid gate function. Similar to LSTMs, GLUs can reduce the vanishing gradient

C. Constraints on Attention Matrix
It would be natural to assume that the time alignment between parallel utterances is usually monotonic and nearly linear. This implies that the diagonal region in the attention matrix A should always be dominant. We expect that imposing such restrictions on A can significantly reduce the training effort since the search space for A can be greatly reduced. To penalize A for not having a diagonally dominant structure, we introduce a diagonal attention loss (DAL) [27]: where is the elementwise product and W N s ×N t (ν) ∈ R N s ×N t is a non-negative weight matrix whose (n, m)-th element w n,m is defined as w n,m = 1 − e −(n/N s −m/N t ) 2 /2ν 2 . Fig. 2 shows plots of W N s ×N t (ν). Each time point of the target feature sequence must correspond to only one or at most a few time points of the source feature sequence. This implies that two different columns in A must be as orthogonal as possible. Although the DAL with a sufficiently small ν value can induce orthogonality, it may also lead to undesirable situations where the time alignment between the two sequences is forced to be always strictly linear. Thus, ν must not be set to a value too small to enable reasonably flexible time alignments. To achieve orthogonality while enabling ν to be a moderately greater value, we propose introducing another loss to constrain A, which we call the orthogonal attention loss (OAL):

D. Training Loss
Given examples of parallel utterances, the total training loss for the ConvS2S-VC model to be minimized is given as (11) where E X (s) ,X (t) {·} is the sample mean over all the training examples and λ r ≥ 0, λ d ≥ 0 and λ o ≥ 0 are regularization parameters, which weigh the importances of L rec , L dal and L oal relative to L dec .

E. Conversion Process
At test time, we can convert a source speech feature sequence X via the following recursion: However, as Fig. 3 shows, it transpired that with this algorithm the attended time point does not always move forward monotonically and continuously and can occasionally become stuck at the same time point or suddenly jump to a distant time point even though the diagonal and orthogonal losses are considered in training. To assist the attended point to move forward monotonically and continuously, we limit the paths through which the attended point is allowed to move by forcing the attentions to the time points distant from the peak of the attention distribution obtained at the previous time step to zeros. This can be implemented for instance as follows: where sum(·) denotes the sum of all the elements in a vector. Note that we set N 0 and N 1 at the nearest integers that correspond to 160 [ms] and 320 [ms], respectively. Fig. 3 shows an example of how attention matrices look different when this procedure has been undertaken. After we obtain R with the above algorithm, we can use the target reconstructor to compute Y = TrgRec(R) and use it instead of Y as the feature sequence of the converted speech.
Once Y or Y has been obtained, we adjust the mean and variance of the generated feature sequence so that they match the pretrained mean and variance of the feature vectors of the target speaker. We can then generate a time-domain signal using the WORLD vocoder or any recently developed neural vocoder [32], [41]- [50]. Note that in the following experiments, we chose to use Y for final waveform generation as it resulted in bettersounding speech.

F. Real-Time System Design
Real-time requirements must be considered when building VC systems. If we want our model to work in real-time, first, we must not allow the source encoder to use future information as with the target encoder and decoder during training. This requirement can easily be implemented by constraining the convolution layers in the source encoder (and the target reconstructor, if we assume it is used to generate the converted feature sequence) to be causal. Another point we must consider is that the speaking rate and rhythm of input speech cannot be changed drastically at test time. One simple way of keeping them unchanged is to set A to an identity matrix. In this way, the autoregressive recursion will be no longer needed and the conversion can be performed in a sliding-window fashion as: We will show later how these modifications can affect the VC performance. Note that even under this setting, the ability to learn and apply conversion rules that capture long-term dependencies is still effective.

G. Impact of Batch Normalization
As mentioned earlier, using fully convolutional architectures allows the use of batch normalization for all the hidden layers in the networks, which is not straightforward for architectures including recurrent modules. One benefit of using batch normalization layers is that it enables the networks to use a higher learning rate without vanishing or exploding gradients. It is also believed to help regularize the networks such that it is easier to generalize and mitigate overfitting. The effect of batch normalization will be verified experimentally in Section V.

A. Model and Training Loss
We now describe an extension of the ConvS2S model that enables many-to-many VC. Here, the idea is to use a single model to achieve mappings among multiple speakers. The model consists of the same set of the networks as the pairwise model. The only difference is that each network takes a speaker index as an additional input.
Let X (1) , . . . , X (K) be examples of the acoustic feature sequences of speech in different speakers reading the same sentence. Given a single pair of parallel utterances X (k) and X (k ) , where k and k denote the source and target speaker indices (integers), the source encoder takes X (k) and the source speaker index k as the inputs and produces two internal vector sequences K (k) , V (k) , whereas the target encoder takes X (k ) and the target speaker index k as the inputs and produces an internal vector sequence Q (k ) : The attention matrix A (k,k ) and the time-warped version of V (k) are then computed using K (k) and Q (k ) : The outputs of the reconstructor and decoder given the input R (k,k ) with target speaker conditioning are finally given as The loss functions to be minimized given this single training example are given as With the above model, the case where k = k would be reasonable to also consider. Minimizing this loss corresponds to ensuring that the input feature sequence X (k) will remain unchanged when the source and target speakers are the same. We call this loss the identity mapping loss (IML). The effect given by this loss will be shown later. Hence, the total training loss to be minimized becomes where E X (k) ,X (k ) [·] and E X (k) [·] denote the sample means over all the training examples of parallel utterances in speakers k and Fig. 4. Examples of the attention matrices predicted from test input female speech using the many-to-many model: with batch normalization (left) and with conditional batch normalization (right).
k , and λ i ≥ 0 is a regularization parameter, which weighs the importance of the IML.

B. Conditional Batch Normalization
The left figure in Fig. 4 shows the attention matrix predicted from input female speech using the many-to-many model with regular batch normalization layers. As this example shows, attention matrices predicted by the many-to-many model tended to become blurry, mostly resulting in unintelligible speech. We conjecture that this was caused by the fact that the distributions of the inputs to the hidden layers can change in accordance with the source and/or target speakers. To normalize layer input distributions on a speaker-dependent basis, we propose using conditional batch normalization layers for the many-tomany model. Each element y b,d,n of the output of a regular batch normalization layer Y = B(X) is defined as y b,d,n = γ d where X denotes the layer input given by a three-way array with batch, channel, and time axes, x b,d,n denotes its (b, d, n)-th element, μ d (X) and σ d (X) denote the mean and standard deviation of the d-th channel components of X computed along the batch and time axes, and γ = [γ 1 , . . . , γ D ] and β = [β 1 , . . . , β D ] denote the parameters to be learned. In contrast, the output of a conditional batch normalization layer where the only difference is that the parameters γ k = [γ k 1 , . . . , γ k D ] and β k = [β k 1 , . . . , β k D ] are conditioned on speaker k. Note that a similar idea, called the conditional instance normalization, has been introduced to modify the instance normalization process for image style transfer [55] and non-parallel VC [56].

C. Any-to-Many Conversion
With the models presented above, the source speaker must be known and specified during both training and inference. However, there can be certain situations where the source speaker is unknown or arbitrary. We call VC tasks in such scenarios any-to-one or any-to-many VC. Our many-to-many model can be modified to handle any-to-many VC tasks by not allowing the source encoder to take the source speaker index k as an input at both training and test time. The modified version can be formulated by simply replacing Eq. (12) in the many-to-many model with

A. Experimental Settings
To evaluate the effects of the ideas presented in Sections III and IV, we conducted objective and subjective evaluation experiments involving a speaker identity conversion task. For the experiment, we used the CMU Arctic database [57], which consists of recordings of 1132 phonetically balanced English utterances spoken by four US English speakers. We used all the speakers (clb (female), bdl (male), slt (female), and rms (male)) for training and evaluation. Thus, in total there were 12 different combinations of source and target speakers. The audio files for each speaker were manually divided into 1000 and 132 files, which were provided as training and evaluation sets, respectively. All the speech signals were sampled at 16 kHz. As already detailed in Subsection III-A, for each utterance, the spectral envelope, log F 0 , coded aperiodicity, and voiced/unvoiced information were extracted every 8 ms using the WORLD analyzer [52]. Then, 28 MCCs were extracted from each spectral envelope using the Speech Processing Toolkit (SPTK) [58]. The reduction factor r was set to 3. Hence, the dimension of the acoustic feature was D = (28 + 3) × 3 = 93.

B. Network Architectures
We use the notations in Table I to describe the network architectures. The architectures of all the networks in the pairwise and many-to-many models are detailed in Table II. Note that in Table II the layer index is omitted for simplicity of notation and each layer has a different set of free parameters even though the same symbol is used.
All the networks were trained simultaneously with random initialization. Adam optimization [59] was used for model training where the mini-batch size was 16 and 25,000 iterations were run. The learning rate and the exponential decay rate for the first moment for Adam were set at 0.00015 and 0.9.

D. Objective Performance Measures
The test dataset consists of speech samples of each speaker reading the same sentences. Thus, the quality of a converted feature sequence can be assessed by comparing it with the feature sequence of the reference utterance.
to measure their difference. Here, we used the average of the MCDs taken along the DTW path between converted and reference feature sequences as the objective performance measure for each test utterance.

2) Log F 0 Correlation Coefficient (LFC):
To evaluate the F 0 contour of converted speech, we used the correlation coefficient between the predicted and target log F 0 contours [60] as the objective performance measure. Since the converted and reference utterances were not necessarily aligned in time, we computed the correlation coefficient after properly aligning them. Here, we used the MCC sequencesX 1:28,1:N , X 1:28,1:M of converted and reference utterances to find phoneme-based alignment, assuming that the predicted and reference MCCs at the corresponding frames were sufficiently close. Given the log F 0 contourŝ X 29  m =1 y m , to measure the similarity between the two log F 0 contours. In the current experiment, we used the average of the correlation coefficients taken over all the test utterances as the objective performance measure for log F 0 prediction. Thus, the closer it is to 1, the better the performance. We call this measure the log F 0 correlation coefficient (LFC).

3) Local Duration Ratio (LDR):
To evaluate the speaking rate and the rhythm of converted speech, we used the local slopes of the DTW path between converted and reference utterances to determine the objective performance measure. If the speaking rate and the rhythm of the two utterances are exactly the same, all the local slopes should be 1. Hence, the better the conversion, the closer the local slopes become to 1. To compute the local slopes, we undertook the following process. Given the MCC sequenceŝ X 1:28,1:N , X 1:28,1:M of converted and reference utterances, we first performed DTW onX 1:28,1:N and X 1:28,1:M . If we use (p 1 , q 1 ), . . . , (p j , q j ), . . . , (p J , q J ) to denote the obtained DTW path where (p 1 , q 1 ) = (1, 1) and (p J , q J ) = (M, N ), we computed the slope of the regression line fitted to the 33 local consecutive points for each j: wherep j = 1 33 j+16 j =j−16 p j andq j = 1 33 j+16 j =j−16 q j , and then computed the median of s 1 , . . . , s J . We call this measure the local duration ratio (LDR). The greater this ratio, the longer the duration of the converted utterance is relative to the reference utterance. In the following, we use the mean absolute difference between the LDRs and 1 (in percentages) as the overall measure for the LDRs. Thus, the closer it is to zero, the better the performance. For example, if the converted speech is 2 times faster than the reference speech, the LDR will be 0.5 everywhere, and so its mean absolute difference from 1 will be 50%.

1) Sprocket:
We chose the open-source VC system called sprocket [61] for comparison in our experiments. To run this method, we used the source code provided by its author [62]. Note that this system was used as a baseline system in the Voice Conversion Challenge (VCC) 2018 [63].
2) RNN-S2S-VC: To evaluate the effect of the fully convolutional architecture adopted in ConvS2S-VC, we implemented its recurrent counterpart [35], inspired by the architecture introduced in a S2S model-based TTS system called Tacotron [23] and considered it as another baseline. Although the original Tacotron used mel-spectra as the acoustic features, the baseline system was designed to use the same acoustic features as our system. The architecture was specifically designed as follows. The encoder consisted of a bottleneck fully-connected prenet followed by a stack of 1 × 1 1D GLU convolutions and a bi-directional LSTM layer. The decoder was an autoregressive content-based attention network, consisting of a bottleneck fully-connected prenet followed by a stateful LSTM layer producing the attention query, which was then passed to a stack of two uni-directional residual LSTM layers, followed by a linear projection to generate the features. Note that we replaced all rectified linear unit (ReLU) activations with GLUs as with our model. We also designed and implemented a many-to-many extension of the above RNN-based model.  Tables III, IV, and V show the average MCDs (with 95% confidence intervals), LFCs, and LDR deviations of the converted speech obtained using the pairwise and many-to-many models under different (λ r , λ o ) settings (0, 0), (1, 0), and (1,2000) for the pairwise conversion model and different (λ r , λ i , λ o ) settings (0, 0, 0), (1, 0, 0), (1, 1, 0), and (1,1,2000) for the manyto-many model. Owing to the limited amount of training data, the models trained without DAL did not successfully produce recognizable speech. Thus, we omit the results obtained when λ d = 0. As the results show, although there are a few exceptions, both the pairwise and many-to-many models performed better for most speaker pairs in terms of the MCD measure when all the regularization terms were simultaneously taken into account during training. We also found that the effects of L rec and L oal on the LFC and LDR measures were less significant than on the MCD measure. Fig. 5 shows examples of how each of the regularization techniques can affect the prediction of the attention matrices by the many-to-many model at test time.
As these examples show, the CPL tended to have a notable effect on promoting monotonicity and continuity of the attention prediction. However, it also had a negative effect of blurring the predicted attention distributions. The OAL and IML contributed   [64], weight normalization (WN) [65], and batch normalization (BN) [66]. For our many-to-many model, other choices include conditional IN (CIN) and conditional BN (CBN). We compared the effects of these normalization methods on both the pairwise and many-to-many models on the basis of the MCD, LFC, and LDR measures. Note that all the normalization layers in Table II are excluded in the WN counterparts. The average MCDs, LFCs, and LDR deviations obtained using these normalization methods are demonstrated in Tables VI, VII, and VIII.
As the results show, BN worked better than IN and WN when applied to the pairwise conversion model especially in terms of the MCD and LFC measures. However, naively applying it directly to the many-to-many model did not work satisfactorily, as expected in Subsection IV-B. This was also the case with IN. Although CIN was found to perform poorly, CBN worked significantly better.
3) Comparisons With Baseline Methods: Tables IX, X, and XI show the average MCDs, LFCs, and LDRs obtained with the proposed and baseline methods. As Tables IX and X show, the pairwise versions of ConvS2S-VC and RNN-S2S-VC performed comparably to each other and significantly better than sprocket. The effect of the many-to-many extension was noticeable for both ConvS2S-VC and RNN-S2S-VC, revealing the advantage of exploiting the training data of all the speakers. The many-to-many ConvS2S-VC performed better than its RNN counterpart. This demonstrates the effect of the convolutional architecture. Since sprocket is designed to keep the speaking rate and rhythm of input speech unchanged, the performance gains over sprocket in terms of the LDR measure show how well the competing methods are able to predict the speaking rate and rhythm of target speech. As Table XI shows, both the pairwise and many-to-many versions of RNN-S2S-VC and ConvS2S-VC obtained LDR deviations closer to 0 than sprocket.
As mentioned earlier, one important advantage of the proposed model over its RNN counterpart is that it can be trained efficiently thanks to the nature of the convolutional architectures. In fact, whereas the pairwise and many-to-many versions of the RNN-based model took about 30 and 50 hours to train, the two The modifications described in Subsection IV-C make it possible to handle anyto-many VC tasks. We evaluated how these modifications actually affected the performance. Table XII shows the average MCDs, LFCs, and LDR deviations obtained with the any-tomany setting under a closed-set condition, where the speaker of input speech is unknown but is seen in the training data. Whereas the pairwise and the default many-to-many versions must be informed about the speaker of each input utterance at test time, the any-to-many version requires no information. This can be convenient in practical scenarios of VC applications, but because of the disadvantage in the test condition, the problem becomes more challenging. As the results show, the MCDs and LFCs obtained with the any-to-many version were only slightly worse than those obtained with the default many-to-many model despite this disadvantage. It is also worth noting that they were better than those obtained with sprocket and the pairwise versions of ConvS2S-VC and RNN-S2S, all of which were trained under a speaker-dependent closed-set condition.  We further evaluated the performance of the any-to-many model under an open-set condition where the speaker of the test utterances is unseen in the training data. We used the utterances of the speaker lnh (female) as the test input speech. The results are shown in Table XIII. For comparison, Table XIV shows results of sprocket performed on the same speaker pairs under a speaker-dependent closed-set condition. As these results show, the proposed model with the open-set any-to-many setting still performed better than sprocket, even though sprocket had an advantage in both the training and test conditions.

5) Performance With Real-Time System Settings:
We evaluated the MCDs and LFCs obtained with the many-to-many model under the real-time system setting described in Subsection III-F. The results are shown in Table XV. As the results show, the MCDs and LFCs were only slightly worse than those obtained with the default setting despite the disadvantage of using causal convolutions for all the networks and forcing attention matrices to be exactly diagonal (instead of having them be predicted). A comparison of Table XV with the results obtained with sprocket  TABLE XV  MCDS AND LFCS OBTAINED WITH THE REAL-TIME  in Tables IX and X may also show how well the proposed method can perform with the real-time system setting.

G. Subjective Listening Tests
We conducted mean opinion score (MOS) tests to compare the sound quality and speaker similarity of the converted speech samples obtained with the proposed and baseline methods.
With the sound quality test, we included the speech samples synthesized in the same way as the proposed and baseline methods (namely the WORLD synthesizer) using the acoustic features directly extracted from real speech samples. Hence, the scores of these samples are expected to show the upper limit of the performance. We also included speech samples produced using the pairwise and many-to-many versions of RNN-S2S-VC and sprocket in the stimuli. Speech samples were presented in random orders to eliminate bias as regards the order of the stimuli. Ten listeners participated in our listening tests. Each listener was presented 6 × 10 utterances and asked to evaluate the naturalness by selecting 5: Excellent, 4: Good, 3: Fair, 2: Poor, or 1: Bad for each utterance. The results are shown in Fig. 6. As the results show, the pairwise ConvS2S-VC performed slightly better than sprocket and significantly better than the two versions of RNN-S2S-VC. The many-to-many ConvS2S-VC performed better than all other methods, revealing the effect of the many-to-many extension, and reached close to the upper limit obtained with the analysis and synthesis technique. In the speaker similarity test, each subject was given a converted speech sample and a real speech sample of the corresponding target speaker and was asked to evaluate how likely they are to be produced by the same speaker by selecting 5: Definitely, 4: Likely, 3: Fair, 2: Not very likely, or 1: Unlikely. We used converted speech samples generated by the pairwise and many-to-many versions of RNN-S2S-VC and sprocket for comparison as with the sound quality test. Each listener was presented 5 × 10 pairs of utterances. As the results in Fig. 7 show, both the pairwise and many-to-many versions of ConvS2S-VC performed better than all other methods.

H. Audio Examples of Various Conversion Tasks
Although we only considered a speaker identity conversion task in the above experiments, ConvS2S-VC can also be applied to other tasks. Audio samples of ConvS2S-VC tested on several tasks, including speaker identity conversion, emotional expression conversion, electrolaryngeal speech enhancement, and English accent conversion, are provided at [67]. From these examples, we can expect that ConvS2S-VC can also perform reasonably well in various tasks other than speaker identity conversion.

VI. CONCLUSION
This paper proposed a voice conversion (VC) method based on the ConvS2S learning framework. The proposed method provides a natural way of converting the F 0 contour, speaking rate, and rhythm as well as the voice characteristics of input speech and the flexibility of handling many-to-many, any-tomany, and real-time VC tasks without relying on automatic speech recognition (ASR) models and text annotations. Through ablation studies, we demonstrated the individual effect of each of the ideas introduced in the proposed method. Objective and subjective evaluation experiments on a speaker identity conversion task showed that the proposed method could perform better than baseline methods. Furthermore, audio examples showed the potential of the proposed method to perform well in various tasks including emotional expression conversion, electrolaryngeal speech enhancement, and English accent conversion.