An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning

Speaker identity is one of the important characteristics of human speech. In voice conversion, we change the speaker identity from one to another, while keeping the linguistic content unchanged. Voice conversion involves multiple speech processing techniques, such as speech analysis, spectral conversion, prosody conversion, speaker characterization, and vocoding. With the recent advances in theory and practice, we are now able to produce human-like voice quality with high speaker similarity. In this paper, we provide a comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and discuss their promise and limitations. We will also report the recent Voice Conversion Challenges (VCC), the performance of the current state of technology, and provide a summary of the available resources for voice conversion research.


I. INTRODUCTION
Voice conversion (VC) is a significant aspect of artificial intelligence.It is the study of how to convert one's voice to sound like that of another without changing the linguistic content.Voice conversion belongs to a general technical field of speech synthesis, which converts text to speech or changes the properties of speech, for example, voice identity, emotion, and accents.Stewart, a pioneer in speech synthesis, commented in 1922 [1], the really difficult problem involved in the the artificial production of speechsounds is not the making of a device which shall produce speech, but in the manipulation of the apparatus.As voice conversion is focused on the manipulation of voice identity in speech, it represents one of the challenging research problems in speech processing.
There has been a continuous effort in quest for effective manipulation of speech properties since the debut of computer-based speech synthesis in the 1950s.The rapid development of digital signal processing in the 1970s greatly Berrak Sisman is with the Information Systems Technology and Design (ISTD) Pillar of Singapore University of Technology and Design (SUTD), Singapore.
Junichi Yamagishi is with National Institute of Informatics, Japan and University of Edinburgh, United Kingdom.
Simon King is with the University of Edinburgh, United Kingdom.Haizhou Li is with the Department of Electrical and Computer Engineering, National University of Singapore.facilitated the control of the parameters for speech manipulation.While the original motivation of voice conversion could be simply novelty and curiosity, the technological advancements from statistical modeling to deep learning have made a major impact on many real-life applications, and benefited the consumers, such as personalized speech synthesis [2], [3], communication aids for the speechimpaired [4], speaker de-identification [5], voice mimicry [6] and disguise [7], and voice dubbing for movies.
In general, a speaker can be characterized by three factors that are 1) linguistic factors that are reflected in sentence structure, lexical choice, and idiolect; 2) supra-segmental factors such as the prosodic characteristics of a speech signal, and 3) segmental factors that are related to short term features, such as spectrum and formants.When the linguistic content is fixed, the supra-segment and the segmental factors are the relevant factors concerning speaker individuality.An effective voice conversion technique is expected to convert both the supra-segment and the segmental factors.Despite much progress, voice conversion is still far from perfect.In this paper, we celebrate the technological advances, at the same time we expose their limitations.We will discuss the state-of-the-art technology from historical and technological perspectives.
A typical voice conversion pipeline includes a speech analysis, mapping, and reconstruction modules as illustrated in Figure 1, that is referred to as analysis-mappingreconstruction pipeline.The speech analyzer decomposes the speech signals of a source speaker into features that represent supra-segmental and segmental information, and the mapping module changes them towards the target speaker, finally the reconstruction module re-synthesizes time-domain speech signals.The mapping module has taken the centre stage in many of the studies.These techniques can be categorized in different ways, for example, based on the use of training data -parallel vs non-parallel, the type of statistical modeling technique -parametric vs non-parametric, the scope of optimization -frame level vs utterance level, and the workflow of conversion -direct mapping vs inter-lingual.Let's first give an account from the perspective of the use of training data.
The early studies of voice conversion were focused on spectrum mapping using parallel training data, where speech of the same linguistic content is available from both the source and target speaker, for example, vector quantization (VQ) [8] and fuzzy vector quantization [9].With parallel data, one can align the two utterances using Dynamic Time Warping [10].The statistical parametric approaches can benefit from more training data for improved performance, just to name a few, Gaussian mixture model [11]- [13], partial least square regression [14] and dynamic kernel partial least squares regression (DKPLS) [15].
One of the successful statistical non-parametric techniques is based on non-negative matrix factorization (NMF) [16] and it is known as the exemplar-based sparse representation technique [17]- [20].It requires a smaller amount of training data than the parametric techniques, and addresses well the over-smoothing problem.The family of sparse representation techniques include phonetic sparse representation, group sparsity implementation [21], [22], that greatly improved the voice quality on small parallel training dataset.
The studies on voice conversion towards non-parallel training data [23]- [28] open up the opportunities for new applications.The challenge is how to establish the mapping between non-parallel source and target utterances.The INCA alignment technique by Erro et al. [27] represents one of the solutions to the non-parallel data alignment problem [29].With the alignment techniques, one is able to extend the voice conversion techniques from parallel data to non-parallel data, such as the extension to DKPLS [30] and speaker model alignment method [31].Phonetic Posteriograms, or PPG-based [32], approach represents another direction of research towards non-parallel training data.While the alignment technique doesn't use external resources, the PPG-based approach makes use of automatic speech recognizer to generate intermediate phonetic representation [33], [34] as the inter-lingual between the speakers.Successful applications include Phonetic Sparse Representation [22].
Wu and Li [6], and Mohammadi and Kain [35] provided an overview of voice conversion systems from the perspective of time alignment of speech features followed by feature mapping, that represents the statistical modeling school of thoughts.The advent of deep learning techniques represents an important technology milestone in the voice conversion research [36].It has not only greatly advanced the state-of-the-art, but also transformed the way we formulate the voice conversion research problems.It also opens up a new direction of research beyond the parallel and non-parallel data paradigm.Nonetheless, the studies on statistical modeling approaches have provided profound insights into many aspects of the research problems that serve as the foundation work of today's deep learning methodology.In this paper, we will give an overview of voice conversion research by providing a perspective that reveals the underlying design principles from statistical modeling to deep learning.
Deep learning's contributions to voice conversion can be summarized in three areas.Firstly, it allows the mapping module to learn from a large amount of speech data, therefore, tremendously improves voice quality and similarity to target speaker.With neural networks, we see the mapping module as a nonlinear transformation function [37], that is trained from data [38], [39].LSTM represents a successful implementation with parallel training data [40].Deep learning made a great impact on non-parallel data techniques.The joint use of DBLSTM and i-vector [41], KL divergence and DNN-based approach [42], variational autoencoder [43], average modeling [44] and DBLSTM based Recurrent Neural Networks [32], [45] bring the voice quality to a new height.More recently, Generative Adversarial Networks such as VAW-GAN [46], CycleGAN [47]- [49], and StarGAN [50] further advance the state-of-the-art.
Secondly, deep learning has created a profound impact on vocoding technology.Speech analysis and reconstruction modules are typically implemented using a traditional parametric vocoder [11]- [13], [51].The parameters of such vocoders are manually tuned according to some oversimplified assumptions in signal processing.As a result, the parametric vocoders offer a suboptimal solution.Neural vocoder is a neural network that learns to reconstruct an audio waveform from acoustic features [52].For the first time, neural vocoder becomes trainable and data-driven.WaveNet vocoder [53] represents one of the popular neural vocoders, that directly estimates waveform samples from the input feature vectors.It has been studied intensively, for example, speaker dependent and independent WaveNet vocoder [54], [55], quasi-periodic WaveNet vocoder [56], [57], adaptive WaveNet vocoder with GANs [58], factorized WaveNet vocoder [59], and refined WaveNet vocoder with VAEs [60] that are known for their natural sounding voice quality.WaveNet vocoder is also widely adopted in traditional voice conversion pipeline, such as GMM [54], sparse representation [61], [62] systems.Other successful neural vocoders include WaveRNN vocoder [63], WaveGlow [64], that are excellent vocoders in their own right.
Thirdly, deep learning represents a departure from the traditional analysis-mapping-reconstruction pipeline.All the above techniques largely follow the voice conversion pipeline as in Figure 1.As neural vocoder is trainable, it can be trained jointly with mapping module [58] and even with analysis module to become end-to-end solution [53].
Voice conversion research used to be a niche area in speech synthesis.However, it has become a major topic in recent years.In the 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020), voice conversion papers represent more than one-third of the papers under the speech synthesis category.The growth of research community was accelerated by collaborative activities across academia and industry, such as voice conversion challenge (VCC) 2016, which was first launched [65]- [67] at INTERSPEECH 2016.VCC 2016 is focused on the most basic voice conversion task, that is voice conversion for parallel training data recorded in acoustic studio.It establishes the evaluation methodology and protocol for performance benchmarking, that are adopted widely in the community.VCC 2018 [68]- [70] proposes a non-parallel training data challenge, and also connects voice conversion with anti-spoofing of speaker verification studies.VCC 2020 puts forward a cross-lingual voice conversion challenge for the first time.We will provide an overview of the series of challenges and the publicly available resources in this paper.This paper is organized as follows: In Section II, we present the typical flow of voice conversion that includes feature extraction, feature mapping and waveform generation.In Section III, we study the statistical modeling for voice conversion with parallel training data.In Section IV, we study statistical modeling for voice conversion without parallel training data.In Section V, we study the deep learning approaches for voice conversion with parallel training data, and beyond parallel training data.In Section VI, we explain the evaluation techniques for voice conversion.In Section VII and VIII, we summarize the series of voice conversion challenges, and publicly available research resources for voice conversion.We conclude in Section IX.

II. TYPICAL FLOW OF VOICE CONVERSION
The goal of voice conversion is to modify a source speaker's voice to sound as if it is produced by a target speaker.In other words, a voice conversion system only modifies the speaker-dependent characteristics of speech, such as formants, fundamental frequency (F0), intonation, intensity and duration, while carrying over the speakerindependent speech content.
The core module of a voice conversion system performs the conversion function.Let's denote the source and target speech signals as X and Y respectively.As will be discussed later, voice conversion is typically applied to some intermediate representation of speech, or speech feature, that characterizes a speech frame.Let's denote the source and target speech features as x and y.The conversion function can be formulated as follows, where F(•) is also called mapping function in rest of this paper.As illustrated in Figure 1, a typical voice conversion framework is implemented in three steps: 1) speech analysis, 2) feature mapping, and 3) speech reconstruction, that we call the analysis-mapping-reconstruction pipeline.We discuss in detail next.

A. Speech Analysis and Reconstruction
The speech analysis and reconstruction are two crucial processes in the 3-step pipeline.The goal of speech analysis is to decompose speech signals into some form of intermediate representation for effective manipulation or modification with respect to the acoustic properties of speech.There have been many useful intermediate representation techniques that were initially studied for speech communication, and speech synthesis.They become handy for voice conversion.In general, the techniques can be categorized into model-based representations, and signalbased representations.
In model-based representation, we assume that speech signal is generated according to a underlying physical model, such as source-filter model, and express a frame of speech signal as a set of model parameters.By modifying the parameters, we manipulate the input speech.In signalbased representation, we don't assume any models, but rather represent speech as a composition of controllable elements in time domain or frequency domain.Let's denote the intermediate representation for source speaker as x, speech analysis can be described by a function, Speech reconstruction can be seen as an inverse function of the speech analysis, that operates on the modified parameters and generates an audible speech signal.It works with speech analysis in tandem.For example, A vocoder [51] is used to express a speech frame with a set of controllable parameters that can be converted back into a speech waveform.A Griffin-Lim algorithm is used to reconstruct a speech signal from a modified short-time Fourier transform after amplitude modification [71].As the output speech quality is affected by the speech reconstruction process, speech reconstruction is also one of the important topics in voice conversion research.Let's denote the modified intermediate representation and the reconstructed speech signal for target speaker as y and Y = R(y), voice conversion can be described by a composition of three functions, that represents the typical flow of a voice conversion system as a 3-step pipeline.As the mapping is applied frame-byframe, the number of converted speech features y is the same as that of the source speech features x if speech duration is not modified in the process.
While speech analysis and reconstruction make possible voice conversion, just like other signal processing techniques, they inevitably also introduce artifacts.Many studies were devoted to minimize such artifacts.We next discuss the most commonly used speech analysis and reconstruction techniques in voice conversion.
1) Signal-based Representation: Pitch Synchronous Over-Lap and Add (PSOLA) is an example of signal-based representation techniques.It decomposes a speech signal into overlapping speech segments [72], each of which represents one of the successive pitch periods of the speech signal.By overlap-and-adding these speech segments with a different pitch periods, we can reconstruct the speech signal of a different intonation.As PSOLA operates directly on the timedomain speech signal [72], the analysis and reconstruction do not introduce significant artifacts.While PSOLA technique is effective for modification of fundamental frequency of speech signals, it suffers from several inherent limitations [73], [74].For example, unvoiced speech signal is not periodic, and the manipulation of time-domain signal not straightforward.
Harmonic plus Noise Model (HNM) represents another signal-based representation approach.It works under the assumption that a speech signal can be represented as a harmonic component plus a noise component that is delimited by the so-called maximum voiced frequency [75].The harmonic component is modeled as the sum of harmonic sinusoids up to the maximum voiced frequency, while the noise component is modeled as Gaussian noise filtered by a time-varying autoregressive filter.As HNM decomposition is represented by some controllable parameters, it allows for easy modification speech [76], [77].
2) Model-based Representation: The model-based technique assumes that the input signal can be mathematically represented by a model whose parameters vary with time.A typical example is the source-filter model that represents a speech signal as the outcome of an excitation of the larynx (source) modulated by a transfer (filter) function determined by the shape of the supralaryngeal vocal tract.A vocoder, a short form of voice coder, was initially developed to minimize the amount of data that are transmitted for voice communication.It encodes speech into slowly changing control parameters, such as linear predictive coding and mel-log spectrum approximation [78], that describe the filter, and re-synthesizes the speech signal with the source information at the receiving end.In voice conversion, we convert the speech signals from a source speaker to mimic the target speaker by modifying the controllable parameters.
The majority of vocoders are designed based on some form of the source-filter model of speech production, such as mixed excitation with a spectral envelope, and glottal vocoders [79].STRAIGHT or "Speech Transformation and Representation using Adaptive Interpolation of weiGHTed spectrum" is one of the popular vocoders in speech synthesis and voice conversion [80].It decomposes a speech signal into: 1) a smooth spectrogram which is free from periodicity in time and frequency; 2) a fundamental frequency (F0) contour which is estimated using a fixed-point algorithm; and 3) a time-frequency periodicity map which captures the spectral shape of the noise and its temporal envelope.STRAIGHT is widely used in voice conversion because its parametric representation facilitates the statistical modeling of speech, that allows for easy manipulation of speech [11], [81], [82].
Parametric vocoders are widely adopted for analysis and reconstruction of speech in voice conversion studies [8], [9], [11], [12], [46], [47], [83], [84], and continue to play a major role today [17], [21], [22].The traditional parametric vocoders are designed to approximate the complex mechanics of the human speech production under certain simplified assumptions.For example, the interaction between F0 and formant structure is ignored, and the original phase structure is discarded [85].The assumption of stationary process in the short-time window, and time-invariant linear filter, also give rise to "robotic" and "buzzy" voice.Such problems become more serious in voice conversion as we modify both F0 and the formant structure of speech among others at the same time.We believe that vocoding can be improved by considering the interaction between the parameters.
3) WaveNet Vocoder: Deep learning offers a solution to some of the inherent problems of parametric vocoders.WaveNet [53] is a deep neural network that learns to generate high quality time-domain waveform.As it doesn't assume any mathematical model, it is a data-driven solution that requires a large amount of training data.
The joint probability of a waveform X = x 1 , x 2 , ..., x N can be factorized as a product of conditional probabilities.
A WaveNet is constructed with many residual blocks, each of which consists of 2 × 1 dilated causal convolutions, a gated activation function and 1 × 1 convolutions.With additional auxiliary features h, WaveNet can also model conditional distribution p(x|h) [53].Eq. ( 4) can then be written as follows: A typical parametric vocoder performs both analysis and reconstruction of speech.However, most of today's WaveNet vocoders only cover the function of speech reconstruction.It takes some intermediate representations of speech as the input auxiliary features, and generate speech waveform as the output.WaveNet vocoder [55] outperforms remarkably the traditional parametric vocoders in terms of sound quality.Not only can it learn the relationship between input features and output waveform, but also it learns the interaction among the input features.It has been successfully adopted as part of the state-of-the-art speech synthesis [3], [86]- [89] and voice conversion [54], [55], [57], [60]- [62], [86], [90]- [97] systems.
There have been promising studies on using vocoding parameters as the intermediate representations in WaveNet vocoding.A speaker independent WaveNet vocoder [55] is studied by utilizing the STRAIGHT vocoding parameters, such as F0, aperiodicity, and spectrum as the inputs of WaveNet.In this way, WaveNet learns a sample-by-sample correspondence between the time-domain waveform and the input vocoding parameters.When such a WaveNet vocoder is trained on speech signals from a large speaker population, we obtain a speaker independent vocoder [55].By adapting the speaker independent WaveNet vocoder with speaker specific data, we obtain a speaker dependent vocoder that generates personalized voice output [58], [60].The study on WaveNet vocoder also opens up opportunities for the use of other non-vocoding parameters as the input.For example, a recent study adopts phonetic posteriogram (PPG) in WaveNet vocoding with promising results in voice conversion with non-parallel training data [94]- [97].Another study adopts latent code of autoencoder and speaker embedding as the speech representation for WaveNet vocoder [98].
4) Recent Progress on Neural Vocoders: More recently, speaker independent WaveRNN-based neural vocoder [63] became popular as it can generate human-like voices from both in-domain and out-of-domain spectrogram [99]- [101].Another well-known neural vocoder that achieves highquality synthesis performance is WaveGlow [64].WaveGlow is a flow-based network capable of generating high quality speech from mel-spectrogram [102].WaveGlow benefits from the best of Glow and WaveNet so as to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression.We note that WaveGlow is implemented using only a single network with a single cost function, that is to maximize the likelihood of the training data, which makes the training procedure simple and stable [103].
WaveNet [53] uses an auto-regressive (AR) approach to model the distribution of waveform sampling points, that incurs a high computational cost.As an alternative to autoregression, a neural source-filter (NSF) waveform modeling framework is proposed [104], [105].We note that NSF is straightforward to train and fast to generate waveform.It is reported 100 times faster than WaveNet vocoder, and yet achieving comparable voice quality on a large speech corpus [106].

B. Feature Extraction
With speech analysis, we derive vocoding parameters that usually contains spectral and prosodic components to represent the input speech.The vocoding parameters characterize the speech in a way that we can reconstruct the speech signal later on after transmission.This is particularly important in speech communication.However, such vocoding parameters may not be the best for transformation of voice identity.More often, the vocoding parameters are further transformed into speech features, that we call feature extraction in Figure 1, for more effective modification of the acoustic properties in voice conversion.
For the spectral component, feature extraction aims to derive low-dimensional representations from the highdimensional raw spectra.Generally speaking, the spectral features are be able to represent the speaker individuality well.The feature not only fit the spectral envelope well, but also be able to be converted back to spectral envelope.They should have good interpolation properties that allow for flexible modification.
The magnitude spectrum can be warped to Mel or Bark frequency scale that are perceptually meaningful for voice conversion.It can also be transformed into cepstral domain using a finite number of coefficients using the Discrete Cosine Transform of log-magnitude.Cepstral coefficients are less correlated.In this way, high dimension magnitude spectrum is transformed to lower dimension feature representation.The commonly used speech features include Mel-cepstral coefficients (MCC), linear predictive cepstral coefficients (LPCC), and line spectral frequencies (LSF).Typically, a speech frame is represented by a feature vector.
Short-time analysis has been the most practical way of speech analysis.Unfortunately it inherently ignores the temporal context of speech, that is crucial in voice conversion.Many studies have shown that multiple frames [18], [107], dynamic features [62], and phonetic segments serve as effective features in feature mapping.
For the prosodic component, feature extraction can be used to decompose prosodic signal, such as fundamental frequency (F0), aperiodicity (AP), and energy contours, into speaker dependent and independent parameters [82].In this way, we can carry over the speaker independent prosodic patterns, while converting speaker dependent ones during the feature mapping.

C. Feature Mapping
In the typical flow of voice conversion, feature mapping performs the modification of speech features from source to target speaker.Spectral mapping seeks to change the voice timbre, while prosody conversion seeks to modify the prosody features, such as fundamental frequency, intonation and duration.So far, spectral mapping remains the center of many voice conversion studies.
During training, we learn the mapping function, F(•) in Eq.( 1), from training data.At run time inference, the mapping function transforms the acoustic features.A large part of this paper is devoted to the study of the mapping function.In Section III, we will discuss the traditional statistical modeling techniques with parallel training data.In Section IV, we will review the statistical modeling techniques that do not require parallel training data.In Section V, we will introduce a number of deep learning approaches, which includes 1) parallel training data of paired speakers; and 2) beyond parallel data of paired speakers.

III. STATISTICAL MODELING FOR VOICE CONVERSION WITH
PARALLEL TRAINING DATA Most of the traditional voice conversion techniques assume availability of parallel training data.In other words, the mapping function is trained on paired utterances of the same linguistic content spoken by source and target speaker.Voice conversion studies started with statistical approaches [108] in late 1980s, that can be grouped into parametric and non-parametric mapping techniques.Parametric techniques makes assumptions about the underlying statistical distributions of speech features and their mapping.Non-parametric ones make fewer assumptions about the data, but seek to fit the training data with the best mapping function, while maintaining some ability to generalize to unseen data.
Parametric techniques, such as Gaussian mixture model (GMM) [109], Dynamic Kernel Partial Least Square Regression, PSOLA mapping technique [73], represent a great success in the recent past.The vector quantization approach to voice conversion is a typical non-parametric technique.It maps codewords between source and target codebooks [8].In this method, a source feature vector is approximated by the nearest codeword in the source codebook, and mappped to the corresponding codeword in the target codebook.To reduce the quantization error, fuzzy vector quantization was studied [9], [110], where continuous weights for individual clusters are determined at each frame according to the source feature vector.The converted feature vector is defined as a weighted sum of the centroid vectors of the mapping codebook.Recently, the non-negative factorization approach marks a successful non-parametric implementation.
We will discuss a typical frame-level mapping paradigm under the assumption of parallel training data, as illustrated in Figure 2.During the training phase, given parallel training data from a source speaker x and a target speaker y, frame alignment is performed to align the source speech vectors and target speech vectors to obtain the paired speech feature vector z = {x, y}.Dynamic time warping is feature-based alignment technique that is commonly used.Speech recognizer, that is equipped with phonetic knowledge, can also be used to perform model-based alignment.Frame alignment has been well studied in speech processing.In voice conversion, a large body of literature has been devoted to the design of frame-level mapping function.

A. Gaussian Mixture Models
In Gaussian mixture modeling (GMM) approach to voice conversion [109], we represent the relationship between two sets of spectral envelopes, from source and target speakers, using a Gaussian mixture model.The Gaussian mixture model is a continuous parametric function, that is trained to model the spectral mapping.In [109], harmonic plus noise (HNM) features are used in the feature mapping, which allows for high-quality modifications of speech signals.The GMM approach is seen as an extension to the vector quantization approach [8], [9], that results in improved voice quality.However, the speech quality is affected by some factors, e.g., spectral movement with inappropriate dynamic characteristics caused by the frameby-frame conversion process, and excessive smoothing of converted spectra [111]- [113].
To address the frame-by-frame conversion issue, a maximum likelihood estimation technique was studied to model the spectral parameter trajectory [11].This technique aims to estimate an appropriate spectrum sequence using dynamic acoustic features.To address the over-smoothing issue, or the muffled effect, joint density Gaussian mixture model (JD-GMM) was studied [2], [11] to jointly model the sequences of spectral features and their variances using maximum likelihood estimation, that increases the global variance of the spectral features.The JD-GMM method involves two phases: off-line training and run-time conversion phases.During the training phase, Gaussian mixture model (GMM) is adopted to model the joint probability density p(z) of the paired feature vector sequence z = {x, y}, which represents the joint distribution of source speech x and target speech y: where K is the number of Gaussian components, µ z k and Σ (z)  k are the mean vector and the covariance matrix of the kth Gaussian component N z|µ z k , Σ (z) k , respectively.To estimate the model parameters of the JD-GMM, expectationmaximization (EM) algorithm [114]- [117] is used to maximize likelihood on the training data.
A post-filter based on modulation spectrum modification is found useful to address the inherent over-smoothing issue in statistical modeling [118], such as GMM approach, which effectively compensates the global variance.The GMM approach is a parametric solution [119]- [123].It represents a successful statistical modeling technique that works well with parallel training data.

B. Dynamic Kernel Partial Least Squares
The family of parametric techniques also include linear [73], [74] or non-linear mapping functions.With the local mapping functions, each frame of speech is typically transformed independently from the neighboring frames, which causes temporal discontinuities to the output [74].
To take into account the time-dependency between speech features, a dynamic kernel partial least squares (DKPLS) technique was studied [15].This method is based on a kernel transformation of the source features to allow non-linear modeling, and concatenation adjacent frames to model the dynamics.The non-linear transformation takes advantage of the global properties of the data that GMM approach doesn't.It was reported that DKPLS outperforms GMM approach [109] in terms of voice quality.This method is simple and efficient, and does not require massive tuning.More recently, DKPLS-based approaches are studied to overcome the over-fitting and over-smoothing problems by feature combination strategy [124].
While statistical modeling for the mapping of spectral features has been well studied, conversion of prosody is  often achieved by simply shifting and scaling F0, which is not sufficient for high-quality voice conversion.Hierarchical modeling of prosody, for different linguistic units at several distinct temporal scales, represents an advanced technique for prosody conversion [82], [125]- [127].DKPLS has created a platform for multi-scale prosody conversion through wavelet transform [128] that shows significant improvement in naturalness over the F0 shifting and scaling technique.

C. Frequency Warping
Parametric techniques, such as GMM [109] and DKPLS [15], usually suffer from over-smoothing because they use the minimum mean square error [81] or the maximum likelihood [11] function as the optimization criterion.As a result, the system produces acoustic features that represent statistical average, and fails to capture the desired details of temporal and spectral dynamics.
Additionally, parametric techniques generally employ low-dimensional features, as discussed in Section II.B, such as the Mel cepstral coefficients (MCC) or line spectral frequencies (LSF) to avoid the curse of dimensionality.The low dimensional features, however, are doomed to lose spectral details because they have low-resolution.Statistical averaging and low-resolution features both lead to the muffled effect of output speech [129].
To preserve the necessary spectral details during conversion, a number of frequency warping-based methods were introduced.The frequency warping technique directly transforms the high resolution source spectrum to that of the target speaker through a frequency warping function.In recent literature, the warping function is either realized by a single parameter, such as VTLN-based approaches [26], [130]- [133], or represented as a piecewise linear function [73], [129], [134], which has become a mainstream solution.
The goal of piecewise linear warping function is to align a set of frequencies between the source and target spectrum by minimizing the spectral distance or maximizing the correlation between the converted and target spectrum.More recently, the parametric frequency warping technique was incorporated with a non-parametric exemplar-based technique, that achieves good performance [107].

D. Non-negative Matrix Factorization
Non-negative matrix factorization (NMF) [135] is an effective data mining technique that has been widely used, especially for reconstruction of high quality signals, such as in speech enhancement [136], [137], speech de-noising [138], [139], noise and speech estimation [140].It factorizes a matrix into two matrices, a dictionary and an activation matrix, with the property that all three matrices have no negative elements.The NMF-based techniques are shown effective in voice conversion with very limited training data.It marks a major progress of non-parametric approach to voice conversion since vector quantization technique was introduced.Successful implementation includes nonnegative spectrogram deconvolution [141], locally linear embedding (LLE) [142], and unit selection [20].In NMFbased approaches, a target spectrogram is constructed as a linear combination of exemplars.Therefore, oversmoothing problem can also arise.To overcome the oversmoothing problem, several effective techniques were developed, that we summarize next.
1) Sparse Representation: One effective way to alleviate the over-smoothing problem is to apply sparsity constraint to the activation matrix, referred to as exemplar-based sparse representation.
As illustrated in Figure 3, a pair of dictionaries A and B are first constructed from speech feature vectors, that we call aligned exemplars, from source and target.[A; B] is also called the coupled dictionary.At run-time, let's consider a speech utterance as a sequence of speech feature vectors, that form a spectrogram matrix.The matrix of a source utterance X can be represented as, Due to the non-negative nature of spectrogram, NMF technique is employed to estimate the source activation matrix Ĥ, which is constrained to be sparse.Mathematically, we estimate Ĥ by minimizing an objective function, where λ is the sparsity penalty factor.To estimate activation matrix Ĥ, a generalised Kullback-Leibler (KL) divergence is used.It is assumed that source and target dictionaries A and B can share the same source activation matrix Ĥ.Therefore, the converted spectrogram for the target speaker can be written as, where the activation matrix Ĥ serves as the pivot to transfer source utterance X to target utterance Y.
The sparse representation framework continues to attract much attention in voice conversion.The recent studies include its extension to discriminative graph-embedded NMF approach [19], phonetic sparse representation for spectrum conversion [22], and its application to timbre and prosody conversion [143], [144].
2) Phonetic Sparse Representation: As the frame-level mapping is done at acoustic feature level, the coupled dictionary [A; B] is therefore called acoustic dictionary.With the scripts of the training data and a general purpose speech recognizer, we are able to obtain phonetic labels and their boundaries.Studies have shown that the strategy of dictionary construction plays an important role in voice conversion [145].The idea of selecting sub-dictionary according to the run-time speech content shows improved performance [21].
Phonetic sparse representation [22] is an extension to sparse representation for voice conversion.It is built on the idea of phonetic sub-dictionaries, and dictionary selection at run-time.The study shows that multiple phonetic sub-dictionaries consistently outperform single dictionary in exemplar-based sparse representation voice conversion [21], [22].However, the phonetic sparse representation relies on a speech recognizer at run-time to help select the subdictionary.
3) Group Sparse Representation: Sisman et al. [62] proposed group sparse representation to formulate both exemplar-based sparse representation [141], and phonetic sparse representation [22] under an unified mathematical framework.With the group sparsity regularization, only the phonetic sub-dictionary that is relevant to the input features is likely to be activated at run-time inference.Unlike phonetic sparse representation that relies on a speech recognizer for both training and run-time inference, group sparse representation only requires the speech recognizer during training when we build the phonetic dictionary.It was reported that group sparse representation provides similar performance to that of phonetic sparse representation when performing both spectrum and prosody conversion [62].

IV. STATISTICAL MODELING FOR VOICE CONVERSION WITH NON-PARALLEL TRAINING DATA
It is easy to understand that it is more straightforward to train a mapping function from parallel than non-parallel training data.However, parallel training data are not always available.In real-world applications, there are situations where only non-parallel data are available.Intuitively, if we can derive the equivalents of speech frames or segments between speakers from non-parallel data, we are able to establish or to refine the mapping function using the conventional linear transformation parameter training, such as GMM, DKPLS or frequency warping.
There were a number of attempts to do so.For example, one idea is to find source-target mapping between unsupervised feature clusters [146].Another is to use a speech recognizer to index the target training data so that we can retrieve similar frames from target database for a unknown source frame at run-time [147].Unfortunately, each of the steps may produce errors that accumulate and may lead to a poor parameter estimation [146].There was also a study to use a hidden Markov model (HMM) that is trained for the target speaker, then the parameters of GMM-based linear transformation function are estimated in such a way that the converted source vectors exhibit maximum likelihood with respect to the target HMM [148].This method shows comparable performance with methods of parallel data.However, it requires that the orthography of the training utterances be known, that limits its use.
Next we will discuss three clusters of studies and their representative work, 1) INCA algorithm, 2) unit selection algorithm, and 3) speaker modeling algorithm.

INCA refers to an Iterative combination of a Nearest
Neighbor search step and a Conversion step Alignment method [27].It learns a mapping function by finding the nearest neighbor of each source vector in the target acoustic space.It is based on a hypothesis that an iterative refinement of the basic nearest neighbour method, in tandem with the voice conversion system, would lead to a progressive alignment improvement.The main idea is that the intermediate voice, x k s , obtained after the previous nearest neighbour alignment can be used as the source voice during the next iteration.
During training, the optimization process is repeated until the current intermediate voice, x k s , is close enough to target voice, y t .INCA represents a successful framework for the non-parallel training data problem, where the nearest neighbor search step (INCA alignment) and the conversion step (a parametric mapping function) iterates to optimize the mapping function, as illustrated in Figure 4.
INCA was first implemented with GMM approach [109] for voice conversion to estimate a linear mapping function.As INCA does not require any phonetic or linguistic information, it not only works for non-parallel training data, but also works for cross-lingual voice conversion.Experiments show that the INCA implementation of a crosslingual system achieves similar performance to its intralingual counterpart that is trained on parallel data [27].
INCA was further implemented with DKPLS approach [15] that was discussed in Section III.B for parallel training data.The idea [30] is to use the INCA alignment algorithm [27] to find the corresponding frames from the source and target datasets, that allows the DKPLS regression to find a non-linear mapping between the aligned datasets.It was reported [30] that the INCA-DKPLS implementation produces high-quality voice that is comparable to implementation with parallel training data on the same amount of training data.

Source Features Target Features
Target Speaker Database Dynamic Programming Fig. 5: Run-time inference of unit selection algorithm that doesn't model a mapping function with parameters, but rather searches for output feature sequence directly from target speaker database, and optimizes the output at utterance level.

B. Unit Selection Algorithm
Unit selection algorithm has been widely used to generate natural-sounding speech in speech synthesis.It is known to produce high speaker similarity and voice quality [75], [149], [150] because the synthesized waveform is formed of sound units directly from the target speaker [151].The unit selection algorithm optimizes the unit selection from a voice inventory of a target speaker.It was suggested [152] to make use of unit selection synthesis system to generate parallel versions of the training sentences from non-parallel data.With the resulting pseudo-parallel data, the statistical modeling techniques for parallel training data, that we discuss in Section III, can be readily applied.While this approach produces satisfactory voice quality [152], it requires a large speech database to develop the the voice inventory, that is not always practical in reality.
Another idea is to follow what we do in unit selection speech synthesis by defining a speech feature vector as a unit [24].Given an utterance of M speech feature vectors X = {x 1 , x 2 , ..., x M } from the source speaker, a dynamic programming is applied to find the sequence of feature vectors y i from the target speaker, that minimizes a cost function, where d 1 (•) represents the acoustic distance between a source and a target feature vector, while d 2 (•) is the concatenative cost between two target feature vectors.With the acoustic distance, we make sure that the retrieved speech features from the target speakers are close to those of the source; with the concatenative cost, we encourage the consecutive speech frames from the target speaker database to be retrieved together in a multi-frame segment.As illustrated in Figure 5, unit selection algorithm is a nonparametric solution because we don't model the conversion with parameters.It optimizes the output by applying a dynamic programming to find the best feature vector sequence from the target speaker database.The mapping function Y = F(X) is defined by the cost function Eq.11 itself, and optimized at the utterance level.

C. Speaker Modeling Algorithm
The techniques for text-independent speaker characterization are readily available for non-parallel training data, where a speaker can be modeled by a set of parameters, such as a GMM or i-vector.One is possible to make use such speaker models to perform voice conversion.
Mouchtaris et al. [153] used a GMM-based technique to model relationship between reference speakers in advance and apply the relationship for a new speaker.Toda et al. [154] proposed an eigenvoice approach that performs two mappings, one to map from the source speaker to an eigenvoice (or average voice) trained from reference speakers, and another from the eigenvoice to the target speaker.These approaches don't require parallel training data, they do require parallel data from some reference speakers.
In speaker verification, the joint factor analysis method [155] decomposes a supervector into speaker independent, speaker dependent and channel dependent components, each of which is represented by a low-dimensional set of factors.This aims to disentangle speaker from other speech content for effective speaker verification.Inspired by this idea, we argue [156] that similar decomposition would be useful in voice conversion, where we would like to separate speaker information from the linguistic content, and apply factor analysis on the speaker specific component.
With factor analysis, the speaker specific component can be represented by a low-dimensional set of latent variables via the factor loadings.One of the ideas [156] is to estimate the phonetic component and factor loadings from non-parallel prior data.In this way, during the training process, we only estimate a low-dimensional set of speaker identity factors and a tied covariance matrix instead of a full conversion function from the source-target parallel utterances.Even though parallel utterances are still required for estimating the conversion function, the use of prior data allows us to obtain a reliable model from much fewer training samples than those required by conventional JD-GMM [157].
Another idea is to perform the voice conversion in i-vector [155] speaker space, where i-vector is used to disentangle a speaker from the linguistic content.The primary motivation is that an i-vector can be extracted in an unsupervised manner regardless of speaker or speech content, which opens up new possibilities especially for non-parallel data scenarios where source and target speech is of different content or even in different languages [28], [45], [158].Kinnunen et al. [159] studies a way to shift the acoustic features of input speech towards target speech in the i-vector space.The idea is to learn a function that maps the i-vector of the source utterance to that of the target.With the mapping function, we are able to convert the source speech frame-by-frame to the target.This technique is free of any parallel data, and text transcription.

V. DEEP LEARNING FOR VOICE CONVERSION
Voice conversion is typically a research problem with scarce training data.Deep learning techniques are typi-cally data driven, that rely on big data.However, this is actually the strength of deep learning in voice conversion.Deep learning opens up many possibilities to benefit from abundantly available training data, so that the voice conversion task can focus more on learning the mapping of speaker characteristics.For example, it shouldn't be the job of voice conversion task to infer low level detail during speech reconstruction, a neural vocoder can learn from large database to do so [98].It shouldn't be a task of voice conversion to learn how to represent an entire phonetic system of a spoken language, a general purpose acoustic model of neural ASR [160] or TTS [161] system can learn from a large database to do so.By leveraging the large database, we free up the conversion network from using its capacity to represent low level detail and general information, but instead, to focus on the high level semantics necessary for speaker identity conversion.
Deep learning techniques also transform the way we implement the analysis-mapping-reconstruction pipeline.For effective mapping, we need to derive adequate intermediate representation of speech, that was discussed in Section II.The concept of embedding in deep learning provides a new way of deriving the intermediate representation, for example, latent code for linguistic content, and speaker embedding for speaker identity.It also makes the disentanglement of speaker from content much easier.
In this section, we will summarize how deep learning helps address existing research problems, such as parallel and non-parallel data voice conversion.We will also review how deep learning breaks new ground in voice conversion research.

A. Deep Learning for Frame-Aligned Parallel Data
The study on deep learning approaches for voice conversion started with parallel training data, where we use a neural network as an improved regression function to approximate the mapping function y = F(x) under the frame-level mapping paradigm in Figure 2.

1) DNN Mapping Function:
The early studies on DNNbased voice conversion methods are focused on spectral transformation.DNN mapping function, y = F(x), has some clear advantage over other statistical models, such as GMM, and DKPLS.For instance, it allows for non-linear mapping between source and target features, and there is little restriction to the dimension of features to be modeled.We note that conversion on other acoustic features, such as fundamental frequency and energy contour, can also be done similarly [162].
Desai et al. [81] proposed a DNN to map a lowdimensional spectral representation, such as mel-cepstral coefficients (MCEP), from source to target speaker.Nakashika et al. [163] proposed to use Deep Belief Nets (DBNs) to extract latent features from source and target cepstrum coefficients, and use a neural network with one hidden layer to perform conversion between latent features.Mohammadi et al. [164] furthered the idea by studying a deep autoencoder from multiple speakers to derive a compact representations of speech spectral feature.Highdimensional representation of spectrum has also been used in a more recent work [165] for spectral mapping, together with dynamic features and a parameter generation algorithm [166].Chen et al. [167] proposed to model the distributions of spectral envelopes of source and target speakers respectively through a layer-wise generative training.
Generally speaking, DNN for spectrum and/or prosody transformation requires a large amount of parallel training data from paired speakers, which is not always feasible.But it opens up opportunities for us to make use of speech data from multiple speakers beyond source and target, to better model the source and the target speakers, and to discover better feature representations for feature mapping.
2) LSTM Mapping Function: To model the temporal correlation across speech frames in voice conversion, Nakashika et al. [168] explore the use of Recurrent Temporal Restricted Boltzmann Machines (RTRBM), a type of recurrent neural networks.The success of Long-Short Term Memory (LSTM) [169], [170] in sequence to sequence modeling inspires the study of LSTM in voice conversion, which leads to an improvement of naturalness and continuity of the speech output.
The LSTM network architecture consists of a set of memory blocks and peephole connections, that support the storage and access to long-range contextual information [171] in linear memory cells.It learns the optimal amount of contextual information for voice conversion.A bidirectional LSTM (BLSTM) network is expected to capture sequential information and maintain long-range contextual features from both forward sequence and backward sequence [45].
Sun et al. [40] and Ming et al. [172] proposed a deep bidirectional LSTM network (DBLSTM) by stacking multiple hidden layers of BLSTM network architecture, that is shown to outperform DNN voice conversion even without using dynamic features.While DBLSTM-based voice conversion approach generates high-quality synthesized voice, it typically requires a large speech corpus from source and target speakers for training, that limits the scope of the applications in practice [40].
Just like GMM approach, DNN and LSTM techniques rely on external frame aligner during training data preparation, as illustrated in Figure 2. At run-time, the conversion process follows the typical flow of 3-step pipeline, and doesn't change the speech duration during the conversion.

B. Encoder-decoder with Attention for Parallel Data
The research problems of voice conversion are centered around alignment and mapping, which are interrelated both during training and at run-time inference, as illustrated in Figure 2.During training, more accurate alignment helps build better mapping function, that explains why we prefer parallel training data.At run-time inference, the framelevel mapping paradigm doesn't change the duration of the speech during the conversion.While it is possible to model and predict the duration for voice conversion output, it is not straightforward to incorporate duration model and The attention mechanism [173], [174] in encoder-decoder structure neural network brings about a paradigm change.The idea of attention was first successfully used in machine translation [173], speech recognition [175], and sequenceto-sequence speech synthesis [86], [176]- [178], that led to many parallel studies in voice conversion [179]- [181].With the attention mechanism, the neural network learns the feature mapping and alignment at the same time during training.At run-time inference, the network automatically decides the output duration according to what it has learnt.In other words, the frame-aligner in Figure 2 is no longer required.
There are several variations based on recurrent neural networks, such as SCENT [179], and AttS2S-VC [181].They follow the widely-used architecture of encoder-decoder with attention [180], [182].Suppose that we have a source speech x = {x 1 , x 2 , ..., x T s }.The encoder network first transforms the input feature sequences into hidden representations, h = {h 1 , h 2 , ..., h T h } at a lower frame rate with T h < T s , which are suitable for the decoder to deal with.At each decoder time step, the attention module aggregates the encoder outputs by attention probabilities and produces a context vector.Then, the decoder predicts output acoustic features frame by frame using context vectors.Furthermore, a postfiltering network is designed to enhance the accuracy of the converted acoustic features to generate the converted speech y = {y 1 , y 2 , ..., y T y }.During training, the attention mechanism learns the mapping dynamics between source sequence and target sequence.At run-time inference, the decoder and the attention mechanism interacts to perform the mapping and alignment at the same time.The overall architecture is illustrated in Figure 6.
While recurrent neural networks represent an effective implementation for sequence-to-sequence conversion, recent studies have shown that convolutional neural networks with gating mechanisms also learn well the long-term dependencies [53], [183].It employs an attention mechanism that effectively makes possible parallel computations for encoding and decoding.During decoding, the causal convolution design allows the model to generate an output sequence in an autoregressive manner.Kameoka et al. proposed a convolutional neural networks implementation for voice conversion [184], that is called ConvS2S-VC.Recent studies show that ConvS2S-VC outperforms its recurrent neural network counterparts in both pairwise and many- Training Fig. 7: Training a CycleGAN with cycle-consistency loss of L1 norm for voice conversion with non-parallel training data of paired speakers.L1 norm represents the least absolute errors to-many voice conversion [181].
The encoder-decoder structure with attention marks a departure from the frame-level mapping paradigm.The attention doesn't perform the mapping frame-by-frame, but rather allows the decoder to attend to multiple speech frames and uses the soft combination to predict an output frame in the decoding process.With the attention mechanism, the duration of the converted speech T y is typically different from that of the source speech T s to reflect the differences of speaking style between source and target.This represents a way to handle both spectral and prosody conversion at the same time.The studies have attributed the improvement of voice quality to the effective attention mechanism.The attention mechanism also represents the first step towards relaxing the rigid requirement of parallel data in voice conversion.

C. Beyond Parallel Data of Paired Speakers
In Section III and IV, we study statistical modeling for voice conversion with parallel training data and nonparallel training data.The advent of deep learning has broken new ground for voice conversion research.We now go beyond the paradigm of parallel and non-parallel training data.We refer nonparallel training data to the case where nonparallel utterances from source-target speakers are required.However, the recent studies show that, deep learning has enabled many voice conversion scenarios without the need of parallel data.In this section, we summarize the studies into four scenarios, 1) Non-parallel data of paired speakers, 2) Leveraging TTS systems, 3) Leveraging ASR systems, and 4) Disentangling speaker from linguistic content.

1) Non-parallel data of paired speakers:
Voice conversion with non-parallel training data is a task similar to image-toimage translation, which is to find a mapping from a source domain to a target domain without the need of parallel training data.Let's draw a parallel between image-to-image translation and voice conversion.In image translation, we would like to translate a horse to a zebra, where we preserve the structure of horse and change the coat of horse to that of zebra [185]- [190], in voice conversion, we would like to transform one voice to that of another, while preserving the linguistic, and prosodic content.
CycleGAN is based on the concept of adversarial learning [191], which is to train a generative model to find a solution in a min-max game between two neural networks, called as generator (G) and discriminator (D).It is known to achieve remarkable results [185] on several tasks where paired training data does not exist, such as image manipulation and synthesis [185], [188], [192]- [195], speech enhancement [196], speech recognition [197], speech synthesis [198], [199].
As the speech data are non-parallel, alignment is not easily achieved.Kaneko and Kameoka first studied a CycleGAN [47], [48], [200], [201] that incorporates three loss functions: adversarial loss, cycle-consistency loss, and identitymapping loss, to learn forward and inverse mapping between source and target speakers.
The adversarial loss measures how distinguishable between the data distribution of converted features and source features x or target features y.For the forward mapping, it is defined as follows: The closer the distribution of converted data with that of target data, the smaller the loss becomes.
The adversarial loss only tells us whether G X →Y follows the distribution of target data and does not ensure that the contextual information, that represents the general sentence structure we would like to carry over from source to target, is preserved.To ensure that we maintain the consistent contextual information between x and G X →Y (x), the cycle-consistency loss, that is presented in Figure 7, is introduced, where • 1 refers to a L1 norm function, or least absolute errors, that is known to produce sharper spectral features.This loss encourages G X →Y and G Y →X to find an optimal pseudo pair of (x, y) through circular conversion.
To encourage the generator to find the mapping that preserves underlying linguistic content between the input and output [202], an identity mapping loss is introduced as follows, Combining the three loss functions, we have the total loss as, where λ C Y C and λ I D are trade-off parameters.
The optimal mapping functions G * and F * are obtained by solving the minmax-game defined as: CycleGAN represents a successful deep learning implementation to find an optimal pseudo pair from nonparallel data of paired speakers.It doesn't require any frame alignment mechanism such as dynamic time warping or attention.Experimental results show that, with non-parallel training data, CycleGAN achieves comparable performance to that of GMM-based system that is trained on twice amount of parallel data [47].Moreover, with the adversarial training, it effectively overcomes the over-smoothing problem, which is known to be one of the main factors leading to speech-quality degradation.We note that more recently, CycleGAN-VC2, an improved version of CycleGAN-VC has been studied [201], that further improves CycleGAN by incorporating three new techniques: an improved objective (two-step adversarial losses), improved generator (2-1-2D CNN), and improved discriminator (PatchGAN).CycleGAN has been successfully applied in mono-lingual [48], [203], cross-lingual voice conversion [204], emotional voice conversion [205], [206] and rhythm-flexible voice conversion [207].
Unlike the encoder-decoder structure, CycleGAN follows a generative modeling architecture that doesn't explicitly model some internal representations to support flexible manipulation, such as voice identity, duration of speech, and emotion.Therefore, it is more suitable for voice conversion between a specific source and target pair.Nonetheless, it represents an important milestone towards non-parallel data voice conversion.
2) Leveraging TTS systems: We have discussed the deep learning architectures for voice conversion that do not involve text.One of the important aspects of voice conversion is to carry forward the linguistic content from source to target.Voice conversion and TTS systems are similar in the sense that they both aim to generate high quality speech with the appropriate linguistic content.A TTS system provides a mechanism for the speech to adhere to the linguistic content.The ideas to leverage TTS mechanism can be motivated in different ways.Firstly, a TTS system is trained on a large speech database that offers a high quality speech re-construction mechanism given the linguistic content; Fig. 8: The upper panel is a TTS flow, and the lower panel is a voice conversion flow.Both follow similar encoderdecoder with attention architecture.The voice conversion leverages the TTS system that is linguistically informed.secondly, a TTS system is equipped with a quality attention mechanism that is needed by voice conversion.
As illustrated in Figure 8, encoder-decoder models with attention have recently shown considerable success in modeling a variety of complex sequence-to-sequence problems.Tacotron [87], [176], [208] represents one of the successful text-to-speech (TTS) implementations, that has been extended to voice conversion [3], [179].
Zhang et al. proposed a joint training system architecture for both text-to-speech and voice conversion [3] by extending the model architecture of Tacotron, which features a multi-source sequence-to-sequence model with a dual input, and dual attention mechanism.By taking only text as input, the system performs speech synthesis.The system can also take either voice alone, or both text and voice as input for voice conversion.The multi-source encoderdecoder model is trained with a decoder that is linguistically informed via the TTS joint training, as illustrated by shared decoder in Figure 8. Experiments show that the joint training has improved the voice conversion task with or without text input at run-time inference.
Park et al. proposed a voice conversion system, known as Cotatron, that is built on top of a multi-speaker Tacotron TTS architecture [161].At run-time inference, the pretrained TTS system is used to derive speaker-independent linguistic features of the source speech.This process is guided by the transcription of the input speech, as such, text transcription of source speech is required at run-time inference.The system uses the TTS encoder to extract speaker-independent linguistic features, or disentangle the speaker identity.The decoder then takes the attentionaligned speaker-independent linguistic features as the input, and the target speaker identity as the condition, to generate a target speaker's voice.In this way, voice conversion leverage the attention mechanism or shared attention from TTS, as shown in Figure 8. Cotatron is designed to perform one-to-many voice conversion.A study [209], that shares similar motivation with [161] but is based on

Multi-speaker Average Model
Target Speaker Mapping Function Fig. 9: Training phase of the average modeling approach that maps PPG features to MCEP features for voice conversion [44].
the Transformer instead of Tacotron, suggests transferring knowledge from a learned TTS model to benefit from largescale, easily accessible TTS corpora.
Zhang et al. [210] proposed to improve the sequenceto-sequence model [179] by using text supervision during training.A multi-task learning structure is designed which adds auxiliary classifiers to the middle layers of the sequence-to-sequence model to predict linguistic labels as a secondary task.The linguistic labels can be obtained either manually or automatically with alignment tools.With the linguistic label objective, the encoder and decoder are expected to generate meaningful intermediate representations which are linguistically informed.The text transcripts are only required during training.Experiments show that the learning with linguistic labels effectively improves the alignment quality of the model, thus alleviates issues such as mispronunciation.
The neural representation of deep learning has facilitated the interaction between TTS and voice conversion.By leveraging TTS systems, we hope to improve the training and run-time inference of voice conversion with by adhering to linguistic content.However, such techniques usually require a large training corpus.Recent studies introduced a framework for creating limited-data VC system [209], [211], [212] by bootstrapping from a speaker-adaptive TTS model.It deserves future studies as to how voice conversion can benefit from TTS systems without involving large training data.
3) Leveraging ASR systems: Deep learning approaches for voice conversion typically require a large parallel corpus for training.This is partly because we would like to learn the latent representations that describe the phonetic systems.The requirement of training data has limited the scope of potential applications.We know that most ASR systems are already trained with a large corpus.They already describe well the phonetic systems in different ways.The question is how to leverage the latent representations in ASR systems for voice conversion.
One of the ideas is to use the context posterior proba-bility sequence produced by the ASR model with sequence to sequence learning to generate a target speech feature sequence [160].In this modal, the system has an encoderdecoder structure similar to Figure 6, except that it uses a speech recognizer as the encoder, and a speech synthesizer as the decoder.Another study is to guide a sequence to sequence voice conversion model by an ASR system, which augments inputs with bottleneck features [179].Recently, an end-to-end speech-to-speech sequence transducer, Parrotron [213], was studied.Parrotron learns to convert speech spectrogram of any speakers, with multiple accents and imperfections, to the voice of a single predefined target speaker.Parrotron accomplishes this by using an auxiliary ASR decoder to predict the transcript of the output speech, conditioned on the encoder latent representation.The multi-task training of Parrotron optimizes the decoder to generate the target voice, at the same time, constrains the latent representation to retain linguistic information only.
The ASR decoder aims to disentangle the speaker's identity from the speech.The above techniques adopt the encoderdecoder with attention architecture.It is another way to look at voice conversion that speech consists of two components, speaker dependent component and speaker independent component.If we are able to decompose speech signals into the two components, we can carry over the former, and only convert the latter to achieve voice conversion.The average modeling technique represents one of the successful implementations [41], where we build a mapping function to convert phonetic posteriogram (PPG) [32] to acoustic features.The PPG features are derived from an ASR system, that can be considered as speaker independent.We train the mapping function from multispeaker, non-parallel speech data.In this way, one doesn't need to train a full conversion model for each target speaker.The average model can be adapted towards the target with a small amount of target speech.The training and adaptation of the average model are illustrated in Figure 9.
There were several follow-up studies along this direction, for example, Tian et al. proposes a PPG to waveform conversion [94], and a average model with speaker identity [155] as a condition [44].Zhou et al. proposes to use PPG as the linguistic features for cross-lingual voice conversion [158].Liu et al. proposes to use PPG for emotional voice conversion [214].Zhang et al. also shows that the average model framework can benefit from a small amount of parallel training data using an error reduction network [215].
4) Disentangling speaker from linguistic content: In the context of voice conversion, speech can be considered as a composition of speaker voice identity and linguistic content.If we are able to disentangle speaker from the linguistic content, we can change the speaker identity independently of the linguistic content.Auto-encoder [216] represents one of the common techniques for speech disentanglement, and reconstruction.There are other techniques such as instance normalization [217] and vector quantization [218], [219], that are effective in disentangling speaker from the content.
An auto-encoder learns to reproduce its input as its Fig. 10: A typical auto-encoding network for voice conversion, where the encoders and decoder learn to disentangle speaker from linguistic content.At run-time, the linguistic content of the source speech represented by latent code and speaker embedding of a target speaker are combined to generate target speech.output.Therefore, parallel training data is not required.An encoder learns to represent the input with a latent code, and a decoder learns to reconstruct the original input from the latent code.The latent code can be seen as an information bottleneck which, on one hand, lets pass tion necessary, e.g.speaker independent linguistic content, for perfect reconstruction, and on the other hand, forces some information to be discarded, e.g.speaker, noise and channel information [83].Variational auto-encoder (VAE) [220] is the stochastic version of auto-encoder, in which the encoder produces distributions over latent representations, rather than deterministic latent codes, while the decoder is trained on samples from these distributions.Variational auto-encoder is more suitable than deterministic autoencoder in synthesizing new samples.
Chorowski et al. [98] provides a comparison of three auto-encoding neural networks by studying how they learn a representation from speech data to separate speaker identity from the linguistic content.It was shown that discrete representation, that is the latent code obtained from VQ-VAE, preserves the most linguistic content while also being the most speaker-invariant.Recently, a group latent embedding technique for VQ-VAE is studied to improve the encoding process, which divides the embedding dictionary into groups and uses the weighted average of atoms in the nearest group as the latent embedding [221].
The concept of a VAE-based voice conversion framework [43] can be illustrated in Figure 10.The decoder reconstructs the utterance by conditioning on the latent code extracted by the encoder, and separately on a speaker code, which could be an one-hot vector [43], [222] for a close set of speakers, or an i-vector [155], bottleneck speaker representation [223], or d-vector [224] for an open set of speakers.By explicitly conditioning the decoder on speaker identity, the encoder is forced to capture speakerindependent information in the latent code from a multi-speaker database.
Just like other auto-encoder, VAE decoder tends to generate over-smoothed speech.This can be problematic for voice conversion because the network may generate poor quality buzzy-sounding speech.Generative adversarial networks (GANs) [225] were proposed as one of the solutions to the over-smoothing problem.GANs offer a general framework for training a data generator in such a way that it can deceive a real/fake discriminator that attempts to distinguish real data and fake data produced by the generator.By incorporating the GAN concept into VAE, VAE-GAN was studied for voice conversion with non-parallel training data [46] and in cross-lingual voice conversion [204].It was shown that VAE-GAN [225] produces more natural sounding speech than the standard VAE method [43], [223].
A recent study on sequence-to-sequence non-parallel voice conversion [226] shows that it is possible to explicitly model the transfer of other aspects of speech, such as source rhythm, speaking style, and emotion to the target speech.

VI. EVALUATION OF VOICE CONVERSION
Effective quality assessment of voice quality is required to validate the algorithms, to measure the technological progress, and to benchmark a system against the state-ofthe-art.Typically, we report the results in terms of objective and subjective measurements.
To provide an objective evaluation, a reference speech is required.The common objective evaluation metrics include Mel-cepstral distortion (MCD) [227] for spectrum, and PCC [228] and RMSE [229]- [231] for prosody.We note that, such metrics are not always correlated with human perception partly because they measure the distortion of acoustic features rather than the waveform that humans actually listen to.
Subjective evaluation metrics, such as the mean opinion score (MOS) [2], [232]- [234], preference tests [18], [235] and best-worst scaling [236] could represent the intrinsic naturalness and similarity to the target.We note that, for subjective evaluation to be meaningful, a large number of listeners are required, that is not always possible in practice.

A. Objective Evaluation 1) Spectrum Conversion:
To provide an objective evaluation, first of all, we need a reference utterance spoken by the target speaker.Ideally the converted speech is very close to the reference speech.We can measure the differences between them by comparing their spectral distances.However, there is no guarantee that the converted speech and the reference speech is of the same length.In this case, a frame aligner is required to establish the frame-level mapping.
Mel-cepstral distortion (MCD) [227] is commonly used to measure the difference between two spectral features [62], [237]- [239].It is calculated between the converted and target Mel-cepstral coefficients, or MCEPs, [240], [241], ŷ and y.Suppose that each MCEP vector consists of 24 coefficients, we have ŷ = {m c k,i } and y = {m t k,i } at frame k, where i denotes the i th coefficient in the converted and target MCEPs.
We note that a lower MCD indicates better performance.However, MCD value is not always correlated with human perception.Therefore, subjective evaluations, such as MOS and similarity score, are also conducted.
2) Prosody Conversion: Speech prosody of an utterance is characterized by phonetic duration, energy contour, and pitch contour.To effectively measure how close the prosody patterns of converted speech is to the reference speech, we need to provide measurements for the three aspects.
The alignment between converted speech and the reference speech provides the information about how much the phonetic duration differs one another.We can derive the number of frames that deviate from the ideal diagonal path on average, such as frame disturbance [242], to report the differences of phonetic duration.
Pearson Correlation Coefficient (PCC) [62], [205] and Root Mean Squared Error (RMSE) have been widely used as the evaluation metrics to measure the linear dependence of prosody contours or energy contours between two speech utterances.
We next take the measurement of two prosody contours as an example.PCC between the aligned pair of converted and target F0 sequences is given as follows, where σ F 0 c and σ F 0 t are the standard deviations of the converted F0 sequences (F 0 c ) and the target F0 sequences (F 0 t ), respectively.We note that a higher PCC value represents better F0 conversion performance.
The RMSE between the converted F0 and the corresponding target F0 is defined as, where F 0 c k and F 0 t k denote the converted and target F0 features, respectively.K is the length of F 0 sequence, or the total number of frames.We note that a lower RMSE value represents better F 0 conversion performance.The same measurement applies to energy contours as well.
Other generally-accepted metrics for prosody transfer include F0 Frame Error (FFE) [243] and Gross Pitch Error (GPE) [244].We note that GPE reports the percentage of voiced frames whose pitch values are more than 20% different from the reference, while FFE reports the percentage of frames that either contain a 20% pitch error or a voicing decision error [245].

B. Subjective Evaluation
Mean Opinion Score (MOS) has been widely used in listening tests [40], [61], [62], [246]- [251].In MOS experiments, listeners rate the quality of the converted voice using a 5point scale: "5" for excellent, "4" for good, "3" for fair, "2" for poor, and "1" for bad.There are several evaluation methods that are similar to MOS, for example: 1) DMOS [252]- [254], which is a "degradation" or "differential" MOS test, requiring listeners to rate the sample with respect to this reference, and 2) MUSHRA [255]- [257], which stands for MUltiple Stimuli with Hidden Reference and Anchor, and requires fewer participants than MOS to obtain statistically significant results.
Another popular subjective evaluation is preference test, also denoted as AB/ABX test [2], [11], [40], [258].In AB tests, listeners are presented with two speech samples and asked to indicate which one has more of a certain property; for example in terms of naturalness, or similarity.In ABX test, similar to that of AB, two samples are given but an extra reference sample is also given.Listeners need to judge if A or B more like X in terms of naturalness, similarity, or even emotional quality [205].We note that it is not practical to use AB and/or ABX test for the comparison of many VC systems at the same time.MUSHRA is another type of voice quality test in telecommunication [259], where the reference natural speech and several other converted samples of the same content are presented to the listeners in a random order.The listeners are asked to rate the speech quality of each sample between 0 and 100.
It is known that people are good at picking the extremes but their preferences for anything in between might be fuzzy and inaccurate when presented with a long list of options.Best-Worst Scaling (BWS) [236] is proposed for voice conversion quality assessment [22], where listeners are presented only with a few randomly selected options each time.With many such BWS decisions, Best-Worst Scaling can handle a long list of options and generates more discriminating results, such voice quality ranking, than MOS and preference tests.
We note that subjective measures can represent the intrinsic naturalness and similarity of a voice conversion system.However, such evaluation can be time-consuming and expensive as they involve a large number of listeners.

C. Evaluation with Deep Learning Approaches
The study of perceptual quality evaluation seeks to approximate human judgement with computational models of psychoacoustic motivation.It provides insights into how humans perceive speech quality in listening tests, and suggests assessment metrics, that are required in speech communication, speech enhancement, speech synthesis, voice conversion and any other speech production or transmission applications.Perceptual Evaluation of Speech Quality (PESQ) [260] is an ITU-T recommendation that is widely used as industry standard.It provides objective speech quality evaluation that predicts the humanperceived speech quality.
However, the PESQ formulation requires the presence of reference speech, that considerably restricts its use in voice conversion applications, and motivates the study of perceptual evaluations without the need of reference speech.Those metrics that don't require reference speech are called non-intrusive evaluation metrics.For example, Fu et al. [261] propose Quality-Net [261] that is an end-toend model to predict PESQ ratings, that are the proxy for human ratings.Yoshimura et al. [262], Patton et al. [263] propose a CNN-based naturalness predictor to predict human MOS ratings, among other non-intrusive assessment metrics [264]- [266].
Lo et al. [267] propose MOSNet, another non-intrusive assessment technique based on deep neural networks, that learns to predict human MOS ratings.MOSNet scores are highly with human MOS ratings at system level, and fairly correlated at utterance level.While it is a nonintrusive evaluation metric for naturalness, MOSNet can also be modified and re-purposed to predict the similarity scores between target speech and converted speech.It provides similarity scores with fair correlation values to human ratings on VCC 2018 dataset.MOSNet marks a recent advancement towards automatic perceptual quality evaluation [268], which is free and open-source.

VII. VOICE CONVERSION CHALLENGES
In this section, we would like to give an overview of the series of voice conversion challenges, that provide shared tasks with common data sets and evaluation metrics for fair comparison of algorithms.The voice conversion challenge (VCC) is a biannual event since 2016.In a challenge, a common database is provided by the organizers.The participants build voice conversion systems using their own technology, and the organizers evaluate the performance of the converted speech.The main evaluation methodology is a listening test in which crowd-sourced evaluators rank the naturalness and speaker similarity.
The 2016 challenge offers a standard voice conversion task using a parallel training database was adopted [269].The 2018 challenge features a more advanced conversion scenario using a non-parallel database [270].The 2020 challenge puts forward a cross-lingual voice conversion research problem.A summary of VCC 2016, VCC 2018 and VCC 2020 is also provided in Table I.

A. Why is the Challenge Needed?
As described earlier, many of the voice conversion approaches are data-driven, hence speech data are required to train models and for conversion evaluation.To compare such data-driven methods each other precisely, a common database that specifies training and evaluation data explicitly is needed.However, such common database did not exist until 2016.Without common databases, researchers have to re-implement others' system with their own databases before trying any new ideas.In such situation, it is not guaranteed that the re-implemented system achieves the expected performance in the original work.
To address the same problem, the TTS community gave birth to the first Blizzard challenge in 2005.Since then, the challenge has defined various standard databases for TTS and has made comparisons of TTS much fairer and easier.The motivations of VCC are exactly the same as those of the Blizzard challenges.VCC introduced a few standard databases for voice conversion and also defined the common training and evaluation protocols.All the converted speech submitted by the participants for the challenges have been released publicly.In this way, researchers can compare the performance of their voice conversion system with that of other state-of-the-art systems without the need of re-implementation.
Another need on voice conversion standard databases arose from biometric speaker recognition community.As the voice conversion technology could be misused for attacking speaker verification systems, anti-spoofing countermeasures are required [271].This is also called presentation attack detection.Anti-spoofing techniques aim at discriminating between fake artificial inputs presented to biometric authentication systems and genuine inputs.If sufficient knowledge and data regarding the spoofed data is available, a binary classifier can be constructed to reject artificial inputs.Therefore, the common VCC databases are also important for anti-spoofing research.With many converted speech data from advanced voice conversion systems, researchers in the biometric community can develop anti-spoofing models to strengthen the defence of speaker recognition systems, and to evaluate their vulnerabilities.

B. Overview of the 2016 Voice Conversion Challenge
We first overview the 2016 voice conversion challenge [269] and its datasets1 .As the first shared task in voice conversion, a parallel voice conversion task and its evaluation protocol are defined for VCC 2016.The parallel dataset consists of 162 common sentences uttered by both source and target speakers.Target and source speakers are four native speakers of American English (two females and two males), respectively.In the challenge, the participants develop the conversion systems and produce converted speech for all possible source-target pair combinations.In total, eight speakers (plus two unused speakers) are included in the VCC 2016 database.The number of test sentences for evaluation is 54.
The main evaluation methodology adopted for the ranking is subjective evaluation on perceived naturalness and speaker similarity of the converted samples to target speakers.The naturalness is evaluated using the standard fivepoint scale mean-opinion score (MOS) test ranging from 1 (completely unnatural) to 5 (completely natural).The speaker similarity was evaluated using the Same/Different paradigm [272].Subjects are asked to listen to two audio samples and to judge if they are speech signals produced by the same speaker in a four point scale: "Same, absolutely sure", "Same, not sure", "Different, not sure" and "Different, absolutely sure."As the perceived speaker similarity to a target speaker, and the perceived voice quality are not necessarily correlated, it is important to use a scatter-plot to observe the trade-off between the two aspects.In the 2016 challenge, 17 participants submitted their conversion results.Two hundreds native listeners of English joined the listening tests.It is reported that the best system using GMM and waveform filtering obtained an average of 3.0 in the five-point scale evaluation for the naturalness judgement, and about 70% of its converted speech samples are judged to be the same as target speakers by listeners.However, it is also confirmed that there is still a huge gap between target natural speech and the converted speech.We observe that it remains a unsolved challenge to achieve good quality and speaker similarity at that time.More details of VCC 2016 can be found at [272].Details of best performing systems are reported in [273].

C. Overview of the 2018 Voice Conversion Challenge
Next we give an overview of the 2018 voice conversion challenge [270] and its datasets 2 .VCC 2018 offers two tasks, parallel and non-parallel voice conversion tasks.A dataset and its evaluation protocol are defined for each task.The dataset for the parallel conversion task is similar to that of the 2016 challenge, except that it has a smaller number of common utterances uttered by source and target speakers.Target and source speakers are four native speakers of American English (two females and two males), respectively, but, they are different speakers from those used for the 2016 challenge.Like the 2016 challenge, the participants were asked to develop conversion systems and to produce converted data for all possible source-target pair combinations.
VCC 2018 introduced a non-parallel voice conversion task for the first time.The same target speakers' data in the parallel task are used as the target.However, the source speakers are four native speakers of American English (2 females and 2 males) different from those of the parallel conversion task and their utterances are also all different from those of the target speakers.Like the parallel voice conversion task, converted data for all possible sourcetarget pair combinations needed to be produced by the participants.In total twelve speakers are included in the VCC 2018 database.Each of the source and target speakers has a set of 81 sentences as training data, which is half of that for VCC 2016.The number of test sentences for evaluation is 35.In the 2018 challenge, 23 participants submitted their conversion results to the parallel conversion task, with 11 of them additionally participating in the non-parallel conversion task.The same evaluation methodology as the 2016 challenge was adopted for the 2018 challenge and 260 2 The VCC2018 dataset is available at https://doi.org/10.7488/ds/2337.crowd-sourced native listeners of English have joined the listening tests.It was reported that in both tasks, the best system using phone encoder and neural vocoder obtained an average of 4.1 in the five-point scale evaluation for the naturalness judgement and about 80% of its converted speech samples were judged to be the same as target speakers by listeners.It was also reported that the best system has similar performance in both the parallel and non-parallel tasks in contrast to results reported in literature.
In VCC 2018, the spoofing countermeasure was introduced as an supplement to subjective evaluation of voice quality, that brought together the voice conversion and speaker verification research community.More details of the 2018 challenge can be found at [270].Details of best performing systems are reported in [274], [275].
From this challenge, we observed that new speech waveform generation paradigms such as WaveNet and phone encoding have brought significant progress to the voice conversion field.Further improvements have been achieved in the follow up papers [276], [277] and new VC systems that exceed the challenge's best performance have already been reported.

D. Overview of the 2020 Voice Conversion Challenge
The 2020 voice conversion challenge3 consists of two tasks: 1) non-parallel training in the same language (English); and 2) non-parallel training over different languages (English-Finnish, English-German, and English-Mandarin).
In the first task, each participant trains voice conversion models for all source and target speaker pairs using up to 70 utterances, including 20 parallel utterances and 50 non-parallel utterances in English, for each speaker as the training data.Overall, 16 voice conversion models (i.e., 4 sources by 4 targets) are to be developed.In the second task, each participant develops voice conversion models for all source and target speaker pairs using up to 70 utterances for each speaker (i.e., in English for the source speakers, and in Finnish, German, or Mandarin for the target speakers) as the training data.Overall, 24 conversion systems (i.e., 4 sources by 6 targets) are to be developed.
In the 2020 challenge, the participants are allowed to mix and combine different source speaker's data to train speaker-independent models.Moreover, the participants can also use orthographic transcriptions of the released training data to develop their voice conversion systems.Last but not least, the participants are free to perform manual annotations of the released training data, which can effectively improves the quality of the converted speech.
The 2020 challenge organizers also built several baseline systems including the top system of the previous challenge on the new database.The codes of CycleVAE-based baseline 4 and Cascade ASR + TTS based VC 5 are released so that participants can build the basic systems easily and focus on their own innovation.The 2020 challenge also features a multifaceted evaluation.In addition to the traditional evaluation metrics, the challenge also reports the speech recognition, speaker recognition, and anti-spoofing evaluation results on the converted speech.The challenge is underway at the time we submit this manuscript.

E. Relevant Challenges -ASVspoof Challenge
The spoofing capability against automatic speaker verification is a related topic to voice conversion, that has also been organized as technology challenges.The ASVspoof series of challenges are such biannual events, which started in 2013.Like in the voice conversion challenges, the organizers release a common database including many pairs of spoofed audio (converted, generated audio or replay audio) and genuine audio to the participants, who build antispoofing models using their own technology.The organizers rank the detection accuracy of the anti-spoofing results submitted by the participants.
In 2015, the first anti-spoofing database including various types of spoofed audio using voice conversion and TTS systems was constructed.This database became a reference standard in the automatic speaker verification (ASV) community [278], [279].The main focus of the 2017 challenge was a replay task, where a large quantity of real-world replay speech data were collected [280].In 2019, an even larger database including converted, generated, and replay speech data was constructed [281].The best performing systems in the 2016 and 2018 voice conversion challenges were also used for generating advanced spoofed audio [282].The challenges revealed that some anti-spoofing systems outperform human listeners in detecting spoofed audio.

VIII. RESOURCES
In addition to the voice conversion challenge databases described above, the CMU-Arctic database [283] and the VCTK databases [284] are also popular for voice conversion research.The current version of the CMU-Arctic database6 has 18 English speakers and each of them reads out the same set of around 1,150 utterances, which are carefully selected from out-of-copyright texts from Project Gutenberg.This is suitable for parallel voice conversion since sentences are common to all the speakers.The current version (ver.0.92) of the CSTR VCTK corpus 7 has speech data uttered by 110 English speakers with various dialects.Each speaker reads out about 400 sentences, which are selected from newspapers, the rainbow passage and an elicitation paragraph used for the speech accent archive.
Since the rainbow passage and an elicitation paragraph are common to all the speakers, this database can be used for both parallel and non-parallel voice conversion.
Since neural networks are data hungry and generalization to unseen speakers is a key for successful conversion, largescale, but, low-quality databases such as LibriTTS and Vox-Celeb are also used for training some components required (e.g.speaker encoder) for voice conversion.The LibriTTS corpus [285] has 585 hours of transcribed speech data uttered by total of 2,456 speakers.The recording condition and audio quality are less than ideal, but, this corpus is suitable for training speaker encoder networks or generalizing any-to-any speaker mapping network.The VoxCeleb database [286] is further a larger scale speech database consisting of about 2,800 hours of untranscribed speech from over 6,000 speakers.This is an appropriate database for training noise-robust speaker encoder networks.
There are many open-source codes for training VC models.For instance, spocket [287] supports GMM-based conversions and ESPnet [288] supports cascaded ASR and TTS system.In addition, there are many open-source codes for neural-network based voice conversion written by the community at github 8 .

IX. CONCLUSION
This article provides a comprehensive overview of the voice conversion technology, covering the fundamentals and practice till July 2020.We reveal the underlying technologies and their relationship from the statistical approaches to deep learning, and discuss their promise and limitations.We also study the evaluation techniques for voice conversion.Moreover, we report the series of voice conversion challenges and resources that are useful information for researchers and engineers to start voice conversion research.

Fig. 1 :
Fig. 1: The typical flow of a voice conversion system.The pink box represents the training of the mapping function, while the blue box applies the mapping function at run-time, in a 3-step pipeline process Y = (R • F • A)(X ).

Fig. 2 :
Fig. 2: Training and run-time inference of voice conversion with parallel training data under the frame-level mapping paradigm.The pink boxes represent the training algorithms of the models that result in the mapping function F (x) in blue box for run-time inference.Dotted box (1) includes examples of statistical approaches, and (2) includes examples of deep learning approaches.

Fig. 4 :
Fig. 4: The training of a frame-level mapping function is an iterative process between the nearest neighbor search step (INCA alignment) and the conversion step (a parametric mapping function).