Introduction
Text-to-speech (TTS) technology synthesizes speech waveforms from input texts through several processes, including text analysis, linguistic feature extraction, acoustic feature prediction, and waveform generation [1]. With recent advances in deep learning, TTS models have significantly improved the quality of synthesized speech compared with traditional statistical parametric models. At present, they can generate natural and human-level quality speech after being trained for several hours on single-speaker or multi-speaker recordings [2], [3], [4], [5]. This advancement has made TTS technology attractive across diverse speech-related applications.
Recently, there has been a growing interest in using TTS for personalized voice assistants and broadcasting [6] with personalized custom voices. In these applications, generating personalized voices for new speakers not included in training data presents a challenge. This challenge arises because of the quality gap between the synthesized speeches of a trained speaker and those of a new speaker. This gap is often caused by factors such as the lack of training data for the new speaker or characteristics of the speaker that do not match the data used during training [7]. To address the issue associated with such a quality gap, speaker adaptation techniques have been applied to better adapt to new speakers not included in training data.
Two main methods are being studied for generating speech for new speakers: speaker adaptation methods based on zero-shot learning [8] or fine-tuning a pretrained TTS model to personalize the natural voices of new speakers [9], [10]. Zero-shot learning utilizes a single pretrained model to imitate unseen speech patterns and features. Significant advances have been achieved by applying zero-shot learning to TTS [11], [12], [13], [14]. However, the zero-shot approach generates a relatively inconsistent personalized voice with distorted naturalness for a given new speaker. Additionally, when the speakers pronounce strong accents or nonstandard pronunciations, the similarity of the synthesized speech further decreases [15]. In contrast, the fine-tuning approach generally adapts a pretrained TTS model by optimizing all the parameters of the TTS model using a limited amount of new-speaker data.
Although adapting the TTS model for a target speaker can improve the synthesized speech quality, several problems arise. First, fine-tuning all the parameters of the TTS model incurs significant computational cost and time consumption [6]. Second, the adapted TTS model for each target speaker needs to be stored individually, which requires considerable storage space [16], [17]. Therefore, reducing the number of adaptation parameters is necessary for fine-tuning.
To mitigate the abovementioned problems, parameter-efficient fine-tuning (PEFT) approaches have been proposed. For instance, AdaSpeech leverages acoustic condition modeling and conditional layer normalization (CLN) at the mel-decoder stage to achieve parameter efficiency while fine-tuning TTS models [6]. Meanwhile, Meta-StyleSpeech [18] employs metalearning techniques for style modeling, enabling fast adaptation to a new speaker’s style with minimal data. Furthermore, adapter-based methods have been introduced as PEFT [19], [20], [21], and they achieve efficiency by selectively fine-tuning only a subset of parameters rather than the entire model, thereby reducing the computational load and storage requirements. However, these approaches have typically focused on fine-tuning the acoustic models of two-stage TTS models [4], [22], [23]. Because acoustic feature representation and waveform synthesis in two-stage TTS models are processed independently, the TTS performance is limited because of the independence of the fine-tuned intermediate features [24].
In recent years, end-to-end (E2E) TTS models have been widely studied to provide higher-quality expression compared with two-stage TTS models. One representative E2E TTS model is variational inference with adversarial learning for E2E TTS (VITS) model [24], which mainly comprises a variational autoencoder (VAE) augmented with normalizing flow (NF) [25], [26] and is trained through adversarial training [27]. Another notable E2E model is Your-TTS [11], which builds upon the VITS framework and incorporates a speaker encoder for zero-shot multi-speaker adaptation and multilingual training. Additionally, NaturalSpeech [12] achieves high-quality single-speaker TTS by modifying the VITS model structure, introducing a bidirectional NF alongside differentiable duration modeling and phoneme pretraining, which significantly enhances the synthesized speech’s expressiveness and naturalness. However, an issue persists when PEFT is applied to these VITS-based models. The connection between the modules in the VITS-based model is represented by a probability distribution. Thus, applying PEFT to a specific module in the VITS model can change the probability distribution of the output of the module. However, whether this updated probability distribution is suitable for the input of the subsequent module is uncertain. Without more sophisticated fine-tuning, high-quality synthesized speech cannot be guaranteed.
To address this issue, a recent study [15] proposed a zero-shot learning and PEFT method for VITS-based models, which improved the zero-shot adaptation performance by altering the VAE model structure to prevent overfitting and introducing a specific discriminator for speaker information, thereby enhancing the overall model performance. In addition, the speaker encoder was based on the ECAPA-TDNN architecture [28], which was modified to extract speaker embeddings and pretrained to effectively capture speaker characteristics. In this model [15], the baseline TTS model was trained using speaker embeddings extracted from the pretrained speaker encoder to aid the model’s flow and duration predictor during training. PEFT was applied through adapters to the prior encoder, specifically targeting the flow-based decoder and text encoder. This approach demonstrated impressive performance in speaker adaptation. However, this method relied on a pretrained speaker encoder, did not consider multi-speaker adaptation, and only applied the adapter to the prior encoder.
Thus, this paper presents a PEFT approach in VITS models and demonstrates the effectiveness of applying PEFT to multiple specific modules within the E2E architecture, providing a new method for improving TTS performance for multi-speaker adaptation. To further enhance this approach, we propose three specific strategies to realize PEFT for the VITS model. First, we incorporate low-rank adaptation (LoRA) [29] for fine-tuning the VITS model. LoRA is a method for reducing the complexity of neural network parameters by decomposing them into lower-dimensional representations [30]. Consequently, it adapts only a subset of model parameters using a low-rank matrix rather than the entire model parameters. In this study, LoRA is applied to several modules: the attention network of the text encoder, the WaveNet [2] structure in both the flow network and posterior encoder, the HiFi-GAN generator [31], and two linear projection layers. Second, LoRA-based fine-tuning is expanded with CLN [6] for multi-speaker fine-tuning. This is because LoRA alone does not capture diverse speaker-specific variations, resulting in suboptimal performance in multi-speaker adaptation. CLN uses a small conditioning layer to obtain scale and bias vectors for normalization instead of standard layer normalization, and it is applied to the text encoder and the stochastic duration predictor (SDP) of the VITS model by replacing layer normalization (LN). Lastly, to achieve intelligibility and naturalness of speech quality as in full fine-tuning, the degree of expressiveness of the prior distribution should be increased [12]. Therefore, this work additionally applies the modified version of the residual adapter [22], [32], [33], which can be flexibly inserted into the output of any module. In our model, we inserted the residual adapter into the text encoder outputs of the VITS model to enhance the representation of the prior distribution of the text encoder output.
We conducted experimental evaluations on the widely adopted multi-speaker VCTK [34] and Libri-TTS-100 [35] datasets to measure the voice quality of the proposed fine-tuning method against several objective and subjective metrics. These datasets were chosen to test the robustness of our process to different data characteristics. The VCTK dataset was characterized by many audio samples per speaker and a generally calm and consistent tone of voice. In contrast, the Libri-TTS-100 dataset comprised significantly more speakers despite similar sample numbers, with variations in the tone of each speaker. Because VCTK and Libri-TTS-100 were composed of controlled and stable speeches, we repeated experiments using the Common Voice datasets [36] to evaluate the performance of the proposed PEFT method under various accent conditions, which was essential for building personalized custom voices. Moreover, we conducted additional experiments using a Korean multi-speaker dataset to further investigate the model’s adaptability to different languages. Using these datasets, we verified the performance of our multi-speaker fine-tuning method with four speakers. The speech performances of different models, where fine-tuned TTS models were evaluated according to different combinations of fine-tuning modules (e.g., LoRA, CLN, and residual adapter), were compared in terms of the number of tuning parameters and speech quality measures. To measure speech quality, we used five objective metrics: speaker embedding cosine similarity (SECS) [37], word error rate (WER), character error rate (CER) [38], nonintrusive objective speech quality assessment for TTS (NISQA-TTS) [39], and mean opinion score (MOS) prediction by a fine-tuned wave2vec2.0 model (WV-MOS) [40]. In addition, to measure reliable TTS perception quality in terms of human-level quality, we used a comparative mean opinion score (CMOS) as a subjective metric [41].
The main contributions of this study are as follows:
To implement PEFT in the VITS model, we applied LoRA to the prior encoder and other specific modules within the E2E model, achieving speech quality comparable to that of a fully fine-tuned model with a 90% reduction in model parameters.
To handle speaker-specific variation with improved multi-speaker PEFT performance, CLN replaced the LN in the text encoder and the SDP, allowing the model to train an additional speaker with only 0.02M parameters.
To improve the expressiveness of the prior distribution, the residual adapter was integrated into the text encoder output. With only 0.15M parameters, this integration improved the WER, CER, and NISQA-TTS scores.
The remainder of this paper is organized as follows. Section II provides helpful background knowledge to help readers understand our work. Section III describes the VITS model architecture used as the baseline TTS model. Section IV proposes the PEFT method using LoRA, CLN, and residual adapter for multi-speaker adaptation. Section V evaluates the performance of the VITS models with the proposed PEFT, including several ablation studies and visualization experiments. Finally, Section VI concludes the paper.
Background
This section provides helpful background knowledge to help readers understand our work. First, we give a general review of TTS models. Next, we explain the flow-based generative models used in TTS systems.
A. Overview of Text-to-Speech Models
Recently, neural TTS systems have made significant advances in terms of performance. Two-stage TTS structures are commonly used to generate speech. These systems use acoustic models to predict predetermined acoustic features, such as mel-spectrograms, and then synthesize waveforms using a vocoder [31], [42]. When predicting these acoustic features, acoustic models can be categorized into two groups: autoregressive (AR) and non-autoregressive (NAR) TTS systems. Typically, sequence-to-sequence AR-TTS systems include models such as WaveNet [2] and Tacotron1, 2 [3], [22]. Transformer TTS [23] is the first model to use a transformer network in TTS. These AR-TTS systems sequentially generate frames of a mel-spectrogram by relying on the previous frame to effectively capture long-term dependencies. However, such a system can lead to a compromise in terms of inference speed and robustness errors, such as missing words and repetition. Thus, NAR-TTS systems have been developed to address these problems. For instance, FastSpeech [43] overcomes problems such as repetition in AR-TTS and parallelizes the process with a duration predictor to improve the speed and robustness of speech synthesis. FastSpeech2 [4] refines this setup using a variance adaptor for pitch and energy, although it still depends on an external text and speech alignment tool. Meanwhile, Glow-TTS [5] advances the field by learning alignment directly during training using monotonic alignment search (MAS).
Despite the progress in NAR-TTS systems, the abovementioned cascaded acoustic/vocoder model pipeline still has problems. In two-stage models, the latter model is trained on samples generated by earlier models or leverages pretrained models without modification. In addition, fine-tuning for high-quality speech synthesis is problematic because the two models must be trained separately. Furthermore, training-inference mismatches occur for both the mel-spectrogram and the duration as the models are trained with ground-truth values but rely on predicted values during inference. High-quality speech synthesis requires fine-tuning. Because of this problem, E2E models utilizing efficient training methods have been widely studied [44], [45]. Among these models, VITS [24] has succeeded in producing more natural speech than two-stage models by integrating the TTS model and a neural vocoder within an E2E framework using a VAE to enhance the synthetic speech quality. Moreover, VITS addresses the one-to-many problem of TTS by employing an SDP, enabling the generation of varied rhythms. Consequently, there have been widely adopted E2E models based on the VITS architecture [11], [12], [46].
B. Flow-Based Generative Model
Flow-based models are increasingly being used in different models because of their ability to compute the exact likelihood of data by applying inverse transformations [47]. To estimate the exact density, the latent variable of a generative model should be as simple as a Gaussian distribution. This leads to NF [25], which transforms a simple distribution into a complex distribution by applying a sequence of invertible transformations. This transform is iteratively replaced by changing the following variables:\begin{align*} {log~p}_{\theta }\left ({{ c }}\right)& = {log~p}_{\theta }\left ({{ z }}\right)+\sum \limits _{i=1}^{k} {log \left |{{ det \left ({{ J\left ({{ f_{i}^{-1}\left ({{ c }}\right) }}\right) }}\right) }}\right |,} \tag {1}\\ z& = f_{k}^{-1} o f_{k-1}^{-1}o \ldots f_{1}^{-1}\left ({{ c }}\right) \tag {2}\end{align*}
When implementing NF, two conditions must be satisfied. The Jacobian matrix of the transformation should be easily calculated, and the NF should be able to perform the inverse transformation easily. These requirements have been effectively addressed using the affine coupling layer proposed previously [48], simplifying the Jacobian computation and ensuring invertibility. The affine coupling layer operates by partitioning the input into two parts, transforming one part conditionally according to the other, facilitating the simple calculation of the Jacobian determinant. Additionally, the limitations of the unchanging dimensions of the affine coupling layer have been overcome upon the introduction of the
The WaveGlow model [50] further extended these structures by incorporating the WaveNet architecture and significantly enhancing its capabilities in modeling complex audio signals. The model structure was utilized to compose the baseline model, VITS, incorporating VAE with its NF framework. This integration improved the expressiveness of the prior distribution and significantly improved the quality of speech synthesis by leveraging the ability of flow, allowing the construction of complex probability distributions with a simple distribution. More detailed explanations are provided in Section III.
Baseline TTS Model
In this section, we explain the VITS model [24], which is employed as the baseline model in this work, with a focus on the network architecture and training process. VITS is a parallel E2E model that utilizes a VAE to learn latent variables that serve as intermediate representations between the acoustic model and the waveform generator in a fully integrated training process. This integration improves the smooth flow of information from the acoustic model to the waveform generator, resulting in the consistency of the personalized voice quality.
Fig. 1 depicts the training procedure of the baseline VITS model, which comprises three primary components: a prior encoder, a posterior encoder, and a HiFi-GAN generator [31]. The prior encoder comprises a transformer-based text encoder, a flow-based decoder, MAS [5], and an SDP. The text encoder uses multiple feed-forward transformer blocks [51] to transform the input phonemes \begin{align*} {\mathrm {log~}p}_{\theta }\left ({{ d\vert C_{\mathrm {text}} }}\right)\ge \mathbb {E}_{q_{\emptyset }\left ({{ u,v\vert d,C_{\mathrm {text}} }}\right)}\left [{{ \log \frac {p_{\theta }\left ({{ d-u,v\vert C_{\mathrm {text}} }}\right)}{q_{\emptyset }\left ({{ u,v\mathrm {\vert }d,C_{\mathrm {text}} }}\right)} }}\right ]. \tag {3}\end{align*}
The flow-based decoder is constructed by arranging a stack of WaveNet [2] residual blocks in a stack of affine coupling layers [47]. The probability of the latent variables conditioned on the text, \begin{equation*} p_{\theta }\left ({{ z\vert c }}\right)=N\left ({{ f_{\theta }\left ({{ z }}\right);\mu _{\theta }\left ({{ c }}\right),\sigma _{\theta }\left ({{ c }}\right) }}\right)\left |{{ \left.{{ \det \frac {\partial f_{\theta }\left ({{ z }}\right)}{\partial z} }}\right | }}\right. \tag {4}\end{equation*}
The posterior encoder and the Hi-Fi GAN generator, as shown in Fig. 1, correspond to the encoder and decoder of VAE, respectively. The former extracts the latent representation z from the waveform x, whereas the latter generates the reconstructed waveform \begin{align*} z& = Enc\left ({{ x }}\right) \sim q\left ({{ z\thinspace \vert \thinspace x}}\right), \tag {5}\\ \hat {x}& =Dec\left ({{ z }}\right) \sim p\left ({{ x\thinspace \vert \thinspace z}}\right). \tag {6}\end{align*}
The training loss for a conditioned VAE is derived from the evidence lower bound of the marginal log-likelihood \begin{equation*} {\mathrm {log~} p}_{\theta }\left ({{ x\vert c }}\right)\ge \mathbb {E}_{q_{\emptyset }\left ({{ z\vert x }}\right)}\left [{{ {\mathrm {log~} p}_{\theta }\left ({{ x\vert z }}\right)-\log \frac {q_{\emptyset } \left ({{ z\vert x }}\right)}{p_{\theta } \left ({{ z\vert c }}\right)} }}\right ] \tag {7}\end{equation*}
\begin{equation*} {} {L}_{\mathrm {recon}}=\left \|{{ x_{\mathrm {mel}} }}\right.-\left.{{ \hat {x}_{\mathrm {mel}} }}\right \|_{1}. \tag {8}\end{equation*}
In addition, the KL loss in the latent space is defined using the output of the priority distribution \begin{equation*} L_{\mathrm {KL}}=\log {q_{\phi } (z|x_{\mathrm {lin}})}-\log {p_{\theta }\left ({{ z\thinspace \vert \thinspace c}}\right)} \tag {9}\end{equation*}
Finally, the HiFi-GAN generator G synthesizes the predicted speech \begin{equation*} L_{\mathrm {adv}}\left ({{ G }}\right)=\mathbb {E}_{\left ({{ z }}\right)}\left [{{ \left ({{ D\left ({{ G\left ({{ z }}\right) }}\right)-1 }}\right)^{2} }}\right ] \tag {10}\end{equation*}
\begin{equation*} L_{\mathrm {adv}}\left ({{ D }}\right)=\mathbb {E}_{\left ({{ x,z }}\right)}\left [{{ {(D(x)-1)}^{2}+{(D(G(z)))}^{2} }}\right ]. \tag {11}\end{equation*}
\begin{equation*} L_{total}=L_{recon}+L_{kl}+L_{dur}+L_{adv}\left ({{ G }}\right)+L_{fm}\left ({{ G }}\right). \tag {12}\end{equation*}
Proposed Method
This section proposes three approaches to fine-tuning the baseline VITS model. First, we incorporate LoRA for fine-tuning the VITS model to reduce the complexity of neural network parameters by decomposing them into lower dimensional representations. Second, LoRA-based fine-tuning is expanded with CLN for multi-speaker fine-tuning. Third, we apply the residual adapter to the text encoder outputs of the VITS model, which can enhance the representation of the prior distribution of the text encoder output. Fig. 2 illustrates how the parameter-efficient modules—LoRA, CLN, and residual adapter—are integrated into the VITS architecture, with specific colors used for each module. Compared with Fig. 1, Fig. 2 also indicates that the latent variable changes from z to
A. Reduction of Model Parameters Based on LoRA
Instead of optimizing all model parameters, the LoRA-based fine-tuning method optimizes the parameters of the low-rank model [29]. Assuming that the pretrained model parameter is
Fig. 3(a) illustrates an example of the application of LoRA to a matrix as if it is applied to our fine-tuning process. For a pretrained weight matrix \begin{equation*} h= Wx+\Delta Wx=Wx + W_{up}W_{dw}x. \tag {13}\end{equation*}
Network architectures of (a) LoRA applied to a weight matrix, (b) LoRA applied to the attention matrices in the transformer-based text encoder, (c) LoRA applied to a WaveNet residual block, and (d) LoRA applied to the MRF in the HiFi-GAN generator.
In this study, the LoRA module is integrated into six different modules of the baseline VITS model, as illustrated in Fig. 2. In particular, there are two LoRAs for each linear projection layer and four LoRAs for the attention matrices in the transformer-based text encoder, each of the two WaveNets, and an upsampling layer of the generator. The linear projection layer is an important part of the VITS architecture because it projects the distribution of the posterior and prior encoders. Each layer uses a
In addition to the linear projection layer, Fig. 3(b) shows the network architecture of LoRA applied to the attention matrices in the transformer-based text encoder. As shown in Fig. 3(b), the self-attention module of each transformer block includes four weight matrices:
As mentioned in Section III, the VITS architecture contains the WaveNet structure in two modules. The posterior encoder comprises 16 noncausal WaveNet residual blocks, whereas the flow-based decoder consists of four affine coupling layer stacks [48], each containing four WaveNet residual blocks. WaveNet fundamentally works through stacked residual blocks, each containing a dilated convolution layer, two activation functions, and a \begin{equation*} z= tanh \left ({{ W_{f}\ast x+V_{lora}g^{\prime } }}\right)\sigma (W_{g}\ast x+V_{lora}g^{\prime }) \tag {14}\end{equation*}
Finally, we apply LoRA to the generator whose MRF model structure is described in Fig. 3(d). MRF facilitates the formation of several different receptive field patterns to enrich speech with details and textures. Therefore, MRF fine-tuning is crucial for generating personalized voice; thus, LoRA is applied to the ConvTranspose layer connected to the MRF.
B. Conditional Layer Normalization for Multi-Speaker TTS
To handle speaker-specific variations with improved multi-speaker PEFT performance, we incorporate CLN by replacing LN [53] in the text encoder and SDP. Fig. 4(a) shows the conditional network comprising two linear layers: \begin{equation*} \gamma _{s}= g^{\prime }\times W_{\gamma }, \beta _{b}= g^{\prime }\times W_{\beta }. \tag {15}\end{equation*}
Network architecture of the (a) conditional normalization layer and (b) residual adapter used for the proposed fine-tuning approach.
Without CLN, all model parameters for each new speaker must be stored. However, by adjusting the normalization parameters for each speaker, the model can achieve high-quality adaptation during multispeaker optimization while significantly reducing the number of parameters. Specifically, a storage of only 0.02M parameters for each speaker is required by applying CLN during fine-tuning, which corresponds to ~0.05% of the model parameters required for full fine-tuning.
C. Residual Adapter for Expressive TTS
Although the application of LoRA and CNL provided enhanced performance, limitations in naturalness and pronunciation compared with those of full fine-tuning persisted. To address this issue, the expressiveness of the prior distribution of the new-speaker data during fine-tuning must be enhanced [12]. Accordingly, we attempted to increase the rank of the LoRA matrix applied to the text encoder; however, it did not yield performance improvement. Therefore, a residual adapter [22], [32] was integrated into the text encoder output.
Fig. 4(b) shows the network architecture of a residual adapter, a modified version of the vanilla adapter [33], used for the proposed fine-tuning approach. As shown in Fig. 4(b), the residual adapter operates by initially projecting the text encoder output
The adapter incorporates a residual connection to ensure stable training and minimize the disruption to the original model architecture. This connection enables the original input \begin{equation*} h_{adp}= h_{lora}+LN\left ({{ ReLU\left ({{ {FF}_{down}\left ({{ h_{lora} }}\right) }}\right){FF}_{up} }}\right). \tag {16}\end{equation*}
Experiments and Results
A. Dataset
We utilized four datasets—VCTK [34], Libri-TTS-100 [35], Common Voice [36], and the Korean Multi-Speaker Speech Synthesis (KMSSS)1—to evaluate the performance of the TTS model using the proposed fine-tuning approaches. These datasets were selected for their different characteristics. For instance, the VCTK dataset comprised around 400 sentences spoken by 109 speakers. The audio format was a 16-bit PCM with a sampling rate of 48 kHz. This dataset was characterized by a similar number of speech samples per speaker and low variability in speech. Meanwhile, the Libri-TTS-100 dataset had a similar number of speech samples as VCTK but comprised 247 speakers. The total length of the audio data was approximately 54 h, with a sampling rate of 24 kHz. This dataset had fewer samples per speaker, an inconsistent number and length of speeches per speaker, and more variability in speech. The Common Voice dataset consisted of mono-channel, 16-bit MPEG-3 audio files at a sampling rate of 48 kHz. In this experiment, we organized a subset of 144 English speakers, each with ~1,000 samples, to ensure balanced data for fine-tuning. Compared with VCTK and Libri-TTS-100, this dataset offered a greater variation in speech, including various accents and dialects, recorded by volunteers from diverse linguistic backgrounds. Lastly, to investigate the model’s adaptability to non-English languages, a dataset was constructed from the KMSSS dataset by taking 184 speakers, where each speaker spoke 500 utterances at a sampling rate of 48 kHz.
For multi-speaker fine-tuning, a VITS model was pretrained using 100 speakers from the VCTK dataset, with five speakers for validation and four for fine-tuning and testing. In contrast, we pretrained, validated, and tested the VITS model with 220, 14, and 13 speakers, respectively, for the Libri-TTS-100 dataset, where we selected the four speakers with the highest number of samples in the test data for fine-tuning. For the Common Voice dataset, the VITS model was pretrained using 130 speakers, whereas ten and four speakers were used for validation and testing, respectively. Note that two out of the four speakers recorded speeches in environments with slight background noise, adding diversity to the data. Similarly, for the Korean dataset, we used 160 speakers for training, 20 for validation, and 4 for testing.
B. Experimental Setup
In our experimental setup, we resampled all the speech data at a sampling rate of 22 kHz. Then, we normalized the raw text sequences and converted the normalized sequences into the International Phonetic Alphabet sequence using an open-source phonemizer2 [55].
To obtain our pretrained VITS model, we utilized the AdamW optimizer [56] with the hyperparameters set as
C. Evaluation Metrics
To evaluate the performance of the proposed fine-tuning approaches, we compared the synthesized speech with the reference speech using five objective metrics: SECS, WER, CER, NISQA-TTS3, and WV-MOS4.
SECS measured the cosine similarity between the speaker embedding of the synthesized speech and the reference speech audio. This value, which ranged from −1 to 1, indicated how closely the speaker’s vocal characteristics match. We computed the speaker embedding using the H/ASP model [37], a publicly available speaker verification model5 trained on VoxCeleb2 [57], a large-scale speech dataset.
WER (%) and CER (%) respectively indicated the percentages of recognized word and character errors in the synthesized transcript to the ground-truth text. For synthesized speech transcription, we used NeMo’s stt_en_conformer_transdu-cer_large_model6 [38], which was based on the conformer transducer architecture, and computed these error rates using the Levenshtein distance algorithm7 [58]. A lower value suggests fewer pronunciation errors in the synthesized speech, indicating higher fidelity of the synthesized audio in adhering to the provided transcription.
NISQA-TTS was designed to predict the naturalness of synthetic speech, providing a nonintrusive evaluation without needing a reference signal in TTS systems. This metric predicted the naturalness score on a five-point scale consistent with the human MOS evaluation. Our work used the NISQA-TTS model to estimate the naturalness of the synthetic speech generated by our TTS system.
WV-MOS evaluated the overall quality of the utterances generated by each model and provided a score that ranged from 1 to 5 points. For MOS prediction, the WV-MOS model utilized a neural network architecture, wav2vec2.0, which was pretrained in a contrastive self-supervised manner, making it useful for various downstream tasks. The pretrained wav2vec2.0 model was fine-tuned using listening evaluation results from the Voice Conversion Challenge 2018 dataset [59]. In our study, we used WV-MOS to measure the overall quality of the generated speech for each fine-tuning method.
Objective metrics are not always reliable for measuring the perceived quality of synthesized speech from TTS models. Therefore, subjective evaluation is required to accurately assess speech quality. In this study, we compared the quality of synthesized speech obtained using our fine-tuning approach with that of the original speech using a CMOS on a seven-point scale ranging from −3 to 3. Ten people participated in the subject test by listening to 10 randomly selected pairs of original and synthesized speeches.
D. Performance Evaluation
To examine the effectiveness of the proposed different fine-tuning approaches on the objective and subjective quality of synthesized speech, we generated speech samples from the TTS models after applying different combinations of the proposed approaches. Table 1 compares the objective quality between the fine-tuned TTS models according to different combinations of the proposed fine-tuning approaches. The rightmost column of Table 1 compares the number of model parameters trained by each fine-tuned TTS model. In Table 1, Proj, AT, WN, and MRF denote the LoRA approach applied to the linear projection layers, attention matrix in the transformer-based text encoder (shown in Fig. 3(b)), WaveNet (shown in Fig. 3(c)), and MRF in the generator (shown in Fig. 3(d)), respectively. In addition, CLN and ADP signify the proposed approach using the CLN and residual adapter, as described in Figs. 3(a) and 3(b), respectively. Moreover, to investigate the effect of the fine-tuning of the projection layers connected to the speaker embeddings
As revealed by Table 1, Model 1, where AT was only fine-tuned, performed deficiently overall. However, the metric scores of Model 2 showed that tuning the WN was necessary to improve the overall speech quality, naturalness, and intelligibility. Model 3, trained by fine-tuning only LoRA, indicated that it was difficult to capture the unique characteristics of the speaker’s voice, making it challenging to represent the speaker accurately. Next, we fine-tuned the SEPLs in Model 4, which showed that tuning the SEPL increased the SECS score. Subsequently, we fine-tuned the VITS model with AT, WN, and SE together to create Model 5, which showed that fine-tuning the critical parts in the VITS model resulted in a higher overall quality of the synthesized speech. Further, Table 1 indicates that Model 6 improved the overall performance of the generated speech, particularly in terms of naturalness. However, Model 7 provided a lower overall quality but a higher SECS score than Models 1 to 4, which implied that SEPL and CLN could contribute to speaker similarity.
In the observation from Model 7, we applied SEPL and CLN to the following fine-tuned models from Models 8 to 11. As shown in Table 1, Model 8 demonstrated higher performance than Model 6, highlighting the importance of CLN in the multi-speaker fine-tuning process with an additional increase of 0.02M parameters. Meanwhile, Model 9 demonstrated higher performance in terms of WER, CER, and NISQA-TTS than Model 8 because of the addition of ADP. Moreover, Model 10 further improved speaker similarity and overall speech quality compared with Model 9 because the MRF was fine-tuned.
Lastly, we employed all the proposed approaches to fine-tune the VITS model, referred to as Model 11. As shown in the 11th row of Table 1, Model 11 showed the best objective performance among other models from Models 1–10, with a 10% increase in the number of model parameters. Interestingly, Model 11 achieved a slightly lower performance than the full fine-tuning method. Finally, to investigate the effect of the CLN on the multi-speaker TTS, we fully fine-tuned the VITS model with the CLN. The last two rows of Table 1 show that the CLN considerably contributed to improving all the objective metrics compared with the model with full fine-tuning.
Additionally, we performed a subjective test on the synthesized speeches with the models whose NISQA-TTS was higher than 3.0. In particular, we chose Models 8–11 and two models with full fine-tuning with/without CLN adaptation. Table 2 compares the CMOS of the top six fine-tuned models, revealing that CMOS was closely related to either NISQA-TTS or WV-MOS. Although the proposed approaches had slightly lower CMOS values than the case of full fine-tuning, the participants’ survey confirmed that they demonstrated comparable listening results.
E. Effect of Speaker-Related Techniques on Speaker Representation
We conducted a series of experiments to understand the effect of the fine-tuning of speaker-related modules on speaker representations. Fig. 5 illustrates the t-distributed stochastic neighbor embedding (t-SNE) [60] plots of the latent vectors z of synthesized speeches from the test speakers in the VCTK dataset to compare the speaker clustering performance according to different VITS models. As shown in Fig. 5(a), the latent vectors of Model 1 were distributed randomly, implying that the speakers were not clustered anymore. Instead, Model 7 (as shown in Fig. 5(b)) provided better speaker clustering than Model 1, which implied that CLN was effective for speaker representation. Then, we plotted the latent vectors from Model 11, which showed the best subjective and objective performance, as presented in Tables 1 and 2, and achieved better speaker clustering than that of Model 7. Next, we compared the t-SNE plots of the fully fine-tuned models with and without CLN, as shown in Figs. 5(d) and 5(e), respectively. Thus, CLN was also demonstrated as effective in speaker clustering, resulting in better objective and subject quality scores.
Comparison of the t-SNE plots of latent vectors predicted by different models: (a) Model 1, (b) Model 7, (c) Model 11, (d) fully fine-tuned model, and (e) fully fine-tuned model with CLN, where the latent vectors were obtained from the test speakers on the VCTK dataset.
F. Setting the Rank of LoRA
This section describes the performance when the rank \begin{align*} \emptyset \left ({{ \Delta W(8),\Delta W(96),i,j }}\right)=\frac {\left \|{{ {({\Delta W\left ({{ 8 }}\right)}^{i})}^{T}{\Delta W\left ({{ 96 }}\right)}^{j} }}\right \|_{F}^{2}}{min(i,j)} \tag {17}\end{align*}
Similarity of eigenvectors between
Similarity of eigenvectors between
Lastly, we applied the proposed method to fine-tune the models with LoRA ranks
G. Comparison of Synthesis and Training Speed
We evaluated the speech synthesis and training speed of our model, focusing on complexity when a new module was added. We measured two indices: the average fine-tuning speed per epoch (measured in seconds) and the real-time factor (RTF). All the measurements were performed on a single A100 GPU with a batch size of 1, and the fine-tuning speed was measured using 1,546 sentences over 20 epochs. Table 4 compares the average fine-tuning speed and RTFs of the different models. A comparison of Model 8 with Models 5 and 6 indicated that CLN increased its average fine-tuning speed from 23.76 to 26.36 s. MRF also increased the average fine-tuning speed of 2.18 s, as revealed by the comparison of Models 10 and 8. In particular, the average fine-tuning speed of Model 11, which was fine-tuned with all the proposed approaches, was much faster than that of the fully fine-tuned model. In contrast, the RTF was proportional to the number of added modules that corresponded to the additional model parameters given in the rightmost column of Table 1. As expected, the fully fine-tuned model had the lowest RTF among all the models compared in Table 4. However, Model 11, which had the highest RTF among the proposed PEFT methods, remained at a real-time level. When we inferred Model 11 on lower-resource GPUs, such as NVIDIA TITAN X and RTX 2080 Ti, it was confirmed that the proposed method could operate in real time under low-resource conditions.
H. Objective Quality According to Different Datasets
In this section, we decompose the performance evaluation results shown in Table 1 into the results according to each dataset: VCTK and Libri-TTS-100. We conducted an ablation study using different fine-tuning methods and assessed the performance of the proposed method. We compared the results with those of full fine-tuning and ground truth. Table 5 presents the evaluation results using the objective metrics of different methods on the VCTK dataset, whereas Table 6 provides the results using the objective metrics on the Libri-TTS-100 dataset.
As shown in Tables 5 and 6, Model 5, which fine-tuned AT, WN, and SEPL together, showed a higher overall quality of the synthesized speech, achieving SECS values of 0.591 and 0.526 and WV-MOS scores of 3.85 and 3.49. By fine-tuning Proj to Model 5, Model 6 improved the overall performance of the generated speech with NISQA-TTS scores of 2.84 ± 0.17 and 3.11 ± 0.31. However, when it was applied to fine-tuning using four speakers, the performance was lower than that for fine-tuning using a single speaker. Model 7 involved fine-tuning using four speakers, demonstrating higher performance and highlighting the importance of CLN in the multi-speaker fine-tuning process. Moreover, its objective performance was improved compared with that of Model 6.
Model 10 involved fine-tuning the MRF by LoRA, which showed an improvement in the overall quality across both datasets. Model 11 further enhanced this performance by incorporating ADP to improve the expressiveness of the prior distribution, resulting in the best performance. Overall, the VCTK dataset outperformed the Libri-TTS-100 dataset; however, because of its nature, the Libri-TTS-100 dataset had a higher naturalness score, and CLN had a more significant impact during the fine-tuning process.
In addition to the VCTK and Libri-TTS-100 datasets, the performance of the fine-tuned models was evaluated on the Common Voice dataset to assess the robustness of the proposed PEFT method against diverse accents and speech styles. Table 7 presents the evaluation results using the objective metrics of the different methods on the Common Voice dataset. Apparently, the WV-MOS score of the ground truth samples was 3.78, which was lower than the scores obtained from both the VCTK and Libri-TTS-100 datasets. This was due to the characteristics of the Common Voice dataset, such as its various accents and slight background noise, which increased CER and WER. Compared with the results in Tables 5 and 6, the tendency of performance variations according to different combinations of the proposed fine-tuning approaches was similar to those in the VCTK and Libri-TTS-100 datasets. In other words, the full fine-tuning with CLN provided better performance than the conventional full fine-tuning and also a comparable overall performance to the ground truth with an SECS score of 0.696, demonstrating the effectiveness of the fine-tuning process. Moreover, the performance of Model 11 was the best among the fine-tuned models using the proposed PEFT method. It also maintained WV-MOS and NISQA-TTS scores comparable to those by the full fine-tuning, suggesting that the proposed PEFT method could be effective even when applied to more challenging speech samples.
I. Adaptation Results With the Korean Dataset
In this section, we apply the proposed PEFT method to the KMSSS dataset to examine the effect of variations in pronunciation and tone across languages on the model’s adaptability. Table 8 presents the evaluation results using the objective metrics of the different methods on the KMSSS dataset. According to the results, although the proposed PEFT method was applied to fine-tune the pretrained Korean TTS model, the difference in terms of performance between Model 11 and full fine-tuning with CLN for the Korean dataset was consistent with those for the English datasets, as shown in Tables 5 –7. This implied that even if we developed a PEFT method using English datasets, the proposed PEFT method could be applied to any language. Instead, the most critical factor for applying PEFT lies in the ability of the pretrained TTS model to produce proper synthetic and personalized speech by applying the proposed fine-tuning method to achieve optimal performance. In conclusion, the proposed PEFT method can be effectively applied in various scenarios, regardless of the language.
J. Comparison With Zero-Shot TTS Models
In this section, we compare the performance of our model, which incorporated the proposed PEFT method, with two zero-shot TTS models: YourTTS [11] and XTTS [14]. YourTTS is a VITS-based E2E TTS model that utilizes the H/ASP model’s output as speaker embedding and applies a speaker consistency loss to ensure high speaker similarity between synthetic and ground truth speech. Meanwhile, XTTS builds on Tortoise [61] but introduces several novel modifications to enable multilingual training, enhance zero-shot TTS performance, and achieve faster training and inference. As we aimed to evaluate the performance of the zero-shot TTS, we utilized the open-source YourTTS8 model and XTTS-v29 without any fine-tuning. To evaluate these three TTS models, including our Model 11, we prepared 120 samples by taking 60 samples from each VCTK and Libri-TTS-100.
Table 9 compares the objective metrics of Model 11, Your-TTS, and XTTS-v2. Apparently, Model 11 outperformed YourTTS in all objective metrics. Meanwhile, XTTS achieved slightly better WER, CER, and NISQA-TTS values than Model 11, but Model 11 showed much better performance in terms of SECS, which was the most important metric for personalized speech. Therefore, we concluded that the proposed PEFT method was more effective than zero-shot TTS models for generating personalized speech.
K. Effect of LoRA on Information Flow in the Flow-Based Decoder
Herein, we investigate whether the application of LoRA to the flow-based decoder did not affect invertibility during inference. Fig. 8 illustrates the three-dimensional (3D) t-SNE of each k-th latent variable from
3D t-SNE plots of the latent variables in the kth forward flow and backward flow (
To measure the difference between the initial forward data distribution and the final backward distribution, we used the centered kernel alignment (CKA) [62]. CKA quantifies the similarity between pairs of neural network representations and effectively calculates the similarity of representation distributions invariant to isotropic scaling. Using CKA, we can robustly assess the similarity of data distributions before and after passing through flow layers. Note that the CKA calculations were performed using open-source data10.
Table 10 compares the CKA accuracy between the latent variables of the first forward and the last backward layer outputs applied to the full fine-tuned model, Model 2, and Model 11. The reason why we compared Model 11 with Model 2 was that Model 2 was the first attempt to deal with LoRA to WaveNet in the flow network.
As shown in the table, Model 11 achieved a CKA accuracy of 96.19, whereas the full CKA accuracy of the fine-tuned model was 96.31. Such a high similarity score of Model 11 demonstrated that the integration of LoRA into the flow-base encoder yielded flow invertibility. Furthermore, Model 2 achieved a CKA accuracy of 95.98. Although Model 2 showed lower speech performance because of its fewer tuning parameters, the CKA scores indicated that the invertibility of the flow transformations was also maintained. These results confirmed that integrating LoRA did not affect the invertibility of the flow-based transformations.
Conclusion
In this paper, we proposed several fine-tuning approaches to improve the performance of an E2E multi-speaker TTS by efficiently adapting it to new speakers. To this end, we first proposed a LoRA-based fine-tuning approach to achieve speech quality comparable with that of a fully fine-tuned model by updating a smaller number of model parameters. Second, a CLN-based fine-tuning approach was proposed to handle speaker-specific variation with improved multi-speaker PEFT performance. Third, the residual adapter was integrated into the text encoder output to improve the expressiveness of the prior distribution. We constructed the VITS models using the VCTK, Libri-TTS-100, Common Voice, and Korean multi-speaker datasets according to different combinations of the proposed fine-tuning approaches (i.e., LoRA, CLN, and residual adapter). The model performance was evaluated using five objective measures, namely, SECS, WER, CER, NISQA-TTS, and WV-MOS, as well as a subjective listening test involving the measurement of CMOS. The performance comparison revealed that LoRA improved the overall objective measures but was limited in improving the subjective quality for multi-speaker TTS. However, combining LoRA and CLN improved the speech quality compared to that using only LoRA. In addition, the VITS model was fine-tuned using all the proposed approaches, which provided objective and subjective speech quality compared with the fully fine-tuned model. Next, we investigated the effect of the proposed fine-tuning approaches on speaker clustering. The t-SNE comparison showed that CLN was effective in separating speakers in the latent space. Finally, the comparison of complexity by measuring the average fine-tuning speed and RTF showed that the proposed fine-tuning approaches were realized with less complexity compared with the full fine-tuning approach.
Despite these promising results, the proposed approaches have limitations that require future work. First, although the proposed PEFT method achieved good performance, it is still not as effective as full fine-tuning. Second, the baseline model structure has exhibited limited adaptability when dealing with challenging datasets such as Common Voice. These limitations can be mitigated by enhancing the adaptability of the pre-trained model through structural modifications or adding new modules to the baseline model architecture. Furthermore, additional adapters can be integrated into various components of the system beyond the prior encoder of the VITS to assess their potential for further performance enhancement. By focusing on these aspects, we aim to advance the adaptability and efficiency of PEFT approaches in multi-speaker TTS systems.