Loading [MathJax]/extensions/TeX/boldsymbol.js
Leveraging Low-Rank Adaptation for Parameter-Efficient Fine-Tuning in Multi-Speaker Adaptive Text-to-Speech Synthesis | IEEE Journals & Magazine | IEEE Xplore

Leveraging Low-Rank Adaptation for Parameter-Efficient Fine-Tuning in Multi-Speaker Adaptive Text-to-Speech Synthesis


Illustration of the block diagrams of the (a) baseline VITS model and (b) TTS model employing the proposed parameter-efficient fine-tuning (PEFT) method combining LoRA, a...

Abstract:

Text-to-speech (TTS) technology is commonly used to generate personalized voices for new speakers. Despite considerable progress in TTS technology, personal voice synthes...Show More

Abstract:

Text-to-speech (TTS) technology is commonly used to generate personalized voices for new speakers. Despite considerable progress in TTS technology, personal voice synthesis remains problematic in achieving high-quality custom voices. In addressing this issue, fine-tuning a TTS model is a popular approach. However, it must be applied once for every new speaker, which results in both time-consuming model training and excessive storage of the TTS model parameters. Therefore, to support a large number of new speakers, a parameter-efficient fine-tuning (PEFT) approach must be used instead of full fine-tuning, as well as an approach to accommodate multiple speakers with a small number of parameters. To this end, this work first incorporates a low-rank adaptation-based fine-tuning method for variational inference with adversarial learning for end-to-end TTS (VITS) model. Next, the approach is extended with conditional layer normalization for multi-speaker fine-tuning, and the residual adapter is further applied to the text encoder outputs of the VITS model to improve the intelligibility and naturalness of the speech quality of personalized speech. The performance of the fine-tuned TTS models with different combinations of fine-tuning modules is evaluated using the Libri-TTS-100, VCTK, and Common Voice datasets, as well as a Korean multi-speaker dataset. Objective and subjective quality comparisons reveal that the proposed approach achieves speech quality comparable to that of a fully fine-tuned model, with around a 90% reduction in the number of model parameters.
Illustration of the block diagrams of the (a) baseline VITS model and (b) TTS model employing the proposed parameter-efficient fine-tuning (PEFT) method combining LoRA, a...
Published in: IEEE Access ( Volume: 12)
Page(s): 190711 - 190727
Date of Publication: 11 December 2024
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Text-to-speech (TTS) technology synthesizes speech waveforms from input texts through several processes, including text analysis, linguistic feature extraction, acoustic feature prediction, and waveform generation [1]. With recent advances in deep learning, TTS models have significantly improved the quality of synthesized speech compared with traditional statistical parametric models. At present, they can generate natural and human-level quality speech after being trained for several hours on single-speaker or multi-speaker recordings [2], [3], [4], [5]. This advancement has made TTS technology attractive across diverse speech-related applications.

Recently, there has been a growing interest in using TTS for personalized voice assistants and broadcasting [6] with personalized custom voices. In these applications, generating personalized voices for new speakers not included in training data presents a challenge. This challenge arises because of the quality gap between the synthesized speeches of a trained speaker and those of a new speaker. This gap is often caused by factors such as the lack of training data for the new speaker or characteristics of the speaker that do not match the data used during training [7]. To address the issue associated with such a quality gap, speaker adaptation techniques have been applied to better adapt to new speakers not included in training data.

Two main methods are being studied for generating speech for new speakers: speaker adaptation methods based on zero-shot learning [8] or fine-tuning a pretrained TTS model to personalize the natural voices of new speakers [9], [10]. Zero-shot learning utilizes a single pretrained model to imitate unseen speech patterns and features. Significant advances have been achieved by applying zero-shot learning to TTS [11], [12], [13], [14]. However, the zero-shot approach generates a relatively inconsistent personalized voice with distorted naturalness for a given new speaker. Additionally, when the speakers pronounce strong accents or nonstandard pronunciations, the similarity of the synthesized speech further decreases [15]. In contrast, the fine-tuning approach generally adapts a pretrained TTS model by optimizing all the parameters of the TTS model using a limited amount of new-speaker data.

Although adapting the TTS model for a target speaker can improve the synthesized speech quality, several problems arise. First, fine-tuning all the parameters of the TTS model incurs significant computational cost and time consumption [6]. Second, the adapted TTS model for each target speaker needs to be stored individually, which requires considerable storage space [16], [17]. Therefore, reducing the number of adaptation parameters is necessary for fine-tuning.

To mitigate the abovementioned problems, parameter-efficient fine-tuning (PEFT) approaches have been proposed. For instance, AdaSpeech leverages acoustic condition modeling and conditional layer normalization (CLN) at the mel-decoder stage to achieve parameter efficiency while fine-tuning TTS models [6]. Meanwhile, Meta-StyleSpeech [18] employs metalearning techniques for style modeling, enabling fast adaptation to a new speaker’s style with minimal data. Furthermore, adapter-based methods have been introduced as PEFT [19], [20], [21], and they achieve efficiency by selectively fine-tuning only a subset of parameters rather than the entire model, thereby reducing the computational load and storage requirements. However, these approaches have typically focused on fine-tuning the acoustic models of two-stage TTS models [4], [22], [23]. Because acoustic feature representation and waveform synthesis in two-stage TTS models are processed independently, the TTS performance is limited because of the independence of the fine-tuned intermediate features [24].

In recent years, end-to-end (E2E) TTS models have been widely studied to provide higher-quality expression compared with two-stage TTS models. One representative E2E TTS model is variational inference with adversarial learning for E2E TTS (VITS) model [24], which mainly comprises a variational autoencoder (VAE) augmented with normalizing flow (NF) [25], [26] and is trained through adversarial training [27]. Another notable E2E model is Your-TTS [11], which builds upon the VITS framework and incorporates a speaker encoder for zero-shot multi-speaker adaptation and multilingual training. Additionally, NaturalSpeech [12] achieves high-quality single-speaker TTS by modifying the VITS model structure, introducing a bidirectional NF alongside differentiable duration modeling and phoneme pretraining, which significantly enhances the synthesized speech’s expressiveness and naturalness. However, an issue persists when PEFT is applied to these VITS-based models. The connection between the modules in the VITS-based model is represented by a probability distribution. Thus, applying PEFT to a specific module in the VITS model can change the probability distribution of the output of the module. However, whether this updated probability distribution is suitable for the input of the subsequent module is uncertain. Without more sophisticated fine-tuning, high-quality synthesized speech cannot be guaranteed.

To address this issue, a recent study [15] proposed a zero-shot learning and PEFT method for VITS-based models, which improved the zero-shot adaptation performance by altering the VAE model structure to prevent overfitting and introducing a specific discriminator for speaker information, thereby enhancing the overall model performance. In addition, the speaker encoder was based on the ECAPA-TDNN architecture [28], which was modified to extract speaker embeddings and pretrained to effectively capture speaker characteristics. In this model [15], the baseline TTS model was trained using speaker embeddings extracted from the pretrained speaker encoder to aid the model’s flow and duration predictor during training. PEFT was applied through adapters to the prior encoder, specifically targeting the flow-based decoder and text encoder. This approach demonstrated impressive performance in speaker adaptation. However, this method relied on a pretrained speaker encoder, did not consider multi-speaker adaptation, and only applied the adapter to the prior encoder.

Thus, this paper presents a PEFT approach in VITS models and demonstrates the effectiveness of applying PEFT to multiple specific modules within the E2E architecture, providing a new method for improving TTS performance for multi-speaker adaptation. To further enhance this approach, we propose three specific strategies to realize PEFT for the VITS model. First, we incorporate low-rank adaptation (LoRA) [29] for fine-tuning the VITS model. LoRA is a method for reducing the complexity of neural network parameters by decomposing them into lower-dimensional representations [30]. Consequently, it adapts only a subset of model parameters using a low-rank matrix rather than the entire model parameters. In this study, LoRA is applied to several modules: the attention network of the text encoder, the WaveNet [2] structure in both the flow network and posterior encoder, the HiFi-GAN generator [31], and two linear projection layers. Second, LoRA-based fine-tuning is expanded with CLN [6] for multi-speaker fine-tuning. This is because LoRA alone does not capture diverse speaker-specific variations, resulting in suboptimal performance in multi-speaker adaptation. CLN uses a small conditioning layer to obtain scale and bias vectors for normalization instead of standard layer normalization, and it is applied to the text encoder and the stochastic duration predictor (SDP) of the VITS model by replacing layer normalization (LN). Lastly, to achieve intelligibility and naturalness of speech quality as in full fine-tuning, the degree of expressiveness of the prior distribution should be increased [12]. Therefore, this work additionally applies the modified version of the residual adapter [22], [32], [33], which can be flexibly inserted into the output of any module. In our model, we inserted the residual adapter into the text encoder outputs of the VITS model to enhance the representation of the prior distribution of the text encoder output.

We conducted experimental evaluations on the widely adopted multi-speaker VCTK [34] and Libri-TTS-100 [35] datasets to measure the voice quality of the proposed fine-tuning method against several objective and subjective metrics. These datasets were chosen to test the robustness of our process to different data characteristics. The VCTK dataset was characterized by many audio samples per speaker and a generally calm and consistent tone of voice. In contrast, the Libri-TTS-100 dataset comprised significantly more speakers despite similar sample numbers, with variations in the tone of each speaker. Because VCTK and Libri-TTS-100 were composed of controlled and stable speeches, we repeated experiments using the Common Voice datasets [36] to evaluate the performance of the proposed PEFT method under various accent conditions, which was essential for building personalized custom voices. Moreover, we conducted additional experiments using a Korean multi-speaker dataset to further investigate the model’s adaptability to different languages. Using these datasets, we verified the performance of our multi-speaker fine-tuning method with four speakers. The speech performances of different models, where fine-tuned TTS models were evaluated according to different combinations of fine-tuning modules (e.g., LoRA, CLN, and residual adapter), were compared in terms of the number of tuning parameters and speech quality measures. To measure speech quality, we used five objective metrics: speaker embedding cosine similarity (SECS) [37], word error rate (WER), character error rate (CER) [38], nonintrusive objective speech quality assessment for TTS (NISQA-TTS) [39], and mean opinion score (MOS) prediction by a fine-tuned wave2vec2.0 model (WV-MOS) [40]. In addition, to measure reliable TTS perception quality in terms of human-level quality, we used a comparative mean opinion score (CMOS) as a subjective metric [41].

The main contributions of this study are as follows:

  • To implement PEFT in the VITS model, we applied LoRA to the prior encoder and other specific modules within the E2E model, achieving speech quality comparable to that of a fully fine-tuned model with a 90% reduction in model parameters.

  • To handle speaker-specific variation with improved multi-speaker PEFT performance, CLN replaced the LN in the text encoder and the SDP, allowing the model to train an additional speaker with only 0.02M parameters.

  • To improve the expressiveness of the prior distribution, the residual adapter was integrated into the text encoder output. With only 0.15M parameters, this integration improved the WER, CER, and NISQA-TTS scores.

The remainder of this paper is organized as follows. Section II provides helpful background knowledge to help readers understand our work. Section III describes the VITS model architecture used as the baseline TTS model. Section IV proposes the PEFT method using LoRA, CLN, and residual adapter for multi-speaker adaptation. Section V evaluates the performance of the VITS models with the proposed PEFT, including several ablation studies and visualization experiments. Finally, Section VI concludes the paper.

SECTION II.

Background

This section provides helpful background knowledge to help readers understand our work. First, we give a general review of TTS models. Next, we explain the flow-based generative models used in TTS systems.

A. Overview of Text-to-Speech Models

Recently, neural TTS systems have made significant advances in terms of performance. Two-stage TTS structures are commonly used to generate speech. These systems use acoustic models to predict predetermined acoustic features, such as mel-spectrograms, and then synthesize waveforms using a vocoder [31], [42]. When predicting these acoustic features, acoustic models can be categorized into two groups: autoregressive (AR) and non-autoregressive (NAR) TTS systems. Typically, sequence-to-sequence AR-TTS systems include models such as WaveNet [2] and Tacotron1, 2 [3], [22]. Transformer TTS [23] is the first model to use a transformer network in TTS. These AR-TTS systems sequentially generate frames of a mel-spectrogram by relying on the previous frame to effectively capture long-term dependencies. However, such a system can lead to a compromise in terms of inference speed and robustness errors, such as missing words and repetition. Thus, NAR-TTS systems have been developed to address these problems. For instance, FastSpeech [43] overcomes problems such as repetition in AR-TTS and parallelizes the process with a duration predictor to improve the speed and robustness of speech synthesis. FastSpeech2 [4] refines this setup using a variance adaptor for pitch and energy, although it still depends on an external text and speech alignment tool. Meanwhile, Glow-TTS [5] advances the field by learning alignment directly during training using monotonic alignment search (MAS).

Despite the progress in NAR-TTS systems, the abovementioned cascaded acoustic/vocoder model pipeline still has problems. In two-stage models, the latter model is trained on samples generated by earlier models or leverages pretrained models without modification. In addition, fine-tuning for high-quality speech synthesis is problematic because the two models must be trained separately. Furthermore, training-inference mismatches occur for both the mel-spectrogram and the duration as the models are trained with ground-truth values but rely on predicted values during inference. High-quality speech synthesis requires fine-tuning. Because of this problem, E2E models utilizing efficient training methods have been widely studied [44], [45]. Among these models, VITS [24] has succeeded in producing more natural speech than two-stage models by integrating the TTS model and a neural vocoder within an E2E framework using a VAE to enhance the synthetic speech quality. Moreover, VITS addresses the one-to-many problem of TTS by employing an SDP, enabling the generation of varied rhythms. Consequently, there have been widely adopted E2E models based on the VITS architecture [11], [12], [46].

B. Flow-Based Generative Model

Flow-based models are increasingly being used in different models because of their ability to compute the exact likelihood of data by applying inverse transformations [47]. To estimate the exact density, the latent variable of a generative model should be as simple as a Gaussian distribution. This leads to NF [25], which transforms a simple distribution into a complex distribution by applying a sequence of invertible transformations. This transform is iteratively replaced by changing the following variables:\begin{align*} {log~p}_{\theta }\left ({{ c }}\right)& = {log~p}_{\theta }\left ({{ z }}\right)+\sum \limits _{i=1}^{k} {log \left |{{ det \left ({{ J\left ({{ f_{i}^{-1}\left ({{ c }}\right) }}\right) }}\right) }}\right |,} \tag {1}\\ z& = f_{k}^{-1} o f_{k-1}^{-1}o \ldots f_{1}^{-1}\left ({{ c }}\right) \tag {2}\end{align*} View SourceRight-click on figure for MathML and additional features.where K is the number of layers in the flow-based decoder, o is a composition operator, and J\left ({{ \cdot }}\right) is a Jacobian operator.

When implementing NF, two conditions must be satisfied. The Jacobian matrix of the transformation should be easily calculated, and the NF should be able to perform the inverse transformation easily. These requirements have been effectively addressed using the affine coupling layer proposed previously [48], simplifying the Jacobian computation and ensuring invertibility. The affine coupling layer operates by partitioning the input into two parts, transforming one part conditionally according to the other, facilitating the simple calculation of the Jacobian determinant. Additionally, the limitations of the unchanging dimensions of the affine coupling layer have been overcome upon the introduction of the 1\times 1 invertible convolution method [49], which facilitates feature permutation (mixing between channels) and enhances the model’s flexibility.

The WaveGlow model [50] further extended these structures by incorporating the WaveNet architecture and significantly enhancing its capabilities in modeling complex audio signals. The model structure was utilized to compose the baseline model, VITS, incorporating VAE with its NF framework. This integration improved the expressiveness of the prior distribution and significantly improved the quality of speech synthesis by leveraging the ability of flow, allowing the construction of complex probability distributions with a simple distribution. More detailed explanations are provided in Section III.

SECTION III.

Baseline TTS Model

In this section, we explain the VITS model [24], which is employed as the baseline model in this work, with a focus on the network architecture and training process. VITS is a parallel E2E model that utilizes a VAE to learn latent variables that serve as intermediate representations between the acoustic model and the waveform generator in a fully integrated training process. This integration improves the smooth flow of information from the acoustic model to the waveform generator, resulting in the consistency of the personalized voice quality.

Fig. 1 depicts the training procedure of the baseline VITS model, which comprises three primary components: a prior encoder, a posterior encoder, and a HiFi-GAN generator [31]. The prior encoder comprises a transformer-based text encoder, a flow-based decoder, MAS [5], and an SDP. The text encoder uses multiple feed-forward transformer blocks [51] to transform the input phonemes c_{\mathrm {text}} into hidden representations h_{\mathrm {text}} . These representations are then processed by a linear projection layer to generate f_{\theta }\left ({{ c }}\right) with the mean \mu _{\theta }\left ({{ c }}\right) and variance \sigma _{\theta }^{2}\left ({{ c }}\right) , which are used to construct the prior distribution. Here, c is defined as [c_{\mathrm {text}},A] , where A is the alignment between the text c and target speech x, selected from all potential alignments through MAS. In parallel, the SDP is trained using speaker embedding g, the alignment A, and the hidden representation h_{\mathrm {text}} of the text encoder output, as depicted in Fig. 1. During inference, the SDP estimates the alignment based on the text. The SDP aligns the text by incorporating two random variables, u and v, into the duration d, in view of variational Bayes estimation. The two random variables are sampled from an approximate posterior distribution of the text to optimize a variational lower bound on the log-likelihood of the phoneme duration. The training loss L_{\mathrm {dur}} is the lower bound of the calculated negative variation:\begin{align*} {\mathrm {log~}p}_{\theta }\left ({{ d\vert C_{\mathrm {text}} }}\right)\ge \mathbb {E}_{q_{\emptyset }\left ({{ u,v\vert d,C_{\mathrm {text}} }}\right)}\left [{{ \log \frac {p_{\theta }\left ({{ d-u,v\vert C_{\mathrm {text}} }}\right)}{q_{\emptyset }\left ({{ u,v\mathrm {\vert }d,C_{\mathrm {text}} }}\right)} }}\right ]. \tag {3}\end{align*} View SourceRight-click on figure for MathML and additional features.

FIGURE 1. - Block diagrams of the baseline VITS model.
FIGURE 1.

Block diagrams of the baseline VITS model.

The flow-based decoder is constructed by arranging a stack of WaveNet [2] residual blocks in a stack of affine coupling layers [47]. The probability of the latent variables conditioned on the text, p_{\theta }\left ({{ z\thinspace \vert \thinspace c}}\right) , can be expressed as\begin{equation*} p_{\theta }\left ({{ z\vert c }}\right)=N\left ({{ f_{\theta }\left ({{ z }}\right);\mu _{\theta }\left ({{ c }}\right),\sigma _{\theta }\left ({{ c }}\right) }}\right)\left |{{ \left.{{ \det \frac {\partial f_{\theta }\left ({{ z }}\right)}{\partial z} }}\right | }}\right. \tag {4}\end{equation*} View SourceRight-click on figure for MathML and additional features.where f_{\theta }\left ({{ z }}\right) is the output of the flow-based decoder. To calculate the inverse probability, the Jacobian determinant in equation (4) is computed as \left |{{ \left.{{ \det \frac {\partial f_{\theta }\left ({{ z }}\right)}{\partial z} }}\right | }}\right. .

The posterior encoder and the Hi-Fi GAN generator, as shown in Fig. 1, correspond to the encoder and decoder of VAE, respectively. The former extracts the latent representation z from the waveform x, whereas the latter generates the reconstructed waveform \hat {x} according to z:\begin{align*} z& = Enc\left ({{ x }}\right) \sim q\left ({{ z\thinspace \vert \thinspace x}}\right), \tag {5}\\ \hat {x}& =Dec\left ({{ z }}\right) \sim p\left ({{ x\thinspace \vert \thinspace z}}\right). \tag {6}\end{align*} View SourceRight-click on figure for MathML and additional features.

The training loss for a conditioned VAE is derived from the evidence lower bound of the marginal log-likelihood p_{\theta }(x\vert c) and maximized as\begin{equation*} {\mathrm {log~} p}_{\theta }\left ({{ x\vert c }}\right)\ge \mathbb {E}_{q_{\emptyset }\left ({{ z\vert x }}\right)}\left [{{ {\mathrm {log~} p}_{\theta }\left ({{ x\vert z }}\right)-\log \frac {q_{\emptyset } \left ({{ z\vert x }}\right)}{p_{\theta } \left ({{ z\vert c }}\right)} }}\right ] \tag {7}\end{equation*} View SourceRight-click on figure for MathML and additional features.where {p}_{\theta }\left ({{ z\vert c }}\right) represents the prior distribution of z in equation (4), q_{\mathrm {\emptyset }}\left ({{ z\vert x }}\right) is an approximate posterior distribution, and {\log p}_{\theta }\left ({{ x\vert z }}\right) is the likelihood function for a data point x. Equation (7) is decomposed into a reconstruction loss measured in the output of the HiFi-GAN and a Kullback-Leibler (KL) divergence loss. The reconstruction loss L_{\mathrm {recon}} is defined as the L1 loss between the target and the predicted mel-spectrograms—x_{\mathrm {mel}} and \hat {x}_{\mathrm {mel}} , respectively—as follows:\begin{equation*} {} {L}_{\mathrm {recon}}=\left \|{{ x_{\mathrm {mel}} }}\right.-\left.{{ \hat {x}_{\mathrm {mel}} }}\right \|_{1}. \tag {8}\end{equation*} View SourceRight-click on figure for MathML and additional features.

In addition, the KL loss in the latent space is defined using the output of the priority distribution p_{\theta } and the posterior distribution q_{\phi } of the baseline model as follows:\begin{equation*} L_{\mathrm {KL}}=\log {q_{\phi } (z|x_{\mathrm {lin}})}-\log {p_{\theta }\left ({{ z\thinspace \vert \thinspace c}}\right)} \tag {9}\end{equation*} View SourceRight-click on figure for MathML and additional features.where x_{\mathrm {lin}} is a linear spectrogram of x, as shown in the bottom left part of Fig. 1.

Finally, the HiFi-GAN generator G synthesizes the predicted speech \hat {x} according to the intermediate representation z. In the VITS framework, G comprises a series of transposed convolutions, each followed by a multi-receptive field fusion module (MRF). The adversarial loss of the HiFi-GAN generator G is defined as\begin{equation*} L_{\mathrm {adv}}\left ({{ G }}\right)=\mathbb {E}_{\left ({{ z }}\right)}\left [{{ \left ({{ D\left ({{ G\left ({{ z }}\right) }}\right)-1 }}\right)^{2} }}\right ] \tag {10}\end{equation*} View SourceRight-click on figure for MathML and additional features.where D is the discriminator for GAN, composed of a multi-period discriminator and a multiscale discriminator, as shown in the top right part of Fig. 1, and it is trained using the adversarial loss of\begin{equation*} L_{\mathrm {adv}}\left ({{ D }}\right)=\mathbb {E}_{\left ({{ x,z }}\right)}\left [{{ {(D(x)-1)}^{2}+{(D(G(z)))}^{2} }}\right ]. \tag {11}\end{equation*} View SourceRight-click on figure for MathML and additional features.In addition to L_{\mathrm {adv}}\left ({{ G }}\right) ,a feature-matching loss L_{\mathrm {fm}}\left ({{ G }}\right) is used as a reconstruction loss of the discriminator of the HiFi-GAN by summing all the L1 losses between the feature maps extracted from the intermediate layers of each discriminator. Consequently, the total loss function of the VITS model is a combination of VAE and GAN loss, which is configured to facilitate E2E learning; it is expressed as follows:\begin{equation*} L_{total}=L_{recon}+L_{kl}+L_{dur}+L_{adv}\left ({{ G }}\right)+L_{fm}\left ({{ G }}\right). \tag {12}\end{equation*} View SourceRight-click on figure for MathML and additional features.

SECTION IV.

Proposed Method

This section proposes three approaches to fine-tuning the baseline VITS model. First, we incorporate LoRA for fine-tuning the VITS model to reduce the complexity of neural network parameters by decomposing them into lower dimensional representations. Second, LoRA-based fine-tuning is expanded with CLN for multi-speaker fine-tuning. Third, we apply the residual adapter to the text encoder outputs of the VITS model, which can enhance the representation of the prior distribution of the text encoder output. Fig. 2 illustrates how the parameter-efficient modules—LoRA, CLN, and residual adapter—are integrated into the VITS architecture, with specific colors used for each module. Compared with Fig. 1, Fig. 2 also indicates that the latent variable changes from z to z^{\prime } for a given new-speaker embedding g^{\prime } after applying such modules in the baseline VITS model.

FIGURE 2. - Block diagrams of the proposed fine-tuned VITS.
FIGURE 2.

Block diagrams of the proposed fine-tuned VITS.

A. Reduction of Model Parameters Based on LoRA

Instead of optimizing all model parameters, the LoRA-based fine-tuning method optimizes the parameters of the low-rank model [29]. Assuming that the pretrained model parameter is \mathrm {\Phi }_{0} , the fine-tuning process involves finding \mathrm {\Phi } that maximizes\mathrm {\Phi =}\mathrm {\Phi }_{0}\mathrm {+\Delta \Phi } , where \mathrm {\Delta \Phi } is the changed parameter during the fine-tuning. If we can replace \mathrm {\Delta \Phi } with a low-rank model, \mathrm {\Theta }\left ({{ \ll \Phi }}\right)\mathrm {, \Phi } is expressed as \Phi =\mathrm {\Phi }_{0}\mathrm {+\Delta \Phi }\left ({{ \Theta }}\right) .

Fig. 3(a) illustrates an example of the application of LoRA to a matrix as if it is applied to our fine-tuning process. For a pretrained weight matrix W\mathrm {\in }R^{m\mathrm {\times }d} , its update can be constrained by representing it with a low-rank decomposition W+\Delta W=W+W_{\mathrm {up}}W_{\mathrm {dw}} whereW_{\mathrm {up}}\mathrm {\in }R^{m\times r} and W_{\mathrm {dw}}\mathrm {\in }R^{r\mathrm {\times }d} with the rankr\ll \min \left ({{ m,d }}\right) . During training,W is frozen and does not receive gradient updates, where W_{\mathrm {up}} and W_{\mathrm {dw}} contain the trainable parameters. Both W and \Delta W=W_{\mathrm {up}}W_{\mathrm {dw}} are multiplied by the same input, x\mathrm {\in }R^{d} , and their respective output vectors h\mathrm {\in }R^{m} can be expressed as follows:\begin{equation*} h= Wx+\Delta Wx=Wx + W_{up}W_{dw}x. \tag {13}\end{equation*} View SourceRight-click on figure for MathML and additional features.

FIGURE 3. - Network architectures of (a) LoRA applied to a weight matrix, (b) LoRA applied to the attention matrices in the transformer-based text encoder, (c) LoRA applied to a WaveNet residual block, and (d) LoRA applied to the MRF in the HiFi-GAN generator.
FIGURE 3.

Network architectures of (a) LoRA applied to a weight matrix, (b) LoRA applied to the attention matrices in the transformer-based text encoder, (c) LoRA applied to a WaveNet residual block, and (d) LoRA applied to the MRF in the HiFi-GAN generator.

In this study, the LoRA module is integrated into six different modules of the baseline VITS model, as illustrated in Fig. 2. In particular, there are two LoRAs for each linear projection layer and four LoRAs for the attention matrices in the transformer-based text encoder, each of the two WaveNets, and an upsampling layer of the generator. The linear projection layer is an important part of the VITS architecture because it projects the distribution of the posterior and prior encoders. Each layer uses a 192\times 384 matrix to split the 384 output channels and derive the mean and variance. LoRA is applied to the matrix W\mathrm {\in }R^{\mathrm {192\times 384}} with W_{\mathrm {up}}\in R^{\mathrm {192\times }r} and W_{\mathrm {dw}}\in R^{r\mathrm {\times 384}} , where r is set to 8 throughout this paper according to the setting described previously [29].

In addition to the linear projection layer, Fig. 3(b) shows the network architecture of LoRA applied to the attention matrices in the transformer-based text encoder. As shown in Fig. 3(b), the self-attention module of each transformer block includes four weight matrices: W_{\mathrm {q}},W_{\mathrm {k}},W_{\mathrm {v}} , and W_{\mathrm {o}} . Of these, W_{\mathrm {q}},W_{\mathrm {k}} , and W_{\mathrm {v}} have a size of 192\times 192 and project the input features c_{\mathrm {text}} into queries, keys, and values, respectively, to handle the dimensionality of the input and output effectively. Among these four matrices, LoRA is applied to W_{\mathrm {q}} and W_{\mathrm {v}} , which are crucial for generating queries and values.

As mentioned in Section III, the VITS architecture contains the WaveNet structure in two modules. The posterior encoder comprises 16 noncausal WaveNet residual blocks, whereas the flow-based decoder consists of four affine coupling layer stacks [48], each containing four WaveNet residual blocks. WaveNet fundamentally works through stacked residual blocks, each containing a dilated convolution layer, two activation functions, and a 1\times 1 convolutional layer, as shown in Fig. 3(c). WaveNet uses the gate activation unit method [52], where each layer consists of a gate, \sigma \left ({{ \cdot }}\right) that looks at a feature of the input value as a filter, \tanh \left ({{ \cdot }}\right) and decides the magnitude of this information to be passed to the next layer. In the VITS model, WaveNet is responsible for embedding conditional information, which is important for generating specific speakers’ voices. This is achieved by incorporating a new speaker embedding g^{\prime } as a global condition within the WaveNet structure. To fine-tune the conditioning part, LoRA is applied to a 1\times 1 convolution layer, V_{\mathrm {lora}} , on each block of the WaveNet that takes the information of g^{\prime } \mathrm {.} The operation of the conditional WaveNet can be formulated as follows:\begin{equation*} z= tanh \left ({{ W_{f}\ast x+V_{lora}g^{\prime } }}\right)\sigma (W_{g}\ast x+V_{lora}g^{\prime }) \tag {14}\end{equation*} View SourceRight-click on figure for MathML and additional features.where \mathrm {\ast } denotes a convolution operation, \odot represents element-wise multiplication, and x is the input.

Finally, we apply LoRA to the generator whose MRF model structure is described in Fig. 3(d). MRF facilitates the formation of several different receptive field patterns to enrich speech with details and textures. Therefore, MRF fine-tuning is crucial for generating personalized voice; thus, LoRA is applied to the ConvTranspose layer connected to the MRF.

B. Conditional Layer Normalization for Multi-Speaker TTS

To handle speaker-specific variations with improved multi-speaker PEFT performance, we incorporate CLN by replacing LN [53] in the text encoder and SDP. Fig. 4(a) shows the conditional network comprising two linear layers: W_{\gamma } and W_{\beta } . These layers project the extracted speaker representation onto a scale vector \gamma _{s} and a bias vector \beta _{b} , which are essential components of CLN. Specifically, the speaker embedding vector g^{\prime } is processed through W_{\gamma } and W_{\beta } , which are responsible for producing \gamma _{s} and \beta _{b} , respectively. Consequently, the normalization is defined as\begin{equation*} \gamma _{s}= g^{\prime }\times W_{\gamma }, \beta _{b}= g^{\prime }\times W_{\beta }. \tag {15}\end{equation*} View SourceRight-click on figure for MathML and additional features.

FIGURE 4. - Network architecture of the (a) conditional normalization layer and (b) residual adapter used for the proposed fine-tuning approach.
FIGURE 4.

Network architecture of the (a) conditional normalization layer and (b) residual adapter used for the proposed fine-tuning approach.

Without CLN, all model parameters for each new speaker must be stored. However, by adjusting the normalization parameters for each speaker, the model can achieve high-quality adaptation during multispeaker optimization while significantly reducing the number of parameters. Specifically, a storage of only 0.02M parameters for each speaker is required by applying CLN during fine-tuning, which corresponds to ~0.05% of the model parameters required for full fine-tuning.

C. Residual Adapter for Expressive TTS

Although the application of LoRA and CNL provided enhanced performance, limitations in naturalness and pronunciation compared with those of full fine-tuning persisted. To address this issue, the expressiveness of the prior distribution of the new-speaker data during fine-tuning must be enhanced [12]. Accordingly, we attempted to increase the rank of the LoRA matrix applied to the text encoder; however, it did not yield performance improvement. Therefore, a residual adapter [22], [32] was integrated into the text encoder output.

Fig. 4(b) shows the network architecture of a residual adapter, a modified version of the vanilla adapter [33], used for the proposed fine-tuning approach. As shown in Fig. 4(b), the residual adapter operates by initially projecting the text encoder output h_{\mathrm {lora}} through a down-projection feed-forward network {FF}_{\mathrm {down}} , which reduces the dimensionality to a lower-dimensional bottleneck. A rectified linear unit activation function [54] is then applied to add the nonlinearity to the output of {FF}_{\mathrm {down}}\mathrm {.} Next, the dimension is restored by an up-projection feed-forward network {FF}_{\mathrm {up}} . This residual adapter requires only 0.15M parameters.

The adapter incorporates a residual connection to ensure stable training and minimize the disruption to the original model architecture. This connection enables the original input h_{\mathrm {lora}} to bypass the adapter and merge with the adapter output. This effectively allows the network to start training from a near-identity state, which is crucial for maintaining the initial performance level. Note that we also incorporate dropout and LN with zero initialization of the final layer to make this residual adapter operate as an identity function. According to the description thus far, the residual adapter can be described as follows:\begin{equation*} h_{adp}= h_{lora}+LN\left ({{ ReLU\left ({{ {FF}_{down}\left ({{ h_{lora} }}\right) }}\right){FF}_{up} }}\right). \tag {16}\end{equation*} View SourceRight-click on figure for MathML and additional features.

SECTION V.

Experiments and Results

A. Dataset

We utilized four datasets—VCTK [34], Libri-TTS-100 [35], Common Voice [36], and the Korean Multi-Speaker Speech Synthesis (KMSSS)1—to evaluate the performance of the TTS model using the proposed fine-tuning approaches. These datasets were selected for their different characteristics. For instance, the VCTK dataset comprised around 400 sentences spoken by 109 speakers. The audio format was a 16-bit PCM with a sampling rate of 48 kHz. This dataset was characterized by a similar number of speech samples per speaker and low variability in speech. Meanwhile, the Libri-TTS-100 dataset had a similar number of speech samples as VCTK but comprised 247 speakers. The total length of the audio data was approximately 54 h, with a sampling rate of 24 kHz. This dataset had fewer samples per speaker, an inconsistent number and length of speeches per speaker, and more variability in speech. The Common Voice dataset consisted of mono-channel, 16-bit MPEG-3 audio files at a sampling rate of 48 kHz. In this experiment, we organized a subset of 144 English speakers, each with ~1,000 samples, to ensure balanced data for fine-tuning. Compared with VCTK and Libri-TTS-100, this dataset offered a greater variation in speech, including various accents and dialects, recorded by volunteers from diverse linguistic backgrounds. Lastly, to investigate the model’s adaptability to non-English languages, a dataset was constructed from the KMSSS dataset by taking 184 speakers, where each speaker spoke 500 utterances at a sampling rate of 48 kHz.

For multi-speaker fine-tuning, a VITS model was pretrained using 100 speakers from the VCTK dataset, with five speakers for validation and four for fine-tuning and testing. In contrast, we pretrained, validated, and tested the VITS model with 220, 14, and 13 speakers, respectively, for the Libri-TTS-100 dataset, where we selected the four speakers with the highest number of samples in the test data for fine-tuning. For the Common Voice dataset, the VITS model was pretrained using 130 speakers, whereas ten and four speakers were used for validation and testing, respectively. Note that two out of the four speakers recorded speeches in environments with slight background noise, adding diversity to the data. Similarly, for the Korean dataset, we used 160 speakers for training, 20 for validation, and 4 for testing.

B. Experimental Setup

In our experimental setup, we resampled all the speech data at a sampling rate of 22 kHz. Then, we normalized the raw text sequences and converted the normalized sequences into the International Phonetic Alphabet sequence using an open-source phonemizer2 [55].

To obtain our pretrained VITS model, we utilized the AdamW optimizer [56] with the hyperparameters set as \beta _{1} = 0.8 , \beta _{2} = 0.99 , and the weight decay \mathrm {\lambda =}0.01 . The learning rate was initially set at \mathrm {2\times }{10}^{-4} and followed a decay schedule of 0.991^{\mathrm {1/8}} . The pretraining phase was conducted over 400 epochs using four NVIDIA A100 graphics processing units (GPUs). In the fine-tuning phase, all hyperparameters were maintained except for the learning rate and batch size, which were adjusted to \mathrm {1\times }{10}^{-5} and 32, respectively. Each fine-tuning process was run for 150 epochs on a single A100 GPU. To evaluate the model performance, we excluded 15 speech samples per speaker from the test dataset and used the remaining data for fine-tuning. We fine-tuned four speakers to evaluate the multi-speaker fine-tuning performance.

C. Evaluation Metrics

To evaluate the performance of the proposed fine-tuning approaches, we compared the synthesized speech with the reference speech using five objective metrics: SECS, WER, CER, NISQA-TTS3, and WV-MOS4.

SECS measured the cosine similarity between the speaker embedding of the synthesized speech and the reference speech audio. This value, which ranged from −1 to 1, indicated how closely the speaker’s vocal characteristics match. We computed the speaker embedding using the H/ASP model [37], a publicly available speaker verification model5 trained on VoxCeleb2 [57], a large-scale speech dataset.

WER (%) and CER (%) respectively indicated the percentages of recognized word and character errors in the synthesized transcript to the ground-truth text. For synthesized speech transcription, we used NeMo’s stt_en_conformer_transdu-cer_large_model6 [38], which was based on the conformer transducer architecture, and computed these error rates using the Levenshtein distance algorithm7 [58]. A lower value suggests fewer pronunciation errors in the synthesized speech, indicating higher fidelity of the synthesized audio in adhering to the provided transcription.

NISQA-TTS was designed to predict the naturalness of synthetic speech, providing a nonintrusive evaluation without needing a reference signal in TTS systems. This metric predicted the naturalness score on a five-point scale consistent with the human MOS evaluation. Our work used the NISQA-TTS model to estimate the naturalness of the synthetic speech generated by our TTS system.

WV-MOS evaluated the overall quality of the utterances generated by each model and provided a score that ranged from 1 to 5 points. For MOS prediction, the WV-MOS model utilized a neural network architecture, wav2vec2.0, which was pretrained in a contrastive self-supervised manner, making it useful for various downstream tasks. The pretrained wav2vec2.0 model was fine-tuned using listening evaluation results from the Voice Conversion Challenge 2018 dataset [59]. In our study, we used WV-MOS to measure the overall quality of the generated speech for each fine-tuning method.

Objective metrics are not always reliable for measuring the perceived quality of synthesized speech from TTS models. Therefore, subjective evaluation is required to accurately assess speech quality. In this study, we compared the quality of synthesized speech obtained using our fine-tuning approach with that of the original speech using a CMOS on a seven-point scale ranging from −3 to 3. Ten people participated in the subject test by listening to 10 randomly selected pairs of original and synthesized speeches.

D. Performance Evaluation

To examine the effectiveness of the proposed different fine-tuning approaches on the objective and subjective quality of synthesized speech, we generated speech samples from the TTS models after applying different combinations of the proposed approaches. Table 1 compares the objective quality between the fine-tuned TTS models according to different combinations of the proposed fine-tuning approaches. The rightmost column of Table 1 compares the number of model parameters trained by each fine-tuned TTS model. In Table 1, Proj, AT, WN, and MRF denote the LoRA approach applied to the linear projection layers, attention matrix in the transformer-based text encoder (shown in Fig. 3(b)), WaveNet (shown in Fig. 3(c)), and MRF in the generator (shown in Fig. 3(d)), respectively. In addition, CLN and ADP signify the proposed approach using the CLN and residual adapter, as described in Figs. 3(a) and 3(b), respectively. Moreover, to investigate the effect of the fine-tuning of the projection layers connected to the speaker embeddings g^{\prime } on the performance, four speaker embedding projection layers in the VITS model were fine-tuned (✓) or frozen (✘), denoted as SEPL (speaker embedding projection layer) in the table.

TABLE 1 Comparison of the Number of Trained Model Parameters and Objective Quality Between the Fine-Tuned TTS Models According to Different Combinations of the Proposed Fine-Tuning Approaches Applied to the VCTK and Libri-TTS-100 Datasets
Table 1- Comparison of the Number of Trained Model Parameters and Objective Quality Between the Fine-Tuned TTS Models According to Different Combinations of the Proposed Fine-Tuning Approaches Applied to the VCTK and Libri-TTS-100 Datasets

As revealed by Table 1, Model 1, where AT was only fine-tuned, performed deficiently overall. However, the metric scores of Model 2 showed that tuning the WN was necessary to improve the overall speech quality, naturalness, and intelligibility. Model 3, trained by fine-tuning only LoRA, indicated that it was difficult to capture the unique characteristics of the speaker’s voice, making it challenging to represent the speaker accurately. Next, we fine-tuned the SEPLs in Model 4, which showed that tuning the SEPL increased the SECS score. Subsequently, we fine-tuned the VITS model with AT, WN, and SE together to create Model 5, which showed that fine-tuning the critical parts in the VITS model resulted in a higher overall quality of the synthesized speech. Further, Table 1 indicates that Model 6 improved the overall performance of the generated speech, particularly in terms of naturalness. However, Model 7 provided a lower overall quality but a higher SECS score than Models 1 to 4, which implied that SEPL and CLN could contribute to speaker similarity.

In the observation from Model 7, we applied SEPL and CLN to the following fine-tuned models from Models 8 to 11. As shown in Table 1, Model 8 demonstrated higher performance than Model 6, highlighting the importance of CLN in the multi-speaker fine-tuning process with an additional increase of 0.02M parameters. Meanwhile, Model 9 demonstrated higher performance in terms of WER, CER, and NISQA-TTS than Model 8 because of the addition of ADP. Moreover, Model 10 further improved speaker similarity and overall speech quality compared with Model 9 because the MRF was fine-tuned.

Lastly, we employed all the proposed approaches to fine-tune the VITS model, referred to as Model 11. As shown in the 11th row of Table 1, Model 11 showed the best objective performance among other models from Models 1–10, with a 10% increase in the number of model parameters. Interestingly, Model 11 achieved a slightly lower performance than the full fine-tuning method. Finally, to investigate the effect of the CLN on the multi-speaker TTS, we fully fine-tuned the VITS model with the CLN. The last two rows of Table 1 show that the CLN considerably contributed to improving all the objective metrics compared with the model with full fine-tuning.

Additionally, we performed a subjective test on the synthesized speeches with the models whose NISQA-TTS was higher than 3.0. In particular, we chose Models 8–11 and two models with full fine-tuning with/without CLN adaptation. Table 2 compares the CMOS of the top six fine-tuned models, revealing that CMOS was closely related to either NISQA-TTS or WV-MOS. Although the proposed approaches had slightly lower CMOS values than the case of full fine-tuning, the participants’ survey confirmed that they demonstrated comparable listening results.

TABLE 2 Comparison of the Subjective Scores of the Top Six Fine-Tuned Models Measured in Terms of CMOS
Table 2- Comparison of the Subjective Scores of the Top Six Fine-Tuned Models Measured in Terms of CMOS

E. Effect of Speaker-Related Techniques on Speaker Representation

We conducted a series of experiments to understand the effect of the fine-tuning of speaker-related modules on speaker representations. Fig. 5 illustrates the t-distributed stochastic neighbor embedding (t-SNE) [60] plots of the latent vectors z of synthesized speeches from the test speakers in the VCTK dataset to compare the speaker clustering performance according to different VITS models. As shown in Fig. 5(a), the latent vectors of Model 1 were distributed randomly, implying that the speakers were not clustered anymore. Instead, Model 7 (as shown in Fig. 5(b)) provided better speaker clustering than Model 1, which implied that CLN was effective for speaker representation. Then, we plotted the latent vectors from Model 11, which showed the best subjective and objective performance, as presented in Tables 1 and 2, and achieved better speaker clustering than that of Model 7. Next, we compared the t-SNE plots of the fully fine-tuned models with and without CLN, as shown in Figs. 5(d) and 5(e), respectively. Thus, CLN was also demonstrated as effective in speaker clustering, resulting in better objective and subject quality scores.

FIGURE 5. - Comparison of the t-SNE plots of latent vectors predicted by different models: (a) Model 1, (b) Model 7, (c) Model 11, (d) fully fine-tuned model, and (e) fully fine-tuned model with CLN, where the latent vectors were obtained from the test speakers on the VCTK dataset.
FIGURE 5.

Comparison of the t-SNE plots of latent vectors predicted by different models: (a) Model 1, (b) Model 7, (c) Model 11, (d) fully fine-tuned model, and (e) fully fine-tuned model with CLN, where the latent vectors were obtained from the test speakers on the VCTK dataset.

F. Setting the Rank of LoRA

This section describes the performance when the rank r\mathrm {=8} for LoRA-based fine-tuning. To this end, we chose r=8 , 96 for the comparison. As shown in Fig. 3(b), the LoRA was first applied to W_{\mathrm {q}} and W_{\mathrm {v}} in an attention module in the transformer-based text encoder because they were crucial for generating queries and values. Note that W_{\mathrm {q}} and W_{\mathrm {v}} both had (192\times 192 ) matrices. Then, the LoRA matrix, \Delta W(r)=W_{\mathrm {up}}W_{\mathrm {dw}} with rank =r was applied to W_{\mathrm {q}} or W_{\mathrm {v}}\mathrm {.} Then, either \Delta W(r) for W_{\mathrm {q}} or \Delta W(r) for W_{\mathrm {v}} was processed through singular value decomposition or eigenvalue decomposition to obtain singular vectors or eigenvectors. We computed the similarity between the pairs of two vectors obtained from \Delta W\mathrm {(8)} and \Delta W\mathrm {(96)} using the following equation:\begin{align*} \emptyset \left ({{ \Delta W(8),\Delta W(96),i,j }}\right)=\frac {\left \|{{ {({\Delta W\left ({{ 8 }}\right)}^{i})}^{T}{\Delta W\left ({{ 96 }}\right)}^{j} }}\right \|_{F}^{2}}{min(i,j)} \tag {17}\end{align*} View SourceRight-click on figure for MathML and additional features.where {\Delta W\left ({{ 8 }}\right)}^{i} is the i-th singular or eigenvector of \Delta W\mathrm {(8)} and {\Delta W\left ({{ 96 }}\right)}^{j} is the j-th largest singular or eigenvector of \Delta W\mathrm {(96)} . Here, we compared r\mathrm {=8} with r\mathrm {=96} because r\mathrm {=96} provided the same number of elements between W_{\mathrm {q}} (or W_{\mathrm {v}} ) and \Delta W\mathrm {(96)} as 192\times 192= 192\times 96+ 96\times 192 . Fig. 6 depicts the similarity of eigenvectors between \Delta W\mathrm {(8)} and \Delta W\mathrm {(96)} for the attention query matrix W_{\mathrm {q}} and attention value matrix W_{\mathrm {v}} . Accordingly, it seems that the rank r\mathrm {=96} should be reduced to r\mathrm {=8} . Next, we repeated this experiment by applying 1) LoRA to a 1\times 1 convolution layer V_{\mathrm {lora}} on each block of the WaveNet and 2) the ConvTranspose layer C_{\mathrm {MRF}} connected to the MRF. Even if V_{\mathrm {lora}} and C_{\mathrm {MRF}} were (192\times 384 ) and (192\times 512 ) matrices, respectively, we compared r\mathrm {=8} with r\mathrm {=96} , whereas the sophisticated selection of r could be 128 or 139 as 192\times 384= 192\times 128+ 128\times 384 and 192\times 512~\cong ~192\times 139+ 139\times 512 . Fig. 7 also depicts the similarity of the singular vectors between \Delta W\mathrm {(8)} and \Delta W\mathrm {(96)} for the WaveNet layer matrix V_{\mathrm {lora}} and the ConvTranspose layer matrix C_{\mathrm {MRF}} ,which implies that rank r\mathrm {=8} can be more reduced into smaller r. Consequently, we set r\mathrm {=8} according to a previous recommendation [29] and these supporting experiments.

FIGURE 6. - Similarity of eigenvectors between 
$\boldsymbol {\Delta W}\mathbf {(8)}$
 and 
$\boldsymbol {\Delta W}\mathbf {(96)}$
 for (a) the attention query matrix 
$\boldsymbol {W}_{\mathbf {q}}$
 and (b) the attention value matrix 
$\boldsymbol {W}_{\mathbf {v}}$
. The attention map illustrates the similarity between the eigenvectors of 
$\boldsymbol {\Delta W}\mathbf {(8)}$
 and top 8 eigenvectors of 
$\boldsymbol {\Delta W}\left ({{ \mathbf {96} }}\right)$
.
FIGURE 6.

Similarity of eigenvectors between \boldsymbol {\Delta W}\mathbf {(8)} and \boldsymbol {\Delta W}\mathbf {(96)} for (a) the attention query matrix \boldsymbol {W}_{\mathbf {q}} and (b) the attention value matrix \boldsymbol {W}_{\mathbf {v}} . The attention map illustrates the similarity between the eigenvectors of \boldsymbol {\Delta W}\mathbf {(8)} and top 8 eigenvectors of \boldsymbol {\Delta W}\left ({{ \mathbf {96} }}\right) .

FIGURE 7. - Similarity of eigenvectors between 
$\boldsymbol {\Delta W}\mathbf {(8)}$
 and 
$\boldsymbol {\Delta W}\mathbf {(96)}$
 for (a) a 
$1\times 1$
 convolution layer 
$\boldsymbol {V}_{\mathbf {lora}}$
 on a block of the WaveNet and (b) the ConvTranspose layer 
$\boldsymbol {C}_{\mathbf {MRF}}$
 connected to the MRF. The attention map illustrates the similarity between the singular vectors of 
$\boldsymbol {\Delta W}\mathbf {(8)}$
 and top 8 singular vectors of 
$\boldsymbol {\Delta W}\left ({{ \mathbf {96} }}\right)\mathbf {.}$
FIGURE 7.

Similarity of eigenvectors between \boldsymbol {\Delta W}\mathbf {(8)} and \boldsymbol {\Delta W}\mathbf {(96)} for (a) a 1\times 1 convolution layer \boldsymbol {V}_{\mathbf {lora}} on a block of the WaveNet and (b) the ConvTranspose layer \boldsymbol {C}_{\mathbf {MRF}} connected to the MRF. The attention map illustrates the similarity between the singular vectors of \boldsymbol {\Delta W}\mathbf {(8)} and top 8 singular vectors of \boldsymbol {\Delta W}\left ({{ \mathbf {96} }}\right)\mathbf {.}

Lastly, we applied the proposed method to fine-tune the models with LoRA ranks r=1 to further evaluate the performance. We measured each model’s performance using the NISQA-TTS and WV-MOS metrics. Table 3 compares the NISQA-TTS and WV-MOS metrics of the TTS models when different LoRA ranks r\mathrm {=1,8,} and 96 were applied on the VCTK and Libri-TTS-100 datasets. The results demonstrated that increasing the rank did not improve the fine-tuning performance of the TTS model; instead, it led to a performance decline. Consequently, the TTS model with r\mathrm {=8} achieved the highest performance among the three different ranks. This confirmed that the rank r\mathrm {=8} was the better choice among our tested ranks.

TABLE 3 Performance Comparison of Different LoRA Ranks r =1, 8, and 96 Using the NISQA-TTS and WV-MOS Metrics Averaged Across the VCTK and Libri-TTS Datasets
Table 3- Performance Comparison of Different LoRA Ranks r =1, 8, and 96 Using the NISQA-TTS and WV-MOS Metrics Averaged Across the VCTK and Libri-TTS Datasets

G. Comparison of Synthesis and Training Speed

We evaluated the speech synthesis and training speed of our model, focusing on complexity when a new module was added. We measured two indices: the average fine-tuning speed per epoch (measured in seconds) and the real-time factor (RTF). All the measurements were performed on a single A100 GPU with a batch size of 1, and the fine-tuning speed was measured using 1,546 sentences over 20 epochs. Table 4 compares the average fine-tuning speed and RTFs of the different models. A comparison of Model 8 with Models 5 and 6 indicated that CLN increased its average fine-tuning speed from 23.76 to 26.36 s. MRF also increased the average fine-tuning speed of 2.18 s, as revealed by the comparison of Models 10 and 8. In particular, the average fine-tuning speed of Model 11, which was fine-tuned with all the proposed approaches, was much faster than that of the fully fine-tuned model. In contrast, the RTF was proportional to the number of added modules that corresponded to the additional model parameters given in the rightmost column of Table 1. As expected, the fully fine-tuned model had the lowest RTF among all the models compared in Table 4. However, Model 11, which had the highest RTF among the proposed PEFT methods, remained at a real-time level. When we inferred Model 11 on lower-resource GPUs, such as NVIDIA TITAN X and RTX 2080 Ti, it was confirmed that the proposed method could operate in real time under low-resource conditions.

TABLE 4 Complexity Comparison of the Average Fine-Tuning Speed and the RTF According to Different TTS Models
Table 4- Complexity Comparison of the Average Fine-Tuning Speed and the RTF According to Different TTS Models

H. Objective Quality According to Different Datasets

In this section, we decompose the performance evaluation results shown in Table 1 into the results according to each dataset: VCTK and Libri-TTS-100. We conducted an ablation study using different fine-tuning methods and assessed the performance of the proposed method. We compared the results with those of full fine-tuning and ground truth. Table 5 presents the evaluation results using the objective metrics of different methods on the VCTK dataset, whereas Table 6 provides the results using the objective metrics on the Libri-TTS-100 dataset.

TABLE 5 Comparison of the Number of Trained Model Parameters and Objective Quality Between the Fine-Tuned TTS Models According to Different Combinations of the Proposed Fine-Tuning Approaches Applied to the Publicly Available VCTK Dataset
Table 5- Comparison of the Number of Trained Model Parameters and Objective Quality Between the Fine-Tuned TTS Models According to Different Combinations of the Proposed Fine-Tuning Approaches Applied to the Publicly Available VCTK Dataset
TABLE 6 Comparison of the Number of Trained Model Parameters and Objective Quality Between the Fine-Tuned TTS Models According to Different Combinations of the Proposed Fine-Tuning Approaches Applied to the Publicly Available Libri-TTS-100 Dataset
Table 6- Comparison of the Number of Trained Model Parameters and Objective Quality Between the Fine-Tuned TTS Models According to Different Combinations of the Proposed Fine-Tuning Approaches Applied to the Publicly Available Libri-TTS-100 Dataset

As shown in Tables 5 and 6, Model 5, which fine-tuned AT, WN, and SEPL together, showed a higher overall quality of the synthesized speech, achieving SECS values of 0.591 and 0.526 and WV-MOS scores of 3.85 and 3.49. By fine-tuning Proj to Model 5, Model 6 improved the overall performance of the generated speech with NISQA-TTS scores of 2.84 ± 0.17 and 3.11 ± 0.31. However, when it was applied to fine-tuning using four speakers, the performance was lower than that for fine-tuning using a single speaker. Model 7 involved fine-tuning using four speakers, demonstrating higher performance and highlighting the importance of CLN in the multi-speaker fine-tuning process. Moreover, its objective performance was improved compared with that of Model 6.

Model 10 involved fine-tuning the MRF by LoRA, which showed an improvement in the overall quality across both datasets. Model 11 further enhanced this performance by incorporating ADP to improve the expressiveness of the prior distribution, resulting in the best performance. Overall, the VCTK dataset outperformed the Libri-TTS-100 dataset; however, because of its nature, the Libri-TTS-100 dataset had a higher naturalness score, and CLN had a more significant impact during the fine-tuning process.

In addition to the VCTK and Libri-TTS-100 datasets, the performance of the fine-tuned models was evaluated on the Common Voice dataset to assess the robustness of the proposed PEFT method against diverse accents and speech styles. Table 7 presents the evaluation results using the objective metrics of the different methods on the Common Voice dataset. Apparently, the WV-MOS score of the ground truth samples was 3.78, which was lower than the scores obtained from both the VCTK and Libri-TTS-100 datasets. This was due to the characteristics of the Common Voice dataset, such as its various accents and slight background noise, which increased CER and WER. Compared with the results in Tables 5 and 6, the tendency of performance variations according to different combinations of the proposed fine-tuning approaches was similar to those in the VCTK and Libri-TTS-100 datasets. In other words, the full fine-tuning with CLN provided better performance than the conventional full fine-tuning and also a comparable overall performance to the ground truth with an SECS score of 0.696, demonstrating the effectiveness of the fine-tuning process. Moreover, the performance of Model 11 was the best among the fine-tuned models using the proposed PEFT method. It also maintained WV-MOS and NISQA-TTS scores comparable to those by the full fine-tuning, suggesting that the proposed PEFT method could be effective even when applied to more challenging speech samples.

TABLE 7 Comparison of the Number of Trained Model Parameters and Objective Quality Between the Fine-Tuned TTS Models According to Different Combinations of the Proposed Fine-Tuning Approaches Applied to the Publicly Available Common Voice Dataset
Table 7- Comparison of the Number of Trained Model Parameters and Objective Quality Between the Fine-Tuned TTS Models According to Different Combinations of the Proposed Fine-Tuning Approaches Applied to the Publicly Available Common Voice Dataset

I. Adaptation Results With the Korean Dataset

In this section, we apply the proposed PEFT method to the KMSSS dataset to examine the effect of variations in pronunciation and tone across languages on the model’s adaptability. Table 8 presents the evaluation results using the objective metrics of the different methods on the KMSSS dataset. According to the results, although the proposed PEFT method was applied to fine-tune the pretrained Korean TTS model, the difference in terms of performance between Model 11 and full fine-tuning with CLN for the Korean dataset was consistent with those for the English datasets, as shown in Tables 5 –​7. This implied that even if we developed a PEFT method using English datasets, the proposed PEFT method could be applied to any language. Instead, the most critical factor for applying PEFT lies in the ability of the pretrained TTS model to produce proper synthetic and personalized speech by applying the proposed fine-tuning method to achieve optimal performance. In conclusion, the proposed PEFT method can be effectively applied in various scenarios, regardless of the language.

TABLE 8 Comparison of the Number of Trained Model Parameters and Objective Quality Between the Fine-Tuned TTS Models According to Different Combinations of the Proposed Fine-Tuning Approaches Applied to the Publicly Available KMSSS Dataset
Table 8- Comparison of the Number of Trained Model Parameters and Objective Quality Between the Fine-Tuned TTS Models According to Different Combinations of the Proposed Fine-Tuning Approaches Applied to the Publicly Available KMSSS Dataset

J. Comparison With Zero-Shot TTS Models

In this section, we compare the performance of our model, which incorporated the proposed PEFT method, with two zero-shot TTS models: YourTTS [11] and XTTS [14]. YourTTS is a VITS-based E2E TTS model that utilizes the H/ASP model’s output as speaker embedding and applies a speaker consistency loss to ensure high speaker similarity between synthetic and ground truth speech. Meanwhile, XTTS builds on Tortoise [61] but introduces several novel modifications to enable multilingual training, enhance zero-shot TTS performance, and achieve faster training and inference. As we aimed to evaluate the performance of the zero-shot TTS, we utilized the open-source YourTTS8 model and XTTS-v29 without any fine-tuning. To evaluate these three TTS models, including our Model 11, we prepared 120 samples by taking 60 samples from each VCTK and Libri-TTS-100.

Table 9 compares the objective metrics of Model 11, Your-TTS, and XTTS-v2. Apparently, Model 11 outperformed YourTTS in all objective metrics. Meanwhile, XTTS achieved slightly better WER, CER, and NISQA-TTS values than Model 11, but Model 11 showed much better performance in terms of SECS, which was the most important metric for personalized speech. Therefore, we concluded that the proposed PEFT method was more effective than zero-shot TTS models for generating personalized speech.

TABLE 9 Comparison of the Objective Metrics Between Model 11, Your-TTS, and XTTS. The Results are Averaged Across the Test Samples From the VCTK and Libri-TTS
Table 9- Comparison of the Objective Metrics Between Model 11, Your-TTS, and XTTS. The Results are Averaged Across the Test Samples From the VCTK and Libri-TTS

K. Effect of LoRA on Information Flow in the Flow-Based Decoder

Herein, we investigate whether the application of LoRA to the flow-based decoder did not affect invertibility during inference. Fig. 8 illustrates the three-dimensional (3D) t-SNE of each k-th latent variable from f_{k}\mathrm {(\cdot)} and f_{k}^{-1}\mathrm {(\cdot)} , as described in equation (3). The final output (k =4 ) was then passed backward through the same layers to obtain the original output data, as depicted in Fig. 8(b) (full fine-tuning) and Fig. 8(d) (Model 11).

FIGURE 8. - 3D t-SNE plots of the latent variables in the kth forward flow and backward flow (
$\boldsymbol {k}\mathbf {=1,\cdots,4}$
): (a) forward flows and (b) backward flows of the flow-based decoder after full fine-tuning and (c) forward flows and (d) backward flows of the flow-based decoder after applying LoRA in Model 11. The plots were obtained using 32 samples from the VCTK dataset.
FIGURE 8.

3D t-SNE plots of the latent variables in the kth forward flow and backward flow (\boldsymbol {k}\mathbf {=1,\cdots,4} ): (a) forward flows and (b) backward flows of the flow-based decoder after full fine-tuning and (c) forward flows and (d) backward flows of the flow-based decoder after applying LoRA in Model 11. The plots were obtained using 32 samples from the VCTK dataset.

To measure the difference between the initial forward data distribution and the final backward distribution, we used the centered kernel alignment (CKA) [62]. CKA quantifies the similarity between pairs of neural network representations and effectively calculates the similarity of representation distributions invariant to isotropic scaling. Using CKA, we can robustly assess the similarity of data distributions before and after passing through flow layers. Note that the CKA calculations were performed using open-source data10.

Table 10 compares the CKA accuracy between the latent variables of the first forward and the last backward layer outputs applied to the full fine-tuned model, Model 2, and Model 11. The reason why we compared Model 11 with Model 2 was that Model 2 was the first attempt to deal with LoRA to WaveNet in the flow network.

TABLE 10 Comparison of the CKA Accuracy Between the Latent Variables of the First Forward and the Last Backward Layer Outputs, Applied to the Full Fine-Tuned Model, Model 2, and Model 11
Table 10- Comparison of the CKA Accuracy Between the Latent Variables of the First Forward and the Last Backward Layer Outputs, Applied to the Full Fine-Tuned Model, Model 2, and Model 11

As shown in the table, Model 11 achieved a CKA accuracy of 96.19, whereas the full CKA accuracy of the fine-tuned model was 96.31. Such a high similarity score of Model 11 demonstrated that the integration of LoRA into the flow-base encoder yielded flow invertibility. Furthermore, Model 2 achieved a CKA accuracy of 95.98. Although Model 2 showed lower speech performance because of its fewer tuning parameters, the CKA scores indicated that the invertibility of the flow transformations was also maintained. These results confirmed that integrating LoRA did not affect the invertibility of the flow-based transformations.

SECTION VI.

Conclusion

In this paper, we proposed several fine-tuning approaches to improve the performance of an E2E multi-speaker TTS by efficiently adapting it to new speakers. To this end, we first proposed a LoRA-based fine-tuning approach to achieve speech quality comparable with that of a fully fine-tuned model by updating a smaller number of model parameters. Second, a CLN-based fine-tuning approach was proposed to handle speaker-specific variation with improved multi-speaker PEFT performance. Third, the residual adapter was integrated into the text encoder output to improve the expressiveness of the prior distribution. We constructed the VITS models using the VCTK, Libri-TTS-100, Common Voice, and Korean multi-speaker datasets according to different combinations of the proposed fine-tuning approaches (i.e., LoRA, CLN, and residual adapter). The model performance was evaluated using five objective measures, namely, SECS, WER, CER, NISQA-TTS, and WV-MOS, as well as a subjective listening test involving the measurement of CMOS. The performance comparison revealed that LoRA improved the overall objective measures but was limited in improving the subjective quality for multi-speaker TTS. However, combining LoRA and CLN improved the speech quality compared to that using only LoRA. In addition, the VITS model was fine-tuned using all the proposed approaches, which provided objective and subjective speech quality compared with the fully fine-tuned model. Next, we investigated the effect of the proposed fine-tuning approaches on speaker clustering. The t-SNE comparison showed that CLN was effective in separating speakers in the latent space. Finally, the comparison of complexity by measuring the average fine-tuning speed and RTF showed that the proposed fine-tuning approaches were realized with less complexity compared with the full fine-tuning approach.

Despite these promising results, the proposed approaches have limitations that require future work. First, although the proposed PEFT method achieved good performance, it is still not as effective as full fine-tuning. Second, the baseline model structure has exhibited limited adaptability when dealing with challenging datasets such as Common Voice. These limitations can be mitigated by enhancing the adaptability of the pre-trained model through structural modifications or adding new modules to the baseline model architecture. Furthermore, additional adapters can be integrated into various components of the system beyond the prior encoder of the VITS to assess their potential for further performance enhancement. By focusing on these aspects, we aim to advance the adaptability and efficiency of PEFT approaches in multi-speaker TTS systems.

References

References is not available for this document.