Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis

Neural text-to-speech (TTS) systems have made significant progress in generating natural synthetic speech. However, neural TTS requires large amounts of paired training data, which limits its applicability to a small number of resource-rich languages. Previous work on low-resource TTS has addressed the data hungriness based on transfer learning from a multilingual model to low-resource languages, but it still relies heavily on the availability of paired data for the target languages. In this paper, we propose a text-inductive language adaptation framework for low-resource TTS to address the cost of collecting the paired data for low-resource languages. To inject textual knowledge during transfer learning, our framework employs a two-stage adaptation scheme that utilizes both text-only and supervised data for the target language. In the text-based adaptation stage, we update the language-aware embedding layer with a masked language model objective using text-only data for the target language. In the supervised adaptation stage, the entire TTS model is updated using paired data for the target language. We also propose a graphone-based multilingual training method that jointly uses graphemes and International Phonetic Alphabet symbols (referred to as graphones) for resource-rich languages, while using only graphemes for low-resource languages. This approach facilitates the transfer of pronunciation knowledge from resource-rich to low-resource languages. Through extensive evaluations, we demonstrate that 1) our framework with text-based adaptation outperforms the previous supervised transfer learning approach and 2) the proposed graphone-based training method further improves the performance of both multilingual TTS and low-resource language adaptation. With only 5 minutes of paired data for fine-tuning, our method achieved highly intelligible synthetic speech with the character error rates of around 6 % for a target language.


I. INTRODUCTION
R ECENT neural text-to-speech (TTS) systems [2] have made remarkable progress in producing natural-sounding synthetic speech.However, neural TTS requires large amounts of data, thus restricting its accessibility to a limited number of resource-rich languages that represent only a small fraction of the thousands of languages spoken worldwide [3], [4].This limitation is primarily due to the challenges of collecting sufficient TTS corpora for low-resource languages, where welldesigned recording environments and careful annotation are essential.Previous studies on low-resource TTS [5], [6] have mainly focused on transfer learning from multilingual models to target low-resource languages.While this framework is simple and effective, it heavily relies on the availability of highquality paired TTS corpora for the target language, and performance significantly declines when the amount of paired data is limited [7].
To address the data collection for TTS, previous research has explored training TTS models with unannotated data, such as untranscribed speech [8], [9] or text-only data [9], [10], [11].Text-only data is particularly appealing because it is easy to collect and does not contain sensitive speaker-related information.We hypothesize that the transfer learning from a multilingual model to low-resource languages (referred to as low-resource language adaptation) can be improved by exploiting textual information for the target language.Thus, our work aims at leveraging text-only data to improve low-resource language adaptation, even when limited paired data is available.
There has been some work on efficient adaptation methods for natural language processing (NLP) tasks, which adapt BERT [12] models fine-tuned for specific languages and tasks to new languages.Previous work has demonstrated the feasibility of transferring a masked language model (MLM) to new languages by learning a new embedding layer [13] or adapter [14] with the original MLM objectives.This approach can efficiently adapt the model to target languages without labeled data for the target languages, since it only requires updating the embedding layer or adapter network with text-only data.Inspired by this approach, our framework adapts a multilingual TTS model to a new language using text-based adaptation with a MLM objective.Fig. 1.Overview of our proposed framework for low-resource TTS: In contrast to previous approach (indicated by region enclosed by red dashed line), which employs transfer learning from multilingual model to target language in supervised manner, our framework progressively performs language adaptation based on text-based adaptation (indicated by region enclosed by blue line) and supervised adaptation.We also propose graphone-based multilingual training method to facilitate pronunciation knowledge transfer from resource-rich to low-resource languages.de, ru, hu and it stand for language codes defined in Table I, i.e.German, Russian, Hungarian and Italian, respectively.
In this paper, we propose a text-inductive language adaptation framework for low-resource TTS.Our approach, illustrated in Fig. 1, differs from previous methods by incorporating textual information to improve language adaptation.Our framework includes a two-stage adaptation process: 1) text-based adaptation with a MLM objective using text-only data for the target language, and 2) supervised adaptation using a limited amount of paired data for the target language.During the text-based adaptation, only a language-aware embedding layer is updated with the text data including the target language.In the supervised adaptation stage, we further update the entire TTS model using the paired data for the target language.In addition, we propose a graphone-based training method to further improve text-inductive language adaptation, which jointly employs graphemes and International Phonetic Alphabet (IPA) symbols (referred to as graphones 1 ) for resource-rich languages while r We propose a graphone-based multilingual training method for low-resource TTS.This method allows flexible switching of different types of tokens (i.e., graphemes and IPA symbols) to facilitate the transfer of pronunciation knowledge from resource-rich to low-resource languages.It improves the performance of low-resource language adaptation without grapheme-to-phoneme modules for target languages, as well as multilingual TTS training itself.
r Our experimental evaluations demonstrate that our frame- work significantly improves the performance of lowresource language adaptation for the target languages.With only 5 minutes of paired data for fine-tuning, our method achieves highly intelligible synthetic speech with the character error rates of around 6 % for a target language.The subjective evaluations showed that the proposed textinductive graphone-based language adaptation method achieved significantly higher naturalness and speaker similarity than baseline methods for two target languages.The remainder of the paper is organized as follows.Section II reviews previous studies relevant to our work and presents the preliminary knowledge to understand our method.In Section III, we detail our proposed text-inductive language adaptation framework.In Section IV, we investigate the effectiveness of our method and provide further discussion.Finally, Section V concludes this paper and outlines future work.

II. BACKGROUND AND RELATED WORK
To understand and motivate our method, we provide a brief review of the related work and the background knowledge including mathematical notations.

A. Low-Resource Language Adaptation for TTS
Early research on multilingual TTS [18], [19], [20] focused on developing TTS models for several resource-rich languages.To accommodate a wider range of languages, recent studies have also focused on low-resource languages.Collecting a sufficient amount of high-quality paired training data for low-resource languages can be challenging; therefore, previous work has adapted a TTS model trained on resource-rich languages, to low-resource languages [5], [6], [7], [21].This approach first pretrains a multilingual TTS model and then fine-tune it for low-resource languages, thereby improving performance by exploiting the multilingual knowledge embedded in the pretrained TTS model.
The previous transfer-learning-based approach is illustrated in the region enclosed by the red dashed line in Fig. 1.Let the paired datasets of the resource-rich languages and the target lowresource language be denoted by D paired and D paired , respectively.First, the entire set of parameters for the TTS model is trained on D paired .Next, the initial parameters for the adaptation are set to the pretrained parameters obtained from the multilingual training.Finally, the entire set of TTS model parameters is fine-tuned on D paired .A drawback of this approach is its heavy reliance on the amount of D paired , which presents a challenge when developing a TTS system for low-resource languages.

B. Other Multilingual Low-Resource TTS Approaches
To address the data collection for multilingual low-resource TTS, previous research has explored the use of unannotated data for low-resource or multilingual TTS [9], [22].Some studies have trained TTS models with untranscribed speech using a vector-quantized variational autoencoder [8] or an unsupervised automatic speech recognition (ASR) model [23].In addition, other work has used text-only data to train TTS models in low-resource scenarios [11], [22].However, these approaches require training the entire model from scratch to adapt to new languages using unpaired data.In contrast, our proposed method efficiently incorporates textual information of the target language during low-resource language adaptation, rather than during the pretraining stage.In a previous study, a phonetic transformation network was trained to map symbol sequences of different languages by using a pretrained ASR model [6].While this method requires an ASR model trained on speech corpora of the target language, our approach adapts language-aware embedding layer only using text data with MLM objectives.Previous multilingual TTS work has focused on tokens, including bytes [7], [9], [11], [24], IPA symbols [11], [25], [26], articulatory features [21], sentence-piece tokens [9], etc.Our work proposes a graphone-based training method for multilingual low-resource TTS, which allows the flexible use of both graphemes and IPA symbols for each language.

C. Cross-Lingual Language Model Adaptation
Cross-lingual representation learning has been thoroughly investigated in the field of NLP [27], [28].In particular, multilingual BERT [12] has shown remarkable cross-lingual transferability, even achieving zero-shot cross-lingual transfer [29], [30].Although multilingual language model pretraining is effective, the cost of training new models from scratch to accommodate additional languages can be expensive.Consequently, several studies have investigated the adaptation of existing pretrained language models to new languages in a parameterefficient manner.Artetxe et al. demonstrated the feasibility of adapting a monolingual MLM to new languages by learning only a new embedding layer using the MLM objective [13].Other research has explored language adaptation of MLMs by using matrix factorization [31] or by introducing additional parameters with adapters [14], [32].Our study investigates this parameter-efficient adaptation framework in the context of TTS and proposes a text-based adaptation method for TTS that adapts a language-aware embedding layer using text-only data for the target languages.Similar to a previous study that expanded a multilingual pretrained language model to support 1600 languages [33], our text-based adaptation enables TTS frontend adaptation using languages not present in TTS corpora, thereby increasing the number of languages accommodated by TTS systems.Moreover, we propose a graphone-based training approach for both MLM pretraining and supervised TTS training, specifically designed for TTS applications, which can improve TTS performance.

D. Unsupervised Text Pretraining for Multilingual TTS
Previous work [11] has proposed an unsupervised multilingual text pretraining method and demonstrated its effectiveness for multilingual TTS.It has even achieved the highly-intelligible zero-shot cross-lingual transfer.The previous work [11] utilized a text-only data for zero-shot TTS, with the aim of developing a TTS system for low-resource language, for which only textual resources are available.In contrast, our work proposes a method for incorporating text-only data in few-shot fine-tuning.Our method addresses a practical scenario where a limited amount of high-quality speech-text paired data is available.To achieve this, the previous work used text data through MLM pretraining [29], [30], while our approach incorporates text data via efficient MLM adaptation [13], [14].Moreover, we propose a graphonebased training method and demonstrate its capacity to improve the performance of both multilingual TTS and low-resource adaptation.
Here we describe the text pretraining framework in the previous work [11].Let X = (x n ∈ V |n = 1, . . ., N) denote the input text token sequence of length N , where V denotes a vocabulary constructed for pretraining.Note that our graphone-based training method presented in Section III-D extends V by using both UTF-8 bytes and IPA symbols, while the previous work [11] has used only a single type of tokens.Let D text and L text denote the text dataset used in the multilingual text pretraining and the set of language IDs included in D text .Note that L text can include languages that are not present in the paired training data for multilingual TTS.Let X m denote the masked token sequence.Then we use X m a language ID l text ∈ L text for the input of the model.Let the token embedding sequence and language embedding be Z m = (z m n ∈ R d |n = 1, . . ., N) and e l ∈ R d , respectively.Then Z m and e l are obtained through embedding layers as: where θ T and θ L denote the model parameters of the token embedding and language embedding layers, respectively.The previous work [11] has used a bottleneck layer to project the token and language embeddings obtained (1) to a hidden encoder input.Let θ B denote the model parameters of the bottleneck layer.We now define the network with the model parameters {θ B , θ T , θ L } as language-aware embedding layer, which jointly embeds the token sequence X and the language ID l text where H in = (h in,n ∈ R d |n = 1, . . ., N) and θ lae denote the hidden vectors for the encoder input and the model parameters for the language-aware embedding layer, respectively.Let H out = (h out,n ∈ R d |n = 1, . . ., N) denote the hidden vector for the encoder output.Let θ B , θ E , θ P the model parameters of the bottleneck layer, the encoder and a prediction network, respectively.Then the conditional probability p(X|X −Π ), which is used to compute the training objective, is obtained as: where Softmax(•) denotes a softmax function.Let Π = (π k ∈ N|k = 1, . . ., K) be the indexes of the masked tokens of length K.The training objective can be defined as: { θE , θlae } = arg min where X −Π denotes the unmasked tokens.
Other previous studies have also explored the use of text pretraining for TTS.For example, several studies have proposed employing contextual embeddings from pretrained word vectors [22], autoregressive language models [34], and MLMs [35], [36], [37] to improve the pronunciation and prosody of TTS systems.Other work [10], [38] has focused on training a TTS encoder with MLM objectives, jointly utilizing various token types suited for TTS.Similar to our work, Jia et al. [10] jointly used graphemes and phonemes to improve neural TTS.While they focuse on improving the pronunciation and prosody of a monolingual TTS model, our focus is on adapting the TTS model to low-resource languages.In their approach, the sequences of both token types are concatenated to pretrain the TTS encoder, which captures the relationship between the two tokens.In contrast, our graphone-based training method does not concatenate the two types of sequences but rather combines the vocabularies derived from different tokens.Unlike Jia et al.'s method, which requires the alignment of phonemes and graphemes during pretraining and supervised learning, our graphone-based training method allows flexible switching between using both tokens or only graphemes for each language.It thus allows the use of only grapheme tokens during language adaptation and joint tokens during pretraining.

III. PROPOSED TEXT-INDUCTIVE LANGUAGE ADAPTATION
In this section, we outline our proposed methods for lowresource language adaptation.In line with previous studies on low-resource multilingual TTS [7], [11], we use Transformer TTS [39] as the backbone TTS model. 2 Our TTS model employs the architecture used in our previous work [11], which includes an additional bottleneck layer.Since previous studies have shown the effectiveness of transfer learning for low-resource TTS [6], [7], our framework leverages text-inductive transfer learning from a multilingual model to low-resource languages.In Section III-A, we describe the multilingual training procedure for obtaining the pretrained model to be used in the subsequent low-resource adaptation.To develop TTS for lowresource languages, we can re-train an existing multilingual TTS model by incorporating target language data alongside the original rich multilingual dataset.While this approach is expected to yield high performance as in the previous work [11], the training cost is significantly high.Therefore, we adapt the pretrained multilingual TTS model to the target language data, thereby efficiently constructing a TTS system for the desired low-resource language.Section III-B presents our text-based adaptation method, which fine-tunes the language-aware embedding layer using new languages.In Section III-C, we discuss supervised adaptation using paired data from target languages and the inference procedure.Finally, in Section III-D, we present our proposed graphone-based multilingual training method for improved cross-lingual transfer in TTS.

A. Multilingual TTS Training
In this stage, we build a multilingual TTS model based on unsupervised pretraining (Fig. 2(a)) and supervised learning (Fig. 2(b)).As described in Section II-D, we perform multilingual text-only pretraining of the language-aware embedding layer and encoder of the TTS model with the MLM objective formulated in (6).As shown in our previous work [11], this text-only pretraining significantly improves the performance of multilingual TTS.We then perform supervised learning of the TTS model using multilingual paired speech-text data.
Since we describe the unsupervised text pretraining in Section II-D, we detail the supervised learning in this section.Let D paired and L paired denote the multilingual paired training data and the set of language IDs, respectively, as shown in Fig. 2(b).Note that we assume L paired ⊂ L text .Let Y = (y t ∈ R D |t = 1, . . ., T ) denote the speech feature sequence with the length of T .The model parameters {θ E , θ lae } are initialized with those obtained in the pretraining described in Section II-D.Let θ D denote the model parameter of the decoder.The speech features are predicted with teacher forcing as: where Z is the unmasked token embedding sequence.Let L tts ( Ŷ , Y ) denote the training objective of the TTS model.We then freeze the language-aware embedding layer while updating the encoder and decoder.Then the training process can be written as By freezing the language-aware embedding layer obtained in the unsupervised text pretraining (Section II-D), the model maps the text-based hidden representation to multilingual speech features.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.This allows the language-aware embedding layer to be adapted to target languages with text-only data in the subsequent stage described in Section III-B.

B. Text-Based Adaptation
In our text-inductive language adaptation framework, we propose a text-based adaptation method that injects textual knowledge for the target language during transfer learning.Fig. 2(c) illustrates the text-based adaptation.Our goal is to adapt the multilingual TTS model obtained with Section III-A to a new low-resource language.We use the language-aware embedding layer as a language adapter, updated with the MLM objective defined as (6).
Let L adap and D text denote the set of target languages and the text data containing L adap , respectively.Using D text , we adapt the language-aware embedding layer with the training objective described in ( 6) and (7).Let θ (pre) E and θ (pre) lae denote the parameters of the encoder and the language-aware embedding layer obtained with the multilingual text pretraining in (7), respectively.We then initialize the parameters for the text-based adaptation with θ lae denote the parameters of the language-aware embedding layer obtained by the text-based adaptation.
We now consider two types of strategies regarding the utilization of text-only data. 1) Multilingual MLM adaptation.D text includes languages other than L adap .Here we select languages that belong to the same language families as the target language for the cross-lingual transfer.For example, as mentioned in Section IV-A, we choose Spanish and French if the target language is Italian.It should be noted that while recent work [42] has proposed a metric for TTS transfer learning, there is no universal metric to measure language similarity.In this study, we select the similar languages for the cross-lingual transfer based on language family defined in Glottolog [43].Language selection based on language family is used in previous studies [4], [44] for spoken language processing tasks.Specifically, as mentioned in Section IV-A, we choose French and Spanish from the same Romance languages for adaptation to Italian, while we use Malayalam and Telugu from the same Dravidian languages for adaptation to Tamil.The multilingual MLM adaptation can then be formulated as L adap ∈ L text and L pre ∩ L text = ∅.As described in Section IV-A, we use the full text data for the selected languages during the multilingual MLM adaptation.The multilingual MLM adaptation aims to capture the cross-lingual relationship more explicitly at the adaptation stage.However, this strategy requires the higher training costs because it uses text data from multiple languages.2) Monolingual MLM adaptation.D text only contains L adap .This can be written as: L adap = L text .This strategy has the advantage of less training time because it only uses the target language in the adaptation stage.However, it reduce the amount of data and the number of languages used for text-based adaptation, undermining the benefits of cross-lingual knowledge sharing.
Furthermore, we investigate different schemes for MLMbased training.1) MLM-Update-All.During the text-based adaptation, we can update the parameters of both encoder and language-aware embedding layer with the MLM objective written as (6).This scheme can be written as: where θ E denotes the parameters of the encoder obtained with the text-based adaptation.2) MLM-Freeze-Enc.As an alternative scheme, we can only update the language-aware embedding layer because our final goal is to adapt the language-aware embedding layer for the target language.We can perform the text-based adaptation as While the scheme defined as ( 12) is more computationally efficient than the scheme defined as (11), it can introduce error propagation since the entire model is not optimized.In the ablation study presented in Section IV-D, we compare different schemes for the utilization of text-only data and updating the parameters.

C. Supervised Adaptation and Inference
We then perform supervised adaptation with a limited amount of the paired data for the target language.Let D paired denote the paired data used in this stage.Let θ denote the parameters obtained with the multilingual training described in Section III-A, respectively.We then use the pretrained parameters {θ lae } for the initial parameters of the supervised adaptation.Note that the previous supervised language adaptation described in Section II-A initialized the parameters with {θ lae }.In the supervised adaptation, we can update the whole model parameters as Since we have already adapted the language embedding layer in Section III-B, as an alternative to (13), we can only update the rest of the parameters as: Finally, for more resource-efficient supervised adaptation scheme, we consider only updating the decoder as: We denote these different schemes as TTS-Update-All (13), TTS-Freeze-LAE (14), and TTS-Freeze-Enc (15), which are compared in the ablation study described in Section IV-D.
In inference, the entire TTS model trained in Section III-C synthesizes speech from the input text for the target language.Let {θ lae } denote the parameters obtained with the supervised adaptation.The speech features are predicted from the encoder output as The whole inference process can be written as Note that the synthetic speech waveform is generated with a pretrained neural vocoder from the predicted features Ŷ .

D. Graphone-Based Training for Low-Resource TTS
We propose a graphone-based multilingual training method to improve the cross-lingual transferrability for low-resource TTS.Fig. 3 illustrates the method.While our previous multilingual pretraining method [11] has used a single token type, our method mixes both graphemes and IPA symbols during the multilingual training described in Section III-A.In contrast to previous work on jointly using different types of tokens for TTS pretraining [10], we refrain from explicitly combining the sequences of the two token types.Rather, we expand the vocabulary V to include both token types and then use the corresponding paired training data accordingly.This approach is more memory-efficient and allows for flexible switching between using both tokens or only graphemes for each language.While we use both token types for resource-rich languages, we use only graphemes for the low-resource language adaptation described in Section III-B and Section III-C because powerful grapheme-to-phoneme (G2P) modules are often not available for low-resource languages.Our graphone-based training method captures the relationship between graphemes and IPA symbols for resource-rich languages, resulting in better pronunciation when the model is applied to low-resource languages. Let where D IPA l = ∅ when the G2P module is not available for the language.We also extend the vocabulary V to include both types of tokens.Let V l denote the vocabulary set for each language, where V = V l is satisfied.Let V Gpm l and V IPA l denote the vocabulary sets for both graphemes and IPA symbols, respectively.Then the vocabulary set V l can be written as where  18)) for each language.In Section IV-E, we discuss the effectiveness of our graphone-based training for multilingual TTS.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I DATASET STATISTICS
During low-resource language adaptation described in Section III-B and Section III-C, per-language G2P modules are often unavailable or inaccurate.We thus only use the grapheme tokens.Let D Gpm l and D IPA l denote the dataset for a low-resource language l using graphemes and IPA symbols as the input text tokens, respectively.Since we cannot use a G2P module to obtain the IPA symbols for the target language, D IPA l = ∅ is satisfied.Therefore, the dataset for a low-resource language D l can be written as Capturing the relationship between graphemes and IPA symbols during multilingual training, as in (18), facilitates the transfer of pronunciation knowledge from resource-rich to low-resource languages during the language adaptation.This leads to improved performance of low-resource language adaptation when only the graphemes are available, as formulated in (20).

A. Experimental Setting 1) Dataset:
We used publicly available datasets for all experiments.Table I shows the amount of data for each language.For the text pretraining described in Section II-D, we used text data from either Voxpopuli [45] or CC100 [46], [47] Since CC100 contains significantly more data than Voxpopuli, we randomly selected 2 million lines to match the data scale between the two sources.As a result, our model was trained on data covering several European and Indic languages.For the TTS training data, we used high-quality speech-text paired datasets.In the multilingual training presented in Section II-D, we used 14 languages as shown in the table.For European languages (de, fr, es, el, nl, ru, hu, fi), we used CSS10 [48] as in the previous work [11].We used Google Language Resources [49] for hi, gu, and ml, while we used CMU Indic [50] for pa and te.For the target languages of the language adaptation (Section III-B and Section III-C), we chose two languages with different writing systems and from different language families: Italian from the European languages and Tamil from the Indic languages.Although these are not genuinely low-resource languages, we chose this simulated low-resource setting as in previous work [51] for two reasons: 1) to ensure a sufficient number of evaluators for the subjective evaluation, 3 and 2) to use powerful multilingual ASR model for the objective assessment.In the supervised adaptation described in Section III-C, we used M-AILABS [53] and Google Language Resources [49] for Italian and Tamil, respectively.We used a male speaker (Riccardo) for Italian, while we chose a female speaker for Tamil.While we used 100 utterances for the test set, we used 5 utterances for the development set because we wanted to ensure a substantial allocation to the test set and the size of the development set provided the stable validation curves.For evaluation, the number of sentences used for supervised learning described in Section III-C was varied in the range of 50, 250, and 500.
2) Text and Speech Features: We set the sampling frequency of the audio data to 16 kHz.For the audio analysis, we used an 80-dimensional Mel filter bank, a Fast Fourier Transform (FFT) length of 1024 samples, and a frame shift of 256 samples.For the grapheme tokens, we used UTF-8 bytes as in previous studies [7], [11], [24].We extracted IPA symbols using open-source toolkits: espeak-ng 4 for European languages and Epitran5 for Indic languages.Table I shows the use of IPA symbols for each language when using our graphone-based multilingual training described in Section III-D."Y" indicates the availability of IPA symbols, while "N" indicates the use of byte tokens only.Note that we only used byte tokens when we were not using the graphone-based multilingual training.In the multilingual training discussed in Section III-A, we performed training using both bytes and IPA symbols, as in (18).However, since gu was not supported by Epitran, only graphemes were used as (18).For the low-resource adaptation described in Section III-B and Section III-C, we used only byte tokens as in (20).As mentioned in Section IV-A, our evaluations are conducted under simulation conditions and thus the languages used in the evaluations are not actually low-resource languages.Therefore, the comparison between a TTS method using less effective G2P modules for real low-resource languages and our method using grapheme-based tokens is a future work.For Italian and Tamil, 97.06% and 5.56% of the character symbols respectively were included in the multilingual pretraining.Due to the UTF-8 B tokenization, which factorizes multi-byte characters, all byte tokens from Italian and Tamil were included in the multilingual training.We conducted a similar study with real low-resource languages, specifically Friulian (a Romance language like Italian) and Tulu (a Dravidian language like Tamil), which resulted in the inclusion of all the byte tokens from these languages in the multilingual pretraining.This suggests that the grapheme sets used to train our TTS models have similar overlap ratios in the simulated and real low-resource conditions.While we used byte tokens for inference of TTS models obtained by low-resource adaptation, we compared inference of multilingual TTS (Section III-A) with bytes and IPA symbols in Section IV-E.In our investigation, we found that 94.64% and 97.06% of the IPA symbols from the low-resource language adaptation were present in the multilingual training for Italian and Tamil, respectively.
3) Model Configurations: Our model was based on Transformer TTS [39] and we followed the model hyperparameters described in our previous work [11].The baseline method was defined as a low-resource TTS model trained without the textbased adaptation described in Section III-B, which corresponds to previous transfer-learning-based methods [6], [7].Note that the baseline model used the multilingual text-only pretraining (Fig. 2(a)) in the same manner as the proposed models.For the neural vocoder, we used a HiFi-GAN [54] model trained for 2M iterations with LibriTTS [55], VCTK [56], and CSS10.As in the previous work [11], we used x-vector for the speaker embedding.We used a x-vector extractor trained on VoxCeleb1 and VoxCeleb2 [57] published in SpeechBrain [58].ESPnet2-TTS [59] was used for the implementation and experiments.

4) Training Configurations of Multilingual Training:
For the text pretraining described in Section II-D, we set the learning rate to 1.0 and the warm-up steps to 100,000, and performed training for three epochs (2.4 M iterations) in both the bytes and graphones cases.During training, we used the Noam optimizer [60] with the gradient accumulation set to 2 and batch size set to 32, employing a single NVIDIA Tesla V100 GPU.For the multilingual supervised learning presented in Section III-C, we trained the model for 200 epochs until the training loss converged.We set the learning rate to 1.0 and the warm-up steps to 50,000, using the Noam optimizer.The gradient accumulation was set to 4, and the batch bin used for the ESPnet2 [61] recipe was set to 400,000.When using our graphone-based training method described in Section III-D, the vocabulary V defined in Section II-D is constructed from the UTF-8 bytes and IPA symbols, and it includes a start/end of sentence token ([SOS/EOS]).To obtain the masked token X m during the training procedures described in Section II-D, we use the same masking ratio and category as in the original BERT pre-training [12] for each token type.The tokens with indices Π are chosen randomly for 15% of all tokens.80% of them are replaced by the [MASK] token, and 10% of them are replaced by random tokens.Also, 10% of them are left unchanged and L mlm is computed as in (6) for the 15% of the tokens with indices Π.

5) Training Configurations of Low-Resource Language Adaptation:
For our main evaluations of low-resource language adaptation presented in Section IV-B and Section IV-C, we used the schemes MLM-Update-All described in Section III-B and TTS-Update-All described in Section III-C.We compared the other schemes in the ablation study presented in Section IV-D.
In the text-based adaptation demonstrated in Section III-B, the learning rate, optimizer settings, and masking strategy were the same as those used in the multilingual pretraining described in Section II-D.In the monolingual MLM adaptation scheme described in Section III-B, training was performed using only the text data for the target language.In the multilingual MLM adaptation scheme presented in Section III-B, for Italian, we selected French and Spanish from the same Romance languages and used the multilingual text data of these three languages.For Tamil, we chose Telugu and Malayalam from the same Dravidian language family and employed the multilingual text data for these three languages.During the multilingual MLM adaptation, we used the full text-only data shown in Table I for the selected languages.For both the monolingual and multilingual cases, we performed the text-based adaptation for four epochs.In the supervised adaptation described in Section III-C, we performed training using the paired data for the target language.The number of training iterations for the supervised adaptation was decided based on the convergence of the baseline method.The supervised adaptation was performed for 100, 200, and 300 epochs when the training data contained 50, 250, and 500 utterances, respectively, with the 200 iterations per epoch.The learning rate and optimizer settings were the same as those used in the multilingual supervised learning described in Section III-C.We tried several other learning rate scheduling methods and chose to use the one used in the multilingual supervised learning described in Section III-A, which yielded the highest baseline performance.
6) Evaluation Metrics: For the objective evaluation of the synthetic speech quality, we used common metrics: mel cepstral distortion (MCD) [62] and log fundamental frequency root mean square error (F0 RMSE).For the MCD evaluation, we set the mel cepstrum dimension to 25.When calculating the MCD and F0 RMSE, we used FastDTW [63] to align feature sequences with different sequential lengths.To evaluate the speaker similarity of the synthesized speech, we used the x-vector extracted with a model described in Section IV-A3.We then compute the cosine similarity between the x-vectors of the synthesized and natural speech utterances (XV Sim).To evaluate the intelligibility of the synthesized speech, we calculated the character error rate (CER), which is a common metric to evaluate TTS intelligibility as used in previous work [64] using a pretrained multilingual ASR model [65].In this case, we employed the publicly available large model from the Whisper repository. 6According to the original paper [65], the model shows word error rates of 4.2 % and 20.6 % for Italian and Tamil respectively on FLEURS [66].The use of the well-established public model to calculate CER provides a consistent basis for comparison with the results of other studies.The training data for this ASR model is not publicly available and may include the transcripts from our test set.However, it should be noted that the synthetic audio data for the evaluations are not included in the training data of the ASR model.For the subjective naturalness evaluation, we conducted subjective mean opinion score (MOS) tests and preference AB tests.For the MOS and AB tests on naturalness described in Section IV-C, we hired 30 and 25 native speakers, respectively, through Amazon Mechanical Turk [67].For the XAB tests on speaker similarity described in Section IV-C, we used an alternative crowdsourcing platform [68] due to budget constraints.We recruited 40 native Japanese speakers for each evaluation case through this platform.It is worth noting that previous work [69] has reported that non-native speaker ratings have very strong correlations with native speaker ratings.Also, given that this evaluation focuses on speaker similarity, we expect the effect of non-native evaluators to be minimal.To ensure the quality of the subjective ratings, we only recruited participants with Human Intelligence Tasks (HITs) approval rates above 85%.

B. Objective Evaluations of Low-Resource Language Adaptation
We conducted an objective evaluation of the low-resource adaptation described in Section III.We compared a baseline method without text-based adaptation described in Section III-B, with a proposed method that includes the text-based adaptation.As described in Section IV-A3, the baseline model used the multilingual text-only pretraining as well as the proposed models.We assumed low-resource scenarios without G2P modules and used bytes for the input tokens of these models.We also investigated the proposed methods with graphone-based training described in Section III-D, which use only the byte tokens for the target languages but use both bytes and IPA symbols for other languages as listen in Table I.
Tables II and III list the results for Italian and Tamil, respectively. 7In the tables, Bytes and Graphones denote the models only using bytes and using graphone-based multilingual training (Section III-D), respectively.Multi-MLM and Mono-MLM denote the multilingual MLM adaptation and monolingual MLM adaptation methods described in Section III-B, respectively.
Looking at Table II, we can see that the proposed text-based adaptation method improves the naturalness, speaker similarity, and intelligibility of the adaptation to Italian.The effect of textbased adaptation is particularly significant when the number of utterances in the training data is small.Specifically, when the number of utterances is 50, the baseline method had a CER of 17.29, while the proposed method achieved a CER of less than half of that.Moreover, the introduction of graphone-based training improved MCD, F0 RMSE and XV Sim, while bytebased training often outperformed graphone-based training in terms of CER.
In the Tamil results of Table III, comparing the proposed byte-based method and the baseline byte-based method, we observe the improvement trends in MCD and F0 RMSE but not in CER.On the other hand, graphone-based training improved all metrics, especially for the 50-utterance and 250-utterance cases.These results are probably due to the inherent differences between Italian and Tamil.In particular, the writing systems may play a key role: Italian uses a Latin alphabet with 21 different characters, while Tamil uses a script with 247 characters.Moreover, no language that shares the Tamil script was part of the multilingual training, as listed in Table I.This circumstance is likely to have limited the effectiveness of language adaptation only using grapheme tokens for Tamil, making knowledge transfer via graphone-based training more powerful.Overall, our framework combined with the text-based adaptation described in Section III-B and the graphone-based training method presented in Section III-D has improved the naturalness, speaker similarity, and intelligibility of low-resource language adaptation for Italian and Tamil -two languages with markedly different linguistic characteristics.
When comparing the performance of the multi-MLM and mono-MLM approaches described in Section III-B, the results varied depending on the evaluation metric and language, with no clear superiority for either method.In the case of Tamil, mono-MLM showed a tendency to outperform multi-MLM.Therefore, the language adaptation could be more effective when focusing on fitting the model to the data distribution of the target language, rather than promoting cross-lingual knowledge sharing through the use of multilingual text data.These findings suggest that using only text data from the target language for text-based adaptation is a viable strategy, which substantially reduces the training time required for text-based adaptation.

C. Subjective evaluations
As described in Section IV-A, we conducted subjective evaluations on naturalness.Fig. 4 shows the results of the MOS tests for Italian and Tamil.Due to budget constraints, we used the case with 50 training utterances for both target languages, which represents the minimal amount of training data and bearing significance for practical applications.Our proposed models showed higher MOS values compared to the baseline model for Italian, with the graphone-based multilingual training method (described in Section III-D) yielding the highest MOS value of 3.61 in the Proposed (Graphones, multi-MLM) case.In the Tamil case, the proposed models outperformed the baseline model, with the Proposed (Graphones) showing higher MOS values than the Proposed (Bytes), with larger margins than in the Italian case.
To further validate these results, we performed subjective AB tests on naturalness, comparing three different methods for both languages.In Fig. 5, the Baseline (Bytes), Proposed (Bytes), and Proposed (Graphones) correspond to the Baseline (Bytes), Proposed (Bytes, multi-MLM), and Proposed (Graphones, multi-MLM) in Fig. 4, respectively.We observed that, for Italian, both Proposed (Bytes) and Proposed (Graphones) significantly outperformed Baseline (Bytes), thus confirming the effectiveness of our text-based adaptation method presented in Section III-B.Although Proposed (Graphones) had a higher preference score than Proposed (Bytes), no statistical significance was observed.In the Tamil case, Proposed (Bytes) showed a higher preference score than Baseline (Bytes) without statistical significance.Conversely, Proposed (Graphones) significantly outperformed Baseline (Bytes), indicating the effectiveness of our method, which integrates text-based adaptation (Section III-B) and graphone-based training (Section III-D).Notably, in contrast to the Italian case, Proposed (Graphones) outperformed Proposed (Bytes) with statistical significance, indicating the effectiveness of graphone-based training, which is consistent with the objective evaluation results detailed in Section IV-B.
In the naturalness evaluations, the differences between the baseline and the proposed methods are small in absolute terms, as shown in Fig. 4. Similarly, in the AB tests shown in Fig. 5, the baseline was preferred over the proposed methods for at least 43%.Although we used the worker selection criterion to improve the quality of the ratings [70], as described in Section IV-A6, this did not fully mitigate the problem of inconsistency in raters' responses.Nevertheless, it should be noted that a statistically Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.significant difference was observed between the baseline and the proposed methods.Furthermore, the proposed method that combines the text-inductive adaptation method and graphone-based training method has demonstrated an improvement in the objective metrics of naturalness for Italian and Tamil, as described in Section IV-B.These results confirm the effectiveness of the proposed text-inductive graphone-based adaptation method for two languages with very different linguistic profiles, underlining its potential for practical scenarios.
Finally, to evaluate the speaker similarity, we also conducted subjective XAB tests, where each rater determined whether synthetic speech sample A or B more closely resembles groundtruth speech sample X.The detailed settings are described in Section IV-A6.For Italian shown in Fig. 6(a), our proposed methods, both the byte-based (Proposed (Bytes)) and graphone-based methods (Proposed (Graphones)), showed significantly higher speaker similarity compared to the baseline method (Baseline (Bytes)).We did not observe a significant difference between Proposed (Bytes) and Proposed (Graphones).For Tamil in Fig. 6(b), although Proposed (Bytes) did not show a significant difference when compared to Baseline (Bytes), the graphone-based proposed method (Proposed (Graphones)) showed a significant improvement in speaker similarity over both the baseline (Baseline (Bytes)) and the byte-based proposed methods (Proposed (Bytes)).Similar to the results of the naturalness evaluations, we observed the different trends between Italian and Tamil, which is attributed to their very different linguistic characteristics.However, the proposed text-inductive grapheme-based language adaptation (Proposed (Graphones)) improves the baseline method for both Italian and Tamil.

D. Ablation Studies on Adaptation Strategies
To compare the different schemes for the text-based adaptation described in Section III-B and supervised adaptation described in Section III-C, we conducted ablation studies.In the evaluations described in Section IV-B and Section IV-C, we updated all parameters of the TTS model ( 13) during the supervised adaptation described in Section III-C.We also investigated a scheme that freezes the language-aware embedding layer as in ( 14) (denoted as TTS-Freeze-LAE) and a scheme that updates only the decoder as in (15) (TTS-Freeze-Enc) during the supervised adaptation.TTS-Update-All refers to cases where all parameters are updated as in 13.Furthermore, we compared the different schemes for the text-based adaptation, which are denoted as MLM-Update-All and MLM-Freeze-Enc, as described in Section III-B.Note that we followed (13) in MLM-Freeze-Enc, while we followed (11) in TTS-Update-All, TTS-Freeze-LAE, and TTS-Freeze-Enc.Therefore, we can compare TTS-Update-All corresponds to MLM-Update-All defined in Section III-B.Fig. 7 shows the results.
As a result, for both Italian and Tamil, the cases where only the decoder was updated (TTS-Freeze-Enc) yielded the worst performance.This confirms that updating the encoder during the supervised adaptation is required, even with the small number of utterances.The performance gaps between different schemes for the supervised adaptation were small, showing varying trends depending on the evaluation metrics and languages.Also, "MLM-Freeze-Enc" showed the comparable performance to TTS-Update-All.This suggests that we only need to update the language-aware embedding layer as in (12), which is more computationally efficient than the scheme formulated as (11).

E. Investigation of Graphone-Based Multilingual Training
The proposed graphone-based training described in Section III-D can also be used for typical multilingual TTS with resource-rich languages.We thus investigated the effectiveness of graphones for the multilingual TTS training presented in Section III-A.Since our previous work [11] has demonstrated that multilingual TTS only with bytes outperforms multilingual TTS only with IPA symbols for almost all languages used in the evaluation, we compared the byte-based model (referred to as "Bytes-Only") and graphone-based models here.We also compared two cases for our graphone-based models: employing bytes during inference (referred to as "Graphone-Bytes") and using IPA symbols during inference (referred to as "Graphone-IPA").Fig. 8 shows the results. 8e found that the graphone-based training outperformed byte-only training across all languages and metrics.In the byte-only method, the average MCD was 7.22, the average F0 RMSE was 0.33, and the average CER was 13.46, while the average MCD was 6.50, the average F0 RMSE was 0.31, and the averaged CER was 9.20 in the graphone-based training method using bytes for inference.These results and those in Section IV-B suggest that the graphone-based training is effective for both multilingual training and low-resource adaptation.Interestingly, the "Graphone-Bytes" model outperformed the "Graphone-IPA" model across multiple languages and evaluation metrics, possibly due to inaccuracies in the extraction of IPA symbols from text.This finding implies that by adopting the "Graphone-Bytes" model, we can improve the byte-based multilingual TTS model while simultaneously mitigating the negative impact of inaccuracies in G2P operation.
As shown in Table I, the training data did not include IPA symbols for the Gujarati language (gu).Nevertheless, substantial improvements were observed in both the MCD and F0 RMSE metrics.This suggests that by incorporating graphones from other languages during the multilingual training process, it is possible to improve performance for low-resource languages where G2P knowledge is not available.Fig. 9 shows the cosine similarity between the encoder output sequences from bytes and IPA symbols for the same utterance when using the graphone-based training method.We observe that the cosine similarity has the diagonal trend, and the encoder outputs from these tokens are corresponded.This suggests that the model embeds the input bytes and IPA symbols in the shared representations and, the graphone-based training method facilitates the knowledge transfer from IPA symbols to bytes.

V. CONCLUSION
In this paper, we proposed a text-inductive language adaptation framework for low-resource TTS, incorporating textual information to improve language adaptation.Our two-stage adaptation process leverages both text-only data and a limited amount of supervised paired data for the target language.We also proposed a graphone-based multilingual training method that leverages both graphemes and IPA symbols for resourcerich languages while using only graphemes for low-resource languages, promoting the transfer of pronunciation knowledge.Our experimental evaluations demonstrate that our framework outperforms previous supervised transfer learning approaches and that the graphone-based training method further enhances the performance of multilingual TTS and low-resource language adaptation.With only 5 minutes of paired data for fine-tuning, our method achieves highly intelligible synthetic speech with Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.CER of around 6% for target languages.We also conducted extensive ablation studies to investigate the effectiveness of different language adaptation strategies.These advances contribute to the ongoing efforts to make TTS systems accessible to a wider range of languages, especially those with limited resources.
Limitation and future work: This study has several limitations that warrant further investigation.First, due to the limited availability of TTS training data across various languages, our evaluations of low-resource language adaptation were restricted to Italian and Tamil.Future research should investigate a wider range of languages, spanning multiple language families, in order to provide more comprehensive results.The effectiveness of text-inductive language adaptation and graphone-based training methods showed different trends for Italian and Tamil.Evaluating a wider range of languages would provide a more detailed understanding of the effectiveness of each method per language.We are particularly interested in evaluations of tonal languages (e.g.Punjabi).For the evaluation of such languages, it is desirable to gather several dozen evaluators for subjective evaluations and also to ensure a high performance ASR model for the languages.Under these conditions, we would like to evaluate whether the performance of the proposed text-based adaptation can be improved by adding accent labels to the input text.We also found that text-based adaptation for byte-input models showed limited effectiveness when applied to Tamil as the target language.Therefore, our future work will focus on further improving the text-based adaptation method and exploring more effective methods for text injection during language adaptation.Finally, investigating how the amount and languages of text-only data affect the performance of language adatation is also compelling future work.APPENDIX A ATTENTION MAPS Fig. 10 shows the attention alignment from all layers and heads in the baseline model trained on 500 utterances in Italian.This model corresponds to Bytes under the condition of 500 Utts.(47.1 min.) in Table II, which shows a high robustness with a CER of 5.29%.As shown in Fig. 10, the attention map shows diagonal patterns for some layers and heads, while it shows off-diagonal patterns for the rest of the layers and heads.In Fig. 11 we selected the diagonal attention maps and compared them between different models.While the baseline model trained on 500 utterances (Fig. 11 II.As shown in Fig. 11(c), the proposed model has a more continuous alignment, which is consistent with the results in Table II showing the higher robustness.MULTILINGUAL TRAINING In this section, we investigated the effect of the diversity of languages included in the multilingual training on language adaptation performance.The details of these investigations are presented in Table IV .The target language was Italian, using mono-MLM (defined in Section IV-B) for text-based adaptation with 50 utterances for the trianing dataset of supervised adaptation.In Table IV, All languages denotes the configuration described in Table I, same as the Bytes, Mono-MLM in Table II.W/o Indic denotes a configuration that excludes Indic languages and includes German, French, Spanish, Greek, Dutch, Russian, Hungarian and Finnish for multilingual training.Indic only refers to multilingual training with Indic languages only, including Hindi, Punjabi, Gujarati, Malayalam and Telugu.Case (4) in Table IV describes multilingual training with Latin-script languages such as Italian, while Case (5) excludes Romance languages (French and Spanish, such as Italian).In Case (6), (7) and (8), only Romance, Germanic and Uralic languages were used, respectively.
The results show that the All languages performs consistently best across several evaluation metrics.The W/o Indic configuration showed lower performance than All languages, highlighting the beneficial role of Indic languages in multilingual pretraining for the Italian language adaptation.On the other hand, the Indic only condition resulted in a character error rate of over 100%, indicating that the model failed to generate intelligible speech with the training data of only 50 utterances.This result underlines the crucial role of language selection in multilingual training and suggests the indispensability of European languages.Furthermore, (4) underperformed compared to (2), indicating the merit of integrating languages with different scripts.The results of ( 4) and (6) show that including Uralic and Germanic languages gives better results than only using Romance languages, confirming the effectiveness of multilingual training with a larger number of languages.We can see that ( 7) and ( 8) show worse results for all evaluation metrics compared to (6) with Romance languages.These results clarify that the Romance languages, to which the target language belongs, contribute the most to the performance of the language adaptation.Interestingly, the results of ( 5) and (6) showed that a configuration with more languages without Romance languages showed better results than training only with Romance languages, suggesting the importance of increasing language variation in multilingual training.

Fig. 2 .
Fig. 2. Procedure of proposed text-inductive adaptation for low-resource TTS.(a) First, we adapt language-aware embedding layer of pretrained multilingual TTS model using text-only data.(b) Then, we adapt whole TTS model using limited paired speech-text data.This illustration corresponds to MLM-Freeze-Enc described in Section III-B and TTS-Freeze-LAE described in Section III-C.de, ru, hu and it stand for language codes defined in Table I, i.e.German, Russian, Hungarian and Italian respectively.

Fig. 3 .
Fig. 3. Proposed graphone-based multilingual training method to improve transfer of pronunciation knowledge from rich-resource to low-resource languages.It allows us to switch between using both tokens or only graphemes.
D and D denote the datasets used for the multilingual training (Section III-A) and the low-resource adaptation (Section III-B and Section III-C), respectively.Here D = D text ∪ D paired and D = D text ∪ D paired are satisfied.Let D l and D l denote the subsets of the datasets for each language, where D = D l and D = D l are satisfied.In the multilingual training described in Section III-A, we use the both tokens as shown in Fig. 3. Let D Gpm l and D IPA l denote the datasets using graphemes and IPA symbols as the input text tokens, respectively.Note that D Gpm l ∩ D IPA l = ∅ is satisfied and we use UTF-8 bytes for grapheme tokens as described in Section IV-A.For resource-rich languages, we assume D l as when the G2P module is not available for the language.Our graphone-based multilingual training framework allows us to switch whether to use both tokens (D IPA l = ∅ and V IPA l = ∅ in (18)) and only use graphemes (D IPA l = ∅ and V IPA l = ∅ in (

Fig. 4 .
Fig. 4. Subjective MOS results for two target languages.50 utterances were used as training data for each target language.Error bars indicate 95% confidence intervals.

Fig. 5 .
Fig. 5. Subjective AB test results on naturalness for two target languages.50 utterances were used as training data for each target language.Bold within bar graphs indicates score of case that outperformed competing case by statistical significance (p < 0.05).

Fig. 6 .
Fig. 6.Subjective XAB test results on speaker similarity for two target languages.50 utterances were used as training data for each target language.Bold within bar graphs indicates score of case that outperformed competing case by statistical significance (p < 0.05).

Fig. 7 .
Fig. 7. Ablation study on text-based and supervised language adaptation strategies.

Fig. 9 .
Fig. 9. Cosine similarity of encoder outputs from phone and byte tokens when using proposed graphone-based multilingual training method.Utterance sampled from Hindi test set was used for visualization.

Fig. 10 .
Fig. 10.Attention alignment from all layers and heads in Baseline model trained on 500 utterances.
Fig.10shows the attention alignment from all layers and heads in the baseline model trained on 500 utterances in Italian.This model corresponds to Bytes under the condition of 500 Utts.(47.1 min.)  in TableII, which shows a high robustness with a CER of 5.29%.As shown in Fig.10, the attention map shows diagonal patterns for some layers and heads, while it shows off-diagonal patterns for the rest of the layers and heads.In Fig.11we selected the diagonal attention maps and compared them between different models.While the baseline model trained on 500 utterances (Fig.11(a)) maintains the diagonal alignment, some alignments are corrupted when the number of utterances is reduced to 50 (Fig.11(b)).Fig.11(c)shows the attention alignment of the proposed model corresponding to Bytes, Mono-MLM under the condition of 50 Utts.(5.1 min.)   in TableII.As shown in Fig.11(c), the proposed model has a more continuous alignment, which is consistent with the results in TableIIshowing the higher robustness.

TABLE IV INVESTIGATION
OF LANGUAGES FOR MULTILINGUAL TRAINING APPENDIX B INVESTIGATION OF LANGUAGES USED FOR