Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages

This work applies a hierarchical transfer learning to implement deep neural network (DNN)-based multilingual text-to-speech (TTS) for low-resource languages. DNN-based system typically requires a large amount of training data. In recent years, while DNN-based TTS has made remarkable results for high-resource languages, it still suffers from a data scarcity problem for low-resource languages. In this article, we propose a multi-stage transfer learning strategy to train our TTS model for low-resource languages. We make use of a high-resource language and a joint multilingual dataset of low-resource languages. A pre-trained monolingual TTS on the high-resource language is fine-tuned on the low-resource language using the same model architecture. Then, we apply partial network-based transfer learning from the pre-trained monolingual TTS to a multilingual TTS and finally from the pre-trained multilingual TTS to a multilingual with style transfer TTS. Our experiment on Indonesian, Javanese, and Sundanese languages show adequate quality of synthesized speech. The evaluation of our multilingual TTS reaches a mean opinion score (MOS) of 4.35 for Indonesian (ground truth = 4.36). Whereas for Javanese and Sundanese it reaches a MOS of 4.20 (ground truth = 4.38) and 4.28 (ground truth = 4.20), respectively. For parallel style transfer evaluation, our TTS model reaches an F0 frame error (FFE) of 9.08%, 10.13%, and 8.43% for Indonesian, Javanese, and Sundanese, respectively. The results indicate that the proposed strategy can be effectively applied to the low-resource languages target domain. With a small amount of training data, our models are able to learn step by step from a smaller TTS network to larger networks, produce intelligible speech approaching the real human voice, and successfully transfer speaking style from a reference audio.


I. INTRODUCTION
Speech is the most natural verbal communication tool that can be easily understood by normal humans [1]. The computer's ability to process voice signals is necessary in the area of human computer interaction (HCI). It helps the computer to communicate and interact with humans or to be used as a communication device between normal humans and visual/speech impaired people. Text-to-speech (TTS) The associate editor coordinating the review of this manuscript and approving it for publication was Arianna Dulizia . learns how a computer can read text or symbols and pronounce them by producing sound waves automatically [2]. The purpose of building TTS is to produce synthesized speech that can be easily understood and is indistinguishable from sound produced by real humans [3]. In general, modern TTS involves three main processes: text analysis, acoustic modeling, and synthesizing speech waveforms [4]. Model-based TTS research has been dominated by statistical parametric speech synthesis (SPSS) [5]- [9] until recent years in which deep learning has delivered extraordinary achievements in various fields [10]- [12]. This has attracted many researchers to exploit DNN in all TTS process stages. Beyond parametric speech synthesis (BPSS) that applies DNN for both features and rules learning has become increasingly researched [13]- [22]. Tacotron-2 [13], a state-of-the-art DNN-based TTS that can be trained end-to-end using <text, audio> data pairs, successfully produces human-like synthesized speech. Besides producing high-quality synthetic sounds, DNN-based TTS introduces many possibilities to produce speech in various types of sounds, speech styles, and emotional states. E2E-prosody [23], Tacotron-GST [24], and Mellotron [25] proposed a DNN-based prosody model that has an important role in transferring a reference speaking style when generating synthesized speech. These works showed satisfactory results.
Despite the remarkable performance, DNN-based TTS has a very strong dependence on a large amount of training data to understand latent data patterns. This data dependency is one of the most serious problems for DNN-based model. The scale of the DNN model and the size of required training data correlate almost linearly [26]. Based on our preliminary study, a minimum 10 hours of data is required to train Tacotron-2-based single-speaker monolingual TTS on the Indonesian domain. Training data below 3 hours is unable to produce intelligible speech. As for multi-speaker multilingual TTS, 10 hours of data is still insufficient to train the model. This study confirms that the bigger the scale of the TTS network, the bigger the amount of training data needed. Rather than building a bigger dataset that is expensive and needs human efforts, it is necessary to find alternative strategies to train the model on low-resource language domains.
There have been several efforts to train a DNN-based TTS model using a small amount of annotated <text, audio> data pairs. Semi-supervised training proposed by [27] to make use of textual and acoustic knowledge from non-parallel large text and speech corpora for training end-to-end TTS with a small amount of parallel data. Other studies used cycle consistency training using the automatic speech recognition (ASR) model to train TTS [28], [29]. In the training process, ASR is used to look for transcripts from sounds, while TTS reconstructs transcripts into sounds. However, these approaches still require a large amount of unlabelled text and unlabelled audio corpora that are limitedly available for low-resource domain. Speech chain machine for cross-lingual is proposed by [30] that applies cycle consistency training for cross-lingual ASR-TTS. Work [31] proposed an approach to discover cross-lingual symbol mapping from abundant source data.
Transfer learning is an interesting option to overcome the lack of data in low-resource language by allowing what has been learned in a source domain be exploited to improve generalization in a target domain. Referring to the classification of transfer learning approaches in traditional machine learning by [32], [33], there are four transfer categories: instance transfer, feature-representation transfer, parameter transfer, and relational-knowledge transfer. Especially for deep learning, a study by [26] classifies deep transfer learning (DTL) into different four categories: instancesbased DTL, mapping-based DTL, network-based DTL, and adversarial-based DTL. Some DNN-based systems have successfully applied deep transfer learning, including TTS [31], image classification [34], [35], machine translation [36], [37], automatic speech recognition [38]- [40], language identification [41], and sentiment classification [42].
We propose hierarchical transfer learning, a network-based DTL, to train the TTS model on low-resource (target) languages by utilizing a high-resource (source) language. This strategy is a multi-stage learning inspired by the human learning process to accumulate knowledge from previous learning, step by step from a simple task to more complex ones. Furthermore, we exploit the benefit of using a joint multilingual dataset of low-resource languages to maximize the latent variable learning from more data of other languages. For this reason, we develop DNN-based multilingual multi-speaker TTS with and without style transferring by extending the Tacotron-2 architecture with additional networks for multispeaker, multilingual, and style transfer. Adding a multilingual component has two benefits. First, TTS model can learn from more data of other languages. More data can generalize the network parameters better. Second, it allows a native speaker of a language to speak fluently in other languages. We train TTS models using the proposed hierarchical transfer learning in several stages. For each transfer stage, it has a background motive to transfer particular knowledge from previous learning: parameter generalization including alignment map between text input and spectrogram output from a high-resource language, pronunciation learning from a phonologically close language, and multilingual multispeaker learning from a joint multilingual data. After these learned capabilities are transferred, the TTS model at the last stage learns to imitate the speaking style from a reference audio.
Our experiment uses an English dataset as the source domain and a joint multilingual dataset of Indonesian, Javanese, and Sundanese as the target domain. These target languages are phonologically close. Using international phonetic alphabet (IPA), Indonesian has 32 phonemes, while the other languages have all Indonesian phonemes with additional three phonemes for Javanese and one phoneme for Sundanese. The models are able to generate synthesized speech that is close to a real human voice by training them using less than 1 hour of monolingual dataset and 11 hours of joint multilingual dataset (Javanese and Sundanese are less than three hours each). Our study reports that these amounts are inadequate to train the TTS models from scratch. In comparison, single-speaker monolingual Tacotron-2 uses more than 24 hours, multi-speaker monolingual Mellotron uses 44 hours and 41.7 hours, and monolingual E2E-Prosody uses 147 hours and 296 hours for single-speaker and multispeaker, respectively. TTS evaluation for Indonesian and Sundanese reaches a smaller MOS difference from the real human speech than the baseline Tacotron-2 on English and better mel-cepstral distortion (MCD) than the baseline VOLUME 8, 2020 E2E-Prosody on English. As for transfer style, our model on female speakers provides better FFE than Mellotron.
In summary, our main contributions are as follows: 1. We present Tacotron-2-based TTS that supports multispeaker, multilingual, and style transfer by adding new network components. Multilingual component enables TTS model to be trained on a joint multilingual dataset. The joint dataset can help TTS model improve the generalization learning of a low-resource language using more data from other languages with phonetic similarity and allow a speaker of one language to speak fluently in other languages with/without style transfer. 2. We propose a hierarchical transfer learning scheme to train TTS for low-resource languages in several stages. Firstly, it utilizes pre-trained model on a high-resource single-speaker monolingual source domain and fine-tune on a single-speaker monolingual target domain. Secondly, we use a partial network-based DTL from the pre-trained single-speaker monolingual TTS to build a multi-speaker multilingual TTS that is fine-tuned using a joint multilingual dataset. Finally, similar partial network-based DTL is used to build a multi-speaker multilingual with style transfer TTS from the pre-trained multi-speaker multilingual TTS model.
The rest of the paper is organized as follows: Section II presents previous related works. Section III introduces our DNN-based TTS architecture and proposed hierarchical transfer learning. Section IV provides implementation details. Section V presents the experimental results and Section VI concludes the study.

II. RELATED WORKS A. END-TO-END DNN-BASED TTS
A recent promising beyond parametric speech synthesis (BPSS) is the end-to-end TTS system that combines the main stages of the TTS process into a DNN framework that can be trained directly using <text, audio> data pairs. There are several advantages of such an integrated endto-end TTS system [4]: It does not require phoneme level alignment and reduces the need for exhausting engineering features; It is easier for conditioning on various attributes, such as speakers, languages, or high-level features such as sentiment; It is easier to adapt to new data; It tends to be stronger than a multi-stage model where the errors of each component can accumulate. Tacotron-2 [13], a simplification of Tacotron [15], is a fully end-to-end DNN-based TTS system that can be trained directly using <text, audio> data pairs and directly processes raw orthographic text to produce spectograms. Tacotron-2 uses WaveNet [16] as a vocoder conditioned on the mel-spectogram instead of using the Griffin-Lim algorithm as in Tacotron.
To convey human-like speech, the TTS system needs to learn how to make a prosody model, such as paralinguistic information (intention, attitude, and emotion), pitch, rhythm, intonation, stress, and style. Tacotron [15] and Tacotron-2 [13] do not model prosody explicitly. E2E Prosody [23] added a reference encoder network to Tacotron architecture as a prosodic modeling derived from a reference audio. Tacotron-GST [24] proposed modeling speech style using global style token (GST) by adding style token layer that consumes the reference encoder outputs [23] using a multi-head attention scheme [43]. Recently, Mellotron [25] combined GST, pitch, and rhythm for style transferring and successfully reduced F0 frame error (FFE) significantly between synthesized audio and reference audio.
Different from these end-to-end TTS models, we add multilingual component to exploit the benefits of using a joint multilingual dataset. We also extend Tacotron-2 to support style transfer using GST [24] by conditioning the decoder with pitch and rhythm obtained from a reference audio signal as applied in Mellotron. GST can express various expressive styles without requiring explicit prosody labels. The GST network is jointly trained with the whole model that is only driven by the reconstruction loss of the Tacotron-2 decoder. However, unlike Mellotron that uses phoneme-level in the text processing, our approach uses character level. Therefore, it does not need to make a phonetic dictionary that requires human annotation effort. As for the vocoder, we employs WaveGlow [44] instead of WaveNet used by Tacotron-2. Unlike WaveNet that produces very natural speech waveforms but is very slow due to the autoregressive generation process, WaveGlow is a non-autoregressive vocoder that provides fast, efficient and high-quality audio synthesis without auto-regression.

B. LOW-RESOURCE PROBLEM
Deep learning has a very strong dependence on a large amount of training data. Previous studies related to data efficiency for training the DNN-based TTS model [27]- [29] are less suitable for low-resource language. Even though these approaches do not require a large amount of parallel data, they still need a large amount of non-parallel text and speech corpora. ASR-TTS proposed by [30] need additional ASR to assist TTS learning. Mapping-based DTL is explored in [31] by adding a phonetic transformation network (PTN) model to learn a mapping between source and target linguistic symbols. An ASR system is used to train PTN separately. However, this approach can only be applied to the same TTS network. It does not have the flexibility to transfer the learning on a more complex network.
Different from the solutions proposed in [27]- [29], our proposed strategy does not require a large amount of non-parallel text and audio corpora. Our strategy is simpler than [30] as it does not need additional system such as ASR. Similar to [31], we apply DTL approach. However, our DTL is network-based approach that is more flexible than mapping-based DTL applied by [31] in which with multi stages of transfer learning the previous learned DNN parameters can be passed on to a larger network. The hierarchical transfer learning scheme proposed in this article is an extended study of our previous work [45]. This prior FIGURE 1. TTS Architecture. The prediction network of T2-mlms contains T2, whereas T2-mlms-gst prediction network contains T2-mlms such that T2 ⊂ T2-mlms ⊂ T2-mlms-gst. All models use WaveGlow as a vocoder.
work used pre-trained TTS model on a source domain and fine-tune it on a target domain using the same monolingual single-speaker TTS model. Our new proposed transfer scheme can be applied to both the same TTS model trained on a monolingual target domain and different, more complex models trained on a joint multilingual target domain. The new scheme exploits the benefit of generalization learning from other languages with phonetic similarity, allows a speaker of a language to speak other languages, and transfers speaking style from one speaker to another speaker.

III. METHODS
This section explains the proposed TTS architecture and hierarchical transfer learning training strategy.

A. MODEL ARCHITECTURES
TTS architecture in our work consists of three modules: Encoder module that converts inputs into feature representations; Decoder module that changes the representation of features into the acoustic parameters mel-spectogram; Vocoder module that produces sound signals from mel-spectogram.
The encoder-decoder network can also be called spectogram prediction network that predicts spectrogram output from text input.
The entire proposed multilingual multi-speaker TTS model, illustrated in Figure 1, is a sequence-to-sequence (seqto-seq) Tacotron-2 network [13] with some additions: style embedding as in [24], pitch contour and attention map as in [25], language embedding, and speaker embedding. These additional networks are for handling multilingual, multispeaker, and transfer of speaking style, pitch, and rhythm from a reference audio.
There are three TTS models used in our study: T2, a Tacotron-2-based encoder decoder architecture; T2-mlms, an extension of T2 by adding language embedding and speaker embedding; T2-mlms-gst, an extension of T2-mlms with the addition of GST encoder, pitch, and rhythm components for prosody transferring.

1) TEXT ENCODER
Text encoder generates a T X _d X -dimensional representation of the grapheme sequence. T X is the length of the encoded VOLUME 8, 2020 text (usually the same as the transcript length) and d X is the dimension of the encoded text. We adopt the text encoder networks used in Tacotron-2. It consists of learnable d X -dimensional character embedding, followed by stacked convolutional layers with filter that spans 5 characters to model long-term context (N -grams) of the input sequence. After batch normalization and ReLU activation, the output of the convolutional layer is passed into a single bi-LSTM with d X units.

2) MULTILINGUAL MULTI-SPEAKER
To model multilingual and multi-speaker, we use learnable d L -dimensional language embedding network and d S -dimensional speaker embedding network that are jointly trained with the TTS task without the need for changes in loss metrics. The training process updates the parameters so that similar languages/speakers in relation to synthesis task have close distance in the vector space. For both multilingual and multi-speaker model, we use a channel-wise embedding concatenated with the encoder output.

3) GLOBAL STYLE TOKEN
To model acoustic expressiveness we apply a style embedding using a GST encoder to capture speaking style from a reference audio as in [24]. The GST encoder calculates a d G -dimensional style embedding that corresponds to the mel-spectogram of a reference audio. It consists of reference encoder network as in [23] followed by a style token layer that are jointly trained with the rest of the model, driven by the reconstruction loss from TTS decoder.

4) PITCH CONTOUR
GST only offers rough control over expressive speech characteristics. To carry out finer and detailed control, we add networks to condition melodic information such as pitch and rhythm. In addition to GST network, we adopt scheme in [25] to explicitly model expressive speech variables, such as fundamental frequency contour (F0) or pitch, and voicing decision (voiced/unvoiced), and rhythm variables. The pitch contour is extracted using the YIN algorithm [46] with a harmony threshold between 0.1 and 0.25 from the reference audio. The pitch goes to a convolutional layer followed by ReLU to get d P -dimensional pitch representation.

5) RHYTHM
Rhythm, also called alignment map, is learnt from text and spectrogram as described in [13] by using location-sensitive attention [47], which is an extension of additive attention [48]. Alignment map is a T M _T X -dimensional matrix that contains alignment (or attention weight) of an input text X with the length of T X characters and reference mel-spectogram M with the length of T M frames. By learning the alignment map during training, we can control the rhythm during inference. Alignment map is extracted using a forcedaligner from <reference audio, transcription> pair data, as in [25]. TTS can produce the same rhythm as the reference audio using the extracted alignment map.

6) WAVEGLOW
WaveGlow is a non-autoregressive vocoder that is able to convert mel-spectrogram into waveforms faster than real time [44]. WaveGlow combines flow-based generative model Glow [49] and WaveNet [16] to achieve the generation of non-autoregressive waveforms, making it possible to speed up the training process on a large scale while maintaining the naturalness of synthesized speech. WaveGlow vocoder consists of a single network that is trained using a single cost function to maximize the likelihood of training data and make training procedures simpler and more stable.

B. PREDICTION NETWORK FORMULATION
The following section describes the spectrogram prediction network formulation for T2, T2-mlms, and T2-mlms-gst in more detail. The spectogram prediction network is the encoder and decoder part of the architecture illustrated in Figure 1. It is a seq-to-seq model that converts an input text sequence X = (x 1 , . . . , x Tx ) into an output spectrogram sequence Y = (y 1 , . . . , y T Y ). Each y t is predicted based on all previous outputs y 1 , . . . , y t−1 . The prediction is computed using the attention-based encoder decoder scheme.

1) MODEL T2
The proposed T2 model adopts the spectrogram prediction network used by Tacotron-2 [13]. In T2 model, the encoder processes input text sequence X = (x 1 , . . . , x Tx ), where T X is the number of characters in the text that has been normalized, and then converts them into T X _d X -dimensional hidden representations H = (h 1 , . . . , h Tx ) in the following way: where θ e is the encoder model parameters. The hidden representations H = (h 1 , . . . , h Tx ) are processed by the decoder network to produce predicted mel-spectogram Y = (y 1 , . . . , y TY ) from which the vocoder generates speech waveforms. To produce output y t , the decoder calculates a new decoder hidden state s t based on the prior state s t−1 , prior output y t−1 , and attention context vector c t . The decoder state s t is formulated as follows: where θ d is the decoder model parameters, and c t is the context vector and is computed using attention scheme: where α t,i is the attention weight and is calculated as follows: where e t,i is the attention score or energy that is calculated using location-sensitive attention as follows: where s t−1 is the decoder hidden state from the prior time step, h i is the i th encoder hidden state, f t,i is the location feature ( * is a 1-dimensional convolution operator). U, V, W, and F are trainable weight matrices, w is a trainable weight vector, and b is a trainable bias. Finally, output mel-spectogram Y = (y 1 , . . . , y T Y ) and stop token Z = (z 1 , . . . , z T Y ) are produced. For each time step t, y t and z t are calculated using the following equation: where f FC is a fully connected network that processes the decoder state s t by a linear projection to produce the predicted output and f ST is a linear projection followed by sigmoid to predict when the production is stopped. A stacked convolutional post-net consumes Y = (y 1 , . . . , y T Y ) to obtain Y = (y t , . . . , y T Y ) by adding a residual prediction to improve the overall reconstruction as follows: where ⊕ is a concatenation operator, l is the language embedding, and q is the speaker embedding. With this additional information, Equation (3) is changed into: where h i is the concatenation of text, language, and speaker embedding. Likewise, Equation (5) is also changed into:

3) MODEL T2-MLMS-GST
T2-mlms-gst model is an extension of T2-mlms by adding d G -dimensional style embedding g, d P -dimensional pitch embedding, and T M _T X -dimensional rhythm R. Pitch P is extracted from the reference audio M = (m 1 , . . . , m T M ) with a length of T M and R is the T M _T X -dimensional alignment map between the reference audio M and the text. During the training process, the ground truth audio is used as the reference audio M and R is set with ''none''. Whereas during the model inference, R is extracted using T2-mlms-gst by performing teacher-forced forward pass from any desired reference audio M . The predicted mel-spectogram's length T Y is equal to the reference mel-spectogram's length T M because we apply force alignment. In T2-mlms-gst model, the text hidden representation H = (h 1 , . . . , h Tx ) is concatenated with language embedding l, speaker embedding q, and style embedding g to produce Each h i is computed as follows: Style embedding g is generated by the GST network from the reference audio and formulated as follows: where θ G is the GST network parameters and M = (m 1 , . . . , m T M ) is the mel-spectogram of the reference audio.
With this addition, in T2-mlms-gst model, Equation (3) is changed into: where h i is the hidden representation of the text encoder concatenated with language, speaker, and style embedding.
In here, the formulation between the model training and inference slightly differs. During training, the attention weight α t,i is calculated using Equation (4) by changing Equation (5) into: Whereas during inference, the attention weight α t,i is obtained from extracted alignment map R, as follows: Meanwhile, the pitch information P = (p 1 , . . . , p T M ) is extracted from the reference audio M using YIN algorithm that is processed through the pre-net-F0 decoder. Pitch p t is concatenated with the previous spectogram output y t−1 that is processed through the pre-net decoder. This information is used by the decoder to find the decoder hidden state s t . It is calculated with a new equation, replacing Equation (2):

C. HIERARCHICAL TRANSFER LEARNING
Our models are trained using teacher-forcing procedure, the standard maximum-likelihood training, by feeding in the ground truth spectrogram frame instead of the predicted one to the decoder network. Thus, y t−1 in Equation (2) and (18) is replaced by ground truth y gt t−1 during the training process. The model is optimized by minimizing the summed mean squared error (MSE) for the following objective function: where Y gt is the ground truth/target mel-spectogram, Y is the predicted mel-spectogram, Y is the predicted melspectogram after post-net, Z gt is the ground truth stop token sequence, and Z is the predicted stop token sequence. The similar is done on the Javanese/Sundanese dataset to produce single-speaker monolingual T2-jv/T2-su. In the third layer, T2-id model parameters are transferred to initialize T2-mlms that is then fine-tuned on the joint multilingual dataset, ID-JV-SU. In the fourth layer, model T2-mlms-gst is partially initialized using pre-trained T2-mlms and fine-tuned using the same multilingual dataset.
There are four layers of training stages in our proposed hierarchical transfer learning architecture as illustrated in Figure 2. In the first layer, the monolingual single-speaker T2 model is trained on a high-resource language source domain. We use English source domain. In the second layer, the pre-trained monolingual single-speaker T2 model from the first layer is fine-tuned on a low-resource language target domain. We train our T2 model for Indonesian, Javanese, and Sundanese separately. In the third layer, the pre-trained model obtained on the second layer is transferred to initialize the multilingual multi-speaker T2-mlms model. Then, it is fine-tuned on a multilingual multi-speaker target domain. We use joint multilingual dataset of Indonesian, Javanese, and Sundanese languages. In the fourth layer, the pre-trained T2-mlms is transferred to partially initialize the multilingual multi-speaker with style transfer T2-mlms-gst model. Then, it is fine-tuned on the same joint dataset as target domain. Each model optimization in the hierarchical transfer learning scheme is shown in Algorithm 1 and Algorithm 2.

Compute argmax
where G 3 and P 3 are the speaking style and the pitch representations extracted from the reference audio (ground truth audio Y 3 ), whereas θ mlms_gst is the model parameters fine-tuned on θ mlms_gst_init . Output: θ 2 , θ mlms , and θ mlms_gst Algorithm 2 Transfer Parameter Weights for Multi-Models Input: Model parameters target θ target and model parameters source θ source , where structure(θ source ) ⊂ structure(θ target ). 1. For each θ s w ⊂ θ source that corresponds to θ t w ⊂ θ target , where θ s w and θ t w are trainable weight vectors, matrices, or 3D-tensors of a layer in our DNN models: 2.
#update θ t w using element-wise update 6.
If θ s w is a vector, for each w s a ∈ θ s w : is quite simple. All network parameters of the pre-trained model on the source domain are transferred as initialized model in the next layer. Then, it is further fine-tuned on the target domain. If the transfer learning is from a simpler model to a more complex model, such as T2-mlms in the third and T2-mlms-gst in the fourth layers, an additional process is required to transfer the learned weights to a different model structure (see the second and fourth steps of Algorithm 1).
A different layer structure has a different trainable weight shape. In our models, the same layers with different structures are found in the attention and decoder. Hence, these layers have different dimension of weight matrices between prior model and success model. After the success model is created using the standard initialization, we transfer the prior model parameters to the corresponding success model parameters using Algorithm 2. The weight matrix (or vector or tensor) transfer learning, though only partially, is more effective than training the whole weight matrix from scratch. The partial weight matrix transfer allows us to fine-tune the higher dimension weight matrix of the success model by making use of the learned lower dimension weight matrix of the prior model.

IV. EXPERIMENTS A. DATASET
Our work utilizes publicly available datasets: LJSpeech [50], an English speech corpus with a total duration of about 24 hours; TITML-IDN [51], an Indonesian (ID) speech corpus with an average of 43 minutes for each speaker; OpenSLR jv-ID [52], a Javanese (JV) speech corpus with an average of 10 minutes for each speaker; OpenSLR su-ID [52], a Sundanese (SU) speech corpus with an average of 7 minutes for each speaker.  T2 model for Indonesian, Javanese, and Sundanese uses a subset of corpus consisting of one female speaker for each language as shown in Table 1. Whereas for T2-mlms and T2mlms-gst, we use a joint multi-speaker multilingual dataset, referred as 10ID-10JV-10SU dataset as shown in Table 2. Data pre-processing was carried out to equalize the sample rate of audio to 16000 Hz and clean up text transcriptions.

B. MODEL IMPLEMENTATION
We implement our TTS model using PyTorch library [53]. For T2 model, we modify the open source code from NVIDIA Tacotron-2 [54] to support text processing for Indonesian, Javanese and Sundanese languages. For T2-mlms model, we add embedding networks to handle speaker and language identity. For T2-mlms-gst model, we add a reference encoder network for style embedding [23], GST network as in [24], and pitch and rhythm as in [25].
For each model, we use the same feature representations, both text and acoustic features. For text features, grapheme level is used to produce encoded text with a dimension of 512. We use 80 channels mel-spectrogram for the acoustic feature. We use a language embedding dimension of 8, a speaker embedding dimension of 128, a style embedding dimension of 256, and a 1-dimensional pitch embedding. More details about the spectral analysis and the model hyper-parameters can be seen in Table 3.

C. TRAINING SETUP
Each model is trained using two schemes: training from scratch and transfer learning. Each training scenario is carried out using a batch size of 32. We use 300K training steps except for T2 using transfer learning that uses 10K steps. The spectrogram prediction network training uses the standard maximum-likelihood (MLE) by feeding in the correct output instead of the prediction on the decoder side, referred as teacher-forcing. We use ADAM optimization [55] with default parameters, learning rates starting at 1e-3 and weight decay 1e-6. Models are trained using a single NVIDIA DGX-1 GPU. Table 4 show the model names referred in this article along with the training setup information: the architecture, language, dataset, and pre-trained model for transfer learning.  For T2 on Javanese and Sundanese, we use pre-trained T2 on IDF-01 instead of on LJS as explained in our prior work [45].
As for the vocoder, we use the pre-trained WaveGlow on LJS dataset [56] and fine-tune on IDF-01 dataset for single-speaker monolingual T2 model. Our experiments suggest that WaveGlow trained on the English LJS gives poor result when applied to Indonesian speech, so we need to fine-tune on Indonesian domain. Our experiments also conclude that WaveGlow trained on a female speaker can only produces good synthesize speech on the same gender speakers, but poor result on male speakers. Thus, for multi-speaker multilingual TTS, we further fine-tune WaveGlow on 10ID-10JV-10SU dataset.

D. MODEL EVALUATION
To evaluate our models, we use subjective assessments involving 9-20 respondents and objective assessments by measuring acoustic features. Two subjective evaluations are employed to measure the intelligibility of speech using semantically unpredictable sentences (SUS) [57] and to measure the quality of speech synthesis using a mean opinion score (MOS) [58] with scale of 1-5 with an increase of 1. Whereas the objective evaluations use four metrics as in [23]: mel-cepstral distortion (MCD K ) [59], gross pitch error (GPE) [60], voicing decision error (VDE) [60], and F0 frame error (FFE) [61].
Before calculating the objective metrics, we apply padding according to the type of domain to equalize the length of signal frames because not all models produce the same signal length as of the reference audio. For MCD K evaluation, we use 13 coefficients of mel-frequency cepstral coefficient (MFCC), producing the MCD 13 metric. We extract pitch and voicing decisions using YIN algorithm [46] to calculate GPE, VDE, and FFE.

V. RESULTS AND ANALYSIS
This section presents the comparison of alignment learning using two training schemes: training from scratch and hierarchical transfer learning schemes for all models. It also presents the evaluation of the speech synthesis produced by the TTS models trained using the transfer learning scheme.

A. ALIGNMENT LEARNING
TTS is a seq-to-seq problem, when given text sequence it produces sound wave sequence. As a typical of seq-to-seq problems, it is important to learn the alignment between input sequence and output sequence. The TTS model that fails to do a reasonable alignment mapping is unable to synthesize intelligible speech that can be understood. We use location-sensitive attention scheme [47] to learn the mapping between input text and output mel-spectogram. This mapping is referred as an alignment map or attention map. TTS models that are able to produce intelligible speech can be indicated from the alignment that forms a diagonal map. In accordance with the nature of the TTS seq-to-seq problem using attention-based encoder decoder framework, the diagonal map shows that the alignment learning between the encoder steps and the decoder steps has been successful.
Our preliminary study suggests that training T2 model from scratch needs at least 10 hours of data to produce good quality of synthesized speech. Data under 3 hours was unable to produce clear synthesized speech and it is impossible for data below 1 hour to produce intelligible speech. However, for bigger scale models such as T2-mlms and T2-mlmsgst, 10 hours of training data is still insufficient. The standard training scheme fails to produce an accurate alignment map, hence the models are unable to produce intelligible speech. Different results are reported when we apply the proposed hierarchical transfer learning scheme. This scheme is able to learn fast and produce a reasonable alignment map using training data below 1 hour for T2 model (39,16, and 18 minutes for Indonesian, Javanese, and Sundanese, respectively) and 11 hours data for T2-mlms and T2-mlms-gst. The learning process is shown in Figure 3 and the alignment maps are shown in Figure 4, Figure 5, and Figure 6 for T2, T2-mlms, and T2-mlms-gst, respectively.
These figures show the effectiveness of the transfer learning strategy applied on single-speaker monolingual TTS model T2 and multi-speaker multilingual with/without style transfer TTS models, T2-mlms-gst and T2-mlms. Using this learning scheme, all models can quickly learn the alignment. This is not the case for the standard training scheme, in which up to 300K iterations the model is still unable to produce reasonable map. Performing more iteration up to 500K does not enable T2-mlms-gst model to learn the proper mapping between text input and mel-spectogram output.

B. INTELLIGIBILITY AND NATURALNESS
The intelligibility and naturalness of the speech synthesized by TTS models are evaluated using SUS and MOS of a female speaker. The results are shown in Table 5. Table 5 contains the results of MOS evaluations of T2 for Indonesian (ID), Javanese (JV), and Sundanese (SU), T2-mlms-tl, and T2-mlms-gst-tl. For SUS evaluations, we reports T2 and T2-mlms models only. For comparison,  the table also presents the MOS result of baseline Tacotron-2 [13] for English (ENG). For each language, the MOS of the  real human voice is presented as the ground truth and a ''diff'' which contains the MOS difference between ground truth and synthesized speech produced by our models.
The MOS and SUS results demonstrate that only using training data less than 1 hour, our T2 model gives comparable MOS to the baseline Tacotron-2 trained on a large numbers of English dataset (24.6 hours). Likewise the more complex multi-speaker multilingual models, T2-mlms and T2-mlms-gst, can be trained using 11 hours of the joint multilingual dataset. Using the proposed learning scheme, our multilingual model provide even better ''diff'' value than of the English Tacotron-2 (''diff'' = 0.056), specifically on Indonesian, and Sundanese with a ''diff'' of 0.017 and -0.08, respectively. The SUS evaluations show that our models are able to produce intelligible synthesized speech that can be understood. The SUS evaluation on Indonesian has the best performance with the word accuracy of 98.96%, whereas the SUS accuracy for Javanese and Sundanese are 98.52% and 97.53%, respectively. VOLUME 8, 2020 Overall, the SUS and MOS evaluations show that the performances of the model trained on the joint multilingual dataset are better than that of the model trained on the monolingual dataset. Using the joint multilingual dataset can significantly improve the naturalness and the intelligibility of the synthesized speech on each language. The model can generalize a language better by benefitting from other languages included in the joint multilingual dataset. Interestingly, Sundanese that has the least amount of data in the joint dataset (1.7 hours) has the most MOS improvement, an increment of 0.36 from T2-su-tl MOS to T2-mlms-gst-tl MOS on Sundanese, and gives better MOS than the ground truth. Whereas Javanese with 2.3 hours data in the joint dataset provides a MOS increment of 0.123 from T2-jv-tl MOS to T2-mlms-tl MOS on Javanese. As for Indonesian that shares 7 hours data, a MOS increment of 0.074 is obtained from T2-id-tl MOS to T2-mlms-tl MOS on Indonesian. In addition to the obvious benefit in using more data from other languages, we can see that the closer the phonetic similarity, the more benefit the language can gain. Sundanese that has only one additional phoneme of Indonesian's phonemes (while Javanese has three) obtains the most advantage from the phonetics similarity of Indonesian that shares the highest amount of data in the joint dataset, when it combines with style transferring.

C. PARALLEL STYLE TRANSFER
Parallel style transfer is the transfer of speaking style using the same sentence between the synthesized speech signal and reference signal. To evaluate the style transferring performance, we use GPE, VDE, FFE, and MCD K metrics proposed by E2E-Prosody [23], each of which reflects the acoustic prosody correlation. For MCD K evaluation we use MCD 13 with k = 13 coefficients of MFCC. Table 6 shows the GPE, VDE, FFE, and MCD evaluation results of speech synthesized by our models that are trained using transfer learning scheme: T2 for each language, T2-mlms, and T2-mlms-gst. The table also presents the evaluation results of E2E-Prosody [23] and Mellotron [25] for comparison. It also displays metrics per gender: F is for female, M is for male, and F/M is for both. The metrics are calculated by comparing the synthesized speech signal and the reference signal by speaker mentioned in ''speaker'' and ''ref'' columns, respectively. The synthesized speech are produced by the TTS model mentioned in ''model'' column.

1) PITCH TRACKING
Pitch tracking between synthesized speech and reference audio can be measured using FFE. FFE metric is a combination of two metrics: GPE that compares the pitch magnitude between the synthesized speech signals and the reference signal and VDE that compares the voicing decision (voiced/unvoiced). For prosody transfers, the lower the FFE the more successful style transfer is.
From Table 6 we can see that by applying style transfer, T2-mlms-gst model provides a better FFE measure compared to T2 and T2-mlms that do not apply it. The FFE results also demonstrate the effectiveness of our proposed hierarchical transfer learning to learn style transfer in T2-mlms-gst model. Using far less amount of training data (11 hours of joint multilingual dataset), our multilingual models give much better performance than monolingual E2E-Prosody that uses 147 hours and 296 hours training data for single-speaker and multi-speaker, respectively. In most cases, especially on female speakers, our T2-mlms-gst model is also better than Mellotron trained using 44 hours of LJS-Sally dataset and 41.7 hours of LibriTTS dataset. Moreover, our model is capable of transferring cross-lingual speakers that is not supported by both E2E-prosody and Mellotron. Our model allows speakers of one language to speak fluently in other languages. However, we can see there is gender bias in FFE results: FFE on male speakers are slightly worse than FFE on female speakers. Figure 7 shows the comparison of pitch tracking between synthesized speech signal and reference signal for the same sentence in each language. We can see that the model applying prosody transfer, T2-mlms-gst-tl, is able to imitate the reference pitch contours well. T2-mlms-tl that does not apply prosody transfer produces different contours.
2) MEL-SPECTOGRAM MCD 13 is a metric for measuring distortion between synthesized signal and reference signal using 13 coefficients of MFCC. Lower score of MCD 13 has better performance. From Tables 6, the MCD 13 scores of our models for Indonesian and Sundanese are better than E2E-Prosody's. Different from FFE gender bias, the MCD 13 scores differ among speakers regardless their gender. MFCC is computed using discrete cosine transform (DCT) operation on mel-spectogram.
The mel-spectogram comparison between reference audio and the synthesized audio of the same texts is illustrated in Figure 8. Mel-spectogram of the Reference Audio (top) and Synthesized Speech by t2-mlms-gst (middle -vertical) and t2-mlms (bottom) for Indonesian (left), Javanese (middle -horizontal), and Sundanese (right). T X is the length of the embedding input text, T Y is the length of the predicted mel-spectogram by the models, and T M is the length of the reference mel-spectogram. The sentences used are the same as in Figure 7.
3) RHYTHM Figure 9 shows the alignment map between text sentences as used in Figure 8 and their corresponding speech signals VOLUME 8, 2020 in Indonesian, Javanese and Sundanese. The same texts, represented as encoder steps, are mapped to the reference signals (a), the predicted speech signals by T2-mlms-gst (b), the predicted speech signals by T2-mlms (c). From this figure, we can also see that the synthesis by T2-mlms model has a different number of decoder steps from the reference signal's, while using forced-alignment by feeding the rhythm to T2-mlms-gst model can produce the same decoder steps as the reference's. The higher the number of decoder steps the slower the rhythm, and vice versa, the fewer the decoder steps the faster the rhythm of the speech.

VI. CONCLUSION
Our work develops Tacotron-2-based multi-speaker multilingual TTS with/without style transfer by adding several new components: speaker embedding, language embedding, style embedding, pitch embedding, and rhythm. To train the models, we propose hierarchical transfer learning, a network-based transfer learning, that benefits from previous learning on a high-resource (source) language. Pre-trained model parameters are transferred to the same model that is fine-tuned on a low-resource (target) language and to a more complex model that is fine-tuned on a joint multilingual dataset with phonetic similarity.
From the experiment results, we demonstrate that the hierarchical transfer learning scheme is an effective choice to be applied in low-resource target languages. The alignment learning, that is crucial in attention-based encoder-decoder TTS model, is successfully transferred from source to target domain by fine-tuning the pre-trained source model on a small amount of target data. Moreover, the model can benefit from using a joint multilingual dataset for better generalization. The TTS multilingual models are able to generate intelligible human-like synthesized speech. In addition, our multi-speaker multilingual with style transfer TTS is able to adequately transfer the speaking style of one speaker to another speaker of the same language or different ones.
Despite having high performance on a joint multilingual dataset with phonetic similarity, it is challenging to study the transfer learning strategy on a low-resource domain using a multilingual dataset with high differences in linguistic aspects such as phonetics, phonology, and grapheme symbol diversity.