Learning to Maximize Speech Quality Directly Using MOS Prediction for Neural Text-to-Speech

Although recent neural text-to-speech (TTS) systems have achieved high-quality speech synthesis, there are cases where a TTS system generates low-quality speech, mainly caused by limited training data or information loss during knowledge distillation. Therefore, we propose a novel method to improve speech quality by training a TTS model under the supervision of perceptual loss, which measures the distance between the maximum possible speech quality score and the predicted one. We first pre-train a mean opinion score (MOS) prediction model and then train a TTS model to maximize the MOS of synthesized speech using the pre-trained MOS prediction model. The proposed method can be applied independently regardless of the TTS model architecture or the cause of speech quality degradation and efficiently without increasing the inference time or model complexity. The evaluation results for the MOS and phone error rate demonstrate that our proposed approach improves previous models in terms of both naturalness and intelligibility.


I. INTRODUCTION
S TATE-OF-THE-ART text-to-speech (TTS) systems can synthesize speech that is almost indistinguishable from human speech [1]- [4]. Nevertheless, several factors can degrade speech quality. First, it is well known that limited training data results in the quality degradation of the synthesized speech [5]. Therefore, system developers have needed largescale training data to synthesize high-quality speech, despite the high cost of data collection. Second, oversimplified or inaccurate target data during knowledge distillation can degrade speech quality. Knowledge distillation has been proposed for TTS to improve inference speed or reduce model size [6]- [8]. However, some oversimplified or inaccurate data generated by a teacher model causes information loss in the target data for the student model, thus degrading the speech quality.
In this paper, we propose a novel method called perceptually guided TTS to improve the speech quality of TTS models directly. We incorporate perceptual loss, which measures the distance between the maximum possible speech quality score and the predicted score, into the conventional training loss function for TTS. Here, we utilize the mean opinion score (MOS) as the quality score since it is the most widely used subjective evaluation metric for TTS. To predict the MOS of synthesized speech automatically, we pre-train a deeplearning-based MOS prediction model. The proposed method is independent of the TTS model architecture or cause of speech quality degradation. It is also efficient in that it does not increase the inference time or complexity of the model.

II. RELATED WORK
Many studies have proposed perceptual loss to improve the quality of outputs generated by a deep generative model. There are generally two orthogonal approaches for defining perceptual loss. The first approach is based on style reconstruction loss proposed by Gatys et al. [9]. It assumes that a neural network trained for classification has the perceptual information that a generative model needs to learn. Then, it tries to make the feature representations of the generative model similar to those of the pre-trained classification model. Here, perceptual loss is defined as the distance between the feature representations from the generative model and those from the classification model. This approach has been successfully applied in various fields, including image style transfer [9], [10], audio inpainting [11], speech enhancement [12], neural vocoding [2], [13], and expressive TTS [14].
The second approach uses a perceptual evaluation metric, such as the image aesthetic score, perceptual evaluation of speech quality (PESQ) [15], or short-time objective intelligibility (STOI) [16], to learn perceptual information more directly. For the image enhancement task, Talebi and Milanfar [17] have proposed to maximize the aesthetic score of an image enhanced by a convolutional neural network (CNN). They calculated perceptual loss using a pre-trained image assessment model and used the perceptual loss to train an image enhancement model. For the speech enhancement task, Zhao et al. [18] and Fu et al. [19] have proposed to fine-tune a pre-trained speech enhancement model by maximizing the modified STOI and approximated PESQ function, respectively. Martín-Doñas et al. [20] have proposed a perceptual metric for speech quality evaluation (PMSQE) by introducing two disturbance terms inspired by the PESQ algorithm. Kolbaek et al. [21] investigated six loss functions including the standard loss function (i.e., mean square error) and perceptual loss function (i.e., PMSQE). Zhang et al. [22] have proposed an approximate gradient descent algorithm to optimize PESQ and STOI directly for speech separation. For unit selection TTS, Peng et al. [23] optimized the concatenative cost function concerning its correlation with the MOS. For neural TTS, Baby et al. [24] have proposed a TTS model selection method using the phone error rate (PER) as a perceptual metric. In this paper, we define perceptual loss using a MOS prediction model and make a neural TTS model learn to maximize the MOS of speech. Since we use the perceptual evaluation metric, MOS, to learn perceptual information, our method follows this second approach.
Despite a large number of prior studies using perceptual loss, only a few of those have been proposed for neural TTS. One of them is the work done by Baby et al. [24], which selected a TTS model with the lowest PER. The authors used the PER as a criterion for selecting the best model after training is completed, not as a loss function for training the model. Our method differs from their work in two aspects: 1) we use a MOS, not the PER, as a perceptual metric, and 2) we use the perceptual metric during training, not after training. We argue that these two aspects make our method more advantageous for TTS. This is because the MOS is a more widely used metric for synthetic speech evaluation than the PER, and the use of a perceptual metric during training allows a direct update of model parameters. To the best of our knowledge, this is the first study to use MOSbased perceptual loss for training a neural TTS model.

III. METHOD
Recent neural TTS systems generally consist of two modules: a text-to-Mel-spectrogram conversion model and a vocoder. In this paper, we call the text-to-Mel-spectrogram conversion model the "TTS model" for simplicity of notation and focus on the TTS model, not the vocoder. Our method can be applied to an arbitrary TTS model regardless of the model architecture or training method since it only requires the Melspectrogram generated by a TTS model during training.

A. MOS PREDICTION MODEL
To directly improve the perceptual quality of synthesized speech, we introduce perceptual loss based on the predicted MOS. We slightly modify MOSNet+STC+SD [25], an improved version of MOSNet [26], and pre-train it to predict the MOS of synthetic speech from its Mel-spectrogram. MOSNet is a deep neural network that predicts a MOS from a 257-dim linear spectrogram. It consists of 12 convolutional layers, one bidirectional long short-term memory (BLSTM) layer, two fully connected (FC) layers, and a global average pooling layer. The CNN-BLSTM network consisting of convolutional layers and a BLSTM layer acts as a feature extractor to extract frame-level features. The outputs from the two FC layers are frame-level scores, and the final utterancelevel score is obtained by the global average pooling layer.
Here, the ground truth frame-level MOSs are assumed to be the same as the ground truth utterance-level MOS and the loss function is defined as a weighted sum of frame-level and utterance-level mean square error (MSE) losses. Fig. 1 shows the architecture of the MOS prediction model used in this study. In [25], we proposed to use multi-task learning (MTL) with spoofing type classification (STC) and spoofing detection (SD) to improve the generalization ability of MOSNet and called the proposed model MOSNet+STC+SD. To begin with, we point out that we modify the model in this paper to combine it with a TTS model. To use an 80-dim Mel-spectrogram generated by a TTS model instead of the 257-dim linear spectrogram as an input, we change the number of BLSTM units from 128 to 32. Since synthetic speech can threaten an automatic speaker recognition system [27], we refer to synthetic speech as "spoofing speech" and define a binary classification task to discriminate between human and synthetic speech as spoofing detection (SD). STC is a multi-classification task on the subject that generates the input spectrogram, called "spoofing type." Here, a spoofing type can be a speech generation system such as Transformer TTS and FastSpeech or a human speaker. Both auxiliary tasks share the feature extractor with MOS prediction except for the final FC layer (indicated by "Shared layers" with green color in the figure). Each auxiliary task has a task-specific layer that consists of an FC layer, a global average pooling layer, and a softmax layer. The task-specific layer of SD (marked by blue rectangles) outputs the probabilities of synthetic and human speech for the input spectrogram. The task-specific layer of STC (marked by yellow rectangles) outputs the probabilities of all spoofing types in the training data. The total loss function for training the model is composed of four terms: both MSE losses for utterance-and frame-level MOSs and both cross-entropy losses for STC and SD tasks. For simplicity of notation, we denote MOSNet+STC+SD as MTL-MOSNet.
Since there is a domain mismatch between the MOS prediction dataset and the TTS dataset, we augment the MOS prediction dataset with audio samples in the TTS dataset. In this paper, although the "domain" also includes the speaker and recording environment, it primarily refers to the language since we use an English dataset for MOS prediction and a Korean dataset for TTS. Therefore, we augment with speech data uttered in the same language as the TTS dataset. For convenience, we only use the already-existent TTS dataset, although it is possible to augment with speech data uttered by speakers other than the target speaker. This data augmentation process is the first step of our method, as shown in Fig. 2. The next step is to train the MOS prediction model to minimize the MSE loss between the ground truth MOS and the predicted MOS on the augmented training data. Here, we assume that all audio samples in the TTS dataset have a ground truth MOS of 5 since obtaining exact ground truth MOSs by a subjective test is expensive and time-consuming. This assumption is reasonable in this paper because the TTS dataset we use was recorded by a professional speaker in a clean environment. Nevertheless, if the speech data in a TTS dataset was not recorded in a perfect environment, it might be better to perform a small MOS test and assume that the average of the resulting scores is the ground truth MOS.

B. PERCEPTUALLY GUIDED TTS
After pre-training the MOS prediction model, we use it to calculate the perceptual loss for TTS (see Step 3 in Fig.  2). We define perceptual loss as the L 1 loss between the maximum MOS (i.e., 5) and the predicted MOS so that minimizing the perceptual loss is equivalent to maximizing the predicted MOS. The equation for the perceptual loss,

MSE loss
Mel-spectrogram Step 1: Data augmentation for MOS prediction Step 3: Training the TTS model  L per , is as follows: Then we combine the perceptual loss (L per ) with the conventional loss function (L con ) of a TTS model. We use Transformer and FastSpeech TTS models to validate our method for the limited training data scenario and knowledge distillation scenario, respectively. Conventionally, the L 1 or L 2 distance between the target and the predicted Mel-spectrogram is used as the main loss function for recent TTS. Based on this loss, denoted as "Mel loss," various loss functions can be used for TTS. Transformer TTS [4], a state-of-the-art TTS model, has a post-net that refines the generated Mel-spectrogram. It also has a stop linear layer that predicts the probability of the "positive stop token," which determines the end of the utterance at inference time. Therefore, the loss function of Transformer TTS consists of L 2 Mel losses from before and after the post-net and the binary cross-entropy loss for stop token prediction. According to [28], the loss function can also include guided attention loss [29] that forces attention alignment to be diagonal. We also utilize this loss in this paper. Finally, we define the conventional loss function L con for Transformer TTS as follows: where L bp , L ap , L stop , and L attn represent the L 2 Mel loss from before the post-net, L 2 Mel loss from after the post-VOLUME 4, 2016 In this work, FastSpeech [6] uses the knowledge distillation to consider the Mel-spectrogram and character duration extracted from the teacher model, pre-trained Transformer TTS, as targets for training. Accordingly, the loss function of FastSpeech consists of L 2 Mel losses from before and after the post-net and the cross-entropy loss for duration prediction. Finally, the conventional loss function L con for FastSpeech is defined as follows: where L dur refers to the cross-entropy loss for duration prediction.
The proposed perceptual loss can be combined with any conventional loss function for TTS, but one issue must be considered when combining the two losses. Since the purpose of a MOS test is to evaluate a fully trained speech generation system, the MOS prediction model is trained using Melspectrograms from such a system as an input. Therefore, the Mel-spectrograms predicted in the early stages of TTS training (i.e., from a system that is not fully trained) are out-of-domain data for MOS prediction. It makes the MOS prediction model unable to predict reliable MOSs for TTS outputs during the early stages of training. To address this problem, inspired by [30], we first set the weight for perceptual loss to a low value and gradually increase it to some extent as the training epoch increases. We define the final loss function L as the weighted sum of L con and L per , which is formulated as follows: where λ, the weight for conventional loss, is initially set to a high value, λ max , and is gradually reduced with a step size of γ to a certain level, λ min . Then, λ is defined as follows: where λ max , λ min , and γ are determined experimentally. The parameters of the TTS model are updated to minimize the final loss function. Under the supervision of perceptual loss, the TTS model can learn to maximize speech quality directly.

A. DATASET
In experiments, we demonstrate that our method improves the quality of synthesized speech by considering two different scenarios. A summary of these two scenarios is presented in  [31]. A total of 36 voice conversion systems were submitted to the VCC 2018 and evaluated along with human speech. A total of 20580 utterances were rated by 267 listeners on a scale from 1 (completely unnatural) to 5 (completely natural). A ground truth MOS was defined as the average of all the scores an utterance received, and there are 20580 <audio, ground truth MOS> pairs in the evaluation results. For more details about the dataset, please refer to our previous work [32]. For each scenario, we train the MOS prediction model using the augmented dataset consisting of both the evaluation results of the VCC 2018 and the TTS dataset as explained in Section III-A.

B. IMPLEMENTATION DETAILS
We implement the MOS prediction model using PyTorch and train it on a GTX 1080 Ti GPU. The weights for the utterance-level MOS prediction loss, frame-level MOS prediction loss, STC loss, and SD loss are 1, 0.8, 1, and 1, respectively, similar to those in our previous work [25]. We train the model with the Adam optimizer with a learning rate of 10 −4 . We implement TTS models based on ESPnet [28], which is a widely used end-to-end speech processing toolkit. Parallel WaveGAN [33] is trained on the same TTS dataset and used as the neural vocoder for each scenario. Generated audio samples and information about the Korean language are available online at https://wkadldppdy.github.io/perceptualTTS/index.html.
For the first scenario, we train a Transformer TTS model on the Small TTS dataset using two GTX 1080 Ti GPUs. Compared to the original paper [4], we reduce the number of layers from six to three due to the limited training data and adopt character embeddings instead of phoneme ones. Also, as in [34], layer normalization is applied to character embeddings.

Score
Instructions 5 There is no degradation, and the sentence sounds very clear. 4 There is degradation in less than or equal to 1/3 of the sentence, but the meaning of the original sentence is fully conveyed.

3
There is degradation in less than or equal to 1/3 of the sentence, but the sentence makes sense itself.
2 There is degradation in less than or equal to 1/3 of the sentence, and neither the meaning of the original sentence is fully conveyed nor the sentence makes sense itself. 1 There is degradation in more than 1/3 of the sentence. * Degradation: inaccurate, repeated, or skipped pronunciation In the second scenario, we require a teacher model for knowledge distillation. We first train Transformer TTS on the Large TTS dataset using two GTX 1080 Ti GPUs and call it Transformer-L. Then, we train FastSpeech on a single GTX 1080 Ti GPU using Transformer-L as the teacher model and call it FastSpeech-L. When FastSpeech is perceptually guided, we call it P-FastSpeech-L.
As discussed in Section III-B, we initially set the value of λ to a high value and gradually reduce it to a certain level. λ max , λ min , and γ in Eq. 5 are set to 90, 20, and 1, respectively, for the first scenario and 60, 56, and 0.2, respectively, for the second scenario.

C. EVALUATION
For subjective evaluation, we conduct both naturalness and intelligibility MOS tests, which were also performed in the Blizzard Challenge 2020 [35]. For each TTS model, 20 listeners rate 25 audio samples, which results in 500 evaluated data points. For the naturalness test, listeners score each sample in the range from 1 to 5 in increments of 0.5. Then we report the MOS of a model as the average of the 500 evaluated data points with a 95% confidence interval.
In the case of the intelligibility test, the same listeners score each sample on a scale of 1 to 5 in increments of 1. They are provided with input texts of speech samples, unlike in the naturalness test. Since we conduct the naturalness test before the intelligibility test, the listeners listen to the samples first without input texts and thus the input texts do not affect the naturalness test. Unlike general MOS tests, we provide the listeners with input texts for the following reasons. The speech generated by neural TTS models often suffers from repeated or skipped pronunciations due to imperfect alignment between the text and Mel-spectrogram. Since the purpose of TTS is to generate speech that exactly matches the input text, scores should be deducted for utterances with such problems. However, some utterances still make sense even with repeated or skipped words, and they can be problematic when an input text is not available to listeners. The listeners might misunderstand that the word was repeated for emphasis or hesitation or that the skipped word never existed in the input text, thus giving high scores to the utterance.
Meanwhile, intelligibility is related to aspects such as pronunciation and articulation [36]. An in-depth analysis of listeners' scores requires listeners to follow granular evaluation instructions considering those aspects. By introducing the concepts of adequacy and comprehensibility used in the machine translation field [37], we can create more granular instructions. For TTS, adequacy refers to how well the meaning of the input text is conveyed to the output speech, and comprehensibility refers to how much the output speech is understandable without access to the input text. Even if the TTS model synthesizes the utterance not exactly according to the text input, the utterance might be adequate and comprehensible. For example, in English, when the script "I am going to school" is pronounced as "I'm going to school," the meaning of the input text is still fully conveyed. In this case, it is not desirable to assign a low score. Furthermore, even if the original meaning is not fully conveyed, it is necessary to distinguish between comprehensible and incomprehensible utterances. Considering the above discussion, we create instructions for the intelligibility test, shown in Table 2.
For objective assessment, we compute the phone error rate (PER) of 200 samples using a Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) phone recognizer. We use the Kaldi toolkit to train the phone recognizer on the combination of both the Small and Large TTS datasets. The 200 sentences consist of 47 "long" sentences (excluded from the Large TTS dataset) and 153 relatively "short" sentences (60 sentences from the Large TTS test data and 93 sentences not in any TTS dataset). Here, "long" sentences have 188 phonemes, whereas "short" sentences have 43 phonemes on average. Since the length of the input sentence can affect the quality of synthesized speech, we report both PERs for long and short sentences separately in addition to the overall PER.

V. RESULTS AND DISCUSSION
This section presents the results and analysis for two scenarios: limited training data and knowledge distillation. Note that the findings of this study have to be seen in the light of some limitations. First, as in the majority of multi-task learning studies, the loss weights are determined heuristically. This not only makes the experiments time-consuming but also can yield suboptimal results. Second, a fundamen-   tal domain mismatch remains between the Mel-spectrogram generated by TTS and that of the MOS prediction dataset. As discussed in Section III, to handle domain mismatch, we augment the MOS prediction dataset with TTS speech data and initially set the weight for perceptual loss to a low value. However, these methods cannot eliminate the domain mismatch completely. Finally, during the subjective intelligibility test, there is bias in listening behaviors caused by the prior access to the input texts. Manual dictation by human subjects can directly address the problem since scripts are not provided to human subjects. Instead, we can adopt automatic phone recognition which can serve as manual dictation by human subjects. Therefore, PER results could compensate for the bias in the subjective intelligibility test results, but the bias itself still remains. In the future, we will develop a more advanced approach to overcome these limitations.

A. RESULTS FOR MOS PREDICTION
Before discussing the TTS results, we present the MOS prediction results. For the limited training data scenario, we train MTL-MOSNet on the augmented dataset containing both the evaluation results of VCC 2018 and the Small TTS dataset. For the knowledge distillation scenario, MTL-MOSNet is trained on the dataset augmented with the Large TTS dataset. Note that the final goal of these MOS prediction models is to improve the performance of a TTS model by providing perceptual loss. However, the goal can be achieved only if the MOS prediction models work properly. Therefore, we focus on validating that the MOS prediction models predict reasonable MOSs, including MTL-MOSNet trained without the augmented dataset.
To specifically validate the generalization ability of MOS prediction models, we test the models on MOS evaluation  results from the VCC 2016 [38]. There is no specification of the utterances or raters for those MOS evaluation results, which means that we can only measure system-level performance. The performance of MOS prediction is measured in terms of the linear correlation coefficient (LCC) [39], Spearman's rank correlation coefficient (SRCC) [40], and mean squared error (MSE). Fig. 3 shows scatter plots of (a) MTL-MOSNet trained on the dataset without augmentation, (b) MTL-MOSNet trained on the dataset augmented with Small TTS dataset, and (c) MTL-MOSNet trained on the dataset augmented with Large TTS dataset. These results show that pre-trained MOS prediction models predict reasonable MOSs from input Mel-spectrograms (an LCC of > 0.8 generally suggests a strong positive association between two variables).

B. RESULTS FOR LIMITED TRAINING DATA
We compare the performance of Transformer TTS and perceptually guided Transformer TTS trained on the Small TTS dataset. We denote these models as Transformer-S and P-Transformer-S, respectively. Table 3 lists the naturalness MOSs and PERs of them. We can see that our method, P-Transformer-S, increases the naturalness MOS by more than 1. The p-value of the paired t-test is lower than 0.01, indicating that the improvement is quite significant (a pvalue of < 0.05 is taken as statistically significant). The PER decreases for both long and short sentences, and we obtain a relative improvement of 26.4% overall. Here, both PERs for long sentences are high (i.e., over 40%), which is not only because Transformer TTS generates unstable attention alignments when converting long sentences into speech [41]  but also because it lacks training data. Fig. 4 shows the subjective intelligibility test results on the Small TTS dataset. It clearly demonstrates that P-Transformer-S outperforms Transformer-S. For in-depth analysis using the instructions in Table 2, we define the ratio of the "fully conveyed" as F CR = (N 4 + N 5 )/N tot and the ratio of the "though makes sense" as T M SR = N 3 /(N 1 + N 2 + N 3 ). Here, N n is the number of evaluated data points with scores of n, and N tot is the total number of evaluated data points (i.e., 500). When the denominator of T M SR is less than 3, we do not define T M SR and denote it as "-" because the sum of N1, N2, and N3 is too small. F CR focuses on highly intelligible evaluated data where the meaning of the original sentence is fully conveyed, whereas T M SR focuses on evaluated data where the sentence at least makes sense even though the meaning of the original sentence is not fully conveyed. Therefore, we can say that higher F CR represents better adequacy and higher T M SR represents better comprehensibility. Since F CR increases from 49.4% to 91.6% and T M SR increases from 41.9% to 88.1%, we can say that our method improves the intelligibility of Transformer-S.

C. RESULTS FOR KNOWLEDGE DISTILLATION
The second, fourth, and fifth rows of Table 4 present the naturalness MOSs and PERs of Transformer-L, FastSpeech-L, and P-FastSpeech-L, respectively. In terms of naturalness MOS, P-FastSpeech-L outperforms FastSpeech-L with a gap of 0.38 (p-value < 0.01), which gets closer to the teacher model (Transformer-L). P-FastSpeech-L achieves a 7.25% relative improvement in the overall PER. Here, as opposed to the PERs on short sentences, the PERs on the long sentences are almost half that of Transformer-L because FastSpeech-L is more robust to the length of input text. As noted in numerous studies such as [41], the attention mechanism of a neural TTS model often fails to align between the input text and output Mel-spectrogram in the latter part when the text is long. Unlike Transformer-L, FastSpeech-L does not use an attention mechanism. Therefore, it can produce speech more reliably than Transformer-L even when the input text is long. Fig. 5 shows the intelligibility test results on the Large TTS dataset. In that F CR increases from 88.6% to 98.2% and T M SR increases from 56.1% to 88.9%, P-FastSpeech-L outperforms FastSpeech-L. It is comparable to Transformer-L, which shows F CR of 98.6% and T M SR of 100.0%.  Besides the second scenario, we perform an additional experiment to investigate whether the proposed method is also effective for the state-of-the-art TTS model, Transformer TTS. We train perceptually guided Transformer TTS and call it P-Transformer-L. After that, we compare P-Transformer-L with Transformer-L and the system called "GT (Mel)." In GT (Mel), we convert the ground truth Mel-spectrogram into a waveform using Parallel WaveGAN.
The results of GT (Mel), Transformer-L, and P-Transformer-L are presented in the first, second, and third rows of Table 4, respectively. In terms of naturalness MOS, p-values for all possible pairs between the three systems are higher than 0.29. Therefore, the pairwise differences between naturalness MOSs of the three systems are not statistically significant, and we cannot tell which one is better. More specifically, although P-Transformer-L shows a lower MOS than Transformer-L, we cannot conclude that the proposed method degrades the naturalness of Transformer-L. Because the proposed method works effectively in the case of Transformer-S and FastSpeech-L, we analyze why it cannot improve Transformer-L by focusing on the differences from the other two models. In contrast to Transformer-S, Transformer-L is trained with enough data. This difference results in better naturalness and intelligibility compared to Transformer-S. Then, for especially long sentences, whereas the synthesized speech of Transformer-S shows poor naturalness and intelligibility from the beginning to the end of a sentence, the synthesized speech of Transformer-L shows high naturalness and intelligibility at least before the latter part of a sentence. As discussed earlier, the latter part of the synthesized speech can be unintelligible since Transformer-L uses an attention mechanism. However, FastSpeech-L, which does not use an attention mechanism can generate speech that shows high intelligibility until the end of a long input text.
P-Transformer-L does not outperform Transformer-L because of this characteristic of Transformer-L. For long text input, Transformer-L often generates speech that has high quality up to a certain point but is unintelligible at the end of the sequence. In the case of such speech, the quality varies significantly between the former and latter parts. Then, a VOLUME 4, 2016 single MOS value alone is insufficient to evaluate the whole sentence, and thus, the perceptual loss based on a single MOS loses its power to guide the TTS model. In terms of the PER, P-Transformer-L shows a relative degradation of 1.12% for long sentences but achieves a relative improvement of 5.78% for short sentences. The PER of the GT (Mel) system is only reported for long sentences since there are no recordings for 93 short sentences. The intelligibility test results are shown in the first three rows of Fig.  5. By perceptual training, F CR of Transformer-L increases from 98.6% to 99.4%, which is only 0.2% lower than that of GT (Mel). These results show that our method improves the intelligibility of Transformer-L for short sentences.

D. ABLATION STUDIES
We conduct ablation studies to verify the effectiveness of the proposed data augmentation method for MOS prediction. We train both Transformer-S and FastSpeech-L under the supervision of MTL-MOSNet trained only on the evaluation results of VCC 2018 (i.e., without the TTS dataset). The results are in Table 5. For the limited training data scenario, the overall PER for P-Transformer-S without data augmentation is 35.39%, which is worse than 30.08% for P-Transformer-S but better than 40.88% for Transformer-S. For the knowledge distillation scenario, the overall PER for P-FastSpeech-L without data augmentation is 8.56%, which is worse than 8.31% for P-FastSpeech-L but better than 8.96% for FastSpeech-L. As can be seen from the results, the proposed method can reduce the overall PER even without data augmentation. Moreover, we can observe that using data augmentation leads to a larger relative improvement in PER.

VI. CONCLUSION
We proposed a perceptual training method for a TTS model to improve the speech quality independently and efficiently. We first trained the MOS prediction model on the augmented data and then used the model to calculate the perceptual loss for the TTS model. Under the supervision of perceptual loss, the TTS model learned to maximize the perceptual speech quality directly. The experimental results for two scenarios show that the proposed method improves the previous TTS models in terms of naturalness and intelligibility. In future work, we will develop a sophisticated approach to automatically find the optimal loss weights instead of simply using heuristically determined values. We will also extend our study to other speech generation tasks, such as multi-speaker TTS. His research interests include signal processing for speech and speaker recognition, speech synthesis, audio indexing and retrieval, and spoken language processing. VOLUME 4, 2016