Fast Neural Speech Waveform Generative Models With Fully-Connected Layer-Based Upsampling

Although end-to-end (E2E) text-to-speech (TTS) models with HiFi-GAN-based neural vocoder (e.g. VITS and JETS) can achieve human-like speech quality with fast inference speed, these models still have room to further improve the inference speed with a CPU for practical implementations because HiFi-GAN-based neural vocoder unit is a bottleneck. Additionally, HiFi-GAN is widely used not only for TTS but also for many speech and audio applications. To accelerate HiFi-GAN while maintaining the synthesis quality, Multi-stream (MS)-HiFi-GAN, iSTFTNet and MS-iSTFT-HiFi-GAN have been proposed. Although inverse short-term Fourier transform (iSTFT)-based fast upsampling is introduced in iSTFTNet and MS-iSTFT-HiFi-GAN, we first find that the predicted intermediate features input to the iSTFT layer are completely different from the original STFT spectra due to the redundancy of the overlap-add operation in iSTFT. To further improve the synthesis quality and inference speed, we propose FC-HiFi-GAN and MS-FC-HiFi-GAN by introducing trainable fully-connected (FC) layer-based fast upsampling without overlap-add operation instead of the iSTFT layer. The experimental results for unseen speaker synthesis and E2E TTS conditions show that the proposed methods can slightly accelerate the inference speed and significantly improve the synthesis quality in JETS-based E2E TTS than iSTFTNet and MS-iSTFT-HiFi-GAN. Therefore, the iSTFT layer can be replaced by the proposed trainable FC layer-based upsampling without overlap-add operation in HiFi-GAN-based neural vocoders.


I. INTRODUCTION
In recent years, text-to-speech (TTS) technology, which generates speech waveforms from input text, can synthesize high-quality speech as good as human speech by using deep learning techniques such as Tacotron 2 [1] combined with WaveNet-based neural vocoder [2].However, this system requires large computing resources such as a GPU to generate speech waveforms, so it was necessary to reduce the The associate editor coordinating the review of this manuscript and approving it for publication was Tao Huang .model size and improve the inference speed.To achieve the high-speed inference while maintaining the synthesis quality, several end-to-end (E2E) TTS models have been proposed that can synthesize speech waveforms directly from input text or phoneme sequences with a single neural network [3], [4], [5], [6], [7], [8], [9].Especially, VITS [7] and JETS [8] can achieve human-like quality and real-time inference.However, these models have room for improving the inference speed with a single CPU because HiFi-GAN [10]-based neural vocoder unit used in these models is a bottleneck in the inference speed although Glow-TTS [11]-based acoustic model for VITS and Fastspeech 2 [4]-based acoustic model for JETS can realize quite fast inference with a single CPU.
To accelerate the inference speed of HiFi-GAN while maintaining the synthesis quality, Multi-stream (MS)-HiFi-GAN [50] and iSTFTNet [51] have been proposed by replacing the final 4× upsampling layers of HiFi-GAN with lightweight fast upsampling layers. 2 Additionally, by efficiently combining these models, MS-iSTFT-HiFi-GAN [53] has also been proposed in VITS-based E2E TTS model and can realize 4 times faster inference than vanilla HiFi-GAN while maintaining the synthesis quality.Focusing on iSTFT-Net, this architecture can reasonably achieve the acceleration of HiFi-GAN by using the inverse short-term Fourier transform (iSTFT)-based fast upsampling.However, we first show that the intermediate features input to the iSTFT layer are completely different from the original STFT spectra due to the redundancy of the overlap-add operation in iSTFT.This means that the iSTFT-based upsampling does not work as expected and there is room for improvement.
To further improve the synthesis quality and inference speed of iSTFTNet and MS-iSTFT-HiFi-GAN, we propose simple but efficient models, FC-HiFi-GAN and MS-FC-HiFi-GAN by replacing iSTFT layer-based upsampling using fixed weights based on the Fourier basis and overlap-add operation with trainable fully-connected (FC) layer-based lightweight upsampling without overlap-add operation.In experiments for analysis-synthesis-based unseen speaker synthesis and VITS-and JETS-based E2E TTS conditions, we show that the proposed methods can also realize fast and high-fidelity synthesis as well as iSTFTNet and MS-iSTFT-HiFi-GAN, slightly improve the inference speed than iSTFTNet and MS-iSTFT-HiFi-GAN, and significantly improve the synthesis quality for JETS-based E2E TTS by trainable but lightweight upsampling without overlap-add operation.
The rest of this paper is organized as follows.Conventional HiFi-GAN-based fast neural vocoders and E2E TTS models, VITS and JETS, are briefly introduced in Sec.II.The issues for iSTFT-based upsampling are explained in Section III.FC-HiFi-GAN and MS-FC-HiFi-GAN are then proposed in Sec.IV.Section V describes experiments to compare the proposed FC-HiFi-GAN and MS-FC-HiFi-GAN with the conventional models for analysis-synthesis-based unseen speaker synthesis and VITS-and JETS-based E2E TTS conditions.Finally, conclusions are presented in Section VI.

II. CONVENTIONAL MODELS
A. HiFi-GAN-BASED FAST NEURAL VOCODERS 1) HiFi-GAN [10] HiFi-GAN is a GAN-based neural vocoder consisting of a generator and two superior discriminators.The generator synthesizes speech waveforms from acoustic features, such as mel-spectrograms, by progressively upsampling the input features (8× → 8× → 2× → 2×) using transposed convolutional layers with residual blocks as shown in Fig. 1(a).With the efficient upsampling-based generator and sophisticated discriminators, HiFi-GAN can realize high-fidelity and fast speech synthesis.
2) MS-HiFi-GAN [50] As Multi-band MelGAN [28], HiFi-GAN can be easily accelerated by replacing the last two layers for final 4× upsampling to multi-rate signal processing [54]-based sub-band synthesis filter [55] as used in [16] where the four sub-band 31410 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.output waveforms are upsampled by zero-padding and then a full-band speech waveform is synthesized by the synthesis filter.However, the multi-band structure with constant synthesis filter is too restrictive to train HiFi-GAN because the sophisticated HiFi-GAN discriminators can easily distinguish between real and synthetic speech.By replacing the sub-band synthesis filter, which can be regarded as a convolutional layer with fixed weights without bias, with a trainable convolutional layer without bias, MS-HiFi-GAN can be successfully trained by decomposing the four output waveforms in a data-driven manner.Then, MS-HiFi-GAN can successfully accelerate the inference speed of HiFi-GAN while maintaining the synthesis quality.The architecture of the MS-HiFi-GAN generator is shown in Fig. 1(b).
3) iSTFTNet [51] Similar to the sub-band synthesis filter in Multi-band Mel-GAN [28], iSTFT can also be regarded as an upsampling operation.For both accelerating HiFi-GAN and making the best use of input mel-spectrogram structure, iSTFTNet replaces the last two layers for final 4× upsampling of HiFi-GAN with iSTFT-based fast upsampling as shown in Fig. 1(c).In iSTFTNet, the amplitude and phase components of the STFT spectra are predicted by a 1D convolutional layer before the iSTFT layer.Compared with MS-HiFi-GAN with trainable lightweight upsampling, iSTFTNet can also successfully accelerate the inference speed of HiFi-GAN while maintaining the synthesis quality although the iSTFT layer with fixed weights based on the Fourier basis is not trainable.

4) MS-iSTFT-HiFi-GAN [53]
By combining a trainable convolutional layer-based upsampling for MS-HiFi-GAN and iSTFT-based upsampling for iSTFTNet, MS-iSTFT-HiFi-GAN has been proposed to further accelerate HiFi-GAN-based neural vocoder.MS-iSTFT-HiFi-GAN is introduced in the speech waveform synthesizer component for VITS-based E2E TTS.The architecture of the MS-iSTFT-HiFi-GAN generator is depicted in Fig. 2(e).Although MS-iSTFT-HiFi-GAN is twice as fast as MS-HiFi-GAN and iSTFTNet, it can still maintain the synthesis quality.

B. E2E TTS MODELS 1) VITS [7]
VITS is proposed as an E2E TTS model extended from Glow-TTS [11].In the training of Glow-TTS, the target mel-spectrograms are converted to Gaussian white noise by the Flow [56]-based decoder, and the alignment between the hidden features converted from the input text and converted white noise is gradually obtained by monotonic alignment search (MAS) [11] without external aligners.In the inference, the upsampled hidden features are converted to the target mel-spectrograms by Flow-based inverse transformation.In VITS, the target linear-spectrograms are converted to the latent variables based on variational autoencoder (VAE) [57], and the latent variables instead of mel-spectrograms are converted not only to Gaussian white noise by the Flow-based decoder but also to the target speech waveforms by HiFi-GAN-based neural vocoder.All  the network components are jointly trained with the same discriminators for HiFi-GAN, and the intermediate latent variables are optimized to minimize the training loss.Then, VITS can realize higher-quality TTS than the cascade model with Glow-TTS and HiFi-GAN [7].The architecture of the VITS generator is shown in Fig. 3(a).In MS-iSTFT-VITS, MS-iSTFT-HiFi-GAN (Fig. 2(e)) is used for the neural vocoder instead of vanilla HiFi-GAN [53].
2) JETS [8] Compared with VITS, which efficiently introduces three kinds of deep generative models, Flow [56], VAE [57] and GAN [29], JETS is a simpler E2E TTS model while realizing higher synthesis quality than VITS [8].JETS is realized by joint training of FastSpeech 2 [4]-based acoustic model and HiFi-GAN-based neural vocoder with the same discriminators for HiFi-GAN without intermediate mel-spectrograms nor external aligners although FastSpeech 2 [4] requires an external aligner, such as Montreal Forced Aligner [58].In JETS, an alignment training framework proposed in [59] with MAS is introduced, and the alignment between the hidden features converted from the input text sequences and the target mel-spectrogram sequences is gradually obtained in the training as VITS.

III. ISSUES FOR ISTFT LAYER-BASED UPSAMPLING
In this section, we first show that the iSTFT-based upsampling used in iSTFTNet and MS-iSTFT-HiFi-GAN does not work as expected.As described in Sec.II-A3, the amplitude and phase components of the STFT spectra are inferred by the 1D convolutional layer before the iSTFT layer, and high-fidelity speech waveforms can be synthesized by the final iSTFT layer-based fast upsampling in iSTFTNet.To explain the actual behavior of iSTFTNet, Figure 4 shows the magnitude and phase components of the STFT spectrum of an original female speech waveform used in the experiments of analysis-synthesis condition conducted in Sec.V, those estimated by iSTFTNet, and those reanalyzed from the speech waveform synthesized by using the estimated STFT spectrum, respectively.The estimated magnitude and phase components (Fig. 4(b)) differ from those of the orig- inal (Fig. 4(a)).This result indicates that iSTFTNet cannot perfectly predict the magnitude and phase components of the STFT spectra.However, the reanalyzed magnitude and phase components (Fig. 4(c Additionally, there is a room for improvement in the iSTFT layer-based upsampling with untrainable fixed weights based on the Fourier basis compared with MS-HiFi-GAN with trainable fast upsampling [50].

IV. PROPOSED FULLY-CONNECTED LAYER-BASED TRAINABLE UPSAMPLING WITHOUT OVERLAP-ADD OPERATION: FC-HiFi-GAN AND MS-FC-HiFi-GAN
As described in Sec.III, the iSTFT layer-based upsampling has the following issues.
• The intermediate features inferred by the 1D convolutional layer in iSTFTNet are completely different from the original STFT spectra.
Additionally, FC can be realized with fewer calculations than iSTFT.When M is a power of 2, inverse FFT is applyed.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.training in time domain than the indirect estimation of STFT spectra by the iSTFT-based upsampling.With these features, the proposed FC layer-based upsampling with trainable weights without overlap-add operation is expected to further improve the inference speed and synthesis quality compared to the iSTFT layer-based upsampling with fixed weights based on the Fourier basis and overlap-add operation.

V. EXPERIMENTS A. EXPERIMENTS OF ANALYSIS-SYNTHESIS CONDITION FOR UNSEEN SPEAKER SYNTHESIS WITH MULTI-SPEAKER MODELS
To evaluate the proposed FC-HiFi-GAN and MS-FC-HiFi-GAN and to compare them with the conventional HiFi-GAN, MS-HiFi-GAN, iSTFTNet and MS-iSTFT-HiFi-GAN in fundamental analysis-synthesis condition, experiments of analysis-synthesis condition for unseen speaker synthesis with multi-speaker models were first conducted.Some of the speech samples used in the experiments are available online. 3

1) EXPERIMENTAL CONDITIONS a: DATASET
We used JVS corpus [61] of parallel 100 and non-parallel 30 sentences read by 100 Japanese speakers with a sampling frequency of 24 kHz.The utterances of 90 speakers (jvs011 to jvs100) were used for the training set, and the nonparallel 30 sentences of the remaining 10 speakers (jvs001 to jvs010) not included in the training set were used for the test set.The input acoustic features were 80-dimensional melspectrograms bandlimited to 7600 Hz where the FFT and hop sizes were 1,024 and 256 samples, respectively.

b: MODEL SETTING
In the experiments, HiFi-GAN-based models were trained and inferred by modifying a PyTorch [62]-based open source implementation, 4 and each model was trained up to 2.5 million iterations by using an NVIDIA Tesla V100 GPU.As shown in Figs 1 and 2, the upsampling rates and kernel sizes of the transposed convolutional layers for HiFi-GAN were [8,8,2,2] and [16,16,4,4], those for MS-HiFi-GAN,  iSTFTNet and FC-HiFi-GAN were [8,8] and [16,16], and those for MS-iSTFT-HiFi-GAN and MS-FC-HiFi-GAN were [4,4] and [8,8], respectively.The initial channel of all the models was 512 as HiFi-GAN V1 model [10].The model configuration of HiFi-GAN was the default setting 5 where only the sampling frequency was changed from 22,050 Hz to 24 kHz.The model configurations of the other models were modified from that of HiFi-GAN.

c: EVALUATION CRITERIA
As objective evaluation criteria, the mel-cepstral distortion (MCD) and log f o root mean square error (log f o RMSE) between the original and synthesized speech waveforms were evaluated.These values were calculated by using ESPNet2-TTS [63]. 6To measure RTFs, we used an Intel Xeon 6152 CPU (with one core).A mean opinion score (MOS) test with a five-point scale (5 for excellent, 4 for good, 3 for fair, 2 for poor, and 1 for bad) [64] was conducted to evaluate the subjective perceptual quality of the ground truth and synthesized speech waveforms.For the MOS test, nonparallel 30 utterances of two female (jvs004 and jvs008) and two male (jvs001 and jvs003) speakers were used.In the MOS test, twenty adult native Japanese speakers without hearing loss listened to the original and synthesized speech samples using headphones and evaluated 140 sentences in total, consisting of 20 sentences of each model and ground truth samples (6 × 20 + 20 = 140).

2) RESULTS OF ANALYSIS-SYNTHESIS EXPERIMENTS
The results of the objective and subjective evaluations for unseen speaker synthesis with multi-speaker models are shown in Table 1.Additionally, Figure 7 shows the result of the T-test for the MOS tests in Table 1.First of all, the proposed MS-FC-HiFi-GAN realized the fastest inference speed and highest synthesis quality compared with the other models although there was no significant difference between the MOS value of MS-FC-HiFi-GAN and those of the other models.As expected, FC-HiFi-GAN and MS-FC-HiFi-GAN without overlap-add operation realized slightly faster inference than iSTFTNet and MS-iSTFT-HiFi-GAN with overlap-add operation while maintaining the synthesis quality by the trainable FC layer.

B. EXPERIMENTS OF E2E TTS CONDITION
The analysis-synthesis condition is a simpler problem because the inputs were ground truth mel-spectrograms.For this reason, there was no significant difference between the MOS values of iSTFTNet and FC-HiFi-GAN, or between those of MS-iSTFT-HiFi-GAN and MS-FC-HiFi-GAN.
Therefore, we evaluated the performance of each neural vocoder in E2E TTS condition, which is a more complex problem than analysis-synthesis condition.For E2E TTS models, we introduced VITS [7], which was used in MS-iSTFT-HiFi-GAN.Additionally, we introduced JETS, which is expected to realize higher quality and more stable synthesis than VITS [8].The neural vocoder part of each E2E TTS model was changed to iSTFTNet, MS-HiFi-GAN, MS-iSTFT-HiFi-GAN, and MS-FC-HiFi-GAN, and the inference speed and synthesis quality of these models for E2E TTS condition were compared.

1) EXPERIMENTAL CONDITIONS a: DATASET
In the TTS experiments, we first used LJspeech [65] with a sampling frequency of 22.05 kHz.As the default setting of ESPnet2-TTS [63], 12,600 utterances, 250 utterances and 250 utterances were used for the training, validation and test sets, respectively.Although LJSpeech is widely used in TTS experiments as [7], [8], [51], and [53], the original recordings were distributed as 128 kbps MP3 files and they may contain artifacts introduced by the MP3 encoding [65].To evaluate the neural vocoder models using a higher quality corpus, we introduced Hi-Fi TTS dataset [66] with a sampling frequency of 44.1 kHz.In the experiments, a clean female speaker corpus (Reader ID: 92) was selected, and normalband (24 kHz) and full-band (44.1 kHz) E2E TTS models were trained combined with these neural vocoders.As the default setting of Hi-Fi TTS dataset [66], 35,146 utterances, 50 utterances and 100 utterances were used for the training, validation and test sets, respectively.

b: MODEL SETTING
In the experiments, VITS-and JETS-based E2E TTS models were trained and inferred by the modifying PyTorch-based open source implementation provided in ESPnet2-TTS [63].
Each model was trained up to 1.0 million iterations by using four NVIDIA Tesla V100 GPUs.
The FFT and hop sizes of acoustic feature extraction for sampling frequencies of 22.05 kHz and 24 kHz were also 1,024 and 256 samples, respectively.Then, the upsampling rates and kernel sizes of the transposed convolutional layers in the neural vocoder part for 22.05 kHz and 24 kHz were the same as those used in the experiments for analysissynthesis condition.The model configurations of VITS and JETS with HiFi-GAN for LJSpeech (22.05 kHz) were the default settings. 78The model configurations of the other models for 22.05 kHz and 24 kHz were modified from the default settings.
The FFT and hop sizes of acoustic feature extraction for full-band VITS and JETS with a sampling frequency of 44.1 kHz were 2,048 and 512 samples, respectively.Then, the upsampling rates and kernel sizes of the transposed convolutional layers for HiFi-GAN were [8, 8, 2, 2, 2] and [16,16,4,4,4] [63], those for MS-HiFi-GAN, iSTFTNet and FC-HiFi-GAN were [8,8,2] and [16,16,4], and those for MS-iSTFT-HiFi-GAN and MS-FC-HiFi-GAN were [4,4,2] and [8,8,4], respectively.The model configuration of full-band VITS with HiFi-GAN for 44.1 kHz was the default setting. 9The model configurations of full-band VITS with the other models were modified from the default setting of full-band VITS.The model configurations of full-band JETS with these models were modified from the default setting of JETS for 22.05 kHz.

c: EVALUATION CRITERIA
As objective evaluation criteria, the MCD, log f o RMSE and RTF were also evaluated as the analysis-synthesis condition.Additionlly, the character error rate (CER) of automatic speech recognition (ASR) were measured as in [8] and [63] to evaluate the stability of E2E TTS models.The CER was calculated by a Conformer-based ASR trained using LibriSpeech corpus [67] by ESPnet [68].A MOS test with a five-point scale was also conducted to evaluate the subjective perceptual quality of the ground truth and synthesized speech waveforms.In the MOS test, twenty adult native English speakers without hearing loss listened to the original and synthesized speech samples using headphones and evaluated 390 sentences in total, consisting of 10 sentences of each model and ground truth samples ((12 × 10 + 10) × 3 [LJSpeech, Hi-Fi TTS (24 kHz) and Hi-Fi TTS (44.1 kHz)] = 390).

2) RESULTS OF E2E TTS EXPERIMENTS
The results of the subjective and objective evaluations for normal-band and full-band E2E TTS conditions are pre-TABLE 2. Results of objective and subjective evaluations of normal-band E2E TTS conditions using LJSpeech corpus and Hi-Fi TTS dataset.

TABLE 3.
Results of objective and subjective evaluations of full-band E2E TTS conditions using Hi-Fi TTS dataset.sented in Tables 2 3. Additionally, Figure 8 shows the results of the T-test for the MOS tests in Tables 2 and  3. First, JETS-based models significantly realized higher quality synthesis than VITS-based models and outperformed VITS-based models in terms of the MCD, log f o RMSE, and CER for both normal-band and full-band E2E TTS conditions as [8].Although the proposed JETS-based MS-FC-HiFi-GAN could not realize the highest synthesis quality compared with MS-HiFi-GAN (LJSpeech and full-band Hi-Fi TTS) or HiFi-GAN (normal-band Hi-Fi TTS), it significantly realized higher synthesis quality than the conventional MS-iSTFT-VITS [53] (VITS-based E2E TTS with MS-iSTFT-HiFi-GAN), and realized the fastest inference and lowest CER compared with the other models for both normal and full-band E2E TTS conditions.Additionally, there were significant differences between the MOS values of JETS with FC-HiFi-GAN and JETS with iSTFTNet for both normal-band and full-band conditions, and those of JETS with MC-FC-HiFi-GAN and JETS with MS-iSTFT-HiFi-GAN for LJSpeech corpus.Fig. 6 shows the STFT spectra of a speech waveform in the test set and intermediate features of JETS-based E2E TTS models with iSTFTNet, FC-HiFi-GAN, MS-iSTFT-HiFi-GAN, and MS-FC-HiFi-GAN, and mel-spectrograms of the original speech waveform and those synthesized by JETSbased E2E TTS models with iSTFTNet, FC-HiFi-GAN, MS-iSTFT-HiFi-GAN, and MS-FC-HiFi-GAN.The intermediate features of iSTFTNet are also completely different from the STFT spectra of the original speech waveform as shown in Fig. 4 in analysis-synthesis condition.As the intermediate features of iSTFTNet, those of MS-iSTFT-HiFi-GAN have the same tendency.Compared with the intermediate features of iSTFTNet and MS-iSTFT-HiFi-GAN, those of the proposed FC-HiFi-GAN and MS-FC-HiFi-GAN input to the trainable FC layer-based fast upsampling layer are optimally trained to synthesize high-fidelity speech waveforms as shown in Fig. 6(c) and (e).Additionally, the harmonic structures of the proposed (c) FC-HiFi-GAN and (e) MS-FC-HiFi-GAN are clearer than those of (b) iSTFTNet and (d) MS-iSTFT-HiFi-GAN as showin in the red and green boxes in Fig. 6.
Consequently, the proposed FC-HiFi-GAN and MS-FC-HiFi-GAN with trainable FC layer-based fast upsampling layer without overlap-add operation can realize slightly faster inference and significantly improve the synthesis quality for JETS-based E2E TTS than iSTFTNet and MS-iSTFT-HiFi-GAN with iSTFT layer-based upsampling using fixed weights and overlap-add operation.Therefore, the iSTFT layer-based upsampling can be replaced by the proposed FC layer-based upsampling in HiFi-GAN-based neural vocoders.The summary of the results of the experiments are as follows: • The proposed JETS-based models can significantly improve the synthesis quality with lower CER compared to the VITS-based models.
• JETS with the proposed MS-FC-HiFi-GAN can realize higher MOS values than the conventional MS-iSTFT-VITS in all the conditions.
• In many conditions, the proposed FC-HiFi-GAN can realize higher MOS values than the conventional iSTFTNet.
• The proposed MS-FC-HiFi-GAN can realized significantly higher MOS values than MS-iSTFT-HiFi-GAN in many conditions.2 and 3. Values for p < 0.05 (statistically significant) are bold with yellow highlighting.

VI.
HiFi-GAN is widely used not only for TTS but also for many speech and audio applications.Although iSTFTNet and MS-iSTFT-HiFi-GAN have been proposed to accelerate HiFi-GAN while maintaining the synthesis quality, we first pointed out that the predicted intermediate features input to the iSTFT layer are completely different from the original STFT spectra due to the redundancy of the overlap-add operation in iSTFT.To further improve the synthesis quality and inference speed of HiFi-GAN based neural vocoder, we proposed FC-HiFi-GAN and MS-FC-HiFi-GAN by introducing trainable FC layer-based fast upsampling without overlap-add operation instead of the iSTFT layer.The results of experiments for unseen speaker synthesis with multi-speaker models and E2E TTS with VITS-and JETS-based normal-band and full-band models demonstrated that the proposed methods with trainable FC layer-based fast upsampling without overlap-add operation can slightly accelerate the inference speed and significantly improve the synthesis quality in JETS-based E2E TTS than iSTFTNet and MS-iSTFT-HiFi-GAN with iSTFT-based upsampling using fixed weights based on the Fourier basis and overlap-add operation.Consequently, the iSTFT layer-based upsampling can be replaced by the proposed FC layer-based upsampling in HiFi-GAN-based neural vocoders.

FIGURE 1 .
FIGURE 1. Architectures of (a) HiFi-GAN, (b) Multi-stream HiFi-GAN, (c) iSTFTNet, and (d) proposed FC-HiFi-GAN generators.T, T.Conv, ResBlock and Conv1d are the number of frames of mel-spectrograms for analysis-synthesis condition or hidden features for E2E TTS condition, transposed convolutional layer, residual block and 1-dimensional convolutional layer.

FIGURE 2 .
FIGURE 2. Architectures of (e) MS-iSTFT-HiFi-GAN and (f) proposed MS-FC-HiFi-GAN generators.T, T.Conv, ResBlock and Conv1d are the number of frames of mel-spectrograms for analysis-synthesis condition or hidden features for E2E TTS condition, transposed convolutional layer, residual block and 1-dimensional convolutional layer.

FIGURE 4 .
FIGURE 4. (a) amplitude and phase components of STFT spectrum of an original speech waveform (jvs001-BASIC5000-0025), (b) those estimated by iSTFTNet trained using JVS corpus, (c) those reanalyzed from the speech waveform synthesized by using (b).
)) are indistinguishable from those of the original Fig. 4(a).When the fast Fourier transform (FFT) length and shift length of acoustic feature analysis in STFT are M and N , M /N = Q samples are summed for each sample in iSTFT by the overlap-add operation (Fig. 5(a)).Therefore, the overlap-add operation in iSTFT has the ''redundancy'' for Q ≥ 2. By the redundancy of the overlap-add operation and the GAN-based training in the time domain, the magnitude and phase components estimated by iSTFTNet, that differ from those of the original, can still synthesize high-fidelity speech waveforms.Conversely, iSTFTNet is trained to estimate STFT spectra for synthesizing high-quality speech waveforms through the overlap-add operation, and GAN-based training in the time domain has no restriction in the STFT domain.Therefore, direct estimation of speech waveform samples in the time domain is more suitable for GAN-based training in the time domain than the indirect estimation of STFT spectra introduced in iSTFTNet.

FIGURE 5 .
FIGURE 5. (a) iSTFTNet with overlap-add operarion, and (b) proposed FC-HiFi-GAN without overlap-add operation.T is the number of frames, and M and N are the FFT length and shift length of acoustic feature analysis.
M audio samples are calculated by 2M log 2 M real number multiplications and 3M log 2 M real number additions in the iFFT.Then, 2MN log 2 M real number multiplications and 3MN log 2 M + N (Q − 1) = 3MN log 2 M + M − N real number additions are required to synthesize N audio samples in the iSTFT because iSTFT is calculated by shifting frame and overlap-add with shift length N .Conversely, FC without bias is calculated as x = Wh, where x ∈ R N ×1 , W ∈ R N ×(M +2) , and h ∈ R (M +2)×1 are the vector of N audio samples, trainable weight matrix of the fully-connected layer, and vector of hidden features, respectively.Then, (M + 2)N real number multiplications and (M +1)N real number addictions are required to synthesize N audio samples in the FC.In iSTFTNet and FC-HiFi-GAN, M = 16 and N = 4. Therefore, FC-based upsampling can realize faster inference than iSTFT-based upsampling.The FC layer-based upsampling differs from the iSTFT layer-based upsampling in the following important points.• The weights of FC layer-based upsampling are trainable.• The FC layer-based upsampling can directly predict speech waveform samples without overlap-add operation (Fig. 5(b)), and it is more suitable for GAN-based 31414 VOLUME 12, 2024

FIGURE
FIGURE 7. Result of T-test for MOS tests in Table 1.Values for p < 0.05 (statistically significant) are bold with yellow highlighting.

FIGURE 8 .
FIGURE 8. Results of T-test for MOS tests in Tables2 and 3. Values for p < 0.05 (statistically significant) are bold with yellow highlighting.

TABLE 1 .
Results of objective and subjective evaluations of analysis-synthesis condition for unseen speaker synthesis with multi-speaker models.

7 .
Result of T-test for MOS tests in Table 1.Values for p < 0.05 (statistically significant) are bold with yellow highlighting.