1. INTRODUCTION
Text-to-speech (TTS) and voice conversion (VC) have been actively studied to obtain the desired speech. In recently developed TTS and VC systems, a two-stage approach is commonly adopted, whereby the first model predicts the intermediate representation (e.g., mel spectrogram) from the input data (e.g., text or speech), and the second model synthesizes speech from the predicted intermediate representation. The second model, the neural vocoder, has been extensively studied through autoregressive models (e.g., WaveNet [1] and WaveRNN [2]) and non-autoregressive models, including distillation-based (e.g., Parallel WaveNet [3] and ClariNet [4]), flow (e.g., Glow [5])-based (e.g., WaveGlow [6]), diffusion [7], [8] -based (e.g., WaveGrad [9] and DiffWave [10]), and generative adversarial network (GAN) [11] -based (e.g., [12]–[27]) models. This study focuses on a GAN-based model because it is fast, lightweight, and high-quality.