I. Introduction
Text-to-Speech (TTS) technology has undergone remarkable research and advances for the last several years. TTS enables the conversion of text into spoken natural speech, and helps to facilitate communication for individuals with better accessibility, and expressiveness with human interaction experience using digital devices. In recent years, TTS has been implemented in an autoregressive way of data distribution corresponding to the input, e.g., WaveNet [1], Tacotron [2]. However, autoregressive models could be more efficient in addressing the TTS quality. Later on, non-autoregressive TTS works on the motive of the teacher-student to increase the training-inference speed, and minimise latency issues of the autoregressive TTS, e.g., FastSpeech 2 [3] generates speech frames in parallel rather than sequentially, and transformer-based TTS [4] uses the multi-head attention mechanism to replace the Tacotron2 [5] to addresses a long-range dependency problem by directly connecting two input at different times using the self-attention mechanism. Due to the self-attention mechanism, it demonstrated the remarkable capability to generate high quality speech synthesis. Leveraging the ongoing research, TTS methods improved the naturalness and fluency for synthesized speech. However, due to the trade-off between speed vs. quality, non-autoregressive models frequently compromise on synthesis quality in favor of faster processing. Also, maintaining true parallelism by coherence, and fidelity in the synthesized speech degrades speech variability, such as different speaking rates, accents, or emotional expressions.