Quasi-Periodic WaveNet: An Autoregressive Raw Waveform Generative Model With Pitch-Dependent Dilated Convolution Neural Network

In this paper, a pitch-adaptive waveform generative model named Quasi-Periodic WaveNet (QPNet) is proposed to improve the limited pitch controllability of vanilla WaveNet (WN) using pitch-dependent dilated convolution neural networks (PDCNNs). Specifically, as a probabilistic autoregressive generation model with stacked dilated convolution layers, WN achieves high-fidelity audio waveform generation. However, the pure-data-driven nature and the lack of prior knowledge of audio signals degrade the pitch controllability of WN. For instance, it is difficult for WN to precisely generate the periodic components of audio signals when the given auxiliary fundamental frequency (<inline-formula><tex-math notation="LaTeX">$F_{0}$</tex-math></inline-formula>) features are outside the <inline-formula><tex-math notation="LaTeX">$F_{0}$</tex-math></inline-formula> range observed in the training data. To address this problem, QPNet with two novel designs is proposed. First, the PDCNN component is applied to dynamically change the network architecture of WN according to the given auxiliary <inline-formula><tex-math notation="LaTeX">$F_{0}$</tex-math></inline-formula> features. Second, a cascaded network structure is utilized to simultaneously model the long- and short-term dependencies of quasi-periodic signals such as speech. The performances of single-tone sinusoid and speech generations are evaluated. The experimental results show the effectiveness of the PDCNNs for unseen auxiliary <inline-formula><tex-math notation="LaTeX">$F_{0}$</tex-math></inline-formula> features and the effectiveness of the cascaded structure for speech generation.


I. INTRODUCTION
R AW waveform generation of audio signals like speech and music is a commonly used technique as the core of many applications such as text-to-speech (TTS), voice conversion (VC), and music synthesis.However, because of the extremely high temporal resolution (sampling rates are usually higher than 16kHz) and the very long term dependence of audio signals, directly modeling the raw waveform signals is challenging.To overcome these difficulties, in conventional synthesis techniques, audio signals are usually encoded into low temporal resolution acoustic features and then audio waveforms are decoded on the basis of these acoustic features.The analysis-synthesis (encoding-decoding) technique is called the vocoder [1], [2], which is often built on a source-filter [3] speech production model including source excitations and vocal tracts.However, because of the oversimplified assumptions of the speech generation mechanism, the lost temporal details and phase information lead to the serious quality degradation of conventional vocoders such as STRAIGHT [4] and WORLD [5].
Owing to the recent development of deep learning, many neural-based audio generation models [6]- [17] have been proposed to generate raw audio waveforms without the various assumptions imposed on conventional vocoders.That is, advanced, and deep network architectures directly model the long-term dependence of high-temporal-resolution audio waveforms.In this paper, we focus on WaveNet (WN) [6], which is one of the state-of-the-art audio generation models and has been applied to a variety of applications such as music generation [18], text-to-speech (TTS) [19], [20], speech coding [21], speech enhancement [22], [23], and voice conversion (VC) [24]- [28].The main core of WN is an autoregressive (AR) network modeling the probability distribution of each audio sample conditioned on auxiliary features and a specific number of previous samples called a receptive field.To handle the very long term dependence of audio signals, a stacked dilated convolution network (DCNN) [29] structure is utilized to efficiently extend the receptive field.Furthermore, the WN vocoder [30]- [33], which conditions WN on the acoustic features extracted by conventional vocoders to recover the lost information, achieves significant speech quality improvements for speech generation by replacing the synthesis process of traditional vocoders.
Although WN attains excellent performance in high-fidelity speech generation, the fixed architecture is inefficient and the lack of prior audio-related knowledge limits the pitch controllability of the WN vocoder.Specifically, because of the quasi-periodicity of speech, each sample may have a specific dependent field related to its periodicity instead of a fixed receptive field that presumably includes many redundant previous samples.The requirement of a long receptive field for modeling speech dependency will lead to a huge network and high demands for computation power.The data-driven architecture without prior speech knowledge only implicitly models the relationship between the periodicity of waveform signals and the auxiliary fundamental frequency (F 0 ) features, which may not explicitly generate speech with the precise pitch corresponding to the auxiliary F 0 values, especially in an unseen F 0 case.However, the pitch controllability is an essential feature for the definition of a vocoder.
To address these problems, inspired by the sourcefilter model [3] and code-excited linear prediction (CELP) codec [34], [35], we propose Quasi-Periodic WaveNet (QP-Net) [36], [37] with a pitch-dependent dilated convolution neural network (PDCNN).Specifically, the generation process of periodic signals can be modeled as the generation of a single pitch cycle signal (short-term correlation) and then extending this single cycle signal to form the whole periodic sequences on the basis of pitches (long-term correlation).As a result, we develop QPNet including two cascaded WNs with different DCNNs.Vanilla WN with fixed DCNNs is the first stage, which is used to model the relationship between the current sample and a specific segment of the nearest previous samples, and the second stage utilizes the PDCNNs to link the correlations of the relevant segments in the current and previous cycles.The Pitch-adaptive architecture allows each sample to have an exclusive receptive field length corresponding to the auxiliary F 0 features and improves the pitch controllability by introducing the periodicity information into the network.The proposed QPNet with the improved pitch controllability is more line with the definition of a vocoder.Furthermore, a more compact network size while achieving acceptable quality similar to that of vanilla WN is feasible for QPNet because of the more efficient way the receptive field is extended, which is highly related to the modeling capability.
The paper is organized as follows.In Section II, we review recent neural-based speech generation models.In Section III, a brief introduction to WN is presented.In Section IV, we describe the concepts and details of QPNet.In Sections V and VI, we report objective and subjective experimental results to evaluate the effectiveness of QPNet for generating high-temporal-resolution periodic sinusoid signals and quasiperiodic speech, respectively.Finally, the conclusion is given in Section VII.

II. RELATED WORK
Recent mainstream speech generation techniques use AR models such as WN [6] and SampleRNN [7] to model the very long term dependence of speech signals with high temporal resolution.In contrast to conditioned on linguistic and F 0 features to generate speech like vanilla WN, taking an AR model as a vocoder is a more efficient way to train the AR model and make it generate the desired speech conditioned on handcrafted acoustic features.Many acoustic features have been applied to these AR vocoders such as the Melcepstral coefficients (mcep) with band aperiodicity (ap) and F 0 features, which are extracted from WORLD [30]- [32] or STRAIGHT [38], and Mel-spectrograms with F 0 features [33].
Furthermore, to achieve acceptable speech quality, the basic AR vocoders usually require a huge network for the long receptive field.However, although the speech qualities of these basic AR vocoders are significantly higher than those of the traditional vocoders, the AR mechanism and the complicated network structure make these AR vocoders difficult to generate speech in real-time [6], [7].To tackle this issue, the authors of FFTNet [8] and WaveRNN [9] proposed more compact AR vocoders with specific network structures based on speechrelated knowledge and efficient computation mechanisms.Moreover, AR models generating glottal excitation [39], [40] and linear predictive coding (LPC) residual [10] signals have been proposed to ease the burden of modeling speaker identity and spectral information.Because of the speaker-independent characteristic of these source signals, the requirements for the network capacity and speaker adaptation of these glottal vocoders and LPCNet are greatly reduced.
In addition, flow-based [41], [42] non-AR vocoders have been proposed for efficient parallel generations.For example, parallel WaveNet [11] and ClariNet [12] with inverse autoregressive flow (IAF) [43] and WaveGlow [13] and FloWaveNet [14] with Glow [44] model an invertible transformation between a simple probability distribution of noise signals and a target distribution of speech signals for generating waveforms from a known noise sequence.
Non-AR vocoders with mixed sine-based excitation inputs produced on the basis of F 0 and Gaussian noise [16], [17] or periodic sinusoid signals and aperiodic Gaussian noise inputs [15] have also been proposed to simultaneously generate whole waveforms while attaining pitch controllability via the manipulation of the periodic inputs.However, to synchronize the phases of generated and ground truth waveforms during training, these models need a handcrafted design of the input signal or a GAN [45] structure, which increases the complexity of the models.Moreover, directly applying these models to related applications such as music generation is not straightforward because of the tailored architectures.
Instead of the carefully designed inputs and specific networks, we proposed a simple module PDCNNs, which can be easily applied to any DCNN-based generative model to improve its audio signal modeling capability by introducing pitch information into the network.We applied PDCNNs to WN to develop a pitch-dependent adaptive network QPNet [36], [37] for speech generation with arbitrary F 0 values.In this paper, we further evaluate the periodical modeling capability of QP-Net with PDCNNs for nonspeech sinusoid signals generation and comprehensively explore the effectiveness of the QPNet model with different cascade orders, network structures, and adaptive dilation sizes.

III. WAVENET FOR SPEECH GENERATION A. WaveNet
Because an audio waveform is a sequential signal with a strong long-term dependency, WN [6] is used to model audio signals in an AR manner that predicts the distribution of each waveform sample on the basis of its previous samples.The conditional probability function can be formulated as where t is the sample index, x t is the current audio sample, and r is a specific length of the previous samples called a receptive field.Instead of the general recurrent structure for AR modeling, WN applies stacked convolution neural networks (CNNs) with a dilated mechanism and a causal structure to model the very long term dependence and causality of audio signals.
Since the modeling capability of WN is highly related to the amounts of the previous samples taken into consideration for predicting the current sample, the dilated mechanism improves the efficiency of extending the receptive field length.Moreover, a categorical distribution is applied to model the conditional probability whereas audio signals are encoded into 8 bits by using the µ-law algorithm.The categorical distribution is flexible to model an arbitrary distribution of target speech.Taken together, the data flow of WN is as follows: previous audio samples pass through a causal layer and several residual blocks with DCNNs, gated structures, and residual and skip connections.Specifically, the gated structure for enhancing the modeling capability of the network is formulated as where z (i) and z (o) are the input and output feature maps of the gated structure, respectively.V is a trainable convolution filter, * is the convolution operator, is an element-wise multiplication operator, σ is a sigmoid function, k is the layer index, and f and g are the filter and gate, respectively.Finally, the summation of all skip connections is processed by two ReLU [46] activations with 1×1 convolutions and one softmax layer to output the predicted distribution of the current audio sample.
Furthermore, to guide the WN model to generate desired contents, the vanilla WN is conditioned on not only previous samples but also linguistic and F 0 features.The conditional probability is modified as where h is the vector of the auxiliary features (linguistic and F 0 features), and the gated activation with auxiliary features becomes where V (1) and V (2) are trainable convolution filters, and h is the temporal extended auxiliary features, whose temporal resolution matches to the speech samples.

B. WaveNet Vocoder
Many conventional vocoders [4], [5] are built on the basis of a source-filter architecture [3], which models the speech generation process as a spectral filter driven by the source excitation signal.However, the oversimplified assumptions, such as analysis windows with a fixed length, time-invariant linear filters, and stationary Gaussian processing, make the vocoders lose some essential information of speech such as phase and temporal details, and it causes marked quality degradation.To address this problem, the authors of [30], [31]  vocoder to generate raw speech waveforms.That is, the WN vocoder replaces the synthesis part of conventional vocoders to synthesize high-fidelity speech on the basis of the prosodic and spectral acoustic features extracted by conventional vocoders.Furthermore, conditioning WN on the acoustic features greatly reduce the requirements of the amounts of the training data, and it makes WN more tractable.

C. Problems in Using WaveNet as A Vocoder
As a vocoder, WN achieves high speech quality, but it lacks pitch controllability, which is an essential feature of conventional vocoders.Specifically, the WN vocoder has difficulties in generating speech with precise pitch conditioning on the F 0 values that are not observed in the F 0 range of training data [36].Even though the F 0 and spectral features are within the observed range, an unseen combination of the auxiliary features still markedly degrades the generation performance of the WN vocoder [24]- [28].The possible reasons for this problem are that WN lacks prior speech knowledge and does not explicitly model the relationship between the auxiliary F 0 feature and pitch.The defect makes the WN vocoder inconsistent with the definition of a vocoder.Moreover, since the fixed WN architecture assumes each sample has the same length of the receptive field, the inefficient receptive field extending may lead to the costly requirements of a huge network and lots of computation power.

IV. QUASI-PERIODIC WAVENET
To improve the efficiency of extending the receptive field and pitch controllability, QPNet introduces the prior pitch information into WN by dynamically changing the network structure according to the auxiliary F 0 features.Specifically, as shown in Fig. 1, the main differences between WN and QPNet are the pitch-dependent dilated convolution mechanism handling the periodicity of audio signals and the cascaded structures simultaneously modeling the long-and short-term correlations.The pitch filtering in CELP, which is the basis of the PDCNN, and the details of QPNet are described as follows.

Short delay predictor
Long delay predictor Weighted mean-square error

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 4
structure modeling the hierarchical correlations is also applied to QPNet.

B. Pitch-dependent Dilated Convolution
The main idea of the PDCNN is that since audio signals attain the quasi-periodic property, the network architecture can be dynamically optimized using the prior pitch information.Specifically, the dilated convolution can be formulated as where   Moreover, to extend the receptive field, vanilla WN utilizes stacked chunks including DCNN layers with different dilation sizes.Specifically, each chunk contains a specific number of DCNN layers, and each layer (except the first layer) twice the dilation size of the last one.The dilation sizes of the first layers of the chunks are set to one, so the dilation size in each chunk exponentially increases with base two.For QPNet, the dilation sizes of PDCNN layers in stacked adaptive modules follow the same extension rule but multiplied by an extra dilated factor to match the pitch of the current sample.The pitch-dependent factor Et is derived from where Fs is the utterance-wise constant sampling rate, F0,t is the fundamental frequency with speech sample index t, and a is a hyperparameter called the dense factor, which indicates the number of samples in one cycle taken into consideration when predicting the current sample.Therefore, with the same dense factor, different effective receptive fields include the same number of past cycles, as shown in Fig. 4. In summary, the pitch-dependent structure allows each sample to have an exclusive effective receptive field length and efficiently extends it according to the corresponding F0 value.

C. Cascaded Autoregressive Network
Most audio signals are sequential and quasi-periodic, so the generation network should simultaneously model the long-term (periodicity) and short-term (aperiodicity) correlations of audio samples.As shown in Fig. 1, the proposed OPNet model utilizes a cascaded architecture that contains fixed and adaptive (pitch-dependent) modules.The fixed module models the sequential relationship between the current sample and a segment of the most recent samples.The adaptive module models the periodic correlations of the current and related past segments in the successive cycles.Moreover, the fixed module of QPNet is composed of a causal layer, several stacked residual blocks with fixed DCNNs, conditional auxiliary features, gated activations, and residual and skip connections, similarly to vanilla WN.The adaptive module also contains several similar stacked residual blocks but with fixed DCNNs replaced by PDCNNs.In summary, the cascaded structure of QPNet presumably mimics a similar mechanism of CELP for quasi-periodic audio signals.

V. PERIODIC SIGNAL GENERATION EVALUATION
To evaluate the frequency controllability of the proposed QPNet with the PDCNN, we first evaluated the generation quality of simple periodic but high-temporal-resolution signals.
That is, the training data of QPNet were sine waves within a specific frequency range and the corresponding F0 values.In the test phase, QPNet was conditioned on outside F0 values and a small piece of the related sine wave for the initial receptive field to generate sinusoid waveforms.

A. Sinusoid Evaluation Setting
Because the pitch range of speech is 80-400 Hz, the training sine waves were set to be in the same range with a step size of 20 Hz (ex: 80, 100, 120 … Hz).QPNet had a related one-dimensional F0 value as its auxiliary feature.To increase the robustness of QPNet, both sinusoid and auxiliary signals were mixed with white noise.The signal-to-noise ratio (SNR) of the sine waves was around 20 dB, and the noise of the auxiliary feature was a random sequence between -1 and 1. Random initial phases were also applied to the sinusoid signals.A. Pitch Filtering in CELP Fig. 2 shows a flowchart of the CELP system [35], which includes an innovation signal codebook and two cascaded timevarying linear recursive filters.First, each innovation signal in the codebook is scaled and passed to the pitch filter (long delay) to generate the pitch periodicity of the speech, and then the linear-prediction filter (short delay) restores the spectral envelope to obtain the synthesized speech.Secondly, the meansquare errors between the original and synthesized speech signals are weighted by a linear filter to attenuate/amplify frequency components that are less/more perceptually important.Finally, the optimum innovation signal and the scaled factor are determined by minimizing the weighted meansquare error.To be more specific, the pitch-filtering process can be formulated as where c (i) is the input, c (o) is the output, t d is the pitch delay, g is the gain, and b is the pitch filter coefficient.This periodic feedback structure handling the periodicity of signals is the basis of the proposed PDCNN, and the cascaded recursive structure modeling the hierarchical correlations is also applied to QPNet.

B. Pitch-dependent Dilated Convolution
The main idea of the PDCNN is that since audio signals have the quasi-periodic property, the network architecture can be dynamically adapted using the prior pitch information.Specifically, the dilated convolution can be formulated as where y (i) and y are respectively for the current and previous samples.* is the convolution operator.The dilation size d is constant for the vanilla DCNN but time-variant for the PDCNN.
To extend the receptive field length, the vanilla WN utilizes stacked chunks including DCNN layers with different dilation sizes.Specifically, each chunk contains a specific number of DCNN layers, and each layer (except the first layer) twice the dilation size of the last one.The dilation sizes of the first layers of the chunks are set to one, so the dilation size in each chunk exponentially increases with base two.As shown in Fig. 3, the dilation sizes of PDCNN layers in the stacked adaptive chunks of QPNet follow the same extension rule but multiplied by an extra dilated factor to match the instantaneous pitch of the current sample.The pitch-dependent dilated factor E t is derived from where F s is the utterance-wise constant sampling rate, F 0,t is the fundamental frequency with speech sample index t, and a is a hyperparameter called the dense factor, which indicates the number of samples in one cycle taken into consideration when predicting the current sample.Specifically, the grid sampling locations of each DCNN is controlled by the dilation size d, and the dilation size d of each PDCNN is controlled by the dilated factor E t as By setting the F 0 values and the dense factor a, the network can control the sparsity of the CNN sampling grids to attain the desired effective receptive field length.As shown in Fig. 4, since the sinusoids in Figs. 4 (a) and (b) have the same dense factors and sampling rates, even though the frequencies of them are different, the numbers of cycles in their effective receptive fields are still the same.The difference is the temporal sparsity of the effective receptive field.That is, fixing the number of sampling grids in each cycle by the dense factor and changing the gaps between the grid sampling locations by the instantaneous F 0 values lead to pitch-dependent and timevariant effective receptive field lengths.In summary, the dilated factor E t is the expanded ratio of the effective receptive field length to the receptive field length, and the ratio of the receptive field length to the dense factor a is the number of past cycles in the effective receptive field.With the pitch-dependent structure, each sample has an exclusive effective receptive field length, which is efficiently extended according to the auxiliary F 0 values.In addition, since speech has voiced and unvoiced segments, we have tried to set E t to one or the value calculated by interpolating the F 0 values of the adjacent voiced segments for the unvoiced segments, and the results in Section VI show that QPNet with the continuous E t from interpolated F 0 values achieves higher speech quality.

C. Cascaded Autoregressive Network
Most audio signals are sequential and quasi-periodic, so the audio generative models usually simultaneously model the long-term (periodicity) and short-term (aperiodicity) correlations of audio samples.As shown in Fig. 1, the proposed OPNet utilizes a cascaded architecture that contains a fixed and an adaptive (pitch-dependent) macroblocks.The fixed macroblock models the sequential relationship between the current sample and a segment of the most recent samples.The adaptive macroblock models the periodic correlations of the current and related past segments in the successive cycles.Specifically, the fixed macroblock (macroblock 0 in Fig. 1) of the QPNet is composed of several fixed chunks.Each fixed chunk consists of several stacked residual blocks with DCNNs (fixed blocks), conditional auxiliary features, gated activations, and residual and skip connections, similarly to the vanilla WN.The adaptive macroblock (macroblock 1 in Fig. 1) also contains several adaptive chunks, which also have similar stacked residual blocks but with PDCNNs (adaptive blocks).In summary, the cascaded structure of QPNet presumably mimics a similar generative procedure of CELP for quasi-periodic audio signals generation.

V. PERIODIC SIGNAL GENERATION EVALUATION
To evaluate pitch controllability of the proposed QPNet with the PDCNNs, we first evaluated the generation quality of simple periodic but high-temporal-resolution signals.That is, the training data of QPNet were sine waves within a specific frequency range and the corresponding F 0 values.In the test phase, QPNet was conditioned on an F 0 value and a small piece of the related sine wave for the initial receptive field to generate sinusoid waveforms.

A. Model Architecture
In this section, to evaluate the effectiveness of the PDCNN, we compared three types of QPNet with two types of WN in terms of sine wave generation.Specifically, in addition to the basic QPNet, because a sinusoid is a simple periodic signal that can be modeled well by a pitch-dependent structure, the QPNet model with only adaptive residual blocks (pQPNet) was taken into account.The QPNet model with the reverse order of the fixed and adaptive macroblocks (rQPNet) was also considered.Moreover, a compact-size WN (WNc) and a full-size WN (WNf) models were evaluated as the references.
The details of the network architectures are shown in Table I.Since the numbers of CNN channels were the same for all models, the model sizes were proportional to the numbers of the chunks and residual blocks.For instance, the WNf contained 3 chunks and each chunk included 10 residual blocks, so the model size of the WNf was larger than that of the WNc, which only had 4 chunks with 4 residual blocks in each chunk.The learning rate was 1 × 10 −4 without decay, the minibatch size was one, the batch length was 22050, the training epochs were two, and the optimizer was Adam [47] for all models.

B. Evaluation Setting
Because the pitch range of most speech is around 80-400 Hz, the training sine waves were set to be in the same range with a step size of 20 Hz (ex: 80, 100, 120 Hz).Each model had a related one-dimensional F 0 value as its auxiliary feature.Since the single-tone generation was evaluated, the auxiliary features of all samples in one utterance were the same.To prevent the networks from suboptimal training and lacking the generality for sinusoid generations with unseen F 0 values, both sinusoid and auxiliary signals were mixed with white noise.
The signal-to-noise ratio (SNR) of the sine waves was around 20 dB, and the noise of the auxiliary feature was a random sequence between -1 and 1. Random initial phases were also applied to the sinusoid signals.The number of training utterances was 4000, and each utterance was one second.The ground truths were clean sinusoid signals, so each model was trained as a denoising network.The test data included 20 different F 0 values, which were 10-80 Hz with a step size of 10 Hz, 100-400 Hz with a step size of 100 Hz, and 450-800 Hz with a step size of 50 Hz, and each F 0 value contained 10 test utterances with different phase shifts.Both training and test data were encoded using the µ-law into 8 bits, and the sampling rate was 22,050 Hz.
In the test stage, the initial receptive field of each network was fed with the noisy test sine wave, and the length of the generated sinusoid was set to 1s.The quality of each generated waveform was evaluated on the basis of the SNR and the rootmean-square error (RMSE) of the log F 0 value measured from the peak of the power spectral density (PSD).Moreover, the test data were divided into 10-40 Hz (under 1/2L), 50-80 Hz  (above 1/2L), 100-400 Hz (inside), 450600 Hz (under 3/2U ), and 650800 (above 3/2U ) subsets.L is the lower bound and U is the upper bound of the inside F 0 range, which was the F 0 range of the training data.As a result, the under 1/2L and above 1/2L F 0 ranges are the lower outside F 0 range, and the under 3/2U and above 3/2U F 0 ranges are the higher outside F 0 range.

C. Dense Factor
To explore the efficient dense factor value of the PDCNNs, the sinusoid generative qualities of the pQPNet models with different dense factors was evaluated.Since the chunk and block numbers of pQPNet were set to four, the length of the receptive field was 61 samples.That is, the receptive field included from 61 past cycles to less than one cycle according to the dense factors from 2 0 to 2 6 .Moreover, in contrast to containing a fixed number of past cycles for sinusoids with arbitrary pitch, the receptive field of WNf contained 11 past cycles for 80 Hz sinusoids and 56 past cycles for 400 Hz sinusoids when the sampling rate was 22,050 Hz.As a result, the effective receptive fields of the pQPNet with a dense factor 2 already contained a comparative number of the past cycles as WNf.Since pQPNet introduced prior pitch knowledge into the network, the required number of the past cycles for modeling the sinusoids might be less than that of WNf.
The number of training epochs of the pQPNet models with dense factors from 2 2 to 2 6 was two.For dense factors of 2 0 and 2 1 , pQPNet required at least 10 training epochs to attain stable results.As shown in Tables II and III, the network with the dense factor of 2 0 was very unstable even when already trained with 10 epochs.The results indicate that although the small dense factor made the network have a long effective receptive field, the overbrief information of each past cycle might make it difficult to model signals well.For the inside and lower outside F 0 ranges, the networks with dense factors greater than 2 1 achieved high SNR values.However, the performance of the network with a dense factor of 2 6 markedly degraded when the auxiliary F 0 values were in the higher outside F 0 range.The possible reason is that the PDCNNs of the network degenerated to DCNNs because the E t became one when the dense factor was 2 6 and the F 0 values were higher than 350 Hz.Moreover, the log F 0 RMSE results show a similar tendency to the SNR results.The networks with dense factors of 2 0 and 2 6 achieved the lowest pitch accuracies while the networks with dense factors of 2 2 and 2 3 achieved the highest pitch accuracies.
In conclusion, the PDCNN with an appropriate dense factor was found to be robust against the conditions in the outside F 0 range, especially in the lower outside F 0 range conditions.For the higher outside F 0 range conditions, the networks still had acceptable quality until the F 0 value exceeded 600 Hz.Therefore, we set the dense factors to 2 3 for the models in the following evaluations because of the balance between the generative performance and the number of past cycles covered in its receptive field.

D. Network Comparison
As shown in Tables IV and V, the PDCNNs significantly improved pitch controllability.The PDCNNs made the QPseries networks achieve much higher SNR and lower log F 0 RMSE values than the same-size WNc network in both higher and lower outside F 0 ranges, and it shows the effectiveness of the PDCNNs to extend the effective receptive field length.Although full-size WNf attained similar SNRs to pQPNet, the log F 0 RMSE of WNf was much higher in the outside F 0 ranges.This indicates that WNf tended to generate the signals in the inside F 0 range instead of being consistent with the auxiliary F 0 feature, so the generated waveform of WNf might still be a perfect sinusoid signal but with an incorrect pitch.The  results also imply that the PDCNNs improved the periodical modeling capability using prior pitch knowledge.
In addition, because of the simple periodic signal generation scenario, pQPNet with the longest effective receptive field and the pure PDCNN structure attained the best generative performance among all QP-series networks.QPNet and rQP-Net showed some quality degradations when the auxiliary F 0 values were far away from the inside F 0 range, but they still outperformed WNc in both measurements and WNf in terms of log F 0 RMSE.

E. Discussion
In this section, several sinusoid generation examples are presented for looking into the physical phenomena behind the objective results.As shown in Figs. 5 (a) and (b), the pQPNet with a dense factor 2 3 generated clear sine waves with an SNR 23.7 dB when conditioned on an outside auxiliary value of 500 Hz (under 3/2U ).The PSD of this generated signal has a peak value of 502 Hz, which is very close to the ground truth and the log F 0 error is less than 0.01.However, the results in Figs. 5 (c) and (d) show that the sine wave generated by the pQPNet with a dense factor 2 0 includes much harmonic noise, which result in a low SNR.Even if the generated sine wave is still like a periodic signal, the wrong peak value from the second harmonic component of the PSD also causes a high

H
)UHTXHQF\+] 36'G% I 3HDN+] 615G% log F 0 error.Moreover, the results in Figs. 5 (e) and (f) show that the pQPNet with a dense factor 2 6 generated a very noisy signal, which has a very low SNR and a wrong peak value.
In addition, as shown in Figs. 6 (a) and (b), the pQPNet with a dense factor 2 3 still generated a clear sine wave with an SNR 23.3 dB and a correct peak value of its PSD when conditioned on an outside 20 Hz (under 1/2L) auxiliary value.However, the same-size WNc could not generate any meaningful signal, and the SNR of its generated signal is very low as shown in Figs. 6 (c) and (d).By contrast, the WNf still generated a clear sine wave with an SNR 33 dB but its frequency is incorrect as shown in Figs. 6 (e) and (f).Specifically, the PSD peak value is 120 Hz, and it implies that the WNf tends to generate seen signals even if conditioned on an unseen auxiliary feature.
In conclusion, the SNRs are related to the noisy degrees of the generated signals, which indicate the generated signals are clear sinusoids or not.Since it was a single-tone sinusoid generation test, the high log F 0 RMSEs imply that the generated signals may include much harmonic noise or the frequencies of these signals are incorrect.As a result, the generated signal with a high SNR and a high RMSE is a clear sinusoid with an inaccurate frequency like the signal shown in Fig. 6 (e).The generated signal with a low SNR and a high RMSE is a noisy sinusoid with much harmonic noise like the signal shown in Fig. 5 (c).The generated signal with a very low SNR is a

VI. SPEECH GENERATION EVALUATIONS
In this section, we evaluate the effectiveness of the PDCNNs for speech generation.The appropriate proportions of adaptive and fixed residual blocks, the continuous pitch-dependent dilated factor, and the order of the macroblocks are explored.

A. Model Architecture
The quality of speech generation was evaluated on the basis of 11 vocoders, which included three types of vocoder, QP-Net, WN, and WORLD.Specifically, to explore the efficient receptive field extension by the PDCNNs, the compact-size QPNet vocoders were compared with the same-size WNc and double-size WNf vocoders.Furthermore, the evaluations included several variants of QPNet such as the models with different types of pitch-dependent dilated factor E t and the order of the fixed and adaptive macroblocks.Specifically, the QPNet and rQPNet vocoders with the continuous and discrete E t sequences were evaluated.For the unvoiced frames, the discrete E t sequence was set to ones, and the continuous E t sequence was calculated using interpolated F 0 values as mentioned in Section IV.In addition, the full-size QPNet and rQPNet vocoders, which were full-size WN vocoders cascaded with four extra adaptive residual blocks, were also taken into consideration to explore the effect of the ratio of adaptive to fixed residual blocks.The network architectures and model sizes are shown in Table VI.The learning rate was 1 × 10 −4 without decay, the minibatch size was one, the batch length was 20,000, and the optimizer was Adam [47] for all models.Since even the compact-size WNc had tens of millions parameters, which was the same order of magnitude as that of WNf, the training iterations were empirically set to 200,000 for all models.Note that we did not evaluate speech generation using the pQPNet model because it failed to model the short-term correlation of speech according to our internal experiments.

B. Evaluation Setting
All models were trained in a multispeaker manner.The training corpus of these multispeaker NN-based vocoders  consisted of the training sets of the "bdl" and "slt" speakers of CMU-ARCTIC [48] and all speakers of VCC2018 [49].The total number of training utterances was around 3000, and the total training data length was around four hours.The evaluation corpus was composed of the SPOKE set of VCC2018, which included two female and two male speakers, and each speaker had 35 test utterances.All speech data were set to a sampling rate of 22,050 Hz and a 16-bit resolution.The waveform signals for the categorical output of the NNbased vocoders were further encoded into 8 bits using the µ-law.The 513-dimensional spectral (sp) and ap and onedimensional F 0 features were extracted using WORLD.The sp feature was further parameterized into 34-dimensional mcep, ap was coded into two-dimensional components, and F 0 was converted into continuous F 0 and the voice/unvoice (U/V ) binary code for the auxiliary features [30].The F 0 range of the SPOKE set was around 40-330 Hz, and the F 0 mean was around 150 Hz.The unseen outside auxiliary features were simulated by replacing the original F 0 values of the acoustic features with the scaled F 0 values, and the scaling ratios were 1/2, 3/4, 5/4, 3/2, and 2. A demo and open-source QPNet implementation can be found in [50].

C. Objective Evaluation
For the objective evaluations, the ground truth acoustic features were extracted from natural speech utterances using WORLD, and the extraction error from WORLD was neglected.A speaker-dependent F 0 rage was applied to the feature extraction of each speaker to improve the extraction accuracy, and the F 0 range was set following the process in [51].Since WORLD was developed to extract F 0 independent spectral features [5], the WORLD-extracted sp feature was assumed to be independent of the F 0 feature in this paper.Therefore, the ground truth acoustic features for the scaled F 0 scenarios were the same natural spectral features with the F 0 feature scaled by an assigned ratio.The auxiliary features of the evaluated vocoders were the ground truth acoustic features.Mel-cepstral distortion (MCD) was applied to measure the spectral reconstruction capability of the vocoders, and the MCD was calculated between the auxiliary mcep and the WORLD-extracted mcep from the generated speech.The pitch accuracy of the generated speech was evaluated using the RMSE of the auxiliary F 0 and the WORLD-extracted F 0 value from the generated speech in the logarithmic domain.The unvoiced/voiced (U/V ) decision error was also taken into account in the evaluation of the prosodic prediction capability, which was the percentage of the unvoiced/voiced decision difference of each utterance.Since speech generation is more complicated than sine wave generation, we first conducted a objective evaluation of QPNet models with different dense factors for speech generation to check the consistency of the efficient dense factor value.As shown in Table VII, the tendency of the objective evaluation is similar to the results of the sinusoid generation evaluation.That is, the QPNets with dense factors from 2 1 -2 4 achieved similar generative performance while the speech quality and pitch accuracy of the QPNets with dense factors 2 5 and 2 6 markedly degraded because of the much shorter effective receptive field lengths.Specifically, as shown in Table VIII, the average effective receptive field lengths of the QPNets with the dense factors 2 5 and 2 6 are much shorter than others, and the lengths were too short to cover at least one cycle of the signal with 150 Hz, which was the F 0 mean of the SPOKE set.Furthermore, although the QPNet with a 2 0 dense factor had the longest average effective receptive field length and achieved an acceptable MCD, the higher RMSE of log F 0 and U/V error indicate its instability, which was also observed in the sinusoid generation evaluation.In conclusion, the dense factors of the following QPNet-series models were set to 2 3 because of the lowest RMSE of log F 0 and U/V error with an acceptable MCD.The internal subjective evaluation results also show the preference of the utterances generated by the QPNet with the dense factor 2 3 .
As shown in Table IX, in terms of spectral prediction capability, the compact-size (r)QPNet vocoders with the proposed PDCNNs significantly outperformed the same-size WNc vocoder.The results confirm the effectiveness of the QP structure to skip some redundant samples using the prior pitch knowledge for a more efficient receptive field extension.
However, the MCDs of the double-size WNf vocoder are lower than that of the compact-size (r)QPNet vocoders, and the fullsize (r)QPNet vocoders with the largest network size also outperformed the WNf vocoder in terms of MCD.The results indicate that the MCD values are highly related to the network sizes, so a deeper network attains a more powerful spectral modeling capability.Furthermore, the systems with continuous pitch-dependent dilated factors achieved better MCDs than those with discrete ones, and the result is consistent with our internal subjective evaluation for speech quality.However, the MCD differences of the rQPNet and QPNet vocoders were not reflected in the perceptual quality, and they had similar speech qualities according to the internal evaluation.
The log F 0 RMSE results in Table X also show that both the compact-size QPNet and rQPNet vocoders attained markedly higher pitch accuracy than the same-size WNc vocoder, particularly when conditioned on the unseen F 0 with a large shift.The compact-size QPNet vocoder even achieved higher pitch accuracies than the WNf vocoder.The results indicate that the PDCNNs with the prior pitch knowledge improved the pitch controllability of these vocoders against the unseen F 0 .However, the pitch accuracies of the fullsize QPNet and rQPNet vocoders are lower than that of the (r)QPNet vocoders.The possible reason is that the unbalanced proportion of the adaptive and fixed residual blocks impaired the pitch controllability.That is, for the full-size (r)QPNet vocoders, the number of the fixed blocks is markedly larger than the number of the adaptive blocks.Therefore, the network might be dominated by the fixed blocks, which degraded the influence from the adaptive blocks.Specifically, for the (r)QPNet vocoders with a dense factor 2 3 , the receptive field length of the fixed blocks is 46 samples (The details of the receptive field length can be found in Discussion.), and the average effective receptive field length of the adaptive blocks is 384 samples as shown in Table VIII.However, for the full-size (r)QPNet vocoders, the receptive field length of the fixed blocks is 3070 samples, which was much longer than  the 384 samples of the extra four adaptive blocks.Therefore, the influence of the adaptive blocks might be very limited.
As shown in Table XI, the compact-size QPNet vocoder attained the lowest U/V decision error among all NN-based vocoders, and it indicates a higher capability to capture U/V information.In conclusion, the compact-size QPNet vocoder with the proposed PDCNNs and continuous pitch-dependent dilated factors attained the highest accuracy of pitch and U/V information among the evaluated NN-based vocoders.Although the compact-size QPNet vocoder did not achieve the same spectral prediction capability as the WNf vocoder according to the MCD results, it is difficult to measure a perceptual quality difference only on the basis of MCD.As a result, we subjectively evaluated the compact-size QPNet (with continuous pitch-dependent dilated factors), WNc, and WNf vocoders in the next section.Moreover, although the WORLD vocoder had the best objective evaluation results, the WORLD-generated speech usually lacks naturalness and contains buzz noise, which may not be reflected in the objective measurements.Therefore, in our subjective evaluations, we also considered the WORLD vocoder.

D. Subjective Evaluation
The subjective evaluations included the Mean Opinion Score (MOS) test for speech quality and the ABX preference test for perceptual pitch accuracy.Specifically, the naturalness of each utterance in the evaluation set for the MOS test was evaluated by several listeners by assigning scores of 1-5 to each utterance; the higher the score, the greater naturalness of the utterance.The MOS evaluation set was composed of randomly selected utterances generated on the basis of the WORLD, WNf, WNc, and QPNet vocoders, and the auxiliary features with 1/2 F 0 , 3/2 F 0 , and unchanged F 0 .The compact-size QPNet vocoder with the continuous dilated factors was adopted and abbreviated as QPNet in the subjective evaluations.We randomly selected 20 utterances from the 35 test utterances of each condition and each speaker to form the MOS evaluation set, so the number of utterances in the set was 960.The MOS evaluation set was divided into five subsets, and each subset was evaluated by two listeners, so the total number of listeners was 10.All listeners took the test using the same devices in the same quiet room.Although the listeners were not native speakers, they had worked on speech or audio generation research.
In the ABX preference test, the listeners compared two test utterances (A and B) with one reference utterance (X) to evaluate which testing utterance had a pitch contour more consistent with that of the reference utterance.Because the natural speech with the desired scaled F 0 does not actually exist, and the conventional vocoders usually have high pitch accuracy, we took the WORLD-generated speech as the reference.The ABX evaluation set consisted of the same generated utterances of the WNf, QPNet, and WORLD vocoders as the MOS evaluation set.The number of ABX utterance pairs was 240, and each pair was evaluated by two of the same 10 listeners as in the MOS test.
As shown in Fig. 7, for the female speaker set, the QP-Net vocoder significantly outperformed the same-size WNc vocoder in all cases.Although the QPNet vocoder achieved slightly lower naturalness than the WNf vocoder in the unchanged F 0 (inside) case, the QPNet vocoder still attained markedly better naturalness than the WNf vocoder in the 1/2 F 0 (outside) case.The results indicate that halving the network size markedly degraded the speech modeling capability of the WN vocoder.However, the proposed PDCNNs significantly improved it, especially in the 1/2 F 0 case which made QPNet obtain a long effective receptive field length.On the other hand, owing to the small dilated factors caused by the high F 0 values, many of the PDCNNs might degenerate to DCNNs in the 3/2 F 0 case.Specifically, when the dilated factors are less than or equal to one because of the high F 0 values, the dilation sizes of PDCNN are also less than or equal to DCNN.As a result, while conditioned on the auxiliary features with 3/2 F 0 , although the QPNet vocoder still outperformed the WNc vocoder, the speech qualities of the WNf and WORLD vocoders are higher than that of the QPNet vocoder.In addition, as shown by the results of the male speaker set in Fig. 8, the QPNet vocoder achieved naturalness comparable to that of the WNf vocoder in all F 0 cases, which is significantly better than that of the WNc vocoder.Specifically, most of the 3/2 F 0 values of the male speaker are still within the range of the normal female F 0 , so the effective receptive field lengths of the QPNet vocoder are apparently longer than the receptive field lengths of the WNc vocoder in all the subjective evaluations of the male speaker set.On the other hand, the WORLD vocoder shows almost the same tendency in the evaluations of both female and male speaker sets.That is, it shows lower naturalness than the WNf vocoder in the unchanged F 0 case and much lower speech quality than both the WNf and QPNet vocoders in the 1/2 F 0 case, whereas the naturalness of the WORLD vocoder only slightly degrades in the 3/2 F 0 case.
As shown in Figs. 9 and 10, the QPNet vocoder significantly outperformed the WNf vocoder in terms of pitch accuracy in all F 0 cases and both the female and male sets except in the unchanged F 0 cases of the female set, which may be caused by the naturalness degradation.The results confirm the pitch controllability improvement of the QPNet vocoder with the PDCNNs.In summary, the QPNet vocoder with the more compact network size achieved comparable speech quality to the WNf vocoder under most conditions except for the female set with 3/2 F 0 because the higher F 0 values might make the PDCNNs degenerate to the DCNNs.The QPNet vocoder conditioned on the unseen F 0 also gets the markedly higher pitch accuracy than the WNf vocoder.Moreover, the QPNet vocoder achieved higher or comparable speech quality than the WORLD vocoder under most conditions except conditioning on the unseen 3/2 female F 0 .

E. Discussion
As shown in Fig. 11, the length of the receptive field of WNf is 3070 samples (The receptive field length of 10 blocks in each chunk is 2 0 +2 1 +• • •+2 9 = 1023, so the total length is 1023×3 with an extra one from the causal layer.), that of WNc is 61 samples (Each chunk contains 2 0 +2 1 +2 2 +2 3 = 15, so the total receptive field length is 15 × 4 + 1 = 61.), and that of QPNet is 100-1000 samples (The receptive field length of the    fixed blocks and the causal layer is 15 × 3 + 1 = 46, and that of the adaptive blocks is 15 × E t .The pitch-dependent dilated factor E t with a dense factor 8 was around 60 for 50 Hz and 6 for 500 Hz).Specifically, the receptive field lengths of WNf and WNc are constant because of the fixed network structure, and the receptive field length of QPNet is time-variant and pitch-dependent because of the QP structure.
Fig. 11 also shows the effective receptive field length distributions of the female and male speakers of the SPOKE set.We find that the effective receptive field lengths of both male and female speakers of the SPOKE set are apparently longer than the receptive field length of WNc, which concurs with the evaluation results showing that QPNet significantly outperforms WNc.Furthermore, most of the effective receptive field lengths of the female set are shorter than that of the male set, and it is caused by the higher F 0 values of the female speakers.The distribution results also imply that the effective receptive field length of QPNet is close to the receptive field length of WNc when conditioned on the female 3/2 F 0 because most PDCNNs degenerate to DCNNs.In conclusion, the performance of AR models is highly related to the length of the receptive field.
However, the length of the receptive field may be more strongly correlated to the quality of the generated speech, whereas a balanced proportion of the adaptive and fixed modules may be an essential factor for the pitch accuracy.Specifically, although the full-size QPNet has the longest effective receptive field length and achieves the lowest MCD, the pitch accuracy of full-size QPNet is still lower than that of compact-size QPNet.The possible reason is that the full-size QPNet is dominated by the fixed blocks because the number of the fixed blocks is much larger than the number of the adaptive blocks while the number of the fixed and adaptive blocks of the QPNet is more balanced.
Furthermore, as shown in Tables I and VI, the number of the trainable parameters of the compact-size QPNet model is around half of that of the WNf model, so only about 75% of the training time and 40% of the generation time were required.However, because of the very long effective receptive field, the memory usage of QPNet in the training stage was almost the same as that of WNf.The huge memory requirement in the training process limits the possible ratio of the fixed to adaptive modules, which leads to an unbalanced proportion problem.Therefore, increasing improving the efficiency of memory usage will be one of the main tasks of future QPNet research.

VII. CONCLUSION
In this paper, we propose a WaveNet-like audio waveform generation model named QPNet, which models quasi-periodic and high-temporal-resolution audio signals on the basis of an NN-based AR model with a novel PDCNN component and a cascaded AR structure.Specifically, the novel PDCNN component is a variant of a DCNN that dynamically changes the dilation size corresponding to the conditioned F 0 for modeling the long-term correlations of audio samples.On the basis of the sinusoid generation evaluation results, the PDCNNs significantly improves the periodicity-modeling capability of the generation network using the introduced prior frequency information.Furthermore, the QPNet model as a vocoder models the short-and long-term correlations of speech samples on the basis of the cascaded fixed and adaptive macroblocks, respectively.The speech generation evaluation results indicate that the proposed QPNet vocoder attains a much higher pitch accuracy and comparable speech quality to the WN vocoder especially when conditioning on the unseen auxiliary F 0 values.The network size and generation time requirements of the QPNet vocoder are only half of those of the WN vocoder.In conclusion, the proposed QPNet model with the novel PDCNN component and compact cascaded network architecture significantly improves the pitch controllability of the vanilla WN model, and it makes the QPNet vocoder more in line with the definition of a vocoder.In our future work, we will explore the improvements in memory usage and optimize the proportion between the adaptive and fixed blocks.

iX
is the input and   o X is the output of the DCNN layer.The trainable 1×1 convolution filters   c W and   p W are respectively for the current and past samples.The dilation size d is constant for the vanilla DCNN but time-variant for the PDCNN.As shown in Figs. 3 and 4, instead of a fixed length of past speech samples, the effective receptive field of the PDCNN includes a pitch-variant length of speech samples.Specifically, although the sinusoids with different frequencies in Figs. 4 (a) and (b) have the same sampling rate and the number of sample points taken into account in both receptive fields is the same, the different F0-dependent dilation sizes lead to different effective receptive field lengths.

Fig 4 .
Fig 4. Effective receptive fields with different F0 values

conventional Skip connection Fixed block Fixed block Adaptive block Adaptive block
proposed the WN vocoder, which conditions WN on the auxiliary acoustic features extracted by a

TABLE IX MCD
(dB) WITH FRAME-BASED 95% CONFIDENCE INTERVAL (CI) OF DIFFERENT GENERATION MODELS FOR SPEECH GENERATION

TABLE X LOG
F 0 RMSE WITH UTTERANCE-BASED 95% CI OF DIFFERENT GENERATION MODELS FOR SPEECH GENERATION

TABLE XI U
/V DECISION ERROR RATE (%) WITH UTTERANCE-BASED 95% CI OF DIFFERENT GENERATION MODELS FOR SPEECH GENERATION