Differentiable measures for speech spectral modeling

Autoregressive models for the envelope of speech power spectral densities (PSDs) are refined by the self-supervised spectral learning machine (S3LM) provided with differentiable spectral objective functions, including the Itakura-Saito divergence (ISD), the Kullback-Leibler divergence (KLD), the reverse KLD (RKLD) and the log spectral distortion (LSD), which display more significant results. However, in order to assess the models more perceptually, a method is proposed based upon perturbations around perfect reconstruction analysis-synthesis configurations. In the cross-excitation analysis-synthesis assessment (CEASA) method, the residual signals generated by analysis filters of the spectral models are injected as excitation into the synthesis filters derived from the same and other models in order to be evaluated by the perceptual evaluation of speech quality (PESQ) and Itakura divergence (ID), which are averaged over a set of models obtained using the objective functions mentioned above. The results lead to a superior performance when the RKLD is used as the loss function for the estimation of the spectral models with the ISD ranking close behind. The focus of these divergences on the spectral peaks is argued and pointed as the most important factor for this behavior. Specifically, using the PESQ scores obtained with CEASA, the RKLD loss is found to improve the performance by 1.0%, 4.0% and 19.3% with respect to the open-loop analysis, the KLD and the LSD models, respectively, while the corresponding improvements for the ISD loss are 0.1%, 3.0% and 18.2%, and the RKLD models excel the ISD models by 1.0% on average. Even though the spectral measures alone are not able to unequivocally distinguish the better of the two, CEASA is shown to have enough sensitivity to distinguish their performances. In summary, the learning machine S3LM fits models for the short-term spectral envelope of speech and, for the evaluation of its performance under several differentiable loss functions, the CEASA assessment tool has been developed. In addition, CEASA may be used for other assessments connected with speech analysis and synthesis.


I. INTRODUCTION
M ODELS for the envelope of speech spectra [1] are important for various tasks that require speech analysis, such as speech coding, speech synthesis, automatic speech recognition and speech enhancement.
Autoregressive models for speech power spectral density S(e jω ) may be obtained by the application of the Wiener-Khinchin theorem to get the autocorrelation function [2] R(m) = 1 2π π −π S(e jω )dω (1) for m = 0, 1, · · · , p, in order to determine an autoregressive model of order p. This model may be obtained by the autocorrelation method of linear prediction, proposed by F. Itakura [3], [4]. The model may be represented by linear prediction coefficients [5] or by other transformed parameters.
For instance, the analysis may require formant estimation and tracking [2], [6]. Despite the successful wide use of autoregressive analysis in speech applications [7], it has some shortcomings such as inaccuracies in modeling the discrete spectra arising in harmonic segments of speech [8], [9]. An interesting approach to harmonic spectral envelope estimation is true-envelope linear predictive coding (TE-LPC), which is an iterative cepstral technique based on a band-limited interpolation of the reference sub-sampled spectral envelope [10]. This work also proposes a residual spectral peak flatness measure for discrete spectra.
The shortcoming of straightforward autoregressive analysis of harmonic speech segments and other reasons motivate the improvement of autoregressive spectral estimation by means of machine learning methods. For instance, Cui et al. [9] show that adaptive changes performed by a deep neural network (DNN) to the spectra to be analyzed improve the quality of supervised spectral models.
Also, models for speech spectral envelopes play a significant role in speech synthesis, where a major problem is the oversmoothing of the reconstructed spectral envelopes [11]. In order to ameliorate this effect, restricted Boltzmann machines and deep belief networks have been proposed for modeling spectral envelopes [12]. It is also important to note that spectral envelope features can be efficiently detected by means of unsupervised methods [6].
Spectral envelopes may be obtained by means of cepstral coefficients also such as in this application of machine learning to speech emotion recognition [13]. In addition, mel frequency cepstral coefficients (MFCCs) are also reported to be used in emotional speech synthesis [14].
In the performance evaluation of diverse speech solutions or applications [15], the speech quality assessment is widely adopted. For instance, in [16], a complex spectral mapping based on DNN is proposed, and its results were evaluated using the algorithm described in ITU-T Rec. P.862 [17], [18], mostly known as PESQ. Another speech quality metric is the Virtual Speech Quality Objective Listener, known as ViSQOL [19], that uses spectral and temporal parameters to determine a listening quality objective score (LQO) using the 5-point quality scale. In connection with these applications, we propose an analysis-synthesis assessment method for the spectral models which is more suitable to evaluate their performance in action.
In this context, this work intends to improve the open-loop analytical (OLA) model using a machine learning algorithm in conjunction with several differentiable loss functions that are applied to the reference and reconstructed power spectral densities. The differentiable losses implemented in the S3LM architecture and used in the experimental tests were the ISD [3], [20], the KLD, the reverse KLD, and the LSD. For each loss function, S3LM produces a distinctive spectral envelope model. The cross-excitation analysis-synthesis assessment (CEASA) was used to assess the fidelity of each spectral envelope model considered in the tests, and it permitted to obtain different synthesized speech signals. In summary, each spectral model is used as two filters, namely, an analysis filter and a synthesis filter, which are associated in cascade. Further, a reference speech signal is put into the analysis filter and the synthesized signal comes out of the synthesis filter. Finally, in order to perform a better quality analysis, the synthesized signal is compared with the reference signal using both the PESQ and the ID [20], [21] algorithms. This procedure is carried out for all combinations of analysis and synthesis filters for all spectral model pairs generated with different losses for the spectrum of the same reference signal. In addition, two different window sizes for the speech signal are used to obtain spectral models that, beyond allowing one to analyze the impact of window length on the spectral fitting measures for the spectral models, also underlines the need for a nonspectral assessment tool such as CEASA. This independent assessment is necessary because CEASA tends to amplify the distinction between different models and also dismisses seeming static spectral fit improvements brought about by window length change, which turn out to be illusory.
Nowadays, different solutions based on both signal processing methods and machine learning algorithms are applied in several research areas [22], [23]. In the present work, we use signal processing techniques such as autoregressive models, prediction and perfect reconstruction in analysissynthesis systems which are integrated with machine learning structures to come up with tied spectral weighting layers (TSWLs). These techniques are used both in the proposed learning machine for the layers and the losses and in the CEASA diagnostic tool which includes analysis-synthesis techniques based on perfect reconstruction.
It is noted that the CEASA assessment tool is intended to be used with rather high quality spectral models since it should cause its analysis-synthesis system to operate around the perfect reconstruction condition. Further, under these conditions, PESQ-LQO is a trustworthy quality score, whose results are also corroborated with those given by the VisQOL metric.
Addressing the issues raised above, this article presents the proposed S3LM in Section II, the most important measures for speech spectral analysis in Section III, the spectral measures used as loss functions and the comparison of the spectral models they lead to in Section IV and the description of the CEASA method along with the results of its application to the speech spectral models in Section V. Finally, the major results in this article are connected in conclusion in Section VI.

II. THE SELF-SUPERVISED SPECTRAL LEARNING MACHINE
As previously stated, we propose a learning machine that inputs a spectrogram as a sequence of one-sided log PSDs with K samples up to the Nyquist frequency for an F s = 16 kHz sampling rate.
The network architecture of the proposed S3LM is composed by three tied spectral weighting layers (TSWLs), as shown in Fig. 1, with tied weight vector w 0 and tied bias vector b 0 , both the size K of the PSDs, which are extended over the spectrogram for each training epoch.
The S3LM architecture performs spectral pre-processing using the TSWLs. The structure consists of artificial neurons applied to each spectral component. Rather than fully connected networks, the proposed model has a singly connected architecture with two hidden layers and the weights shared between the layers. This structure concentrates attention on each spectral bin for closer convergence up to the same number of epochs while, at the same time, the strategy also brings about a reduction in the number of parameters and training time.
We will represent a single log PSD as for k = 0, 1, · · · , K − 1, which forward propagates through the first three layers as where ϕ(·) is the rectified linear unit (ReLU) activation function, • represents the Hadamard or elementwise product, and h 0 , h 1 , and h 2 are the outputs of each weighting layer. So the modified log PSD is h 2 , resulting in the modified PSD obtained as P 2 (k) = 10 h2(k)/10 (4) for k = 0, 1, · · · , K − 1. And now the modified autocorrelation function is obtained by using the Wiener-Khinchin theorem [24] as for m = 0, 1, . . . , p. From these autocorrelation coefficients a prediction analysis is performed, leading to the prediction coefficient vector For a general prediction coefficient vector α, the square prediction error is where R is the (p+1)×(p+1) Toeplitz reference augmented autocorrelation matrix whose entries are given by (5). For the special prediction coefficient vector α = a, the minimum prediction error achieved is After linear prediction analysis, the reconstructed PSD [4] is obtained as for k = 0, 1, · · · , K − 1 or, alternatively, by for k = 0, 1, · · · , K − 1, where the autocorrelation function of the linear prediction vector is Equation (9) is arguably simpler than Eq. (10) for gradient backpropagation.
Then, the log reconstructed PSD is obtained as for k = 0, 1, · · · , K − 1 and either the PSD or the log PSD, h 2 , may be used for computing the loss function according to the arguments of this function. The model is implemented using the deep learning framework PyTorch. The weights w 0 of S3LM are initialized to all ones while its biases b 0 are all initialized from samples of a zero-mean Gaussian distribution with a standard deviation σ = 1 × 10 −4 and they are optimized by a stochastic gradient descent algorithm with a learning rate ℓ r = 1 × 10 −4 . Good convergence has been observed after 80 epochs.
The model was experimented using the TIMIT Acoustic-Phonetic Continuous Speech Corpus dataset [25]. TIMIT has 6300 utterances (10 sentences spoken by each of 630 speakers) separated in 16 bit-wav files with a sampling rate of 16 kHz. The speakers are distributed across 8 different dialect regions. The utterances in dialect regions 1 through 4 of the test set of TIMIT dataset were selected for modeling K-sample one-sided log PSDs by self-supervised methods with K = 1025.
We used a workstation computer with 16 GB of RAM, an Intel ® Xeon ® E-2146G CPU at 3.50 GHz with 6 cores, and a single NVIDIA GPU card with 4 GB. Training and testing are simultaneous since S3LM is self-supervised.

III. MEASURES FOR SPECTRAL ANALYSIS
Based on square prediction errors, an important measure for comparing autoregressive models is Itakura divergence (ID). For the reference autoregressive vector a and the estimated vector a, Itakura divergence [20], [21] is given by  This definition, originally called "likelihood ratio" by Itakura [21], makes it clear that the minimum possible value for ID is unity, corresponding to the minimum square prediction error condition, therefore coinciding with the result for open-loop linear prediction analysis. On the other hand, it does not have any inherent upper bound, even though a practical value of 1.4 has been mentioned as the frontier beyond which synthetic speech quality is too low to be useful [26].
However, in order to be used as a loss function in comparing PSDs, the Itakura-Saito divergence (ISD) is more straightforward than ID and it is defined by [3], [20] where P (f ) is the reference PSD, Q(f ) is the distorted or reconstructed PSD and f Ny is Nyquist frequency. For sampled PSDs, the ISD is given by In Section II, where the proposed S3LM was described, we have reference PSD as P 2 and reconstructed PSD as P 2 . A more general spectral distortion measure which is not conceived for measuring autoregressive spectral fit in particular is the log-spectral distortion (LSD), which is expressed in dB as Notwithstanding their different constitutions, it is interesting to observe that both the square error and the ISD are instances of Bregman divergences [27], which also holds as a class member the generalized Kullback-Leibler divergence (GKLD), defined by In Machine Learning, it is usual to employ probability density functions (PDFs) or probability mass functions (PMFs). First, we observe that PSDs are nonnegative and, while log PSDs may take on negative values, they may be raised to 0 dB by subtracting the minimum value from the whole log-spectrum. Second, if we normalize the log PSDs so that they sum to unity, then the GKLD reduces to the KLD, the Kullback-Leibler divergence. The possibility of processing PSDs just as PDFs for KLD measures and modeling has already been pointed out by [28]. The KLD from PDF q to PDF p is defined as as long as S p ⊆ S q , where S p and S q are, respectively, the supports for the D-variate PDFs p e q, avoiding the occurrence of infinities [14] for points where q(x) = 0 and p(x) > 0 in (18). For PMFs, the KLD from q to p is computed as which represents the direct KLD as long as p is a data PMF and q is a latent variable PMF. By keeping these links while exchanging the positions of p and q, we obtain the reverse KLD (RKLD) as

IV. DIFFERENTIABLE LOSS COMPARISONS
The open loop analytical (OLA) analysis based on the autocorrelation method is used as a baseline for assessing the refinements brought about by the learning methods. Its objective function is a square prediction error, which is a square distance in polynomial space provided with a timevarying inner product defined by the short-term autocorrelation function [8].
The differentiable losses that have been applied to the reference and reconstructed power spectral densities (PSDs) are the Itakura-Saito divergence (ISD), the Kullback-Leibler divergence (KLD), the reverse KLD (RKLD) and the log spectral distortion (LSD). Most of these measures are also used for assessing the reconstructed PSDs, including in addition Jeffrey's divergence (JD) [20], which provides a balance between the KLD and the RKLD as a measurement tool.
It is interesting to observe that it can be demonstrated that the minimization of the ISD with respect to the prediction coefficients is equivalent to the minimization of the square prediction error in polynomial space [4]. However, it may be argued that the path leading to the minimum may be different in an iterative approach.
Differentiable signal processing methods have made it possible to perform the short-time Fourier transform (STFT) with variable hop size windows [29]. This research has lead us to discover some interesting spectral fitting differences that depend on STFT window length.
All our PSDs have been obtained using sequences of 50% overlapping sine windows.
The performance of the various methods for female speakers and 20 ms long windows are shown in Table 1, where quality improvements (QI) are positive for a result greater than the OLA baseline when it is a quality or similarity measure and are also positive for a result smaller than the OLA baseline when it is about a divergence or distortion measure. More precisely, the quality improvement for measure M is computed as where the plus sign is selected if M is a quality measure while the minus sign is selected if M is a divergence measure, P M is the PSD for the model obtained with M as objective function, P OLA is the PSD for model obtained by the openloop analysis and P ref is the reference power spectral density. The utterances in dialect regions 1 through 4 of the test set of TIMIT speech corpus [25] were selected for modeling Ksample one-sided log PSDs by self-supervised methods with K = 1025.
Using the self-supervised learning machine several measures are used as objective function alternatively as shown in Table 1 for female speakers and long windows in its leftmost column and the measures appearing as headers for the next columns are alternative measures for the comparisons between the PSDs obtained and the corresponding reference PSDs. Also, the same is shown for male talkers and long windows in Table 2, for female speakers and short windows in Table 3 and for male speakers and short windows in Table 4. However, short windows have been used only in a preliminary way for a couple of utterances.
By observing the results in the abovementioned tables it stands out that ISD is the only objective function that can make the learning machine improve the quality under the ISD measure, which is arguably the most significant measure for speech PSD envelopes. The ISD objective function also brings about quality improvement that can be seen by the LSD measure. On the other hand, the KLD objective function, which is widely used in Machine Learning, can consistently improve quality as measured by both the KLD and the JD and even, in most cases, its quality improvement is also seen by the LSD, particularly in Tables 1 and 2, but fails in Table 4.
The RKLD objective function may cause quality improvements to be detected by the KLD, the JD and the LSD measures in Table 1, even exceeding the quality improvement of the KLD as seen by itself in Tables 1 and 2, but it may also fail to have any quality improvement seen by any of those three measures as happens in Table 3. A similar behavior is displayed by the LSD objective function, which is able to cause quality improvements detectable by the KLD, the JD and the LSD measures in Table 1 and Table 2 but quality improvements fail to be seen by the KLD and the JD in Table 3. But a final analysis about these apparent shortcomings of the RKLD should be postponed till a more complete performance assessment is disclosed in Section V.
Further, the good performance of the RKLD is only partially offset by its not so good performance as measured by the ISD even if it is still the best performing loss as seen by the ISD in the set of losses that includes also the KLD and the LSD.
Finally, the LSD models behave in a rather contradictory manner, being the worst as measured by the ISD measure but beating the models obtained with the other losses by the greatest margin in several instances.
As a final overall observation, absolute scores are seen to improve for short spectral estimation windows when compared with long windows and particularly so when measured by the ISD measure. The improvement is also significant when measured by the LSD except when the same LSD is used as a loss function as well for male speaker. In this case, when the LSD is used as the loss function, the KLD and the JD also fail to notice any improvement.

V. ASSESSMENT RESULTS
In order to assess the fidelity of the spectral envelope model in more neutral conditions, the cross-excitation analysissynthesis assessment (CEASA) was used, which is depicted in Fig. 2 for the simple case involving two models, where two prediction vectors a 1 and a 2 are input from S3LM or any other modeling system for that matter. In its turn, CEASA VOLUME 4, 2016   uses the prediction vectors to come up with the analysis filters and where p is the order of the models and the synthesis filters are obtained as H 1 (z) = 1/A 1 (z) and H 2 (z) = 1/A 2 (z). For a given speech signal s(n) and a spectral model, the speech signal is injected into the corresponding analysis filter whose output is its residual signal, either e 1 (n) or e 2 (n), which is injected into both synthesis filters. As a result, different synthesized signals are obtained that are represented by s 11 (n), s 12 (n), s 21 (n), and s 22 (n). These synthesized signals, which provide a realization of their corresponding spectral models, are assessed by the PESQ algorithm, which provides a mean opinion score listening quality objective (MOS-LQO) measure [17], [18], and the Itakura divergence (ID) [20], [21]. Both measures are represented by the block named Meas(·) depicted in Fig. 2. By using each spectral model in turn for the analysis filter, two sets of measures are obtained for each synthesis filter and the mean value of each set of values is ascribed to the spectral model of the corresponding synthesis filter.
The basis for the operation of CEASA analysis-synthesis filter cascade is the perfect reconstruction condition which prevails when both the analysis filter and the synthesis filter in the cascade connection are configured with the same prediction vector so that the synthesized signal will coincide with the input signal up to a time delay in the absence of numerical errors.
After investigating the application of different window lengths in spectral modeling, the divergences were found to decrease for shorter windows as reported in Section IV. So we have decided to test the modeling algorithms for longer 20 ms windows over the dialect regions 1 through 4 of the test set of the TIMIT corpus [25] for female and male speakers while shorter 7.25 ms windows have been tested only for a couple of speakers due to their CEASA scores to be reported below.
In Table 5, longer windows are used for the spectral modeling of the utterances of female speakers, where ISD displays a small, however consistent, better performance which can be checked for the case of shorter windows from female utterances in Table 8 as well as male utterances in Tables 6  and 9 for longer and shorter windowing.
In order to check the confidence of the results, we have also assessed them using the ViSQOL measure [19], whose scores and attendant quality improvements for the machine learning methods over OLA are reported in Table 7, which, upon further comparison between methods and ranking of methods, is consistent with Tables 5 and 6.
However, if we keep to longer windows, the best performing loss is the RKLD, either assessed by PESQ or ID. This best performance within this set of losses is hinted by a qualitative analysis of the defining equation of the RKLD (20) in comparison to the defining equation of the KLD (19). As the weighting coefficients for the RKLD are the reconstructed masses q(k), when RKLD is used as the loss q(k) should converge to small values in regions where the data masses p(k) are rather small and, by themselves, would increase the argument of the log function unless q(k) converges to a comparable small value. This would lead q(k) to place most of its probability mass near the peaks of p(k), which is a good behavior to be valued by the ISD measure. An illustrated discussion of this convergence behavior under the minimization of the RKLD may be found in [30], where the equations for the divergences are the same as the abovementioned ones but the labels RKLD and KLD are exchanged.
Besides, as a matter of fact, shorter windows are found by CEASA to lead to lower performance than longer windows, contrary to what happens for pure spectral analysis in Section IV. This behavior is due to the dynamics of the synthesis filter in the assessment procedure. Further, it seems to indicate that shorter windows may be better for some spectral analysis tasks but longer windows are recommended for synthesis.
As a curious outcome, we may find it surprising that the LSD models, which performed very well for all the measures but for the ISD, have ranked last in both the PESQ and ID scores for long windows. This highlights the fact that spectral envelope models should be better in matching spectral peaks than overall spectral details and this is captured more clearly by CEASA assessment than by static spectral measures.
It is noticeable by comparing the scores in Tables 5 and 6 that the spectral envelope models for male speakers fit their references more closely than those for female speakers. While modeling the spectral envelopes should not be affected by the local harmonic structure of the spectrum, this is valid when the density of harmonics is high enough so that the spectrum is approximately continuous. The latter is the condition for a lower pitched speaker, which is usually the case of a male speaker, which is consistent with the observation. Nonetheless, by referring again to the two tables mentioned above, we notice that the performances of the loss functions are ranked in the same order, irrespective of whether the speakers are female or male.
In short, using the PESQ scores obtained with CEASA, the RKLD loss is found to improve the performance by 1.0%, 4.0% and 19.3% with respect to the open-loop analysis, the KLD and the LSD models, respectively, while the corresponding improvements for the ISD loss are 0.1%, 3.0% and 18.2% and the RKLD models excel the ISD models by 1.0% on average. In a different approach to spectral envelope VOLUME 4, 2016

VI. CONCLUSION
Spectral envelope models for speech signals have relied for quite some time on linear prediction analysis. This work proposes a refinement to open-loop analytical (OLA) models by using machine learning algorithms provided with differentiable losses. Losses that have been proposed previously in autoregressive analysis are investigated for this task as well as popular divergences used in machine learning. Since the results obtained by spectral measures are not conclusive at first as to the most suitable losses, a quality assessment method is proposed based on the fundamental perfect reconstruction criterion for analysis-synthesis cascaded systems. Using the original speech signals and the analysis and synthesis filters from the spectral envelope models, all possible analysis-synthesis cascades are mounted in the proposed cross-excitation analysis-synthesis assessment (CEASA) method. For the whole set of signals, the reverse Kullback-Leibler divergence (RKLD) appears to be the one that more closely matches the PESQ MOS-LQO scores and the Itakura divergence (ID) estimates. Ranking close behind, the Itakura-Saito divergence (ISD) comes in the CEASA assessment. As a by-product of these methods, shorter analysis windows have been found to lead to better spectral fitting even though they are not the best for synthesis as indicated by the CEASA assessment results. Future research should focus on the conception of loss functions more suitable to the task such as perceptual losses properly adapted to the structure of the learning machine, as long as they are constrained to be differentiable with respect to the weights. Also the measures of merit should be suitable for the specific tasks in an evolution of the CEASA assessment tool. She is currently an Adjunct Professor with the Department of Computer Science, Federal University of Lavras, Brazil. She has a solid knowledge in computer science based on more than ten years of professional experience. Her current research interests include computer networks, telecommunication systems, machine learning, quality of experience of multimedia service, social networks, and recommendation systems.