Multi-Scale Spectral Loss Revisited

The Multi-Scale Spectral (MSS) loss is commonly used for comparing audio signals, as it provides a good trade-off between temporal and spectral resolution. However, some configuration choices, including window type and size, magnitude compression, as well as the distance between spectrograms, are often set implicitly, even though they can significantly impact the loss properties and the convergence of trained models. Particularly in the context of differentiable digital signal processing (DDSP), where learned parameters may explicitly control the frequency of synthesis components, the MSS loss often fails to provide informative gradients. The main goal of this letter is to gain a better understanding of how different configurations of the MSS loss affect this problem. As an illustrative example, we analyze the task of sinusoid frequency estimation via gradient descent to compare different configurations and their effect on the loss properties. Furthermore, we show that favorable configurations can also facilitate unsupervised training of a more complex DDSP additive synthesis autoencoder. Our results indicate that a careful configuration may benefit many applications where the MSS loss is utilized.


Multi-Scale Spectral Loss Revisited
Simon Schwär and Meinard Müller , Fellow, IEEE Abstract-The Multi-Scale Spectral (MSS) loss is commonly used for comparing audio signals, as it provides a good trade-off between temporal and spectral resolution.However, some configuration choices, including window type and size, magnitude compression, as well as the distance between spectrograms, are often set implicitly, even though they can significantly impact the loss properties and the convergence of trained models.Particularly in the context of differentiable digital signal processing (DDSP), where learned parameters may explicitly control the frequency of synthesis components, the MSS loss often fails to provide informative gradients.The main goal of this letter is to gain a better understanding of how different configurations of the MSS loss affect this problem.As an illustrative example, we analyze the task of sinusoid frequency estimation via gradient descent to compare different configurations and their effect on the loss properties.Furthermore, we show that favorable configurations can also facilitate unsupervised training of a more complex DDSP additive synthesis autoencoder.Our results indicate that a careful configuration may benefit many applications where the MSS loss is utilized.
Index Terms-Audio-to-audio distances, audio synthesis, differentiable DSP, loss functions.

I. INTRODUCTION
A MULTITUDE of machine learning tasks require a loss function to compare audio signals, including many end-toend approaches for sound, music and speech synthesis.Spectral loss functions are among the most commonly used distances between audio signals and rely on an element-wise comparison of spectrograms, which can be computed from time-domain signals using the short-time Fourier transform (STFT).This way, signals are compared in terms of the temporal and spectral distribution of signal energy, which better correlates with human perception than, for example, the numerical similarity of waveforms [1].The spectrogram, however, is limited by the fundamental trade-off between time and frequency resolution of the STFT and thus-without phase information-cannot achieve high temporal and spectral accuracy at the same time.This trade-off can be mitigated by comparing multiple spectrograms with different time-frequency resolutions in a combined loss function [2], [3], so that signals must conform at all these resolutions simultaneously to minimize the loss.The authors are with the International Audio Laboratories (a joint institution of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and Fraunhofer Institute for Integrated Circuits IIS), 91058 Erlangen, Germany (e-mail: simon.schwaer@audiolabs-erlangen.de;meinard.mueller@audiolabs-erlangen.de).
Digital Object Identifier 10.1109/LSP.2023.3333205 This concept of a Multi-Scale Spectral (MSS) loss is used extensively in the context of Differentiable Digital Signal Processing (DDSP) [4], [5].DDSP was introduced as an umbrella term for the concept of back-propagating gradients through fixed DSP components, which allows for including domain knowledge (e.g., about the physics of a generative process) in model architectures to restrict them in a meaningful way (inductive bias).
The DDSP paradigm has recently proven useful for various tasks in audio signal processing and music information retrieval, including fundamental frequency estimation [6], musical source separation [7], as well as estimating parameters for piano [8] and singing voice synthesis [9] or artificial reverberation [10].All these applications use a variant of the MSS loss, but in most approaches, certain DSP parameters cannot be estimated simply by comparing the target and output audio signals.In particular, the MSS loss has been shown to be highly irregular and non-convex for parameters that directly or indirectly control the frequency of tonal synthesis components [11], [12], [13].This way, optimization methods like stochastic gradient descent are unlikely to converge without additional means like selfsupervision (e.g.[6], [14]) or external pitch estimation (e.g.[4], [7]) that increase the overall complexity of the system.While randomizing configurations of the STFT has been proposed to improve training robustness [15], to our knowledge, no systematic analysis of the influence of different configurations on the loss behavior has been presented.
In this letter, we show that, in certain situations, the MSS loss is able to provide gradients that allow for convergence to the true frequency parameter of a sinusoid (we call these informative gradients in the following), and that a significant part of both the favorable and the unfavorable loss characteristics can be ascribed to the effects of spectral leakage.Some loss configurations may amplify unfavorable aspects, so that e.g. the choice of window type and size, magnitude compression or spectrogram distance can influence convergence behavior.To illustrate the differences between configurations, we consider the simple scenario of sinusoidal parameter estimation via gradient descent, where we can explicitly analyze the loss landscape for the frequency parameter.From this analysis, we derive three example configurations and compare their performance in an unsupervised training setup using a DDSP additive synthesis autoencoder [6].These experiments provide evidence for spectral leakage as a possible underlying cause of the MSS loss' failure to provide informative gradients for frequency parameters.

II. DEFINITIONS & EXPERIMENTAL SETUP
An autoencoder consists of an encoder E : X → Z and a decoder D : Z → X which are chosen so that D(E(x)) ≈ x for all x ∈ X .Often, E and D are NNs that jointly learn a suitable encoding and decoding by minimizing a loss function L : X × X → R between x and x = D(E(x)) over a training dataset.This minimization is typically achieved by a form of stochastic gradient descent on L(x, x).

A. Multi-Scale Spectral Loss
If X = R L is the space of real discrete audio signals of length L, the MSS loss is a popular choice for L(x, x).This loss function aggregates the distance between multiple spectrograms with specified window types, window sizes, and magnitude compressions to achieve a high temporal and spectral accuracy without enforcing phase coherence of the compared signals.Let With this, a generalized MSS loss can be defined as where w is a window function as defined above, N is a set of suitable window sizes, P is a set of suitable compression functions, and d(•, •) is a distance between two matrices.This formulation allows for many different configurations of w, N , P, and d.In Table I, we introduce a coding scheme for the configurations used in our experiments.As an example, the "original" MSS loss proposed in [4] is attained by the configuration (WH, S4, C4, D1).The small value ε in C1, C3, and C4 is used to avoid taking the logarithm of zero.The parameter γ ∈ R + in C2 can be used to control the strength of compression [16].We set γ = 1 for comparability with C1.

B. Sinusoidal Frequency Estimation
As a defining property of a DDSP autoencoder, D becomes a fixed mapping from a latent parameter space Z to X , while only the encoder E is an NN with learnable parameters.The training objective for E given a fixed target signal x can thus be rephrased as a loss function L x : Z → R defined by for an encoder output z = E(x).As opposed to a general autoencoder, the fixed properties of D for a given z considerably influence the convergence of E to a minimizer of L.
Typically (see e.g.Section V), D is a non-trivial mapping with many control parameters which all have to be learned jointly.In the following, we consider a simpler but illustrative scenario where D is a single sinusoidal oscillator with fixed amplitude A = 1 that maps a single frequency parameter z ∈ R to an output signal x with for all n ∈ [0 : L − 1], where F s = 16000 Hz is the sampling rate used in our experiments.We further assume that the target signal x is also generated according to (4), with frequency f tgt = 1000 Hz.In this setting, we can visualize the loss landscape L x (z) for different L MSS as shown in Fig. 1.Particularly the original MSS loss (in black) appears to be very noisy and thus provide uninformative gradients dL x /dz to find the optimal value z = f tgt .Understanding the causes for this loss behavior and the differences between configurations in Fig. 1 is a main goal of this letter.

III. SPECTRAL LEAKAGE
The truncation of x and x in (1) leads to a "blurred" spectral representation of the sinusoids, since multiplication with a window function w in time domain is equivalent to convolution in frequency domain.Spectra of finite-length windows have multiple local maxima separated by zeros [17] that can be differentiated into a mainlobe (the central maximum around the sinusoid frequency up to the first zero on both sides) and sidelobes (all other local maxima).This effect of spectral leakage is illustrated in Fig. 2, showing the influence of different configuration choices on the spectrum of a windowed excerpt of x (in black) and x with an arbitrarily chosen z (in light blue).Each plot depicts the discrete Fourier transform (DFT) bin frequencies as vertical lines and the DFT coefficient values as circles.It further shows the (approximate) continuous spectra of the sinusoids, illustrating that they are shifted versions of the symmetric window spectra centered around f tgt and z.The DFT bin frequencies form a "rigid sampling grid" on the frequency axis, so that all coefficient values change when the window spectrum is shifted relative to the grid.In our example, f tgt is equal to a DFT bin frequency, so that most DFT coefficients coincide with a zero of the window spectrum, while z lies between two bins and the coefficients are non-zero.This can lead to large numerical differences between the two DFT spectra, especially when sidelobes are prominent.

IV. SINUSOID FREQUENCY ESTIMATION
Without any blurring of the sinusoid spectra, L x (z) would not provide informative gradients at all.In our example, dL x /dz would be zero when the spectral peaks do not overlap, since L MSS depends on the element-wise difference between spectra (i.e., the vertical distance between circles in Fig. 2).However, an ideal "kernel" for blurring has one unique maximum, since in this case L x (z) could only be reduced when z moves closer to f tgt and not by other changes of z relative to the DFT bins.In other words, when choosing a suitable L MSS configuration for frequency estimation via gradient descent, we aim for a wide mainlobe and numerically negligible sidelobes.Fig. 3 illustrates L x (z) with different configurations for L MSS , which we will discuss in the following.

A. Window Type
The window spectra of three different window types-Rectangular (WR), Hann (WH), and Flat Top (WF)-are compared in Fig. 2(a) with an otherwise fixed configuration (S3, C2, D2).Many considerations influence window choices in practice [17,Ch. 5.3.3],since for example narrower windows have a higher sensitivity (i.e., better ability to detect sinusoids in noise), while wider windows have a better dynamic range (i.e., sidelobes of a loud sinusoid are less likely to mask a weaker sinusoid).For frequency estimation, the sidelobe level is a central property that influences the behavior of L x (z).Using WR (and to a lesser extent WH) leads to strong periodic fluctuations of L x (z) with a period of F s /N (see Fig. 3(a)), due to the changes in spectral leakage depending on the relative value of z compared to the DFT bin frequencies.The low sidelobe levels of WF result in a smoother loss landscape with a unique local minimum at z = f tgt .

B. Window Size
The width of mainlobe and sidelobes is also influenced by the window size, as illustrated in Fig. 2(b), using fixed (WH, C0, D2).The choice of a suitable set of window sizes was originally motivated by the resolution of amplitude comparisons, where a small window size leads to a better time resolution and a worse frequency resolution.For frequency estimation, the width of the locally convex sections depend on the maximal difference between f tgt and z where the mainlobes still overlap, so that the convex section is wider for smaller window sizes.Larger windows, conversely, increase the loss' ability to discriminate sinusoids with similar frequencies.

C. Magnitude Compression
Compression enables a comparison of magnitudes over a wide dynamic range and is often perceptually motivated.However, it may also exacerbate the problem of periodically changing spectral leakage by decreasing the relative difference between mainlobe and sidelobe levels (see Fig. 2(c)).The simple logarithmic compression (C1) as used in [4] leads to large negative values at the zeroes of the window function bounded below by log(ε).This amplifies the periodic behavior of L x (z) as shown in Fig. 3(c) with fixed (WF, S3, D2).To mitigate this issue, we propose to replace ε with 1 (C2), so that the compressed value is always larger or equal to 0. A factor γ ≥ 0 can further be chosen to adjust the compression strength.

D. Spectrum Norm
We compare two distances between the spectrogram matrices, the 1 norm (D1) and the squared 2 norm (D2) in Fig. 3(d) with fixed (WH, S3, C0).Other considerations like outlier sensitivity are often relevant for choosing a distance, but D1 slightly amplifies the periodic fluctuations in L x (z).

E. Considered MSS Loss Configurations
Fig. 1 shows L x (z) with three different configurations for L MSS , which we chose not to represent a "best" configuration, but to illustrate how different choices affect the loss landscape.In addition to this qualitative comparison, Table II shows the Gradient-Sign Ranking Accuracy (GRA) [11] for these loss landscapes, specifying how often on average L x (z) decreases when moving towards a random f tgt by c cents from a random initial frequency z 0 .While it is not an analytic evaluation of the loss, a larger GRA indicates that L x (z) tends to provide informative gradients, while a value of 0.5 suggests that changes in L x (z) are random.We calculate the GRA for step sizes c of 0.3, 3, 30, and 300 cents with all other settings as in [11].The original MSS results in a GRA near 0.5 for smaller step sizes, so that gradient descent algorithms are unlikely to converge.In fact, for noise-free signals, this loss behavior is entirely a result of spectral leakage at different window sizes, amplified by the logarithmic compression C1.The modified Hann MSS uses D2 and S5, where all window sizes are prime instead of powers of two.This way, z does not coincide with the DFT bin frequencies of multiple window sizes at the same time, which reduces the amplitude of the fluctuations in L x (z).The modified Hann MSS achieves a high GRA for c = 300 cents while some artifacts still impact the GRA for smaller step sizes.Smooth MSS results in the fewest spectral leakage artifacts by using WF and C2, while also having the widest mainlobe overlaps.The high GRA values for small step sizes down to 0.3 cents indicate that this configuration also approximates local convexity with a unique minimum in these synthetic conditions.However, performance decreases for the largest step size due to vanishing gradients when the mainlobes do not overlap.

V. UNSUPERVISED DDSP AUTOENCODER
The simple scenario above does not consider other influences on the loss like noise or mixed sinusoids with varying amplitudes.To investigate differences between the configurations from Table II in a more practically relevant scenario, we repeat a DDSP autoencoder experiment using the encoder (F θ sin ) and sinusoidal synthesizer (S sin ) from [6] trained with the NSynth dataset [18], which consists of harmonic single notes with natural background noise and transients.The task of F θ sin is to estimate time-varying parameters for 100 sinusoids that are then synthesized by S sin .Instead of relying on self-supervision as in [6], we use only a reconstruction loss (L MSS with the respective configuration), resulting in fully unsupervised training.For comparable loss magnitudes, we multiply each L MSS with an empirically estimated constant.The training loss with the different L MSS configurations is shown in Fig. 4. With all other settings as in [6], only smooth MSS leads to consistent convergence of F θ sin .To evaluate whether the model also learns to estimate meaningful parameters, we conduct two experiments.First, we create 1000 synthetic test signals of one second length with a random constant fundamental frequency (F0) between 30 and 800 Hz and 10 integer harmonics with random amplitude.F θ sin trained on NSynth with smooth MSS estimates the true frequencies with a mean error of 27±35 cents (original: 838±647 cents, mod.Hann: 1467±932 cents).Second, we evaluate the similarity between the output of S sin and an input signal from MDBmelody-synth [19].For this, we estimate the F0 of the output signal using CREPE [20] and compare it with the reference F0 from the dataset.Since CREPE relies on salient frequency components in its input, it would yield dissimilar F0 estimates if the output signal contained strong erroneous components.The raw pitch accuracy [21] for this comparison is 0.81±0.09for F θ sin trained with smooth MSS (original: 0.01±0.03,mod.Hann: 0.27±0.15).While these preliminary experiments suggest that differences between configurations are also relevant for complex scenarios, further research is needed to fully understand the loss behavior in practice.

VI. CONCLUSION
Our results indicate that the properties of the MSS loss for frequency estimation considerably depend on the numerical relation between mainlobe and sidelobes in the compared spectra.

Fig. 2 .
Fig. 2. Influence of (a) window type, (b) window size, and (c) magnitude compression on the spectrum of x (black) and x (light blue).Circles denote DFT coefficients and lines the approximate continuous spectra.In (a) and (b), some coefficient values are below the y-axis range, indicated by half circles.The default configuration is (WH, S3, C3) unless otherwise specified.

Fig. 4 .
Fig. 4. Training loss for F θ sin with original MSS (black), mod.Hann MSS (orange), and smooth MSS (red), with mean and variance of three runs each.

TABLE I CONSIDERED
MSS LOSS CONFIGURATION CHOICES + (K×M ) to be a matrix.Analogously, we denote the spectrogram matrix of x by Ŷw,N,p .
be the spectrogram of x, where N ∈ N is the window size in samples, H ∈ N is the hop size in samples (we set H = N/2 throughout all experiments for simplicity, but generally, arbitrary hop sizes can be used), w ∈ R N is a discrete window function, and p : R + → R is an (optional) magnitude compression function.Each time-frequency coefficient can be accessed with the time index m ∈ [0 : M − 1] with M = L/H, assuming for simplicity that L is a multiple of H, and frequency index k ∈ [0 : K − 1] with K = (N + 1)/2 , discarding negative frequencies since spectra of real signals are symmetric.To further simplify notation, we consider Y N,w,p ∈ R

TABLE II GRADIENT
-SIGN RANKING ACCURACY FOR SELECTED CONFIGURATIONS