SPICE: Self-Supervised Pitch Estimation

We propose a model to estimate the fundamental frequency in monophonic audio, often referred to as pitch estimation. We acknowledge the fact that obtaining ground truth annotations at the required temporal and frequency resolution is a particularly daunting task. Therefore, we propose to adopt a self-supervised learning technique, which is able to estimate pitch without any form of supervision. The key observation is that pitch shift maps to a simple translation when the audio signal is analysed through the lens of the constant-Q transform (CQT). We design a self-supervised task by feeding two shifted slices of the CQT to the same convolutional encoder, and require that the difference in the outputs is proportional to the corresponding difference in pitch. In addition, we introduce a small model head on top of the encoder, which is able to determine the confidence of the pitch estimate, so as to distinguish between voiced and unvoiced audio. Our results show that the proposed method is able to estimate pitch at a level of accuracy comparable to fully supervised models, both on clean and noisy audio samples, although it does not require access to large labeled datasets.


I. INTRODUCTION
Pitch represents the perceptual property of sound that allows ordering based on frequency, i.e., distinguishing between high and low sounds.For example, our auditory system is able to recognize a melody by tracking the relative pitch differences along time.Pitch is often confused with the fundamental frequency (f 0 ), i.e., the frequency of the lowest harmonic.However, the former is a perceptual property, while the latter is a physical property of the underlying audio signal.Despite this important difference, outside the field of psychoacoustics pitch and fundamental frequency are often used interchangeably, and we will not make an explicit distinction within the scope of this paper.A comprehensive treatment of the psychoacoustic aspects of pitch perception is given in [1].
Pitch estimation in monophonic audio received a great deal of attention over the past decades, due to its central importance in several domains, ranging from music information retrieval to speech analysis.Traditionally, simple signal processing pipelines were proposed, working either in the time domain [2], [3], [4], [5], in the frequency domain [6] or both [7], [8], often followed by post-processing algorithms to smooth the pitch trajectories [9], [10].
Until recently, machine learning methods had not been able to outperform hand-crafted signal processing pipelines targeting pitch estimation.This was due to the lack of annotated data, which is particularly tedious and difficult to obtain at the temporal and frequency resolution required to train fully supervised models.To overcome these limitations, a All authors are with Google Research.A shortened version of this manuscript is under review at ICASSP2020.synthetically generated dataset was proposed in [11], obtained by re-synthesizing monophonic music tracks while setting the fundamental frequency to the target ground truth.Using this training data, the CREPE algorithm [12] was able to achieve state-of-the-art results when evaluated on the same dataset, outperforming signal processing baselines, especially under noisy conditions.
In this paper we address the problem of lack of annotated data from a different angle.Specifically, we rely on self-supervision, i.e., we define an auxiliary task (also known as a pretext task) which can be learned in a completely unsupervised way.To devise this task, we started from the observation that for humans, including professional musicians, it is typically much easier to estimate relative pitch, related to the frequency interval between two notes, than absolute pitch, related to the actual fundamental frequency [13].Therefore, we design SPICE (Selfsupervised PItCh Estimation) to solve a similar task.More precisely, our network architecture consists of a convolutional encoder which produces a single scalar embedding.We aim at learning a model that linearly maps this scalar value to pitch, when the latter is expressed in a logarithmic scale, i.e., in units of semitones of an equally tempered chromatic scale.To do this, we feed two versions of the same signal to the encoder, one being a pitch shifted version of the other by a random but known amount.Then, we devise a loss function that forces the difference between the scalar embeddings to be proportional to the known difference in pitch.For convenience, we perform pitch shifting in the domain defined by the constant-Q transform, because this corresponds to a simple translation along the log-spaced frequency axis.Upon convergence, the model is able to estimate relative pitch.To translate this output to an absolute pitch scale we apply a simple calibration step against ground truth data.Since we only require to estimate a single scalar offset, a very small annotated dataset can be used for this purpose.
Another important aspect of pitch estimation is determining whether the underlying signal is voiced or unvoiced.Instead of relying on handcrafted thresholding mechanisms, we augment the model in such a way that it can learn the level of confidence of the pitch estimation.Namely, we add a simple fully connected layer that receives as input the penultimate layer of the encoder and produces a second scalar value which is trained to match the pitch estimation error.
As an illustration, Figure 1 shows the CQT frames of one of the evaluation datasets (MIR-1k [14]), which are considered to be voiced and sorted by the pitch estimated by SPICE.
In summary, this paper makes the following key contributions: • We propose a self-supervised (relative) pitch estimation model, which can be trained without having access to any labelled dataset.• We incorporate a self-supervised mechanism to estimate the confidence of the pitch estimation, which can be directly used for voicing detection.
• We evaluate our model against two publicly available monophonic datasets and show that in both cases we outperform handcrafted baselines, while matching the level of accuracy attained by CREPE, despite having no access to ground truth labels.• We train and evaluate our model also in the noisy conditions, where background music is present in addition to monophonic singing, and show that also in this case, match the level of accuracy obtained by CREPE.The rest of this paper is organized as follows.Section II contrasts the proposed method against the existing literature.Section III illustrates the proposed method, which is evaluated in Section IV. Conclusions and future remarks are discussed in Section V.

II. RELATED WORK
Pitch estimation: Traditional pitch estimation algorithms are based on hand-crafted signal processing pipelines, working in the time and/or frequency domain.The most common timedomain methods are based on the analysis of local maxima of the auto-correlation function (ACF) [2].These approaches are known to be prone to octave errors, because the peaks of the ACF repeat at different lags.Therefore, several methods were introduced to be more robust to such errors, including, e.g., the PRAAT [3] and RAPT [4] algorithms.An alternative approach is pursued by the YIN algorithm [5], which looks for the local minima of the Normalized Mean Difference Function (NMDF), to avoid octave errors caused by signal amplitude changes.Different frequency-domain methods were also proposed, based, e.g., on spectral peak picking [15] or template matching with the spectrum of a sawtooth waveform [6].Other approaches combine both time-domain and frequency-domain processing, like the Aurora algorithm [7] and the nearly defect-free F0 estimation algorithm [8].Comparative analyses including most of the aforementioned approaches have been conducted on speech [16], [17] , singing voices [18] and musical instruments [19].Machine learning models for pitch estimation in speech were proposed in [20], [21].The method in [20] first extracts hand-crafted spectral domain features, and then adopts a neural network (either a multi-layer perceptron or a recurrent neural network) to compute the estimated pitch.In [21] consensus of other pitch trackers is used to get ground truth, and a multi-layer perceptron classifier is trained on the principal components of the autocorrelations of subbands from an auditory filterbank.More recently the CREPE [12] model was proposed, an end-to-end convolutional neural network which consumes audio directly in the time domain.The network is trained in a fully supervised fashion, minimizing the crossentropy loss between the ground truth pitch annotations and the output of the model.In our experiments, we compare our results with CREPE, which is the current state-of-the-art.
Pitch confidence estimation: Most of the aforementioned methods also provide a voiced/unvoiced decision, often based on heuristic thresholds applied to hand-crafted features.However, the confidence of the estimated pitch in the voiced case is seldom provided.A few exceptions are CREPE [12], which produces a confidence score computed from the activations of the last layer of the model, and [22], which directly addresses this problem, by training a neural network based on handcrafted features to estimate the confidence of the estimated pitch.In contrast, in our work we explicitly augment the proposed model with a head aimed at estimating confidence in a fully unsupervised way.
Pitch tracking and polyphonic audio: Often, postprocessing is applied to raw pitch estimates to smoothly track pitch contours over time.For example, [23] applies Kalman filtering to smooth the output of a hybrid spectro-temporal autocorrelation method, while the pYIN algorithm [9] builds on top of YIN, by applying Viterbi decoding of a sequence soft pitch candidates.A similar smoothing algorithm is also used in the publicly released version of CREPE [12].Pitch extraction in the case of polyphonic audio remains an open research problem [24].In this case, pitch tracking is even more important to be able to distinguish the different melody lines [10].A machine learning model targeting the estimation of multiple fundamental frequencies, melody, vocal and bass line was recently proposed in [25] .
Self-supervised learning: The widespread success of fully supervised models was stimulated by the availability of annotated datasets.In those cases in which labels are scarse or simply not available, self-supervised learning has emerged as a promising approach for pre-training deep convolutional networks both for vision [26], [27], [28] and audio-related tasks [29], [30], [31].Somewhat related to our paper are those methods that try to use self-supervision to obtain point disparities between pairs of images [32], where shifts in the spatial domain play the role of shifts in the log-frequency domain.

Audio frontend
The proposed pitch estimation model receives as input an audio track of arbitrary length and produces as output a timeseries of estimated pitch frequencies, together with an indication of the confidence of the estimates.The latter is used to discriminate between unvoiced frames, in which pitch is not well defined, and voiced frames.
To better illustrate our method, let us first introduce a continuous-time model of an ideal harmonic signal, that is: where f 0 denotes the fundamental frequency and f k = kf 0 , k = 2, . . .K, its higher order harmonics.The modulus of the Fourier transform is given by where δ is the Dirac delta function.Therefore, the modulus consists of spectral peaks at integer multiples of the fundamental frequency f 0 .When the signal is pitch-shifted by a factor of α, these spectral peaks move to fk = αf k .If we apply a logarithmic transformation to the frequency axis, log fk = log α + log f k , i.e., pitch-shifting results in a simple translation in the log-frequency domain.This very simple and well known result is at the core of the proposed model.Namely, we preprocess the input audio track with a frontend that computes the constant-Q transform (CQT).
In the CQT domain, frequency bins are logarithmically spaced, as the center frequencies obey the following relationship: where f base is the frequency of the lowest frequency bin, B is the number of bins per octave, and F max is the number of frequency bins.Given an input audio track, the CQT produces a matrix X of size T × F max , where T depends on the selected hop length.Note that the frequency bins are logarithmically spaced.Therefore, if the input audio track is pitch-shifted by a factor α, this results in a translation of ∆k = B log 2 α bins in the CQT domain.

Pitch estimation
The proposed model architecture is illustrated in Figure 2. Starting from the observation above, the model computes the modulus of the CQT |X|, and from each temporal frame t = 1, . . ., T (where T is equal to the batch size during training) it extracts two random slices x t,1 , x t,2 ∈ R F , spanning the range of CQT bins [k t,i , k t,i + F ], i = 1, 2, where F is the number of CQT bins in the slice and the offsets are sampled from a uniform distribution, i.e., k t,i ∼ U(k min , k max ).Then, each vector is fed to the same encoder to produce a single scalar y t,i = Enc(x t,i ) ∈ R. The encoder is a neural network with L convolutional layers followed by two fully-connected layers.Further details about the model architecture are provided in Section IV.
We design our main loss in such a way that y t,i is encouraged to encode pitch.First, we define the relative pitch error as Then, the loss is defined as the Huber norm of the pitch error, that is: where: The pitch difference scaling factor σ is adjusted in such a way that y t ∈ [0, 1] when pitch is in the range [f min , f max ], namely: The values of f min and f max are determined based on the range of pitch frequencies spanned by the training set.In our experiments we found that the Huber loss makes the model less sensitive to the presence of unvoiced frames in the training dataset, for which the relative pitch error can be large, as pitch is not well defined in this case.
In addition to L pitch , we also use the following reconstruction loss where xt,i , i = 1, 2, is a reconstruction of the input frame obtained by feeding y i,t into a decoder xt,i = Dec(y i,t ).Therefore, the overall loss is defined as: where w pitch and w recon are scalar weights that determine the relative importance assigned to the two loss components.
Given the way it is designed, the proposed model can only estimate relative pitch differences.The absolute pitch of an input frame is obtained by applying an affine mapping: which depends on two parameters.We consider two cases: estimating only the intercept b, and setting s = 1/σ; estimating both the intercept b and the slope ŝ.This is the only place where our method requires access to ground truth labels.However, we can observe that: i) only very few labelled samples are needed, as only one or two parameters need to be estimated; ii) synthetically generated labelled samples could be used for this purpose; iii) some applications (e.g., matching melodies played at different keys) might require only relative pitch.Section IV provides further details on the robustness to the calibration process.
Note that pitch in (10) is expressed in semitones and it can be converted to frequency (in Hz) by:

Confidence estimation
In addition to the estimated pitch p0,t , we design our model such that it also produces a confidence level c t ∈ [0, 1].Indeed, when the input audio is voiced we expect to produce high confidence estimates, while when it is unvoiced pitch is not well defined and the output confidence should be low.
To achieve this, we design the encoder architecture to have two heads on top of the convolutional layers, as illustrated in Figure 2. The first head consists of two fully-connected layers and produces the pitch estimate y t .The second head consists of a single fully-connected layer and produces the confidence level c t .To train the latter, we add the following loss: This way the model will produce high confidence c t ∼ 1 when the model is able to correctly estimate the pitch difference between the two input slices.At the same time, given that our primary goal is to accurately estimate pitch, during the backpropagation step we stop the gradients so that L conf only influences the training of the confidence head and does not affect the other layers of the encoder architecture.

Handling background music
The accuracy of pitch estimation can be severely affected when dealing with noisy conditions.These emerge, for example, when the singing voice is superimposed over background music.In this case, we are faced with polyphonic audio and we want the model to focus only on the singing voice source.To deal with these conditions, we introduce a data augmentation step in our training setup.More specifically, we mix the clean singing voice signal with the corresponding instrumental backing track at different levels of signal-to-noise (SNR) ratios.Interestingly, we found that simply augmenting the training data was not sufficient to achieve a good level of robustness.Instead, we also modified the definition of the loss functions as follows.Let x c t,i and x n t,i denote, respectively, the CQT of the clean and noisy input samples.Similarly, y c t,i and y n t,i denote the corresponding outputs of the encoder.The pitch error loss is modified by averaging four different variants of the error, that is: The reconstruction loss is also modified, so that the decoder is asked to reconstruct the clean samples only.That is: The rationale behind this approach is that the encoder is induced to represent in its output only the information relative to the clean input audio samples, thus learning to denoise the input by separating the singing voice from noise.

IV. EXPERIMENTS Model parameters
First we provide the details of the default parameters used in our model.The input audio track is sampled at 16 kHz.The CQT frontend is parametrized to use B = 24 bins per octave, so as to achieve a resolution equal to one half-semitone per bin.We set f base equal to the frequency of the note C 1 , i.e., f base 32.70 Hz and we compute up to F max = 190 CQT bins, i.e., to cover the range of frequency up to Nyquist.The hop length is set equal to 512 samples, i.e., one CQT frame every 32 ms.During training, we extract slices of F = 128 CQT bins, setting k min = 8 and k max = 16.The Huber threshold is set to τ = 0.25σ and the loss weights equal to, respectively, w pitch = 10 4 and w recon = 1.We increased the weight of the pitch-shift loss to w pitch = 3 • 10 5 when training with background music.
The encoder receives as input a 128-dimensional vector corresponding to a sliced CQT frame and produces as output two scalars representing, respectively, pitch and confidence.The model architecture consists of L = 6 convolutional layers.We use filters of size 3 and stride equal to 1.The number of channels is equal to d • [1,2,4,8,8,8], where d = 64 for the encoder and d = 32 for the decoder.Each convolution is followed by batch normalization and a ReLU non-linearity.Max-pooling of size 3 and stride 2 is applied at the output of each layer.Hence, after flattening the output of the last convolutional layer we obtain an embedding of size 1024 elements.This is fed into two different heads.The pitch estimation head consists of two fully-connected layers with, respectively, 48 and 1 units.The confidence head consists of a single fully-connected layer with 1 output unit.The total number of parameters of the encoder is equal to 2.38M.Note that we do not apply any form of temporal smoothing to the output of the model.
The model is trained using Adam with default hyperparameters and learning rate equal to 10 −4 .The batch size is set to 64.During training, the CQT frames of the input audio tracks are shuffled, so that the frames in a batch are likely to come from different tracks.

Datasets
We use three datasets in our experiments, whose details are summarized in Table I.The MIR-1k [14] dataset contains 1000 audio tracks with people singing Chinese pop songs.The dataset is annotated with pitch at a granularity of 10 ms and it also contains voiced/unvoiced frame annotations.It comes with two stereo channels representing, respectively, the singing voice and the accompaniment music.The MDB-stem-synth dataset [11] includes re-synthesized monophonic music played with a variety of musical instruments.This dataset was used to train the CREPE model in [12].In this case, pitch annotations are available at a granularity of 29 ms.Given the mismatch of the sampling period of the pitch annotations across datasets, we resample the pitch time-series with a period equal to the hop length of the CQT, i.e., 32 ms.In addition to these publicly available datasets, we also collected in-house the SingingVoices  dataset, which contains 88 audio tracks of people singing a variety of pop songs, for a total of 185 minutes.Figure 3 illustrates the empirical distribution of pitch values.For SingingVoices, there are no ground-truth pitch labels, so we used the ouput of CREPE (configured with full model capacity and enabling Viterbi smoothing) as a surrogate.We observe that MDB-stem-synth spans a significantly larger range of frequencies (approx.5 octaves) than MIR-1k and SingingVoices (approx.3 octaves).
We trained SPICE using either SingingVoices or MIR-1k and used both MIR-1k (singing voice channel only) and MDBstem-synth to evaluate models in clean conditions.To handle background music, we repeated training on MIR-1k, but this time applying data augmentation by mixing in backing tracks with a SNR uniformly sampled from [-5dB, 25dB].For the evaluation, we used the MIR-1k dataset, mixing the available backing tracks at different levels of SNR, namely 20dB, 10dB and 0dB.In all cases, we apply data augmentation during training, by pitch-shifting the input audio tracks by an amount in semitones uniformly sampled in the set {−12, 0, +12}.

Baselines
We compare our results against two baselines, namely SWIPE [6] and CREPE [12].SWIPE estimates the pitch as the fundamental frequency of the sawtooth waveform whose spectrum best matches the spectrum of the input signal.CREPE is a data-driven method which was trained in a fully-supervised fashion on a mix of different datasets, including MDB-stem-synth [11], MIR-1k [14], Bach10 [33], RWC-Synth [9], MedleyDB [34] and NSynth [35].We consider two variants of the CREPE model, by using model capacity tiny or full, and we disabled Viterbi smoothing, so as to evaluate the accuracy achieved on individual frames.These models have, respectively, 487k and 22.2M parameters.CREPE also produces a confidence score for each input frame.

Evaluation measures
We use the evaluation measures defined in [24] to evaluate and compare our model against the baselines.The raw pitch accuracy (RPA) is defined as the percentage of voiced frames for which the pitch error is less than 0.5 semitones.To assess the robustness of the model accuracy to the initialization, we also report the interval ±2σ, where σ is the sample standard deviation obtained collecting the RPA values computed using the last 10 checkpoints of 3 separate replicas.For CREPE we do not report such interval, because we simply run the model provided by the CREPE authors on each of the evaluation datasets.The voicing recall rate (VRR) is the proportion of voiced frames in the ground truth that are recognized as voiced by the algorithm.We report the VRR at a target voicing false    alarm rate equal to 10%.Note that this measure is provided only for MIR-1k, since MDB-stem-synth is a synthetic dataset and voicing can be determined based on a simple silence thresholding.

Main results
The main results of the paper are summarized in Table II and Figure 4. On the MIR-1k dataset, SPICE outperforms SWIPE, while achieving the same accuracy as CREPE in terms of RPA (90.7%), despite the fact that it was trained in an unsupervised fashion and CREPE used MIR-1k as one of the training datasets.Figure 5 illustrates a finer grained comparison between SPICE and CREPE (full model), measuring the average absolute pitch error for different values of the ground truth pitch frequency, conditioned on the level of confidence (expressed in deciles) produced by the respective algorithm.When excluding the decile with low confidence, we observe that above 110Hz, SPICE achieves an average error around 0.2-0.3semitones, while CREPE around 0.1-0.5 semitones.
We repeated our analysis on the MDB-stem-synth dataset.In this case the dataset has remarkably different characteristics from the SingingVoices dataset used for the unsupervised training of SPICE, in terms of both frequency extension (Figure 3) and timbre (singing vs. musical instruments).This explains why in this case the gap between SPICE and CREPE is wider (88.9% vs. 93.1%).Figure 6 repeats the fine-grained analysis for the MDB-stem-synth dataset, illustrating larger errors at both ends of the frequency range.We also performed a thorough error analysis, trying to understand in which cases CREPE and SWIPE outperform SPICE.We discovered that most of these errors occur in the presence of a harmonic signal, in which most of the energy is concentrated above the fifth-order harmonics, i.e., in the case of musical instruments characterized by a spectral timbre considerably different from the one of singing voice.We also evaluated the quality of the confidence estimation comparing the voicing recall rate (VRR) of SPICE and CREPE.Results in Table II show that SPICE achieves results comparable with CREPE (86.8%, i.e., between CREPE tiny and CREPE large), while being more accurate in the more interesting low false-positive rate regime (see Figure 7).
In order to obtain a smaller, thus faster, variant of the SPICE model, we used the MorphNet [36] algorithm.Specifically, we added to the training loss (9) a regularizer which constrains the number of floating point operations (FLOPs), using λ = 10 −7 as regularization hyper-parameter.MorphNet produces as output a slimmed network architecture, which has 180k parameters, thus more than 10 times smaller than the original model.After training this model from scratch, we were still able to achieve a level of performance on MIR-1k comparable to the larger SPICE model, as reported in Table II.
Table III shows the results obtained when evaluating the models in the presence of background music.We observe that SPICE is able to achieve a level of accuracy very similar to CREPE across different values of SNR.

Calibration
The key tenet of SPICE is that is an unsupervised method.However, as discussed in Section III, the raw output of the pitch head can only represent relative pitch.To obtain absolute pitch, the intercept b (and, optionally, the slope s) in (10) needs to be estimated with the use of ground truth labels.Figure 8 shows the fitted model for both MIR-1k and MDB-stem-synth as a dashed red line.We qualitatively observe that the intercept is stable across datasets.In order to quantitatively estimate how many labels are needed to robustly estimate b, we repeated 100 bootstrap iterations.At each iteration we resample at random just a few frames from a dataset, fit b (and s) using these samples, and compute the RPA. Figure 9 reports the results of this experiment on MIR-1k (error bars represent 2.5% and 97.5% quantiles).We observe that using as few as 200 frames is generally enough to obtain stable results.For MIR-1k this represents about 0.09% of the dataset.Note that these samples can also be obtained by generating synthetic harmonic signals, thus eliminating the need for manual annotations.

V. CONCLUSION
In this paper we propose SPICE, a self-supervised pitch estimation algorithm for monophonic audio.The SPICE model is trained to recognize relative pitch without access to labelled data and it can also be used to estimate absolute pitch by calibrating the model using just a few labelled examples.Our experimental results show that SPICE is competitive with CREPE, a fully-supervised model that was recently proposed in the literature, despite having no access to ground truth labels.

Fig. 1 :
Fig. 1: CQT frames extracted from the MIR-1k dataset re-ordered based on the pitch estimated by the SPICE algorithm (in red).

Fig. 3 :
Fig. 3: Range of pitch values covered by the different datasets.

Fig. 5 :
Fig. 5: Pitch error on the MIR-1k dataset, conditional on ground truth pitch and model confidence.

Fig. 6 :
Fig. 6: Pitch error on the MDB-stem-synth dataset, conditional on ground truth pitch and model confidence.

Fig. 9 :
Fig.9: Robustness of the RPA on MIR-1k when varying the number of frames used for calibration.

TABLE I :
Dataset specifications.

TABLE II :
Evaluation results.

TABLE III :
Evaluation results on noisy datasets.