Dynamic Time Signature Recognition, Tempo Inference, and Beat Tracking Through the Metrogram Transform

This paper proposes a probabilistic approach for extracting time-varying and irregular time signature information from polyphonic audio extracts, subsequently providing beat and bar line positions given inferred time signature divisions. This is achieved via dynamically evaluating the beat tempo as a function of time through finding an optimal compromise in beat and bar alignment in the time and tempo domains. Time signature divisions are determined based on a new representation, termed the Metrogram, that presents time-varying information regarding rhythmic and metric periodicities in the Tempogram. Our methodology is characterised by its ability to provide a distribution over metric interpretations, offering insights into the diverse ways music can be rhythmically perceived. Results indicate high-level accuracy for a variety of polyphonic extracts containing irregular, complex, irrational, and time-varying time signatures. Accuracy rivalling state-of-the-art methodologies is also reported in a beat tracking task performed on the standard Ballroom Dataset. The paper offers insights into the field of dynamic time signature recognition and beat tracking, offering a valuable and versatile resource for the analysis, composition, and performance of music.


I. INTRODUCTION
The process of transcribing complex polyphonic performances to musical notation is a notoriously arduous task, requiring the ability to distinguish numerous instrumental lines, separable often only via remarkably subtle variances in timbre, frequency, and waveform characteristics.Automated music transcription is a field of great significance in the music and educational industry, especially for primarily improvised genres, such as jazz.In particular, one of the most significant and challenging tasks is that of time signature inference and beat tracking, especially in metrically ambiguous extracts, a common occurrence across genres [1], [2], [3].Inaccurate beat tracking for performances with rubato (time-varying tempo) or multiple metric interpretations, regardless of the accuracy of note detections, will result in poor quality transcriptions given the misalignment of key rhythmic structures in the transcription [4], [5].
Several competing methods exist in the literature for tempo and time signature estimation from audio, primarily through the employment of Tempograms [21], [22] and similarity matrices [17], [20], respectively.Various probabilistic methods are employed to facilitate meter detection [19], [25], with Hidden Markov Models (HMMs) paired with bar pointer models, in particular, proving successful for beat and downbeat estimation tasks for genre-specific applications [6], [8], [9], [10], [12], [13], [15].The advent of Deep Learning (DL) approaches has presented numerous genre-specific methods for beat and downbeat tracking, primarily through the employment of Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Temporal Convolutional Networks (TCNs) [23], [24], [34], [35], [36], [37], [38].HMM bar pointer models have also been successfully applied to irregular ("odd") time signature beat tracking tasks [6], [8], [13], [14], however, the approaches assume a constant metric division (often a priori).Likewise, for methods that tackle time-varying time signatures using HMMs [10], [16] or RNNS [24], [39], the models employ cascade formats, such that joint beat and bar tracking is performed sequentially resulting in beat lengths independent of metric properties.The subjectivity of perceived pulse (beat) positions and lengths is addressed in literature through the employment of agents [5], [31], yet the process has not been generalised to joint time signature and beat tracking tasks.In general, the majority of algorithms assume either a constant tempo, beat length, or time signature division over time, providing methods for extracting each independently, and thus poor performance is reported for extracts with rubato and irregular beat and time signature changes [25].Likewise, although high-level accuracy is reported for DL and HMM approaches for specific genres, poor performance is often reported on unseen genres [42] given that fundamentally, the models are constrained by the training data and genre-specific rhythmic and metric properties.
This paper instead proposes transforms in conjunction with novel probabilistic tracking algorithms, aimed specifically at extracting time-varying rhythmic and metric features from non-genre specific polyphonic extracts, such as providing metric interpretations, detecting time signature changes, rubato, and irregular beat length, which is often present in jazz and other improvisatory genres.As such, the paper evaluates the model on a custom dataset featuring a variety of metrically diverse extracts, with varying time signatures, metric modulations, and rubato, as well as on the Ballroom Dataset, in order to verify its capability in fixed time signature beat tracking against the state of the art.

A. OVERVIEW
This paper proposes a posterior distribution that is maximised in order to optimally fit the model to the audio extract, with respect to a time series of tempo values and a phase offset, which together determine the relative positions of the bar and beat (tatum) times in the extract.The system iteratively samples the hyperparameter space of the model, optimising the posterior conditioned on the sampled hyperparameters with respect to the tempo values and phase offset until satisfactory convergence is achieved for each sampling iteration.The optimised posterior probabilities for the sampled hyperparameter configurations are compared and iterated until an appropriate global solution is found.The posterior distribution is constructed based upon a Note Onset Detection Function (NODF), a 2D Morlet Convolver, and a Fundamental Tempogram, which is used finally to generate the Metrogram, as described in the following sections.A simplified overview of the proposed architecture is presented in Fig. 1.The analysis is presented in continuous time for simplicity, although of course, the practical implementation involves a time and frequency discretisation.

B. NOTE ONSET DETECTION FUNCTION
As input to the Tempograms, a note onset detection function (NODF) is required.This paper proposes a two-dimensional NODF that predominantly exploits frequency domain information in order to distinguish more complex instrumental lines in polyphonic and polytimbral environments, which may have otherwise been hidden in the analytic envelope of the time domain.
The proposed NODF is based on the smoothed timederivative of a spectrogram-like representation, using variable resolution wavelet basis functions in the analysis step: where f (ω, t ) is the magnitude of the function evaluated at frequency ω, and time t.x(t ) is the continuous time audio signal (mono), and W ω (t ) is a wavelet function for frequency ω.Here we employ the Morlet wavelet, defined as a complex exponential with a Gaussian envelope [44]: A wavelet-based approach is employed in our analysis of polyphonic music due to its ability to finely tune timefrequency resolutions at a semitone level, for example.A smoothed derivative approximation of f (ω, t ) is then obtained by convolving, with respect to time, the first derivative of a Gaussian kernel, g σ d (t ), with standard deviation σ d .Subsequently, the output is summed over N harmonics of the frequency ω; the resulting value is limited to be above zero so that it is suited for note onset detection, rather than note release.As such, the method is analogous to the Harmonic Product Spectrum (HPS) [11], [18] used in the detection of fundamentals, however, our approach employs summations instead of multiplications of harmonics: where n is the harmonic index (1 ≤ n ≤ N). σ 2 ω + σ 2 d is employed as a normalisation term.The inner function, f (ω, t ) g σ d (t ), can be expanded as:

D(ω, t
) is now evaluated for each of the 88 semitones ω(s) = A 0 2 s−1 12 , where A 0 = 2π × 27.5, to yield the NODF.The parameter σ ω , which determines the time-frequency resolution of the analysis, is chosen to respect the logarithmic spacing of the musical pitches, such that the frequency width is one-sixth of the distance between adjacent semitones.A factor of 6 is chosen as a reasonable compromise between frequency resolution and the potential inclusion of non-equally-tempered tones.The frequency resolution resulting from the wavelet analysis step is obtained from the Fourier Transform of the proposed wavelet: which is of course a Gaussian in the frequency domain with a standard deviation equal to the inverse of the wavelet parameter σ ω .Thus, σ ω can be expressed as: A one-dimensional NODF (required in (10), is then obtained as a weighted sum of the two-dimensional form (3) across s, corresponding to the 88 semitones: where V (s) is a (normalised) weighting function with respect to the semitone number, s.The following expression is proposed that prioritises lower frequencies with strength q, which is typically desirable given the strong dependence of beat on the bass notes:

C. FUNDAMENTAL TEMPOGRAM
An alternative to the conventional Tempogram, a Fundamental Tempogram (so termed because of its ability to attenuate harmonics and sub-harmonics of the true tempo), is now proposed in order to extract metric information from the NODF.As a starting point, this could be attempted by performing further wavelet analysis of the one-dimensional NODF D(t ) ( 7), with frequency centred upon candidate beats-per-minute (bpm) ω bpm : The wavelet function, W ω bpm (t ), is defined in terms of the bpm (ω bpm ) and equivalent receptive field (standard deviation) σ ω bpm , as follows: However, given that information regarding note positioning and spacing in the frequency domain is lost when evaluating D(t ) from D(ω(s), t ), the following alternative expression employing the two-dimensional NODF, (3), is proposed: where W 2D (s, t ) is a 2D Morlet wavelet defined as: Parameter σ ω s is the standard deviation of the wavelet in component s.The resulting function is equivalent to a standard Morlet function in the time domain, weighted by a Gaussian in the (semitone) frequency domain.
The purpose of increasing the dimensionality of the wavelet whilst evaluating the CWT Tempogram is to capture correlations between note positions at neighbouring tatum and bar lines.The s component of the wavelet effectively weights neighbouring note positions by a Gaussian with standard deviation σ ω s semitones, such that closer note values at neighbouring tatum and bar lines produce greater responses in the final CWT Tempogram; typically, σ ω s = 12 is employed.Thus, periodicities in certain frequency ranges are extracted with greater accuracy.The relationship between the three-dimensional space generated by this process and the final CWT Tempogram can be observed in the animation available https://youtu.be/dbARq9Y9p8k?si=TZmdXxPl7m54BnIvhere.
As a further aid to determining beat tempo, an Autocorrelation Tempogram is also constructed.This takes the one-dimensional NODF D(t ) (7) as its input, autocorrelates it, and then smooths the result through convolution with a Gaussian function having the same standard deviation as the CWT Tempogram (σ ω bpm ): As a consequence of their construction, R CW T has harmonics associated with the rhythmic components, whilst R AC has sub-harmonics.Harmonic and tempo ambiguity in Tempograms have been encountered previously in the literature, with certain methods proposed, such as exploiting the combined properties of the Autocorrelation (R AC and Fourier Tempograms (R DF T ) through multiplication [16], [27], [28] and performing octave removal [26].However, as a consequence of the methods employed (such as DFT as opposed to CWT) to generate R AC and R CW T in [16], [27], [28], and the limitations of purely targeting octave removal in [26], fundamental "rhythmic" frequencies remain largely inseparable from their harmonics.This paper by contrast proposes combining the properties of the presented CWT and Autocorrelation Tempograms, employing the same receptive fields σ ω bpm in each case, via computing the geometric mean of the two arrays such that harmonics and sub-harmonics are attenuated and the resulting Tempogram normalised.Employing the notation x + = max{0, x}, the final proposed Fundamental Tempogram is obtained as:

D. METROGRAM TRANSFORM FOR TIME SIGNATURE RECOGNITION
As observed in the Tempogram plots (such as Fig. 2), tempo trajectories present in the Tempogram provide insight into the type of time signature divisions represented in the music, specifically the ratios between the tempo lines.A few papers explore the concept of ratios in the Tempogram [7], [27], [29], in order to facilitate meter tracking and genre classification.However, in these papers, ratios are evaluated causally with respect to a specified tempo and for a limited discrete subset of metric divisions, and in [29], the dependency on time is lost given this fixed-tempo assumption.Here however we propose a transform that exploits the Tempogram properties to extract rhythmic ratios present in the music, independent of tempo information, thus enabling the time-varying metric characteristics to be evaluated in continuous form.To achieve this, the proposed transform, named the Metrogram, involves evaluating the multiplicative equivalent of the autocorrelation function in the frequency domain.The proposed Metrogram can be expressed as: where k is the rhythmic ratio to be evaluated, and Z (k) is a constant factor with respect to k; typically Z (k) ∝ k −(p+1) is suitable (0.5 < p < 1.5), given that higher time signature divisions are inherently more sensitive to rhythmic ambiguity, and therefore larger Metrogram ratios should be weighted accordingly.Note that the evaluation of P(k, t ) is not limited by integer values of k, and can thus be employed to detect polyrhythms in music, for example.To determine the primary time signature division as a function of time given the Metrogram, k(t ) = max k {P(k, t )} can be employed.
The Fundamental Tempogram's equivalent receptive field is noteworthy; both the Autocorrelation and CWT Tempograms employ Gaussian window functions with a standard deviation of σ ω bpm , defining this receptive field.A small receptive field might miss correlations between adjacent bar lines, whilst a sufficiently large σ ω bpm might not capture localised metric variation.A suitable range, influenced by the musical genre, has been found experimentally to lie between 1.5 s < σ ω bpm < 5s, a result in line with previous works [13].Fig. 2 illustrates the receptive field's effect on the Fundamental Tempogram and the subsequent influence on the integer-valued Metrogram division assignments.

E. TEMPO INFERENCE AND BEAT TRACKING
Having defined the various input components to the tempo and beat tracking functions, we now describe time-varying tempo inference through simultaneous optimisation with respect to 6 probabilistic objective functions.Specifically, the various inputs are discretised into arrays, denoted by the NODF, D 1:N , the Fundamental Tempogram, R 1:N (ω bpm ), and the bar-aligned metric divisions extracted from the Metrogram, k1:N , forming the input dataset x 1:N = {D 1:N , k1:N , R 1:N (ω)}.In [6], [8], [13], [14] Markov models were applied to meter and rhythm.Here a posterior distribution is proposed directly in terms of the bpm over time, λ 1:N , and the phase offset within a bar, φ: where: This paper proposes a likelihood that is constructed to account for all metric phenomena previously described, with no dependency on training data: where j, 1 ≤ j ≤ 6, are the indices associated with the following proposed probabilistic objective functions: ) Equation ( 16) encourages the maximisation of the fit with respect to the Fundamental Tempogram, and ( 17) maximises the fit of the tatum and division tempo trajectories with respect to the Tempogram array.Likewise, ( 18) and ( 20) maximise the fit of the bar lines with respect to the NODF, and ( 19) and ( 21) maximise the fit of the tatum lines.C 0 , C 1 , C 2 , C 3 are scaling constants, and b 0 , b 1 , b 2 , b 3 are biases.The function cos 2r θ is employed to enforce convergence on the possible bar and tatum line alignments, with hyperparameters r 1 , r 2 utilised such that alignments are encouraged with strength r with respect to the time domain NODF rendered previously.Note, previous methods that model time-varying time signature divisions employ cascade formats [10], [24], [39], such that joint beat and bar tracking is performed sequentially, as opposed to our proposed method that simultaneously optimises with respect to beat and bar allocations, ensuring beat length variations are appropriately modelled.The prior is then specified as: = p(λ 1 ) Equation ( 23) and ( 24) are possible given p(φ) and p(λ 1 ) are assumed to be uniform in the ranges 0 ≤ φ ≤ max n { kn }π , 0 ≤ λ 1 ≤ ω bpm,max , and due to the Markovian nature of the proposed prior on λ 1:N : Thus, taking the partial derivative of the negative log of the posterior (L(λ 1:N , φ) := − log p(λ 1:N , φ | x 1:N )) with respect to λ n and φ results in ( 26) and (27) shown at the bottom of the next page, respectively.The model parameters can be initialised through maximisation over the Metrogram inner terms in (12) given the determined kn values (multiplied by kn for the beat tempo): Other models typically employ the Viterbi algorithm for optimisation [6], [8], [9], [10], [12], [13] or Monte Carlo (MC) techniques [15]; for the structure of our continuous-parameter model an iterative Gradient Descent method [30] is appropriate, as follows: where e is the training epoch, and λ = {λ n } N n=1 .In order to avoid becoming trapped in local minima during convergence, a stochastic step is added.For every epoch, with probability p s (p s ≈ 0.2), the algorithm evaluates the posterior probability at a grid of φ values, φ = φ e + n s π , for − kmax /2 ≤ n s ≤ kmax /2 and integer n s .kmax is the maximum division extracted from the whole extract and φ e corresponds to φ computed during the current epoch.The algorithm then updates φ according to the maximum posterior probability (MAP estimate) across the specified range of n s .This specific range of φ is chosen given that each beat corresponds to π phase, thus, this step effectively samples the neighbouring beats to ensure the correct beat-bar alignment position has been chosen.
Upon satisfactory convergence of the parameters λ and φ, the bar (right) and tatum (left) times in terms of the sampling index n can be determined by solving the following equations for n (with appropriate linear interpolation):

F. MONTE CARLO HYPERPARAMETER SAMPLING
Given the inherent subjectivity in time signature recognition [1], the proposed posterior exhibits multiple local maxima dependent on hyperparameter selections.This arises primarily due to polyrhythmic and polymetric elements, prevalent in genres like jazz [3].For instance, a waltz in 3/4 could be perceived as 6/8, 12/8, or 2/2 due to a multi-layered rhythmic hierarchy (as observed in Fig. 4).Hyperparameters include q, the weight in the 2D NODF, σ ω bpm , the Tempogram receptive  field, and σ ω s , the 2D wavelet standard deviation.To determine the most suitable maxima, a global posterior maximum with respect to the hyperparameters can be identified (MAP estimate) by employing Monte Carlo (MC) sampling of the model's hyperparameter space for J iterations.Subsequently, for each sampling iteration, the posterior is evaluated through gradient descent of the negative log posterior (as per ( 29)) conditioned on these hyperparameters.The extended posterior is thus given by: Each hyperparameter σ ω s , σ ω bpm and q is assigned a separate gamma distribution as its prior, p(θ |α, β ) = β α (α) θ α−1 e −βθ .Subsequently, the allocation results obtained from the sampling iteration that achieved the greatest posterior value are taken.The corresponding sampling iteration index, i, can thus be expressed as: where σ ( j) ω s , σ ( j) ω bpm , q ( j) are the hyperparameters sampled from the priors during the jth sampling iteration; note that the posteriors here are computed with respect to the converged λ 1:N , φ values resulting from the optimisation step.A notable benefit of this proposed approach is the ability to provide the user with a variety of metric interpretations, corresponding to each sampling iteration, alongside the single highest probability assignment.

III. RESULTS AND DISCUSSION
In order to evaluate the performance of our approach on extracts with both beat and time-signature variations, a metrically diverse custom dataset is presented.Although several other datasets have been presented with either irregular or time-varying time signatures [6], [8], [10], [13], [14], [16], [32], none are available for extracts that exhibit both simultaneously with rubato, as is the case in improvisatory genres such as jazz.This custom dataset was recorded by the first author and features 43 extracts (primarily jazz piano) with primary time signature divisions of 2, 3, 4, 5, and 7 (35 extracts), as well as several metrically ambiguous extracts with dynamic (time-varying) time signature divisions and tempo (rubato).A custom-designed algorithm was developed for the labelling of ground truth assignments based upon quarter-speed playback and expert domain knowledge provided by the first author.
The following widely-used evaluation techniques [33] are employed: The F-measure, Cemgil score, P-score, and the Mean Squared Error (MSE).As input to the optimisation step, the terms x n = {D n , kn , R n (ω)} are computed once every 1500 data samples at 44.1 kHz.This compression factor, found through experimentation, provides a suitable compromise in temporal accuracy and computational efficiency, a result consistent with previous findings [6], [31].The results for the 35 primary division extracts are shown in Table I.One example extract's beat assignment results are plotted in Fig. 3. Likewise, another example extract, Fig. 5, is presented with its Fundamental Tempogram, Metrogram (with the division allocations highlighted in blue), and the final beat (grey vertical lines) and Bar (bright orange vertical lines) allocations superimposed over the CWT NODF intensities and audio waveform.Fig. 4 additionally presents three metric interpretations provided by the algorithm for a metrically ambiguous orchestral extract.As depicted in Table I, the algorithm displays high-level accuracy for both tatum and bar line assignments, with a tatum F-measure mean of 0.967 attesting to its strong performance in handling rubato, syncopation, and rhythmic irregularities.Every time signature division in the dataset was accurately inferred.Evaluation metrics largely remain consistent across the different time signature divisions (see Table I).For the case of more subjective metric assignments, the algorithm was able to infer compatible interpretations for each, as observed in the example extracts presented in Figs. 4 and 5.
Our model is designed to account for localised metric and rhythmic variations, such as time-varying time signature divisions and beat tempo, providing a set of probability-ranked metric interpretations, and is hence more flexible than the majority of other approaches in the literature.Nevertheless, comparisons with state-of-the-art approaches are considered for fixed-time signature recognition to specifically evaluate its fixed-division beat tracking capabilities.For this purpose, evaluation is performed on the Ballroom dataset (BDS) [43]; our algorithm's performance is shown in Table III.Its overall performance is then compared with a number of state-of-theart algorithms in Table II.Overall, our results are competitive with the leading methods for fixed time-signature methods, surpassed only by some of the deep learning methods that employ 8-fold cross-validation.For context, 8-fold crossevaluation involves training on a fraction (k-1)/k of the dataset and testing on the remaining 1/k.As such, while these models may exhibit high F-measures for specific datasets, there is potential for overfitting to particular genre-specific metric and rhythmic features, as evidenced when tested on unfamiliar datasets [42] (the BDS is drum-tracked with minimal tempo variation).A distinctive feature of our approach is its zero-training necessity and versatility to general polyphonic audio extracts, facilitating broader applications in various rhythmic and metric domains, as exhibited by the performance in the custom dataset.Accordingly, an interesting consequence of the proposed method is that lower BDS performance figures from our method are predominantly attributed to the model providing localised and global alternate metric interpretations (such as hemiolas or polymetric features), especially in the Waltz category, a possibility that is not entertained by either the fixed or dynamic time-signature methods.Indeed, this is a challenge faced previously in the literature, with numerous studies finding the misalignment of objective beat tracking scores with subjective participant scores due to varying metric interpretations [1], [2], [3], [4].

IV. CONCLUSION
This paper presents a probabilistic approach for the extraction of time-varying and irregular time signatures from polyphonic audio extracts, whilst also providing beat tracking estimates according to the inferred metric properties.Central to this approach is the Metrogram, a novel representation that captures time-varying information on rhythmic and metric periodicities within the Tempogram.A unique feature of our methodology is its ability to provide a distribution over metric interpretations through hyperparameter sampling of the posterior, offering insights into the diverse ways music can be rhythmically perceived.This aspect is particularly crucial for handling complex, irrational, and fluctuating time signatures commonly found in polyphonic extracts, especially in rhythmically and metrically diverse genres such as jazz.To demonstrate this unique feature, this paper presents a dataset consisting of irregular, irrational, and time-varying time signatures, with overall high level accuracy reported.This level of accuracy is attributable to the unique probabilistic approach to dynamically evaluating the optimal balance in beat and bar alignment across both time and tempo domains.Likewise, empirical evaluations of the algorithm's fixed time signature beat tracking capabilities are presented for the standard Ballroom Dataset, with accuracy rivalling state-of-the-art methodologies.
Looking ahead, the potential applications of this algorithm are extensive; its adaptability and capacity to interpret complex rhythmic and polymetric structures open new avenues for music composition, performance, and analysis.Ultimately, our algorithm balances adaptability and accuracy, demonstrating capability in handling complex rhythmic and metric structures and offering versatility across diverse polyphonic audio environments, paving the way for a deeper understanding and appreciation of the complexities inherent in music.

FIGURE 1 .
FIGURE 1. Simplified overview of the proposed architecture.

FIGURE 2 .
FIGURE 2. Fundamental Tempograms (left) with the corresponding Metrograms (right) for two receptive fields for a recording modulating from 3/4 to 4/4.The threshold division assignments over time are illustrated in blue in the Metrograms.Top: σ ω bpm = 0.5 s.Bottom: σ ω bpm = 3.0 s.X-axis = time.An animation showing the gradual transition between the extremes of the receptive field can be seen https://youtu.be/wTL18EMWcekhere.

FIGURE 3 .
FIGURE 3. Predicted and ground truth bar and tatum alignment results superimposed over the audio waveform for an extract with a primary time signature division of 7. Red = predicted; orange = ground truth; x-axis = time (s).

FIGURE 4 .
FIGURE 4. Selection of metric interpretations provided by the algorithm for a metrically ambiguous orchestral extract (Piano Concerto No. 4 by the first Author).A reduced score is shown for convenience (full scores available in the links).