By Topic

IEEE Quick Preview
  • Abstract



UNCERTAINTY is inherent in music analysis. A musical piece about which we have little prior knowledge can often be interpreted in various ways. One might, for example, have various degrees of belief in different possible interpretations of tempo and semantic structures, and when we try to transcribe the music we hear in an audio recording, we often find difficult to identify the notes with absolute confidence. Even if in the end we need to determine which interpretation or transcription is the most reasonable, during the analysis it is important to keep all possibilities open with various degrees of belief. We should therefore take an approach that can evaluate, propagate, and integrate the uncertainties of interdependent musical elements or musical notes.

A natural way to manage uncertainty is to take a Bayesian approach and use Bayesian probabilities to indicate degrees of belief. For example, suppose we have a distorted die. If the probabilities of getting the numbers 1, 2, 3, ···, 6 (called parameters) are known, we can evaluate the likelihood for a set of numbers (called observed data) obtained by casting the die many times. Note that the true values of the parameters do not vary stochastically. When the parameters are unknown, a probabilistic distribution is used as a means of representing how strongly possible values are believed to be the true values. Such degrees of belief vary according to the amount of observed data. Before we get observed data, prior distributions tend to be widely spread. The more data we get, the sharper the peaks of posterior distributions become. That is, the degree of belief on a certain possibility increases. The objective of Bayesian inference is to calculate posterior distributions of unknown variables by formulating probabilistic models defined by likelihood functions and prior distributions.

A critical problem in the conventional Bayesian approach is that we have to specify the complexity of the probabilistic models in advance (complexity means the number of mixtures in Gaussian mixture models (GMMs) and the number of states in hidden Markov models (HMMs)). If model complexities are unknown, both the uncertainty of model complexities and that of model parameters should be dealt with appropriately. The conventional approach, however, forces us to train many models of different complexities independently and then select one according to some criteria. Such fine-comb model selection, or model-complexity control, is often impractical, especially in the optimization of combinatorial-complexity models.

A nonparametric Bayesian approach avoiding the model selection problem has recently attracted a lot of attention [1]. Here the term “nonparametric” means that the size of a parameter space (complexity) is not fixed and in theory an infinite number of parameters (infinite complexity) are considered. If an infinite amount of observed data were available, an infinite number of parameters would be needed to represent variety of the data. Actually, however, only a limited number of parameters are needed because the amount of observed data is limited. The effective complexities of nonparametric models can be automatically adjusted according to observed data. Such nonparametric models are essentially different from conventional parametric models. In a single nonparametric model, an infinite number of parametric models with different complexities are stochastically overlapped.

In this paper, we propose a nonparametric Bayesian method for multipitch analysis, which is the basis of music transcription and music information retrieval (MIR). The method is called infinite latent harmonic allocation (iLHA), and our goal is to estimate multiple fundamental frequencies (F0s) from polyphonic audio signals. Instead of determining the values of F0s (parameters) and the number of them (complexity) uniquely, our method estimates a joint posterior distribution of all unknown variables when amplitude spectra of musical audio signals are given as observed data. We formulate nested infinite GMMs for observed spectra by using nonparametric priors called Dirichlet processes (DPs). These models can be obtained by taking the limit of the nested finite GMMs proposed by Goto [2] and Kameoka et al. [3] as the number of mixtures goes to infinity. More specifically, each spectral strip is allowed to contain an unbounded number of sound sources (harmonic structures), each of which is allowed to contain an unbounded number of harmonic partials. An important problem is that the parameters of the DPs (called hyperparameters) should be given appropriately because they affect the effective number of mixtures.

To avoid hyperparameter tuning, our models are formulated in a hierarchical Bayesian manner by putting prior distributions (called hyperprior distributions) on influential hyperparameters. Conventionally, we need to specify the hyperparameters of Dirichlet prior distributions on the relative weights of harmonic partials [2], [3]. Although these hyperparameters strongly impact the accuracy of F0 estimation, it is difficult to optimize them by hand. We instead put noninformative hyperprior distributions on the hyperparameters of DP priors of the infinite number of F0s and harmonic partials. This is reasonable because we have little knowledge of the hyperparameters. As shown in Fig. 1, we can completely automate iLHA by leveraging natural Bayesian treatment of parameters, complexities, and influential hyperparameters.

Figure 1
Fig. 1. Advantage of our method: We are not required to specify the number of spectral bases and the number of harmonic partials in advance. In addition, we do not have to adjust hyperparameters carefully.

The reminder is organized as follows. Section II introduces related work. Section III compares parametric models of conventional methods and nonparametric models of our method. Section IV describes a finite version of our method (LHA) and Section V explains our method (iLHA). Section VI reports our experiments. Section VII concludes this paper.



Many researchers have applied probabilistic models to multipitch analysis. Goto [2] proposed a probabilistic model for a single-frame amplitude spectrum (spectral strip) that contains multiple harmonic structures (see Section III) and used it to estimate the F0s of melody and bass lines from polyphonic audio signals. Kameoka et al. [3] estimated multiple F0s by using a similar model for grouping frequency components into multiple sound sources. Kameoka et al. [4] extended the model by capturing the temporal continuity of harmonic structures. Raphael [5] formulated a HMM based on a large number of chord hypotheses. Cemgil et al. [6] used a dynamic Bayesian network (DBN) to represent the sound generation process, i.e., to associate a music-score level with an audio-signal level. Raczyński et al. [7] also used a DBN to model temporal dependencies between musical notes. Emiya et al. [8] proposed a probabilistic model that jointly represents spectral envelopes and harmonic partials.

Recently, nonnegative matrix factorization (NMF) [9] has been considered to be promising. It regards time–frequency spectra as a nonnegative matrix and decomposes it into the product of two nonnegative matrices, one corresponding to a set of spectral bases and the other corresponding to a set of temporal activations. Smaragdis et al. [10] pioneered the use of NMF for music transcription. Virtanen et al. [11] and Peeling et al. [12] proposed Bayesian extensions of NMF. Raczyński et al. [13] and FitzGerald et al. [14] proposed harmonicity constraints for spectral bases, and Bertin et al. [15] further introduced smoothness constraints for temporal activations. Vincent et al. [16] proposed a method of training spectral bases from audio signals of isolated tones and adapting them to target polyphonic audio signals. Cont [17] developed NMF with sparsity constraints for real-time pitch tracking. Several variants of NMF—such as the complex NMF proposed by Kameoka et al. [18], the Itakura–Saito (IS) divergence NMF proposed by Févotte et al. [19], and the gamma process NMF proposed by Hoffman et al. [20]—have been applied to spectrogram decomposition, but F0s have not been estimated from the spectral bases thus obtained.

Many other approaches have been also proposed (see [21] for a review). For example, Marolt [22] and Klapuri [23] proposed auditory-model-based methods that use a peripheral hearing model. Computationally efficient approaches based on harmonic sums [24] and correlograms [25] have also been investigated. Pertusa and Iñesta [26] proposed a spectral-peak clustering method. Bello et al. [27] tackled grouping of frequency components by using a heuristic set of rules.

There have been attempt to estimate F0 contours of melody lines (vocal parts) from polyphonic audio signals. Dressler [28] used instantaneous frequency estimation, sinusoidal extraction, psychoacoustics, and auditory stream segregation. Ryynänen and Klapuri [29] formulated a HMM based on acoustic and musicological modeling, and Durrieu et al. [30] proposed a statistical method of extracting the main melody by using source/filter models. Poliner et al. [31] have reported a comparative evaluation of several approaches.

Most methods mentioned above can achieve good results if the number of sound sources and/or manual parameters are appropriately specified. However, it is difficult to always bring out the full potential of these methods in practice.



Our method is based on nonparametric Bayesian extension of conventional finite mixture models proposed by Goto [2] and Kameoka et al. [3]. Here we explain the conventional models for observed spectra and then derive our infinite mixture models by extending the conventional models.

A. Notations

Figure 2
Fig. 2. Gaussian mixture model for the Formula$k$th basis (single basis). Each Gaussian corresponds to a harmonic partial, and the mixing weights represent the relative strengths of Formula$M$ harmonic partials.

Suppose that given polyphonic audio signals contain Formula$K$ bases, each of which consists of Formula$M$ harmonic partials located at integral multiples of the F0 on a linear frequency scale. Each basis can be associated with multiple sounds of different temporal positions if these sounds are derived from the same pitch of the same instrument. We transform the audio signals into wavelet spectra. Let Formula$D$ be the number of frames. Note that Formula$K$ and Formula$M$ are finite integers that in conventional methods are specified in advance. Our method considers that Formula$K$ and Formula$M$ go to infinity.

B. Conventional Finite Models and MAP Estimation

Probabilistic models can evaluate how likely observed data is to be generated by using a limited number of parameters. Therefore, estimation of multiple F0s corresponds directly to finding model parameters that give the highest probability to the generation of the observed data (called model training).

Goto [2] first proposed probabilistic models of harmonic structures by regarding an amplitude spectrum (a spectral strip of a single frame) as a probability density function. As shown in Fig. 2, the amplitude distribution of basis Formula$k(1\leq k\leq K)$ can be modeled by a harmonic GMM as follows: Formula TeX Source $${\cal M}_{k}({\mmb x})=\sum_{m=1}^{M}\tau_{km}{\cal N}\left({\mmb x}\big\vert{\mmb\mu}_{k}+{\mmb o}_{m},{\mmb\Lambda}_{k}^{-1}\right)\eqno{\hbox{(1)}}$$ where Formula${\mmb x}$ is a one-dimensional vector indicating a logarithmic frequency [cents].1 The Gaussian parameters (mean Formula${\mmb\mu}_{k}$ and variance Formula${\mmb\Lambda}_{k}^{-1}$) represent the F0 of basis Formula$k$ and the degree of energy spread around the F0. Formula$\tau_{km}$ is the relative strength of the Formula$m$th harmonic partial Formula$(1\leq m\leq M)$ in basis Formula$k$. We set Formula${\mmb o}_{m}$ to Formula$[1200\log_{2} m]$. This means that Formula$M$ Gaussians are located to have the harmonic relationship on the logarithmic frequency scale. One might think that the value of Formula${\mmb\Lambda}_{k}^{-1}$ can be precomputed because the basis sound consists of Formula$M$ sinusoidal signals (see Appendix I in [4]). This is true if these sinusoidal signals are stationary, but frequency-modulated sounds (e.g., vibrato) result in a larger value of Formula${\mmb\Lambda}_{k}^{-1}$ because of the uncertainty principle of time–frequency resolution.

As shown in Fig. 3, the spectral strip of frame Formula$d$ is modeled by mixing Formula$K$ harmonic GMMs as follows: Formula TeX Source $${\cal M}_{d}({\mmb x})=\sum_{k=1}^{K}\pi_{dk}{\cal M}_{k}({\mmb x})\eqno{\hbox{(2)}}$$ where Formula$\pi_{dk}$ is a relative strength of basis Formula$k$ in frame Formula$d$. Consequently, the polyphonic spectral strip can be represented by means of a nested finite GMM.

Figure 3
Fig. 3. Nested Gaussian mixture model for mixed multiple bases. It is obtained by mixing multiple Gaussian mixture models in a weighted manner under the assumption of amplitude additivity.
Table 1

Several methods that have been proposed for parameter estimation are listed in Table I. Goto [2] proposed a predominant-F0 estimation method (PreFEst) that estimates only relative strengths Formula${\mmb\tau}$ and Formula${\mmb\pi}$ by allocating many GMMs (Formula${\mmb\mu}$ and Formula${\mmb\Lambda}$ are fixed) to cover the entire frequency range as F0 candidates. Kameoka et al. [3] proposed harmonic clustering (HC), which estimates all the parameters and selects the optimal number of bases by using the Bayesian information criterion. Although these methods yielded the promising results, they analyze the spectral strips of different frames independently. Kameoka et al. [4] therefore proposed harmonic-temporal-structured clustering (HTC), which captures the temporal continuity of spectral bases. Because all the above methods use a maximum a posteriori (MAP) estimation strategy to train the finite models, a prior distribution of relative strengths Formula${\mmb\tau}$ has a large effect on the accuracy of F0 estimation.

C. Our Infinite Models and Bayesian Inference

We would like to discuss the limit of (1) and (2) as Formula$K$ and Formula$M$ diverge to infinity. There is a reason that taking the infinite limit is reasonable even though there are a finite number of discrete pitches (e.g., the standard piano has 88 keys). The F0s and spectral shapes of many instruments (strings, woodwinds, brasses, etc.) vary infinitely according to playing styles (vibrato, marcato, legato, staccato, etc.), and it is difficult to capture these variations when using a parametric model of fixed complexity.

Although there are theoretically infinite number of mixing weights Formula$\{\pi_{d1},\pi_{d2},\ldots, \pi_{d_{\infty}}\}$ and Formula$\{\tau_{k1},\tau_{k2},\ldots, \tau_{k_{\infty}}\}$, in the finite amount of observed data in practice there are a finite number of bases and a finite number of harmonic partials. Most of mixing weights must therefore be almost equal to zero. In other words, only a limited number of bases and a limited number of harmonic partials are allowed to become active. To realize such “sparse” GMMs, we put nonparametric prior distributions on mixing weights as sparsity constraints. We developed a method of Bayesian inference called iLHA to train the nested infinite GMMs (see Section V).

1) Definition of Observed Data

In the context of Bayesian inference we need to explicitly define the observed data from the statistical viewpoint. More specifically, we regard each spectral strip as a histogram of observed frequencies as in [32]. If a spectral strip at frame Formula$d(1\leq d\leq D)$ has amplitude Formula$a$ at frequency Formula$f$, we assume that frequency Formula$f$ was observed Formula$\lfloor\omega a\rfloor$ times in frame Formula$d$, where Formula$\omega$ is a scaling factor of wavelet spectra. In other words, we suppose there are countable frequency “particles” (sound quanta), each corresponding to an independent and identically distributed (i.i.d.) observation. Note that there is a nontrivial issue in determining the value of Formula$\omega$ (see Section III-C3). Assuming that amplitudes are additive, we can consider each observation to be generated from one of Formula$M$ partials in one of Formula$K$ bases.

Let the total observations over all Formula$D$ frames be represented by Formula${\mmb X}=\{{\mmb X}_{1},\ldots, {\mmb X}_{D}\}$, where Formula${\mmb X}_{d}$ is a set of observed frequencies Formula${\mmb X}_{d}=\{{\mmb x}_{d1},\ldots, {\mmb x}_{dN_{d}}\}$ in frame Formula$d$. Formula$N_{d}$ is the number of frequency observations (i.e., the sum of spectral amplitudes over all frequency bins in frame Formula$d$) and Formula${\mmb x}_{dn}(1\leq n\leq N_{d})$ is a one-dimensional vector that represents an observed frequency. We let Formula$N=\sum_{d}N_{d}$ be the total number of observations over all frames.

Let the total latent variables corresponding to Formula${\mmb X}$ be similarly represented by Formula${\mmb Z}=\{{\mmb Z}_{1},\ldots, {\mmb Z}_{D}\}$, where Formula${\mmb Z}_{d}=\{{\mmb z}_{d1},\ldots, {\mmb z}_{dN_{d}}\}$. Formula${\mmb z}_{dn}$ is a Formula$KM$-dimensional vector in which only one entry, Formula$z_{dnkm}$, takes a value of 1 and the others take values of 0 when frequency Formula${\mmb x}_{dn}$ is generated from partial Formula$m(1\leq m\leq M)$ of basis Formula$k(1\leq k\leq K)$.

2) Positioning of Our Method

Our method can be viewed as an extension of a well-known topic modeling method called latent Dirichlet allocation (LDA) [33]. LDA was developed as a Bayesian extension of probabilistic latent semantic analysis (pLSA) [34] in the field of natural language processing. In LDA, each document is represented as a weighted mixture of multiple topics that are shared over all documents contained in observed data. Our method similarly represents frames as weighted mixtures of bases. An important difference between our method and LDA, however, is that iLHA represents each basis as a continuous distribution (a GMM) on the frequency space while LDA represents each topic as a discrete distribution over words (a set of unigram probabilities).

Another relevant extension of pLSA is probabilistic latent component analysis (PLCA) [35]. PLCA has been applied to source separation by assuming the time–frequency spectrogram to be a two-dimensional histogram of sound quanta. A major difference between our method and PLCA is that iLHA is based on a continuous distribution on the frequency space at each frame while PLCA is based on a two-dimensional discrete distribution on the space of frame-frequency pairs.

Our method is also similar to the standard NMF [10] based on temporal exchangeability of spectral strips (see Table I). Our method simultaneously trains GMMs of all frames contained in the observed spectra. In other words, if we permute a temporal sequence of spectral strips, the same results would be obtained. Although such temporal modeling is not appropriate for music, it is known to work well in practice.

As discussed above, we fuse the topic modeling framework into the NMF-style decomposition. This is reasonable because any (local) maximum-likelihood solution of pLSA is proven to be a solution of NMF that uses Kullback–Leibler (KL) divergence as a cost function [36]. In addition, we propose a nonparametric Bayesian extension.

3) Limitations of Our Method

The amplitude quantization and i.i.d. assumption are not justified in a physical sense. The amplitudes at the integral multiples of a F0 are correlated to each other when they were generated from a single harmonic sound. Besides this, there is arbitrariness in determining the total number of observations Formula$N$ (the scaling factor Formula$\omega$ multiplied to raw wavelet spectra). The larger Formula$\omega$ is, the more observations we have, resulting in a more compact posterior distribution because of reduced uncertainty. This criticism can be applied not only to topic models like [32], [35] but also to probabilistic models of NMF with KL divergence. This NMF assumes the value of amplitude to follow a Poisson distribution that is defined over nonnegative integers and has no scale parameter. Note that another NMF with IS divergence [19] does not have such a problem because it assumes the value of power (squared amplitude) to follow an exponential distribution that is defined over nonnegative real numbers and has a scale parameter. Therefore, NMF with IS divergence is scale invariant.

This is more problematic in the context of “nonparametric” Bayesian inference because the larger number of observations allows iLHA to activate more relatively small mixture components (i.e., bases and harmonic partials). We therefore need to perform a thresholding process according to the value of Formula$\omega$ after training the weights of bases. In our experiments, the accuracy of multipitch analysis little varied if we changed the value of Formula$\omega$ (see Section VI).

Another limitation is that our method represents harmonic sounds in an oversimplified manner. We assume that harmonic sounds consist only of several sinusoidal signals corresponding to harmonic partials. Actually, however, measurable noisy components are widely distributed along the frequency axis even if the target musical pieces are played only by pitched instruments. iLHA is thus forced to use too many harmonic GMMs to represent such noisy components. This is another reason that we need the thresholding process in the end.



This section explains LHA, the finite version of iLHA, as a preliminary step to deriving iLHA. LHA deals with nested finite GMMs described in Section III in a Bayesian manner. First, we mathematically represent the LHA model by putting prior distributions on unknown variables. Then, we explain a training method of estimating posterior distributions.

A. Model Formulation

Fig. 4 shows a graphical representation of the LHA model. The full joint distribution is given by Formula TeX Source $$p({\mmb X},{\mmb Z},{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})=p({\mmb X}\vert{\mmb Z},{\mmb\mu},{\mmb\Lambda})p({\mmb Z}\vert{\mmb\pi},{\mmb\tau})p({\mmb\pi})p({\mmb\tau})p({\mmb\mu},{\mmb\Lambda})\eqno{\hbox{(3)}}$$ where the first two terms are likelihood functions and the other three terms are prior distributions. The likelihood functions are defined as Formula TeX Source $$\eqalignno{p({\mmb X}\vert{\mmb Z},{\mmb\mu},{\mmb\Lambda})=&\,\prod_{dnkm}{\cal N}\left({\mmb x}_{dn}\vert{\mmb\mu}_{k}+{\mmb o}_{m},{\mmb\Lambda}_{k}^{-1}\right)^{z_{dnkm}}&\hbox{(4)}\cr p({\mmb Z}\vert{\mmb\pi},{\mmb\tau})=&\,\prod_{dnkm}(\pi_{dk}\tau_{km})^{z_{dnkm}}&\hbox{(5)}}$$ Then, we introduce conjugate priors as follows: Formula TeX Source $$\eqalignno{p({\mmb\pi})=&\,\prod_{d=1}^{D}{\rm Dir}({\mmb\pi}_{d}\vert\alpha{\mmb\nu})=\prod_{d=1}^{D}C(\alpha{\mmb\nu})\prod_{k=1}^{K}\pi_{dk}^{\alpha\nu_{k}-1}&\hbox{(6)}\cr p({\mmb\tau})=&\,\prod_{k=1}^{K}{\rm Dir}({\mmb\tau}_{k}\vert\beta{\mmb\upsilon})=\prod_{k=1}^{K}C(\beta{\mmb\upsilon})\prod_{m=1}^{M}\tau_{km}^{\beta\upsilon_{m}-1}&\hbox{(7)}\cr p({\mmb\mu},{\mmb\Lambda})=&\,\prod_{k=1}^{K}{\cal N}\left({\mmb\mu}_{k}\vert{\mmb m}_{0},(b_{0}{\mmb\Lambda}_{k})^{-1}\right){\cal W}({\mmb\Lambda}_{k}\vert{\mmb W}_{0},c_{0})&\hbox{(8)}}$$ where Formula$p({\mmb\pi})$ and Formula$p({\mmb\tau})$ are products of Dirichlet distributions and Formula$p({\mmb\mu},{\mmb\Lambda})$ is a product of Gaussian–Wishart distributions. Formula$C(\alpha{\mmb\nu})$ and Formula$C(\beta{\mmb\upsilon})$ are normalization factors, and Formula$\alpha{\mmb\nu}$ and Formula$\beta{\mmb\upsilon}$ are hyperparameters. We let Formula${\mmb\nu}$ and Formula${\mmb\upsilon}$ sum to unity, respectively. Formula$\alpha$ and Formula$\beta$ are often called concentration parameters. Formula${\mmb m}_{0}$, Formula$b_{0}$, Formula${\mmb W}_{0}$, and Formula$c_{0}$ are also hyperparameters: Formula${\mmb m}_{0}$ is a Gaussian mean, Formula$b_{0}$ is a scaling factor of the precision matrix, Formula${\mmb W}_{0}$ is a scale matrix, and Formula$c_{0}$ is a degree of freedom.

Figure 4
Fig. 4. Graphical representation of nested finite Gaussian mixture models for LHA. First, finite sets of mixing weights, Formula${\mmb\pi}$ and Formula${\mmb\tau}$, are stochastically generated according to Dirichlet prior distributions. At the same time, Formula$KM$ Gaussian distributions are stochastically generated according to a Gaussian–Wishart prior distribution. Then one of Formula$M$ harmonic partials in one of Formula$K$ bases is stochastically selected as a latent variable Formula${\mmb z}_{dn}$ according to multinomial distributions defined by Formula${\mmb\pi}$ and Formula${\mmb\tau}$. Finally, frequency Formula${\mmb x}_{dn}$ is stochastically generated according to a Gaussian distribution specified by Formula${\mmb z}_{dn}$.

B. Variational Bayesian Inference

The goal of Bayesian inference is to compute a true posterior distribution of all unknown variables: Formula$p({\mmb Z},{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda}\vert{\mmb X})$. Because analytical calculation of the posterior distribution is intractable, we use an approximation technique, called variational Bayes (VB) [37], that limits the posterior distribution to an analytical form and optimizes it iteratively in a deterministic way. Another possible technique is Markov chain Monte Carlo (MCMC) [38], which sequentially generates samples (the concrete values of unknown variables) from the true posterior distribution in a stochastic way by constructing a Markov chain that has the target distribution as its equilibrium distribution. It is generally difficult, however, to tell whether or not a Markov chain has reached a stationary distribution from which we can get samples within an acceptable error.

In the VB framework, we introduce a variational posterior distribution Formula$q({\mmb Z},{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})$ and make it close to the true posterior Formula$p({\mmb Z},{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda}\vert{\mmb X})$ iteratively. Here we assume that the variational distribution can be factorized as Formula TeX Source $$q({\mmb Z},{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})=q({\mmb Z})q({\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})\eqno{\hbox{(9)}}$$ To optimize Formula$q({\mmb Z},{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})$, we use a variational version of the expectation–maximization (EM) algorithm [37]. We iterate VB-E and VB-M steps alternately until a variational lower bound of evidence Formula$p({\mmb X})$ converges as follows: Formula TeX Source $$\eqalignno{{\hskip-10pt}q^{\ast}({\mmb Z})\!\propto\!&\,\exp\!\left(\BBE_{{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda}}\!\left[\log p({\mmb X},{\mmb Z},{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})\right]\right)&\hbox{(10)}\cr{\hskip-10pt}q^{\ast}({\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})\!\propto\!&\,\exp\!\left(\BBE_{\mmb Z}\!\left[\log p({\mmb X},{\mmb Z},{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})\right]\right).&\hbox{(11)}}$$

C. Variational Posterior Distributions

We derive the formulas for updating variational posterior distributions according to (10) and (11).

1) VB-E Step

An optimal variational posterior distribution of latent variables Formula${\mmb Z}$ can be computed as follows: Formula TeX Source $$\eqalignno{\log q^{\ast}({\mmb Z})=&\,\BBE_{{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda}}\left[\log p({\mmb X},{\mmb Z},{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})\right]+{\rm const.}\cr=&\,\BBE_{{\mmb\mu},{\mmb\Lambda}}\left[\log p({\mmb X}\vert{\mmb Z},{\mmb\mu},{\mmb\Lambda})\right]+\BBE_{{\mmb\pi},{\mmb\tau}}\left[\log p({\mmb Z}\vert{\mmb\pi},{\mmb\tau})\right]\cr&+{\rm const.}\cr=&\,\sum_{dnkm}z_{dnkm}\log\rho_{dnkm}+{\rm const.}&\hbox{(12)}}$$ where Formula$\rho_{dnkm}$ is defined as Formula TeX Source $$\displaylines{\log\rho_{dnkm}=\BBE_{{\mmb\pi}_{d}}[\log\pi_{dk}]+\BBE_{{\mmb\tau}_{k}}[\log\tau_{km}]\hfill\cr\hfill+\BBE_{{\mmb\mu}_{k},{\mmb\Lambda}_{k}}\left[\log{\cal N}\left({\mmb x}_{dnm}\vert{\mmb\mu}_{k},{\mmb\Lambda}_{k}^{-1}\right)\right]\quad\hbox{(13)}}$$ where Formula${\mmb x}_{dnm}={\mmb x}_{dn}-{\mmb o}_{m}$. Consequently, Formula$q^{\ast}({\mmb Z})$ is obtained as multinomial distributions given by Formula TeX Source $$q^{\ast}({\mmb Z})=\prod_{dnkm}\gamma_{dnkm}^{z_{dnkm}}\eqno{\hbox{(14)}}$$ where Formula$\gamma_{dnkm}=\rho_{dnkm}/\sum_{km}\rho_{dnkm}$ is called a responsibility that indicates how likely it is that observed frequency Formula${\mmb x}_{dn}$ is generated from harmonic partial Formula$m$ of basis Formula$k$. Let Formula$n_{dkm}$ be the number of frequencies that were generated from harmonic partial Formula$m$ of basis Formula$k$ in frame Formula$d$. This number and its expected value can be calculated as follows: Formula TeX Source $$n_{dkm}=\sum_{n}z_{dnkm}\quad\BBE[n_{dkm}]=\sum_{n}\gamma_{dnkm}\eqno{\hbox{(15)}}$$

For convenience in executing the VB-M step, we calculate several sufficient statistics as follows: Formula TeX Source $$\eqalignno{\BBS_{k}[1]\equiv&\,\sum_{dnm}\gamma_{dnkm}&\hbox{(16)}\cr\BBS_{k}[{\mmb x}]\equiv&\,\sum_{dnm}\gamma_{dnkm}{\mmb x}_{dnm}&\hbox{(17)}\cr\BBS_{k}[{\mmb x}{\mmb x}^{T}]\equiv&\,\sum_{dnm}\gamma_{dnkm}{\mmb x}_{dnm}{\mmb x}_{dnm}^{T}.&\hbox{(18)}}$$

2) VB-M Step

Similarly, an optimal variational posterior distribution of parameters Formula${\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda}$ is given by Formula TeX Source $$\displaylines{\log q^{\ast}({\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})=\log p({\mmb\pi})p({\mmb\tau})+\BBE_{\mmb z}\left[\log p({\mmb Z}\vert{\mmb\pi},{\mmb\tau})\right]\hfill\cr\hfill+\log p({\mmb\mu},{\mmb\Lambda})+\BBE_{\mmb z}\left[\log p({\mmb X}\vert{\mmb Z},{\mmb\mu},{\mmb\Lambda})\right]+{\rm const.}\quad\hbox{(19)}}$$ This distribution can be factorized into the product of posterior distributions of respective parameters as follows: Formula TeX Source $$q^{\ast}({\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})=\prod_{d=1}^{D}q^{\ast}({\mmb\pi}_{d})\prod_{k=1}^{K}q^{\ast}({\mmb\tau}_{k})\prod_{k=1}^{K}q^{\ast}({\mmb\mu}_{k},{\mmb\Lambda}_{k})\eqno{\hbox{(20)}}$$ Since our model is based on the conjugate prior distributions, each posterior distribution has the same form of the corresponding prior distribution as follows: Formula TeX Source $$\eqalignno{{\hskip-10pt}q^{\ast}({\mmb\pi}_{d})=&\,{\rm Dir}({\mmb\pi}_{d}\vert{\mmb\alpha}_{d})&\hbox{(21)}\cr{\hskip-10pt}q^{\ast}({\mmb\tau}_{k})=&\,{\rm Dir}({\mmb\tau}_{k}\vert{\mmb\beta}_{k})&\hbox{(22)}\cr{\hskip-10pt}q^{\ast}({\mmb\mu}_{k},{\mmb\Lambda}_{k})=&\,{\cal N}\left({\mmb\mu}_{k}\vert{\mmb m}_{k},(b_{k}{\mmb\Lambda}_{k})^{-1}\right){\cal W}({\mmb\Lambda}_{k}\vert{\mmb W}_{k},c_{k})&\hbox{(23)}}$$ where the variational parameters are given by Formula TeX Source $$\eqalignno{\alpha_{dk}=&\,\alpha\nu_{k}+\BBE[n_{dk\cdot}]&\hbox{(24)}\cr\beta_{km}=&\,\beta\upsilon_{m}+\BBE[n_{\cdot km}]&\hbox{(25)}\cr b_{k}=&\,b_{0}+\BBS_{k}[1]&\hbox{(26)}\cr c_{k}=&\,c_{0}+\BBS_{k}[1]&\hbox{(27)}\cr{\mmb m}_{k}=&\,{b_{0}{\mmb m}_{0}+\BBS_{k}[{\mmb x}]\over b_{0}+\BBS_{k}[1]}={b_{0}{\mmb m}_{0}+\BBS_{k}[{\mmb x}]\over b_{k}}&\hbox{(28)}\cr{\mmb W}_{k}^{-1}=&\,{\mmb W}_{0}^{-1}+b_{0}{\mmb m}_{0}{\mmb m}_{0}^{T}+\BBS_{k}[{\mmb{xx}}^{T}]-b_{k}{\mmb m}_{k}{\mmb m}_{k}^{T}&\hbox{(29)}}$$ where we introduced a dot notation for improved readability. We let dot “⋅” denote the sum over that index. For convenience in the subsequent sections, we also introduce notations using comparison operators (> and Formula$\geq$). For example, we write Formula TeX Source $$n_{dk\cdot}=\sum_{m^{\prime}}n_{dkm^{\prime}}\quad n_{dk>m}=\sum_{m^{\prime}>m}n_{dkm^{\prime}}.\eqno{\hbox{(30)}}$$ The three terms of (13) can therefore be calculated as follows: Formula TeX Source $$\eqalignno{\BBE_{{\mmb\pi}_{d}}[\log\pi_{dk}]=&\,\psi(\alpha_{dk})-\psi\left(\sum_{k=1}^{K}\alpha_{dk}\right)&\hbox{(31)}\cr\BBE_{{\mmb\tau}_{k}}[\log\tau_{km}]=&\,\psi(\beta_{km})-\psi\left(\sum_{m=1}^{M}\beta_{km}\right)&\hbox{(32)}}$$ Formula TeX Source $$\eqalignno{{\hskip-10pt}&\BBE_{{\mmb\mu},{\mmb\Lambda}}\left[\log{\cal N}\left({\mmb x}_{dnm}\vert{\mmb\mu}_{k},{\mmb\Lambda}_{k}^{-1}\right)\right]\cr{\hskip-10pt}&\quad=-{1\over 2}\log(2\pi)+{1\over 2}\BBE_{{\mmb\Lambda}_{k}}\left[\log\vert{\mmb\Lambda}_{k}\vert\right]\cr{\hskip-10pt}&\qquad-{1\over 2}c_{k}({\mmb x}_{dnm}-{\mmb m}_{k})^{T}{\mmb W}_{k}({\mmb x}_{dnm}-{\mmb m}_{k})-{1\over 2b_{k}}&\hbox{(33)}}$$ where Formula$\psi$ is the digamma function, which is defined as the logarithmic derivative of the gamma function.

D. Variational Lower Bound

To judge convergence, we examine the increase of the variational lower bound. Its maximization is inextricably linked with minimization of the KL divergence between the true and variational posteriors. Let Formula${\cal L}$ be the lower bound given by Formula TeX Source $$\eqalignno{{\cal L}=&\,\BBE\left[\log p({\mmb X},{\mmb Z},{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})\right]-\BBE\left[q({\mmb Z},{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})\right]\cr=&\,\BBE\left[\log p({\mmb X}\vert{\mmb Z},{\mmb\mu},{\mmb\Lambda})\right]+\BBE\left[\log p({\mmb Z}\vert{\mmb\pi},{\mmb\tau})\right]-\BBE\left[\log q({\mmb Z})\right]\cr&+\BBE\left[\log p({\mmb\pi})\right]+\BBE\left[\log p({\mmb\tau})\right]+\BBE\left[\log p({\mmb\mu},{\mmb\Lambda})\right]\cr&-\BBE\left[\log q({\mmb\pi})\right]-\BBE\left[\log q({\mmb\tau})\right]-\BBE\left[\log q\left({\mmb\mu},{\mmb\Lambda})\right)\right].&\hbox{(34)}}$$ The calculation of these terms is described in Appendix I.



Our goal is to formulate and train nested infinite GMMs without model selection and hyperparameter tuning. To do this, we consider the limit of the nested finite GMMs described in Section IV as both Formula$K$ and Formula$M$ approach infinity. In addition, we put noninformative hyperprior distributions on influential hyperparameters in a hierarchical Bayesian manner and then calculate the posterior distributions of those hyperparameters. As a result of Bayesian inference, likely values are given large posterior probabilistic densities and unlikely values are given small densities. Such informative posterior distributions naturally emerge from noninformative prior distributions as polyphonic spectra are observed. That is, uncertainty is decreased by getting additional information. In the end we estimate F0s by taking MAP values of the posterior distributions.

A. Mathematical Preparation

We explain the Dirichlet process (DP) and the hierarchical Dirichlet process (HDP), which can be used as nonparametric Bayesian priors in our infinite models. In this section, mathematical symbols are defined according to the custom. Therefore, the definition is valid only in this section.

1) Dirichlet Process

The DP and its extensions play important roles in the theory of Bayesian nonparametrics [39]. Formally introduced by Ferguson [40] in 1973, in the past 10 years it has often been used as a building block of infinite mixture models.

A formal definition of the DP is that its marginal distributions must be Dirichlet distributed [40]. Let Formula$\alpha$ be a positive real number and Formula$G_{0}$ be a distribution over a sample space Formula$\Theta$. We say a random distribution Formula$G$ over Formula$\Theta$ is DP distributed with concentration parameter Formula$\alpha$ and base measure Formula$G_{0}$ if Formula TeX Source $$\displaylines{\left(G(A_{1}),G(A_{2}),\ldots, G(A_{K})\right)\hfill\cr\hfill\sim{\rm Dir}\left(\alpha G_{0}(A_{1}),\alpha G_{0}(A_{2}),\ldots, \alpha G_{0}(A_{K})\right)\quad\hbox{(35)}}$$ for any finite measurable partition Formula$\{A_{1},A_{2},\ldots, A_{K}\}$ of Formula$\Theta$. The DP is thus a distribution over distributions. This is written Formula TeX Source $$G\sim{\rm DP}(\alpha,G_{0}).\eqno{\hbox{(36)}}$$ Then, a concrete sample Formula$\theta\in\Theta$ is drawn from Formula$G$ as follows: Formula TeX Source $$\theta\sim G.\eqno{\hbox{(37)}}$$

An alternative constructive definition of the DP is known as the stick-breaking construction (SBC) [41]. As illustrated in Fig. 5, a random distribution Formula$G$ can be written explicitly as a countably infinite sum of point masses (“atoms”): Formula TeX Source $$\eqalignno{\theta_{k}\sim&\,G_{0}&\hbox{(38)}\cr G(\theta)=&\,\sum_{k=1}^{\infty}\pi_{k}\delta_{\theta_{k}}(\theta)&\hbox{(39)}}$$ where Formula$\delta_a(x)$ is the Dirac delta function that diverges to positive infinity at Formula$x=a$, is otherwise equal to 0, and integrates to 1 with respect to Formula$x$. The point mass Formula$\pi_{k}$ of Formula$\theta_{k}$ is given by Formula TeX Source $$\eqalignno{\pi_{k}^{\prime}\sim&\,{\rm Beta}(1,\alpha)&\hbox{(40)}\cr\pi_{k}=&\,\pi_{k}^{\prime}\prod_{i=1}^{k-1}\left(1-\pi^{\prime}_{i}\right)&.\hbox{(41)}}$$ The distribution on the infinite number of mixing weights Formula${\mmb\pi}=\{\pi_{1},\pi_{2},\ldots, \pi_{\infty}\}$ is often written Formula${\mmb\pi}\sim{\rm GEM}(\alpha)$, where the letters stand for Griffiths, Engen, and McCloskey.

Figure 5
Fig. 5. Stick-breaking construction of the Dirichlet process. Starting with a stick of length 1, we break it at Formula$\pi_{1}^{\prime}\sim{\rm Beta}(1,\alpha)$ and assign Formula$\pi_{1}$ to be the length of the stick we just broke off. We obtain the infinite number of mixing weights, Formula$\{\pi_{2},\pi_{3},\ldots, \pi_{\infty}\}$, by breaking the remaining portion recursively.
Figure 6
Fig. 6. Discretization property of the Dirichlet process. Formula$G$ becomes an infinite-dimensional discrete distribution when Formula$G_{0}$ is a continuous distribution. The smaller Formula$\alpha$ is, the fewer atoms in Formula$G$ occupy most of its total probability mass. This means that Formula$G$ becomes more sparse.

An important property of the DP is that Formula$G$ must be a discrete distribution. As shown in Fig. 6, Formula$G$ is an infinite-dimensional discrete distribution when Formula$G_{0}$ is a continuous distribution. The DP can therefore be used as a prior distribution to formulate an infinite mixture model. In case of an infinite GMM (iGMM), for example, Formula$\Theta$ is a space of Gaussians (i.e., a space of means and variances). Formula$G_{0}$ is usually set to a Gaussian–Wishart distribution, which is a conjugate prior distribution over Gaussians. Formula$G$ drawn from the DP is also a distribution over Gaussians. Every time an observation is generated, Gaussian Formula$\theta\in\Theta$ is drawn from Formula$G$, where Formula$\theta$ is selected from the infinite number of Gaussians Formula$\{\theta_{1},\theta_{2},\ldots, \theta_{\infty}\}$ according to their probabilities Formula$\{\pi_{1},\pi_{2},\ldots, \pi_{\infty}\}$. This is a straightforward extension of a conventional finite GMM.

Several extensions increasing a degree of freedom of the standard DP have been proposed. For example, a beta two-parameter process [42] is obtained when Formula TeX Source $$\pi_{k}^{\prime}\sim{\rm Beta}(\alpha,\beta)\eqno{\hbox{(42)}}$$ where positive real numbers Formula$\alpha$ and Formula$\beta$ are adjustable parameters of the beta distribution.

2) Hierarchical Dirichlet Process

We discuss how to simultaneously train tied infinite mixture models when observed data consists of multiple groups, e.g., spectral strips (frames). Here a set of component distributions should be shared across mixture models trained for different groups. Such parameter tying enables us to directly compare compositions of different groups in terms of mixing weights of component distributions. This is similar to vector quantization (VQ) [43]. Let Formula$N$ be the number of groups. In this setting, it is natural to use a DP for modeling observed data of each group as follows: Formula TeX Source $$G_{n}\sim{\rm DP}(\alpha,G_{0})\quad(1\leq n\leq N)\eqno{\hbox{(43)}}$$ where Formula$G_{n}$ is a random distribution on Formula$\Theta$ for group Formula$n$.

A problem is that if Formula$G_{0}$ is a continuous distribution, atoms (component distributions) drawn from Formula$G_{n}$ for generating observations are almost surely disjointed from those drawn from Formula$G_{n^{\prime}}(n^{\prime}\neq n)$. This is because Formula$N$ DPs can independently determine the positions of the countably infinite number of discrete atoms Formula${\mmb\theta}=\{\theta_{1},\ldots, \theta_{\infty}\}$ (cardinality Formula$\aleph_{0}$) from the uncountably infinite continuous space Formula$\Theta$ (cardinality Formula$\aleph$).

Figure 7
Fig. 7. Overview of hierarchical Dirichlet process. Formula$G_{0}$ becomes an infinite-dimensional discrete distribution when Formula$H$ is a continuous distribution. The smaller Formula$\alpha$ is, the fewer atoms in Formula$G_{n}$ occupy most of its total probability mass. This means that Formula$G_{n}$ becomes more sparse.

To solve this problem, we use a HDP [44] as a nonparametric prior distribution. As shown in Fig. 7, we consider the base measure Formula$G_{0}$ itself to be distributed according to a top-level DP as follows: Formula TeX Source $$G_{0}\sim{\rm DP}(\gamma,H)\eqno{\hbox{(44)}}$$ where Formula$\gamma$ is a concentration parameter and Formula$H$ a base measure over Formula$\Theta$. In this model, Formula$G_{0}$ always becomes a discrete distribution. The SBC of the top-level DP is given by Formula TeX Source $$\eqalignno{\theta_{k}\sim&\,H&\hbox{(45)}\cr G_{0}(\theta)=&\,\sum_{k=1}^{\infty}\pi_{k}\delta_{\theta_{k}}(\theta)&\hbox{(46)}}$$ where Formula${\mmb\pi}=\{\pi_{1},\ldots, \pi_{\infty}\}$ and Formula${\mmb\theta}=\{\theta_{1},\ldots, \theta_{\infty}\}$ are the point masses and positions of atoms and we have Formula${\mmb\pi}\sim{\rm GEM}(\gamma)$. Similarly, the SBC of a lower-level DP is given by Formula TeX Source $$\eqalignno{\theta_{nk}\sim&\,G_{0}&\hbox{(47)}\cr G_{n}(\theta)=&\,\sum_{k=1}^{\infty}\pi_{nk}\delta_{\theta_{nk}}(\theta)&\hbox{(48)}}$$ where Formula${\mmb\pi}_{n}\sim{\rm GEM}(\alpha)$ and each Formula$\theta_{nk}$ is selected from Formula${\mmb\theta}$. Note that Formula$\theta_{nk}$ can be equal to Formula$\theta_{nk^{\prime}}$ if Formula$k\neq k^{\prime}$ because Formula$G_{0}$ is a discrete distribution. Another direct representation based on Formula${\mmb\theta}$ determined by the top-level DP is as follows: Formula TeX Source $$G_{n}(\theta)=\sum_{k=1}^{\infty}\pi_{nk}^{\ast}\delta_{\theta_{k}}(\theta)\eqno{\hbox{(49)}}$$ where Formula${\mmb\pi}_{n}^{\ast}\sim{\rm DP}(\alpha,{\mmb\pi})$ and the hyperparameter Formula$\alpha$ controls the difference between Formula${\mmb\pi}$ and Formula${\mmb\pi}^{\ast}$. Therefore, only point masses Formula${\mmb\pi}_{n}^{\ast}$ (mixture weights) differ between groups while positions Formula${\mmb\theta}$ (component distributions) are shared across groups.

A remaining problem is how to adjust the influential hyperparameters Formula$\alpha$ and Formula$\gamma$. This problem is often solved by putting vague gamma hyperprior distributions on these hyperparameters and inferring the posterior distributions.

B. Model Formulation

We explain how to formulate nested infinite GMMs based on a HDP and generalized DPs by extending the nested finite GMMs described in Section IV.

First we discuss Formula$K\rightarrow\infty$. An important requirement is that basis models [harmonic GMMs represented by (1)] should be shared as a global set across all Formula$D$ frames because each basis sound has a duration and may appear in different frames while only its weight varies. The HDP can satisfy this requirement and we can explain the HDP from the generative point of view. After an unbounded number of bases are initially generated according to a top-level DP, an unbounded number of bases are selected in each frame according to a frame-specific DP. In practice, a limited number of bases are used to represent a spectral strip because the number of observed frequency particles Formula$(n_{d\cdot\cdot})$ is limited. Mathematically speaking, in (6) we consider infinite-dimensional Dirichlet distributions, which are equivalent to the frame-specific DPs, and assume hyperparameter Formula${\mmb\nu}$ to be distributed according to the top-level DP as follows: Formula TeX Source $$\eqalignno{\mathtilde{\nu}_{k}\sim&\,{\rm Beta}(1,\gamma)&\hbox{(50)}\cr\nu_{k}=&\,\mathtilde{\nu}_{k}\prod_{k^{\prime}=1}^{k-1}(1-\mathtilde{\nu}_{k^{\prime}})&\hbox{(51)}}$$ where Formula$\gamma$ is a concentration parameter of the top-level DP.

Now we discuss Formula$M\rightarrow\infty$. Because each basis is allowed to consist of a unique infinite set of harmonic partials (basis models are independent of each other), instead of (7) we can use beta two-parameter processes as follows: Formula TeX Source $$\eqalignno{\mathtilde{\tau}_{km}\sim&\,{\rm Beta}(\beta\lambda_{1},\beta\lambda_{2})&\hbox{(52)}\cr\tau_{km}=&\,\mathtilde{\tau}_{km}\prod_{m^{\prime}=1}^{m-1}(1-\mathtilde{\tau}_{km^{\prime}})&\hbox{(53)}}$$ where Formula$\beta$ is a positive real number and we let Formula$\lambda_{1}$ and Formula$\lambda_{2}$ sum to unity. Note that we used the size-biased permutation property of the SBC to encourage lower harmonic partials to have larger weights because roughly speaking, the weights of harmonic partials of an instrument sound decrease exponentially.

Because hyperparameters Formula$\alpha$, Formula$\beta$, Formula$\gamma$, and Formula${\mmb\lambda}$ are influential, we put hyperprior distributions on them as follows: Formula TeX Source $$\eqalignno{p(\alpha)=&\,{\rm Gam}(\alpha\vert a_{\alpha},b_{\alpha})&\hbox{(54)}\cr p(\gamma)=&\,{\rm Gam}(\gamma\vert a_{\gamma},b_{\gamma})&\hbox{(55)}\cr p(\beta)=&\,{\rm Gam}(\beta\vert a_{\beta},b_{\beta})&\hbox{(56)}\cr p({\mmb\lambda})=&\,{\rm Beta}(\lambda_{1}\vert u_{1},u_{2})&\hbox{(57)}}$$ where Formula$a_{\{\alpha,\beta,\gamma\}}$ and Formula$b_{\{\alpha,\beta,\gamma\}}$ are shape and rate parameters of the gamma distributions. Formula$u_{1}$ and Formula$u_{2}$ are parameters of the beta distribution. These distributions are set to be vague (Formula$a_{\{\alpha,\beta,\gamma\}}=1.0$, Formula$b_{\{\alpha,\beta,\gamma\}}=0.001$, and Formula$u_{1}=u_{2}=1.0$ in our experiments described in Section VI).

Figure 8
Fig. 8. Graphical representation of nested infinite Gaussian mixture models for iLHA. First the infinite sets of mixing weights Formula${\mmb\pi}$ and Formula${\mmb\tau}$ are stochastically generated according to a HDP and beta two-parameter processes (generalized DPs). At the same time, the infinite number of Gaussian distributions are stochastically generated according to a Gaussian–Wishart prior distribution. Then one of the harmonic partials contained in one of the bases is stochastically selected as a latent variable Formula${\mmb z}_{dn}$ according to multinomial distributions defined by Formula${\mmb\pi}$ and Formula${\mmb\tau}$. Finally, frequency Formula${\mmb x}_{dn}$ is stochastically generated according to a Gaussian distribution specified by Formula${\mmb z}_{dn}$.

Fig. 8 shows a graphical representation of the iLHA model. The full joint distribution is given by Formula TeX Source $$\displaylines{p({\mmb X},{\mmb Z},{\mmb\pi},\mathtilde{\mmb\tau},{\mmb\mu},{\mmb\Lambda},\alpha,\beta,\gamma,{\mmb\lambda},\mathtilde{\mmb\nu})=p({\mmb X}\vert{\mmb Z},{\mmb\mu},{\mmb\Lambda})p({\mmb\mu},{\mmb\Lambda})\hfill\cr\hfill\times p({\mmb Z}\vert{\mmb\pi},\mathtilde{\mmb\tau})p({\mmb\pi}\vert\alpha,\mathtilde{\mmb\nu})p(\mathtilde{\mmb\tau}\vert\beta,{\mmb\lambda})p(\alpha)p(\beta)p(\gamma)p({\mmb\lambda})p(\mathtilde{\mmb\nu}\vert\gamma)\!\quad\hbox{(58)}}$$ where Formula$p({\mmb Z}\vert{\mmb\pi},\mathtilde{\mmb\tau})$ is given by plugging (52) into (5) and Formula$p({\mmb\pi}\vert\alpha,\mathtilde{\mmb\nu})$ is given by (6). Formula$p(\mathtilde{\mmb\nu}\vert\gamma)$ and Formula$p(\mathtilde{\mmb\tau}\vert\beta,{\mmb\lambda})$ are defined according to (50) and (52) as follows: Formula TeX Source $$\eqalignno{p(\mathtilde{\mmb\nu}\vert\gamma)=&\,\prod_{k}{\rm Beta}(\mathtilde{\nu}_{k}\vert 1,\gamma)&\hbox{(59)}\cr p(\mathtilde{\mmb\tau}\vert\beta,{\mmb\lambda})=&\,\prod_{km}{\rm Beta}(\mathtilde{\tau}_{km}\vert\beta\lambda_{1},\beta\lambda_{2}).&\hbox{(60)}}$$

C. Collapsed Variational Bayesian Inference

Figure 9
Fig. 9. Graphical representation of collapsed nested infinite mixture models for iLHA. After the original parameters Formula${\mmb\pi}$, Formula$\mathtilde{\mmb\tau}$, Formula${\mmb\mu}$, and Formula${\mmb\Lambda}$ are integrated out, the auxiliary variables Formula${\mmb\eta}$, Formula${\mmb\xi}$, Formula${\mmb s}$, and Formula${\mmb t}$ are introduced to set up conjugacy between hyperprior distributions and a marginalized likelihood function.

There are two problems in training the HDP mixture model. The first problem is that VB needs to assume the independence between latent variables and parameters to factorize a posterior distribution as in (9). This assumption is sometimes too strong and leads to incorrect posterior approximation. The second problem is that applying VB to hierarchical Bayesian models that have no conjugacy between priors and hyperpriors is generally difficult.

To solve these problems, we use a sophisticated version of VB called collapsed variational Bayes (CVB) [45]. It instead assumes independence between individual latent variables in a “collapsed” space in which parameters are integrated out (marginalized out). This is reasonable because the dependence between individual latent variables in the collapsed space is generally much weaker than the dependence between a set of parameters and a set of latent variables in the non-collapsed space. In addition, we introduce auxiliary variables to apply CVB to hierarchical Bayesian models.

Fig. 9 shows a graphical representation of a collapsed iLHA model. Integrating out Formula${\mmb\pi}$, Formula$\mathtilde{\mmb\tau}$, Formula${\mmb\mu}$, and Formula${\mmb\Lambda}$, we obtain the marginal distribution given by Formula TeX Source $$\displaylines{p({\mmb X},{\mmb Z},\alpha,\beta,\gamma,{\mmb\lambda},\mathtilde{\mmb\nu})=p({\mmb X}\vert{\mmb Z})p({\mmb Z}\vert\alpha,\beta,{\mmb\lambda},\mathtilde{\mmb\nu})\hfill\cr\hfill\times p(\alpha)p(\beta)p(\gamma)p({\mmb\lambda})p(\mathtilde{\mmb\nu}\vert\gamma).\quad\hbox{(61)}}$$ The first term of (61) can be easily calculated by leveraging conjugacy between Formula$p({\mmb X}\vert{\mmb Z},{\mmb\mu},{\mmb\Lambda})$ and Formula$p({\mmb\mu},{\mmb\Lambda})$ as follows: Formula TeX Source $$p({\mmb X}\vert{\mmb Z})=(2\pi)^{-{n_{\cdots}\over 2}}\prod_{k}\left({b_{0}\over b_{zk}}\right)^{1\over 2}{B({\mmb W}_{0},c_{0})\over B({\mmb W}_{zk},c_{zk})}\eqno{\hbox{(62)}}$$ where Formula$B({\mmb W}_{0},c_{0})$ and Formula$B({\mmb W}_{zk},c_{zk})$ are normalization factors of prior and posterior Gaussian–Wishart distributions. Formula$b_{zk}$, Formula$c_{zk}$, and Formula${\mmb W}_{zk}$ are obtained by substituting Formula$z_{dnkm}$ for Formula$\gamma_{dnkm}$ in calculating (26), (27), and (29). Similarly, the second term of (61) can be calculated by leveraging conjugacy between Formula$p({\mmb Z}\vert{\mmb\pi},\mathtilde{\mmb\tau})$ and Formula$p({\mmb\pi}\vert\alpha,\mathtilde{\mmb\nu})p(\mathtilde{\mmb\tau}\vert\beta,{\mmb\lambda})$ as follows: Formula TeX Source $$\displaylines{p({\mmb Z}\vert\alpha,\beta,{\mmb\lambda},\mathtilde{\mmb\nu})=\prod_{d}{\Gamma(\alpha)\over\Gamma(\alpha+n_{d\cdot\cdot})}\prod_{k}{\Gamma(\alpha\nu_{k}+n_{dk\cdot})\over\Gamma(\alpha\nu_{k})}\hfill\cr\hfill\times\prod_{km}{\Gamma(\beta)\Gamma(\beta\lambda_{1}+n_{\cdot km})\Gamma(\beta\lambda_{2}+n_{\cdot k>m})\over\Gamma(\beta\lambda_{1})\Gamma(\beta\lambda_{2})\Gamma(\beta+n_{\cdot k\geq m})}\quad\hbox{(63)}}$$ where Formula$\Gamma$ is the gamma function.

We then introduce auxiliary variables by using a technique called data augmentation [45]. Let Formula$\eta_{d}$ and Formula$\xi_{km}$ be beta-distributed variables and Formula$s_{dk}$ and Formula${\mmb t}_{km}$ be positive integers that satisfy Formula$1\leq s_{dk}\leq n_{dk\cdot}$, Formula$1\leq t_{km1}\leq n_{\cdot km}$, and Formula$1\leq t_{km2}\leq n_{\cdot k>m}$. We can augment (63) as follows: Formula TeX Source $$\eqalignno{&p({\mmb Z},{\mmb\eta},{\mmb\xi},{\mmb s},{\mmb t}\vert\alpha,\beta,{\mmb\lambda},\mathtilde{\mmb\nu})\cr&\quad=\prod_{d}{\eta_{d}^{\alpha-1}(1-\eta_{d})^{n_{d\cdot\cdot}-1}\over\Gamma(n_{d\cdot\cdot})}\prod_{k}{n_{dk\cdot}\brack s_{dk}}(\alpha\nu_{k})^{s_{dk}}\cr&\qquad\times\prod_{km}{\xi_{km}^{\beta-1}(1-\xi_{km})^{n_{\cdot k\geq m}-1}\over\Gamma(n_{\cdot k\geq m})}\cr&\qquad\times{n_{\cdot km}\brack t_{km1}}(\beta\lambda_{1})^{t_{km1}}{n_{\cdot k>m}\brack t_{km2}}(\beta\lambda_{2})^{t_{km2}}&\hbox{(64)}}$$ where [] denotes a Stirling number of the first kind. We can confirm that (64) reduces to (63) by marginalizing out auxiliary variables Formula${\mmb\eta}$, Formula${\mmb\xi}$, Formula${\mmb s}$, and Formula${\mmb t}$. The augmented marginal distribution is given by Formula TeX Source $$\displaylines{p({\mmb X},{\mmb Z},{\mmb\eta},{\mmb\xi},{\mmb s},{\mmb t},\alpha,\beta,\gamma,{\mmb\lambda},\mathtilde{\mmb\nu})=p({\mmb X}\vert{\mmb Z})\hfill\cr\hfill\times p({\mmb Z},{\mmb\eta},{\mmb\xi},{\mmb s},{\mmb t}\vert\alpha,\beta,{\mmb\lambda},\mathtilde{\mmb\nu})p(\alpha)p(\beta)p(\gamma)p({\mmb\lambda})p(\mathtilde{\mmb\nu}\vert\gamma).\quad\hbox{(65)}}$$

To apply CVB to approximate the true posterior distribution Formula$p({\mmb Z},{\mmb\eta},{\mmb\xi},{\mmb s},{\mmb t},\alpha,\beta,\gamma,{\mmb\lambda},\mathtilde{\mmb\nu}\vert{\mmb X})$, we assume that the variational posterior distribution can be factorized as follows: Formula TeX Source $$\displaylines{q({\mmb Z},{\mmb\eta},{\mmb\xi},{\mmb s},{\mmb t},\alpha,\beta,\gamma,{\mmb\lambda},\mathtilde{\mmb\nu})=q(\alpha,\beta,\gamma,{\mmb\lambda})\hfill\cr\hfill\times q(\mathtilde{\mmb\nu})q({\mmb\eta},{\mmb\xi},{\mmb s},{\mmb t}\vert{\mmb Z})\prod_{dn}q({\mmb z}_{dn})\quad\hbox{(66)}}$$ where we assumed independence between hyperparameters, auxiliary variables, and elements of Formula${\mmb Z}$. We also use an approximation technique called variational posterior truncation. More specifically, we assume that Formula$q(z_{dnkm})=0$ when Formula$k>K^{+}$ and Formula$m>M^{+}$. In practice, we set Formula$K^{+}$ and Formula$M^{+}$ to sufficiently large integers. This does not mean that effective model complexities are fixed in advance. The larger the truncation levels we use, the more the accurate approximations we obtain.

To optimize Formula$q({\mmb Z},{\mmb\eta},{\mmb\xi},{\mmb s},{\mmb t},\alpha,\beta,\gamma,{\mmb\lambda},\mathtilde{\mmb\nu})$, we use a variational EM algorithm that iterates the following steps: Formula TeX Source $$\eqalignno{{\hskip-15pt}q^{\ast}({\mmb z}_{dn})\!\propto\!&\,\exp\!\left(\BBE_{{\mmb Z}^{\neg dn},\alpha,\beta,\gamma,{\mmb\lambda},\mathtilde{\mmb\nu}}\left[\log\ {\rm Eqn.}\ (61)\right]\right)&\hbox{(67)}\cr{\hskip-15pt}q^{\ast}(\alpha,\beta,\gamma,{\mmb\lambda})\!\propto\!&\,\exp\!\left(\BBE_{{\mmb Z},{\mmb\eta},{\mmb s},{\mmb\xi},{\mmb t},\mathtilde{\mmb\nu}}\left[\log\ {\rm Eqn.}\ (65)\right]\right)&\hbox{(68)}\cr{\hskip-15pt}q^{\ast}(\mathtilde{\mmb\nu})\!\propto\!&\,\exp\!\left(\BBE_{{\mmb Z},{\mmb\eta},{\mmb s},{\mmb\xi},{\mmb t},\alpha,\beta,\gamma,{\mmb\lambda}}\left[\log\ {\rm Eqn.}\ (65)\right]\right)&\hbox{(69)}\cr{\hskip-15pt}q^{\ast}({\mmb\eta},{\mmb\xi},{\mmb s},{\mmb t}\vert{\mmb Z})\!\propto\!&\,\exp\!\left(\BBE_{{\mmb Z},\alpha,\beta,\gamma,{\mmb\lambda},\mathtilde{\mmb\nu}}\left[\log\ {\rm Eqn.}\ (65)\right]\right)&\hbox{(70)}}$$ where Formula$\neg dn$ denotes a set of indices without Formula$d$ and Formula$n$.

D. Variational Posterior Distributions

We derive the formulas for updating variational posterior distributions according to (67)(70).

1) CVB-E Step

An optimal variational distribution of Formula${\mmb Z}$ can be obtained as the product of multinomial distributions. The posterior probability that Formula${\mmb x}_{dn}$ was generated from the Formula$m$th harmonic partial of basis Formula$k$ is given by Formula TeX Source $$\eqalignno{&\log q^{\ast}(z_{dnkm}=1)\cr&\quad=\BBE_{{\mmb z}^{\neg dn}}\!\left[\log\left(\BBG[\alpha\nu_{k}]+n_{dk\cdot}^{\neg dn}\right)\right]\cr&\qquad+\!\BBE_{{\mmb z}^{\neg dn}}\!\!\left[\log\!\left({\BBG[\beta\lambda_{1}]\!+\!n_{\cdot km}^{\neg dn}\over\BBE[\beta]\!+\!n_{\cdot k\geq m}^{\neg dn}}\!\prod_{m^{\prime}=1}^{m-1}\!{\BBG[\beta\lambda_{2}]\!+\!n_{\cdot k>m^{\prime}}^{\neg dn}\over\BBE[\beta]\!+\!n_{\cdot k\geq m^{\prime}}^{\neg dn}}\right)\right]\cr&\qquad+\!\BBE_{{\mmb z}^{\neg dn}}\!\!\left[\log{\cal S}({\mmb x}_{dnm}\vert{\mmb m}_{zk}^{\neg dn},{\mmb L}_{zk}^{\neg dn},c_{zk}^{\neg dn})\right]\!\!+\!{\rm const.}&\hbox{(71)}}$$ where Formula$\BBG[x]$ is the geometric average Formula$(\BBG[x]=\exp(\BBE[\log x]))$ and Formula${\cal S}$ is the Student-t distribution defined by the three parameters Formula${\mmb m}_{zk}^{\neg dn}$, Formula${\mmb L}_{zk}^{\neg dn}$, and Formula$c_{zk}^{\neg dn}$. Formula${\mmb L}_{zk}^{\neg dn}$ is given by Formula TeX Source $${\mmb L}_{zk}^{\neg dn}={b_{zk}^{\neg dn}\over 1+b_{zk}^{\neg dn}}c_{zk}^{\neg dn}{\mmb W}_{zk}^{\neg dn}\eqno{\hbox{(72)}}$$ where Formula$b_{zk}^{\neg dn}$, Formula$c_{zk}^{\neg dn}$, Formula${\mmb m}_{zk}^{\neg dn}$, and Formula${\mmb W}_{zk}^{\neg dn}$ are obtained according to (26)(29) in which Formula$z_{dnkm}$ is substituted for Formula$\gamma_{dnkm}$ and the sums are calculated without Formula${\mmb z}_{dn}$. Each term of (71) can be approximated efficiently by using first-order and second-order Taylor expansions [45], [46], [47].

Equation (71) calculates the geometric averages of three predictive distributions under posterior distributions. These predictive distributions are derived from an infinite-dimensional Dirichlet distribution (a DP for an infinite mixture of iGMMs), stick-breaking construction (a DP for an iGMM), and a Gaussian distribution. Interestingly, this corresponds to (13) based on the geometric averages of three likelihood functions under posterior distributions. This implies that CVB is more robust to the local-optima problem than standard VB is.

2) CVB-M Step

We can optimize the variational posterior distributions of the hyperparameters analytically by optimizing those of the auxiliary variables. First, Formula$\alpha$, Formula$\beta$, and Formula$\gamma$ are gamma distributed as follows: Formula TeX Source $$\eqalignno{q^{\ast}(\alpha)\propto&\,\alpha^{a_{\alpha}+\BBE[s_{\cdot\cdot}]-1}e^{-\alpha\left(b_{\alpha}-\sum_{d}\BBE[\log\eta_{d}]\right)}&\hbox{(73)}\cr q^{\ast}(\beta)\propto&\,\beta^{a_{\beta}+\BBE[t_{\cdots}]-1}e^{-\beta\left(b_{\beta}-\sum_{km}\BBE[\log\xi_{km}]\right)}&\hbox{(74)}\cr q^{\ast}(\gamma)\propto&\,\gamma^{a_{\gamma}+K-1}e^{-\gamma\left(b_{\gamma}-\sum_{k}\BBE\left[\log(1-\mathtilde{\nu}_{k})\right]\right)}&\hbox{(75)}}$$ and Formula${\mmb\lambda}$ and Formula$\mathtilde{\mmb\tau}$ are beta distributed as follows: Formula TeX Source $$\eqalignno{q^{\ast}({\mmb\lambda})\propto&\,\lambda_{1}^{u_{1}+\BBE[t_{\cdot\cdot 1}]-1}\lambda_{2}^{u_{2}+\BBE[t_{\cdot\cdot 2}]-1}&\hbox{(76)}\cr q^{\ast}(\mathtilde{\nu}_{k})\propto&\,\mathtilde{\nu}_{k}^{1+\BBE[s_{\cdot k}]-1}(1-\mathtilde{\nu}_{k})^{\BBE[\gamma]+\BBE[s_{\cdot>k}]-1}.&\hbox{(77)}}$$ Then Formula${\mmb\eta}$ and Formula${\mmb\xi}$ are beta distributed as follows: Formula TeX Source $$\eqalignno{q^{\ast}(\eta_{d})\propto&\,\eta_{d}^{\BBE[\alpha]-1}(1-\eta_{d})^{n_{d\cdot\cdot}-1}&\hbox{(78)}\cr q^{\ast}(\xi_{km}\vert{\mmb Z})\propto&\,\xi_{km}^{\BBE[\beta]-1}(1-\xi_{km})^{n_{\cdot k\geq m}-1}&\hbox{(79)}}$$ and Formula${\mmb s}$ and Formula${\mmb t}$ are multinomial distributed as follows: Formula TeX Source $$\eqalignno{q^{\ast}(s_{dk}=s\vert{\mmb Z})\propto&\,{n_{dk\cdot}\brack s}\BBG[\alpha\nu_{k}]^{s}&\hbox{(80)}\cr q^{\ast}(t_{km1}=t\vert{\mmb Z})\propto&\,{n_{\cdot km}\brack t}\BBG[\beta\lambda_{1}]^{t}&\hbox{(81)}\cr q^{\ast}(t_{km2}=t\vert{\mmb Z})\propto&\,{n_{\cdot k>m}\brack t}\BBG[\beta\lambda_{2}]^{t}.&\hbox{(82)}}$$

To optimize the variational posterior distributions, we need to calculate the expectations of these variables. If a random variable Formula$x$ follows Formula${\rm Gam}(x\vert a,b)$ with shape parameter Formula$a$ and rate parameter Formula$b$, its expectations are given by Formula$\BBE[x]=a/b$ and Formula$\BBE[\log x]=\psi(a)-\log(b)$. If Formula$x$ follows Formula${\rm Beta}(x\vert c,d)$ with parameters Formula$c$ and Formula$d$, its expectations are given by Formula$\BBE[x]=c/(c+d)$ and Formula$\BBE[\log x]=\psi(c)-\psi(c+d)$. Note that the distributions given by Equations (79)(82) are conditioned by Formula${\mmb Z}$. The expectations must therefore be averaged over Formula${\mmb Z}$. For example, we now have the following conditional expectation: Formula TeX Source $$\BBE[\log\xi_{km}\vert{\mmb Z}]=\psi\left(\BBE[\beta]\right)-\psi\left(\BBE[\beta]+n_{\cdot k\geq m}\right).\eqno{\hbox{(83)}}$$ We use Taylor expansion to average Formula$\BBE[\log\xi_{km}\vert{\mmb Z}]$ over Formula${\mmb Z}$, but the digamma function Formula$\psi$ diverges to negative infinity much faster than the logarithmic function does in the vicinity of the origin. To solve this problem, we use a method that treats the case Formula$n_{\cdot k\geq m}=0$ exactly and applies second-order approximation when Formula$n_{\cdot k\geq m}>0$ [45]. We can similarly average the following conditional expectations: Formula TeX Source $$\eqalignno{{\hskip-15pt}\BBE[s_{dk}\vert{\mmb Z}]\!=\!&\,\BBG[\alpha\nu_{k}]\left(\psi\left(\BBG[\alpha\nu_{k}]\!+\!n_{dk\cdot}\right)\!-\!\psi\left(\BBG[\alpha\nu_{k}]\right)\right)&\hbox{(84)}\cr{\hskip-15pt}\BBE[t_{km1}\vert{\mmb Z}]\!=\!&\,\BBG[\beta\lambda_{1}]\left(\psi\left(\BBG[\beta\lambda_{1}]\!+\!n_{\cdot km}\right)\!-\!\psi\left(\BBG[\beta\lambda_{1}]\right)\right)&\hbox{(85)}\cr{\hskip-15pt}\BBE[t_{km2}\vert{\mmb Z}]\!=\!&\,\BBG[\beta\lambda_{2}]\left(\psi\left(\BBG[\beta\lambda_{2}]\!+\!n_{\cdot k>m}\right)\!-\!\psi\left(\BBG[\beta\lambda_{2}]\right)\right).&\hbox{(86)}}$$

To estimate F0s in the end, we explicitly compute the variational posterior distributions of the integrated-out parameters Formula${\mmb\mu}$ and Formula${\mmb\Lambda}$. To do this, we need to execute the standard VB-M step once using Formula$q({\mmb Z})$ obtained in the CVB-E step.

E. Variational Lower Bound

As in LHA, we monitor the increase of the variational lower bound of evidence Formula$p({\mmb X})$, which is given by Formula TeX Source $$\eqalignno{{\cal L}=&\,\BBE\left[\log p({\mmb X},{\mmb Z},\alpha,\beta,\gamma,{\mmb\lambda},\mathtilde{\mmb\nu})\right]-\BBE\left[\log q({\mmb Z},\alpha,\beta,\gamma,{\mmb\lambda},\mathtilde{\mmb\nu})\right]\cr=&\,\BBE\left[\log p({\mmb X}\vert{\mmb Z})\right]+\BBE\left[\log p({\mmb Z}\vert\alpha,\beta,{\mmb\lambda},\mathtilde{\mmb\nu})\right]-\BBE\left[\log q({\mmb Z})\right]\cr&+\BBE\left[\log p(\alpha)\right]+\BBE\left[\log p(\beta)\right]+\BBE\left[\log p(\gamma)\right]\cr&-\BBE\left[\log q(\alpha)\right]-\BBE\left[\log q(\beta)\right]-\BBE\left[\log q(\gamma)\right]\cr&+\BBE\left[\log p({\mmb\lambda})\right]+\BBE\left[\log p(\mathtilde{\mmb\nu}\vert\gamma)\right]-\BBE\left[\log q({\mmb\lambda})\right]\cr&-\BBE\left[\log q(\mathtilde{\mmb\nu})\right].&\hbox{(87)}}$$ The calculation of these terms is described in Appendix II.



This section reports the results of two comparative evaluation experiments. We compared LHA and iLHA with PreFEst and HTC because these four methods are based on the same idea for modeling harmonic structures. Using a different data set, we then compared iLHA with NMF-based methods and other methods. In the latter experiment, we investigated how significantly the value of the scaling factor Formula$\omega$ (i.e., how many frequency particles are assumed to be observed in total) affects the accuracy of multipitch analysis.

A. Comparison with Conventional Parametric Methods

1) Experimental Conditions

We evaluated LHA and iLHA on a test that was used in [4] and consisted of eight pieces of piano and guitar solo performances excerpted from the RWC music database [48]. The first 23 s of each piece were used for evaluation. Spectral analysis with a 16-ms time resolution was conducted using a wavelet transform with Gabor wavelets. The correct values and temporal positions of actual F0s were prepared by hand as ground truth. Denoting by Formula$g_{d}$, Formula$e_{d}$, and Formula$c_{d}$ the respective numbers of ground-truth, estimated, and correct F0s on frame Formula$d$, we calculated the following frame-level recall and precision rates and F-measure for each piece: Formula TeX Source $${\cal R}=100\cdot{\sum_{d}c_{d}\over\sum_{d}g_{d}}\quad{\cal P}=100\cdot{\sum_{d}c_{d}\over\sum_{d}e_{d}}\quad{\cal F}={2{\cal RP}\over{\cal R}+{\cal P}}\eqno{\hbox{(88)}}$$ and we averaged each of these measures over all pieces.

The prior and hyperprior distributions of LHA and iLHA were set to noninformative distributions. In LHA, Formula$K$ and Formula$M$ were set to 60 and 15. In iLHA, Formula$K^{+}$ and Formula$M^{+}$ were also set to 60 and 15. iLHA is not sensitive to these values, and no other tuning was needed for either method. To output F0s at each frame, we extracted bases whose expected weights Formula${\mmb\pi}$ were over a threshold that was optimized as in [4].

For comparison, we referred to the PreFEst and HTC experimental results reported in [4]. Although the ground-truth data in that study was slightly different from ours, it was close enough for roughly evaluating performance comparatively. The number of bases, priors, and weighting factors of the PreFEst and HTC were carefully tuned to optimize the results. Although this is not realistic, the upper bounds of potential performance were investigated in the literature.

2) Experimental Results

The results listed in Table II show that the performance of iLHA closely approached and sometimes surpassed that of HTC. This is consistent with the empirical findings of many studies on Bayesian nonparametrics that nonparametric models were competitive with optimally tuned parametric models. HTC outperformed PreFEst because HTC can appropriately deal with temporal continuity of spectral bases. This implies that incorporating temporal modeling would improve the performance of iLHA.

Table 2

The results of LHA were worse than those of iLHA because LHA is not formulated in a hierarchical Bayesian manner and requires precise priors. In fact, we confirmed that the results of PreFEst and HTC based on MAP estimation were drastically degraded when using noninformative priors. Automated iLHA, in contrast, stably showed the good performance.

We found that model flexibility can be greatly enhanced by making time-consuming fine tuning unnecessary. Conventional studies assumed that appropriate prior knowledge is required to constrain flexibility (called regularization). By using a truly flexible hierarchical model based on Bayesian nonparametrics, however, we can let the data speak for itself. This naturally results in optimal performance.

B. Comparison With NMF-Based Methods and Other Methods

1) Experimental Conditions

We then evaluated iLHA on a test set that was used in [16] and consisted of 50 pieces of piano solo performances excerpted from the MAPS piano database [8]. The first 30 s of each piece were used for evaluation. Spectral analysis with a 10-ms time resolution was conducted using a Gabor wavelet transform. The value of Formula$K^{+}$ was increased to 88, (the number of notes in a standard piano) because the piano pieces were much sophisticated than those used in first experiment. The time resolution and the value of Formula$K^{+}$ were equal to those used in [16], and performance was evaluated in terms of F-measures.

For comparison, we referred to the experimental results of seven methods reported in [16]. We compared iLHA with four NMF-based methods: one using no constraints, one using harmonicity constraints (a subset of [13]), one using harmonicity and source-filter constraints [14], and one using harmonicity and spectral smoothness constraints [16]. Note that only the last one was manually tuned to yield the best results (the effect of hyperparameter tuning was investigated in [16]). We also compared it with a method based on harmonic sums [24], a method based on correlograms [25], and a method based on spectral peak clustering [26].

2) Experimental Results

Table 3

The results listed in Table III show that iLHA was the second best among the seven methods. Although the best variant of NMF gained the better F-measure (67.0%) than iLHA did (61.2%), we can say that well-automated iLHA is still competitive because it is reported that non-optimal settings deteriorated the performance of NMF moderately [16]. The F-measure of iLHA (61.2%) was close to that of NMF using only harmonicity constraints (60.5%). As discussed in Section III-C2, pLSA and PLCA are proven to have a close connection to NMF. Therefore, the similarity between iLHA based on harmonic GMMs and NMF based on harmonicity constraints was experimentally and theoretically supported. In addition, the difference between NMF using only harmonicity constraints and NMF adding spectral smoothness constraints implies that the performance of iLHA would be improved by incorporating spectral smoothness modeling.

It is interesting that in almost all methods the Formula${\cal P}$ was higher than the Formula${\cal R}$. This means that there were many F0s that were hard to detect because of the complex overlapping of multiple F0s. To solve this problem, more accurate spectral modeling would be required by removing the assumption of amplitude additivity that forms a basis of iLHA and NMF.

3) Impact of Scaling Factor

We investigated the impact of the scaling factor Formula$\omega$ described in Section III-C. We tested three different values: Formula$\omega=0.1, 1, 10$. The similarity of respective F-measures—61.2%, 60.6%, and 60.1%—indicates that the results are not sensitive to the value of the scaling factor Formula$\omega$. The automatic optimization of Formula$\omega$ would be an interesting research topic that tackles the limitation of many methods based on the assumption of amplitude quantization.



This paper presented a novel statistical method for detecting multiple F0s in polyphonic music audio signals. In this method, which is called iLHA and is the first to apply Bayesian nonparametrics to multipitch analysis, we formulated nested infinite GMMs that represent polyphonic spectral strips in a hierarchical nonparametric Bayesian manner. More specifically, each spectral strip is allowed to contain an unbounded number of spectral bases, each of which can contain an unbounded number of harmonic partials. The method was fully automated by putting noninformative hyperprior distributions on influential hyperparameters except for the final thresholding process. The joint posterior distribution of all unknown variables can be inferred efficiently according to the VB framework. In our experiments comparing iLHA with the state-of-the-art methods manually optimized by trial and error, we found that iLHA is competitive enough and there is room for improvement based on modeling of temporal continuity and spectral smoothness. One interesting future direction is to use MCMC methods such as Gibbs sampling and more efficient variants for training the iLHA model.

Bayesian nonparametrics is a powerful framework avoiding the model selection problem faced in various areas of music information retrieval (MIR). For example, how many sections are required for structuring a musical piece? How many groups are required for clustering listeners according to their tastes or musical pieces according to their contents? We can avoid these problems by assuming that in theory there is an infinite number of objects (sections or groups) behind available observed data. Unnecessary objects are automatically removed from consideration through statistical inference. Hoffman et al. recently successfully applied this framework to the calculation of musical similarity [49] and the detection of repeated patterns [32], and we also plan to use this powerful framework in a wide range of applications.


The nine terms of the variational lower bound of LHA in (34) can be calculated as follows: Formula TeX Source $$\eqalignno{&\BBE\left[\log p({\mmb X}\vert{\mmb Z},{\mmb\mu},{\mmb\Lambda})\right]\cr&\quad=\sum_{dnkm}\gamma_{dnkm}\BBE_{{\mmb\mu},{\mmb\Lambda}}\left[\log{\cal N}\left({\mmb x}_{dnm}\vert{\mmb\mu}_{k},{\mmb\Lambda}_{k}^{-1}\right)\right]\cr&\BBE\left[\log p({\mmb Z}\vert{\mmb\pi},{\mmb\tau})\right]\cr&\quad=\sum_{dnkm}\gamma_{dnkm}\left(\BBE_{{\mmb\pi}_{d}}[\log\pi_{dk}]+\BBE_{{\mmb\tau}_{k}}[\log\tau_{km}]\right)\cr&\BBE\left[\log p({\mmb\pi})\right]\cr&\quad=D\log C(\alpha{\mmb\nu})+\sum_{dk}(\alpha\nu_{k}-1)\BBE_{{\mmb\pi}_{d}}[\log\pi_{dk}]\cr&\BBE\left[\log p({\mmb\tau})\right]\cr&\quad=K\log C(\beta{\mmb\upsilon})+\sum_{km}(\beta\upsilon_{m}-1)\BBE_{{\mmb\tau}_{k}}[\log\tau_{km}]\cr&\BBE\left[\log p({\mmb\mu},{\mmb\Lambda})\right]\cr&\quad=\sum_{k}\BBE_{{\mmb\mu}_{k},{\mmb\Lambda}_{k}}\left[\log{\cal N}\left({\mmb\mu}_{k}\vert{\mmb m}_{0},(b_{0}{\mmb\Lambda}_{k})^{-1}\right){\cal W}({\mmb\Lambda}_{k}\vert{\mmb W}_{0},c_{0})\right]\cr&\BBE\left[\log q({\mmb Z})\right]\cr&\quad=\sum_{dnkm}\gamma_{dnkm}\log\gamma_{dnkm}\cr&\BBE\left[\log q({\mmb\pi})\right]\cr&\quad=\sum_{d}\log C({\mmb\alpha}_{d})+\sum_{dk}(\alpha_{dk}-1)\BBE_{{\mmb\pi}_{d}}[\log\pi_{dk}]\cr&\BBE\left[\log q({\mmb\tau})\right]\cr&\quad=\sum_{k}\log C({\mmb\beta}_{k})+\sum_{km}(\beta_{km}-1)\BBE_{{\mmb\tau}_{u}}[\log\tau_{km}]\cr&\BBE\left[\log q\left({\mmb\mu},{\mmb\Lambda})\right)\right]\cr&\quad=\sum_{k}\BBE_{{\mmb\mu}_{k},{\mmb\Lambda}_{k}}\left[\log{\cal N}\left({\mmb\mu}_{k}\vert{\mmb m}_{k},(b_{k}{\mmb\Lambda}_{k})^{-1}\right){\cal W}({\mmb\Lambda}_{k}\vert{\mmb W}_{k},c_{k})\right]}$$ where the fifth and last terms can be obtained as follows: Formula TeX Source $$\eqalignno{&\BBE_{{\mmb\mu}_{k},{\mmb\Lambda}_{k}}\left[\log{\cal N}\left({\mmb\mu}_{k}\vert{\mmb m}_{0},(b_{0}{\mmb\Lambda}_{k})^{-1}\right){\cal W}({\mmb\Lambda}_{k}\vert{\mmb W}_{0},c_{0})\right]\cr&\quad={1\over 2}\log\left({b_{0}\over 2\pi}\right)+{1\over 2}\BBE_{{\mmb\Lambda}_{k}}\left[\log\vert{\mmb\Lambda}_{k}\vert\right]+\log B({\mmb W}_{0},c_{0})\cr&\qquad-{b_{0}\over 2}\left(c_{k}({\mmb m}_{k}-{\mmb m}_{0})^{T}{\mmb W}_{k}({\mmb m}_{k}-{\mmb m}_{0})+{1\over b_{k}}\right)\cr&\qquad+{c_{0}-2\over 2}\BBE_{{\mmb\Lambda}_{k}}\left[\log\vert{\mmb\Lambda}_{k}\vert\right]-{c_{k}\over 2}{\rm Tr}\left({\mmb W}_{0}^{-1}{\mmb W}_{k}\right)\cr&\BBE_{{\mmb\mu}_{k},{\mmb\Lambda}_{k}}\left[\log{\cal N}\left({\mmb\mu}_{k}\vert{\mmb m}_{k},(b_{k}{\mmb\Lambda}_{k})^{-1}\right){\cal W}({\mmb\Lambda}_{k}\vert{\mmb W}_{k},c_{k})\right]\cr&\quad=-\BBE_{{\mmb\Lambda}_{k}}\left[H\left[q({\mmb\mu}_{k}\vert{\mmb\Lambda}_{k})\right]\right]-H\left[q({\mmb\Lambda}_{k})\right]\cr&\quad={1\over 2}\BBE_{{\mmb\Lambda}_{k}}\left[\log\vert{\mmb\Lambda}_{k}\vert\right]+{1\over 2}\log\left({b_{k}\over 2\pi}\right)-{1\over 2}+\log B({\mmb W}_{k},c_{k})\cr&\qquad+{c_{k}-2\over 2}\BBE_{{\mmb\Lambda}_{k}}\left[\log\vert{\mmb\Lambda}_{k}\vert\right]-{c_{k}\over 2}}$$


The 13 terms of the variational lower bound of iLHA in (87) can be calculated as follows: Formula TeX Source $$\eqalignno{\BBE\!\left[\log p({\mmb X}\vert{\mmb Z})\right]\!=\!&\,-{n_{\cdots}\over 2}\log(2\pi)\!+\!{1\over 2}\sum_{k}\log b_{0}\cr&-{1\over 2}\sum_{k}\BBF_{2}[\log b_{zk}]\cr&+\!\sum_{k}\log B({\mmb W}_{0},c_{0})\cr&-\sum_{k}\BBF_{1}\left[\log B({\mmb W}_{zk},c_{zk})\right]\cr\BBE\!\left[\log p({\mmb Z}\vert\alpha,\beta,{\mmb\lambda},\mathtilde{\mmb\nu})\right]\!=\!&\,\sum_{d}\log\left({\Gamma\!\left(\BBE[\alpha]\right)\over\Gamma\!\left(\BBE[\alpha]+n_{d\cdot\cdot}\right)}\right)\cr&+\!\sum_{dk}\BBF_{2}\!\!\left[\log\!\left({\Gamma\!\left(\BBG[\alpha\nu_{k}]\!+\!n_{dk\cdot}\right)\over\Gamma\!\left(\BBG[\alpha\nu_{k}]\right)}\right)\right]\cr&+\!\sum_{km}\!\BBF_{2}\!\!\left[\log\!\left({\Gamma\!\left(\BBE[\beta]\right)\over\Gamma\!\left(\BBE[\beta]\!+\!n_{\cdot k\geq m}\right)}\right)\right]\cr&+\!\sum_{km}\!\BBF_{2}\!\!\left[\log\!\left({\Gamma\!\left(\BBG[\beta\lambda_{1}]\!+\!n_{\cdot km}\right)\over\Gamma\!\left(\BBG[\beta\lambda_{1}]\right)}\right)\right]\cr&+\!\sum_{km}\!\BBF_{2}\!\!\left[\log\!\left({\Gamma\!\left(\BBG[\beta\lambda_{2}]\!+\!n_{\cdot k>m}\right)\over\Gamma\!\left(\BBG[\beta\lambda_{2}]\right)}\right)\right]\cr\BBE\!\left[\log p(\alpha)\right]\!=\!&\,-\log\Gamma(a_{\alpha})\!+\!a_{\alpha}\log b_{\alpha}\cr&+(a_{\alpha}-1)\BBE_{\alpha}[\log\alpha]-b_{\alpha}\BBE_{\alpha}[\alpha]\cr\BBE\!\left[\log p(\beta)\right]\!=\!&\,-\log\Gamma(a_{\beta})\!+\!a_{\beta}\log b_{\beta}\cr&+(a_{\beta}-1)\BBE_{\beta}[\log\beta]-b_{\beta}\BBE_{\beta}[\beta]\cr\BBE\!\left[\log p(\gamma)\right]\!=\!&\,-\log\Gamma(a_{\gamma})\!+\!a_{\gamma}\log b_{\gamma}\cr&+(a_{\gamma}-1)\BBE_{\gamma}[\log\gamma]-b_{\gamma}\BBE_{\gamma}[\gamma]\cr\BBE\!\left[\log p({\mmb\lambda})\right]\!=\!&\,\log{\Gamma(u_{1}+u_{2})\over\Gamma(u_{1})\Gamma(u_{2})}\cr&+(u_{1}-1)\BBE[\log\lambda_{1}]\cr&+(u_{2}-1)\BBE[\log\lambda_{2}]\cr\BBE\!\left[\log p(\mathtilde{\mmb\nu}\vert\gamma)\right]\!=\!&\,K\BBE[\log\gamma]\cr&+\!\sum_{k}\left(\BBE[\gamma]-1\right)\BBE\!\left[\log(1-\mathtilde{\nu}_{k})\right]\cr\BBE\!\left[q({\mmb Z})\right]\!=\!&\,\sum_{dnkm}\gamma_{dnkm}\log\gamma_{dnkm}\cr\BBE\!\left[\log q(\alpha)\right]\!=\!&\,-H\left[{\rm PosteriorGamma}(\alpha)\right]\cr\BBE\!\left[\log q(\beta)\right]\!=\!&\,-H\left[{\rm PosteriorGamma}(\beta)\right]\cr\BBE\!\left[\log q(\gamma)\right]\!=\!&\,-H\left[{\rm PosteriorGamma}(\gamma)\right]\cr\BBE\!\left[\log q({\mmb\lambda})\right]\!=\!&\,-H\left[{\rm PosteriorBeta}(\lambda_{1})\right]\cr\BBE\!\left[\log q(\mathtilde{\mmb\nu})\right]\!=\!&\,\sum_{k}-H\left[{\rm PosteriorBeta}(\mathtilde{\nu}_{k})\right]}$$ where Formula$\BBF_{1}$ and Formula$\BBF_{2}$ mean the first-order and second-order approximations based on Taylor expansion (see [45], [46], [47]).


The authors would like to thank Dr. H. Kameoka (The University of Tokyo/NTT, Japan) for providing the ground-truth transcriptions of eight pieces included in the RWC music databases [48]. They would also like to thank Dr. V. Emiya (INRIA, France) for allowing them to use the valuable MAPS piano database [8].


This work was supported in part by CREST, JST, and KAKENHI 20800084. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Daniel Ellis.

The authors are with the National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8568, Japan (e-mail:;

Color versions of one or more of the figures in this paper are available online at

1Linear frequency Formula$f_{h}$ in hertz can be converted to logarithmic frequency Formula$f_{c}$ in cents as follows: Formula$f_{c}=1200\log_{2}(f_{h}/(440(2^{(3/12)-5}))$.


No Data Available


Kazuyoshi Yoshii

Kazuyoshi Yoshii

Kazuyoshi Yoshii (M'08) received the Ph.D. degree in informatics from Kyoto University, Kyoto, Japan, in 2008.

He is currently a Research Scientist at the National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba, Japan. His research interests include probabilistic music analysis, blind source separation, and Bayesian nonparametrics.

Dr. Yoshii has received several awards including the IPSJ Yamashita SIG Research Award and the Best-in-Class Award of MIREX 2005. He is a member of the Information Processing Society of Japan (IPSJ) and Institute of Electronics, Information, and Communication Engineers (IEICE).

Masataka Goto

Masataka Goto

Masataka Goto received the D.Eng. degree from Waseda University, Tokyo, Japan, in 1998.

He is currently the leader of the Media Interaction Group, National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba, Japan. He serves concurrently as a Visiting Professor at the Institute of Statistical Mathematics, an Associate Professor (Cooperative Graduate School Program) in the Graduate School of Systems and Information Engineering, University of Tsukuba, and a Project Manager of the MITOH Program (the Exploratory IT Human Resources Project) Youth division by the Information Technology Promotion Agency (IPA).

Dr. Goto received 25 awards over the past 19 years, including the Commendation for Science and Technology by the Minister of MEXT “Young Scientists’ Prize,” the DoCoMo Mobile Science Awards “Excellence Award in Fundamental Science,” the IPSJ Nagao Special Researcher Award, and the IPSJ Best Paper Award.

Cited By

No Data Available





No Data Available
This paper appears in:
No Data Available
Issue Date:
No Data Available
On page(s):
No Data Available
INSPEC Accession Number:
Digital Object Identifier:
Date of Current Version:
No Data Available
Date of Original Publication:
No Data Available

Text Size