By Topic

• Abstract

SECTION I

## INTRODUCTION

UNCERTAINTY is inherent in music analysis. A musical piece about which we have little prior knowledge can often be interpreted in various ways. One might, for example, have various degrees of belief in different possible interpretations of tempo and semantic structures, and when we try to transcribe the music we hear in an audio recording, we often find difficult to identify the notes with absolute confidence. Even if in the end we need to determine which interpretation or transcription is the most reasonable, during the analysis it is important to keep all possibilities open with various degrees of belief. We should therefore take an approach that can evaluate, propagate, and integrate the uncertainties of interdependent musical elements or musical notes.

A natural way to manage uncertainty is to take a Bayesian approach and use Bayesian probabilities to indicate degrees of belief. For example, suppose we have a distorted die. If the probabilities of getting the numbers 1, 2, 3, ···, 6 (called parameters) are known, we can evaluate the likelihood for a set of numbers (called observed data) obtained by casting the die many times. Note that the true values of the parameters do not vary stochastically. When the parameters are unknown, a probabilistic distribution is used as a means of representing how strongly possible values are believed to be the true values. Such degrees of belief vary according to the amount of observed data. Before we get observed data, prior distributions tend to be widely spread. The more data we get, the sharper the peaks of posterior distributions become. That is, the degree of belief on a certain possibility increases. The objective of Bayesian inference is to calculate posterior distributions of unknown variables by formulating probabilistic models defined by likelihood functions and prior distributions.

A critical problem in the conventional Bayesian approach is that we have to specify the complexity of the probabilistic models in advance (complexity means the number of mixtures in Gaussian mixture models (GMMs) and the number of states in hidden Markov models (HMMs)). If model complexities are unknown, both the uncertainty of model complexities and that of model parameters should be dealt with appropriately. The conventional approach, however, forces us to train many models of different complexities independently and then select one according to some criteria. Such fine-comb model selection, or model-complexity control, is often impractical, especially in the optimization of combinatorial-complexity models.

A nonparametric Bayesian approach avoiding the model selection problem has recently attracted a lot of attention [1]. Here the term “nonparametric” means that the size of a parameter space (complexity) is not fixed and in theory an infinite number of parameters (infinite complexity) are considered. If an infinite amount of observed data were available, an infinite number of parameters would be needed to represent variety of the data. Actually, however, only a limited number of parameters are needed because the amount of observed data is limited. The effective complexities of nonparametric models can be automatically adjusted according to observed data. Such nonparametric models are essentially different from conventional parametric models. In a single nonparametric model, an infinite number of parametric models with different complexities are stochastically overlapped.

In this paper, we propose a nonparametric Bayesian method for multipitch analysis, which is the basis of music transcription and music information retrieval (MIR). The method is called infinite latent harmonic allocation (iLHA), and our goal is to estimate multiple fundamental frequencies (F0s) from polyphonic audio signals. Instead of determining the values of F0s (parameters) and the number of them (complexity) uniquely, our method estimates a joint posterior distribution of all unknown variables when amplitude spectra of musical audio signals are given as observed data. We formulate nested infinite GMMs for observed spectra by using nonparametric priors called Dirichlet processes (DPs). These models can be obtained by taking the limit of the nested finite GMMs proposed by Goto [2] and Kameoka et al. [3] as the number of mixtures goes to infinity. More specifically, each spectral strip is allowed to contain an unbounded number of sound sources (harmonic structures), each of which is allowed to contain an unbounded number of harmonic partials. An important problem is that the parameters of the DPs (called hyperparameters) should be given appropriately because they affect the effective number of mixtures.

To avoid hyperparameter tuning, our models are formulated in a hierarchical Bayesian manner by putting prior distributions (called hyperprior distributions) on influential hyperparameters. Conventionally, we need to specify the hyperparameters of Dirichlet prior distributions on the relative weights of harmonic partials [2], [3]. Although these hyperparameters strongly impact the accuracy of F0 estimation, it is difficult to optimize them by hand. We instead put noninformative hyperprior distributions on the hyperparameters of DP priors of the infinite number of F0s and harmonic partials. This is reasonable because we have little knowledge of the hyperparameters. As shown in Fig. 1, we can completely automate iLHA by leveraging natural Bayesian treatment of parameters, complexities, and influential hyperparameters.

Fig. 1. Advantage of our method: We are not required to specify the number of spectral bases and the number of harmonic partials in advance. In addition, we do not have to adjust hyperparameters carefully.

The reminder is organized as follows. Section II introduces related work. Section III compares parametric models of conventional methods and nonparametric models of our method. Section IV describes a finite version of our method (LHA) and Section V explains our method (iLHA). Section VI reports our experiments. Section VII concludes this paper.

SECTION II

## RELATED WORK

Many researchers have applied probabilistic models to multipitch analysis. Goto [2] proposed a probabilistic model for a single-frame amplitude spectrum (spectral strip) that contains multiple harmonic structures (see Section III) and used it to estimate the F0s of melody and bass lines from polyphonic audio signals. Kameoka et al. [3] estimated multiple F0s by using a similar model for grouping frequency components into multiple sound sources. Kameoka et al. [4] extended the model by capturing the temporal continuity of harmonic structures. Raphael [5] formulated a HMM based on a large number of chord hypotheses. Cemgil et al. [6] used a dynamic Bayesian network (DBN) to represent the sound generation process, i.e., to associate a music-score level with an audio-signal level. Raczyński et al. [7] also used a DBN to model temporal dependencies between musical notes. Emiya et al. [8] proposed a probabilistic model that jointly represents spectral envelopes and harmonic partials.

Recently, nonnegative matrix factorization (NMF) [9] has been considered to be promising. It regards time–frequency spectra as a nonnegative matrix and decomposes it into the product of two nonnegative matrices, one corresponding to a set of spectral bases and the other corresponding to a set of temporal activations. Smaragdis et al. [10] pioneered the use of NMF for music transcription. Virtanen et al. [11] and Peeling et al. [12] proposed Bayesian extensions of NMF. Raczyński et al. [13] and FitzGerald et al. [14] proposed harmonicity constraints for spectral bases, and Bertin et al. [15] further introduced smoothness constraints for temporal activations. Vincent et al. [16] proposed a method of training spectral bases from audio signals of isolated tones and adapting them to target polyphonic audio signals. Cont [17] developed NMF with sparsity constraints for real-time pitch tracking. Several variants of NMF—such as the complex NMF proposed by Kameoka et al. [18], the Itakura–Saito (IS) divergence NMF proposed by Févotte et al. [19], and the gamma process NMF proposed by Hoffman et al. [20]—have been applied to spectrogram decomposition, but F0s have not been estimated from the spectral bases thus obtained.

Many other approaches have been also proposed (see [21] for a review). For example, Marolt [22] and Klapuri [23] proposed auditory-model-based methods that use a peripheral hearing model. Computationally efficient approaches based on harmonic sums [24] and correlograms [25] have also been investigated. Pertusa and Iñesta [26] proposed a spectral-peak clustering method. Bello et al. [27] tackled grouping of frequency components by using a heuristic set of rules.

There have been attempt to estimate F0 contours of melody lines (vocal parts) from polyphonic audio signals. Dressler [28] used instantaneous frequency estimation, sinusoidal extraction, psychoacoustics, and auditory stream segregation. Ryynänen and Klapuri [29] formulated a HMM based on acoustic and musicological modeling, and Durrieu et al. [30] proposed a statistical method of extracting the main melody by using source/filter models. Poliner et al. [31] have reported a comparative evaluation of several approaches.

Most methods mentioned above can achieve good results if the number of sound sources and/or manual parameters are appropriately specified. However, it is difficult to always bring out the full potential of these methods in practice.

SECTION III

## PROBABILISTIC MODELS

Our method is based on nonparametric Bayesian extension of conventional finite mixture models proposed by Goto [2] and Kameoka et al. [3]. Here we explain the conventional models for observed spectra and then derive our infinite mixture models by extending the conventional models.

### A. Notations

Fig. 2. Gaussian mixture model for the $k$th basis (single basis). Each Gaussian corresponds to a harmonic partial, and the mixing weights represent the relative strengths of $M$ harmonic partials.

Suppose that given polyphonic audio signals contain $K$ bases, each of which consists of $M$ harmonic partials located at integral multiples of the F0 on a linear frequency scale. Each basis can be associated with multiple sounds of different temporal positions if these sounds are derived from the same pitch of the same instrument. We transform the audio signals into wavelet spectra. Let $D$ be the number of frames. Note that $K$ and $M$ are finite integers that in conventional methods are specified in advance. Our method considers that $K$ and $M$ go to infinity.

### B. Conventional Finite Models and MAP Estimation

Probabilistic models can evaluate how likely observed data is to be generated by using a limited number of parameters. Therefore, estimation of multiple F0s corresponds directly to finding model parameters that give the highest probability to the generation of the observed data (called model training).

Goto [2] first proposed probabilistic models of harmonic structures by regarding an amplitude spectrum (a spectral strip of a single frame) as a probability density function. As shown in Fig. 2, the amplitude distribution of basis $k(1\leq k\leq K)$ can be modeled by a harmonic GMM as follows: TeX Source $${\cal M}_{k}({\mmb x})=\sum_{m=1}^{M}\tau_{km}{\cal N}\left({\mmb x}\big\vert{\mmb\mu}_{k}+{\mmb o}_{m},{\mmb\Lambda}_{k}^{-1}\right)\eqno{\hbox{(1)}}$$ where ${\mmb x}$ is a one-dimensional vector indicating a logarithmic frequency [cents].1 The Gaussian parameters (mean ${\mmb\mu}_{k}$ and variance ${\mmb\Lambda}_{k}^{-1}$) represent the F0 of basis $k$ and the degree of energy spread around the F0. $\tau_{km}$ is the relative strength of the $m$th harmonic partial $(1\leq m\leq M)$ in basis $k$. We set ${\mmb o}_{m}$ to $[1200\log_{2} m]$. This means that $M$ Gaussians are located to have the harmonic relationship on the logarithmic frequency scale. One might think that the value of ${\mmb\Lambda}_{k}^{-1}$ can be precomputed because the basis sound consists of $M$ sinusoidal signals (see Appendix I in [4]). This is true if these sinusoidal signals are stationary, but frequency-modulated sounds (e.g., vibrato) result in a larger value of ${\mmb\Lambda}_{k}^{-1}$ because of the uncertainty principle of time–frequency resolution.

As shown in Fig. 3, the spectral strip of frame $d$ is modeled by mixing $K$ harmonic GMMs as follows: TeX Source $${\cal M}_{d}({\mmb x})=\sum_{k=1}^{K}\pi_{dk}{\cal M}_{k}({\mmb x})\eqno{\hbox{(2)}}$$ where $\pi_{dk}$ is a relative strength of basis $k$ in frame $d$. Consequently, the polyphonic spectral strip can be represented by means of a nested finite GMM.

Fig. 3. Nested Gaussian mixture model for mixed multiple bases. It is obtained by mixing multiple Gaussian mixture models in a weighted manner under the assumption of amplitude additivity.
TABLE I MULTIPITCH ANALYSIS METHODS

Several methods that have been proposed for parameter estimation are listed in Table I. Goto [2] proposed a predominant-F0 estimation method (PreFEst) that estimates only relative strengths ${\mmb\tau}$ and ${\mmb\pi}$ by allocating many GMMs (${\mmb\mu}$ and ${\mmb\Lambda}$ are fixed) to cover the entire frequency range as F0 candidates. Kameoka et al. [3] proposed harmonic clustering (HC), which estimates all the parameters and selects the optimal number of bases by using the Bayesian information criterion. Although these methods yielded the promising results, they analyze the spectral strips of different frames independently. Kameoka et al. [4] therefore proposed harmonic-temporal-structured clustering (HTC), which captures the temporal continuity of spectral bases. Because all the above methods use a maximum a posteriori (MAP) estimation strategy to train the finite models, a prior distribution of relative strengths ${\mmb\tau}$ has a large effect on the accuracy of F0 estimation.

### C. Our Infinite Models and Bayesian Inference

We would like to discuss the limit of (1) and (2) as $K$ and $M$ diverge to infinity. There is a reason that taking the infinite limit is reasonable even though there are a finite number of discrete pitches (e.g., the standard piano has 88 keys). The F0s and spectral shapes of many instruments (strings, woodwinds, brasses, etc.) vary infinitely according to playing styles (vibrato, marcato, legato, staccato, etc.), and it is difficult to capture these variations when using a parametric model of fixed complexity.

Although there are theoretically infinite number of mixing weights $\{\pi_{d1},\pi_{d2},\ldots, \pi_{d_{\infty}}\}$ and $\{\tau_{k1},\tau_{k2},\ldots, \tau_{k_{\infty}}\}$, in the finite amount of observed data in practice there are a finite number of bases and a finite number of harmonic partials. Most of mixing weights must therefore be almost equal to zero. In other words, only a limited number of bases and a limited number of harmonic partials are allowed to become active. To realize such “sparse” GMMs, we put nonparametric prior distributions on mixing weights as sparsity constraints. We developed a method of Bayesian inference called iLHA to train the nested infinite GMMs (see Section V).

#### 1) Definition of Observed Data

In the context of Bayesian inference we need to explicitly define the observed data from the statistical viewpoint. More specifically, we regard each spectral strip as a histogram of observed frequencies as in [32]. If a spectral strip at frame $d(1\leq d\leq D)$ has amplitude $a$ at frequency $f$, we assume that frequency $f$ was observed $\lfloor\omega a\rfloor$ times in frame $d$, where $\omega$ is a scaling factor of wavelet spectra. In other words, we suppose there are countable frequency “particles” (sound quanta), each corresponding to an independent and identically distributed (i.i.d.) observation. Note that there is a nontrivial issue in determining the value of $\omega$ (see Section III-C3). Assuming that amplitudes are additive, we can consider each observation to be generated from one of $M$ partials in one of $K$ bases.

Let the total observations over all $D$ frames be represented by ${\mmb X}=\{{\mmb X}_{1},\ldots, {\mmb X}_{D}\}$, where ${\mmb X}_{d}$ is a set of observed frequencies ${\mmb X}_{d}=\{{\mmb x}_{d1},\ldots, {\mmb x}_{dN_{d}}\}$ in frame $d$. $N_{d}$ is the number of frequency observations (i.e., the sum of spectral amplitudes over all frequency bins in frame $d$) and ${\mmb x}_{dn}(1\leq n\leq N_{d})$ is a one-dimensional vector that represents an observed frequency. We let $N=\sum_{d}N_{d}$ be the total number of observations over all frames.

Let the total latent variables corresponding to ${\mmb X}$ be similarly represented by ${\mmb Z}=\{{\mmb Z}_{1},\ldots, {\mmb Z}_{D}\}$, where ${\mmb Z}_{d}=\{{\mmb z}_{d1},\ldots, {\mmb z}_{dN_{d}}\}$. ${\mmb z}_{dn}$ is a $KM$-dimensional vector in which only one entry, $z_{dnkm}$, takes a value of 1 and the others take values of 0 when frequency ${\mmb x}_{dn}$ is generated from partial $m(1\leq m\leq M)$ of basis $k(1\leq k\leq K)$.

#### 2) Positioning of Our Method

Our method can be viewed as an extension of a well-known topic modeling method called latent Dirichlet allocation (LDA) [33]. LDA was developed as a Bayesian extension of probabilistic latent semantic analysis (pLSA) [34] in the field of natural language processing. In LDA, each document is represented as a weighted mixture of multiple topics that are shared over all documents contained in observed data. Our method similarly represents frames as weighted mixtures of bases. An important difference between our method and LDA, however, is that iLHA represents each basis as a continuous distribution (a GMM) on the frequency space while LDA represents each topic as a discrete distribution over words (a set of unigram probabilities).

Another relevant extension of pLSA is probabilistic latent component analysis (PLCA) [35]. PLCA has been applied to source separation by assuming the time–frequency spectrogram to be a two-dimensional histogram of sound quanta. A major difference between our method and PLCA is that iLHA is based on a continuous distribution on the frequency space at each frame while PLCA is based on a two-dimensional discrete distribution on the space of frame-frequency pairs.

Our method is also similar to the standard NMF [10] based on temporal exchangeability of spectral strips (see Table I). Our method simultaneously trains GMMs of all frames contained in the observed spectra. In other words, if we permute a temporal sequence of spectral strips, the same results would be obtained. Although such temporal modeling is not appropriate for music, it is known to work well in practice.

As discussed above, we fuse the topic modeling framework into the NMF-style decomposition. This is reasonable because any (local) maximum-likelihood solution of pLSA is proven to be a solution of NMF that uses Kullback–Leibler (KL) divergence as a cost function [36]. In addition, we propose a nonparametric Bayesian extension.

#### 3) Limitations of Our Method

The amplitude quantization and i.i.d. assumption are not justified in a physical sense. The amplitudes at the integral multiples of a F0 are correlated to each other when they were generated from a single harmonic sound. Besides this, there is arbitrariness in determining the total number of observations $N$ (the scaling factor $\omega$ multiplied to raw wavelet spectra). The larger $\omega$ is, the more observations we have, resulting in a more compact posterior distribution because of reduced uncertainty. This criticism can be applied not only to topic models like [32], [35] but also to probabilistic models of NMF with KL divergence. This NMF assumes the value of amplitude to follow a Poisson distribution that is defined over nonnegative integers and has no scale parameter. Note that another NMF with IS divergence [19] does not have such a problem because it assumes the value of power (squared amplitude) to follow an exponential distribution that is defined over nonnegative real numbers and has a scale parameter. Therefore, NMF with IS divergence is scale invariant.

This is more problematic in the context of “nonparametric” Bayesian inference because the larger number of observations allows iLHA to activate more relatively small mixture components (i.e., bases and harmonic partials). We therefore need to perform a thresholding process according to the value of $\omega$ after training the weights of bases. In our experiments, the accuracy of multipitch analysis little varied if we changed the value of $\omega$ (see Section VI).

Another limitation is that our method represents harmonic sounds in an oversimplified manner. We assume that harmonic sounds consist only of several sinusoidal signals corresponding to harmonic partials. Actually, however, measurable noisy components are widely distributed along the frequency axis even if the target musical pieces are played only by pitched instruments. iLHA is thus forced to use too many harmonic GMMs to represent such noisy components. This is another reason that we need the thresholding process in the end.

SECTION IV

## LATENT HARMONIC ALLOCATION

This section explains LHA, the finite version of iLHA, as a preliminary step to deriving iLHA. LHA deals with nested finite GMMs described in Section III in a Bayesian manner. First, we mathematically represent the LHA model by putting prior distributions on unknown variables. Then, we explain a training method of estimating posterior distributions.

### A. Model Formulation

Fig. 4 shows a graphical representation of the LHA model. The full joint distribution is given by TeX Source $$p({\mmb X},{\mmb Z},{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})=p({\mmb X}\vert{\mmb Z},{\mmb\mu},{\mmb\Lambda})p({\mmb Z}\vert{\mmb\pi},{\mmb\tau})p({\mmb\pi})p({\mmb\tau})p({\mmb\mu},{\mmb\Lambda})\eqno{\hbox{(3)}}$$ where the first two terms are likelihood functions and the other three terms are prior distributions. The likelihood functions are defined as TeX Source \eqalignno{p({\mmb X}\vert{\mmb Z},{\mmb\mu},{\mmb\Lambda})=&\,\prod_{dnkm}{\cal N}\left({\mmb x}_{dn}\vert{\mmb\mu}_{k}+{\mmb o}_{m},{\mmb\Lambda}_{k}^{-1}\right)^{z_{dnkm}}&\hbox{(4)}\cr p({\mmb Z}\vert{\mmb\pi},{\mmb\tau})=&\,\prod_{dnkm}(\pi_{dk}\tau_{km})^{z_{dnkm}}&\hbox{(5)}} Then, we introduce conjugate priors as follows: TeX Source \eqalignno{p({\mmb\pi})=&\,\prod_{d=1}^{D}{\rm Dir}({\mmb\pi}_{d}\vert\alpha{\mmb\nu})=\prod_{d=1}^{D}C(\alpha{\mmb\nu})\prod_{k=1}^{K}\pi_{dk}^{\alpha\nu_{k}-1}&\hbox{(6)}\cr p({\mmb\tau})=&\,\prod_{k=1}^{K}{\rm Dir}({\mmb\tau}_{k}\vert\beta{\mmb\upsilon})=\prod_{k=1}^{K}C(\beta{\mmb\upsilon})\prod_{m=1}^{M}\tau_{km}^{\beta\upsilon_{m}-1}&\hbox{(7)}\cr p({\mmb\mu},{\mmb\Lambda})=&\,\prod_{k=1}^{K}{\cal N}\left({\mmb\mu}_{k}\vert{\mmb m}_{0},(b_{0}{\mmb\Lambda}_{k})^{-1}\right){\cal W}({\mmb\Lambda}_{k}\vert{\mmb W}_{0},c_{0})&\hbox{(8)}} where $p({\mmb\pi})$ and $p({\mmb\tau})$ are products of Dirichlet distributions and $p({\mmb\mu},{\mmb\Lambda})$ is a product of Gaussian–Wishart distributions. $C(\alpha{\mmb\nu})$ and $C(\beta{\mmb\upsilon})$ are normalization factors, and $\alpha{\mmb\nu}$ and $\beta{\mmb\upsilon}$ are hyperparameters. We let ${\mmb\nu}$ and ${\mmb\upsilon}$ sum to unity, respectively. $\alpha$ and $\beta$ are often called concentration parameters. ${\mmb m}_{0}$, $b_{0}$, ${\mmb W}_{0}$, and $c_{0}$ are also hyperparameters: ${\mmb m}_{0}$ is a Gaussian mean, $b_{0}$ is a scaling factor of the precision matrix, ${\mmb W}_{0}$ is a scale matrix, and $c_{0}$ is a degree of freedom.

Fig. 4. Graphical representation of nested finite Gaussian mixture models for LHA. First, finite sets of mixing weights, ${\mmb\pi}$ and ${\mmb\tau}$, are stochastically generated according to Dirichlet prior distributions. At the same time, $KM$ Gaussian distributions are stochastically generated according to a Gaussian–Wishart prior distribution. Then one of $M$ harmonic partials in one of $K$ bases is stochastically selected as a latent variable ${\mmb z}_{dn}$ according to multinomial distributions defined by ${\mmb\pi}$ and ${\mmb\tau}$. Finally, frequency ${\mmb x}_{dn}$ is stochastically generated according to a Gaussian distribution specified by ${\mmb z}_{dn}$.

### B. Variational Bayesian Inference

The goal of Bayesian inference is to compute a true posterior distribution of all unknown variables: $p({\mmb Z},{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda}\vert{\mmb X})$. Because analytical calculation of the posterior distribution is intractable, we use an approximation technique, called variational Bayes (VB) [37], that limits the posterior distribution to an analytical form and optimizes it iteratively in a deterministic way. Another possible technique is Markov chain Monte Carlo (MCMC) [38], which sequentially generates samples (the concrete values of unknown variables) from the true posterior distribution in a stochastic way by constructing a Markov chain that has the target distribution as its equilibrium distribution. It is generally difficult, however, to tell whether or not a Markov chain has reached a stationary distribution from which we can get samples within an acceptable error.

In the VB framework, we introduce a variational posterior distribution $q({\mmb Z},{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})$ and make it close to the true posterior $p({\mmb Z},{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda}\vert{\mmb X})$ iteratively. Here we assume that the variational distribution can be factorized as TeX Source $$q({\mmb Z},{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})=q({\mmb Z})q({\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})\eqno{\hbox{(9)}}$$ To optimize $q({\mmb Z},{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})$, we use a variational version of the expectation–maximization (EM) algorithm [37]. We iterate VB-E and VB-M steps alternately until a variational lower bound of evidence $p({\mmb X})$ converges as follows: TeX Source \eqalignno{{\hskip-10pt}q^{\ast}({\mmb Z})\!\propto\!&\,\exp\!\left(\BBE_{{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda}}\!\left[\log p({\mmb X},{\mmb Z},{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})\right]\right)&\hbox{(10)}\cr{\hskip-10pt}q^{\ast}({\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})\!\propto\!&\,\exp\!\left(\BBE_{\mmb Z}\!\left[\log p({\mmb X},{\mmb Z},{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})\right]\right).&\hbox{(11)}}

### C. Variational Posterior Distributions

We derive the formulas for updating variational posterior distributions according to (10) and (11).

#### 1) VB-E Step

An optimal variational posterior distribution of latent variables ${\mmb Z}$ can be computed as follows: TeX Source \eqalignno{\log q^{\ast}({\mmb Z})=&\,\BBE_{{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda}}\left[\log p({\mmb X},{\mmb Z},{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})\right]+{\rm const.}\cr=&\,\BBE_{{\mmb\mu},{\mmb\Lambda}}\left[\log p({\mmb X}\vert{\mmb Z},{\mmb\mu},{\mmb\Lambda})\right]+\BBE_{{\mmb\pi},{\mmb\tau}}\left[\log p({\mmb Z}\vert{\mmb\pi},{\mmb\tau})\right]\cr&+{\rm const.}\cr=&\,\sum_{dnkm}z_{dnkm}\log\rho_{dnkm}+{\rm const.}&\hbox{(12)}} where $\rho_{dnkm}$ is defined as TeX Source $$\displaylines{\log\rho_{dnkm}=\BBE_{{\mmb\pi}_{d}}[\log\pi_{dk}]+\BBE_{{\mmb\tau}_{k}}[\log\tau_{km}]\hfill\cr\hfill+\BBE_{{\mmb\mu}_{k},{\mmb\Lambda}_{k}}\left[\log{\cal N}\left({\mmb x}_{dnm}\vert{\mmb\mu}_{k},{\mmb\Lambda}_{k}^{-1}\right)\right]\quad\hbox{(13)}}$$ where ${\mmb x}_{dnm}={\mmb x}_{dn}-{\mmb o}_{m}$. Consequently, $q^{\ast}({\mmb Z})$ is obtained as multinomial distributions given by TeX Source $$q^{\ast}({\mmb Z})=\prod_{dnkm}\gamma_{dnkm}^{z_{dnkm}}\eqno{\hbox{(14)}}$$ where $\gamma_{dnkm}=\rho_{dnkm}/\sum_{km}\rho_{dnkm}$ is called a responsibility that indicates how likely it is that observed frequency ${\mmb x}_{dn}$ is generated from harmonic partial $m$ of basis $k$. Let $n_{dkm}$ be the number of frequencies that were generated from harmonic partial $m$ of basis $k$ in frame $d$. This number and its expected value can be calculated as follows: TeX Source $$n_{dkm}=\sum_{n}z_{dnkm}\quad\BBE[n_{dkm}]=\sum_{n}\gamma_{dnkm}\eqno{\hbox{(15)}}$$

For convenience in executing the VB-M step, we calculate several sufficient statistics as follows: TeX Source \eqalignno{\BBS_{k}[1]\equiv&\,\sum_{dnm}\gamma_{dnkm}&\hbox{(16)}\cr\BBS_{k}[{\mmb x}]\equiv&\,\sum_{dnm}\gamma_{dnkm}{\mmb x}_{dnm}&\hbox{(17)}\cr\BBS_{k}[{\mmb x}{\mmb x}^{T}]\equiv&\,\sum_{dnm}\gamma_{dnkm}{\mmb x}_{dnm}{\mmb x}_{dnm}^{T}.&\hbox{(18)}}

#### 2) VB-M Step

Similarly, an optimal variational posterior distribution of parameters ${\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda}$ is given by TeX Source $$\displaylines{\log q^{\ast}({\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})=\log p({\mmb\pi})p({\mmb\tau})+\BBE_{\mmb z}\left[\log p({\mmb Z}\vert{\mmb\pi},{\mmb\tau})\right]\hfill\cr\hfill+\log p({\mmb\mu},{\mmb\Lambda})+\BBE_{\mmb z}\left[\log p({\mmb X}\vert{\mmb Z},{\mmb\mu},{\mmb\Lambda})\right]+{\rm const.}\quad\hbox{(19)}}$$ This distribution can be factorized into the product of posterior distributions of respective parameters as follows: TeX Source $$q^{\ast}({\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})=\prod_{d=1}^{D}q^{\ast}({\mmb\pi}_{d})\prod_{k=1}^{K}q^{\ast}({\mmb\tau}_{k})\prod_{k=1}^{K}q^{\ast}({\mmb\mu}_{k},{\mmb\Lambda}_{k})\eqno{\hbox{(20)}}$$ Since our model is based on the conjugate prior distributions, each posterior distribution has the same form of the corresponding prior distribution as follows: TeX Source \eqalignno{{\hskip-10pt}q^{\ast}({\mmb\pi}_{d})=&\,{\rm Dir}({\mmb\pi}_{d}\vert{\mmb\alpha}_{d})&\hbox{(21)}\cr{\hskip-10pt}q^{\ast}({\mmb\tau}_{k})=&\,{\rm Dir}({\mmb\tau}_{k}\vert{\mmb\beta}_{k})&\hbox{(22)}\cr{\hskip-10pt}q^{\ast}({\mmb\mu}_{k},{\mmb\Lambda}_{k})=&\,{\cal N}\left({\mmb\mu}_{k}\vert{\mmb m}_{k},(b_{k}{\mmb\Lambda}_{k})^{-1}\right){\cal W}({\mmb\Lambda}_{k}\vert{\mmb W}_{k},c_{k})&\hbox{(23)}} where the variational parameters are given by TeX Source \eqalignno{\alpha_{dk}=&\,\alpha\nu_{k}+\BBE[n_{dk\cdot}]&\hbox{(24)}\cr\beta_{km}=&\,\beta\upsilon_{m}+\BBE[n_{\cdot km}]&\hbox{(25)}\cr b_{k}=&\,b_{0}+\BBS_{k}[1]&\hbox{(26)}\cr c_{k}=&\,c_{0}+\BBS_{k}[1]&\hbox{(27)}\cr{\mmb m}_{k}=&\,{b_{0}{\mmb m}_{0}+\BBS_{k}[{\mmb x}]\over b_{0}+\BBS_{k}[1]}={b_{0}{\mmb m}_{0}+\BBS_{k}[{\mmb x}]\over b_{k}}&\hbox{(28)}\cr{\mmb W}_{k}^{-1}=&\,{\mmb W}_{0}^{-1}+b_{0}{\mmb m}_{0}{\mmb m}_{0}^{T}+\BBS_{k}[{\mmb{xx}}^{T}]-b_{k}{\mmb m}_{k}{\mmb m}_{k}^{T}&\hbox{(29)}} where we introduced a dot notation for improved readability. We let dot “⋅” denote the sum over that index. For convenience in the subsequent sections, we also introduce notations using comparison operators (> and $\geq$). For example, we write TeX Source $$n_{dk\cdot}=\sum_{m^{\prime}}n_{dkm^{\prime}}\quad n_{dk>m}=\sum_{m^{\prime}>m}n_{dkm^{\prime}}.\eqno{\hbox{(30)}}$$ The three terms of (13) can therefore be calculated as follows: TeX Source \eqalignno{\BBE_{{\mmb\pi}_{d}}[\log\pi_{dk}]=&\,\psi(\alpha_{dk})-\psi\left(\sum_{k=1}^{K}\alpha_{dk}\right)&\hbox{(31)}\cr\BBE_{{\mmb\tau}_{k}}[\log\tau_{km}]=&\,\psi(\beta_{km})-\psi\left(\sum_{m=1}^{M}\beta_{km}\right)&\hbox{(32)}} TeX Source \eqalignno{{\hskip-10pt}&\BBE_{{\mmb\mu},{\mmb\Lambda}}\left[\log{\cal N}\left({\mmb x}_{dnm}\vert{\mmb\mu}_{k},{\mmb\Lambda}_{k}^{-1}\right)\right]\cr{\hskip-10pt}&\quad=-{1\over 2}\log(2\pi)+{1\over 2}\BBE_{{\mmb\Lambda}_{k}}\left[\log\vert{\mmb\Lambda}_{k}\vert\right]\cr{\hskip-10pt}&\qquad-{1\over 2}c_{k}({\mmb x}_{dnm}-{\mmb m}_{k})^{T}{\mmb W}_{k}({\mmb x}_{dnm}-{\mmb m}_{k})-{1\over 2b_{k}}&\hbox{(33)}} where $\psi$ is the digamma function, which is defined as the logarithmic derivative of the gamma function.

### D. Variational Lower Bound

To judge convergence, we examine the increase of the variational lower bound. Its maximization is inextricably linked with minimization of the KL divergence between the true and variational posteriors. Let ${\cal L}$ be the lower bound given by TeX Source \eqalignno{{\cal L}=&\,\BBE\left[\log p({\mmb X},{\mmb Z},{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})\right]-\BBE\left[q({\mmb Z},{\mmb\pi},{\mmb\tau},{\mmb\mu},{\mmb\Lambda})\right]\cr=&\,\BBE\left[\log p({\mmb X}\vert{\mmb Z},{\mmb\mu},{\mmb\Lambda})\right]+\BBE\left[\log p({\mmb Z}\vert{\mmb\pi},{\mmb\tau})\right]-\BBE\left[\log q({\mmb Z})\right]\cr&+\BBE\left[\log p({\mmb\pi})\right]+\BBE\left[\log p({\mmb\tau})\right]+\BBE\left[\log p({\mmb\mu},{\mmb\Lambda})\right]\cr&-\BBE\left[\log q({\mmb\pi})\right]-\BBE\left[\log q({\mmb\tau})\right]-\BBE\left[\log q\left({\mmb\mu},{\mmb\Lambda})\right)\right].&\hbox{(34)}} The calculation of these terms is described in Appendix I.

SECTION V

## INFINITE LATENT HARMONIC ALLOCATION

Our goal is to formulate and train nested infinite GMMs without model selection and hyperparameter tuning. To do this, we consider the limit of the nested finite GMMs described in Section IV as both $K$ and $M$ approach infinity. In addition, we put noninformative hyperprior distributions on influential hyperparameters in a hierarchical Bayesian manner and then calculate the posterior distributions of those hyperparameters. As a result of Bayesian inference, likely values are given large posterior probabilistic densities and unlikely values are given small densities. Such informative posterior distributions naturally emerge from noninformative prior distributions as polyphonic spectra are observed. That is, uncertainty is decreased by getting additional information. In the end we estimate F0s by taking MAP values of the posterior distributions.

### A. Mathematical Preparation

We explain the Dirichlet process (DP) and the hierarchical Dirichlet process (HDP), which can be used as nonparametric Bayesian priors in our infinite models. In this section, mathematical symbols are defined according to the custom. Therefore, the definition is valid only in this section.

#### 1) Dirichlet Process

The DP and its extensions play important roles in the theory of Bayesian nonparametrics [39]. Formally introduced by Ferguson [40] in 1973, in the past 10 years it has often been used as a building block of infinite mixture models.

A formal definition of the DP is that its marginal distributions must be Dirichlet distributed [40]. Let $\alpha$ be a positive real number and $G_{0}$ be a distribution over a sample space $\Theta$. We say a random distribution $G$ over $\Theta$ is DP distributed with concentration parameter $\alpha$ and base measure $G_{0}$ if TeX Source $$\displaylines{\left(G(A_{1}),G(A_{2}),\ldots, G(A_{K})\right)\hfill\cr\hfill\sim{\rm Dir}\left(\alpha G_{0}(A_{1}),\alpha G_{0}(A_{2}),\ldots, \alpha G_{0}(A_{K})\right)\quad\hbox{(35)}}$$ for any finite measurable partition $\{A_{1},A_{2},\ldots, A_{K}\}$ of $\Theta$. The DP is thus a distribution over distributions. This is written TeX Source $$G\sim{\rm DP}(\alpha,G_{0}).\eqno{\hbox{(36)}}$$ Then, a concrete sample $\theta\in\Theta$ is drawn from $G$ as follows: TeX Source $$\theta\sim G.\eqno{\hbox{(37)}}$$

An alternative constructive definition of the DP is known as the stick-breaking construction (SBC) [41]. As illustrated in Fig. 5, a random distribution $G$ can be written explicitly as a countably infinite sum of point masses (“atoms”): TeX Source \eqalignno{\theta_{k}\sim&\,G_{0}&\hbox{(38)}\cr G(\theta)=&\,\sum_{k=1}^{\infty}\pi_{k}\delta_{\theta_{k}}(\theta)&\hbox{(39)}} where $\delta_a(x)$ is the Dirac delta function that diverges to positive infinity at $x=a$, is otherwise equal to 0, and integrates to 1 with respect to $x$. The point mass $\pi_{k}$ of $\theta_{k}$ is given by TeX Source \eqalignno{\pi_{k}^{\prime}\sim&\,{\rm Beta}(1,\alpha)&\hbox{(40)}\cr\pi_{k}=&\,\pi_{k}^{\prime}\prod_{i=1}^{k-1}\left(1-\pi^{\prime}_{i}\right)&.\hbox{(41)}} The distribution on the infinite number of mixing weights ${\mmb\pi}=\{\pi_{1},\pi_{2},\ldots, \pi_{\infty}\}$ is often written ${\mmb\pi}\sim{\rm GEM}(\alpha)$, where the letters stand for Griffiths, Engen, and McCloskey.

Fig. 5. Stick-breaking construction of the Dirichlet process. Starting with a stick of length 1, we break it at $\pi_{1}^{\prime}\sim{\rm Beta}(1,\alpha)$ and assign $\pi_{1}$ to be the length of the stick we just broke off. We obtain the infinite number of mixing weights, $\{\pi_{2},\pi_{3},\ldots, \pi_{\infty}\}$, by breaking the remaining portion recursively.
Fig. 6. Discretization property of the Dirichlet process. $G$ becomes an infinite-dimensional discrete distribution when $G_{0}$ is a continuous distribution. The smaller $\alpha$ is, the fewer atoms in $G$ occupy most of its total probability mass. This means that $G$ becomes more sparse.

An important property of the DP is that $G$ must be a discrete distribution. As shown in Fig. 6, $G$ is an infinite-dimensional discrete distribution when $G_{0}$ is a continuous distribution. The DP can therefore be used as a prior distribution to formulate an infinite mixture model. In case of an infinite GMM (iGMM), for example, $\Theta$ is a space of Gaussians (i.e., a space of means and variances). $G_{0}$ is usually set to a Gaussian–Wishart distribution, which is a conjugate prior distribution over Gaussians. $G$ drawn from the DP is also a distribution over Gaussians. Every time an observation is generated, Gaussian $\theta\in\Theta$ is drawn from $G$, where $\theta$ is selected from the infinite number of Gaussians $\{\theta_{1},\theta_{2},\ldots, \theta_{\infty}\}$ according to their probabilities $\{\pi_{1},\pi_{2},\ldots, \pi_{\infty}\}$. This is a straightforward extension of a conventional finite GMM.

Several extensions increasing a degree of freedom of the standard DP have been proposed. For example, a beta two-parameter process [42] is obtained when TeX Source $$\pi_{k}^{\prime}\sim{\rm Beta}(\alpha,\beta)\eqno{\hbox{(42)}}$$ where positive real numbers $\alpha$ and $\beta$ are adjustable parameters of the beta distribution.

#### 2) Hierarchical Dirichlet Process

We discuss how to simultaneously train tied infinite mixture models when observed data consists of multiple groups, e.g., spectral strips (frames). Here a set of component distributions should be shared across mixture models trained for different groups. Such parameter tying enables us to directly compare compositions of different groups in terms of mixing weights of component distributions. This is similar to vector quantization (VQ) [43]. Let $N$ be the number of groups. In this setting, it is natural to use a DP for modeling observed data of each group as follows: TeX Source $$G_{n}\sim{\rm DP}(\alpha,G_{0})\quad(1\leq n\leq N)\eqno{\hbox{(43)}}$$ where $G_{n}$ is a random distribution on $\Theta$ for group $n$.

A problem is that if $G_{0}$ is a continuous distribution, atoms (component distributions) drawn from $G_{n}$ for generating observations are almost surely disjointed from those drawn from $G_{n^{\prime}}(n^{\prime}\neq n)$. This is because $N$ DPs can independently determine the positions of the countably infinite number of discrete atoms ${\mmb\theta}=\{\theta_{1},\ldots, \theta_{\infty}\}$ (cardinality $\aleph_{0}$) from the uncountably infinite continuous space $\Theta$ (cardinality $\aleph$).

Fig. 7. Overview of hierarchical Dirichlet process. $G_{0}$ becomes an infinite-dimensional discrete distribution when $H$ is a continuous distribution. The smaller $\alpha$ is, the fewer atoms in $G_{n}$ occupy most of its total probability mass. This means that $G_{n}$ becomes more sparse.

To solve this problem, we use a HDP [44] as a nonparametric prior distribution. As shown in Fig. 7, we consider the base measure $G_{0}$ itself to be distributed according to a top-level DP as follows: TeX Source $$G_{0}\sim{\rm DP}(\gamma,H)\eqno{\hbox{(44)}}$$ where $\gamma$ is a concentration parameter and $H$ a base measure over $\Theta$. In this model, $G_{0}$ always becomes a discrete distribution. The SBC of the top-level DP is given by TeX Source \eqalignno{\theta_{k}\sim&\,H&\hbox{(45)}\cr G_{0}(\theta)=&\,\sum_{k=1}^{\infty}\pi_{k}\delta_{\theta_{k}}(\theta)&\hbox{(46)}} where ${\mmb\pi}=\{\pi_{1},\ldots, \pi_{\infty}\}$ and ${\mmb\theta}=\{\theta_{1},\ldots, \theta_{\infty}\}$ are the point masses and positions of atoms and we have ${\mmb\pi}\sim{\rm GEM}(\gamma)$. Similarly, the SBC of a lower-level DP is given by TeX Source \eqalignno{\theta_{nk}\sim&\,G_{0}&\hbox{(47)}\cr G_{n}(\theta)=&\,\sum_{k=1}^{\infty}\pi_{nk}\delta_{\theta_{nk}}(\theta)&\hbox{(48)}} where ${\mmb\pi}_{n}\sim{\rm GEM}(\alpha)$ and each $\theta_{nk}$ is selected from ${\mmb\theta}$. Note that $\theta_{nk}$ can be equal to $\theta_{nk^{\prime}}$ if $k\neq k^{\prime}$ because $G_{0}$ is a discrete distribution. Another direct representation based on ${\mmb\theta}$ determined by the top-level DP is as follows: TeX Source $$G_{n}(\theta)=\sum_{k=1}^{\infty}\pi_{nk}^{\ast}\delta_{\theta_{k}}(\theta)\eqno{\hbox{(49)}}$$ where ${\mmb\pi}_{n}^{\ast}\sim{\rm DP}(\alpha,{\mmb\pi})$ and the hyperparameter $\alpha$ controls the difference between ${\mmb\pi}$ and ${\mmb\pi}^{\ast}$. Therefore, only point masses ${\mmb\pi}_{n}^{\ast}$ (mixture weights) differ between groups while positions ${\mmb\theta}$ (component distributions) are shared across groups.

A remaining problem is how to adjust the influential hyperparameters $\alpha$ and $\gamma$. This problem is often solved by putting vague gamma hyperprior distributions on these hyperparameters and inferring the posterior distributions.

### B. Model Formulation

We explain how to formulate nested infinite GMMs based on a HDP and generalized DPs by extending the nested finite GMMs described in Section IV.

First we discuss $K\rightarrow\infty$. An important requirement is that basis models [harmonic GMMs represented by (1)] should be shared as a global set across all $D$ frames because each basis sound has a duration and may appear in different frames while only its weight varies. The HDP can satisfy this requirement and we can explain the HDP from the generative point of view. After an unbounded number of bases are initially generated according to a top-level DP, an unbounded number of bases are selected in each frame according to a frame-specific DP. In practice, a limited number of bases are used to represent a spectral strip because the number of observed frequency particles $(n_{d\cdot\cdot})$ is limited. Mathematically speaking, in (6) we consider infinite-dimensional Dirichlet distributions, which are equivalent to the frame-specific DPs, and assume hyperparameter ${\mmb\nu}$ to be distributed according to the top-level DP as follows: TeX Source \eqalignno{\mathtilde{\nu}_{k}\sim&\,{\rm Beta}(1,\gamma)&\hbox{(50)}\cr\nu_{k}=&\,\mathtilde{\nu}_{k}\prod_{k^{\prime}=1}^{k-1}(1-\mathtilde{\nu}_{k^{\prime}})&\hbox{(51)}} where $\gamma$ is a concentration parameter of the top-level DP.

Now we discuss $M\rightarrow\infty$. Because each basis is allowed to consist of a unique infinite set of harmonic partials (basis models are independent of each other), instead of (7) we can use beta two-parameter processes as follows: TeX Source \eqalignno{\mathtilde{\tau}_{km}\sim&\,{\rm Beta}(\beta\lambda_{1},\beta\lambda_{2})&\hbox{(52)}\cr\tau_{km}=&\,\mathtilde{\tau}_{km}\prod_{m^{\prime}=1}^{m-1}(1-\mathtilde{\tau}_{km^{\prime}})&\hbox{(53)}} where $\beta$ is a positive real number and we let $\lambda_{1}$ and $\lambda_{2}$ sum to unity. Note that we used the size-biased permutation property of the SBC to encourage lower harmonic partials to have larger weights because roughly speaking, the weights of harmonic partials of an instrument sound decrease exponentially.

Because hyperparameters $\alpha$, $\beta$, $\gamma$, and ${\mmb\lambda}$ are influential, we put hyperprior distributions on them as follows: TeX Source \eqalignno{p(\alpha)=&\,{\rm Gam}(\alpha\vert a_{\alpha},b_{\alpha})&\hbox{(54)}\cr p(\gamma)=&\,{\rm Gam}(\gamma\vert a_{\gamma},b_{\gamma})&\hbox{(55)}\cr p(\beta)=&\,{\rm Gam}(\beta\vert a_{\beta},b_{\beta})&\hbox{(56)}\cr p({\mmb\lambda})=&\,{\rm Beta}(\lambda_{1}\vert u_{1},u_{2})&\hbox{(57)}} where $a_{\{\alpha,\beta,\gamma\}}$ and $b_{\{\alpha,\beta,\gamma\}}$ are shape and rate parameters of the gamma distributions. $u_{1}$ and $u_{2}$ are parameters of the beta distribution. These distributions are set to be vague ($a_{\{\alpha,\beta,\gamma\}}=1.0$, $b_{\{\alpha,\beta,\gamma\}}=0.001$, and $u_{1}=u_{2}=1.0$ in our experiments described in Section VI).

Fig. 8. Graphical representation of nested infinite Gaussian mixture models for iLHA. First the infinite sets of mixing weights ${\mmb\pi}$ and ${\mmb\tau}$ are stochastically generated according to a HDP and beta two-parameter processes (generalized DPs). At the same time, the infinite number of Gaussian distributions are stochastically generated according to a Gaussian–Wishart prior distribution. Then one of the harmonic partials contained in one of the bases is stochastically selected as a latent variable ${\mmb z}_{dn}$ according to multinomial distributions defined by ${\mmb\pi}$ and ${\mmb\tau}$. Finally, frequency ${\mmb x}_{dn}$ is stochastically generated according to a Gaussian distribution specified by ${\mmb z}_{dn}$.

Fig. 8 shows a graphical representation of the iLHA model. The full joint distribution is given by TeX Source $$\displaylines{p({\mmb X},{\mmb Z},{\mmb\pi},\mathtilde{\mmb\tau},{\mmb\mu},{\mmb\Lambda},\alpha,\beta,\gamma,{\mmb\lambda},\mathtilde{\mmb\nu})=p({\mmb X}\vert{\mmb Z},{\mmb\mu},{\mmb\Lambda})p({\mmb\mu},{\mmb\Lambda})\hfill\cr\hfill\times p({\mmb Z}\vert{\mmb\pi},\mathtilde{\mmb\tau})p({\mmb\pi}\vert\alpha,\mathtilde{\mmb\nu})p(\mathtilde{\mmb\tau}\vert\beta,{\mmb\lambda})p(\alpha)p(\beta)p(\gamma)p({\mmb\lambda})p(\mathtilde{\mmb\nu}\vert\gamma)\!\quad\hbox{(58)}}$$ where $p({\mmb Z}\vert{\mmb\pi},\mathtilde{\mmb\tau})$ is given by plugging (52) into (5) and $p({\mmb\pi}\vert\alpha,\mathtilde{\mmb\nu})$ is given by (6). $p(\mathtilde{\mmb\nu}\vert\gamma)$ and $p(\mathtilde{\mmb\tau}\vert\beta,{\mmb\lambda})$ are defined according to (50) and (52) as follows: TeX Source \eqalignno{p(\mathtilde{\mmb\nu}\vert\gamma)=&\,\prod_{k}{\rm Beta}(\mathtilde{\nu}_{k}\vert 1,\gamma)&\hbox{(59)}\cr p(\mathtilde{\mmb\tau}\vert\beta,{\mmb\lambda})=&\,\prod_{km}{\rm Beta}(\mathtilde{\tau}_{km}\vert\beta\lambda_{1},\beta\lambda_{2}).&\hbox{(60)}}

### C. Collapsed Variational Bayesian Inference

Fig. 9. Graphical representation of collapsed nested infinite mixture models for iLHA. After the original parameters ${\mmb\pi}$, $\mathtilde{\mmb\tau}$, ${\mmb\mu}$, and ${\mmb\Lambda}$ are integrated out, the auxiliary variables ${\mmb\eta}$, ${\mmb\xi}$, ${\mmb s}$, and ${\mmb t}$ are introduced to set up conjugacy between hyperprior distributions and a marginalized likelihood function.

There are two problems in training the HDP mixture model. The first problem is that VB needs to assume the independence between latent variables and parameters to factorize a posterior distribution as in (9). This assumption is sometimes too strong and leads to incorrect posterior approximation. The second problem is that applying VB to hierarchical Bayesian models that have no conjugacy between priors and hyperpriors is generally difficult.

To solve these problems, we use a sophisticated version of VB called collapsed variational Bayes (CVB) [45]. It instead assumes independence between individual latent variables in a “collapsed” space in which parameters are integrated out (marginalized out). This is reasonable because the dependence between individual latent variables in the collapsed space is generally much weaker than the dependence between a set of parameters and a set of latent variables in the non-collapsed space. In addition, we introduce auxiliary variables to apply CVB to hierarchical Bayesian models.

Fig. 9 shows a graphical representation of a collapsed iLHA model. Integrating out ${\mmb\pi}$, $\mathtilde{\mmb\tau}$, ${\mmb\mu}$, and ${\mmb\Lambda}$, we obtain the marginal distribution given by TeX Source $$\displaylines{p({\mmb X},{\mmb Z},\alpha,\beta,\gamma,{\mmb\lambda},\mathtilde{\mmb\nu})=p({\mmb X}\vert{\mmb Z})p({\mmb Z}\vert\alpha,\beta,{\mmb\lambda},\mathtilde{\mmb\nu})\hfill\cr\hfill\times p(\alpha)p(\beta)p(\gamma)p({\mmb\lambda})p(\mathtilde{\mmb\nu}\vert\gamma).\quad\hbox{(61)}}$$ The first term of (61) can be easily calculated by leveraging conjugacy between $p({\mmb X}\vert{\mmb Z},{\mmb\mu},{\mmb\Lambda})$ and $p({\mmb\mu},{\mmb\Lambda})$ as follows: TeX Source $$p({\mmb X}\vert{\mmb Z})=(2\pi)^{-{n_{\cdots}\over 2}}\prod_{k}\left({b_{0}\over b_{zk}}\right)^{1\over 2}{B({\mmb W}_{0},c_{0})\over B({\mmb W}_{zk},c_{zk})}\eqno{\hbox{(62)}}$$ where $B({\mmb W}_{0},c_{0})$ and $B({\mmb W}_{zk},c_{zk})$ are normalization factors of prior and posterior Gaussian–Wishart distributions. $b_{zk}$, $c_{zk}$, and ${\mmb W}_{zk}$ are obtained by substituting $z_{dnkm}$ for $\gamma_{dnkm}$ in calculating (26), (27), and (29). Similarly, the second term of (61) can be calculated by leveraging conjugacy between $p({\mmb Z}\vert{\mmb\pi},\mathtilde{\mmb\tau})$ and $p({\mmb\pi}\vert\alpha,\mathtilde{\mmb\nu})p(\mathtilde{\mmb\tau}\vert\beta,{\mmb\lambda})$ as follows: TeX Source $$\displaylines{p({\mmb Z}\vert\alpha,\beta,{\mmb\lambda},\mathtilde{\mmb\nu})=\prod_{d}{\Gamma(\alpha)\over\Gamma(\alpha+n_{d\cdot\cdot})}\prod_{k}{\Gamma(\alpha\nu_{k}+n_{dk\cdot})\over\Gamma(\alpha\nu_{k})}\hfill\cr\hfill\times\prod_{km}{\Gamma(\beta)\Gamma(\beta\lambda_{1}+n_{\cdot km})\Gamma(\beta\lambda_{2}+n_{\cdot k>m})\over\Gamma(\beta\lambda_{1})\Gamma(\beta\lambda_{2})\Gamma(\beta+n_{\cdot k\geq m})}\quad\hbox{(63)}}$$ where $\Gamma$ is the gamma function.

We then introduce auxiliary variables by using a technique called data augmentation [45]. Let $\eta_{d}$ and $\xi_{km}$ be beta-distributed variables and $s_{dk}$ and ${\mmb t}_{km}$ be positive integers that satisfy $1\leq s_{dk}\leq n_{dk\cdot}$, $1\leq t_{km1}\leq n_{\cdot km}$, and $1\leq t_{km2}\leq n_{\cdot k>m}$. We can augment (63) as follows: TeX Source \eqalignno{&p({\mmb Z},{\mmb\eta},{\mmb\xi},{\mmb s},{\mmb t}\vert\alpha,\beta,{\mmb\lambda},\mathtilde{\mmb\nu})\cr&\quad=\prod_{d}{\eta_{d}^{\alpha-1}(1-\eta_{d})^{n_{d\cdot\cdot}-1}\over\Gamma(n_{d\cdot\cdot})}\prod_{k}{n_{dk\cdot}\brack s_{dk}}(\alpha\nu_{k})^{s_{dk}}\cr&\qquad\times\prod_{km}{\xi_{km}^{\beta-1}(1-\xi_{km})^{n_{\cdot k\geq m}-1}\over\Gamma(n_{\cdot k\geq m})}\cr&\qquad\times{n_{\cdot km}\brack t_{km1}}(\beta\lambda_{1})^{t_{km1}}{n_{\cdot k>m}\brack t_{km2}}(\beta\lambda_{2})^{t_{km2}}&\hbox{(64)}} where [] denotes a Stirling number of the first kind. We can confirm that (64) reduces to (63) by marginalizing out auxiliary variables ${\mmb\eta}$, ${\mmb\xi}$, ${\mmb s}$, and ${\mmb t}$. The augmented marginal distribution is given by TeX Source $$\displaylines{p({\mmb X},{\mmb Z},{\mmb\eta},{\mmb\xi},{\mmb s},{\mmb t},\alpha,\beta,\gamma,{\mmb\lambda},\mathtilde{\mmb\nu})=p({\mmb X}\vert{\mmb Z})\hfill\cr\hfill\times p({\mmb Z},{\mmb\eta},{\mmb\xi},{\mmb s},{\mmb t}\vert\alpha,\beta,{\mmb\lambda},\mathtilde{\mmb\nu})p(\alpha)p(\beta)p(\gamma)p({\mmb\lambda})p(\mathtilde{\mmb\nu}\vert\gamma).\quad\hbox{(65)}}$$

To apply CVB to approximate the true posterior distribution $p({\mmb Z},{\mmb\eta},{\mmb\xi},{\mmb s},{\mmb t},\alpha,\beta,\gamma,{\mmb\lambda},\mathtilde{\mmb\nu}\vert{\mmb X})$, we assume that the variational posterior distribution can be factorized as follows: TeX Source $$\displaylines{q({\mmb Z},{\mmb\eta},{\mmb\xi},{\mmb s},{\mmb t},\alpha,\beta,\gamma,{\mmb\lambda},\mathtilde{\mmb\nu})=q(\alpha,\beta,\gamma,{\mmb\lambda})\hfill\cr\hfill\times q(\mathtilde{\mmb\nu})q({\mmb\eta},{\mmb\xi},{\mmb s},{\mmb t}\vert{\mmb Z})\prod_{dn}q({\mmb z}_{dn})\quad\hbox{(66)}}$$ where we assumed independence between hyperparameters, auxiliary variables, and elements of ${\mmb Z}$. We also use an approximation technique called variational posterior truncation. More specifically, we assume that $q(z_{dnkm})=0$ when $k>K^{+}$ and $m>M^{+}$. In practice, we set $K^{+}$ and $M^{+}$ to sufficiently large integers. This does not mean that effective model complexities are fixed in advance. The larger the truncation levels we use, the more the accurate approximations we obtain.

To optimize $q({\mmb Z},{\mmb\eta},{\mmb\xi},{\mmb s},{\mmb t},\alpha,\beta,\gamma,{\mmb\lambda},\mathtilde{\mmb\nu})$, we use a variational EM algorithm that iterates the following steps: TeX Source \eqalignno{{\hskip-15pt}q^{\ast}({\mmb z}_{dn})\!\propto\!&\,\exp\!\left(\BBE_{{\mmb Z}^{\neg dn},\alpha,\beta,\gamma,{\mmb\lambda},\mathtilde{\mmb\nu}}\left[\log\ {\rm Eqn.}\ (61)\right]\right)&\hbox{(67)}\cr{\hskip-15pt}q^{\ast}(\alpha,\beta,\gamma,{\mmb\lambda})\!\propto\!&\,\exp\!\left(\BBE_{{\mmb Z},{\mmb\eta},{\mmb s},{\mmb\xi},{\mmb t},\mathtilde{\mmb\nu}}\left[\log\ {\rm Eqn.}\ (65)\right]\right)&\hbox{(68)}\cr{\hskip-15pt}q^{\ast}(\mathtilde{\mmb\nu})\!\propto\!&\,\exp\!\left(\BBE_{{\mmb Z},{\mmb\eta},{\mmb s},{\mmb\xi},{\mmb t},\alpha,\beta,\gamma,{\mmb\lambda}}\left[\log\ {\rm Eqn.}\ (65)\right]\right)&\hbox{(69)}\cr{\hskip-15pt}q^{\ast}({\mmb\eta},{\mmb\xi},{\mmb s},{\mmb t}\vert{\mmb Z})\!\propto\!&\,\exp\!\left(\BBE_{{\mmb Z},\alpha,\beta,\gamma,{\mmb\lambda},\mathtilde{\mmb\nu}}\left[\log\ {\rm Eqn.}\ (65)\right]\right)&\hbox{(70)}} where $\neg dn$ denotes a set of indices without $d$ and $n$.

### D. Variational Posterior Distributions

We derive the formulas for updating variational posterior distributions according to (67)(70).

#### 1) CVB-E Step

An optimal variational distribution of ${\mmb Z}$ can be obtained as the product of multinomial distributions. The posterior probability that ${\mmb x}_{dn}$ was generated from the $m$th harmonic partial of basis $k$ is given by TeX Source \eqalignno{&\log q^{\ast}(z_{dnkm}=1)\cr&\quad=\BBE_{{\mmb z}^{\neg dn}}\!\left[\log\left(\BBG[\alpha\nu_{k}]+n_{dk\cdot}^{\neg dn}\right)\right]\cr&\qquad+\!\BBE_{{\mmb z}^{\neg dn}}\!\!\left[\log\!\left({\BBG[\beta\lambda_{1}]\!+\!n_{\cdot km}^{\neg dn}\over\BBE[\beta]\!+\!n_{\cdot k\geq m}^{\neg dn}}\!\prod_{m^{\prime}=1}^{m-1}\!{\BBG[\beta\lambda_{2}]\!+\!n_{\cdot k>m^{\prime}}^{\neg dn}\over\BBE[\beta]\!+\!n_{\cdot k\geq m^{\prime}}^{\neg dn}}\right)\right]\cr&\qquad+\!\BBE_{{\mmb z}^{\neg dn}}\!\!\left[\log{\cal S}({\mmb x}_{dnm}\vert{\mmb m}_{zk}^{\neg dn},{\mmb L}_{zk}^{\neg dn},c_{zk}^{\neg dn})\right]\!\!+\!{\rm const.}&\hbox{(71)}} where $\BBG[x]$ is the geometric average $(\BBG[x]=\exp(\BBE[\log x]))$ and ${\cal S}$ is the Student-t distribution defined by the three parameters ${\mmb m}_{zk}^{\neg dn}$, ${\mmb L}_{zk}^{\neg dn}$, and $c_{zk}^{\neg dn}$. ${\mmb L}_{zk}^{\neg dn}$ is given by TeX Source $${\mmb L}_{zk}^{\neg dn}={b_{zk}^{\neg dn}\over 1+b_{zk}^{\neg dn}}c_{zk}^{\neg dn}{\mmb W}_{zk}^{\neg dn}\eqno{\hbox{(72)}}$$ where $b_{zk}^{\neg dn}$, $c_{zk}^{\neg dn}$, ${\mmb m}_{zk}^{\neg dn}$, and ${\mmb W}_{zk}^{\neg dn}$ are obtained according to (26)(29) in which $z_{dnkm}$ is substituted for $\gamma_{dnkm}$ and the sums are calculated without ${\mmb z}_{dn}$. Each term of (71) can be approximated efficiently by using first-order and second-order Taylor expansions [45], [46], [47].

Equation (71) calculates the geometric averages of three predictive distributions under posterior distributions. These predictive distributions are derived from an infinite-dimensional Dirichlet distribution (a DP for an infinite mixture of iGMMs), stick-breaking construction (a DP for an iGMM), and a Gaussian distribution. Interestingly, this corresponds to (13) based on the geometric averages of three likelihood functions under posterior distributions. This implies that CVB is more robust to the local-optima problem than standard VB is.

#### 2) CVB-M Step

We can optimize the variational posterior distributions of the hyperparameters analytically by optimizing those of the auxiliary variables. First, $\alpha$, $\beta$, and $\gamma$ are gamma distributed as follows: TeX Source \eqalignno{q^{\ast}(\alpha)\propto&\,\alpha^{a_{\alpha}+\BBE[s_{\cdot\cdot}]-1}e^{-\alpha\left(b_{\alpha}-\sum_{d}\BBE[\log\eta_{d}]\right)}&\hbox{(73)}\cr q^{\ast}(\beta)\propto&\,\beta^{a_{\beta}+\BBE[t_{\cdots}]-1}e^{-\beta\left(b_{\beta}-\sum_{km}\BBE[\log\xi_{km}]\right)}&\hbox{(74)}\cr q^{\ast}(\gamma)\propto&\,\gamma^{a_{\gamma}+K-1}e^{-\gamma\left(b_{\gamma}-\sum_{k}\BBE\left[\log(1-\mathtilde{\nu}_{k})\right]\right)}&\hbox{(75)}} and ${\mmb\lambda}$ and $\mathtilde{\mmb\tau}$ are beta distributed as follows: TeX Source \eqalignno{q^{\ast}({\mmb\lambda})\propto&\,\lambda_{1}^{u_{1}+\BBE[t_{\cdot\cdot 1}]-1}\lambda_{2}^{u_{2}+\BBE[t_{\cdot\cdot 2}]-1}&\hbox{(76)}\cr q^{\ast}(\mathtilde{\nu}_{k})\propto&\,\mathtilde{\nu}_{k}^{1+\BBE[s_{\cdot k}]-1}(1-\mathtilde{\nu}_{k})^{\BBE[\gamma]+\BBE[s_{\cdot>k}]-1}.&\hbox{(77)}} Then ${\mmb\eta}$ and ${\mmb\xi}$ are beta distributed as follows: TeX Source \eqalignno{q^{\ast}(\eta_{d})\propto&\,\eta_{d}^{\BBE[\alpha]-1}(1-\eta_{d})^{n_{d\cdot\cdot}-1}&\hbox{(78)}\cr q^{\ast}(\xi_{km}\vert{\mmb Z})\propto&\,\xi_{km}^{\BBE[\beta]-1}(1-\xi_{km})^{n_{\cdot k\geq m}-1}&\hbox{(79)}} and ${\mmb s}$ and ${\mmb t}$ are multinomial distributed as follows: TeX Source \eqalignno{q^{\ast}(s_{dk}=s\vert{\mmb Z})\propto&\,{n_{dk\cdot}\brack s}\BBG[\alpha\nu_{k}]^{s}&\hbox{(80)}\cr q^{\ast}(t_{km1}=t\vert{\mmb Z})\propto&\,{n_{\cdot km}\brack t}\BBG[\beta\lambda_{1}]^{t}&\hbox{(81)}\cr q^{\ast}(t_{km2}=t\vert{\mmb Z})\propto&\,{n_{\cdot k>m}\brack t}\BBG[\beta\lambda_{2}]^{t}.&\hbox{(82)}}

To optimize the variational posterior distributions, we need to calculate the expectations of these variables. If a random variable $x$ follows ${\rm Gam}(x\vert a,b)$ with shape parameter $a$ and rate parameter $b$, its expectations are given by $\BBE[x]=a/b$ and $\BBE[\log x]=\psi(a)-\log(b)$. If $x$ follows ${\rm Beta}(x\vert c,d)$ with parameters $c$ and $d$, its expectations are given by $\BBE[x]=c/(c+d)$ and $\BBE[\log x]=\psi(c)-\psi(c+d)$. Note that the distributions given by Equations (79)(82) are conditioned by ${\mmb Z}$. The expectations must therefore be averaged over ${\mmb Z}$. For example, we now have the following conditional expectation: TeX Source $$\BBE[\log\xi_{km}\vert{\mmb Z}]=\psi\left(\BBE[\beta]\right)-\psi\left(\BBE[\beta]+n_{\cdot k\geq m}\right).\eqno{\hbox{(83)}}$$ We use Taylor expansion to average $\BBE[\log\xi_{km}\vert{\mmb Z}]$ over ${\mmb Z}$, but the digamma function $\psi$ diverges to negative infinity much faster than the logarithmic function does in the vicinity of the origin. To solve this problem, we use a method that treats the case $n_{\cdot k\geq m}=0$ exactly and applies second-order approximation when $n_{\cdot k\geq m}>0$ [45]. We can similarly average the following conditional expectations: TeX Source \eqalignno{{\hskip-15pt}\BBE[s_{dk}\vert{\mmb Z}]\!=\!&\,\BBG[\alpha\nu_{k}]\left(\psi\left(\BBG[\alpha\nu_{k}]\!+\!n_{dk\cdot}\right)\!-\!\psi\left(\BBG[\alpha\nu_{k}]\right)\right)&\hbox{(84)}\cr{\hskip-15pt}\BBE[t_{km1}\vert{\mmb Z}]\!=\!&\,\BBG[\beta\lambda_{1}]\left(\psi\left(\BBG[\beta\lambda_{1}]\!+\!n_{\cdot km}\right)\!-\!\psi\left(\BBG[\beta\lambda_{1}]\right)\right)&\hbox{(85)}\cr{\hskip-15pt}\BBE[t_{km2}\vert{\mmb Z}]\!=\!&\,\BBG[\beta\lambda_{2}]\left(\psi\left(\BBG[\beta\lambda_{2}]\!+\!n_{\cdot k>m}\right)\!-\!\psi\left(\BBG[\beta\lambda_{2}]\right)\right).&\hbox{(86)}}

To estimate F0s in the end, we explicitly compute the variational posterior distributions of the integrated-out parameters ${\mmb\mu}$ and ${\mmb\Lambda}$. To do this, we need to execute the standard VB-M step once using $q({\mmb Z})$ obtained in the CVB-E step.

### E. Variational Lower Bound

As in LHA, we monitor the increase of the variational lower bound of evidence $p({\mmb X})$, which is given by TeX Source \eqalignno{{\cal L}=&\,\BBE\left[\log p({\mmb X},{\mmb Z},\alpha,\beta,\gamma,{\mmb\lambda},\mathtilde{\mmb\nu})\right]-\BBE\left[\log q({\mmb Z},\alpha,\beta,\gamma,{\mmb\lambda},\mathtilde{\mmb\nu})\right]\cr=&\,\BBE\left[\log p({\mmb X}\vert{\mmb Z})\right]+\BBE\left[\log p({\mmb Z}\vert\alpha,\beta,{\mmb\lambda},\mathtilde{\mmb\nu})\right]-\BBE\left[\log q({\mmb Z})\right]\cr&+\BBE\left[\log p(\alpha)\right]+\BBE\left[\log p(\beta)\right]+\BBE\left[\log p(\gamma)\right]\cr&-\BBE\left[\log q(\alpha)\right]-\BBE\left[\log q(\beta)\right]-\BBE\left[\log q(\gamma)\right]\cr&+\BBE\left[\log p({\mmb\lambda})\right]+\BBE\left[\log p(\mathtilde{\mmb\nu}\vert\gamma)\right]-\BBE\left[\log q({\mmb\lambda})\right]\cr&-\BBE\left[\log q(\mathtilde{\mmb\nu})\right].&\hbox{(87)}} The calculation of these terms is described in Appendix II.

SECTION VI

## EVALUATION

This section reports the results of two comparative evaluation experiments. We compared LHA and iLHA with PreFEst and HTC because these four methods are based on the same idea for modeling harmonic structures. Using a different data set, we then compared iLHA with NMF-based methods and other methods. In the latter experiment, we investigated how significantly the value of the scaling factor $\omega$ (i.e., how many frequency particles are assumed to be observed in total) affects the accuracy of multipitch analysis.

### A. Comparison with Conventional Parametric Methods

#### 1) Experimental Conditions

We evaluated LHA and iLHA on a test that was used in [4] and consisted of eight pieces of piano and guitar solo performances excerpted from the RWC music database [48]. The first 23 s of each piece were used for evaluation. Spectral analysis with a 16-ms time resolution was conducted using a wavelet transform with Gabor wavelets. The correct values and temporal positions of actual F0s were prepared by hand as ground truth. Denoting by $g_{d}$, $e_{d}$, and $c_{d}$ the respective numbers of ground-truth, estimated, and correct F0s on frame $d$, we calculated the following frame-level recall and precision rates and F-measure for each piece: TeX Source $${\cal R}=100\cdot{\sum_{d}c_{d}\over\sum_{d}g_{d}}\quad{\cal P}=100\cdot{\sum_{d}c_{d}\over\sum_{d}e_{d}}\quad{\cal F}={2{\cal RP}\over{\cal R}+{\cal P}}\eqno{\hbox{(88)}}$$ and we averaged each of these measures over all pieces.

The prior and hyperprior distributions of LHA and iLHA were set to noninformative distributions. In LHA, $K$ and $M$ were set to 60 and 15. In iLHA, $K^{+}$ and $M^{+}$ were also set to 60 and 15. iLHA is not sensitive to these values, and no other tuning was needed for either method. To output F0s at each frame, we extracted bases whose expected weights ${\mmb\pi}$ were over a threshold that was optimized as in [4].

For comparison, we referred to the PreFEst and HTC experimental results reported in [4]. Although the ground-truth data in that study was slightly different from ours, it was close enough for roughly evaluating performance comparatively. The number of bases, priors, and weighting factors of the PreFEst and HTC were carefully tuned to optimize the results. Although this is not realistic, the upper bounds of potential performance were investigated in the literature.

#### 2) Experimental Results

The results listed in Table II show that the performance of iLHA closely approached and sometimes surpassed that of HTC. This is consistent with the empirical findings of many studies on Bayesian nonparametrics that nonparametric models were competitive with optimally tuned parametric models. HTC outperformed PreFEst because HTC can appropriately deal with temporal continuity of spectral bases. This implies that incorporating temporal modeling would improve the performance of iLHA.

TABLE II FRAME-LEVEL F-MEASURES OF F0 DETECTION

The results of LHA were worse than those of iLHA because LHA is not formulated in a hierarchical Bayesian manner and requires precise priors. In fact, we confirmed that the results of PreFEst and HTC based on MAP estimation were drastically degraded when using noninformative priors. Automated iLHA, in contrast, stably showed the good performance.

We found that model flexibility can be greatly enhanced by making time-consuming fine tuning unnecessary. Conventional studies assumed that appropriate prior knowledge is required to constrain flexibility (called regularization). By using a truly flexible hierarchical model based on Bayesian nonparametrics, however, we can let the data speak for itself. This naturally results in optimal performance.

### B. Comparison With NMF-Based Methods and Other Methods

#### 1) Experimental Conditions

We then evaluated iLHA on a test set that was used in [16] and consisted of 50 pieces of piano solo performances excerpted from the MAPS piano database [8]. The first 30 s of each piece were used for evaluation. Spectral analysis with a 10-ms time resolution was conducted using a Gabor wavelet transform. The value of $K^{+}$ was increased to 88, (the number of notes in a standard piano) because the piano pieces were much sophisticated than those used in first experiment. The time resolution and the value of $K^{+}$ were equal to those used in [16], and performance was evaluated in terms of F-measures.

For comparison, we referred to the experimental results of seven methods reported in [16]. We compared iLHA with four NMF-based methods: one using no constraints, one using harmonicity constraints (a subset of [13]), one using harmonicity and source-filter constraints [14], and one using harmonicity and spectral smoothness constraints [16]. Note that only the last one was manually tuned to yield the best results (the effect of hyperparameter tuning was investigated in [16]). We also compared it with a method based on harmonic sums [24], a method based on correlograms [25], and a method based on spectral peak clustering [26].

#### 2) Experimental Results

TABLE III FRAME-LEVEL ACCURACY OF F0 DETECTION

The results listed in Table III show that iLHA was the second best among the seven methods. Although the best variant of NMF gained the better F-measure (67.0%) than iLHA did (61.2%), we can say that well-automated iLHA is still competitive because it is reported that non-optimal settings deteriorated the performance of NMF moderately [16]. The F-measure of iLHA (61.2%) was close to that of NMF using only harmonicity constraints (60.5%). As discussed in Section III-C2, pLSA and PLCA are proven to have a close connection to NMF. Therefore, the similarity between iLHA based on harmonic GMMs and NMF based on harmonicity constraints was experimentally and theoretically supported. In addition, the difference between NMF using only harmonicity constraints and NMF adding spectral smoothness constraints implies that the performance of iLHA would be improved by incorporating spectral smoothness modeling.

It is interesting that in almost all methods the ${\cal P}$ was higher than the ${\cal R}$. This means that there were many F0s that were hard to detect because of the complex overlapping of multiple F0s. To solve this problem, more accurate spectral modeling would be required by removing the assumption of amplitude additivity that forms a basis of iLHA and NMF.

#### 3) Impact of Scaling Factor

We investigated the impact of the scaling factor $\omega$ described in Section III-C. We tested three different values: $\omega=0.1, 1, 10$. The similarity of respective F-measures—61.2%, 60.6%, and 60.1%—indicates that the results are not sensitive to the value of the scaling factor $\omega$. The automatic optimization of $\omega$ would be an interesting research topic that tackles the limitation of many methods based on the assumption of amplitude quantization.

SECTION VII

## CONCLUSION

This paper presented a novel statistical method for detecting multiple F0s in polyphonic music audio signals. In this method, which is called iLHA and is the first to apply Bayesian nonparametrics to multipitch analysis, we formulated nested infinite GMMs that represent polyphonic spectral strips in a hierarchical nonparametric Bayesian manner. More specifically, each spectral strip is allowed to contain an unbounded number of spectral bases, each of which can contain an unbounded number of harmonic partials. The method was fully automated by putting noninformative hyperprior distributions on influential hyperparameters except for the final thresholding process. The joint posterior distribution of all unknown variables can be inferred efficiently according to the VB framework. In our experiments comparing iLHA with the state-of-the-art methods manually optimized by trial and error, we found that iLHA is competitive enough and there is room for improvement based on modeling of temporal continuity and spectral smoothness. One interesting future direction is to use MCMC methods such as Gibbs sampling and more efficient variants for training the iLHA model.

Bayesian nonparametrics is a powerful framework avoiding the model selection problem faced in various areas of music information retrieval (MIR). For example, how many sections are required for structuring a musical piece? How many groups are required for clustering listeners according to their tastes or musical pieces according to their contents? We can avoid these problems by assuming that in theory there is an infinite number of objects (sections or groups) behind available observed data. Unnecessary objects are automatically removed from consideration through statistical inference. Hoffman et al. recently successfully applied this framework to the calculation of musical similarity [49] and the detection of repeated patterns [32], and we also plan to use this powerful framework in a wide range of applications.

## APPENDIX

The nine terms of the variational lower bound of LHA in (34) can be calculated as follows: TeX Source \eqalignno{&\BBE\left[\log p({\mmb X}\vert{\mmb Z},{\mmb\mu},{\mmb\Lambda})\right]\cr&\quad=\sum_{dnkm}\gamma_{dnkm}\BBE_{{\mmb\mu},{\mmb\Lambda}}\left[\log{\cal N}\left({\mmb x}_{dnm}\vert{\mmb\mu}_{k},{\mmb\Lambda}_{k}^{-1}\right)\right]\cr&\BBE\left[\log p({\mmb Z}\vert{\mmb\pi},{\mmb\tau})\right]\cr&\quad=\sum_{dnkm}\gamma_{dnkm}\left(\BBE_{{\mmb\pi}_{d}}[\log\pi_{dk}]+\BBE_{{\mmb\tau}_{k}}[\log\tau_{km}]\right)\cr&\BBE\left[\log p({\mmb\pi})\right]\cr&\quad=D\log C(\alpha{\mmb\nu})+\sum_{dk}(\alpha\nu_{k}-1)\BBE_{{\mmb\pi}_{d}}[\log\pi_{dk}]\cr&\BBE\left[\log p({\mmb\tau})\right]\cr&\quad=K\log C(\beta{\mmb\upsilon})+\sum_{km}(\beta\upsilon_{m}-1)\BBE_{{\mmb\tau}_{k}}[\log\tau_{km}]\cr&\BBE\left[\log p({\mmb\mu},{\mmb\Lambda})\right]\cr&\quad=\sum_{k}\BBE_{{\mmb\mu}_{k},{\mmb\Lambda}_{k}}\left[\log{\cal N}\left({\mmb\mu}_{k}\vert{\mmb m}_{0},(b_{0}{\mmb\Lambda}_{k})^{-1}\right){\cal W}({\mmb\Lambda}_{k}\vert{\mmb W}_{0},c_{0})\right]\cr&\BBE\left[\log q({\mmb Z})\right]\cr&\quad=\sum_{dnkm}\gamma_{dnkm}\log\gamma_{dnkm}\cr&\BBE\left[\log q({\mmb\pi})\right]\cr&\quad=\sum_{d}\log C({\mmb\alpha}_{d})+\sum_{dk}(\alpha_{dk}-1)\BBE_{{\mmb\pi}_{d}}[\log\pi_{dk}]\cr&\BBE\left[\log q({\mmb\tau})\right]\cr&\quad=\sum_{k}\log C({\mmb\beta}_{k})+\sum_{km}(\beta_{km}-1)\BBE_{{\mmb\tau}_{u}}[\log\tau_{km}]\cr&\BBE\left[\log q\left({\mmb\mu},{\mmb\Lambda})\right)\right]\cr&\quad=\sum_{k}\BBE_{{\mmb\mu}_{k},{\mmb\Lambda}_{k}}\left[\log{\cal N}\left({\mmb\mu}_{k}\vert{\mmb m}_{k},(b_{k}{\mmb\Lambda}_{k})^{-1}\right){\cal W}({\mmb\Lambda}_{k}\vert{\mmb W}_{k},c_{k})\right]} where the fifth and last terms can be obtained as follows: TeX Source \eqalignno{&\BBE_{{\mmb\mu}_{k},{\mmb\Lambda}_{k}}\left[\log{\cal N}\left({\mmb\mu}_{k}\vert{\mmb m}_{0},(b_{0}{\mmb\Lambda}_{k})^{-1}\right){\cal W}({\mmb\Lambda}_{k}\vert{\mmb W}_{0},c_{0})\right]\cr&\quad={1\over 2}\log\left({b_{0}\over 2\pi}\right)+{1\over 2}\BBE_{{\mmb\Lambda}_{k}}\left[\log\vert{\mmb\Lambda}_{k}\vert\right]+\log B({\mmb W}_{0},c_{0})\cr&\qquad-{b_{0}\over 2}\left(c_{k}({\mmb m}_{k}-{\mmb m}_{0})^{T}{\mmb W}_{k}({\mmb m}_{k}-{\mmb m}_{0})+{1\over b_{k}}\right)\cr&\qquad+{c_{0}-2\over 2}\BBE_{{\mmb\Lambda}_{k}}\left[\log\vert{\mmb\Lambda}_{k}\vert\right]-{c_{k}\over 2}{\rm Tr}\left({\mmb W}_{0}^{-1}{\mmb W}_{k}\right)\cr&\BBE_{{\mmb\mu}_{k},{\mmb\Lambda}_{k}}\left[\log{\cal N}\left({\mmb\mu}_{k}\vert{\mmb m}_{k},(b_{k}{\mmb\Lambda}_{k})^{-1}\right){\cal W}({\mmb\Lambda}_{k}\vert{\mmb W}_{k},c_{k})\right]\cr&\quad=-\BBE_{{\mmb\Lambda}_{k}}\left[H\left[q({\mmb\mu}_{k}\vert{\mmb\Lambda}_{k})\right]\right]-H\left[q({\mmb\Lambda}_{k})\right]\cr&\quad={1\over 2}\BBE_{{\mmb\Lambda}_{k}}\left[\log\vert{\mmb\Lambda}_{k}\vert\right]+{1\over 2}\log\left({b_{k}\over 2\pi}\right)-{1\over 2}+\log B({\mmb W}_{k},c_{k})\cr&\qquad+{c_{k}-2\over 2}\BBE_{{\mmb\Lambda}_{k}}\left[\log\vert{\mmb\Lambda}_{k}\vert\right]-{c_{k}\over 2}}

## APPENDIX

The 13 terms of the variational lower bound of iLHA in (87) can be calculated as follows: TeX Source \eqalignno{\BBE\!\left[\log p({\mmb X}\vert{\mmb Z})\right]\!=\!&\,-{n_{\cdots}\over 2}\log(2\pi)\!+\!{1\over 2}\sum_{k}\log b_{0}\cr&-{1\over 2}\sum_{k}\BBF_{2}[\log b_{zk}]\cr&+\!\sum_{k}\log B({\mmb W}_{0},c_{0})\cr&-\sum_{k}\BBF_{1}\left[\log B({\mmb W}_{zk},c_{zk})\right]\cr\BBE\!\left[\log p({\mmb Z}\vert\alpha,\beta,{\mmb\lambda},\mathtilde{\mmb\nu})\right]\!=\!&\,\sum_{d}\log\left({\Gamma\!\left(\BBE[\alpha]\right)\over\Gamma\!\left(\BBE[\alpha]+n_{d\cdot\cdot}\right)}\right)\cr&+\!\sum_{dk}\BBF_{2}\!\!\left[\log\!\left({\Gamma\!\left(\BBG[\alpha\nu_{k}]\!+\!n_{dk\cdot}\right)\over\Gamma\!\left(\BBG[\alpha\nu_{k}]\right)}\right)\right]\cr&+\!\sum_{km}\!\BBF_{2}\!\!\left[\log\!\left({\Gamma\!\left(\BBE[\beta]\right)\over\Gamma\!\left(\BBE[\beta]\!+\!n_{\cdot k\geq m}\right)}\right)\right]\cr&+\!\sum_{km}\!\BBF_{2}\!\!\left[\log\!\left({\Gamma\!\left(\BBG[\beta\lambda_{1}]\!+\!n_{\cdot km}\right)\over\Gamma\!\left(\BBG[\beta\lambda_{1}]\right)}\right)\right]\cr&+\!\sum_{km}\!\BBF_{2}\!\!\left[\log\!\left({\Gamma\!\left(\BBG[\beta\lambda_{2}]\!+\!n_{\cdot k>m}\right)\over\Gamma\!\left(\BBG[\beta\lambda_{2}]\right)}\right)\right]\cr\BBE\!\left[\log p(\alpha)\right]\!=\!&\,-\log\Gamma(a_{\alpha})\!+\!a_{\alpha}\log b_{\alpha}\cr&+(a_{\alpha}-1)\BBE_{\alpha}[\log\alpha]-b_{\alpha}\BBE_{\alpha}[\alpha]\cr\BBE\!\left[\log p(\beta)\right]\!=\!&\,-\log\Gamma(a_{\beta})\!+\!a_{\beta}\log b_{\beta}\cr&+(a_{\beta}-1)\BBE_{\beta}[\log\beta]-b_{\beta}\BBE_{\beta}[\beta]\cr\BBE\!\left[\log p(\gamma)\right]\!=\!&\,-\log\Gamma(a_{\gamma})\!+\!a_{\gamma}\log b_{\gamma}\cr&+(a_{\gamma}-1)\BBE_{\gamma}[\log\gamma]-b_{\gamma}\BBE_{\gamma}[\gamma]\cr\BBE\!\left[\log p({\mmb\lambda})\right]\!=\!&\,\log{\Gamma(u_{1}+u_{2})\over\Gamma(u_{1})\Gamma(u_{2})}\cr&+(u_{1}-1)\BBE[\log\lambda_{1}]\cr&+(u_{2}-1)\BBE[\log\lambda_{2}]\cr\BBE\!\left[\log p(\mathtilde{\mmb\nu}\vert\gamma)\right]\!=\!&\,K\BBE[\log\gamma]\cr&+\!\sum_{k}\left(\BBE[\gamma]-1\right)\BBE\!\left[\log(1-\mathtilde{\nu}_{k})\right]\cr\BBE\!\left[q({\mmb Z})\right]\!=\!&\,\sum_{dnkm}\gamma_{dnkm}\log\gamma_{dnkm}\cr\BBE\!\left[\log q(\alpha)\right]\!=\!&\,-H\left[{\rm PosteriorGamma}(\alpha)\right]\cr\BBE\!\left[\log q(\beta)\right]\!=\!&\,-H\left[{\rm PosteriorGamma}(\beta)\right]\cr\BBE\!\left[\log q(\gamma)\right]\!=\!&\,-H\left[{\rm PosteriorGamma}(\gamma)\right]\cr\BBE\!\left[\log q({\mmb\lambda})\right]\!=\!&\,-H\left[{\rm PosteriorBeta}(\lambda_{1})\right]\cr\BBE\!\left[\log q(\mathtilde{\mmb\nu})\right]\!=\!&\,\sum_{k}-H\left[{\rm PosteriorBeta}(\mathtilde{\nu}_{k})\right]} where $\BBF_{1}$ and $\BBF_{2}$ mean the first-order and second-order approximations based on Taylor expansion (see [45], [46], [47]).

### ACKNOWLEDGMENT

The authors would like to thank Dr. H. Kameoka (The University of Tokyo/NTT, Japan) for providing the ground-truth transcriptions of eight pieces included in the RWC music databases [48]. They would also like to thank Dr. V. Emiya (INRIA, France) for allowing them to use the valuable MAPS piano database [8].

## Footnotes

This work was supported in part by CREST, JST, and KAKENHI 20800084. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Daniel Ellis.

The authors are with the National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8568, Japan (e-mail: k.yoshii@aist.go.jp; m.goto@aist.go.jp).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

1Linear frequency $f_{h}$ in hertz can be converted to logarithmic frequency $f_{c}$ in cents as follows: $f_{c}=1200\log_{2}(f_{h}/(440(2^{(3/12)-5}))$.

## References

No Data Available

## Cited By

No Data Available

None

## Multimedia

No Data Available
This paper appears in:
No Data Available
Issue Date:
No Data Available
On page(s):
No Data Available
ISSN:
None
INSPEC Accession Number:
None
Digital Object Identifier:
None
Date of Current Version:
No Data Available
Date of Original Publication:
No Data Available