Multichannel high resolution NMF for modelling convolutive mixtures of non-stationary signals in the time-frequency domain

—Several probabilistic models involving latent components have been proposed for modelling time-frequency (TF) representations of audio signals such as spectrograms, notably in the nonnegative matrix factorization (NMF) literature. Among them, the recent high resolution NMF (HR-NMF) model is able to take both phases and local correlations in each frequency band into account, and its potential has been illustrated in applications such as source separation and audio inpainting. In this paper, HR-NMF is extended to multichannel signals and to convolutive mixtures. The new model can represent a variety of stationary and non-stationary signals, including autoregressive moving average (ARMA) processes and mixtures of damped sinusoids. A fast variational expectation-maximization (EM) algorithm is proposed to estimate the enhanced model. This algorithm is applied to piano signals, and proves capable of accurately modelling reverberation, restoring missing observations, and separating pure tones with close frequencies.


I. INTRODUCTION
N ONNEGATIVE matrix factorisation was originally intro- duced as a rank-reduction technique, which approximates an o n -n e g a t i v em a t r i xV ∈ R F ×T as a product V ≈ WH of two non-negative matrices W ∈ R F ×S and H ∈ R S×T with S<min(F, T ) [1].In audio signal processing, it is often used for decomposing a magnitude or power TF representation, such as a Fourier or a constant-Q transform (CQT) spectrogram.The columns of W are then interpreted as ad i c t i o n a r yo fs p e c t r a lt e m p l a t e s ,w h o s et e m p o r a la c t i v a t ions are represented in the rows of H.S e v e r a la p p l i c a t i o n st o audio have been addressed, such as multi-pitch estimation [2]- [4], automatic music transcription [5], [6], musical instrument recognition [7], and source separation [7]- [10].
In the literature, several probabilistic models involving latent components have been proposed to provide a probabilistic framework to NMF.Such models include NMF with additive Gaussian noise [11], probabilistic latent component analysis (PLCA) [12], NMF as a sum of Poisson components [13], and NMF as a sum of Gaussian components [14].Although they have already proven successful in a number of audio applications such as source separation [11]- [13] and multipitch estimation [14], most of these models still lack of consistency in some respects.
Firstly, they focus on modelling a magnitude or power TF representation, and simply ignore the phase information.Inan application of source separation, the source estimates are then obtained by means of Wiener-like filtering [8]- [10], which consists in applying a mask to the magnitude TF representation of the mixture, while keeping the phase field unchanged.It can be easily shown that this approach cannot properly separate sinusoidal signals lying in the same frequency band, which means that the frequency resolution is limited by that of the TF transform.In other respects, the separated TF representation is generally not consistent, which means that it does not correspond to the TF transform of a temporal signal, resulting in artefacts such as musical noise.Therefore enhanced algorithms are needed to reconstruct a consistent TF representation [15].In the same way, in an application of model-based audio synthesis, where there is no available phase field to assign to the sources, reconstructing consistent phases requires employing ad-hoc methods [16], [17].
Secondly, these models generally focus on the spectral and temporal dynamics, and assume that all time-frequency bins are independent.This assumption is clearly not relevant in the case of sinusoidal or impulse signals for instance, and it is not consistent with the existence of spectral or temporal dynamics.Indeed, in the case of wide sense stationary (WSS) processes, spectral dynamics (described by the power spectral density) is closely related to temporal correlation (described by the autocovariance function).Reciprocally, in the case of uncorrelated processes (all samples are uncorrelated with different variances), temporal dynamics induces spectral correlation.In other respects, further dependencies in the TF domain may be induced by the TF transform, due to spectral and temporal overlap between TF bins.
In order to overcome the assumption of independent TF bins, Markov models have been introduced for taking the local dependencies between contiguous TF bins of a magnitude or power TF representation into account [18]- [20].However, these models still ignore the phase information.Conversely, the complex NMF model [21], [22], which was explicitly designed to represent phases alongside magnitudes in a TF representation, is based on a deterministic framework that does not represent statistical correlations.More recently, two probabilistic models have been proposed, which partially take the phase information into account.The multichannel NMF presented in [23] is able to exploit phase relationships between different sensors via the mixing matrix, but the phases and correlations of source signals over time and frequency are not modelled.The infinite positive semidefinite tensor factorization method presented in [24] is able to exploit phase information by modelling correlations over frequency bands, but correlations over time frames are still ignored.
Alternatively, the high resolution (HR) NMF model that we introduced in [25], [26], is able to model both phases and correlations over time frames (within frequency bands) in a principled way.We showed that this model offers an improved frequency resolution, able to separate sinusoids within the same frequency band, and an improved synthesis capability, able to restore missing TF observations.It can be used with both complex-valued and real-valued TF representations, such as the short-time Fourier transform (STFT) and the modified discrete cosine transform (MDCT).It also generalizes some popular models, such as the Itakura-Saito NMF model (IS-NMF) [14], autoregressive (AR) processes [27], and the exponential sinusoidal model (ESM), commonly used in HR spectral analysis of time series [27].
In this paper, HR-NMF is extended to multichannel signals and to convolutive mixtures.Contrary to the multichannel NMF [23] where convolution was approximated, convolution is here accurately implemented in the TF domain by following the exact approach proposed in [28].Consequently, correlations over time frames and over frequency bands are both taken into account.In order to estimate this multichannel HR-NMF model, we propose a fast variational EM algorithm.This paper further develops a previous work presented in [29], by providing a theoretical ground for the TF implementation of convolution.
The paper is structured as follows.The HR-NMF model is first introduced in the time domain, then the filter bank used to compute the TF representation is presented in Section II.We show in Section III how convolutions in the original time domain can be accurately implemented in the TF domain.The multichannel HR-NMF model in the TF domain is presented in Section IV, and the variational EM algorithm is derived in Section V.This model is applied to audio inpainting and source separation in Section VI.Finally, conclusions are drawn in Section VII.

NOTATION
The following notation will be used throughout the paper (words in italics refer to the state space representation): • ϕ:f r e q u e n c ys h i f to faT Fc o n v o l u t i o nk e r n e l ; • b ms (f, ϕ, τ):m o v i n ga v e r a g ep a r a m e t e r s( output weights); • a s (f, τ):a u t o r e g r e s s i v ep a r a m e t e r s( transition weights).

II. FROM TIME DOMAIN TO TIME-FREQUENCY DOMAIN
Before defining HR-NMF in the TF domain in Section IV, we first provide a simple definition of this model in the time domain.

A. HR-NMF in the time domain
The HR-NMF model of a multichannel signal v m (n) ∈ F (where Moreover, each source image y ms (f, t) for any s ∈ [0 ...S−1] is defined as where g ms is the impulse response of a causal and stable recursive filter, and x s (n) is a Gaussian process1 .Additionally , processes x s and w m for all s and m are mutually independent.In order to make this model identifiable, we will further assume that the spectrum of x s (n) is flat, because the variability of source s w.r.t.frequency can be modelled within filters g ms for all m.T h u sfi l t e rg ms represents both the transfer from source s to sensor m and the spectrum of source s.
The purpose of the next sections is to transpose this definition of HR-NMF into the TF domain.The advantages of switching to the TF domain are well-known: in this domain audio signals generally admit a sparse representation, and the overlap of different sound sources is reduced.In Section II-B, we introduce the filter bank notation that will be used in the following developments.Then the accurate implementation of filtering in the TF domain will be addressed in Section III.

B. Time-frequency analysis: filter bank notation
To perform the time-frequency analysis of a signal, we propose to use the general and flexible framework of perfect reconstruction (PR) filter banks [30], which include both the STFT and MDCT.In the literature, the STFT is often preferred over other existing TF transforms, because under some smoothness assumptions it allows the approximation of linear filtering by multiplying each column of the STFT by the frequency response of the filter.However we will show in Section III that such an approximation is not necessary, and that any PR filter bank will allow us to accurately implement convolutions in the TF domain.
We thus consider a filter bank [30], which transforms an input signal x(n) ∈ l ∞ (F) in the original time domain n ∈ Z (where l ∞ (F) denotes the space of bounded sequences on F) where D is the decimation factor, * denotes standard convolution, and h f (n) is an analysis filter of support [0 ...N − 1] with N = LD and L ∈ N.Thesynthesisfilters hf (n) of same support [0 ...N− 1] are designed so as to guarantee PR.This means that the output, defined as satisfies x ′ (n)=x(n − N ),w h i c hc o r r e s p o n d st oa no v e r a l l delay of N samples.Let

III. TF IMPLEMENTATION OF CONVOLUTION
In this section, we consider a stable filter of impulse response g(n) ∈ l 1 (F) (where l 1 (F) denotes the space of sequences on F whose series is absolutely convergent) and two signals Our purpose is to directly express the TF representation y(f, t) of y(n) as a function of x(f, t), i.e. to find a TF transformation T TF in Figure 1(a) such that if the input of the filter bank is x(n),thentheoutputisy(n−N ) (y is delayed by N samples in order to take the overall delay of the filter bank into account).The following developments further investigate and generalize the study presented in [28], which x(f, t) * y(f, t) Fig. 2. TF implementation of convolution focused on the particular case of critically sampled PR cosine modulated filter banks.The general case of stable linear filters is first addressed in Section III-A, then the particular case of stable recursive filters is addressed in Section III-B.

A. Stable linear filters
The PR property of the filter bank implies that the relationship between y(f, t) and x(f, t) is given by the transformation T TF described in the larger frame in Figure 1(b), where the input is x(f, t),t h eo u t p u ti sy(f, t),a n dt r a n s f o r m a t i o nT TD is defined as the time-domain convolution by g(n + N ).T h e resulting mathematical expression is given in Proposition 1.
Proposition 1.Let g(n) ∈ l 1 (F) be the impulse response of a stable linear filter, and Let y(f, t) and x(f, t) be the TF representations of these signals as defined in Section II-B.Then Proof.Firstly, applying equation (3) to signal y yields Secondly, equation (4) yields Lastly, equations ( 7) and ( 8) are obtained by successively substituting equations ( 6) and (10) into equation ( 9).
Remark 1.As mentioned in Section II-B, if |ϕ|≥K,t h e n frequency bands f and f − ϕ do not overlap, thus c g (f, ϕ, τ) can be neglected.
Equation (7) shows that a convolution in the original time domain is equivalent to a 2D-convolution in the TF domain, which is stationary w.r.t.time, and non-stationary w.r.t.frequency, as illustrated in Figure 2.

B. Stable recursive filters
In this section, we introduce a parametric family of TF filters based on a state space representation, and we show a relationship between these TF filters and equation (7).

Definition 1. Stable recursive filtering in TF domain is defined by the following state space representation
where

is the impulse response of ac a u s a la n ds t a b l er e c u r s i v efi l t e r ,t h e nt h eT Fi n p u t / o u t p u t system defined in Proposition 1 admits the state space representation (11),w h e r eP
Proposition 2 is proved in Appendix A. Proposition 3. In Definition 1, equation (11) can be rewritten in the form of equation (7),w he r e∀f ∀ϕ ∈ [−P b ...P b ],fi l t e rc g (f, ϕ, τ) is defined as the only stable (bounded-input, bounded-output) solution of the following recursion: Proposition 3 is proved in Appendix A.
Remark 2. In Definition 1, a g (f, τ) and b g (f, ϕ, τ) are overparametrised compared to g(n) in Proposition 1. Consequently, if the values of a g (f, τ) and b g (f, ϕ, τ) are arbitrary, then it is possible that no filter g(n) exists such that equation (8) holds, which means that this state space representation does no longer correspond to a convolution in the original time domain.In this case, we will say that the TF transformation defined in equation ( 11) is inconsistent2 .

IV. MULTICHANNEL HR-NMF IN TF DOMAIN
In this section we present the multichannel HR-NMF model in the TF domain, as initially introduced in [29].Here this model will be derived from the definition of HR-NMF provided in the time domain in Section II-A.
Following the definition in equation ( 1), the multichannel where r m a ld i s t r i b u t i o no fm e a n0 and variance σ 2 w : Then Proposition 2 shows how the convolution in equation (2) can be rewritten in the TF domain: the recursive filters g ms can be accurately implemented via equations ( 15) and ( 17), which come from Definition 1 3 .E a c hs o u r c ei m a g e y ms (f, t) for s ∈ [0 ...S − 1] is thus defined as and the latent components z s (f, t) ∈ F are defined as follows: where x s (f, t) ∼N F (0,σ 2 xs (t)), Q a ∈ N and a s (f, τ) defines a stable autoregressive filter.Note that the variance σ 2 xs (t) of x s (f, t) does not depend on frequency f .T h i sp a r t i c u l a rc h o i c ea l l o w su st om a k et h e model identifiable, as suggested in Section II-A (the variability w.r.t.frequency is already modelled via the the filters g ms ).
The random variables w m (f 1 ,t 1 ) and the prior mean µ s (f, t) ∈ F and the prior precision (inverse variance) ρ s (f, t) > 0 of the latent variable z s (f, t) are considered to be fixed parameters.
The set θ of parameters to be estimated consists of: (we further define a s (f, 0) = 1), 3 More precisely, compared to the result of Proposition 2, processes zs(f, t) and xs(f, t) as defined in Section IV are shifted L − 1 samples backward, in order to write bms(f, ϕ, τ) in a causal form.This does not alter the definition of HR-NMF, since equation ( 17) is unaltered by this time shift, and yms(f, t) is unchanged in equation (15). .This model encompasses the following special cases: ,t h e ne q u ation ( 14) reduces to v 0 (f, t)= S−1 s=0 b 0s (f, 0, 0)x s (f, t), thus v 0 (f, t) ∼N F (0, V ft ),w h e r em a t r i x V of coefficients V ft is defined by the NMF V = WH with W fs = |b 0s (f, 0, 0)| 2 and H st = σ 2 xs (t).T h e maximum likelihood estimation of W and H is then equivalent to the minimization of the Itakura-Saito (IS) divergence between matrix V and spectrogram V (where , hence this model is referred to as IS-NMF [14]. monochannel HR-NMF model [25], [26], [31] involving variance σ 2 w ,a u t o r e g r e s s i v ep a r a m e t e r sa s (f, τ) for all s ∈ [0 ..
Because it generalizes both IS-NMF and ESM models to multichannel data, the model defined in equation ( 14) is called multichannel HR-NMF.

V. VARIATIONAL EM ALGORITHM
In early works that focused on monochannel HR-NMF [25], [26], in order to estimate the model parameters we proposed to resort to an expectation-maximization (EM) algorithm based on a Kalman filter/smoother.The approach proved to be appropriate for modelling audio signals in applications such as source separation and audio inpainting.However, its computational cost was high, dominated by the Kalman filter/smoother, and prohibitive when dealing with high-dimensional signals.
In order to make the estimation of HR-NMF faster, we then proposed two different strategies.The first approach aimed to improve the convergence rate, by replacing the M-step of the EM algorithm by multiplicative update rules [33].However we observed that the resulting algorithm presented some nu-merical stability issues 4 .Thesecondapproachaimedtoreduce the computational cost, by using a variational EM algorithm, where we introduced two different variational approximations [31].We observed that the mean field approximation led to both improved performance and maximal decrease of computational complexity.
In this section, we thus generalize the variational EM algorithm based on mean field approximation to the multichannel HR-NMF model introduced in Section IV, as proposed in [29].Compared to [31], novelties also include a reduced computational complexity and a parallel implementation.

A. Review of variational EM algorithm
Va r i a t i o n a l i n f e r e n c e [ 3 4 ] i s n ow a c l a s s i c a l a p p r o a c h f o r estimating a probabilistic model involving both observed variables v and latent variables z,d e t e r m i n e db yas e tθ of parameters.Let F be a set of probability density functions (PDFs) over the latent variables z.F o ra n yP D Fq ∈F and any function φ(z),w en o t e⟨φ⟩ q = φ(z)q(z)dz.T h e nf o r any set of parameters θ,t h evariational free energy is defined as L(q; θ)= ln p(v, z; θ) q(z) The variational EM algorithm is a recursive algorithm for estimating θ.I tc o n s i s t so ft h et w of o l l o w i n gs t e p sa te a c h iteration i: • Expectation (E)-step (update q): • Maximization (E)-step (update θ): In the case of multichannel HR-NMF, θ has been specified in Section IV.We further define

.T − 1].T h ec o m p l e t es e to f variables consists of:
• the set v of observed variables v m (f, t) for m ∈ [0 ...M − 1] and for all f and t such that δ m (f, t)=1, . We use a mean field approximation [34]: F is defined as the set of PDFs which can be factorized in the form 4 Indeed, the convergence of multiplicative update rules was not proved in [33] (more specifically, there is no theoretical guaranteet h a tt h el o glikelihood is non-decreasing), whereas the convergence of EM strategies is well established.Besides, as stated in [33], we observed that multiplicative update rules may exhibit some numerical instabilities for small values of the tuning parameter (the variation of the log-likelihood oscillates instead of monotonically increasing), which was the reason for introducing a more stable tempering approach, that consists in making ε vary from 1 to a lower value over iterations.In this paper, we therefore preferredt ou s eas l o w e r method with guaranteed convergence.It is possible that the convergence rate could be improved in future using multiplicative update rules.
With this particular factorization of q(z),t h es o l u t i o no f( 1 9 ) is such that each PDF q sf t is Gaussian: In the following sections, we will use notation φ = ⟨φ⟩ q and γ φ = ⟨|φ − φ| 2 ⟩ q ,f o ra n yf u n c t i o nφ of the latent variables.

C. Variational EM algorithm for multichannel HR-NMF
According to the mean field approximation, the maximizations in equations ( 19) and ( 20) are performed for each scalar parameter in turn [34].The dominant complexity of each iteration of the resulting variational EM algorithm is 4MFST∆f ∆t,w h e r e∆f =1+2P b and ∆t =1+Q z (by updating the model parameters in turn rather than jointly, the complexity of the M-step has been divided by a factor (∆t) 2 compared to [31]).However we highlight a possible parallel implementation, by making a difference between parfor loops which can be implemented in parallel, and for loops which have to be implemented sequentially.
1) E-step: ,l e tρ s (f, t)=0 .C o n s i d e r i n gt h em e a n field approximation (21), the E-step defined in equation ( 19) leads to the updates described in Table I (where * denotes complex conjugation).Note that z s (f, t) has to be updated after γ zs (f, t).
2) M-step: The M-step defined in ( 20) leads to the updates described in Table II.The updates of the four parameters can be processed in parallel.
w end parfor end for

VI. SIMULATION RESULTS
In this section, we present a basic proof of concept of the multichannel HR-NMF model.The ability to accurately model reverberation and restore missing observations is illustrated in Section VI-A, and the ability to separate pure tones with close frequencies is illustrated in Section VI-B.

A. Audio inpainting
The following experiments deal with a single source (S =1) formed of a real piano sound sampled at 11025 Hz.A 1.25msshort stereophonic signal (M =2)hasbeensynthesizedbyfiltering the monophonic recording of a loud C3 piano note from the MUMS database [35] with two room impulse responses xs (t) (zs(f,t−τ) * (as(f,τ)zs(f,t−τ )−xs(f,t))) simulated using the Matlab code presented in [36]  5 .T h eT F representation v m (f, t) of this signal has then been computed by applying a critically sampled PR cosine modulated filter bank (F = R)withF =201frequency bands, involving filters of length 8F =1608samples.The resulting TF representation, of dimension F × T with T =7 7 ,i sd i s p l a y e di nF i g u r e3 .In particular, it can be noticed that the two channels are not synchronous (the starting time in the left channel is ≈ 0.04s, whereas it is ≈ 0.02sintherightchannel),whichsuggeststhat the order Q b of filters b ms (f, ϕ, τ) should be chosen greater than zero.

TABLE II M-STEP OF THE VARIATIONAL EM ALGORITHM
In the following experiments, we have set µ s (f, t)=0and ρ s (f, t)=1 0 5 .T h e s ev a l u e sf o r c ez s (f, t) to be close to zero ∀t ∈ [−Q z ...− 1] (since the prior mean and variance 5 Those impulse responses were simulated using 15625 virtual sources.The dimensions of the room were [20m, 19m, 21m], the coordinates of the two microphones were [19m, 18m 1.6m] and [15m, 11m, 10m], and those of the source were [5m, 2m, 1m].The reflection coefficient of the walls was 0.3. of z s (f, t) are µ s (f, t)=0and 1/ρ s (f, t)=1 0 −5 ), which is relevant if the observed sound is preceded by silence.The variational EM algorithm is initialized with the neutral values z s (f, t)=0 , γ zs (f, t)=σ 2 w = σ 2 xs (t)=1 , a s (f, τ)= {τ =0} ,a n db ms (f, ϕ, τ)= {ϕ=0,τ =0} .I no r d e r to illustrate the capability of the multichannel HR-NMF model to synthesize realistic audio data, we address the case of missing observations.We suppose that all TF points within the red frame in Figure 3 are unobserved: δ m (f, t)=0 ∀t ∈ [26 ...50] (which corresponds to the time range 0.47s-0.91s),and δ m (f, t)=1for all other t in [0 ...T− 1].Ineach experiment, 100 iterations of the algorithm are performed, and the restored signal is returned as y ms (f, t).
In the first experiment, a multichannel HR-NMF with Q a = Q b = P b =0is estimated.Similarly to the example provided in Section IV, this is equivalent to modelling the two channels by two rank-1 IS-NMF models [14] having distinct spectral atoms W and sharing the same temporal activation H,o rb yar a n k1m u l t i c h a n n e lN M F[ 2 3 ] .T h er e s u l t i n gT F representation y ms (f, t) is displayed in Figure 4.It can be noticed that wherever v m (f, t) is observed (δ m (f, t)=1 ), y ms (f, t) does not accurately fit v m (f, t) (this is particularly visible in high frequencies), because the length Q b of filters b ms (f, ϕ, τ) has been underestimated: the source to distortion ratio (SDR) 6 in the observed area is 11.7dB.In other respects, the missing observations (δ m (f, t)=0)c o u l dn o tb er e s t o r e d (y ms (f, t) is zero inside the frame, resulting in an SDR of 0dB in this area), because the correlations between contiguous TF coefficients in v m (f, t) have not been taken into account.In the second experiment, a multichannel HR-NMF model with Q a =2, Q b =3,a n dP b =1is estimated.The resulting TF representation y ms (f, t) is displayed in Figure 5.It can be noticed that wherever v m (f, t) is observed, y ms (f, t) better fits v m (f, t):theSDRis36.8dBintheobservedarea.Besides, the missing observations have been better estimated: the SDR is 4.8dB inside the frame.Actually, choosing P b > 0 was 6 The SDR between a data vector v and an estimate v is defined as 20 log 10 necessary to obtain this result, which means that the spectral overlap between frequency bands cannot be neglected in this multichannel setting.

B. Source separation
In this section, we aim to illustrate the ability of HR-NMF to separate pure tones with close frequencies, based on the autoregressive parameters a s (f, τ),i nad i f fi c u l tu n d e r d e t e rmined setting (M<S ) 7 .F o rs i m p l i c i t y ,w eh a v ec h o s e nt o deal with a 2s-long monophonic mixture (M =1), composed of a chord of S =2piano notes, one semitone apart (A4 and Ab4 from the MAPS database [37]   w =1 0 −2 ,a n dt h es a m ei n i t i a lv a l u e so ft h eo t h e r parameters as in the first stage.Then 100 iterations of the E-step are performed in order to let z s (f, t) and γ zs (f, t) converge to relevant values based on the learned parameters, and finally 100 iterations of the full variational EM algorithm are performed.
In order to assess the separation performance, we have evaluated the SDR obtained in the two configurations.In the first configuration, the SDR of A4 is 17.67 dB and that of Ab4 is 23.08 dB.In the second configuration, the SDR of A4 is increased to 22.37 dB and that of Ab4 is increased to 27.78 dB. Figure 7 focuses on the results obtained in the frequency band f =40,w h e r et h efi r s tp a r t i a l so fA 4a n dA b 4o v e r l a p , resulting in a challenging separation problem.The real parts of the original sources are represented as red solid lines.As expected, the two sources are not properly separated in the first configuration (IS-NMF), because the estimated signals (represented as black dashed lines) are obtained by multiplying the mixture signal by a nonnegative mask, and interferences cannot be cancelled.As a comparison, the signals estimated in the second configuration (represented as blue dots) accurately fit the partials of the original sources.Note however that this remarkable result was obtained by guiding the variational EM algorithm with relevant initial values.In future work, we will need to develop robust estimation methods, less sensitive to initialisation, in order to perform source separation in a fully unsupervised way.

VII. CONCLUSIONS
In this paper, we have shown that convolution can be accurately implemented in the TF domain, by applying 2Dfilters to a TF representation obtained as the output of a PR filter bank.In the particular case of recursive filters, we have also shown that filtering can be implemented by means of a state space representation in the TF domain.These results have then been used to extend the monochannel HR-NMF model initially proposed in [25], [26] to multichannel signals and convolutive mixtures.The resulting multichannel HR-NMF model can accurately represent the transfer from each source to each sensor, as well as the spectrum of each source.It also takes the correlations over frequencies into account.In order to estimate this model from real audio data, avariationalEMalgorithmhasbeenproposed,whichhasareduced computational complexity and a parallel implementation compared to [31].This algorithm has been successfully applied to piano signals, and has been capable of accurately modelling reverberation due to room impulse response, restoring missing observations, and separating pure tones with close frequencies.
In order to deal with more realistic music signals, the estimation of the HR-NMF model should be performed in a more informed way, e.g. by means of semi-supervised learning, or by using any kind of prior information about the mixture or about the sources.For instance, harmonicity and temporal or spectral smoothness could be enforced by re-parametrising the model, or by introducing some prior distributions of the model parameters.Because audio signals are sparse in the timefrequency domain, we observed that the multichannel HR-NMF model involves a small number of non-zero parameters in practice.In future work, we will investigate enforcing this property, by introducing a prior distribution of the filter parameters such as that proposed in [38], or a prior distribution of the variances of the innovation process x s (f, t) (modelling variances with a prior distribution is an idea that has been successfully investigated in earlier works [39]- [41]).In other respects, the model could also be extended in several ways, for instance by taking the correlations over latent components into account, or by using other types of TF transforms, e.g.wavelet transforms.
Regarding the estimation of the HR-NMF model, the mean field approximation involved in our variational EM algorithm is known to induce a slow convergence rate.The convergence could thus be accelerated by replacing the mean field approximation by a structured mean field approximation, like in [42].Such an approximation was already proposed to estimate the monochannel HR-NMF model [31], at the expense of a higher computational complexity per iteration.Some alternative Bayesian estimation techniques such as Markov chain Monte Carlo (MCMC) methods and message passing algorithms [34] could also be applied to the HR-NMF model.In other respects, we observed that the variational EM algorithm is hardly able to separate multiple concurrent sourcesin afullyunsupervisedframework,becauseofitshighsensitivity to initialisation.More robust estimation methods are thus needed, which could for instance take advantage of the algebra principles exploited in high resolution methods [32].
Lastly, the multichannel HR-NMF model could be used in avarietyofapplications,suchassourcecoding,audioinpainting, automatic music transcription, and source separation.

APPENDIX TF IMPLEMENTATION OF STABLE RECURSIVE FILTERING
Proof of Proposition 2. We consider the TF implementation of convolution given in Proposition 1, and we define g(n) as the impulse response of a causal and stable recursive filter, having only simple poles.Then the partial fraction expansion of its transfer function [43] shows that it can be written in the form g(n)=g 0 (n)+

•
z * :c o m p l e xc o n j u g a t eo fz ∈ C; • m:s e n s o ri n d e x( r e l a t e dt ot h em u l t i c h a n n e lm i x t u r e ) ; • s:s o u r c ei n d e x( r e l a t e dt ot h el a t e n tc o m p o n e n t s ) ; • n:t i m ei n d e xi nt h eo r i g i n a lt i m ed o m a i n ; • v m :o b s e r v e dm i x t u r e ; • w m :a d d i t i v ew h i t eG a u s s i a nn o i s eo fv a r i a n c eσ 2 w ; • y ms :s o u r c ei m a g e s( output variables); • z s :l a t e n tc o m p o n e n t s( state variables); • x s :l a t e n ti n n o v a t i o n s( input variables); • t:t i m ef r a m ei n d e xi nt h eT Fd o m a i n ; • f :f r e q u e n c yb a n di n d e xi nt h eT Fd o m a i n ; • τ :t i m es h i f to faT Fc o n v o l u t i o nk e r n e l ; Fig. 1.Time-frequency vs. time domain transformations , having only simple poles lying inside the unit circle.The moving average parameters b g (f, ϕ, τ) ∈ F define a sequence of finite support w.r.t.τ , and ∀f ∈ [0 .

Fig. 6 .
Fig. 6.Spectrogram of the mixture of the A4 and Ab4 piano notes.

2 w
a n d1 0 0i t e r a t i o n sa r ep e r f o r m e d .In a second stage, the variance parameters σ 2 xs (t) and σ are estimated from the observed mixture, and the separated signals are obtained as y 0s (f, t) for s ∈{ 0, 1}.I nt h efi r s t configuration, the spectral parameters learned in the first stage are kept unchanged, the values of the time activations σ 2 xs (t) are initialized to 1, and 30 iterations of multiplicative update rules are performed.In the second configuration, the spectral parameters learned in the first stage are kept unchanged, the variational EM algorithm is initialized with the time activations σ 2 xs (t) estimated in the first configuration, the value σ 2

Fig. 7 .
Fig.7.Separation of two sinusoidal components.The real parts of the two components y 0s (f, t) are plotted as red solid lines, their IS-NMF estimates are plotted as black dashed lines, and their HR-NMF estimatesa r ep l o t t e da s blue dots.