Multichannel Blind Music Source Separation Using Directivity-Aware MNMF With Harmonicity Constraints

In this paper we present a harmonic constrained Multichannel Non-Negative Matrix Factorization (MNMF) method for the task of blind music source separation. In this model, the mixing filter encodes the spatial information in terms of magnitude and phase differences between channels whereas the source variances are modelled using a harmonic constrained NMF structure. In this work, the spatial covariance matrix is obtained from the constant-Q transform to account for the frequency logarithmic scale inherent in music signals and reduce the dimensionality of the parameters. Moreover, to mitigate the strong sensitivity to parameter initialization, we propose to initialize the spatial weights with the output of the steered response power (SRP) with the phase transform (PHAT) algorithm. The proposed method has been evaluated for the task of music source separation using a multichannel classical chamber music dataset with several polyphony and reverberation setups. Furthermore, a comparison with other state-of-the-art signal decomposition methods has been accomplished showing reliable results in terms of BSS_EVAL metrics.


I. INTRODUCTION
Audio source separation aims to segregate constituent sound sources from an audio signal mixture. This task has been one of the most popular research problems in the music information retrieval community. Since most of the music audio is available in the form of mixtures, there are several applications of a system capable of music source separation -e.g. automatic creation of karaoke, acoustic emphasis, music transcription, music unmixing and remixing, music production and education purposes.
Many approaches have been addressed in the last two decades in order to achieve this separation. A typical approach consists of decomposing a time-frequency representation of the mixture signal using methods such as non-negative matrix factorization (NMF), independent The associate editor coordinating the review of this manuscript and approving it for publication was Lin Wang. component analysis (ICA), or probabilistic latent component analysis (PLCA). Among these factorization techniques, NMF has been widely used for music audio signals, as it allows to describe the signal as a non-subtractive combination of sound objects (or ''atoms'') over time. However, without further information, the quality of the separation using the aforementioned statistical methods is limited. One solution is to exploit the spectro-temporal properties of the sources. For example, spectral harmonicity and temporal continuity can be assumed for several musical instruments while percussive instruments are characterized by short bursts of broadband energy [1]. Speech source spectrogram can be modelled using a source-filter model [2]. Other approaches also used spatial localization of the sources [3], [4].
When training material is available, it is possible to learn the spectro-temporal patterns and the methods are referred to as supervised. In this way, several signal decomposition methods have been presented which provide superior results VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ to the blind scenario. Recently, deep neural networks (DNN) have been extensively used for this purpose. The existing methods mostly use DNN with either the spectrogram as the input signal representation [5], [6] or directly the time-domain representation [7], [8] to train such a system. Convolutional neural networks (CNN) [6], [9] and long short term memory (LSTM) [10], [11] networks are the popular choices for DNN model architectures adapted for music source separation. Some of the latest top-performing music source separation models are Open-Unmix [11], MMDenseLSTM [12], Demucs [13] and Meta TasNet [14]. The aforementioned approaches are developed for single channel mixtures. When multichannel signals are available, separation can be improved by taking into account the spatial locations of sources or the mixing process. Earlier NMF based approaches relied on stacking magnitude or power spectrograms of all channels into a 3-valence nonnegative tensor and decomposing it with non-negative tensor factorisation (NTF) methods [15] or other NTF-like nonnegative structured approximations [16], [17]. However, since the phase information is discarded, these approaches do not allow exploiting the interchannel phase differences (IPDs), but only the interchannel level differences (ILDs). ILD-based approaches perform well in close-miking scenarios [4], [18]. However, in the far-field case (i.e. when the microphone array size is much smaller than the distances between the sources and microphones) the ILDs are practically negligible and, therefore spatial information can only be exploited using IPDs. Multichannel non-negative matrix factorization (MNMF) based approaches model the latent source magnitude-or power-spectrograms with NMF while the spatial mixing system is modelled using a Gaussian probabilistic modelling applied directly to the complex-valued STFTs of all channels [19]- [21]. The spatial properties of the sources can be modelled using a spatial covariance matrix (SCM) which encodes magnitude and phase differences between the recorded channels. Authors in [19] proposed to estimate unconstrained SCM mixing filters together with an NMF magnitude model to identify and separate repetitive frequency patterns corresponding to a single spatial location. To mitigate the effect of the spatial aliasing, Nikunen and Virtanen [20] proposed an SCM model based on direction of arrival (DoA) kernels to estimate the inter-microphone time delay given a looking direction. Carabias et al. [22] proposed an SCM kernel-based model where the mixing filter is decomposed into two direction dependent SCMs to represent and estimate disjointly both time and level differences between array channels. The main drawback of these strategies is the large number of parameters that have to be tuned and thus, without any prior information, these methods are prone to converge to local minima, especially in reverberant environments. Prior information about the localization can be used to reduce the computational cost and the strong sensitivity to parameter initialization [23]. Alternatively, several studies have recently proposed to restrict the SCMs of sources to jointly diagonalize the full-rank matrices for multichannel blind source separation [21], [24]. Similarly, [25] uses a discrete Fourier transform (DFT) matrix as the diagonalizer under the DoA kernel-based model from [20] projecting the signals into the wavenumber domain [25]. Recent works have tried to exploit multichannel audio with deep neural network (DNN) based approaches. Deep-clustering methods are augmented with spatial information in [26] with large improvements over monophonic versions. Alternatively, several works [5], [27] combine DNN-based source spectrogram estimation with multichannel NMF-inspired spatial models. Finally, a fully spatial-spectral factorization DNN is proposed as deep tensor factorization in [28].
In this paper, we present a blind music source separation approach based on MNMF where the signal model is constrained to be harmonic. To account for the inherent logarithmic frequency scale in western music, we propose to use the Constant-Q transform (CQT) [29] as a timefrequency representation. Although CQT is complex-valued and the ILDs and IPDs are encoded similarly to the DFT, existing approaches in the literature only used the amplitude information [30]- [32]. In fact, to our best knowledge, this is the first work that exploited the phase information of the CQT within an MNMF scheme. Using CQT for music source separation is beneficial for three reasons: 1) The frequency scale can be adjusted to a semitone level with a lower dimensionality than STFT representation, 2) The IPDs can be used as spatial cues to enhance the separation results in farfield scenarios, 3) Perfect reconstruction of the time-domain signals from CQT representation is possible [33]. Similar to [23], we propose a prior localization scheme using the Steered response power (SRP) with phase transform (PHAT) [34] algorithm to initialize the model parameters and thus, reduce the computational complexity and increase the robustness w.r.t. the parameter initialization. Several experiments have been performed using a multichannel dataset of classical chamber music with different polyphony levels and reverberant conditions. Moreover, the proposed approach has been compared with other state-of-the-art signal decomposition based approaches showing better results in terms of BSS_EVAL metrics.
The paper is organized as follows. The introduction is presented in Section I. Section II introduces the problem formulation and briefly reviews the background of MNMF, the harmonic signal models and the CQT transform. The proposed harmonic constrained MNMF method is explained in Section III. Section IV presents the experimental results of the proposed methods in comparison with other stateof-the-art methods for the task of multichannel music source separation. Finally, the conclusions are presented in Section V.

A. PROBLEM SPECIFICATION
The problem considered in this work is to separate each source signal from a set of audio mixtures recorded from a microphone array. The observed signal can be expressed as where the mixture x m (l) consists of s ∈ [1, S] sources captured by microphones m ∈ [1, M ], and the time-domain sample index is denoted by l. The spatial response from source s to microphone m is represented by a mixing filter h ms (τ ) and the single-channel source signals are denoted by y s (l). Assuming the additive of complex spectra, the mixing model in Eq. (1) can be expressed in the frequency domain as:

B. MULTICHANNEL NMF
Multichannel NMF can be formulated based on a so-called local Gaussian model (LGM), which allows modelling and combining spatial and spectral cues in a systematic way.
LGM modelling [35] assumes that each source image (M-length complex-valued vector [y 1,sft , . . . , y M ,sft ] T ∈ C M ) is modelled as a zero-mean circular complex Gaussian random vector as follows where the complex-valued covariance matrix is positive definite Hermitian, and it is composed of two factors: 1) a spatial covariance H sft ∈ C M ×M representing the spatial characteristics of the sth source image at the TF point (f , t), and 2) a spectral variance v stf ∈ R representing the spectral characteristics of the sth source image at the TF point (f , t). These source variances v sft can be modelled using a classical NMF structure as where b skf and g skt represent both the basis functions and their corresponding time-varying gains for each source-dependent component k ∈ [1, K s ].
In the case of static sources, it is reasonable to assume that the spatial covariances are time-invariant, i.e., H sft = H sf [17], [35]. In addition, assuming the random vectors y sft to be independent in time, frequency and between sources, the mixture STFT coefficients in the multichannel mixing in Eq. (3) may be shown distributed as where H sf could be modelled as a rank-1 SCM or as a full-rank matrix. Note that the full-rank spatial model can be used even in an under-determined condition (M < S) unlike the rank-1 model (only valid for (M ≥ S)), but its parameter estimation is harder (and computationally demanding) because of the considerably larger number of parameters.

C. BEAMFORMING INSPIRED DoA-SCM MODEL
When dealing with spectrally similar sources (e.g. several speech signals), without further constraints, NMF-based approaches can lead to the situation where a single NMF component together with the corresponding SCM mixing filter represent multiple sources together at different spatial locations. To enforce SCMs at different frequencies to correspond to the same location, Nikunen and Virtanen [20] proposed to model the spatial covariance as a weighted sum of so-called DoA kernels which are rank-1 spatial covariances modelling plane waves coming from several predefined directions.
where the DoA kernels W fo ∈ C M ×M define the phase differences between every pair of microphones (n, m) for each direction indexed with o ∈ [1, O] and z ko ∈ R ≥0 is the spatial weights matrix that relates NMF components with spatial directions. Note that the spatial weights z ko are frequency independent and estimated directly during the factorization whereas the DoA kernels W fo are computed beforehand and the phase information is kept fixed. In particular, the DoA kernels are defined as where f = (i − 1)F s /F is the frequency in Hz, F s and F are the sampling frequency and the STFT length, respectively. τ nm (k o ) represents the time difference of arrival (TDoA) in seconds between the pair of microphone (n, m) and k o is a unit vector pointing towards the look direction from the geometrical center of the microphone array (i.e. the origin of the Cartesian coordinate system p = [0, 0, 0] T ). In the original method in [20], a posterior grouping strategy is required to relate components to sources.

D. HARMONIC NMF-BASED SOURCE MODEL
NMF modelling of each source consists in structuring the source variances v sft in Eq. (4) with NMF structure as in the single-channel NMF case. When dealing with musical instrument sounds, ideally, each basis function can represent a VOLUME 10, 2022 single pitch and the corresponding gains contain information about the onset and offset times of notes having that pitch. Several works [16], [36]- [41] proposed to restrict the source variances in Eq. (4) to be harmonic. The harmonicity constraint is particularly useful for the analysis and separation of musical audio signals since, by using this constraint, each basis can define a single fundamental frequency.
In [38], [39], the basis functions b skf in the model of Eq. (4) are defined as a weighted combination of n ∈ [1, N ] narrowband harmonic spectra (patterns) p knf ∈ R, which are arbitrarily fixed Each component k is associated with a single pitch (with its corresponding f 0 ) and the basis functions are obtained as a linear combination of harmonic patterns p knf with different shapes but sharing the same pitch which is fixed. Those harmonic patterns are weighted by a set of coefficients e snk which define the actual spectral representation for component k belonging to source s. Some approaches [41], [42] use an extension of the model in Eq. (8) using a single flat harmonic excitation where the basis functions for each note and instrument are reduced to a set of coefficients that define the shape of the harmonic spectral pattern: where each component k represents a single music note and the number of components K s is usually set to cover the whole dynamic range for source s; n ∈ [1, N ] is the number of harmonics; a skn is the amplitude of harmonic n for note k and instrument s; f 0 (k) is the fundamental frequency of note k; ω(f ) is the magnitude spectrum of the window function; and the spectrum of a harmonic component at frequency nf 0 (k) is approximated by ω(f − nf 0 (k)).
Although the excitation-filter (or source-filter) model has origins in speech processing and sound synthesis, a similar model can be extrapolated to musical instruments [36], [43]. In fact, each instrument can be represented using a single filter λ sf that corresponds to the resonant structure of the body of the instrument whereas the excitations can be represented as frequency components of unity magnitude at integer multiples of a certain fundamental frequency.
Some extensions for this source filter for music signals can be found in the literature. For instance, in [44] instead of defining an excitation for every possible pitch, they are given from a multipitch estimator. Additionally, the filter λ sf is represented as a linear combination of fixed elementary responses. In particular, the authors chose the elementary responses to consist of triangular bandpass magnitude responses uniformly spaced on a Mel frequency scale. Finally, in [40], the authors proposed to decompose the single flat excitation by a combination of a few excitation patterns to better model the changes in the timbre between notes across frequency.

E. THE CONSTANT-Q TRANSFORM
Choosing the proper signal representation for a specific task is not trivial, as there are many aspects to be considered. Depending on the used representation, some signal properties may be hidden or revealed. The advantageous features for a given task are more outstanding under certain representations. Moreover, in some circumstances, a specific representation can provide a priori information related to the sources.
Several signal representations have been used in the literature to adjust the time-frequency resolution to the characteristics of the signals to be analyzed such as ERB [45], MFCC [44], CQT [33] or chroma vectors [46].
In the field of music signal processing, CQT has been widely used with logarithmic spaced frequency bands that can be approximated to have a resolution of a multiple/submultiple of a musical semitone. In fact, CQT computes frequency coefficients similar to DFT but on a logarithmic frequency scale [29] as, where f ∈ [1, F] indexes the frequency bins of CQT and u * f (t) denotes the complex conjugate of u k (t) and T f are variable window lengths. The notation | · | infers rounding down towards the nearest integer. The basis functions u f (t) are complex-valued waveforms, also called time-frequency atoms, and are defined by (12) where f c is the center frequency of bin f , f s denotes the sampling rate, and ω(t) is a continuous window function (e.g. the Hann window) sampled at points determined by t/T f . The scaling factor C is given by, In order to have a bin spacing corresponding to an identical frequency ratio for every pair of adjacent notes, the center frequencies f c obey the following expression, where f 1 is the center frequency of the lowest-frequency bin and B determines the number of bins per octave. B is the most crucial parameter in CQT, because it determines the time-frequency resolution tradeoff. The window lengths T f are inversely proportional to f c in order to have the same Q-factor for all f bins and are given by, Consequently, CQT can be understood as computing DFT only for specific logarithmic spaced frequency bins. However, it is not invertible and is far less efficient than FFT, which supposes an important drawback. Some approaches have been focused on improving the efficiency and making it quasi-invertible [33], [47] or perfectly invertible [48] under certain implementation constraints.

III. PROPOSED DIRECTIONAL MNMF HARMONIC MODEL FOR BSS
In this work, a harmonic multi-excitation model based on SCM and MNMF for blind music source separation is presented. Specifically, we propose a signal model suitable for modelling the harmonic structure of musical instruments. To account for the western music logarithmically spaced frequency bands we adopted the CQT as signal representation.
In our proposal the mixing filter is decomposed as a linear combination of spatial kernels as in [20] and the source magnitude spectrograms as weighted combinations of harmonic constrained basis functions. In this work, we assume farfield (i.e., the microphone array size is much smaller than the distances between the sources and microphones). In this scenario, spatial information is obtained from the phase difference between the complex-valued CQT representation computed for each microphone. As commented in Section I, this is the first MNMF approach exploiting the phase information of the CQT. To alleviate the sensitivity of the system towards the parameter initialization, we propose to use prior information about the sources localization using a well-known source localization method.
The block diagram of the proposed method is depicted in Fig. 1. First, the SCM representation is computed from the CQT of the multichannel mixture. Second, the model parameters are estimated using two steps: 1) Initialization of the spatial weights using the SRP-PHAT algorithm [34], 2) Estimation of the source magnitude spectrogram using MNMF. Finally, a generalized Wiener filtering strategy is used to obtain the source reconstruction.

A. CQT-SCM SIGNAL REPRESENTATION
In this work, we used an SCM signal representation obtained from the complex-valued CQT transform (see Section II-E). From the original formulation of the CQT in Eq. (11), the overlap factor between successive window functions in the time domain is constant while window lengths T f are decreasing with increasing f . Consequently, the number of time frames varies as a function of the frequency bin. A way to overcome this issue is to use a ''rasterized'' CQT representation using a fixed hop size for all the center frequencies, that is, the smaller hop size in the representation. As a drawback, this procedure produces a highly redundant representation. To mitigate this problem, we propose to downsample the rasterized CQT representation to a fixed hop size of 32 ms for all the frequency bins. This downsampled representation will be used until the estimation of the Wiener-like masks (see Fig. 1). Then, an interpolation process will be carried out to obtain the inverse transform.
For each frequency bin f and time frame t, the magnitude square-rooted matrix CQT transform x Q ft of the captured signal at each sensor (16) where sgn(z) = z/|z| is the signum function for complex numbers. Then, the CQT-SCM X ft for a single time-frequency point (f , t) is defined from the multichannel captured signalx ft as the outer product where H and * stand for Hermitian transpose and complex valued conjugate, respectively. With the above definitions, the magnitude spectrum of each channel is at the diagonal of X Q ft , while the spatial properties of the mixture are restricted to its off-diagonal values, which encode the square magnitude cross correlation and phase difference between each microphone pair. Note that the SCM representation is independent of the absolute phase of the recorded signals.

B. PROPOSED MULTICHANNEL HARMONIC MULTI-EXCITATION MODEL
The SCM model in Eq. (5) allows estimating the source magnitude spectrograms v sft and the corresponding SCM mixing filter H fs yielding the desired BSS properties. Although the SCM mixing filter H fs considers the phase and amplitude differences between channels, estimating it jointly over all frequencies does not provide any explicit relation to the spatial location of the sources. Following the beamforming-inspired SCM mixing model in [22] and inspired in the original model in [20], [23] (see Eq. (6)), the SCM mixing filter H fs is modelled as a linear combination of DoA kernels W fo ∈ C M ×M multiplied by a spatial weights matrix Z ∈ R S×O + which relates sources s with spatial directions o. The SCM mixing filter is described as Note that the direct relation between sources to directions in Eq. (18) avoids the need for the grouping strategy after the factorization.
Regarding the source magnitude spectrogram v sft , in this work we propose to restrict the source basis functions to be harmonic. In fact, musical notes, excluding transients, VOLUME 10, 2022 are pseudo-periodic, and their spectra consist of regularly spaced frequency peaks. In this way, we propose to model the magnitude time-frequency spectrogram v sft in Eq. (5) of each harmonic instrument (or source) s in frame t and frequency f as a weighted sum of spectral bases with harmonic shape as where g kts denotes the time-varying activation of each musical note (i.e. the gains for each component). n ∈ [1, N ] is the number of harmonics, f 0 (k) is the fundamental frequency of note k and a ksn is the amplitude of the partial (or harmonic) n corresponding to the note k and instrument s. ω Q (f −nf 0 (k)) is the magnitude spectrum of the window function in the CQT domain of the n-th harmonic component at frequency nf 0 (k). The proposed CQT-SCM model is obtained by combining the DoA kernel based SCM mixing filter from Eq. (6) with the harmonic source spectrogram model from Eq. (19). Thus, the proposed signal model is given bỹ

C. SPATIAL WEIGHTS INITIALIZATION
For the sake of reducing the algorithm complexity and increasing the robustness w.r.t. to the parameter initialization, we constraint the spatial weights Z using prior information about the sources DoAs obtained using the well-known SRP-PHAT algorithm. Similar to [23], the DoA estimation is computed from the STFT of the multichannel signal. For each time frame, the maximum of the SRP function is scaled to one. Assuming stationary sources, the source directions can be estimated by averaging the SRP functions over time. In this work, we select all the peaks with SRP higher than 75% of the SRP value of the highest peak and we also impose a minimum distance of 10 degrees between peaks. Note that this scheme allows reducing the number of directions to be analyzed during the decomposition process (i.e., to reduce the size of Z), while allowing to relate several locations o to a single source s. Thus, a reverberant condition can be modelled as a weighted combination of anechoic responses similar to the scheme proposed in [20].

D. PARAMETER ESTIMATION
For the estimation of the signal model parameters in (20), we propose to use the majorization-minimization algorithm presented in [19], [20], [22]. Using this approach, the cost function can be described using both Euclidean (EUC) and Itakura Saito (IS) divergence. However, in this work, we use the IS divergence since it is better suited for audio modelling in comparison to EUC [49]. Following a similar approach to [19], we obtain the multiplicative updates (MU) via auxiliary functions for the case of the IS divergence. Other methods in [20], [23] derived the expectation-maximization (EM) algorithms for the IS divergence. However, as demonstrated in [19], MU updates provide faster convergence than the EM algorithms.
The IS divergence of the observed and estimated multichannel signal using the CQT-SCM domain can be expressed as where tr(X) = M m=1 x mm is the trace of a square matrix X. Note that Eq. (21) reduces to IS NMF when M = 1. For a given observation X Q , the cost function in (21) together with the proposed model in (20) can be written as (22) where constant terms are omitted. To minimize this complex-valued function f (Z, G, A), we follow the optimization scheme of majorization in [19] using an auxiliary positive definite function f + which allows the factorization under non-negativity of the parameters restriction. Then, the derivation of the algorithm updates is achieved via partial derivation of function f + w.r.t. each model parameter and setting these derivatives to zero. For the sake of brevity, we write directly the multiplicative update rules for each free parameter. Further information about the derivation of these rules can be found in [19], [22].
Similar to [25], in this work, we assume far-field and therefore, we assume the DOA kernels W fo to be fixed during the factorization. Therefore, from the proposed signal model in Eq. (20) only the spatial weights z so and the source variance parameters, a ksn and g kts are the free parameters to be optimized. The update rules for the non-negative parameters in (20) are In addition, scaling the parameters is required for eliminating trivial scale indeterminacies and avoids numerical instabilities. Then, the scaling procedure is presented as follows, Note that this scaling procedure is carried out after the updates of z so , g kts and a ksn .

E. SOURCE RECONSTRUCTION
The source signals are reconstructed using the model parameters previously estimated. To ensure that the reconstruction process is conservative, we propose to follow a generalized Wiener filtering strategy. The generalized Wiener masks represent the relative energy contribution of each source with respect to the energy of the mixture. The estimated CQT magnitude spectrogram for each sound source s and microphone m can be defined from our proposed model in Eq. 20 as Then, we apply the generalized Wiener mask to reconstitute each source from the mixture based on the power spectrum ratio between the reference signals in the CQT domain as y Q ftsm =y Q ftsm so tr(W fo ) m z so K k=1 g kts N n=1 a ksn ω Q (f − nf 0 (k)) x Q ftm (28) where x Q ftm ∈ C is the time-frequency CQT transform of the input multichannel mixture. Finally, the multichannel time-domain signals are obtained by the inverse CQT transform [33] ofỹ Q ftsm . The whole proposed MNMF algorithm for music source separation is summarized in Algorithm 1.

IV. EXPERIMENTAL RESULTS AND DISCUSSION
In this section, the proposed method in Section III is evaluated for the task of multichannel instrumental music source separation using a popular dataset of small ensembles. Moreover, the performance of our framework has been compared to other state-of-the-art algorithms to demonstrate the reliability of our proposal.

A. DATASETS
In this study, we have evaluated our method using the University of Rochester Multimodal Music Performance (URMP) dataset presented in [50]. This dataset contains single-channel audio recordings and ground-truth annotations for 44 classical chamber music pieces ranging from duets to quintets (11 duets, 12 trios, 14 quartets, and 7 quintets) and played by 14 different common instruments in orchestra. For each VOLUME 10, 2022 piece, the musical score in MIDI format and the high-quality individual instrument audio recordings are provided. The multichannel mixtures were generated by simulating the spatial position of the sources. In this regard, mixing filters were simulated with the Roomsim Toolbox [51] for a rectangular room of dimensions 3.55 m × 4.45 m × 2.5 m and a linear array of eight omnidirectional microphones. The reverberation time RT 60 1 of the room was set to 65 ms, 250 ms and 500 ms, and the inter-microphone distance to 5 cm. The distance between the sources and the geometrical center of the array was fixed to 2 m and the source directions of arrival varied between 0 • and 180 • with a minimal spacing of 30 • . The overview of the recording configuration and room layout is illustrated in Fig. 2.
The anechoic material from the URMP dataset was convolved with the obtained IRs resulting in two-channels, four-channels and eight-channels datasets for the three RT 60 combinations. Therefore, this process provided nine different setups for each set of audio signals. Note that each audio excerpt has a duration of 30 seconds and is sampled at 44.1 kHz.

B. EXPERIMENTAL SETUP
In this paper, the time-frequency representation is obtained from CQT using 12 bins per octave. Regarding the harmonic decomposition scheme, the number of harmonic components per note N and the number of notes per instrument K are set to 20 and 115, respectively. Note that the entire range of MIDI notes possible for any instrument has to be covered, i.e. from the MIDI note 21 to the MIDI note 135.
As for the mixing filter decomposition, the number of look directions is equal to the number of sources that compound the audio mixture (see Section III-B). Note that only azimuthal angles are considered. Finally, the maximum number of iterations for the parameters estimation loop is set to 300, since we have observed that this value is adequate for the convergence of the reconstruction error.

C. EVALUATION METRICS
In this work, we have objectively evaluated the performance of the separation method by comparing each separated signal to the spatial images of the original sources and using objective measures by BSS_Eval toolbox [52]. These metrics are commonly accepted and represent a standard approach in the specialized scientific community for testing the quality of separated signals, allowing a fair comparison with other stateof-the-art methods. These metrics assume that each separated signal produces a distortion model that can be expressed as: whereŝ j is the estimated source signal for instrument j, s j is the original signal of the instrument j, e target j is the error term associated with the target distortion component, e interf j is the 1 RT 60 is the time required for reflections of a direct sound to decay by 60 dB below the level of the direct sound. error term due to interference of the other sources, and e artif j is the error term attributed to the numerical artifacts of the separation algorithm. In this way, BSS_EVAL provides the following metrics based on energy ratios for each separated signal: the source to distortion ratio (SDR), the source to interference ratio (SIR), the source to artifacts ratio (SAR), and the source image spatial distortion ratio (ISR) [52].

D. ALGORITHMS FOR COMPARISON
In this subsection, we present the state-of-the-art algorithms used for comparing the separation performance of our proposal. Note that we also include two unrealistic baseline methods to show the extreme separation performances, here denoted as oracle mask separation and energy distribution. The different approaches compared here are the following:

1) ORACLE MASK SEPARATION
This method computes the optimal value of the Wiener mask at each frequency and time component using the downsampled CQT representation and assuming that the signals to be separated are known in advance. Therefore, this approach represents the upper bound for the best separation that can be reached with the used time-frequency representation.

2) ENERGY DISTRIBUTION (ED)
This procedure uses the mixture signal divided by the number of sources as input for the evaluation. This evaluation provides a starting point for the separation algorithms.

3) FASST
In our evaluation we have included the results obtained by the multichannel method presented in [45]. This method decomposes the magnitude spectrogram of the mixture signal into a sum of basis spectra representing individual pitches scaled by time-varying amplitudes. In this approach, each basis spectrum is modelled as a weighted sum of narrowband spectra representing a few adjacent harmonic partials, thus enforcing harmonicity and spectral smoothness while adapting the spectral envelope to each instrument. The author used a generalized expectation-maximization (GEM) algorithm to estimate the model parameters.

4) MULTICHANNEL NMF
The approach presented in [17] decomposes the multichannel audio spectrogram using NMF with the Itakura Saito divergence. The method considers two variants, instantaneous and convolutive mixing that are compared here and referred to as Mult. NMF inst. and Mult. NMF conv., respectively. In order to estimate the mixing and source parameters, we have used the implementation provided by the authors using the multiplicative update (MU) algorithm.

5) ILRMA
In this approach, the independent frequency vectors in Independent Vector Analysis (IVA) are extended to lowrank matrices, which correspond to the power spectrograms of estimated sources, using NMF decomposition [53]. This signal model (independence between sources and low-rank decomposition of source spectrograms) is theoretically equivalent to conventional MNMF only when the spatial covariance matrix of each source in MNMF is constrained to a rank-1 matrix, which yields a computationally efficient algorithm for so-callen independent low-rank matrix analysis (ILRMA). Note that ILRMA is applicable to the determined case (M = S). In the overdetermined case (M > S), dimensionality reduction using principal component analysis (PCA) should be applied so that M = S. Therefore, only those signals for the dataset that satisfy the condition M ≥ S have been considered for the evaluation of this method.

6) HARMONIC NTF
We have included in the evaluation results for the MNMF-based method that relies only on the amplitude information of the multichannel signal. In this case, only the magnitude spectrograms are considered and the phase information of the multichannel signal is discarded. This signal model follows the same harmonic structure described in Eq. (9) and includes a panning matrix within the model that models the contribution of each source to each channel of the input signal as in [18], [54].

7) DSB+Wiener
We have also reported results for a spatial beamforming [34] method. Specifically, we have implemented a Delay and Sum Beamforming (DSB) design which consists of time aligning and summing the microphone signals. This technique uses the geometrical information of the microphone array to filter and enhance the sources coming form a specific direction. To allow a fair comparison with NMF-based approaches, a postprocessing Wiener filtering stage is applied to the output of DSB [55].

8) DoA-MNMF
We have included as the baseline for our experiments the results of the beamforming-inspired SCM model in [20] using the implementation provided by the authors.

E. VARIANTS OF THE PROPOSED METHOD
We have also presented results of two variants of our own model. In that sense, we have been considered both the CQT and STFT signal representations when comparing the models in order to know the influence of using these two different representations. Thus, we have the following two scenarios:

F. RESULTS
We start by analyzing the performance of the method in a semi-anechoic scenario. Fig. 3 shows the separation results averaged over all audio excerpts for the room with RT 60 = 65 ms. Each column corresponds to a different array size (2, 4 and 8 microphones). Note that the DoA-MNMF and both proposed variants are evaluated using two different initializations for the spatial weights (ground-truth source DoA represented in light color vs SRP-PHAT estimation represented in dark color). The best results are obtained with the Oracle mask separation method (SDR ≈ 11.5 dB, SIR ≈ 17 dB, SAR ≈ 12.5 dB, ISR ≈ 17 dB). This measure informs us about the best separation that can be achieved using the selected Wiener mask with the downsampled CQT representation. Moreover, the results show that the proposed system achieves significantly better average SDR and ISR than the other compared methods. Observe that only the SB + Wiener, DoA − MNMF and both proposed variants are able to surpass the SDR score obtained by ED (i.e. the mixture signal divided by the number of sources), which is a clear indication of the difficulty of the task at hand.
Concerning our variants, Proposal-CQT achieves better results than Proposal-STFT regardless of the score and the number of channels used. This is due to the robustness of the CQT representation against not perfectly tuned notes of the instruments composing the music excerpts (see Section II-E), especially string instruments. In comparison to the DoA-MNMF method, our Proposal-CQT obtains a clearly higher SDR result (2.8 dB vs 1.1 dB with 2 channels, 3 dB vs 1.2 dB with 4 channels and 3.1 dB vs 1.8 dB with 8 channels), while maintaining a similar level of artifacts and interferences. This results indicate that the harmonic constrain of our model improves the separation of harmonic instruments in classical music mixtures, and does not introduce additional artifacts. It is also worth noting that our model has fewer parameters due to the use of the harmonicity constraint, which may be helping in the convergence to a better solution. Both the DoA-MNMF and the proposed variants perform similarly well when initialized with ground-truth and SRP-PHAT source locations. Consequently, the method can be successfully used in a blind fashion in combination with SRP-PHAT.
As expected, the DSB + Wiener approach outperforms all the other methods in terms of added artifacts, as demonstrated by the SAR score. However, this approach fails to provide a meaningful isolation of the sound sources, resulting in a poor SIR value. Compared to the remaining decomposition methods, our approach performs much better in terms of SDR and SIR metrics, while offering a comparable SAR score. Although FASST and Mult. NMF conv. also make use of a convolutive mixing model, they suffer to discriminate the sources because the phase differences are not constrained, and the amplitude differences are small when the microphones are close to each other. Phase constraining is a clear advantage of the DoA-MNMF and both proposed variants, in which the beamforming-based MNMF model imposes a small set of directions of arrival to each source and benefits from the spatial diversity of the sources. ILRMA, which uses an instantaneous mixing model, performs better than the other decomposition methods in terms of SDR and SIR with 8 channels, and offers a better SAR metric than our method. This may be attributed to the more strict mathematical constraints of our model, which provides much better isolation at the expense of slightly higher artifacts. Also, note that ILRMA is determined, so only in . Source separation metrics averaged over the URMP dataset [50] under moderate reverberation (RT 60 = 250 ms) with simulated room. Each column corresponds to a different array size (2, 4 and 8 microphones). Method abbreviated as (GT) uses ground-truth source DoA (represented in light color) whereas (SRP) uses SRP-PHAT estimation (represented in dark color). The error bars represent 95% confidence intervals. the 8-channel array case the method is evaluated over all excerpts. Harmonic NTF obtains slightly better SDR than ILRMA, probably due to its harmonic constraint. However, it does not provide proper isolation from the sources, as can be observed from the SIR value obtained. Fig. 4 illustrates the separation results in a moderate reverberant scenario (i.e., for the room with RT 60 = 250 ms). In this condition, the SDR performance of all methods, except DSB + Wiener, goes below the reference value provided by ED. When initialized with SRP-PHAT, our proposals still manage to obtain better SDR than the other decomposition methods, especially with 4 and 8 channels. Again, compared to the DoA-MNMF approach, our harmonic-constrained model based on the CQT obtains a much better approximation to the real sources according to the SDR metric. In terms of interferences, our method performs best with 2 channels, whereas Mult. NMF conv. offers the best SIR score with 4 and 8 channels. Nonetheless, Proposal-CQT always seems to benefit from ground-truth initialization of the source directions, reaching the best SIR score with 2 channels and 8 channels. Proposal-CQT variant also performs best in reconstructing the spatial image of the sources (ISR) with 8 channels, and slightly outperforms Mult. NMF conv. and Mult. NMF inst. with 2 and 4 channels.
The separation results for the room with RT 60 = 500 ms are shown in Fig. IV-D7. This is an extremely challenging scenario, since the mixture at each channel is affected by a strong reverberation tail and echos coming from multiple directions. As can be seen, the behavior is similar to the RT 60 = 250 ms case. DSB + Wiener obtains the best results in terms of SDR and SAR performance. However, it provides the worst SIR value, resulting in a really poor isolation of VOLUME 10, 2022 FIGURE 5. Source separation metrics averaged over the URMP dataset [50] under reverberant conditions (RT 60 = 500 ms) with simulated room. Each column corresponds to a different array size (2, 4 and 8 microphones). Method abbreviated as (GT) uses ground-truth source DoA (represented in light color) whereas (SRP) uses SRP-PHAT estimation (represented in dark color). The error bars represent 95% confidence intervals. the sound sources. Concerning our proposals, better SDR and ISR are obtained compared to the other decomposition methods. In terms of interferences, Mult. NMF conv. again offers the best SIR score. Even so, our method initialized with SRP-PHAT shows competitive results regarding this score. In this case, the ground-truth initialization does not benefit our proposals. This can be due to the fact that, with high reverberation and small arrays, constraining each source to a single direction of arrival is not significantly beneficial in terms of interferences due to the strong reflections coming from other sources. Fig. 6 reports the separation metrics of our CQT variant (initialized with SRP-PHAT) as a function of the number of sources in the mixture. As shown, the results scale as expected according to the complexity of the mixture. For two and three sources under slight reverberation, our system gets a very high level of isolation with very low artifacts even in the 2-channel case. With four and five sources, the separation scores decrease significantly, but the method is still able to offer acceptable results under low reverberation, especially using the 8-channel array (around 1.6 dB SDR and 2.5 dB SIR for five sources). We provide sound samples of separated signals at the web page of results. 2 Table 1 shows the computational time of our proposed method variants. Note that the computation time increases as the number of sources in the mixture increases. As can be seen, the CQT variant takes much less time than the STFT variant. Regarding the spatial weights initialization, it can be clearly observed how initializing with SRP-PHAT algorithm FIGURE 6. Separation metrics of our method for different numbers of sources in URMP dataset [50] with the simulated room. Initialization is done with SRP-PHAT. Each column corresponds to a different array size (2, 4 and 8 microphones). The error bars represent 95% confidence intervals. offers a great advantage in terms of computational time with respect to considering the full-rank version of Z.

V. CONCLUSION
In this paper, we present a harmonic constrained MNMF-based method for the task of blind music source separation. In the proposed signal model, the mixing filter encodes the spatial information in terms of magnitude and phase differences between channels, whereas the source variances are modelled using a harmonic constrained NMF structure. In order to reduce the dimensionality of the signal model parameters, we propose to use the CQT as timefrequency representation. Thus, the SCM is obtained from the CQT to account to the frequency logarithmic scale inherent in music signals. To our best knowledge, this is the first work that exploited the phase information of the CQT within an MNMF scheme. Moreover, the SRP-PHAT algorithm is used to initialize the model parameters and thus, reduce the computational complexity and increase the robustness.
The proposed method has been evaluated for the task of multichannel music source separation of classical chamber music ensembles with several polyphony and reverberation setups. The results obtained in the evaluation showed a reliable performance in terms of BSS_EVAL metrics in comparison with other signal decomposition approaches.
Finally, as future work, we would investigate the integration of the score information within the signal model. Thus, we would expect to improve the separation results and reduce the computational complexity by initializing the time-varying gains of the model. Moreover, we will try to improve the results in high reverberation scenarios by evolving the phase model. Finally, since the current system needs to know the microphone arrangement, we will work on developing a blind system respect to the microphone array.