PoP-IDLMA: Product-of-Prior Independent Deeply Learned Matrix Analysis for Multichannel Music Source Separation

Independent deeply learned matrix analysis (IDLMA) is a state-of-the-art determined audio source separation method based on pretrained deep neural networks (DNNs). Owing to the excellent expression power of DNNs, IDLMA can handle a wider range of sources than conventional source models such as nonegative matrix factorization (NMF). However, owing to its supervised nature, the separation performance of IDLMA often degrades in the presence of timbral mismatches between the training data and the to-be-separated data. In this paper, we propose two source models that encompass the NMF- and DNN-based source models by constructing a prior distribution of the source power spectrogram (product of priors: PoP) on the basis of the product-of-expert concept. Since the NMF-based source model works well for a fully blind situation, the proposed models can handle the timbral mismatch without losing the expression power of DNNs. By introducing the PoP-based source models into IDLMA, we propose IDLMA extensions (PoP-IDLMAs) and derive their efficient parameter estimation algorithms on the basis of the majorization–minimization algorithm. Experimental results demonstrated the effectiveness of the proposed PoP-IDLMAs and that the proposed models greatly improve the source power estimation in frequency bands above 500 Hz.

Digital Object Identifier 10.1109/TASLP. 2023.3293044 important role in multichannel audio source separation and has thus far been well studied [1]. The BSS problem is divided into two situations: undetermined (the number of microphones M is smaller than that of sources N ) and (over-)determined (M ≥ N ) situations. In this article, we focus on the determined situation. For the determined situation, a typical BSS approach is to assume the statistical independence of sources, for example, frequency-domain independent component analysis [2], [3], [4], [5], independent vector analysis (IVA) [6], [7], and independent low-rank matrix analysis (ILRMA) [8]. In this approach, the determined BSS problem is formulated as the problem of finding a demixing filter (the inverse system of the mixing process) simultaneously with the estimation of source power spectrograms. For example, ILRMA uses a source model based on a nonnegative matrix factorization (NMF) [9], [10]. The NMF represents each slice of a source power spectrogram by a sum of common spectral templates weighted by their activations, i.e., it approximates a source power spectrogram using a low-rank nonnegative matrix. This representation is suited for capturing recurring spectral patterns, and ILRMA achieves the state-ofthe-art performance in the determined BSS methods.
Alongside with extensions for the fully blind situation [11], [12], [13], ILRMA has been extended for a spatially blind but source-supervised situation, where a mixing system is still unknown but training data of each source are available. This extension is named independent deeply learned matrix analysis (IDLMA) [14]. It is constructed by replacing the NMF-based source model with a source model based on a pretrained deep neural network (DNN) in the ILRMA framework. Owing to the flexible expression power of a DNN, IDLMA works well even for sources that the NMF assumption is not suited for (e.g., a singing voice).
Although a DNN-based source model can handle a wider range of sources, its performance is often degraded by a timbral mismatch between the training data and an observed signal. One cause of this performance degradation is the supervised nature of the DNN-based source model. For example, the higherfrequency components greatly fluctuate owing to musical instrument types and performers' skills, which make the DNN training difficult. Indeed, we experimentally observed such a performance degradation of the DNN-based source model particularly in higher frequency bands, as we will show in Section V-F. This   To alleviate this problem while maintaining the capability of handling various sources, we should extend the DNN-based source model to include an adaptive mechanism against timbral mismatches.
In this article, we propose a source model capable of handling timbral mismatches by unifying the NMF-and DNN-based source models. The idea of developing the proposed model is to combine unsupervised and supervised source models. Unlike the DNN-based source model, the NMF-based source model can work well in an unsupervised manner. We pretrain only the DNN part and use the NMF part in an unsupervised manner. Hence, the NMF part accounts for the time-frequency components that are difficult for the DNN part to represent. The NMF and DNN parts are described with probability distributions of a source power spectrogram. To combine the two distributions in a Bayesian manner, we use a product-of-expert (PoE) technique [15]. PoE represents a probability distribution as a product of multiple probability distributions called experts. By associating the distributions of the NMF and DNN parts with the experts, we can construct a prior distribution of the source power spectrogram as their product. Each of the two distributions can be seen as a prior distribution of the source power spectrogram in the ILRMA/IDLMA framework. Named after this aspect, we call the proposed prior distribution product of priors (PoP).
By replacing the DNN-based source model with the PoPbased source model, we propose an IDLMA extension named PoP-IDLMA (see Fig. 1). Furthermore, we propose a variant of PoP-IDLMA by taking the limit of one of the hyperparameters of the PoP-based source model under a certain condition. To distinguish them, we call the former t-PoP-IDLMA and the latter G-PoP-IDLMA. For both PoP-IDLMAs, we derive efficient parameter estimation algorithms based on the majorizationminimization (MM) algorithm [16]. We conducted experiments on determined source separation and showed the effectiveness of the proposed methods.
While we focus on the IDLMA and ILRMA families throughout this article, the idea of PoP can be extended for underdetermined source separation methods such as multichannel NMF (MNMF) [17], [18] and fast MNMF [19], [20] because they use generative models of a source power spectrogram similarly to the determined source separation methods. We leave such extensions as our future work.
The remainder of this article is organized as follows. In Section II, we briefly describe ILRMA and IDLMA. In Section III, we propose the PoP-based source model and introduce it to IDLMA for constructing t-PoP-IDLMA. We also derive its parameter estimation algorithm on the basis of the MM algorithm. In Section IV, we present G-PoP-IDLMA and derive its parameter estimation algorithm similarly to t-PoP-IDLMA. In Section V, we show the effectiveness of the proposed methods through multichannel music source separation experiments. In Section VI, we conclude this article.
This article is partially based on our previous conference article [21], with the following five contributions. (i) We propose a PoP-based source model by combining the prior distributions of the NMF-and DNN-based source models. Note that the method presented in [21] is used for the DNN part. (ii) We extend the PoP-based source model so that it can avoid the DNN training cost caused by changing hyperparameter. (iii) We introduce these source models into the IDLMA framework and propose efficient parameter estimation algorithms for tand G-PoP-IDLMAs. (iv) Through music source separation experiments, we demonstrated the effectiveness of tand G-PoP-IDLMAs and (v) that the NMF part improves the source power estimation in the frequency band where the DNN part failed to estimate.

A. Formulation of Determined Audio Source Separation
In this section, we formulate a determined audio source separation problem with M microphones and N sources (M ≥ N ). The short-time Fourier transforms (STFTs) of source, observed, and separated signals are defined as where i = 1, . . . , I, j = 1, . . . , J, n = 1, . . . , N, and m = 1, . . . , M are the indices of frequency bins, time frames, sources, and channels, respectively. The superscript T denotes the transpose operator. When the mixing system is time-invariant and an analysis window is sufficiently longer than the reverberation time, x ij is represented as an instantaneous mixture: where A i ∈ C M ×N is the mixing matrix. If M = N and A i is nonsingular, we can write y ij as where W i = (w i1 , . . . , w iN ) H ∈ C N ×M is the demixing matrix and the superscript H is the Hermite transpose operator. In ILRMA and IDLMA, y ijn is assumed to follow an isotropic complex Gaussian distribution with zero mean and variance r ijn ∈ R ≥0 : The variance r ijn corresponds to the (i, j)th entry of the power spectrogram of source n, and we call it the source power spectrogram. With this assumption, the source separation problem is formulated as a maximum likelihood estimation problem with respect to r ijn and W i for a given x ijm . Let X m and Y n be I × J complex matrices consisting of {x ijm } I,J i=1,j=1 and {y ijn } I,J i=1,j=1 , respectively. By taking the negative of the log-likelihood function, we obtain a cost function as where c = denotes the equality up to constants. The second equation of (7) comes from (5) and the change of variables formula. For brevity, we represent an I × J nonnegative matrix consisting of {r ijn } I,J i=1,j=1 as R n ∈ R I×J ≥0 . ILRMA and IDLMA represent R n with an NMF and a DNN, respectively. To distinguish them, we hereafter add superscripts (NMF) and (DNN) to R n for ILRMA and IDLMA, respectively.
B. ILRMA [8] 1) Representation of R (NMF) n : Fig. 2 shows the source model of ILRMA. In this model, the source power spectrogram R (NMF) n is represented as a product of two nonnegative matrices with rank K: or equivalently, where k = 1, . . . , K is the index of the NMF bases. The matrices T n ∈ R I×K ≥0 and V n ∈ R K×J ≥0 are the basis and activation matrices consisting of {t ikn } I,K i=1,k=1 and {v kjn } K,J k=1,j=1 , respectively. The column vectors of T n represent the spectral patterns of source n and the row vectors of V n are the energies of the corresponding bases.
2) Parameter Estimation Algorithm: By substituting (9) into (7), we obtain the cost function of ILRMA as The minimization of (10) can be performed by iteratively updating the parameters of the NMF-based source model (t ikn and v kjn ) and those of the spatial model (W i ) [8].
The parameter estimation algorithm of ILRMA is based on the MM algorithm [16], which offers the guarantee that (10) does not increase at each update. In the MM algorithm, we design an auxiliary function that is tangent to an original cost function. By for i = 1, . . . , I do 3: for n = 1, . . . , N do 4: end for 8: end for 9: return{W i } i 10: end Function using the auxiliary function, we can derive update rules that do not increase the original cost function: Theorem 1: Let f (θ) be a cost function and f + (θ,θ) be its auxiliary function that satisfies f (θ) = minθ f + (θ,θ). The cost function f (θ) is not increased by iteratively performing the following updates.
By adequately designing an auxiliary function of (10), we can obtain update rules for t ikn and v kjn [8]: Since the terms in the outer parentheses in (12) and (13) are nonnegative, the nonnegativity of t ikn and v kjn always holds once their initial values are nonnegative. The demixing matrix W i is updated by the iterative projection (IP) algorithm [22]: where IP(·, ·, ·) is defined in Algorithm 1. Here e n is the Ndimensional unit vector whose nth element is one. This algorithm guarantees that (10) does not increase at each update [23]. It is also used in IDLMA and our proposed methods, as we will show in Sections II-C, III-C, and IV-B. After the parameter estimation, the projection back (PB) [5] is applied to y ij to resolve the scale uncertainty between w in and r where diag(d i ) ∈ C N ×N is a matrix that has elements of d i ∈ C N on the main diagonal and zero elsewhere.  DNN DNN n . Let | · | ·τ denote the elementwise τ th power of absolute values of a matrix. It converts |Y n | ·1 into the source magnitude spectrogram Σ n ∈ R I×J ≥0 : Let σ ijn be the (i, j)th entry of Σ n . We obtain r ijn as where max(·, ·) returns a maximum value of two inputs and ε 1 is a small value used to prevent numerical instability.

2) Parameter Estimation Algorithm:
The parameter estimation algorithm of IDLMA consists of two stages: separation and DNN training stages. The separation stage is performed after the DNN training stage. We describe the separation stage in this section and the DNN training stage in Section II-C3.
In the separation stage, the parameters of the source and spatial models are estimated from observed signals X m . The cost function of IDLMA is defined by replacing r ijn with r (DNN) ijn in (7): As in ILRMA, the parameter estimation algorithm of IDLMA consists of iterative updates of the source model and demixing matrices.
The source power spectrogram R (DNN) n is updated in accordance with (16) and (17), where Y n is obtained with the current estimates of W i . For the update of W i , we can use the IP algorithm because the terms of (18) involved in W i have the same form as those of ILRMA: The PB technique is applied to y ij after every update of W i , which can reduce linear distortion.

3) DNN Training:
In the DNN training stage, we train DNN n so that it can estimate a clean magnitude spectrogram from a noisy magnitude spectrogram. The point of IDLMA is that a cost function for the DNN training is consistent with the cost function (18) used in the separation stage in a maximum likelihood sense.
Lets ijn be the (i, j)th element of a clean complex spectrogram of source n. The cost function for the DNN training is derived by respectively replacing w H in x and r Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
and σ 2 ijn in L IDLMA : where ε 2 is a small value used to prevent numerical instability. The right-hand side of (20) is the Itakura-Saito divergence between |s ijn | 2 + ε 2 and σ 2 ijn + ε 2 . When ε 2 is negligibly small, the minimization of (20) with respect to σ 2 ijn is equivalent to the maximum likelihood estimation of (18) with respect to σ 2 ijn . Hence, the DNN training with C (n) IDLMA corresponds to the emulation of the maximum likelihood estimation with respect to σ ijn in the separation stage.

III. PROPOSED t-POP-IDLMA
In this section, we propose the PoP-based source model by unifying the NMF-and DNN-based source models on the basis of PoE [15]. PoE designs a probability distribution of a random variable by multiplying multiple probability distributions of the variable. The multiplication is analogous to an "and" operation of multiple conditions, and the designed distribution has high values at events where such conditions tend to be satisfied simultaneously. In the proposed model, we treat the source power spectrogram r ijn as a latent variable and construct its prior distribution by multiplying the NMF-and DNN-based probability distributions of r ijn . By using the prior distribution (i.e., PoP), we define the source model through the marginalization of r ijn .
As in ILRMA and IDLMA, y ijn is assumed to obey an isotropic Gaussian distribution with zero mean and variance r ijn . To clarify that r ijn is a latent variable, we rewrite (6) as Following PoE, we can define a prior distribution of r ijn with a set of hyperparameters θ where q(r ijn ; θ ) are the NMF-and DNN-based probability distributions with sets of parameters θ (NMF) ijn and θ (DNN) ijn , respectively. For the right-hand side of (22) to be a probability distribution, a normalization constant should exist. Unfortunately, it is not always described in an explicit form. However, by adequately choosing probability distributions for q(r ijn ; θ and q(r ijn ; θ (DNN) ijn ), we can write the right-hand side of (22) in a closed form, which helps the derivation of a parameter estimation algorithm.
Let us choose an inverse gamma distribution for q(r ijn ; θ where θ ijn > 0 is the scale parameter, and Γ(·) is the gamma function. Since a product of two inverse gamma distributions is also an inverse gamma distribution, we can explicitly write the proposed PoP as where θ It should be noted that we can combine more than two probability distributions in the same manner.
2) Source Model: The proposed source model is defined as a marginalization distribution p(y ijn ; θ An inverse gamma distribution is a conjugate prior distribution of a normal distribution (see [24] for example). Thus, we can compute the marginal distribution in a closed form: The resulting distribution is identical to a complex isotropic Student's-t distribution with the degree-of-freedom (DoF) parameter ν and scale parameterr leads to a more heavy-tailed probability distribution, i.e., it controls the Gaussianity of the distribution. We call the source model (29) the t-PoP-based source model.

B. Interpretation of t-PoP-Based Source Model
We have thus far derived the proposed t-PoP-based source model. In this section, we provide an interpretation of the t-PoP-based source model to bridge it with the source models of ILRMA and IDLMA. On the basis of this interpretation, we parameterize θ  . It coincides with the source model of a Student's-t-distributionbased extension of ILRMA (t-ILRMA) [25]. Furthermore, by invoking the fact that a complex isotropic Student's-t distribution becomes a complex isotropic Gaussian distribution as the DoF parameter goes to infinity, we obtain a complex isotropic Gaussian distribution with zero mean and variancer or equivalentlyr wheret ijn andṽ ikn are the (i, j)th entries of the basis and activation matricesT n ∈ R I×J ≥0 andṼ n ∈ R I×J ≥0 , respectively. We can provide a similar interpretation for q(r ijn ; θ ). Fig. 4 shows the relationship between the proposed and conventional source models, where t-IDLMA [14] is a Student's-t-distribution-based extension of IDLMA. As in the IDLMA family,r (DNN) ijn is estimated from |Y n | ·1 by using a pretrained DNNDNN n .
The above interpretations reveal the relationship between the proposed source model and the source models of ILRMA and IDLMA. With the notations (31), we can rewrite the parameters of the t-PoP-based source modelr where The parameter estimation algorithm of t-PoP-IDLMA consists of two stages as in IDLMA. In the DNN training stage, the DNN is trained with the training data of each source, which we will describe in Section III-D. In the separation stage, we estimatẽ t ikn ,ṽ kjn ,r 2) Update Rule of W i : The cost function L t-PoP includes |w H in x ij | 2 in the logarithm term, which makes it difficult to minimize L t-PoP . To construct an upper bound of this term, we can use the following lemma [25]: Lemma 1: For a concave function f (θ), its tangent line at point θ o is greater than or equal to f (θ): The equality holds if and only if θ = θ o .
Since the logarithmic function is concave, we obtain where ζ ijn > 0 is the auxiliary variable. The equality of (40) holds if and only if Hence, the auxiliary function of L t-PoP is given as The w in -related terms in (42) are only quadratic and logdeterminant terms, which fits the requirements for using the IP algorithm [23]. Hence, the update rule of W i is given as where Ξ n is an I × J matrix consisting of ξ ijn given by Note that ξ ijn is the denominator of the first term of (42) in which the equality condition (41) is substituted.
3) Update Rules oft ikn andṽ kjn : By invoking (33) and (34), we find that the first and second terms of (42) include the sums over k in the reciprocal and logarithmic functions, respectively. These terms make it difficult to analytically solve the minimization of (42) with respect tot ikn andṽ kjn . To overcome this problem, we derive update rules of t ikn and v kjn on the basis of the MM algorithm.
For a reciprocal function, we can use the following lemma: Lemma 2: For a series of nonnegative values {h k } k , where λ k ≥ 0 is the auxiliary variable such that k λ k = 1. This lemma can be proved by Jensen's inequality [24]. Using Lemma 2, we can obtain the following inequality: where λ Using Lemma 1, we can derive the following inequality for the second term of (42): where γ ijn > 0 is an auxiliary variable. The equality of (48) holds if and only if Taken together, the upper bound of (42) is obtained as where D (t-PoP) \t,ṽ denotes terms that do not includet ikn orṽ kjn . By solving ∂L ++ t-PoP /∂t ikn = 0 and ∂L ++ t-PoP /∂ṽ ijn = 0 and substituting equality conditions (41), (47), and (49) into the solutions, we can derive the following update rules: where ξ ijn is given as (44). For DNN n , we adopted a DNN proposed in [21], where ν (DNN) ijn is represented by a weighted sum of anchors:

4) Update Rules ofr
where K is a set of anchors and ρ (κ) ijn is a weight of anchor κ that satisfies Update {Y n } n by (5)  11: Update {Y n } n by (15)  , which degrades the separation performance of IDLMA [21]. Hence, we can updater (DNN) ijn and ν (DNN) ijn by using (53) and the following rules: whereσ ijn is the (i, j)th entry of the source magnitude spectrogram obtained using DNN n . 5) Entire Procedure of Separation Stage: Fig. 1(a) shows the overview of the separation process of t-PoP-IDLMA, where the spatial and source models are iteratively updated. Algorithm 2 shows the entire parameter estimation algorithm of t-PoP-IDLMA in the separation stage, where I denote the numbers of inner and outer iterations, respectively. The inner iteration does not include the update of the DNN part, whereas the outer iteration includes the update.

D. DNN Training
As in IDLMA, the DNN part is trained before the separation stage described in Section III-C. In the DNN training stage, we setr (NMF) ijn = 0 because the DNN part is responsible for source components similar to the training data. In the spirit of IDLMA, we design a cost function for the DNN training to be consistent with the cost function (38) of the separation stage: whereσ ijn is estimated from a noisy mixture using DNN n . The noisy mixture is generated by mixing the clean spectrogram of source ns ijn and the spectrogram of other sources.
The minimization of C On the basis of our interpretation in Section III-B, we introduce the following assumption.
Since a complex isotropic Student's-t distribution becomes a complex isotropic Gaussian distribution as the DoF parameter goes to infinity, we can convert (29) into where θ Since the source model is based on a complex isotropic Gaussian distribution, we call it the G-PoP-based source model. From (59), the source power spectrogramr is the η ijn -weighted sum of the NMF-and DNN-based source models, which clarifies the relationship between the proposed source model and the source models of ILRMA and IDLMA, as shown in Fig. 4. This interpretation is true whenr By setting η ijn = 1 and η ijn = 0, L G-PoP reduces to the cost functions of ILRMA (10) and IDLMA (18), respectively. This finding clarifies that PoP-IDLMA encompasses the source models of ILRMA and IDLMA. Similarly to Section III-C, we describe the DNN training stage in Section IV-C and the separation stage in this section. In the following, we derive update rules of W i ,t ikn ,ṽ kjn , andr Note that η ijn is treated as a hyperparameter.
2) Update Rule of W i : Since the w in -related terms of L G-PoP are only quadratic and log-determinant terms, we can use the IP algorithm as in t-PoP-IDLMA. Hence, the update rule of W i is defined as It is identical to the update rule obtained by applying Assumption 1 to (43) because ξ ijn →r 3) Update Rules oft ikn andṽ kjn : As in Section III-C3, we construct an auxiliary function of L G-PoP and derive update rules oft ikn andṽ kjn . The difficulty in directly minimizing L G-PoP with respect tot ikn andṽ kjn is that the first and second terms of (60) include the sums over k in the reciprocal and logarithmic functions, respectively. Applying inequalities (46) and (48) to these terms yields the following auxiliary function: where D (G-PoP) −t,ṽ denotes terms that do not includet ikn orṽ kjn .

4) Update Rule ofr (DNN)
ijn : In G-PoP-IDLMA, the DNN of the nth source DNN n estimates a source magnitude spectrogram σ ijn from |Y n | ·1 . As in IDLMA, we updater 5) Entire Procedure of Separation Stage: Fig. 1(b) shows the overview of the separation process of G-PoP-IDLMA, where the spatial and source models are iteratively updated. Algorithm 3 shows the entire parameter estimation algorithm of G-PoP-IDLMA. It has the inner and outer iterations to balance the update amount of the DNN and NMF parts similarly to Algorithm 2.

C. DNN Training
In the DNN training stage, we train DNN n so that it can estimate a clean magnitude spectrogram |s ijn | from a noisy mixture. Since the NMF part is responsible for the components not included in the training data, we can set η ijn = 0 during the DNN training. The resultant cost function is given as It is identical to the cost function for the DNN training in IDLMA (20). Hence, we can use the same DNN training procedure as in IDLMA.
It should be noted that (67) does not include η ijn . Thus, once the DNN is trained, it can be used for any η ijn values, which is the primary advantage of G-PoP-IDLMA compared with t-PoP-IDLMA.

V. EXPERIMENTAL EVALUATION
A. Experimental Settings 1) Common Settings: To evaluate the effectiveness of the proposed PoP-IDLMAs, we conducted experiments on determined multichannel music source separation using the DSD100 dataset [26]. This dataset consists of dev and test sets (50 songs per set) and separate recordings of vocals (Vo.), bass (Ba.), drums (Dr.), and other instruments. The recordings of Vo., Ba., and Dr. were used as dry sources.
We generated test data by extracting 30-to 60-s segments of the top 25 songs in the test set in alphabetical order and convolving them with the E2A impulse response (T 60 = 300 ms) in the RWCP database [27]. The test data were composed of stereo and three-channel mixtures, where the number of channels equals that of sources, i.e., N = M . The other settings were as follows: Stereo mixtures: The stereo mixtures were generated with two recording conditions for each pair of Vo., Ba., and Dr. (i.e., Ba./Dr., Vo./Ba., and Vo./Dr.). The number of mixtures was 50 for each instrument pair. The recording conditions are shown in Fig. 5.
Three-channel mixtures: The three-channel mixtures were also generated with two recording conditions, which are shown in Fig. 6. The sources were Vo., Ba., and Dr. (Vo./Ba./Dr.). The number of mixtures was 50.
The sampling frequency was set at 8 kHz as in [14]. For STFT, we used the hamming window of 512 ms (4096 samples) with a frame shift of 256 ms (2048 samples). The evaluation metric was the source-to-distortion ratio (SDR) improvement computed using the BSSEval toolbox [28].
2) Compared Methods: We compared the proposed PoP-IDLMAs with one BSS method and four source-supervised methods. The BSS method is ILRMA [8], which is the NMF-only counterpart of the proposed PoP-IDLMAs. The number of bases was set to K = 20. The initial values of t ikn and v kjn were drawn from a uniform distribution over [0,1), and W i was initialized with an identity matrix. We did not use t-ILRMA for the comparison because it showed a similar performance to ILRMA as shown in [25].
The source-supervised methods were the combination of the DNN and the Wiener filter (DNN+WF) [29], the combination of the full-rank spatial covariance model with DNN (FSCM+DNN) [30], IDLMA [14], and t-IDLMA [14]. IDLMA and t-IDLMA are the DNN-only counterparts of the proposed PoP-IDLMAs. For these four methods, we used the same DNN architecture as in [14]. Fig. 7(a) shows this architecture. It consists of four fully connected (FC) blocks, an FC layer, and a rectified linear unit (ReLU) nonlinearity [31]. Each FC block is composed of an FC layer with 2048 hidden units, a ReLU nonlinearity, and a dropout layer with a drop rate of 0.3. For t-IDLMA, we set the DoF parameter ν = 500, which provided the highest separation performance for the stereo and three-channel mixtures on average. For IDLMA and t-IDLMA, the demixing matrix W i was initialized with an identity matrix.
The proposed methods are t-PoP-IDLMA and G-PoP-IDLMA. We set the number of basis K = 20 to match it with that in ILRMA. The initial values oft ikn ,ṽ kjn , and W i were set in the same manner as those in ILRMA. The numbers of inner and outer iterations were set to 10: (I , I (G-PoP) (out) ) = (10, 10) for G-PoP-IDLMA. For t-PoP-IDLMA, we used the same DNN architecture as in [21], which has two heads for ρ (κ) ijn andσ ijn . Fig. 7(b) shows this architecture. We set K = {1, 10, 100, 1000} and varied ν (NMF) ijn = 1, 10, 100, and 1000. For G-PoP-IDLMA, we varied η ijn as η ijn = 10 −2 , 10 −4 , 10 −6 , 10 −8 , and 10 −10 and used the same DNNs as those used in the source-supervised methods. Since the used values of ν (NMF) ijn and η ijn were independent of i, j, and n, we hereafter drop these indices from the two parameters for the simplicity.
3) DNN Training: For the DNN training, we used all 50 songs in the dev set of the DSD100 dataset as training data and the bottom 25 songs in alphabetical order in the test set as   [32] optimizer with a batch size of 128. The gradient clipping [33] was applied to the weights of the DNNs so that their l 2 norms were less than or equal to 10. We set ε 1 = 10 −1/2 and ε 2 = 10 −5 and the other training conditions were the same as those in [14].

B. Comparison of Average Spectra Between Training and Test Data
Before discussing the separation results, we examined the average spectra of the training and test data to show the timbral mismatches. Fig. 8 shows the average power spectra of the training and test data for each musical instrument. The spectra labeled as Training and Test were computed from the clean audio signals of the DNN training and the dry sources of the test data, respectively. For vocals and bass, the spectral differences between Training and Test were greater in the frequency band above 2000 Hz. For drums, the average spectrum of Training was apparently different from that of Test in the frequency band above around 500 Hz. These results show that the timbral mismatches were most pronounced in the higher frequency band.  t-PoP-IDLMA exhibited a greater separation performance than G-PoP-IDLMA, but their differences in SDR were slight. This result shows that the unification of the NMF-and DNNbased source models has a greater impact on SDR than the difference in probability distribution.

C. Results for Stereo Mixtures
We observed a correlation between η values and the significance of the timbral mismatches. The smaller η provided slightly higher SDR improvements for the Ba./Dr. mixture, whereas the greater η had the higher SDR improvements for the Vo./Ba. mixture. This tendency correlates with the significance of the spectral differences between the training and test data as described in Section V-B. Although a clear tendency was not observed for the Vo./Dr. mixture, this result suggests that the greater η should be used as the timbral mismatches become more significant. Table II shows average SDR improvements for the threechannel mixtures. The SDR improvements of t-PoP-IDLMA monotonically increased in the range of ν (NMF) used in Section V-C and we increased ν (NMF) until they started to decrease. The IDLMA family consistently worked well compared with the other methods as in Section V-C. t-PoP-IDLMA (ν (NMF) = 10 4 , 10 5 , and 10 6 ) and G-PoP-IDLMA (η = 10 −6 , 10 −8 , and 10 −10 ) outperformed the conventional methods, showing the effectiveness of the proposed methods for more severe situations.

D. Results for Three-Channel Mixtures
t-PoP-IDLMA achieved the highest SDR improvement with ν (NMF) = 10 5 . However, it had lower SDR improvement than conventional IDLMA when ν (NMF) = 10 0 , which was the best hyperparameter for the stereo mixtures. By contrast, G-PoP-IDLMA worked stably with η = 10 −8 and 10 −10 for the stereo and three-channel mixtures. This performance stability is another advantage of G-PoP-IDLMA.

E. Effect of η
As described in Section IV-A, the G-PoP-based source model is identical to the DNN-based source model of IDLMA when η = 0. However, we experimentally observed that G-PoP-IDLMA behaved differently with IDLMA, although η decreased to a value close to zero 10 −10 . To examine this phenomenon, we compared i,j,nr (NMF) ijn and i,j,nr (DNN) ijn along with the iterations. We hereafter call the two quantities the energies of the NMF and DNN parts, respectively.
We experimentally found that the energies of the NMF and DNN parts automatically became balanced as the iteration proceeded. At the early iterations, the energy of the NMF part was small and the DNN part dominated the demixing matrix updates. At the late iterations, the energy of the NMF part gradually became the same as that of the DNN part. This observation indicates that η practically determines how confident the NMF part is only at the early iterations. At the early iterations, since the NMF part is still in convergence,r are equally useful for the demixing matrix estimation. Even when η is small, the NMF part affects the separation performance after a sufficient number of iterations were performed. This result clarifies the role and effectiveness of the NMF part.
If η was affected uniformly in all iterations, we needed to precisely control η along with the iterations. However, owing to the automatic energy balancing, the proposed methods are free from such painstaking tuning. This is another advantage of the PoP-based source model.

F. Effect of PoP-Based Source Model
To assess the effect of using the PoP-based source model, we compared G-PoP-IDLMA with IDLMA in terms of frequencyband-wise source-to-noise ratio (FBW-SNR). The FBW-SNR is defined as SNR ω,n = 1 #B ω i∈B ω 10 log 10 where ω = 1, . . . , 7 is the frequency band index, B ω is given as B ω = {250(ω − 1) + 1, . . . , 250ω}, #B ω is the number of elements in B ω , and a inm ref is the (m ref , n)th entry of the mixing matrix A i . Fig. 9 shows the average FBW-SNRs over 50 mixtures for the Ba./Dr. mixtures, where G-PoP-IDLMA was with η ijn = 10 −10 . The FBW-SNRs of G-PoP-IDLMA were higher than those of IDLMA in all the frequency bands and the improvements from IDLMA were remarkable in the frequency bands above 500 Hz, which is consistent with the average spectral difference shown in Section V-B. In these frequency bands, the DNN outputs had many zeros, whereas the NMF part succeeded in the source power estimation. We observed the same trends for the other stereo mixtures, as shown in Figs. 10 and 11. The FBW-SNR gaps between G-PoP-IDLMA and IDLMA were large particularly for drums and bass. This should be because the spectrograms of these instruments tend to match the low-rank assumption of NMF. These results show that the NMF part can compensate for the source power estimation in the frequency bands where the DNN part failed in power spectrogram estimation.

VI. CONCLUSION
We proposed two source models that encompass NMF-and DNN-based source models used in ILRMA and IDLMA, respectively. The proposed source models use the PoP, a prior distribution of the source power spectrogram, which is constructed by multiplying the probability distributions based on NMF and DNN in accordance with the PoE concept. Since the PoP can be written as an inverse gamma distribution, we can introduce the PoP-based source models into the IDLMA framework without violating the generative modeling. The resultant IDLMA extensions are tand G-PoP-IDLMAs. For the proposed PoP-IDLMAs, we derived efficient parameter estimation algorithms on the basis of the MM algorithm. Experimental results showed the effectiveness of the proposed PoP-IDLMAs and the importance of unifying the NMF-and DNN-based source models. Furthermore, the assessment of the results clarified that the NMF part can compensate for the source power estimation in the frequency bands where the DNN part failed in the estimation.