Majorization-Minimization Algorithm for Discriminative Non-Negative Matrix Factorization

This paper proposes a basis training algorithm for discriminative non-negative matrix factorization (NMF) with applications to single-channel audio source separation. With an NMF-based approach to supervised audio source separation, NMF is first applied to train the basis spectra of each source using training examples and then applied to the spectrogram of a mixture signal using the pretrained basis spectra at test time. The source signals can then be separated out using a Wiener filter. Here, a typical way to train the basis spectra is to minimize the dissimilarity measure between the observed spectrogram and the NMF model. However, obtaining the basis spectra in this way does not ensure that the separated signal will be optimal at test time due to the inconsistency between the objective functions for training and separation (Wiener filtering). To address this mismatch, a framework called discriminative NMF (DNMF) has recently been proposed. While this framework is noteworthy in that it uses a common objective function for training and separation, the objective function becomes more analytically complex than that of regular NMF. In the original DNMF work, a multiplicative update algorithm was proposed for the basis training; however, the convergence of the algorithm is not guaranteed and can be very slow. To overcome this weakness, this paper proposes a convergence-guaranteed algorithm for DNMF based on a majorization-minimization principle. Experimental results show that the proposed algorithm outperform the conventional DNMF algorithm as well as the regular NMF algorithm in terms of both the signal-to-distortion and signal-to-interference ratios.


I. INTRODUCTION
Single-channel audio source separation is a challenging task of extracting individual source signals from a monaural recording of a mixture signal. Since the presence of noise or interference can severely degrade the performance of many audio applications such as automatic transcription of music, speech recognition, voice conversion, many attempts have been made to address this problem [1]- [8]. One successful approach for monaural audio source separation involves applications of non-negative matrix factorization (NMF) [6], [10]. Although deep neural networks-based methods [7]- [9] have been shown to work impressively in recent years, The associate editor coordinating the review of this manuscript and approving it for publication was Manuel Rosa-Zurera. the NMF approach still remains attractive when only a limited amount of training data is available.
The basic idea of the NMF approach is to interpret the observed magnitude (or power) spectrogram of a signal as a non-negative matrix and factorize it into the product of non-negative matrices. This amounts to approximating the observed spectra by a linear sum of basis spectra scaled by time-varying amplitudes. In a supervised/semi-supervised source separation problem setting, NMF is first used to train the basis spectra of each sound source using individually recorded audio samples. At test time, NMF is applied to the spectrogram of a test mixture signal, where each subset of the basis spectra is fixed at the pretrained spectra. The source signals can then be separated out using a Wiener filter constructed by employing the estimated power spectrogram VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ of each source. A typical way to train the basis spectra of each source is to minimize a divergence measure between the NMF model and the spectrogram of the training samples of that source. However, the basis spectra obtained in this way do not ensure that the separated signal at test time will be optimal since the objective functions for training and separation are inconsistent, namely a divergence measure for training and Wiener filtering for separation.
To address this mismatch between the training and test objectives, a framework called discriminative NMF (DNMF) has recently been proposed [11]. While many methods called ''discriminative NMF'' [12]- [17] have been proposed with the aim of enhancing the discriminative power of the basis spectra, in this paper, we use this term in relation to the work done by Weninger [11]. Note that the term ''discriminative'' is used in association with the discriminative models for classification and regression. The central idea of DNMF is that the basis spectra are trained in such a way that the output of the Wiener filter becomes as close to the spectrogram of each of the training examples as possible so that the separated signals become optimal at test time. This approach differs from the conventional supervised NMF framework in that it uses the training examples of all the sources to train the basis spectra for each of the sources. This is important since it helps to enhance the discriminative power of the basis spectra. However, the training criterion for DNMF becomes analytically more complex than the typical divergence measures used in the standard NMF framework, which causes difficulty as regards optimization of the basis spectra. In [11], Weninger proposed a multiplicative update (MU) algorithm for the basis training, where the multiplicative factor is obtained by dividing the negative parts by the positive parts of the partial derivative of the objective function as done in [18]. Although this way of obtaining update rules is indeed convenient in that it is applicable as long as an objective function is differentiable, one drawback is that the algorithm is generally not guaranteed to converge to a stationary point. To overcome this weakness, this paper proposes using a majorization-minimization (MM) principle to derive a convergence-guaranteed basis training algorithm for DNMF. We show in Sec. IV that using the present basis training algorithm instead of the conventional MU algorithm leads to notable improvements in source separation performance.
The rest of this paper is organized as follows. Section Sec. II reviews the standard NMF and DNMF approaches for single-channel source separation and the multiplicative update algorithm. In section Sec. III, we introduce the MM principle, on which basis we derive the proposed algorithm. We show the experimental results in Sec. IV and conclude the paper in Sec. V.

II. DISCRIMINATIVE NON-NEGATIVE MATRIX FACTORIZATION A. STANDARD NMF APPROACH
We start by reviewing the standard NMF approach for single-channel source separation. Let the number of sources be L. We use Y = (y ω,t ) ×T ∈ R ≥0, ×T to denote the power spectrogram of a mixture signal obtained using the short-term Fourier transform (STFT), where ω and t are the frequency and time indices, respectively. With the supervised NMF approach, we factorize Y, interpreted as a non-negative matrix, into the product of a non-negative basis matrixW = [W 1 ,W 2 , . . . .W L ] and a non-negative coefficient (activation) matrixĤ = [Ĥ 1 ;Ĥ 2 ; . . . ;Ĥ L ], wherẽ W l = (w l ω,k ) ×K l ∈ R ≥0, ×K l is assumed to be pretrained using the spectrogram of a training sample S l = (s l ω,t ) ×T for each l = 1, 2, . . . , L. A common way to trainW l is to solve where D is a cost function that measures the dissimilarity of S l and W l H l . Here, we have assumed µ H l 1 is used as a regularization term for promoting sparsity of H l 1 , where µ is a regularization parameter that weighs the importance of the regularization term. Note that we can use other kinds of regularization terms, but here we omit them for simplicity. At test time, the concatenated basis matrixW is fixed at the pretrained basis spectra and the activation matrix H is estimated by solvinĝ subject to non-negativity. Typical choices for D(Y|X) include the Euclidean distance, the generalized Kullback-Leibler (KL) divergence, and the Itakura-Saito (IS) divergence: where y ω,t and x ω,t are the (ω, t)th elements of Y and X.
A naïve way of obtaining the time-domain signal of the lth source is to simply useW lĤl and the phase spectrogram of the mixture signal to obtain the complex spectrogram and perform the inverse STFT. However, the signals obtained in this way usually contain artifacts and often sound artificial. Another widely used way involves using the Wiener filter. Namely, onceW andĤ are obtained, the magnitude spectrogram of the lth source can be refined using the Wiener filter constructed using the estimated power spectrogram so that C 1 , . . . , C L are ensured to sum to the magnitude spectrogram Y of the test mixture signal, where and · · denote elementwise multiplication and division. Note that here we have used sans serif fonts to express magnitude spectrograms, Y = √ Y and C l = √ C l , where √ · denotes the element-wise square-root.

B. DISCRIMINATIVE NMF
If we assume the Wiener filter is used to obtain source signals, the training and test objectives become inconsistent. Namely, the basis spectra are not necessarily trained in such a way that the separated signals at test time will be optimal. With the standard NMF approach, at test time, the basis matrix W is used not only for estimating H from Y but also for constructing the Wiener filter in Eq. (6). To make the training objective consistent with this test inference procedure, Weninger [11] proposed introducing two separate basis matrices for these different purposes, B and W, and formulating a bilevel optimization problem for training B and W so that B will be optimized for estimating H from Y and W will be optimized for obtaining C 1 , . . . , C L based on the Wiener filter. Here, α l ≥ 0 is a constant that weighs the importance of source l. M = (m ω,t ) ×T ∈ R ≥0, ×T denotes the power spectrogram of a mixture signal, which can be simply constructed by mixing the training samples S l , respectively. When our goal is to reconstruct a single-source l only, we shall set α l at 1 and 0 for other sources l = l. Fig. 1 illustrates the training and test processes of DNMF using two sources.

C. MULTIPLICATIVE UPDATE ALGORITHM
An inspection of Eqs. (1) and (9) shows that the training criterion for DNMF is more analytically complex than the objective function of standard NMF. In [11], Weninger proposed a two-stage iterative algorithm for solving the above optimization problem: First, B and H are obtained by solving Eq. (8) using a standard NMF algorithm. Second, by using the obtained H, the basis matrix W is iteratively updated according to multiplicative update rules. Here, we set α l = 1 and α l = 0 (l = l) and define W¯l = [W 1 , · · · , W l−1 , W l+1 , · · · , W L ] and H¯l = [H 1 ; · · · ; H l−1 ; H l+1 ; · · · ; H L ]. When D is defined as the KL divergence, the update rules are given by Here, the multiplicative factors are given by dividing the negative parts by the positive parts of the partial derivatives of the objective function in Eq. (9) with respect to the elements of W l and W¯l, as done in [18]. Although this way of obtaining update rules is convenient in that it is generally applicable as long as an objective function is differentiable, one downside is that the algorithm is not guaranteed to converge to a stationary point.

III. DNMF WITH MAJORIZATION-MINIMIZATION A. MAJORIZATION-MINIMIZATION PRINCIPLE
To overcome the weakness of the conventional MU algorithm, in this paper, we propose employing an MM principle to derive a novel convergence-guaranteed algorithm for solving Eq. (9). When constructing an MM algorithm to minimize VOLUME 8, 2020 a certain objective function, the main issue is how to design an auxiliary function called a ''majorizer'' that is guaranteed to never be below the objective function. The following lemma shows that once we obtain an auxiliary function, we can develop an iterative algorithm such that the objective function is guaranteed to be non-increasing at each iteration. Lemma 1: If we use F( ) to denote an objective function that we want to minimize with respect to and use F + ( , ) to denote its auxiliary function, satisfying F( ) = min F + ( , ), then F( ) is non-increasing under the following updates of and : = arg min F + ( , ).
Thus, if F( ) is bounded below, a stationary point of F( ) can be found by iteratively performing these updates. Proof of Lemma 1: Suppose we set to an arbitrary value˜ . We will prove that F( ) is non-increasing after the update Eq. (12) and Eq. (13). From Eq. (12), one obtains It should be noted that this concept is adopted in many existing algorithms. For example, the expectationmaximization (EM) algorithm [19] builds a surrogate for a likelihood function of latent variable models using Jensen's inequality. It is also well known for its use in devising an algorithm for standard NMF [10], [20]. In general, if we can build a tight majorizer that is easy to optimize for the objective function of some optimization problems, we can expect to obtain a fast-converging algorithm. Another advantage of MM-based algorithms is that they have no hyperparameters to tune. This is in contrast to gradient-based methods, which usually require step-size settings.

B. DERIVATION OF MAJORIZERS
Here, we derive majorizers for the objective function where D is defined as the KL divergence and IS divergence. When D is defined as the KL divergence, the objective function in Eq. (9) is given by where we have used g l ω,t and g ω,t to represent and = c to denote equality up to a constant term. First, let us focus on the term g l ω,t /g ω,t . To construct a majorizer for this term, we can use the following inequality: Lemma 2: For a > 0 and b > 0, we have The equality holds if and only if Proof of Lemma 2: For a, b, λ > 0, The equality holds if and only if a − 1 λb = 0. Since m ω,t is non-negative, we can construct an upper bound for g l ω,t m ω,t /g ω,t according to the above lemma, The equality of Eq. (18) holds if and only if In the following, we construct a majorizer for each of the terms on the right-hand side of Eq. (18). We notice that the function − log x is convex. Since s l ω,t is positive, −s l ω,t log g l ω,t is convex in g l ω,t . Hence, we can use Jensen's inequality to obtain a majorizer for this term as where γ l k,ω,t is a positive weight that sums to unity: The equality of Eq. (20) holds if and only if The second term s l ω,t log g ω,t is concave in g ω,t . Hence, we can use the fact that a tangent line to the graph of a differentiable concave function lies entirely above the graph: where η ω,t is an arbitrary positive number. The equality of this inequality holds if and only if η ω,t = g ω,t .
Since a quadratic function is convex, we can apply Jensen's inequality to the third term, which yields where β l k,ω,t > 0 is also a positive number that sums to unity: The equality of Eq. (25) holds if and only if As regards the fourth term, we can use the fact that the function 1/x 2 is convex in the first quadrant and use Jensen's inequality to obtain a majorizer: where θ k,ω,t is a positive number that sums to unity: We can confirm that the equality of this inequality holds if and only if From Eqs. (18), (20), (25), and (28), we can construct a majorizer for the objective function with KL divergence as where denotes a set of all the auxiliary variables, {λ l ω,t }, {γ l k,ω,t }, {η ω,t }, {β l k,ω,t } and {θ k,ω,t }, and d denotes a term that does not depend on W.
By using Lemma 2, Jensen's inequality and the concave inequality, we can also derive a majorizer for the case of the IS divergence in a similar manner: where d and d denote terms that do not depend on W.
These majorizers are particularly noteworthy in that they can be minimized analytically with respect to w l ω,k since they are given as the sum of the reciprocal, logarithmic, first-order, and second-order functions.

C. UPDATE RULES
We can obtain the update rules for w l ω,k by setting the partial derivatives of the above majorizers with respect to w l ω,k at zeros. Thus, the optimal update of w l ω,k is given by the positive solution of for the IS divergence case. It is worth noting that since each element of W is isolated in a separate term in f + KL (W, ) and f + IS (W, ), we can update each of the elements in parallel. Thus, this algorithm is well suited to parallel implementations. Furthermore, since each of the update rules consists of a negative zeroth-order term and a negative second-order term, it turns out that there is only one positive solution, implying that there is no need to solve a solution selection problem.
f + KL (W, ) is minimized with respect to the auxliary variables when the exact bounds of Eqs. (18), (20), (23), (25) and (28)  Let Y and be the power and phase spectrograms of a test mixture signal and letB andW be the pretrained basis matrices. The test inference algorithm for the DNMF approach consists of computingĤ by solvinĝ computing C 1 , . . . , C L using and performing the inverse STFT on C l for all l. Note that the test inference algorithm for the standard NMF approach corresponds to a special case whereB =W.

IV. EXPERIMENTAL EVALUATIONS A. SPEECH ENHANCEMENT TASK
First, we evaluated the effect of the proposed algorithm in a speech enhancement task, namely l ∈ {s, n}. For comparison, we tested (i) the standard supervised NMF method [21] with Euclidean distance (SNMF_EU), KL divergence (SNMF_KL), and IS divergence (SNMF_IS); (ii) DNMF using the MU-based basis training algorithm [11] with KL divergence (DNMF_MU_KL) and Euclidean distance (DNMF_MU_EU); and (iii) DNMF using the proposed basis training algorithm with KL divergence (DNMF_MM_KL) and IS divergence (DNMF_MM_IS). Note that we have excluded DNMF_MU_IS from the baselines since it was not studied in [11]. Also note that the results for DNMF_MM_EU are not provided. This is because we have yet to come up with an auxiliary function with a tractable form for the Euclidean distance case.

1) DATASET AND EXPERIMENTAL SETTINGS
We constructed the training and test datasets using speech signals excerpted from the Wall Street Journal (WSJ-0) corpus [22] and noise signals excerpted from the CHiME4 background noise database [23], which includes four types of noise recorded in a bus, cafe, pedestrian area, and street, respectively. The training dataset consisted of 600 utterances, each of which was created by mixing randomly selected utterances from si_tr_s and noise signals with signal-to-noise ratios (SNRs) set at {−5, 0, 5}dB. In the same way, we also created a validation dataset consisting of 90 utterances. Each of the four test datasets consisted of 100 utterances, half of which we created using speech signals in si_tr_s and the other half using speech signals of different speakers in si_dt_05. The SNRs for three of the four test datasets were set at {−5, 0, 5} dB. and those for the remaining dataset were randomly set between [−10, 10] dB.
All the audio signals were monaural and downsampled to 16 kHz. The STFT was computed using a Hanning window that was 32-ms long with a 16-ms overlap. We used the same basis number k for speech and noise, i.e., K s = K n = K . In this task, we tested K = {25, 50, 100}. For K = 100, we evaluated the effectiveness of sparse regularization in the case of a large number of basis numbers by setting µ = {0, 0.5, 1, 5, 10}. SNMF_KL was run for 100 iterations. For the DNMF algorithms, SNMF_KL was used for initialization. For the separation, the Wiener filter was constructed using the trained basis and activation matrices obtained using the standard NMF run for 100 iterations.

2) CONVERGENCE BEHAVIOR AND COMPUTATIONAL COST
We compared the convergence behaviors of the proposed algorithms, DNMF_MU_EU and DNMF_MU_KL, within the first 500 iterations. For all the algorithms, we used the same initialization and evaluated the signal-to-distortion ratio (SDR) [24] improvements. Two examples are shown in Fig. 2. As can be seen from the example when tested on bus noise with k = 100, DNMF_MU_EU and DNMF_MU_KL did not decrease the objective functions monotonically. This indeed shows that each update in the MU algorithms does not guarantee a decrease in the objective functions. It is also worth noting that the objective function value does not directly reflect the speech enhancement performance, as shown in the experimental results when tested on street noise with k = 50. According to the SDR results obtained with the validation dataset as well as the setting in [11], in the following experiments, we set the iteration number at 150 for the proposed algorithms and 25 for the MU algorithms.
We compared the computational times of all the algorithms with k = 50 using the training data with a length of about one hour. The algorithms were implemented using MATLAB and run on an Intel Xeon Gold 5120 @2.2GHz processor. Table 1 shows the average computational time for updating B or W at each iteration and that of the entire process. Note that the total time of DNMF includes the time of computingB for initialization andĤ. Note that the time complexity of the proposed algorithm is O( KTL 2 ), whereas that of the standard NMF and DNMF algorithms with multiplicative update rules is O( KTL). Since L was 2 in the speech enhancement task, it did not have a significant impact on the computation time. Rather, the increase in the number of iterations in the proposed algorithm led to an increase in the total computation time.

3) SPEECH ENHANCEMENT PERFORMANCE
The speech enhancement performance was numerically evaluated in terms of SDRs, signal-to-interference ratios (SIRs),  and signal-to-artificial ratios (SARs) [24]. Table 2 shows the average SDRs taken over all the test data with basis number K = {25, 50, 100}. For each noise type with different k, we conducted 5 trials with different initializations. The average input SDR of the test data was about 0.063 dB. As Table 2 shows, increasing the bases did not always lead to an improvement in speech enhancement performance. Comparing the results of the standard NMF and DNMF algorithms, we found that the latter outperformed the former. This indicates the effectiveness of the ability to learn discriminative bases. Furthermore, the proposed algorithm performed best among all the algorithms based on the same divergence measure. Table 3 shows the average SDRs, SIRs, and SARs evaluated using K = 25 with various input SNRs. These results were averaged over the four noise types. As the results show, DNMF_MM_KL performed best among all the algorithms in terms of the SDR and SIR. Specifically, it achieved about 1.2-dB improvements over DNMF_MU_EU and DNMF_MU_KL, and about 1.7-dB improvements over SNMF_KL. This shows that the proposed algorithm with the KL divergence criterion had a better ability VOLUME 8, 2020  to learn discriminative bases than the baseline algorithms did. However, the SARs obtained with the proposed algorithms tended to be lower than those obtained with the baseline algorithms.
We also evaluated the effectiveness of sparse regularization. The results are shown in Table 5. We found that µ = 0.5 achieved the best score for each method except for DNMF_MM_IS, where the best performance was obtained without sparse regularization. DNMF_MM_KL outperformed the other methods regardless of the sparse regularization.

B. SINGLE-CHANNEL SOURCE SEPARATION
We also evaluated the performance of the proposed algorithms in source separation tasks.

1) DATASET AND EXPERIMENTAL CONDITIONS
We excerpted five recordings from Demixing Secrets Dataset 100 (DSD100) [25], which was used in the SiSEC 2016 MUS task. Each of the recordings consisted of four sources, namely bass, drums, vocals, and the other. The task was thus a four-source separation problem, namely α l = 1, l = {1, 2, 3, 4}. Each of the recordings was about four to five minutes long. We divided each recording into two segments, namely a training data segment and a test data segment.
Here, we conducted two experiments. In the first experiment, we trained the basis matrix separately using the training data segment of each recording and tested on the test data segment. In the second experiment, we trained a shared basis matrix using the collection of the training data segments of all the recordings and tested on the test data segment of each recording. As in the speech enhancement task, we used monaural audio signals and downsampled them to 16 kHz. The STFT was computed using a 256-ms long Hanning window with 1/2 window overlap. Considering the characteristics of the four sources, we set the basis number at [10,10,15,15] for bass, drums, vocals, and other, respectively, for the first experiment and [20,20,50,50] for the second experiment. For each experiment, we also ran five trails with random initialization and evaluated the average SDR and SIR scores. SNMF_KL was run for 100 iterations and was used as the initialization for the DNMF algorithms. In the source separation experiments, we set the number of iterations for trainingW, at 25. Fig. 3 shows an example of the convergence behavior of the proposed algorithms. Table 4 shows the SDR and SIR improvements [dB]. As the results show, the proposed algorithm with KL divergence outperformed SNMF_KL for most of the test data. It is interesting to note that even though in the first experiment the standard NMF was relatively advantageous as regards the training condition, DNMF_MM_KL still obtained higher SDR and SIR scores. In the speech enhancement task, we confirmed that the proposed algorithm performed slightly better than the standard NMF method under the IS divergence criterion. However, this was found not to be the case for the source separation task. This implies that the discriminative basis training and/or MM strategies were less effective for the IS divergence than for the KL divergence. The reason for this will be examined more closely in our future work.

V. CONCLUSION
DNMF is noteworthy in that it directly uses the reconstruction errors of separated signals as the training criteria, which eliminates the inconsistency between the objctive functions for training and separation in the conventional NMF method and can increase the discriminative power of the trained basis. However, such training criteria cause difficulty in optimization. This paper derived a novel majorizer for the objective function of DNMF and successfully developed an MM algorithm that is guaranteed to converge to a stationary point. Experimental results showed that the proposed algorithm with the KL divergence criterion achieved significant improvements in terms of the SDR and SIR over standard NMF and DNMF using the multiplicative update algorithm.