FastMVAE: A Fast Optimization Algorithm for the Multichannel Variational Autoencoder Method

This paper proposes a fast optimization algorithm for the multichannel variational autoencoder (MVAE) method, a recently proposed powerful multichannel source separation technique. The MVAE method can achieve good source separation performance thanks to a convergence-guaranteed optimization algorithm and the idea of jointly performing multi-speaker separation and speaker identification. However, one drawback is the high computational cost of the optimization algorithm. To overcome this drawback, this paper proposes using an auxiliary classifier VAE, an information-theoretic extension of the conditional VAE (CVAE), to train the generative model of the source spectrograms and using it to efficiently update the parameters of the source spectrogram models at each iteration of the source separation algorithm. We call the proposed algorithm “FastMVAE” (or fMVAE for short). Experimental evaluations revealed that the proposed fast algorithm can achieve high source separation performance in both speaker-dependent and speaker-independent scenarios while significantly reducing the computational time compared to the original MVAE method by more than 90% on both GPU and CPU. However, there is still room for improvement of about 3 dB compared to the original MVAE method.


I. INTRODUCTION
Blind source separation (BSS), a technique for separating out individual source signals from microphone array inputs without any information about the sources or array geometry, has a wide range of applications, including hearing aids, automatic speech recognition, music editing, and music information retrieval.
In BSS, the frequency-domain approach is usually preferred since it enables a fast implementation compared with the time-domain approach. It is also notable in that it provides the flexibility of allowing us to utilize various models for the time-frequency (TF) representations of source signals. One example of this approach involves independent vector analysis (IVA) [1], [2], which achieves frequency-wise source separation and permutation alignment simultaneously by assuming the magnitudes of the frequency components originating from the same source vary coherently over time.
The associate editor coordinating the review of this manuscript and approving it for publication was Lin Wang . Multichannel extensions of non-negative matrix factorization (NMF), e.g., multichannel NMF (MNMF) [3], [4] and independent low-rank matrix analysis (ILRMA) [5]- [7], adopt the NMF concept for source spectrogram modeling in order to make use of the spectro-temporal structure underlying each source as a clue to separation. Specifically, the power spectrogram of each source signal is approximated as the linear sum of a limited number of basis spectra scaled by time-varying amplitudes. Owing to the fact that ILRMA reduces to IVA when it has only one flat basis spectrum, it can be seen that ILRMA has more flexibility than IVA in capturing the spectro-temporal structure in each source [6]. Each of these methods is designed to solve an inverse problem of estimating source signals based on a generative model of mixture signals. In this sense, these methods are categorized as a generative approach.
Meanwhile, given the recent advances achieved by deep neural network (DNN)-based speaker separation methods, including deep clustering (DC) [8], [9] and permutation invariant training (PIT) [10], [11], a discriminative approach has recently proved powerful in monaural source separation tasks, including both speaker-dependent and -independent scenarios [12]- [15]. The general idea is to train a DNN that predicts TF masks or TF embeddings from a given mixture signal based on spectro-temporal features. When multiple microphones are available, spatial information can be utilized to improve separation performance [16]. Although these methods can achieve reasonably good separation, the TF masking process can cause unwanted distortion or musical noise in the separated speech. To avoid distortion and artificial noise and fully exploit the benefits of multichannel inputs, some efforts have been made to integrate DNNs into traditional microphone array processing frameworks such as beamforming [17], [18].
The success of these single-channel DNN-based methods attests to the excellent ability of DNNs to capture and learn the structure of spectrograms. Recently, some attempts have also been made to incorporate DNNs into the generative approach mentioned earlier [19]- [23]. As an example, a multichannel source separation method using a conditional variational autoencoder (multichannel VAE or MVAE for short) [21] has been proposed with notable success in supervised determined source separation tasks. With the MVAE method, a conditional VAE (CVAE) [24] is trained using the spectrograms of clean speech samples along with the corresponding speaker ID as a conditioning class variable. This is done so that the trained decoder distribution can be used as a generative model of signals produced by all the sources included in a given training set, where the latent space variables and the class variables are the parameters to be estimated from an input mixture signal. This generative model is called the CVAE source model. At the separation phase, the MVAE algorithm iteratively updates the demixing matrix using the iterative projection (IP) method [25] and the underlying parameters of the CVAE source model using a gradient descent method, where the gradients of the parameters are computed by backpropagation. The separated signals can then be obtained by applying the estimated demixing matrix to the observed mixture signals. One important feature of this method is that it is designed to jointly perform multi-speaker separation and speaker identification. 1 This is particularly reasonable since these two tasks are interdependent in the sense that the solution to one task can help find the solution to the other task. Another advantage worth noting is that by using a carefully chosen step size or applying a backtracking line search, the model parameters can be updated so as not to decrease the log-likelihood at each iteration of the algorithm. However, one downside is the high computational cost of the backpropagation process involved in each iteration.
To address this drawback, this paper proposes an accelerated version of the MVAE algorithm called the ''Fast-MVAE (or fMVAE)'' algorithm. The idea is to use an auxiliary classifier VAE (ACVAE) [26], an informationtheoretic extension of a CVAE, to pretrain the generative distribution of source spectrograms. An ACVAE consists of decoder, encoder, and classifier networks. These three networks are trained simultaneously so that the decoder can be used as a generative model of spectrograms conditioned on a speaker ID, and the encoder and classifier can be used to infer the latent variables characterizing the generative model and the speaker ID from a spectrogram input. Since the backpropagation process involved in the original MVAE algorithm can be replaced with the forward propagation of the trained encoder and classifier networks, the entire algorithm can be made extremely efficient. It should be noted that this paper is an extended full-paper version of our conference paper [27]. The additional contributions in comparison to [27] are as follows: • To stabilize the parameter inference process of the fMVAE algorithm, especially in speaker-independent conditions, we propose an improved version of the fast algorithm based on a Product-of-Experts (PoE) framework [28] and evaluate the impact of different hyperparameter settings.
• We demonstrate the capability of the MVAE and fMVAE algorithms to handle speaker-independent scenarios by sufficiently increasing the variety of speakers and the number of samples in the training dataset. The rest of this paper is structured as follows. After describing the formulation of the determined multichannel BSS problem and the MVAE method in Section II, we review related work in Section III. In Section IV, we present the core idea of the fMVAE algorithm along with the ACVAE concept and describe the details of the proposed algorithm. We demonstrate the effectiveness of the proposed method in both speaker-dependent and speaker-independent source separation tasks in Section V. Finally, we conclude the paper in Section VI.

II. MVAE METHOD A. PROBLEM FORMULATION
Let us consider a determined situation where I source signals are captured by I microphones. Let x i (f , n) and s j (f , n) denote the short-time Fourier transform (STFT) coefficients of the signal observed at the ith microphone and the jth source signal, where f and n are the frequency and time indices, respectively. We denote the vectors containing x 1 (f , n), . . . , x I (f , n) and s 1 (f , n), . . . , s I (f , n) by where (·) T denotes transpose. In a determined situation, the relationship between observed signals and source signals can be described as where W H (f ) is called the demixing matrix and (·) H denotes the Hermitian transpose. The aim of BSS methods is to In the following, we assume each source signal follows the local Gaussian model (LGM) [29], [30]. Namely, s j (f , n) is assumed to independently follow a zero-mean complex proper Gaussian distribution with variance (power spectral density) v j (f , n): where where (3) and (6), the density of x(f , n) is obtained as where |W H (f )| 2 is the Jacobian of the mapping x(f , n) → s(f , n). Hence, the log-likelihood of the separation matri- where we have used = c to denote equality up to constant terms, and a bold italic font to indicate a set consisting of TF elements, namely, Note that (8) will be split into frequency-wise source separation problems if there is no additional constraint imposed on v j (f , n). This indicates that there is a permutation ambiguity in the separated components for each frequency. Thus, we usually need to group together the separated components of different frequency bins that originate from the same source after or during source separation. This process is called permutation alignment.

B. CVAE MODEL
One efficient way to eliminate the permutation ambiguity is to incorporate a constraint into v j (f , n) so that the spectral structures of sources can be utilized as a clue to the estimation of W. The idea of the MVAE method is to use a CVAE [24] conditioned on a class variable c to model the complex spectrograms S = {s(f , n)} f ,n of source signals. Here, c is a one-hot vector consisting of C elements, indicating to which class the spectrogram S belongs. For example, if we consider speaker IDs as the class category, each element of c will be associated with a different speaker, and c will be filled with 1 at the index of a certain speaker and with 0 everywhere else. A VAE is a stochastic neural network model consisting of an encoder and decoder, and a CVAE is an extended version that allows the encoder and decoder to include a conditioning class variable. In a CVAE, the decoder is modeled as a neural network (decoder network) that produces a set of parameters for a conditional distribution p θ (S|z, c) of data S given a latent space variable z and a class variable c, where θ denotes the network parameters. Figure 1 shows an illustration of CVAE. Computing the exact posterior p θ (z|S, c) = p θ (S|z, c)p(z)/p θ (S|c) of z given S and c is usually very difficult since p θ (S|c) involves an intractable integral over z. The idea of CVAEs is to sidestep the direct computation of this posterior by introducing another neural network (encoder network) for approximating the exact posterior p θ (z|S, c). As with the decoder network, the encoder network generates a set of parameters for the conditional distribution q φ (z|S, c), where φ denotes the network parameters. The goal is to learn the parameters of the encoder and decoder networks so that the encoder distribution q φ (z|S, c) becomes consistent with the posterior p θ (z|S, c) ∝ p θ (S|z, c)p(z). Specifically, we train the encoder and decoder networks so that KL[q φ (z|S, c)||p θ (z|S, c)] is minimized given M class-labeled training samples (S|z, c), this process amounts to minimizing where E (S,c)∼p D (S,c) [·] can be approximated as the sample mean over {S m , c m } M m=1 , and KL[·||·] denotes the Kullback-Leibler divergence. While p D (S, c) can be approximated as the empirical distribution of {S m , c m } M m=1 , q φ (z|S, c), p θ (S|z, c) and p(z) are distributions that need to be modeled.
In the MVAE method, a CVAE is used to model the entire complex spectrogram S of an utterance, conditioned on a speaker ID vector c. p(z) and q φ (z|S, c) are described as Gaussian distributions as with a regular CVAE:

end for
denotes the (f , n)th element of the decoder output. Once the parameters θ and φ of the encoder and decoder are trained using speaker-labeled training utterances, the decoder with fixed θ can be used as a generative model of spectrograms for each speaker at test time.
Normalizing the mean and variance of each training sample is one of the common practices in neural network training. Similarly, in the CVAE training in the MVAE method, the total energy of each training utterance is normalized to 1. However, of course, the total energy of the spectrogram of each source in a test mixture can vary from source to source and does not necessarily equal 1. So that the generative model can flexibly bridge this gap, a scale parameter g is additionally incorporated into (12) and treated as a free parameter to be estimated at test time. Namely, the generative model of the complex spectrograms S j of utterances of speaker j can be expressed as where and z j , c j , and g j are the unknown parameters to be estimated. (13) is called the CVAE source model. We can immediately confirm that the decoder distribution in (12) corresponds to a particular case of (13) where g j = 1. Since the CVAE source model is given in the same form as the LGM in (5), where v j (f , n) is given by g j σ 2 θ (f , n; z j , c j ), using it as the generative model of each source leads to the same form of the log-likelihood as (8): where Since z is assumed to follow N (z|0, I) when θ and φ are trained, it would be reasonable to assume it as a prior distribution for z also at test time. The prior p(c) is the empirical distribution of the training examples {c m } m , expressed as a multinomial distribution. Thus, the log-posterior is the objective function to be maximized with respect to W, , and G. A stationary point of (16) can be found by iteratively updating these parameters so that (16) is guaranteed to be non-decreasing. To update W, the following update rules, called the iterative projection (IP) [25], can be used: where and e j denotes the jth column of an I × I identity matrix. To update G, the following update rule can be used: Note that (19) maximizes (16) with respect to g j when W and are fixed. While keeping W and G fixed, a gradient descent method can be used to search for the optimal z j and c j that maximize (16), or equivalently log Note that estimating c j from a test mixture corresponds to identifying which speaker is present in the mixture signal. When updating c j , the sum-to-one constraint must be taken into account. This is easily implemented by inserting an appropriately designed softmax layer that outputs c j , VOLUME 8, 2020 and treating u j as the parameter to be estimated instead.
The source separation algorithm of the MVAE method is summarized in Algorithm 1. This algorithm is noteworthy that, if it is implemented appropriately, the log-likelihood of the model parameters is guaranteed to be non-decreasing at each iteration.

III. RELATED WORK A. ILRMA
Another reasonable way of constraining power spectrograms involves employing the NMF model [31]. The NMF model Note that a particular case where T j = 1 and b j,t (f ) = 1 for all j is equivalent to assuming the norm r j (n) = f |s j (f , n)| 2 follows a complex Gaussian distribution with time-varying variance h j (n). This is analogous to the assumption in IVA that the magnitudes of the STFT coefficients in all frequency bands originating from the same source tend to vary coherently over time [32].
The optimization algorithm of ILRMA consists of iteratively updating the demixing matrices W using the IP method, the basis templates B = {b j,t (f )} f ,j,t , and the activation matrix H = {h j,t (n)} n,j,t so that (8) is guaranteed to be non-decreasing at each iteration. To update B and H, we can use the majorization-minimization (MM) algorithm [33]. The MM-based update rules can be derived as .

B. DNN-BASED METHODS
Some attempts have recently been made to incorporate DNNs into the LGM-based multichannel source separation framework [19], [20]. With these methods, v j (f , n) is updated at each iteration as the output of pretrained DNNs. Independent deeply low-rank matrix analysis (IDLMA) [20] is a method designed to train a DNN for each source so that the jth DNN produces spectra related to source j when noisy spectra of the jth source are given as the input. Thus, each DNN can be seen as a source-dependent noise reduction system. One drawback of IDLMA is that updating v j (f , n) in this way does not guarantee an increase in the log-likelihood. Another drawback would be that it can perform poorly in speaker-independent scenarios.

C. VAE-BASED METHODS
Recently, deep generative models such as VAEs and generative adversarial networks (GANs) have proved powerful in source separation tasks [22], [23], [27], [34]- [40]. The idea of using a VAE to model the spectrum within each short-term frame was first proposed for single-channel speech enhancement [22]. This method, called VAE-NMF, enables speech enhancement in a semi-supervised manner by using a VAE to model the spectrogram of a target speaker and an NMF model to express unseen noise spectrograms. In this method, the Metropolis algorithm is used to iteratively update the latent space variable z. An extension of this model was subsequently developed, which incorporates a loudness gain for robust speech modeling and adopts a noise model based on alpha-stable distributions [23], [36]. The Monte Carlo expectation-maximization algorithms were used for estimating the model parameters.
To the best of our knowledge, the idea of incorporating the VAE concept into the multichannel framework was first introduced in a preprint article [41] and later published as a journal paper [21]. Unlike the above VAE-NMF methods, this method, namely the MVAE method, uses a CVAE with a fully convolutional architecture to model the entire spectrogram of an utterance of each source. While the original MVAE method was designed to deal with determined anechoic mixtures only, its modified versions have subsequently been proposed to handle underdetermined scenarios [34] and highly reverberant conditions [39]. Like the original version, these two versions use gradient descent (backpropagation) to update the source model parameters. Extensions of the VAE-NMF methods to multichannel inputs were later developed [35], [37], [42] for application to multichannel speech enhancement tasks. In these methods, the Markov chain Monte Carlo (MCMC) methods [43], such as Gibbs sampling and the Metropolis algorithm, are used to iteratively update the latent space variable as with the original VAE-NMF methods.
Although these methods have been shown to perform impressively compared with conventional NMF-based methods, the use of sampling and backpropagation to update latent space variables can be computationally expensive. To reduce the computational cost, we previously proposed to exploit the pretrained encoder of a CVAE as an approximate posterior estimator to infer the latent space variable z in [27]. With the same motivation, a fast algorithm for estimating the parameters of the VAE-NMF model was later derived based on the Bayesian inference in [38] for single-channel speech enhancement.

IV. FMVAE ALGORITHM A. IDEA
In this section, we describe the idea of the proposed fast optimization algorithm for the MVAE method. Since the process of updating the parameters of the CVAE source model is more computationally costly than that of updating the other parameters, our main focus is on how to accelerate this process. When W is fixed, each element of S j will be fixed at n). Now, since the terms that depend on z j and c j in (16) are given as (22) we would like to find z j and c j that maximize the posterior p(z j , c j |S j , g j ) after updating W. This posterior can be factorized as p(z j , c j |S j , g j ) = p(z j |S j , c j , g j )p(c j |S j , g j ). Here, we notice that the first factor, p(z j |S j , c j , g j ), resembles the encoder (or inference) distribution in the CVAE in (11), with the difference being that it is also conditioned on the scale parameter g j . Since the total energy of each training utterance is assumed to be normalized to 1 in the CVAE training as mentioned earlier, g j can be thought of as a parameter that plays the role of normalizing the total energy of an unnormalized input S j to 1 at test time so that the scale of the encoder input is ensured to be consistent with the training utterances. Specifically, the encoder distribution that allows for unnormalized inputs is implicitly assumed to be given as the following expression: , c))), which reduces to (11) when g = 1. Thus, we can use the trained encoder q φ (z j |S j , c j , g j ) as an approximation of the first factor of the posterior p(z j , c j |S j , g j ). This means that if we could obtain the true distribution p(c j |S j , g j ) or its approximate distribution r(c j |S j , g j ), we would be able to find an approximation of the maximum point of the posterior p(z j , c j |S j , g j ) by finding the maximum point of the corresponding approximate distribution.
In this section, we review the concept of an auxiliary classifier VAE (ACVAE), present how this concept can be used to obtain r(c j |S j , g j ), and introduce the details of the proposed optimization algorithm.

B. AUXILIARY CLASSIFIER VAE
An auxiliary classifier VAE (ACVAE) [26] is a CVAE variant, which incorporates an information-theoretic regularization [44] that assists in making the decoder outputs as correlated as possible with the class variable c by maximizing the mutual information between c and an output S ∼ p θ (S|z, c) from the decoder, conditioned on z. The mutual information is expressed as where the equality holds if and only if r(c|S) = p(c|S). This technique of lower bounding mutual information is known as variational information maximization [45]. The last line of (25) follows the lemma presented in [44]. Therefore, we can indirectly maximize I (c, S|z) by increasing the lower bound with respect to p θ (S|z, c) and r(c|S). One way to achieve this involves expressing the variational distribution r(c|S) as a neural network and training it along with q φ (z|S, c) and p θ (S|z, c). Specifically, r(c|S) can be expressed as a multinomial distribution r ψ (c|S) = Mult(c|ρ ψ (S)).
Here, Mult(c|ρ) ∝ i ρ c i i denotes a multinomial distribution, where c = [c 1 , . . . , c I ] T and ρ = [ρ 1 , . . . , ρ I ] T . ρ ψ (S) denotes a neural network that takes S as an input and produces a probability vector consisting of C elements. (26) is called an auxiliary classifier.
Therefore, the regularization term that we would like to maximize over the training samples with respect to φ, θ, and  (27) where r ψ (c|S) must satisfy the sum-to-one constraint. With the regularization term (27), the auxiliary classifier is trained using only the reconstructed spectrograms. Since we can also use the spectrograms of real speech to train the auxiliary classifier, we can further use the cross-entropy as the training criterion. The entire training criterion is thus given by where λ L ≥ 0 and λ I ≥ 0 are the parameters weighing the importance of the regularization terms. Figure 2 shows an illustration of ACVAE.

C. FAST ALGORITHM
As mentioned above, the auxiliary classifier distribution r ψ (c|S) trained using {S m , c m } M m=1 is expected to be a good approximation of the conditional distribution p(c|S). Now, in the same way that we considered the encoder that flexibly allows for an unnormalized input, here we also consider an auxiliary classifier r ψ (c|S, g) that incorporates the global scale parameter g such that r ψ (c|S, g) = Mult(c|ρ ψ (S/g)).
Using the trained auxiliary classifier and encoder, we can obtain an approximation p(z j , c j |S j , g j ) ≈ r ψ (c j |S j , g j ) q φ (z j |S j , c j , g j ). Since the maximum points of r ψ (c j |S j , g j ) and q φ (z j |S j , c j , g j ) can be found immediately, we can use these approximate distributions to find an approximate solution to (z j , c j ) = argmax z j ,c j p(z j , c j |S j , g j ) instead of the gradient descent update for increasing log p θ (S j |z j , c j , g j ) + log p(z j ) + log p(c j ). Figure 3 shows the flowchart of the proposed algorithm for the I = 2 case. The algorithm is summarized in Algorithm 2. The main difference between the new algorithm from the original version is that the optimal z j and c j are estimated using the forward propagations of the two pretrained networks instead of using gradient descent updates. Specifically, z j is given as the mean of the encoder distribution µ φ (S j /g j , c j ). There are two possible ways to update the class variable c j . One is to directly use the probability vector produced by the auxiliary classifier network We hereafter refer to the proposed algorithm using this update rule as fMVAE_c. The other is to use the one-hot vector closest to the output of the auxiliary classifier where [·] k is used to denote the kth element of a vector. We hereafter refer to the algorithm using this update rule as fMVAE_o. Here, the subscripts are the first letters of ''continuous'' and ''one-hot'', respectively. r ψ (c j |S j , g j ) can be seen as a speaker recognizer trained with explicit supervision. Hence, the proposed algorithm is expected to perform better than the original version in terms of speaker identification accuracy. However, one downside would be that it does not guarantee a non-decrease in the objective function because of the approximation p(z j , c j |S j , g j ) ≈ r ψ (c j |S j , g j )q φ (z j |S j , c j , g j ). How this actually affects source separation performance will be discussed later.

D. PRIOR-WEIGHTED INFERENCE
The encoder network is trained so that q φ (z|S, c) becomes as close as possible to p(z) = N (z|0, I). However, through preliminary experiments, we found that at test time the trained encoder occasionally produced outliers that significantly deviated from the assumed distribution N (z|0, I). This may be because the encoder did not generalize very well

Algorithm 2 fMVAE Algorithm
Require: Network parameter θ, φ, ψ trained using (29), observed mixture signal x(f , n), iteration number L , weight parameter α 1: randomly initialize W, 2: optional: update W using a BSS method 3: for = 1 to L do 4: for each source j of J do 5: (updating source model paremeters) 7: initialize g j using (19) 8: update c j using (31) or (32) 10: update z j using (36) 11: compute σ 2 j (f , n; z j , c j , g j = 1, θ) 12: update g j using (19) 13: compute v j (f , n) = g j · σ 2 j (f , n; z j , c j , g j = 1, θ) 14: (updating demixing matrices) 15: update w j (f ) by IP method with (17), (18) 16: end for 17: end for due to the limited amount of training data or the mismatch between the training and test conditions. Since the decoder network was trained under the assumption that its input follows N (z|0, I), these outliers tended to negatively affect the resulting decoder outputs and eventually the estimate of W. One heuristic way to address this problem would be to reapply the prior distribution p(z) during inference. In the following, we omit the source index j in this subsection for simplicity of notation.
As a way of reapplying the prior, we adopt the concept of product-of-experts (PoE) [28] and defineẑ aŝ where α weighs the importance of the prior in the inference. Since both q φ (z|S, c, g) and p(z) are multivariate Gaussian distributions, (34) can be expressed as where φ = diag(σ 2 φ (S/g, c)) and µ = −1 (S/g, c). Therefore, the update rule for z can be easily derived as , c). (36) Note that (36) reduces to the mean of the encoder distribution when α = 0.

E. POTENTIAL ADVANTAGE OF CVAE OVER REGULAR VAE IN TERMS OF SOURCE MODELING
Although the MVAE method uses a CVAE for source spectral modeling, one can also think of using a regular (unconditional) VAE, as in the VAE-NMF framework. In this case, all the factors of variations in speech spectra, including the speaker identity factor, will be encoded into the latent variables. However, this can lead to an overparametrized representation since even though the speaker identity factor should be considered time-invariant (unlike phonemeand F 0 -related factors), the latent variables are allowed to vary over time. Hence, when estimating the latent variable sequence of each source in a given mixture, we would want to separate out only the speaker identity factor from the latent variable sequence and force it to be time-invariant so as not to allow it to change during the utterance. This is the motivation behind the idea of using a CVAE instead of a regular VAE. A quantitative comparison between these choices is provided in Subsection V-E.

V. EXPERIMENTAL EVALUATIONS
To evaluate the effectiveness of the proposed method, we conducted several multi-speaker source separation experiments in which we considered speaker-dependent and speaker-independent separation tasks. Specifically, the speaker-dependent and speaker-independent conditions indicate whether the test speaker is seen in the training dataset. It should be noted that even in the speaker-dependent condition, the training and test sets are disjoint at the sentence level. In this section, we first provide the details of the baseline algorithms in Subsection V-A and the network architectures used in the baseline and proposed methods in Subsection V-B. We then show how the dataset was created and present the experimental results obtained under the speaker-dependent condition in Subsections V-C -V-F. In Subsection V-G, we describe the large-scale dataset designed for the speaker-independent task and show the experimental results.

A. BASELINE METHODS FOR COMPARISON
We chose ILRMA [6], IDLMA [46], and the original MVAE method 2 [21] as the baseline methods for comparison. We tested several different versions of the proposed and baseline methods. We use the terms ''supervised/unsupervised'' and ''informed/uninformed'' to properly categorize each version of the methods. The terms ''supervised'' and ''unsupervised'' indicate whether a method requires training examples of source signals prior to source separation, while the terms ''informed'' and ''uninformed'' indicate whether a method is informed about which sources are present in a test mixture signal. Categorization of each version is summarized in Table 1.
We set the basis number T j = 10 for u.u.ILRMA and randomly initialized the basis spectra and activation matrix.  For supervised ILRMA, basis spectra with T = 10 were pretrained for each speaker in the training dataset using the NMF algorithm. They were then concatenated and used as a unified model to represent all the sources in s.u.ILRMA, whereas the basis spectra corresponding to the specific speakers present in a mixture signal were provided to the method in s.i.ILRMA. Figure 4 depicts the details of the network architectures employed in the MVAE and fMVAE methods. We used the same network architectures to train the CVAE and ACVAE. All the networks were designed to be fully convolutional to handle input spectrograms of signals with arbitrary lengths. We used one-dimensional gated convolutional neural networks (CNNs) [47] to model spectrograms, which allows the networks to capture time dependencies in spectral sequences. Gated CNNs were initially introduced to model word sequences for language modeling and shown to outperform long short-term memory (LSTM) language models trained in a similar setting. The effectiveness of employing a gated CNN to model a spectrogram has already been confirmed [48], [49]. By using O l−1 to denote the output of the (l − 1)th layer, the output of the lth layer O l of a gated CNN can be written as

B. NETWORK ARCHITECTURES
where W f l , W g l , B f l , and B g l are weight and bias parameters of the lth layer, ⊗ denotes element-wise multiplication, and σ is the sigmoid function. The main difference between a gated CNN and a regular CNN layer is that a gated linear unit (GLU), namely the second term of (37), is used as a nonlinear activation function. Like LSTMs, GLUs have data-driven gates, which control the information passed on in the hierarchy. At each gated CNN layer in the encoder and decoder, a broadcast version of c is appended along the channel dimension to the output of the previous layer. Adam [50] was used to train the networks. Note that Algorithm 1 and Algorithm 2 correspond to s.u.MVAE and s.u.fMVAE_o/s.u.fMVAE_c, respectively. For s.i.MVAE and s.i.fMVAE, the correct class label c j is given and fixed during the update. Figure 5 shows the learning curves of the CVAE and ACVAE training processes. The curves demonstrate that the networks were trained stably with fast convergence.
For s.i.IDLMA, we used a fully connected neural network with four hidden layers. Each layer had 1024 units, and a rectified linear unit was used for the output of each layer, which was the same as the network architecture used in [46]. We implemented the training settings described in [46], namely using the Gaussian-IDLMA loss function and concatenation of the current, preceding, and succeeding frames to capture the temporal dependency, data augmentation, and regularization. The only difference was the optimization algorithm, where we used Adam to train the network for 700 epochs instead of Adadelta [51] for 200 epochs. More training details are available in [46].

C. DATASET FOR SPEAKER-DEPENDENT SEPARATION
As in the original MVAE paper [21], we used speech utterances of two male speakers (SM1, SM2) and two female speakers (SF1, SF2) excerpted from the Voice Conversion Challenge (VCC) 2018 dataset [52] for the speaker-dependent source separation experiment. The audio files for each speaker were about seve minutes long and manually segmented into 116 short sentences, where 81 and 35 sentences (about five and two minutes long, respectively) served as training and test sets, respectively.
We used two-channel mixture signals of two sources as the test data, which were synthesized using simulated room impulse responses (RIRs) generated using the image method [53] and real RIRs measured in an anechoic room (ANE) and an echo room (E2A). Figure 6 shows the configuration of the room used for simulating RIRs. To meet the instantaneous mixing model assumption, the reverberation times (RT 60 ) [54] of the simulated RIRs were set at 78 and 351 ms, which were controlled by setting the reflection coefficient of the walls at 0.20 and 0.80, respectively. For the measured RIRs, we used the data included in the RWCP Sound Scene Database in Real Acoustic Environments [55]. The RT 60 of ANE and E2A were 173 and 225 ms, respectively. The test data included 4 pairs of speakers, i.e., SF1+SF2, SF1+SM1, SM1+SM2, and SF2+SM2. For each speaker pair, we generated ten mixture signals. Hence, there were a total of 40 test signals for each reverberation condition, each of which was about four to seven seconds long. All the speech signals were resampled at 16 kHz.

D. EXPERIMENTAL ANALYSIS OF WINDOW LENGTH, INITIALIZATION, AND WEIGHT PARAMETER α
In this subsection, we compare the separation performance across different STFT window lengths, different initialization methods for the MVAE and fMVAE algorithms, and different α settings.
Since all the methods are based on the instantaneous linear mixture model, the STFT window length may affect the separation performance of each of them, especially under reverberant conditions. We computed the STFT using a Hamming window with a length of {32, 64, 128, 256} ms, and by shifting half of the length for each frame. In this experiment, all the MVAE and fMVAE methods were initialized by running u.u.ILRMA for 30 iterations. The MVAE or fMVAE algorithm was then run for 30 iterations, where Adam was used to update z j and c j in the MVAE methods with a step size set of 0.01. We used α = 0 for fMVAE in this experiment. Table 2 shows the SDR scores obtained with each method. From these results, the optimal window length that gave the best overall performance was 128 ms for the current dataset.  Therefore, we conducted all the following experiments using a window length of 128 ms.
To confirm the impact of the initialization for the MVAE and fMVAE methods on the source separation performance, we compared the algorithms using the following three initialization methods: 1) random initialization with the demixing matrices initialized at identity matrices; 2) IVA; and 3) u.u.ILRMA. To keep the number of updates of the demixing matrices constant, each algorithm was run for 60 iterations for the random initialization case and 30 iterations after an initialization algorithm was run for 30 iterations for the other cases. Table 3 shows the SDR scores over the 160 test samples. From these results, we found that the methods adopting ILRMA for initialization achieved better performance than those using IVA for initialization. One possible reason could be that block permutation had occurred in IVA. It is worth noting that the MVAE methods with random initialization obtained more than 3 dB higher SDR improvements than when using IVA and ILRMA for initialization. Meanwhile, though random initialization slightly outperformed ILRMA in s.u.fMVAE_c and s.i.fMVAE, there were no noticeable differences. Therefore, we adopted random initialization in the following experiments.
Finally, we investigated how much the performance depends on the weight parameter α in the prior-weighted inference. We set α at {0, 1, 10, 50, 100, 200, 300, mean}, where ''mean'' indicates the data-dependent setting Figure 7 shows the average SDR scores over 160 test signals. We found that the effectiveness of the prior distribution p(z) in improving the source separation performance was modest in the speaker-dependent case and that the SDRs started to decrease at α > 10, which indicates that a smaller value VOLUME 8, 2020  leads to better performance for the speaker-dependent case. Moreover, the curve of fMVAE_o was entirely above the curve of fMVAE_c without regard for the choice of the initialization methods, which indicates that fMVAE_o is more effective in speaker-dependent scenarios.

E. SOURCE SEPARATION PERFORMANCE
In addition to SDRs, we used signal-to-interference ratios (SIRs) and signal-to-artifact ratios (SARs) [56] to evaluate the source separation performance. Perceptual evaluations of speech quality (PESQ) 3 [57] and short-time objective intelligibility (STOI) 4 [58] were also conducted to ascertain the speech quality and intelligibility. All the criteria were calculated using a dry source as the reference signal. We first confirmed the effectiveness of conditional modeling by comparing the performance obtained with the CVAE source model and its unconditional counterpart under the MVAE framework. Table 4 shows SDR, SIR, SAR, PESQ, and STOI scores. As can be seen from the results, the CVAE source model obtained a 1.7-dB higher SDR than a source model based on a regular VAE. Table 5 shows scores obtained by each method with the optimal parameter setting. By comparing supervised methods to the blind method (u.u.ILRMA), we confirmed that an appropriately pretrained source model could lead to considerably improved source separation performance. The MVAE methods achieved the best scores in both the uninformed and informed categories, which significantly outperformed the other methods. The fMVAE method yielded an average SDR score that was 2.8 dB lower than the original MVAE method, but about 0.75 dB higher than the other baseline methods. 3 Code: https://github.com/vBaiCai/python-pesq 4 Code: https://github.com/mpariente/pystoi

F. COMPUTATIONAL TIME
The average computational times of the MVAE and fMVAE methods with random initialization are summarized in Table 6. All the programs were run using an Intel (R) Core i7-7800X CPU@3.50 GHz and a TITAN V GPU with 12-GB memory. Here, ''runtime/iteration'' means the computational time required to update the parameters once using the MVAE or fMVAE algorithm. The ''total'' time indicates the time taken by the entire process, including the time for constructing the system (e.g., loading the pretrained networks to a GPU), updating parameters, and performing the separation. Through the comparison of the runtime at each iteration, we found that the fMVAE algorithm was about 70 times faster than the MVAE algorithm. Moreover, fMVAE was found to reduce the computational time by more than 90% even when using a CPU. These results indicate a tradeoff between the source separation performance and computational time: the MVAE method provides better separation performance with high computational cost, whereas fMVAE significantly reduces computational cost but with performance degradation.

G. SPEAKER-INDEPENDENT SEPARATION
In practical applications, the speakers in a given mixture signal are not always included in the training dataset. In this subsection, we show the performance of the MVAE and fMVAE methods in speaker-independent tasks and compare them with u.u.ILRMA, which requires no prior information about the speakers.
We created datasets using utterances from the Wall Street Journal (WSJ0) corpus [59]. All the utterances in WSJ0 folder si_tr_s (around 25 hours) were used as the training set, which consists of 101 speakers in total. If there is a large number of utterances of a sufficiently wide variety of speakers  in the training dataset, the trained model is expected to have an ability to express spectrograms of unseen speakers. When a test mixture contains unseen speakers, (31) can be interpreted as how similar speaker j is to the speakers in the training set, whereas (32) indicates the speaker in the training set most similar to speaker j. A test set was created by randomly mixing two different speakers selected from the WSJ0 folders si_dt_05 and si_et_05, where the number of speakers was 18. We generated test data using simulated RIRs with RT 60 = 78 ms and RT 60 = 351 ms, where 100 mixture signals were generated under each reverberation condition. The average SDRs of the datasets were about 0.60 dB and -0.78 dB, respectively. Other experimental conditions and network architectures were the same as those described in Subsection V-C.
As in the speaker-dependent case, we first investigated the dependence of the separation performance on the α setting. Figure 8 shows the average SDR scores over the entire test dataset achieved with various α settings. Since the scores obtained with α = 200 and α = 300 increased continuously, we additionally evaluated the performance obtained when α = {500, 700, 1000, 1500, 2000}. The optimal α settings were 500 for s.u.fMVAE_o and 2000 for s.u.fMVAE_c, respectively. This was considerably different from the speaker-dependent case, where a smaller α performed better. From these results, we can assume that the proposed prior-weighted update rule was more effective under open-set conditions than under closed-set conditions. Table 7 summarizes the average SDR, SIR, SAR, PESQ, and STOI scores obtained with each method with random initialization. The results demonstrate the ability of the MVAE and fMVAE methods to handle speaker-independent scenarios with an increasing variety and amount of training data. Both the MVAE and fMVAE methods were superior to u.u.ILRMA, where s.u.MVAE achieved an improvement of more than 3.5 dB over u.u.ILRMA. As with the speaker-dependent case, the fMVAE methods provided less improvement than the MVAE method.

VI. CONCLUSION
This paper proposed a novel optimization algorithm for the MVAE method, which is called FastMVAE (or fMVAE). The proposed method exploits an auxiliary classifier VAE instead of a regular CVAE to learn the generative distribution of source signals and employs the trained auxiliary classifier and encoder for inference. We newly introduced a prior-weighted update rule for the latent variables of each CVAE source model and different update rules for the class label of each source. We conducted experiments to investigate the optimal window length, initialization, and weight parameter and performed speaker-dependent and speaker-independent source separation experiments to confirm the effectiveness of the proposed method. Experimental results revealed that fMVAE can significantly reduce computational time by more than 90% compared with the original MVAE method; the MVAE and fMVAE methods outperformed conventional methods under speaker-dependent conditions; and the MVAE and fMVAE methods can handle a speaker-independent scenario by using a large set of training data.