Joint Separation and Localization of Moving Sound Sources Based on Neural Full-Rank Spatial Covariance Analysis

This paper presents an unsupervised multichannel method that can separate moving sound sources based on an amortized variational inference (AVI) of joint separation and localization. A recently proposed blind source separation (BSS) method called neural full-rank spatial covariance analysis (FCA) trains a neural separation model based on a nonlinear generative model of multichannel mixtures and can precisely separate unseen mixture signals. This method, however, assumes that the sound sources hardly move, and thus its performance is easily degraded by the source movements. In this paper, we solve this problem by introducing time-varying spatial covariance matrices and directions of arrival of sources into the nonlinear generative model of the neural FCA. This generative model is used for training a neural network to jointly separate and localize moving sources by using only multichannel mixture signals and array geometries. The training objective is derived as a lower bound on the log-marginal posterior probability in the framework of AVI. Experimental results obtained with mixture signals of moving sources show that our method outperformed an existing joint separation and localization method and standard BSS methods.

Most BSS methods assume that sound sources hardly move, and thus their performance is easily degraded by the movement of the target sources. One solution is to allow source steering vectors (the rank-1 special forms of SCMs) to be time-variant by assuming a Markov process on the vectors [29]. If the geometry of microphones is available, we can efficiently constrain the steering vectors with the directions of arrival (DoAs) of sources estimated by source localization [30], [31]. Furthermore, the separation and localization can be performed jointly, based on a unified generative model [32], to complementarily compensate for their estimation errors. In this paper, we present an unsupervised neural method to perform joint separation and localization for moving sources. As shown in Fig. 1, we extend the neural FCA to handle the source movements by introducing the time-varying SCM and DoA for each source in the nonlinear generative model of a multichannel mixture signal. The joint separation and localization is performed by an inference model that predicts the parameters of each source from an input mixture. These inference and generative models are jointly trained in an unsupervised manner by maximizing a log-marginal posterior probability given only multichannel mixtures and array geometries.
The main contribution of this study is to combine a timevarying nonlinear (neural) BSS model with neural probabilistic inference. While the full-rank SCMs generally improve the performance from that obtained with the rank-1 SCMs, it was difficult to introduce the Markov process or temporal smoothness by modeling only with conjugate priors. We solve this problem by introducing the temporal smoothness of SCMs with the constraints on the neural inference model instead of the generative model. The experimental results with the mixture signals of moving sources show that our method significantly outperformed an existing joint localization and separation method as well as standard BSS methods.

II. BACKGROUND
This section introduces a nonlinear BSS method called neural FCA as a preliminary for the proposed method.

A. Blind Source Separation
Existing BSS models typically represent an M -channel mixture signal x ft ∈ C M as a sum of N target (and noise) source signals s nf t ∈ C (n = 1, . . . , N) in the TF domain as follows: where a nf ∈ C M is the time-invariant steering vector for source n, and f = 1, . . . , F and t = 1, . . . , T are the frequency and time frame indices, respectively. The source signal s nf t is represented by a complex Gaussian distribution as follows: where λ nf t ∈ R + is the power spectral density (PSD) of source n. By marginalizing source signals s nf t , we obtain the following likelihood function of the multichannel mixture x ft : where H nf = a nf a H nf ∈ S M ×M + is the SCM of source n. By allowing the full-rankness of H nf , this model called FCA [14] can handle small source movements and reverberation.

B. Deep Spectral Model
A nonlinear (neural) source model has been utilized for precisely representing complex source spectra [23], [24], [25], [26], [27]. This model called a deep spectral model assumes that the PSD λ nf t is represented by D-dimensional feature vectors z nt ∈ R D (t = 1, . . . , T ) as follows: where g θ,f : R D → R + is a neural network with parameters θ for transforming the feature vector to the PSD. Assuming z nt follows the standard Gaussian distribution: this model can be trained as the decoder of a variational autoencoder by using clean source signals [23]. This supervised source model was reported to outperform NMF-based linear models in speech separation or enhancement [26], [27], [28].

C. Neural Full-Rank Spatial Covariance Analysis
The deep spectral model can be trained in an unsupervised manner by using only multichannel mixtures based on amortized variational inference [21], [33]. This method called neural FCA utilizes an inference (encoder) network to predict the speech as the posterior distribution q φ (Z | X), where φ represents the network parameters. By using the generative model of a mixture signal ( (3)- (5)) as a decoder, the encoder and decoder are jointly trained to maximize the following evidence lower bound (ELBO) [33]: where E q φ [·] is the expectation by the posterior q φ , and D KL [·|·] is the Kullback-Leibler (KL) divergence. The network parameters θ and φ are updated by stochastic gradient ascent, and the SCMs H {H nf } n,f are updated by an expectation-maximization (EM) algorithm [14]. This training can be considered nonlinear BSS performed on the training mixtures.

III. TIME-VARYING NEURAL FCA
We extend the original neural FCA to perform joint localization and separation for moving sound sources.

A. Generative Model of Multichannel Mixture Signals
We assume that a mixture signal x ft consists of N directional moving sources s nf t and a diffuse noise n ft as follows: where a nf t ∈ C M is a time-varying steering vector for source n at time frame t. Assuming both the source and noise signals follow the time-varying version of (2) and (3), we obtain the following likelihood function: where H nf t ∈ S M ×M + are time-varying SCMs for the sources (n = 1, . . . , N) and noise (n = 0).
To exploit the localization results, we assume that the SCM H nf t for each source (n = 1, . . . , N) follows a conjugate prior conditioned by a unit vector u nt ∈ R 3 ( u nt = 1) representing the DoA of each source: where IW C (ν, Γ) ∝ |H| −(ν+M ) |Γ| ν exp(−tr(ΓH −1 )) is the complex inverse Wishart distribution, and ν > M is a hyperparameter controlling degrees of freedom of H nf t . The mode of this prior is equal to the prior SCM G f (u nt ) defined as: where > 0 is a small number to make G f (u nt ) positive definite, and b f (u nt ) ∈ C M is the steering vector for direction u nt calculated from the geometrically calculated time delays of the microphones. The SCM for noise H 0ft , on the other hand, is assumed to be diffuse by replacing G f (u nt ) in (9) with an identity matrix. Note that we did not formulate any temporal smoothness of H nf t because it is difficult to make such a constraint in a generative model. As described in the next section, we alternatively introduce it by the inductive bias of the inference model.

B. Inference Model
Our inference model predicts the latent feature z nt , SCMs H nf t , and DoAs u nt from a multichannel mixture signal x ft (Fig. 1). As the estimates of z nt , following the original neural FCA [21], the inference model predicts the posterior distribution q φ (Z | X) as follows: where μ φ,ntd (X) ∈ R and σ 2 φ,ntd (X) ∈ R + are the outputs of the inference network. Since the time-varying SCM H nf t is difficult to estimate analytically, the inference model estimates it as the moving average of masked observation added to the prior SCM G f (u nt ) with a weight hyperparameter γ 0 ∈ R + : where w φ,nf t (X) ∈ [0, 1] is a TF mask predicted by the inference network, and γ ∈ (0, 1] is a decay hyperparameter controlling the smoothness of H nf t . For numerical stability, H nf t is normalized to tr(H nf t ) be M . Lastly, DoA u nt was predicted with unit vectorsũ φ,nt (X) ∈ R 3 output by the network: where η ∈ (0, 1] is a decay hyperparameter. The DoAs u nt are also normalized to be unit vectors. These moving averages introduce the temporal smoothness of H nf t and u nt .

C. Amortized Variational Inference for Unsupervised Training
We train the inference and generative models by using only multichannel mixtures. The training objective for each mixture is the ELBO L θ,φ with a regularization term of H nf t : where H φ and U φ are the sets of inference results obtained by (12) and (13), respectively. This ELBO is equivalent to a lower bound on the following log-marginal posterior function: where c = denotes equality up to an additive constant. The network parameters θ and φ are updated by using stochastic gradient ascent. This method can be considered as training q φ (Z | X) and H φ by the original ELBO L θ,φ (X), while the DoAs U φ regularize the SCMs H φ and are optimized to maximize the log-likelihood log p(H φ | U φ ). After training these networks, they can be used to separate and localize moving sources in an unseen mixture. The source signals are obtained with the source images Y nf t g θ,f (µ φ,nt (X))H nf t by a multichannel Wiener filter [14], [21].

IV. EXPERIMENTAL EVALUATION
We evaluated our method on simulated speech mixtures due to the need for reference signals. A demonstration with real recordings can be found at https://ybando.jp/projects/spl2023.

A. Dataset
We generated a dataset of multichannel mixtures of moving sources. The mixture signals were generated as observations of six-channel microphone arrays in a way similar to the way the spatialized WSJ0-2mix dataset was generated [34]. Each mixture consisted of two source signals randomly selected from the WSJ0 English speech corpus. The moving source signals were generated by convoluting time-varying room impulse responses (RIRs) [35] generated every 0.1 s. The array with random geometry was placed randomly around the center of a room having dimensions of 5 m×5 m×3 m. Each source was initially located randomly and moved around the array with a constant speed drawing a horizontal circular arc. We sampled the angular velocities of sources uniformly between 0 • /s and 45 • /s. The angular difference between sources always had at least 45 • through the movement. The reverberation time (RT 60 ) was fixed to 200 ms. The source signals were mixed with a signal-to-noise ratio (SNR) randomly chosen between −5 and +5 dB. The mixture signals were generated at 16 kHz, and Gaussian noise was added with an SNR of 30 dB. The dataset consisted of 20000, 5000, and 3000 mixtures for training, validation, and test sets, respectively. For comparison, we also generated a static dataset in which no sources moved.

B. Experimental Condition
We used almost the same network configuration as that of the original neural FCA [21], whose inference and generative models consisted of temporal convolutional networks. We added the dropout (p = 0.1) to avoid bad local optima and two output layers to the inference model for estimating the TF mask and DoA. To utilize the spatial information and array geometries, the input feature consisted of a log-power spectrogram and a DoA spectrogram [36] calculated with 1000 uniformly distributed three-dimensional directions.
The networks were trained by an Adam optimizer [37] for 200 epochs with a learning rate of 0.001. The hyperparameters D, ν, , γ 0 , and η were set to 50, M + 1, 0.001, 0.1, and 0.99, respectively. We set γ to 0.99 for sources and 1 (time-invariant) for noise. We scaled the log p(H φ |U φ ) in (14) with 0.001 to avoid over-constraining the SCMs. Following [21], we also performed the cyclic annealing [38] for scaling the KL term in (6). The spectrograms were obtained by the short-time Fourier transform with a window size of 512 samples and a hop length of 128 samples. The mixture spectrograms were fed to the network by splitting them into 500-frame clips. The batch size for training was 128 clips. These hyperparameters were determined empirically. Our time-varying neural FCA was compared with existing BSS methods, a joint separation and localization method, and the original neural FCA. As BSS methods, we evaluated cACGMM [18], FCA [14], FastMNMF2 [17]. An external frequency permutation solver was used for the cACGMM and the FCA as in [21], and the number of basis vectors for FastMNMF2 was set to 8. We evaluated a joint method based on the clustering of TF bins with a DoA-HMM [32]. This method can separate moving sound sources and was initialized by the localization results using MUSIC [39]. The original (time-invariant) neural FCA was trained with the same input features as the proposed method. The number of the iteration for estimating the SCM H nf was 5. We evaluated the separation performance with the average signal-to-distortion ratio (SDR) in dB [40] and the localization performance with the average DoA error in degrees [41]. The DoA error was averaged on non-silent frames whose powers of oracle source signals were larger than −20 dB from the average.

C. Experimental Results
The separation and localization performance is summarized in Table I. We can first see that the SDRs of the original neural FCA and the standard BSS methods were significantly degraded by the source movements. In contrast, our method (the bottom row) improved the average SDR by more than 4 dB for the moving condition from these methods. Although its SDR for the static condition was 1.8 dB worse than that of the original neural FCA, our method was still better than FastMNMF2. Furthermore, our method outperformed the DoA-HMM-based clustering in both the SDRs and DoA errors and MUSIC in the DoA errors for both static and moving conditions. As shown in Fig. 2, while the original neural FCA output unseparated sources or silence, our method successfully estimated speech sources over almost all the time frames.
Our method consists of three extensions of the neural FCA: estimating H nf with TF masking, estimating time-varying H nf t by (12), and performing joint localization and separation. As  in the bottom two rows of Table I, the time-varying extension degraded the performance for the static condition because it cannot exploit the statistics of entire time frames. In contrast, this extension is key to improving the performance in the moving condition. The temporal smoothness of DoAs introduced by (13) was also an important key as demonstrated in Fig. 3. Our method without the smoothness failed to track each source and estimated two DoAs that were almost the same. The localization results with the smoothness, on the other hand, were estimated correctly.

V. CONCLUSION
We presented an unsupervised multichannel method that can separate and localize moving sound sources without any supervision. Our method trains a joint separation and localization model only from multichannel mixture signals and array geometries. This training is based on an extension of neural FCA to incorporate the time-varying DoAs of each source. The experimental results with moving sound sources demonstrated that our method outperformed existing BSS methods and a joint source separation and localization method. Our future work includes further extending the neural FCA to handle variable numbers of sound sources and long reverberation, which will enable the separation of real-world recordings.