Speaker Counting and Separation From Single-Channel Noisy Mixtures

We address the problem of speaker counting and separation from a noisy, single-channel, multi-source, recording. Most of the works in the literature assume mixtures containing two to five speakers. In this work, we consider noisy speech mixtures with one to five speakers and noise-only recordings. We propose a deep neural network (DNN) architecture, that predicts a speaker count of zero for noise-only recordings and predicts the individual clean speaker signals and speaker count for mixtures of one to five speakers. The DNN is composed of transformer layers and processes the recordings using the long-time and short-time sequence modeling approach to masking in a learned time-feature domain. The network uses an encoder-decoder attractor module with long-short term memory units to generate a variable number of outputs. The network is trained with simulated noisy speech mixtures composed of the speech recordings from WSJ0 corpus, and noise recordings from the WHAM! corpus. We show that the network achieves 99% speaker counting accuracy and more than 19 dB improvement in the scale-invariant signal-to-noise ratio for mixtures of up to three speakers.


I. INTRODUCTION
S PEECH presence detection, suppression of undesired noises, and estimation of the speaker signals, from noisy speech recordings containing one or more speakers, are the three most important tasks for any speech analytics application. Speech mixtures containing overlap of two or more speakers are found to be a bottleneck for the success of speech diarization, automatic speech recognition and speaker verification. A separateand-process strategy is found to benefit these tasks, especially in time regions containing speaker overlap. Data driven approaches using deep neural networks, in particular, have been successful for single-channel speech separation. Traditional approaches commonly assume the speech recording to be a mixture of two or more concurrent speakers. In long-form recordings, there exist time regions where none of the speakers are active, or more commonly regions where a single speaker is speaking, and also regions of multi-speaker overlap. The number of speakers in the mixture is often unknown. It is desirable to have a single system, that can detect speech presence, count the number of speakers, and separate individual sources from a mixture signal with unknown number of speakers.
Early DNN approaches to source separation considered masking in the short-time Fourier transform domain [1], [2]. Recent advances showed that masking in a learned time-feature domain gives better separation than a short-time Fourier transform (STFT) based approach provided the window duration is sufficiently small, which is also desirable for low-latency/online source separation [3]. Further, dual-path processing to model the long-time and short-time relations in the data is found to be beneficial [4]. Convolutional layers are used in [3], recurrent layers in [4], transformers in [5], [6], and a combination of recurrent and transformer layers in [7], to compose the DNN separator. Strategies such as incorporating auxiliary speaker loss are also studied [8]. The number of speakers in the mixture is assumed to be known, a-priori, in [1], [2], [3], [4], [5], [6], [7], [8].
Several works [9], [10], [11], [12] have addressed the separation of an unknown number of speakers (mixtures of more than two speakers). In [9], a recursive multi-pass source extraction strategy is proposed for a STFT-domain masking network. The network, trained on two speaker mixtures, was found to generalize to mixtures of zero to two speakers. A recursive source separation scheme, using a one-vs-rest training strategy, was proposed in [10]. The approach was shown to generalize to an unseen number of sources. In the multi-decoder approach of [11], several decoders, one for each possible number of speakers, were used with a common encoder and trained along with a speaker-counting head. In [12], multiple architectures, one for each possible speaker count, were trained and a voice activity based criterion is used to infer the number of speakers. The above works require multiple forward passes [10], multiple decoders [11] or multiple models [12]. To overcome these limitations, we proposed an architecture in [13], in which the speaker counting is achieved using an encoder-decoder-attractor (EDA) module [14] and the speech separation module uses a network of transformer layers configured to model short-time, long-time and cross-speaker relationships.
Previous works on source separation focused mainly on clean speech mixtures, but in real recordings, the mixtures are often noisy and have reverberation. WHAM! [15] and WHAMR! [16] datasets were proposed to develop source separation algorithms for noisy and reverberant mixtures. Joint separation and enhancement was studied in [17], [18], but they assume a fixed number of sources. The approach in [12], which uses a different model for each speaker count, was recently extended to noisy and reverberant mixtures in [19].
In this paper, we develop a unified approach to the tasks of speech detection, speaker counting and separation from a noisy single-channel recording. The joint task is posed as that of "predicting zero or more speakers, and estimating the individual speaker signals when the predicted speaker count is greater than zero". The work uses the DNN architecture we recently proposed in [13]. Unlike [13] where the DNN is trained on two or more speaker mixtures, we train it on noise-only signals (zero speakers), noisy single-speaker recordings, and noisy multi-speaker mixtures in the present work. We show that, the proposed architecture (i) predicts zero speaker count for a noise-only recording, (ii) auto-encodes/enhances the singlespeaker recordings, and (iii) estimates the individual source signals for noisy multi-speaker recordings. Further, we study the generalization capabilities of the trained network for different, practical evaluation conditions and provide insights into the feature representations computed by the DNN at various layers, which provide cues to efficient architecture designs in the future.
II. SYSTEM OVERVIEW Let x be the vector representation of a recording with a duration of L samples, where J is the number of speakers, s j is the j th source signal and v is the noise signal. The source signals and noise signal are assumed to be mutually uncorrelated. The goal in this work is to, (i) predict J = 0 when x = v, i.e., when the speech is absent, and (ii) predict J and estimate s j , ∀j, when x follows (1) with J ≥ 1, We consider the supervised learning approach using a DNN in this paper and f θ (.) represents the function learned by the DNN architecture having parameters θ. We consider masking, using a DNN, in the learned timefeature (TF) domain [3], as shown in Fig. 1. The waveform encoder, with a learnable analysis filter-bank, converts the timedomain signal (x) into the TF domain (X). The masking network, using the EDA block in the architecture, predicts the number of speakers J and a mask for each predicted speaker M j (when J > 0). The predicted masks are multiplied with the mixture TF representation and input to the waveform decoder. The decoder synthesis filter-bank reconstructs the source signals ( s j , ∀ j ∈ [1, J]), independently. We refer to the proposed DNN architecture as "Speech Detector and Separator using Encoder-Decoder-Attractors (SD-SepEDA), in the following sections.

III. SD-SEPEDA SEPARATOR
A block diagram of the proposed SD-SepEDA architecture is shown in Fig. 2. The individual blocks in the architecture are described in this section.

A. Time-Feature Representation
The L dimensional mixture signal x is fed to the waveform encoder, which generates a non-negative, sub-sampled, timefeature representation X. It is composed of a 1D convolution layer (Conv1D) with H filters followed by a rectified linear unit (ReLU) activation. An over-complete representation using short-time windows is found to benefit source separation [3], hence, we choose a kernel size equivalent to 2 ms (window length) with a stride of 1 ms (hop size), where N is the number of time-frames.

B. Masking Network
Input processing: The masking network input X is first passed through a layer-norm (LN) layer [20], followed by a linear layer without bias. It is then segmented into chunks of size K frames with a 50% overlap between successive chunks. In the present work, we choose K = 250 which corresponds to a chunk size of 250 ms and a hop of 125 ms.
Dual-path block: Dual-path block, shown in Fig. 3, models the short-time and long-time relations in the input using transformers [21]. The block composition is identical to the SepFormer block proposed in [5]. The intra-chunk transformer performs  attention across the time-steps with-in a chunk, while treating the chunk dimension as the batch. In contrast, the inter-chunk transformer performs attention across the chunks at every timeframe. The axes of the inputs and outputs of the transformers are appropriately permuted and reshaped as shown in Fig. 3. Skip connections are implemented across the intra-chunk and inter-chunk transformers after passing the transformer outputs through a LN layer.
The transformers comprise of a stack of transformer encoder layers and optional position encoding, as shown in Fig. 4. The input feature matrix is, optionally, added with a positional encoding matrix and input to a stack of P transformer encoder layers followed by a LN layer. We use the sinusoidal position encoding in this work. The composition of the transformer encoder layer, shown in Fig. 4, closely follows the definition in [21], with pre-normalization and dropout. In this work, we use P = 4 layers in the intra-chunk transformer and P = 2 layers in the inter-chunk transformer.
Attractor generation and speaker counting: The purpose of the attractor generation module is to generate a conditioning vector for each speaker in the mixture signal and facilitate speaker counting. The attractors are generated on chunk-level aggregated representations, as they capture speaker characteristics better than the fine frame level features. A typical choice for the intra-chunk sequence aggregation is mean pooling, which gives equal importance to all the time frames including the speech silences. In this work, we consider the self-attentive weighted subspace projection strategy shown in Fig. 5, which weights the samples based on their importance as computed by the attention weights. Such a strategy is found to encode the sequence characteristics effectively and improve the final  performance, for example, in the context of speaker verification in [22]. The processing steps are, Here, r different weighted linear combinations of a lowdimension (H/r) projection of the input V are concatenated to generate the summary vector. The weights are computed, from a higher dimension projection of V (H = 4H), as shown in the top section of Fig. 5.
The EDA module, shown in Fig. 6, is composed of an LSTM encoder and an LSTM decoder. The operation of the EDA module is similar to the method proposed in [14]. The input W of chunk level representations is first shuffled along the chunk dimension to avoid learning the sequence related information and model only the speaker attributes. The shuffled representations are fed to the LSTM encoder. The state c e,C of the encoder at the last step (chunk C), comprising of the cell state and the memory state, encodes a file-level summary of the speakers in the recording, i.e., ∼, c e,C = LSTM encoder(shuffle(W)).
The state c e,C is used as the state at step-0 of the LSTM decoder. The LSTM decoder is run for J + 1 steps for a recording with J speakers. In step-j, it takes a vector of zeros as the input and generates an embedding a j , referred to as the "attractor". The J + 1 attractors are fed to the attractor existence detection module. In addition, the first J attractors are multiplied elementwise with the DPB output V to generate J parallel channels, i.e., The attractor existence detection module is similar to a binary classifier and composed of a Linear-Sigmoid layer. The module is trained to predict a 1 or 0 depending on whether the attractor should correspond to a speaker or otherwise. For a recording with J speakers, the module is expected to predict a 1 for the attractors {a 1 , . . . , a J } and a 0 for a J+1 .
During training, for a recording with J speakers, we extract J + 1 attractors. During inference, if the number of speakers is return attractors A and speaker count J not known a-priori, the attractors are generated sequentially as described in Algorithm 1.
Triple-path block: Triple-path block (TPB) extends the dualpath block processing with an additional inter-channel transformer, as shown in Fig. 3. In this block, the intra-chunk and inter-chunk transformers model the J channels independently. In the inter-channel transformer, the intra-and inter-chunk dimensions are treated as batch size and the attention is applied along the channel dimension. We do not use position encoding in the inter-channel transformer, since the channel order is arbitrary. The inter-channel is also bypassed for recordings with J = 1. The inter-channel transformer is similar to the strategy implemented in [23] for multi-channel speech enhancement.
Mask prediction: The mask prediction follows the steps from the SepFormer architecture [5]. The 4D output of the triple-path block is passed though a pReLU layer and overlap-add method is used to invert the chunking operation. The signals are then fed to a Linear-Tanh layer and a Linear-Sigmoid layer in parallel and the corresponding outputs are multiplied element-wise to compute a gated output. Finally, a Linear-ReLU layer predicts the source masks M j , ∀j ∈ {1, 2, . . . , J} for the predicted sources. TF representations for the separated sources are obtained as,

C. Signal Reconstruction
The waveform decoder uses a 1D transpose-convolution layer (Tr-Conv1D) with the same number of filters, kernel size, and stride as that of the waveform encoder, discussed in Section II-I-A,

D. Loss Function
To train the DNN, we use a weighted combination of the signal estimation loss L signal and the attractor existence loss L attr , i.e., where η > 0 is the weight value.
The negative of the scale-invariant signal-to-noise ratio (SI-SNR), averaged over the speakers in the mixture, computed in a permutation-invariant manner [24] is used for L signal . Let P denote the set of all possible permutations of the estimated source signals. The signal estimation loss L signal is then computed as, The SI-SNR measure, for a reference and estimated signal pair (s, s), is defined as [3], [25], where α =< s, s > / < s, s > is the scale parameter. For a mixture with J speakers, L attr is computed as To compensate for the scale differences between L signal and L attr , we choose η = 10, unless otherwise stated.

A. Datasets
We conducted the experiments on noisy-speech mixtures, simulated using single-speaker recordings from the WSJ0 corpus [26] and noises from the WHAM! corpus [15].
WSJ0-Mix dataset: WSJ0-JMix is a synthetic dataset of J speaker mixtures (J ∈ {2, . . . , 5}), composed using clean WSJ0 recordings [26] and created as defined in [2] using an open-source Python tool. 1 To create the mixture, (i) the clean signals were normalized, (ii) scaled with gain values sampled randomly, and (iii) summed and then normalized such that the peak amplitude of the mixture signal is 0.9. Let g 1 , g 2 be gain values uniformly sampled from the range [0,2.5] dB. The gains for the individual speaker signals for two to five speaker mixtures are For each J, the dataset has 20 K, 5 K and 3 K recordings in the train, validation and test splits, respectively. These three splits have disjoint speakers. We used the "min" duration mode of the dataset (as defined in [2]) for the experiments in this paper, which also guarantees that the mixture signals have a full-overlap between the speakers except for the intra-recording silences. We refer to the original recordings from WSJ0 as single-speaker mixtures (WSJ0-1Mix). The train, validation and test splits of WSJ0-1Mix have 8769, 3557 and 1770 recordings respectively. We refer to the pool of 1-5 speaker mixtures as the WSJ0-Mix dataset.
WHAM!-WSJ0-Mix dataset: We paired the mixtures from the WSJ0-Mix dataset with noise signals sampled randomly from the WHAM! corpus. We used the same train, validation and test splits of the raw recordings defined in [15] to sample the noise signals for the corresponding splits of WSJ0-Mix. The noise signals were added to the speech mixtures at a mixturesignal-to-noise ratio (MSNR) sampled randomly from the range 30-40 dB, for all the splits of the dataset. Additionally, we evaluated the trained models using a wider range of MSNR values in Section VI. For the training examples, we paired each mixture with a different noise sample with a different MSNR value in each epoch. For validation and test, each mixture from WSJ0-Mix was paired with a unique noise sample with a unique MSNR value at all epochs.
We also created noise-only recordings, sampled from WHAM!, and refer to them as 0-speaker mixtures. We sampled 20 K, 5 K and 3 K examples for the train, validation and test splits. Each example in the 0-speaker mixtures set has a duration of 4 s. We refer to the cumulative set of 0-speaker mixtures and the 1-5 speaker mixtures of WSJ0-Mix added with WHAM! noises as the WHAM!-WSJ0-Mix dataset.
The mixture signal duration was limited to a maximum of 15 s during training and not limited for the test recordings. The duration of the mixture recordings in the test splits varied from 1.62 s to 13.87 s.

B. Training Details
The SpeechBrain [27] platform was used to train the models. Adam optimizer [28] was used with an initial learning rate of 1.5 × 10 −4 and a batch size of 1. The learning rate was fixed for the first 20 epochs and halved later if the validation SI-SNR is not increasing for two consecutive epochs. A loss threshold of −30 dB was applied to the negative SI-SNR loss function. The norm of the gradients was clipped to 5. Automatic fixed precision was used to increase the training speed. Time domain speech perturbation [29] was used as the audio augmentation scheme with perturbation factors 95%, 100% and 105%. The model parameters corresponding to the best validation SI-SNR were used for the final evaluation.

C. Performance Measures
We measured the performance of the proposed approach using SI-SNR improvement (SI-SNRi), SDR improvement (SDRi) [30], and speaker-counting accuracy (SCA) measures. The number of sources is known and fixed during training, but the estimated number of sources during inference can be different from the ground truth. When the number of sources was under-estimated, an all-zero signal was used as the pseudoestimate for the sources not accounted for in the output, to compute the SI-SNRi measure. On the other hand, if the number of sources was over-estimated, the subset of sources with the maximum average SI-SNRi were used for the performance metric computation. In all the cases, the performance measures were computed for the optimal pairing of estimated and reference sources, i.e., accounting for the permutation ambiguity. Computation of SDR involves a projection of the estimated signal into the desired signal and interfering speaker signal subspaces [30]. This is not possible if either the desired signal or its estimate  [31] for the computation of SDRi measure. SCA was calculated as the percentage of test recordings with correctly estimated speaker count ( J = J).

V. RESULTS
We evaluated the proposed SD-SepEDA model using (i) the WHAM!-WSJ0-Mix test sets which denote the matching condition for the training, and (ii) the WSJ0-Mix test sets without noise. For each condition, we evaluated the model with speaker count estimation (as described in Algorithm 1) and with ground-truth speaker count (as in the training stage). Table I shows the signal estimation performance for the WHAM!-WSJ0-Mix test sets. For J = 0, we found that only 1 out of the 3000 test samples is identified as having a speaker ( J = 1). For J = 1, the input MSNR is high (sampled from the range [30 − 40] dB) and hence the improvement is less. However, the results indicate that the model is capable of signal enhancement. For J > 2, the metrics reflect mainly the source separation quality, and we see that the performance degrades with an increase in J. The incorrect speaker count estimation contributes significantly to the performance degradation for J > 3, since inference using the ground-truth speaker count for the same model achieves better performance. Table I also shows the average performance across the recordings with correct speaker count estimation only. The SI-SNRi and SDRi measures are comparable on these recordings with correct estimated speaker count. Table II shows the speaker confusion matrix. We see that the model has a tendency to under-estimate the number of speakers. We observed that the model does not estimate more  III  PERFORMANCE FOR WSJ0-MIX TEST SETS AND COMPARISON WITH THE BASELINE SYSTEMS   TABLE IV SPEAKER CONFUSION MATRIX FOR WSJ0-MIX TEST SETS than 5 speakers in general (the maximum number of speakers in the train set). For 0 − 3 speakers, the SCA obtained may be satisfactory for most practical applications.
The performance obtained for the clean WSJ0-Mix test sets is shown in Tables III, IV. For J = 1, the table shows the reconstruction SI-SNR (no-noise and no-interfering speakers). We see that the reconstruction SI-SNR is close to 50 dB and the SCA is 100%. For the mixture recordings (J > 1), the performance trends are similar to the WHAM!-WSJ0-Mix test set results shown in Tables I and II.  Table III also shows the performance of three baselines, the recursive separation system in [10], the MulCAT approach of [12] which uses multiple models, and our previously reported SepEDA model [13]. We note that the baseline systems in [12], [13] work only for two or more speaker mixtures (J ≥ 2) as they are trained on two-five speaker mixtures, unlike the proposed model which works for J ≥ 0. The method in [10], in theory, can detect single speaker recordings. But the single speaker recordings are not considered in the experiments in [10]. Comparing with the proposed SD-SepEDA, we see that SD-SepEDA has better performance than the current best architectures reported in the literature. Compared to the SepEDA model reported in [13], we see a slight degradation in performance for the condition where the speaker count is estimated. For the known number of sources condition, the proposed model is found to be better for J > 2. The speaker counting is also better for J < 5 for the proposed model compared to SepEDA model of [13].
To investigate further, we studied the distribution of sourcewise SI-SNRi for the mixture recordings (J > 1) in Fig. 7. For J = 2 and J = 3, peak in the distribution is observed beyond 20 dB, and most of the recordings have more than 15 dB  SI-SNRi. A significant left-leaning tail is observed for J > 3, though the peaks are beyond 15 dB SI-SNRi. This indicates that, for J > 3, a subset of speakers in the mixture are correctly estimated, and the remaining speakers may be confused. Fig.  8 shows the effect of the median pitch frequency of individual speakers on the separation quality for the two-speaker mixtures. The frame-level pitch values were computed using the Parselmouth Python tool [32] and the median was computed over the voiced speech regions. We see that, closer to the diagonal, the separation is poor, i.e., the difference of median pitch frequency between the two speakers is small. There are two clusters with poor separation, in the low and high frequency regions, which correspond to same gender speaker mixtures. We also studied  the relation between SI-SNRi and the x-vector cosine similarity between the speakers for the two speaker mixtures. For x-vector extraction, we used the pre-trained extractor available in the SpeechBrain framework [27]. 2 Fig. 9 shows the scatterplot of cosine similarity against the SI-SNRi. We see that almost all recordings which have a poor SI-SNRi (< 15 dB) have a higher cosine similarity, but the vice-versa is not true. The analysis shows that the architecture may be using not just timbre and pitch but also other information such as the speech onset instances for source separation.
Next, we study the performance as a function of the duration of the test samples. Fig. 10 shows the distribution of SI-SNRi of the test samples divided into three groups by duration for the mixtures of two-five speakers. The median performance is similar for the different duration groups for a given speaker count. However, the worst-case performance is improved as the duration of the sample increases, which can be seen for the mixtures with more than three speakers in Fig. 10.
We measured the computational complexity of the proposed architecture using the number of GigaFLOPs estimated using the open-source flop-counter tool. 3 Table V, the SCA degrades with decreasing MSNR except for J = 5, which is also reflected in the SI-SNRi values. At low MSNRs, the speaker count estimated is either 0 (i.e., no speech) or 5 (maximum predicted by the model). This also justifies the better SCA obtained at low MSNRs for J = 5. For the [20 − 30] dB condition, the performance is comparable to the [30 − 40] dB case.
SIR variation: For the analysis in this section, we used the WSJ0-Mix test sets with 2 − 5 speakers, but with different relative gains for the mixed signals, which translate to differences in the SIR. The training dataset as defined in Section IV-A has gains g 1 , g 2 sampled from the range [0 − 2.5] dB. For the evaluation in this section, we sampled the gains from three ranges [0 − 5] dB, [0 − 10] dB, [0 − 15] dB. The SCA and the separation accuracy degrade with increase in SIR range, as shown in Table VI. We observed that the speaker count is under-estimated for higher SIR cases, i.e., the weak sources are not identified by the model. The performance for the 5 dB condition is also comparable to the train condition (2.5 dB), shown in Table I. Table VI also shows the results for the known number of sources case. The performance degradation is less with the change in the SIR, and closer to the results reported in Table I. This shows that the separation architecture is robust to different SIR conditions, while the EDA module is sensitive to the SIR variation.
Reverberant test set: In this section, we study the performance of the SD-SepEDA model trained on WSJ0-WHAM!-Mix dataset on reverberant speech. We created a reverberant test set using the simulated RIRs and the WSJ0-Mix test sets. The RIRs were simulated using the pyroomacoustics [33] Python tool, closely following the procedure and the parameters used to create the WHAMR! dataset [16]. Each example in the WSJ0-Mix test sets was paired with a different set of source RIRs. The sources were first convolved with their corresponding RIRs, scaled according to the SIR value, and added to create the reverberant mixture. We used the SIR values from WSJ0-Mix definitions to scale the sources. We considered three different  VI  PERFORMANCE FOR DIFFERENT SIR CONDITIONS. THE COLUMN LABELS SHOW THE UPPER LIMIT OF THE RANDOM GAINS APPLIED TO SOURCES   TABLE VII  For comparison, we also trained the SD-SepEDA model on reverberated WSJ0-WHAM!-Mix dataset. The reverberated training examples were created dynamically. The reverberation time for the training set was sampled uniformly from the range 0.1 − 1.0 s, and the remaining parameters are sampled similar to the test set, described above. The model is trained to predict the reverberant sources given the reverberant mixture inputs, since our goal is to separate sources and not dereverberation. Table VII shows the results. For the model trained on anechoic speech, we see that the speaker counting is significantly affected by the reverberation. We observed that the speaker count is over-estimated in the reverberant cases. The better SCA accuracy for the five speaker test set is misleading, because the network is observed to not predict more than five speakers. The SI-SNRi also degraded with the reverberation level. For "low" reverberation condition, SI-SNRi with reverberant reference is better compared to the direct component reference, indicating that the network estimates the reverberant sources at the output. This can be observed clearly in the single-speaker test set, where the SI-SNRi is closer to the clean speech results, shown in Table. I. Training the model with reverberant speech significantly improves the speaker-counting performance for the medium and high reverberant cases. The SI-SNRi performance for the high reverberation case is also improved. However, the performance for the low and medium reverberation cases is degraded compared to the model trained on non-reverberant mixtures. This is due to the training set which has reverberation times sampled from the wider range of 0.1 − 1.0 s, the model performance is improved for high reverberation recordings at the cost of degradation for low reverberation cases.

VII. DISCUSSION
We study the processing of a speech mixture by the network, using a synthetic 2-speaker mixture with partial speaker overlap (≈ 50% overlap). This is to understand the attention schemes computed by the different transformer layers in the network. We note that the proposed network is trained and evaluated in Section V with fully overlapping speaker mixtures and partial overlap mixture is used here only for illustration purposes. The source signals are from the WSJ0 test set.
Waveform encoder-decoder: The order of the encoder filters as a function of frequency is random, since the network architecture or the training procedure do not encourage the encoder filters to be in sorted frequency order. Fig. 11 shows the frequency response of the encoder and decoder filters, sorted by the center frequencies of the encoder filters. We see that the learned center-frequencies are on a warped scale, as it is also observed in other works [3]. Contrary to the expectation, the decoder filter at the same index as the encoder filter was not at the same frequency. The waveform encoder-decoder are not found to be perfect reconstruction type, i.e., passing the encoder output directly through the decoder does not reconstruct the original signal. However, there exists a mask to reconstruct the single-speaker signal, as we have seen in the results section for the single-speaker recordings.
We observed that the waveform-encoder output has 50% zeros on average. A closer look at the Conv1D filter weights showed that for several filters their negative version is also a filter, i.e., if c is a filter then −c is also a filter (hence 50% zeros). This shows that if the waveform-encoder output is computed as the number of encoder filters could possibly be halved. A further investigation is beyond the scope of this paper.
We also observed that some filters are repeated, i.e., have same weights, indicating a redundant representation. This repetition of filters at the same frequency may be helping with the separation of sources with similar characteristics (same pitch range), since the masks can be disjoint between speakers for features of the same frequency filter. For the single-speaker recordings, we observed that the obtained masks have 60% non-zeros on average, which confirms the redundant representation by the encoder.
Signal estimation: Fig. 12 shows the spectrograms of the estimated sources and their corresponding reference signals. For the example shown, there is no leakage across the speaker channels during voiced speech regions. However, the silence/non-speech regions of one channel may leak into the other in practice.
Intra-chunk attention: Fig. 13 shows the intra-chunk attention computed by the four transformer layers in DPB and TPB. We see that the attention-weights are concentrated around the main-diagonal and the spread around the diagonal increases with the layer index. This shows that the intra-chunk transformers attend to the current and neighboring frames only. The transformers accumulate temporal context through successive layers, similar to the dilated convolution blocks in ConvTasNet [3]. The spread around the diagonal is higher in the TPB for the two output channels, where the network's goal is to generate speaker specific outputs.
Inter-chunk attention: Fig. 14 shows the inter-chunk attention computed by the two transformer layers of DPB and TPB. In DPB (Fig. 14(a)), the first layer has higher attention weights for the active speech chunks and the second layer has higher weights for chunks with non-speech. The attention weights are spread over all the time frames, indicating that the representations computed in DPB are also global. The attention pattern for the first layer appears like a superposition of two similarity matrices, corresponding to two different, overlapping time regions. The different heads of the transformer may be focusing on different time regions. So, in practice, the overall capacity of the network, in terms of the number of sources it can separate, may be limited by the number of heads in the inter-chunk transformer layers.
A different behavior is observed in the inter-chunk transformers of the TPB (Fig. 14(b), (c)). The attention weights in these are concentrated along the diagonal, indicating the modeling of relationships in neighboring chunks, i.e., relationships spanning several hundred milliseconds. In each channel, the second transformer layer attention weights are concentrated along the diagonal, only in the active speech regions of that particular channel and they are more spread-out during the silence regions. This indicates speaker-selective short-time modeling in the inter-chunk layers of TPB.
Inter-channel attention: Fig. 15 shows the intra-chunk mean and standard deviation of cross-channel weights (anti-diagonal of the self attention matrix), across the chunks for the two transformer layers. We see that the weights are always greater than zero, even for chunks which have only one active speaker, showing that there is information sharing across the two output channels. In the first layer, the cross-channel weights are mostly less than 0.5, but greater than 0.5 for the second layer. This is shows that the first layer is giving higher importance to intrachannel features and the second-layer is fusing the information from the second channel. Fig. 15 also shows that the attention weights have a smaller variance with-in a chunk.
Attractors: The EDA module generates 3 attractors for this example with J = 2. The existence probabilities for the generated attractors, for the example shown in Fig. 12, were {1.0, 1.0, 4.8e − 7}. The module is found to predict the speaker existence with a very high confidence. We found this observation to be true for most of the test set, indicating over-fitting to the training data conditions. This could be a reason for the poor SCA in situations where one or more signals in the mixture are loud compared to the other signals, different from the train/test conditions of WHAM!-WSJ0-Mix dataset. The attractors are found to be recording specific. Hence, they do not necessarily contain speaker information useful for speaker identification or re-identification of a given speaker across the different blocks if the proposed method is applied in a block-wise manner to process long recordings.
Gating mechanism: The output of the triple-path block, after overlap-add, goes through a gating scheme before predicting   the masks. Fig. 16 shows short segments of the Linear-Tanh and Linear-Sigmoid layer outputs, for the two speaker channels. We see that, while the output of the Linear-Tanh layer is dense, the Linear-Sigmoid layer output is sparse. It provides speaker-sensitive selection across the feature dimension and also suppresses the features during speech pauses. Finally, the matrices are multiplied element-wise and fed to a Linear-ReLU layer to predict the mask.

VIII. ABLATION EXPERIMENTS
In this section, we study the architecture and training choices. For this study, the architectures are trained on WSJ0-Mix dataset alone, i.e., 1-5 speaker mixtures and no-noise. Table VIII shows the SI-SNRi and SCA for 1 − 5 speaker mixtures for different models. V0 shows the architecture trained with default parameters. Table III shows the same results, but the architecture was trained on 0-5 speaker mixtures (WHAM!-WSJ0-Mix dataset). The performance is better in Table VIII compared to Table III. The presence of noise-only samples and noise during training is found to deteriorate the speaker counting, especially for J > 2.
First, we study the impact of weight η used for L attr in the training loss function (rows V0-V2). η = 10 for the V0 model. SCA decreases with a decrease in η, which also reflects in the SI-SNRi values, for 4 and 5 speaker mixtures. For 2 and 3 speaker mixtures, SI-SNRi is marginally better for smaller η. Model V3 in the table shows the results for a simpler scheme of averaging the intra-chunk features instead of the attentive sequence aggregation scheme discussed in Section II. We see that the performance is affected for single-speaker recordings but similar or better for the mixture signals.
The chunk-level features are shuffled prior to EDA in row V0. Row V4 shows the results for the model trained without shuffling. We see that the performance with shuffling is slightly better compared to EDA without shuffling.
Next, we study the ordering of the three transformers in the triple-path block. V0 configuration has the layers in the order intraChunk-interChunk-interChannel. Architectures V5, V6 and V7 show the results for different ordering of the blocks. The configuration V7 has a very poor performance compared to the other four. Placement of inter-channel transformer at the beginning (V5) and at the end (V0) has similar performance, though V0 is slightly better. The configuration in V6 has a slightly poor performance compared to V0 and V5.
The experiments show that the performance is sensitive to the ordering of the transformer layers. But the other training parameters have a marginal impact on the test performance.

IX. CONCLUSION
We proposed a DNN architecture to jointly estimate the speaker count and the individual sources from a single channel speech mixture of an unknown number of speakers. The network is trained with noise-only signals (i.e., no speakers), single-speaker signals and mixtures of up to five speakers. While the network does not generalize to unknown number of speakers, it achieves more than 99% speaker counting accuracy for input signals with zero to three speakers. The SI-SNR for recordings of one to three speakers is more than 19 dB. Through robustness analysis, we showed that the network generalizes to low-reverberation conditions and a higher range of speaker mixing ratios than those observed during training. We also showed that the network operates by building short-time, medium-time and global file-level representations at different blocks. Through the analysis, we provided insights for the design of compute efficient transformer architectures for source separation, for example, masked attention transformers can be used for all the intra-chunk transformers with a limited temporal context for each time frame.