State-of-the-Art Analysis of Deep Learning-Based Monaural Speech Source Separation Techniques

The monaural speech source separation problem is an important application in the signal processing field. But recent interaction of deep learning algorithms with signal processing achieves remarkable performance improvement for speech source separation problems. This paper explores the numerous state-of-the-art deep learning-based monaural speech source separation algorithms in the time-frequency (T-F), time, and hybrid domains. The motivation, algorithm, and framework of different deep learning models for monaural speech source separation are analyzed. The benchmarked algorithms in the T-F domain can be categorized as deep neural networks (DNN), clustering, permutation, multi-task learning, computational auditory sense analysis (CASA), and phase reconstruction-based techniques, whereas the state-of-the-art time-domain approaches can be categorized as CNN, RNN, multi-scale fusion (MSF), and transformer-based techniques. The end-to-end post filter (E2EPF) is a hybrid algorithm combining T-F and time-domain works to achieve enhanced results. Time-domain models have shown improvement in separation performance compared to the T-F and hybrid domain models with small model sizes. Methods in T-F, time, and hybrid domains are compared using <inline-formula> <tex-math notation="LaTeX">$SDR$ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$SI-SDR$ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$SI-SNR$ </tex-math></inline-formula>, PESQ, and <inline-formula> <tex-math notation="LaTeX">$STOI$ </tex-math></inline-formula> as quality assessment metrics on some benchmark datasets.


I. INTRODUCTION
The source separation problem occurs due to the undesired signal mixing with the desired signal. The undesired signal includes a speaker signal other than the target speaker, interference, reverberation, and background noises. Automatic voice recognition (to convert speech into text) [1], assisted living (to make appropriate living conditions for older and persons with disabilities) [2], and hearing aids (to improve the hearing capability of the person with hearing loss) [3], and many more are applications of monaural source separation [4], [5], [6], [7], [8], [9]. Hence, many researchers are interested in working on source separation problems due to their widespread applications. Source separation can be categorized as single-channel (monaural) and multichannel categories. In signal and speech processing, monaural speech source separation is challenging because it separates The associate editor coordinating the review of this manuscript and approving it for publication was Mira Naftaly . the target speaker from the mixture of speakers and the background noises and interferences in a single microphone recording. Speaker separation [10], [11], [12], [13], speech enhancement [14], [15], and speech de-reverberation and de-noising [16] come under single-channel source separation categories, as in Fig. 1.
Speaker separation allows extracting more than one speaker from the mixture of two and more than two speakers [10], [11], [12], [13]. Speech enhancement improves noisy speech signals' intelligibility and perceptual quality [14], [15] and attempts to separate speech from noisy mixture signals. Speech de-reverberation and de-noising remove reverberation and suppress background noise from the target speaker signal [16], [17]. Speaker separation is the pre-processing stage in many speech-processing applications with multiple speakers, such as multi-speaker automatic speech recognition [18] and multi-speaker emotion recognition [19], [20]. Hence researchers are motivated to work and improve the speaker separation algorithms. Monaural source separation or single-channel source separation works with two learning methods, supervised learning (models can use previous experience to produce outcomes), and unsupervised learning (models do not have previous experience). Existing review articles describe supervised single-channel speaker separation algorithms in either signal processing [21], [22], [23] or in the time-frequency [24], [25] domains. The conventional single channel speaker separation techniques such as computational auditory sense analysis (CASA) [26], non-negative matrix factorization (NMF) [27], [28] in the signal processing domain and deep learning-based deep clustering (DC) [29], deep attractor networks (DANet) [30], permutation invariant training (PIT) [31], in T-F domain have been reviewed in [32]. A comprehensive review with background introduction and formulation of speech separation and components of supervised separation, i.e., learning machines, training targets, and acoustic features, have been introduced with a description of monaural speech enhancement, speaker separation, and speech de-reverberation as well as multi-microphone techniques in [17]. The articles [17], [32], [33], [34] presented interesting reviews of deep learning applied to various problems of speech processing. Nevertheless, these review articles presented speaker separation using deep learning in the T-F domain only in a short portion of the overview. Recently deep learning-based supervised time domain algorithms have achieved significant progress, motivating to review time-frequency, time, and hybrid domain approaches. This paper compares the supervised monaural speaker separation algorithm based on deep learning in T-F, time, and hybrid domains. Available objective performance metrics to evaluate separation models, training objectives, and datasets have been introduced to make the researchers aware of background information for deep-leaning-based speaker separation in T-F, time, and hybrid domains. Before being familiar with deep learning advantages, signal processingbased approaches performed the audio source separation tasks. Signal processing-based speech source separation models can be classified as statistical, clustering, and factorization models, as shown in Fig. 2.
Statistical models include probabilistic models such as Gaussian Mixture Models (GMM) [35], [36], [37], [38], Hidden Markov Models (HMM) [39], [40] and factorial Hidden Markov Models [41], [42], etc. GMM can work well for different gender speakers, and the HMM model separates similar-gender speakers efficiently. GMM and HMM models assume that source energy does not change throughout the change from mixture signal to separated signal. This assumption limits the real-time performance of the models. Factorial HMM models [43], [44], [45], a gain-adapted minimum mean square error estimator [46], and a frame-based gain estimation technique [47] overcome this limitation but compromise increased computational complexity. Clustering methods use computational auditory sense analysis (CASA) [26] and spectral clustering [48], [49], [50] for performing source separation tasks. These methods are based on the principle of auditory sense analysis and attempt to perform separation similar to the human auditory system. The CASA systems aim to separate the mixture of sound sources like human ears do. Hence, the CASA system can be interpreted as a machine listening system [51], [52]. Factorization models make use of the principle of non-negative matrix factorization (NMF) [27], [28]. Considering the source signal non-negative can make its energy unaltered throughout separation with HMM and GMM as in [27], [53], [54], and [55]. However, the energy of real-world sources can be negative or positive.
All these classification-based approaches estimate hard masks to classify each time-frequency (T-F) bin belonging to sources [56]. Due to this hard decision, essential information related to sources can be lost. Signal processing-based approaches fail to work well with real-world scenarios. The success of deep learning applications in various research fields inspires researchers to perform supervised monaural speech source separation in the deep learning domain [57]. Deep learning models with so many hidden layers are suitable for dealing with complex real-world data.
Deep learning-based single-channel speech source separation approaches perform separation in the T-F, time, and hybrid domains. In the T-F domain, DNN, clustering, permutation, multi-task learning, CASA, and phase reconstructionbased approaches are used to separate the speakers from mixture signals. Deep clustering (DC) [29], deep attractor networks (DANet) [30], permutation invariant training (PIT) [31], etc., are benchmarked T-F domain approaches. These methods calculate the spectrums of signals to get into the T-F domain using a Short-Time Fourier Transform (STFT) [58]. The separation can be performed by calculating a mask function and multiplying it with a mixture signal to obtain a clean speech signal. These methods calculate soft mask function instead of hard mask hence getting better separation accuracy than signal processing-based approaches. The STFT is a suboptimal transformation for speech signals because it is not specifically designed for speech signals and can transform any type of signal into the T-F domain. The T-F domain methods only process the magnitude spectrum, leave the phase spectrum unchanged, and can cause phase magnitude decoupling.  Phase reconstruction-based approaches overcome this limitation with limited performance and increased complexity. The separation accuracy of T-F domain methods increases with increased window size and compromises with the size and complexity of models. STFT calculation, phase magnitude decoupling, and long contextual window are the limitations of T-F domain methods and inspire the researchers to work in the time domain. The time-domain approaches make use of data-driven representation instead of T-F domain spectrograms. In these methods, separate models are designed for data-driven representations and inverse transformation. These methods have an encoder-decoder and separation modules. The encoder module converts the timedomain mixture speech signal into an encoded time-domain mixture signal. The separation module calculates the mask function using the encoder output. The calculated mask functions are multiplied with the mixture signal from the encoder to separate sources. Then decoder transforms the separated sources into an understandable form. Deep learning-based time-domain speech source separation can be categorized as CNN, RNN, and transformer-based approaches and techniques without an encoder-decoder framework. Time-domain audio source separation (TasNet) [59], Convolutional TasNet (ConvTasNet) [60], etc., are examples of time-domain audio source separation work. Time-domain approaches overcome limitations of T-F domain approaches like STFT calculation, magnitude and phase decoupling, and long context window. The end-to-end post filter is the hybrid method performing separation in T-F and time domains. This paper reviews the T-F, time, and hybrid domain deep learning-based monaural audio source separation approaches. Section III explains the performance measures for comparing audio source separation outcomes. Section IV presents existing training objectives to train deep learning models for speech source separation tasks. Section V describes various available datasets for monaural speech source separation frameworks. Section VI reviews the state-of-the-art deep learning-based monaural speech source separation algorithm in the T-F, time, and hybrid domains. Section VII compares the performance of speech source separation approaches using SDR, SI −SDR, SI −SNR, PESQ, and STOI on different datasets. Section VII concludes T-F, time, and hybrid domain speech source separation algorithms.

II. PERFORMANCE MEASURE
Subjective and objective are two types of performance measures for evaluating speech source separation outcomes.
Subjective measures are scores given by a human personal perspective or viewpoint for the outcomes of the separation tasks. Human perceptual involvement makes subjective measures more reliable standards than objective measures, but they are time-consuming and expensive, which are the reasons for using them rarely. Further, humans can have different perspectives on particular outputs. Objective measures are cheaper and faster and perform a set of calculations for evaluating separation quality by comparing estimated outcomes with the clean separated sources. This paper explains objective measures because of their wide use in research to judge and compare separation accuracy. Commonly used objective metrics for monaural audio source separation algorithms are as follows: Source to distortion ratio (SDR) [61], source-to-interference ratio (SIR) [61], sourceto-artifact ratio (SAR) [61], signal-to-noise ratio (SNR) [61], scale-invariance SDR (SI − SDR) [62], scale-invariance SIR (SI − SIR ) [62], scale-invariance SAR (SI − SAR) [62], scale-invariance SNR (SI − SNR) [60], [61], short time object intelligibility (STOI ) [63], perceptual evaluation of speech quality (PESQ) [64].
The predicted separated signalP from the mixture can be decomposed as target source signal and errors due to interference, noise, and artifact as followŝ  [62], [65]. The SIR evaluates the amount of error in predicted signal due to interferences. It computes the correlation between the target and estimated signals by calculating the log of the target signal energy to the interference error signal's energy ratio. The SAR represents undesired artifacts in the estimated source signal compared to the actual source signal and can be calculated as a log of the ratio of the energy of the target signal plus the error signals due to interference and noise to the energy of the error signal due to artifacts. To make it independent of noise and interference the formulation of SAR in TABLE 1 numerator contains error term due to noise and interference. The SNR can be defined as the log of the ratio of target signal plus interference error signal energy to the energy of error due to noise. SIR, SAR, and SNR are introduced as a performance measure for audio signals because Distortion = Interference + Artifact and are separately used for comparing different monaural source separation approaches [61]. Furthermore, the formulation of SIR, SAR, and SNR exhibit a nonlinear relationship which may be irrelevant for proper analysis of machine learning algorithms. These metrics can be made scale invariance to get linear relation between them. In the condition where the estimated signal is a scaled version of the target, scaling the estimate is helpful to get perceptually enhanced output rather than boosting a particular metric. Scale invariance metrics perform scaling to produce outcomes invariance to scale. Suppose the target signal e tar is a scaled version of the predicted target signal e tar = αd tar , here α is the scaling factor. In this case predicted signal can be decomposed aŝ P = e tar + e, where e = e int + e art . SI − SDR, SI − SIR, and SI − SAR can be formulated as in TABLE 1. These numerical illustration helps to derive ∥e∥ 2 = ∥e int ∥ 2 + ∥e art ∥ 2 . Hence scale invariance metrics produce a direct relationship between signal distortion, interference, and artifact metric.
The metrics can be made scale invariance by normalizing the predicted and clean speech signals to the zero mean before calculation [60]. Scale invariance SNR (SI − SNR) is one of the commonly used performance metrics for source separation approaches [61], [66]. SI − SDR is equivalent to SI −SNR when e is only due to the noise and can be illustrated in TABLE 1.
Short-time objective intelligibility (STOI ) [63] is a performance assessment measure of objective time-domain signal intelligibility for separating monaural audio sources. It evaluates the intelligibility content by calculating the similarity of time-related short-time envelops of the time-domain reference speech signal and predicted speech signal. STOI scores can vary from [0, 1] [63]. The higher value of predicted intelligibility represents the better accuracy of separated speech. Nowadays, STOI is considered the standard measure for evaluating sound source separation performance [4], [67], [68]. Suppose for one T-F unit, the intermediate intelligibility measure is v k (ℓ) as shown in TABLE 1. Here ℓ is intermediate frequency. The T k (m) and P k (m) are T-F units for clean speech signal and processed signal, respectively, for k th DFT bin, and m is the time index belonging to a region of X consecutive T-F units. The P ′ k (m) denotes clipped and normalized processed speech signal. Suppose z belongs to the region of all existing frames; the objective intelligibility measure is obtained by taking the average of intermediate intelligibility measure over all bands and frames and represented mathematically by v [63] as formulated in TABLE 1. K is the one-third octave band number, and I is the total number of frames. PESQ is suggested by the International Telecommunication Union (ITU) [69], [64], covers the distortion due to telecommunication networks and measures separated speech signal quality. The PESQ estimates and compares the loudness spectra of desired and separated speech calculated by auditory transformation [69], [70].
PESQ scores range from [−0.5, 4.5], with the higher score representing good quality. PESQ can evaluate only oneway noise distortion or speech perceived by the receiver. It requires complex computations and the whole utterance access, which may be undesired. PESQ is a speech quality measure, while STOI is a speech intelligibility measure [64], [70]. Some metrics can evaluate particular distortion while being meaningless for others. One or more than one numerical metric with intelligibility and quality metrics have been calculated in recent works for a more accurate evaluation of the separation works.

III. TRAINING OBJECTIVES
Training objectives are essential for training neural network models properly. Training targets belong to three categories, i.e., masking, mapping, and signal approximation (SA) based targets [71], [72] as in TABLE 2. Masking-based training targets are ideal time-frequency (T-F) masks that establish the time-frequency relationship of the desired speech signal and mixture signal. Ideal binary mask (IBM), ideal ratio mask (IRM), and complex ideal ratio mask belong to maskingbased training targets. The exclusive allocation principle in auditory scene analysis [69] and the auditory masking phenomenon in audition [58] are motivations of the first training target, i.e., an ideal binary mask (IBM) in supervised monaural speech separation [71], [73], [74], [75]. A twodimensional T-F illustration of the noisy signal is used to represent IBM is given in TABLE 2.
For the signal-to-noise ratio (SNR) greater than the threshold value (th), IBM assigns the value 1 and 0 otherwise. In IBM, the separated speech signal becomes distorted due to hard decisions regarding the masking. Hence IRM or soft mask was introduced to overcome the signal distortion associated with IBM.
In IRM, the time-frequency points of the mixed speech signal represent the ratio of the energy of the target speech signal to the energy of the mixed speech signal [76].
Let y (m) is the mixed speech signal, d (m) is the desired speech signal, and η (m) is interference signal.
STFT [58] of the mixture signal y (m) can be represented as follows: Here n and m represent the frequency index and time, respectively. Y (m, n), D (m, n) and N (m, n) are the Fourier transform of mixed-signal, desired speech signal, and interference signal respectively. By multiplying the IRM function M IRM (m, n) with mixture signal Y (m, n) [58] the clean speech signal can be reconstructed as: M IRM (m, n) represents the T-F ideal ratio mask function formulated in TABLE 2, and β is a tunable parameter for changing the magnitude value of the mask. |D (m, n)| and |N (m, n)| are magnitude spectra of clean speech and interference noise, respectively. The ideal ratio mask [77] employs only magnitude information; however, the use of the desired signal spectrum phase information is also essential [60], [67]. Hence cIRM [78] was proposed, which uses both the magnitude and phase information of desired signal spectrogram to recover the target signal. The complex domain mixture and the clean speech signals spectrograms can be written in as: where j ≜ √ −1 and r denotes real and c denotes imaginary components of STFT.M cIRM r (m, n) andM cIRM c (m, n) [58] are real and imaginary parts of the estimated cIRM function. M cIRM is the cIRM expressed as follows: The cost function J cIRM for cIRM is formulated in TABLE 2. Phase-sensitive mask (PSM) [67], [79] is an effectively calculated mask function for speech separation to become aware of the phase of speech signal using the phase information of spectrograms [80]. The T-F domain ideal PSM M Ph (m, n) for speaker separation can be formulated as in TABLE 2. θ y (m, n) and θ s (m, n) represents a phase of mixed-signal and clean speech source for source number s respectively, and D s (m, n) is clean speech signal of s th source.
Mapping-based targets [77] are the spectra of the desired speech signal having the broad value range, i.e., [0, +∞) for all T-F points. In the mapping-based approaches, the magnitude spectrum of the target speaker trains the deep learning model. The cos function J mapping for mapping-based training targets is formulated in TABLE 2. HereD (m, n) is a spectrum of the predicted signal of desired speech source.
Hence the value of the cost function should be minimized to reduce the difference between the desired signal and the estimated signal [77]. However, the spectrum of the clean speech signal may take value belongs to the broad range, i.e., [0, +∞) at every T-F point. Hence mapping-based models are challenging to train [78] and obstacle to produce desired performance. SA based training targets overcome this challenge by estimating the desired speech signal in the range [0, 1] at each T-F point.
Signal approximation-based training targets are signal spectrums calculated by multiplying the estimated mask with the mixture signal in the T-F domain with a range between [0, 1] [58]. In signal approximation (SA) [77], mapping decides the training target, and masking estimates the desired speech. Hence SA is a combination of mapping and masking. Similar to the mapping-based algorithm, the magnitude spectra of the desired signal become the target to train the model. However, the predicted T-F mask and spectrum of mixture signals are multiplied to obtain an estimated speech spectrum as in the masking-based approach. The cost function J SA [58] for the SA-based approach can be formulated as in TABLE 2. HereM SA (m, n) is the predicted T-F mask used to obtain an estimated spectrumD (m, n) = Y (m, n)M SA (m, n) for the SA-based method.
Hence SA based approach increases accuracy in the source separation problem. SA based training targets considers only real terms while cSA [77] based training target uses both real and imaginary components of the signals to calculate target signals. In the complex domain, the cost functions of the cSA-based method [77] can be calculated as J 1 for the real term and J 2 for the imaginary term as shown in TABLE 2. VOLUME 11, 2023 M cSA r (m, n) andM cSA c (m, n) are real and imaginary parts of a complex signal approximation mask function.

IV. DATASETS
Monaural speech source separation methods have been worked with various benchmarked datasets. Speech source separation datasets contain mixture, separated, and noise signals to facilitate researchers for separation, enhancement, and de-noising tasks.
WSJ0 [81] corpus is created for automatic speech recognition (ASR) tasks. WSJ0-2mix [81] and WSJ0-3mix [81] are subsets of WSJ0 used to perform two and three-speaker separation tasks in many state-of-the-art techniques. It contains a speech signal of 30 hours spoken by 119 speakers. The WSJ0 hipster ambient mixture (WHAM!) [82] the dataset is a noisy version of the WSJ0-2mix suitable for speech signal de-noising tasks. The WHAM! contains two speaker mixture signals with Noise. Unique noise is added in the background to make it noisy. WSJ0-2mix and WSJ0-3mix are further extended to WSJ0-4mix [83] and WSJ0-5mix [83] by modifying the basic script of the WSJ0 dataset [83]. To create WSJ0-4mix and WSJ0-5mix, four and five speakers, respectively, are randomly selected and mixed at random 0-5 dB SNR values [83]. The WSJ0 hipster ambient mixture reverberant (WHAMR!) [84] is a reverberant and noisy extension of WHAM! [82]. It contains artificial reverberation with noise in the background. Texas Instruments Massachusetts Institute of Technology (TIMIT) [85] is acoustic-phonetics continuous speech corpus [85]. It contains 6300 utterances produced by 630 speakers. Each speaker speaks ten sentences [85]. Telecommunication and signal processing (TSP) [86] consists of 1444 sentences of 2.372 seconds and speaks by 24 speakers. This dataset also included children's speech signals [87]. SSC (Speech Separation challenge) was the standard corpus to evaluate the separation system in ICSLP 2006 [88]. It contains training, testing, and development sets separately. The dataset for training contains 17000 utterances from 34 speakers (18 males, 16 females) [89]. Tasting and development sets consist of separate noise and two talker sentences. Each set of two talker sentences consists of speech at six different SNR values −9, −6, −3, 0, 3, 6dB [89]. LibriSpeech [90] from Librivox audiobook [90] is a read corpus for ASR. This dataset has 470 hours of speech signals spoken by 1252 speakers. The LJSpeech [91] dataset consists of 12522 training and 578 testing utterances out of 13100 utterances with 1 to 10-second varying lengths. It is a single-speaker reading passage corpus [92]. The LibriMix [93] derived from LibriSpeech, and WHAM! Noises. It contains two and three-speaker recordings of separated, mixed, and noise signals making it beneficial for deep learning-based source separation tasks. The LibriMix is a freely available dataset. However, WSJ0 is the commercially available dataset. Recent works perform monaural speech source separation on both WSJ0 and LibriMix datasets. The LibriMix dataset can be extended to more than three speakers. Libri5mix, Libri10mix, Libri15mix, and Libri20mix are 5, 10, 15, and20 speaker datasets, respectively can be created by using the modified script of the LibriMix dataset [94], [95]. The VCTK dataset contains 109 speakers. Each speaker reads 2-6 seconds long 400 newspaper sentences in native English [96]. VCTK-2mix [96] is an open-source dataset derived from VCTK [96] and WHAM! Noises [96]. It can be used as a test dataset for source separation in a noisy environment and helps to perform cross-dataset experiments [96]. TABLE 3 describes benchmarked datasets for the source separation task

V. DEEP LEARNING-BASED MONAURAL SPEECH SOURCE SEPARATION TECHNIQUES
The impressive performance of deep learning in various research fields motivates the researchers to work on deep learning-based speaker separation problems. recent approaches are available in the T-F, time, and hybrid domains. techniques in the T-F domain transform the speech signal in the T-F domain before processing, while time-domain techniques perform separation in the time domain only. the hybrid method performs the separation in both domains. techniques in T-F, time, and hybrid domains can be explained in the following section.

A. TIME-FREQUENCY DOMAIN SPEECH SOURCE SEPARATION TECHNIQUES
The T-F domain approaches use concepts of clustering, permutation, grouping, and phase reconstruction with deep learning models to perform separation tasks. These approaches can be classified as clustering, permutation, multi-task learning, CASA, and phase reconstruction-based, as presented in the following section.

1) DNN AND RNN-BASED APPROACHES
Deep neural networks (DNN) based approaches are the first to solve speaker separation problems using deep learning [27]. These DNNs are feed-forward networks without recurrent connections. These methods outperform the signal-processing domain speaker separation approaches and motivate researchers to use deep learning for the monaural speech source separation task [27]. NMF uses only positive templates to model the source signals; however, in real-world applications, sources are non-linear and may generate both positive and negative values [97]. Hence non-linear DNN models give more promising results than NMF models.
The DNN models are trained to classify the sources present in the mixture signal. These models capture contextual information by concatenating neighboring features of audio signals, e.g., magnitude spectra, Mel-frequency cepstral coefficients (MFCCs), etc. However, the increase in the number of concatenating neighboring features increases the complexity of the neural network models due to the limitation in incrementing the size of the concatenating window [4]. Hence instead of deep neural networks, recurrent neural networks (RNNs) are used for temporal information of time series audio signals [4]. RNNs employ memory from  VOLUME 11, 2023 previous time steps. Hierarchical RNNs, also known as deep recurrent neural networks (DRNNs), can provide information through multiple time scales [78], [98] and outperforms then DNN-based approaches.
The long-short-term memory recurrent neural network (LSTM RNN) method uses cSA as a training target and produces real and imaginary components of the output separately. The detailed working of the LSTM block is given in [98]. Complex domain monaural source separation approaches utilize phase information of the target speech signal to retrieve the target audio signals. The LSTM RNN uses temporal information from time series data. Two parallel LSTM RNNs with similar configurations simultaneously calculate real and imaginary terms in the cSA-based LSTM RNN approach [77]. The combination of features increases network and system efficiencies. The compound features i.e., the amplitude modulation spectrogram (AMS) [99] (calculated using 64-channel gammatone filterbank [100]), relative spectral transformation, and perceptual linear prediction (RASTA-PLP) [101], Mel-frequency cepstral coefficients (MFCC), cochleagram response, and their deltas are extracted using feature extraction unit [69]. Fig. 3 shows the block diagram of the LSTM RNN method based on cSA [77]. During the training stage, LSTM RNN 1 uses real, and LSTM RNN 2 uses imaginary components of the spectrogram of target speech sources. The calculated complex mask and the mixture signal spectrum are multiplied to obtain separated outputs. The predicted complex T-F mask will be updated in each iteration, reducing the variation between desired speech and calculated speech signals. During the test period, the features of the mixture signals are applied as input to the trained LSTM RNNs. Then, the compound module combines the predicted real and imaginary components of the output signal and the reconstruction module reconstructs the estimated output speech. The cSA-based LSTM RNN algorithm has two advantages over the SA-based DNN algorithms; (1) the SA-based DNN approaches utilize only the magnitude spectrum to calculate mask function. Finally, the unprocessed phase spectrum and calculated mask function of the mixture signal are used to reconstruct the separated signal spectrum. However, the cSA-based LSTM RNN method utilizes information regarding both the magnitude and phase of the desired signal to calculate the mask function [77]. (2) The LSTM-RNN efficiently utilizes the temporal information after training LSTM RNN architecture represents good generalization ability [77].
Ensemble learning [76] motivates to train small DNNs and connects them to perform a big task rather than training a big model to perform the big task. Ensemble learning provides very high performance for regression and classification. It tends to combine small models to provide an enhanced range and flexible representation of the generalized problem [76]. Ensembles of DNN are used to form the multicontext network. The training target for each neural network is the ideal ratio mask or signal approximation. Multi-context networks are of two types: multi-context averaging (MCA) and multi-context stacking (MCS) [76]. The MCA network averages all outputs from small ensembles of DNN to obtain final outcome. However, ensembles in MCS at different context lengths are connected serially to produce the final result. The ensemble learning approach is suitable for efficient training but compromises with designing complexity [76].
2) CLUSTERING-BASED APPROACHES DC [29] is a speaker-independent speech source separation technique that can work with any number of speakers. It transforms T-F bins of spectra of mixture signal into high dimensional embedding space and produces embedding vectors. Then K-means clustering clusters the embedding vectors to separate the sources. It resolves the output dimension mismatch problem of PIT. Objective function measurement between embedding sources instead of ground truth speech signals reduces the efficiency of mapping sources properly. This limitation is overcome by DANet [30]. The DANet also produces high dimensional embedding space, but instead of clustering, it creates attractor points and reduces the distance between T-F bins corresponding to each source. Attractors are the centroid points of sources in embedding space that helps to separate T-F bins belonging to an individual source. Embedding spaces are updated in each iteration to minimize errors in reconstruction. This approach faces a center mismatch problem in which the true attractor points duffers from the estimated attractor point. The center mismatch problem causes the prediction of wrong sources. Anchor DANet (ADANet) [102] approach overcomes the center mismatch problem by considering the anchors instead of attractors in embedding space. Anchors are several reference trainable points used in both the training and test stage to estimate source assignment. ADANet has improved performance as compared to all existing DC-based approaches.
Attention deep clustering network (ADCNet) [103] is a recent T-F domain approach that uses multi-head selfattention and deep clustering to perform speaker separation [103]. Inspiring from human auditory attention, ADCNet optimizes multi-head self-attention and deep clustering simultaneously [103].
This method captures comprehensive information on multiple time scales using multi-head self-attention. Basic deep clustering approach uses k-means clustering which requires number of clusters should be known previously hence not effective for big data [103]. ADCNet uses density-based canopy k-means algorithm to overcome limitation of k-means algorithm. This improved k-means algorithm does not require cluster number previously [103]. Encoder squash-norm deep clustering (ESDC) [12] is a state-of-the-art T-F domain single channel speaker separation method. It enhances discriminative learning ability of high dimensional vectors by performing input feature encoding, embedding vector training, vector normalization, and vector clustering [12]. The node encoder establishes correlation using adjacency-based similarity between neighboring information and calculates the scaler product features using input feature vectors. The scaler product features represent the relationship between input vectors. The training stage discriminates these feature vectors to improve the performance of separation approach. Then squash-norm normalization is used in vector normalization stage to increase the discriminative capability of embedding feature vectors. This stage coverts the short vectors to zero vectors and long vectors to unit vectors. Finally, clustering stage clusters the squash-norm embedding vectors using various clustering methods [12].

3) PERMUTATION-BASED APPROACHES
Permutation-based approaches involve permutation invariant training (PIT) [31]. All possible permutations for mixed sources are pooled in PIT, and the lowest error permutation is used to update the network. PIT solves the permutation problem but has an output dimension mismatch problem. Frame level PIT (tPIT) [31], utterance level PIT (uPIT) [77], [80] and constrained uPIT (cuPIT) [104] are the permutationbased approaches. tPIT works at the frame level to perform speech source separation. It needs speaker tracking due to frame level discontinuity. In contrast, real-world problems are at the utterance level. The uPIT overcomes discontinuity of the frame in tPIT by using BLSTM trained at utterance level criterion to align the frames of the same speakers. The cuPIT produces a delta-acceleration coefficient cost function by adding acceleration and weighted delta of output frames.
The CuPIT is the best PIT method, but due to its complexity tPIT and uPIT are frequently used methods. The one and rest permutation invariant training (OR-PIT) [105] is a monaural talker independent multi-speaker speech source separation algorithm and uses tPIT in its architecture. It recursively uses a source separation network to progressively separate sources from the mixture. The source separation network separates one source at a time. The remaining mixture signal is again recursively applied to the separation network to further separate the sources from the mixture signal [105]. In OR-PIT, easy-to-separate speakers are always separated first with high separation quality. However, separation quality degrades with further separation. This approach with iteration termination criterion knows when to stop the iteration [105]. The uPIT+DEF+DL [104] (uPIT + deep embedding features (DEF) + discriminative learning (DL)) is a T-F domain discriminative learning method with deep embedding features [106]. Single-channel speaker separation can be considered a permutation problem [29], [77].
PIT reduces the distance between the same speech signals but does not increase the distance between different speech signals. tPIT and uPIT have output dimension mismatch problems hence in many approaches PIT is used with DC. The uPIT+DEF+DL is one of the approaches that uses uPIT and DC to perform separation tasks. The Block diagram of uPIT+DEF+DL is shown in Fig. 4 [106]. Deep clustering (DC+) [29] stage extracts deep embedding features (DEF) by producing clusters in embedding space known as embedding vectors. These embedding vectors are used by uPIT stage for separating the sources in the pre-processing stage. Separated signals using DC and uPIT still have possibilities of remixing. Discriminative learning (DL) [107], [108], [109], [110], [111] reduces the chances of remixing of separated signals using discriminative loss function. This algorithm has four stages, DC, uPIT, discriminative learning, and the joint training. In the deep clustering stage, a trained bidirectional longshort term memory (BLSTM) [112] network extracts deep embedding features (DEF) by projecting each T-F bin of the amplitude spectra of a mixture signal |Y (m, n)| into the D-dimensional embedding vector E m . The DEF extractor cost function J dc can be formulated using following equation.
A s is a binary matrix of membership function for source s in each T-F bin. ∥ * ∥ 2 F represents the square Frobenius norm. The value of the matrix A s = 1 for the s th source having the maximum energy compared to other sources, and A s = 0 otherwise. The deep embedding vectors from DC are applied as input to uPIT to estimate soft masks for every source. uPIT selects an optimal permutation having a minimum value of mean square error cost function J uPIT at utterance level from all speaker permutations (P).
uPIT targets to minimize J uPIT to make the output predictions and their corresponding target sources more similar. Discriminative learning (DL) helps to identify the difference between target and interferences by reducing the difference between the predicted and the corresponding target so that possibility of remixing decreases. Suppose the selected permutation is ϕ * and has a minimum mean square error value among all permutations. Then DL cost function J DL can be computed as Here ϕ represents permutation from P excluding ϕ * , µ ≥ 0 is the parameter for regularization of ϕ. When µ = 0, J DL , and J uPIT are the same, this is the condition for no discriminative learning. J DL and J uPIT are jointly calculated in joint training to obtain embedding features effectively [112]. The joint training loss function J joint can be formulated as: Here ϒϵ[0, 1] helps to control J dc and J DL weights. The end-to-end post filter (E2EPF) [112] method with deep attention fusion features reduces residual interferences from pre-separated speech signals [112]. E2EPF uses both magnitude and phase information of pre-separated time-domain signals to maintain correct magnitude and phase values. The E2EPF reduces residual interferences in the output signals of uPIT+DEF+DL.

4) MULTI-TASK LEARNING-BASED APPROACHES
Source separation approaches use multi-task learning (MTL) to perform various tasks simultaneously using similar information from different tasks to improve the model training.
Many recent approaches use MTL to obtain models working simultaneously on different tasks of the same information. Convolutional BLSTM DNN (CBLDNN) [113] is a speaker-independent speech separation approach. It uses generative adversarial training (GAT) created by the generative adversarial network (GAN) and MTL. GAN has a generator and discriminator network. The generator of GAN generates speech signals using the observed mapping between the mixture signal feature and mask functions. Then discriminator differentiates generated speech and actual speech features. MTL extracts the fbank-pitch-based features to improve the model's training [113]. This method reduces numerical mean square error and simultaneously increases the perceptual quality of speech. Shifted delta coefficient with multi-task learning using grid LSTM (SDC-MTL-Grid) [114] approach deals with single-channel speaker separation. During end-toend training shifted delta coefficient (SDC) objective considers the long range of time dynamics to calculate mask functions. These contextual temporal dynamics align the same speaker's frames on the same side [114]. Multi-task learning (MTL) enhances the outcomes of a single task using simultaneous learning of more related tasks. MTL predicts T-F labels like silence labels, single labels, and overlapped labels of mixture signals. SDC and MTL jointly worked with grid LSTM to obtain impressive results and are known as SDC-MTL-Grid. MTL informs SDC about the overlapping regions during the mask estimation because speech separation aims to separate overlapping parts of the mixture signal [114]. Chimera networks incorporate MTL and DC with mask inference [104]. It uses mask inference after the embedding layer. However, in chimera++ [104] network, the mask inference is at the output of the BLSTM hidden layer, which reduces the complexity of the network and increases working speed.

5) COMPUTATIONAL AUDITORY SENSE ANALYSIS (CASA) -BASED APPROACHES
CASA end-to-end (CASA-E2E) [115] is a speakerindependent single-channel speaker separation approach. It uses PIT and DC in two stages of CASA, i.e., in the simultaneous and sequential grouping, respectively. In the simultaneous grouping stage, frame-level PIT trains BLSTM RNN to perform separation at the frame level. In the sequential grouping stage, clustering groups frame-level separated spectra into utterance levels to identify the speakers. Deep computational auditory sense analysis (deep CASA) employs simultaneous and sequential grouping [116], [117]. The simultaneous grouping works at the frame level to differentiate the desired signal from the mixture. In a situation where more than one speaker is to be separated, then separated frame-level spectrums are applied to the sequential grouping stage to track the desired speaker [116].
Simultaneous grouping in Fig. 5(a) works at the frame level to isolate two speakers. The mixture signal spectrum Y (m, n) is the input to a Dense-UNet [116], [117] to predict two complex ratio masks cIRM 1(m, n) and cIRM 2(m, n). The masks and the mixture signal Y (m, n) are multiplied to produce two outputs,P 1 (m, n) andP 2 (m, n) represents the estimated STFT of the two speakers in a complex domain. Permutation invariant training (PIT) [77], [116] is the popularly used method to train a neural network for more than one target signal PIT examines permutations for all possible target output signals and permutation having minimum loss optimizes the network during training. Frame level PIT (tPIT) and utterance-level PIT (uPIT) are two types of PIT [116]. In tPIT, there is frame-by-frame variation between permutations of the target output signals. However, in uPIT, each training utterance uses a fixed permutation. In simultaneous grouping, tPIT trains Dense-UNet, and tPIT loss organizes complex outputsP 1 (m, n) andP 2 (m, n) into two streams,P o1 (m, n) andP o2 (m, n), then inverse STFT of these organized signals produces two time-domain signalsP o1 (m) andP o2 (m). Signal to noise ratio (SNR) objective helps to properly train the model so that accuracy of separation increases. The sequential grouping stage, as in [116], separates the frame-level predicted spectrumP 1 (m, n) andP 2 (m, n) of two speakers. Fig. 5(b) represents a sequential grouping. The input to the sequential grouping stage is a stack of Y (m, n),P 1 (m, n) and P 2 (m, n). A temporal convolutional network (TCN) [118] comprises dilated convolutional blocks that propel each frame-level input to a D-dimensional embedding vector E(w). Two-dimensional vector I (w) specifies the target labels for TCN training. If output 1 is speaker 1, and output 2 is speaker 2 then I (w) = [0, 1] in Dense-UNet, otherwise, I (w) = [1, 0]. During training, E(w) for the same tPIT pairing arranged closer by a weighted objective function between E(w) and I (w), and otherwise to become farther apart.
In simultaneous grouping, the K-means algorithm performs clustering of E(w) and produces a binary value for each frame to arrange the frame-level outputs as the final outputs of deep CASA. The causality of the signal can be considered to make deep CASA causal [119]. But it degrades the separation performance.
Listen and Group [120] approach combines both listening and grouping. It always keeps the order of the output unchanged since it is an autoregressive method. In listening, midlevel representation of magnitude spectrogram of source and mixture signals are simultaneously created. The grouping stage uses these spectra to estimate separated sources.

6) PHASE RECONSTRUCTION-BASED APPROACHES
Sign prediction net [121] is a phase reconstruction-based approach for T-F domain deep learning-based monaural speaker-independent speech source separation approaches. The reconstructed signal's magnitude is not good with phase inconsistency. Hence it predicts the sign and computes the estimated phases [121]. Waveform approximation multiple input spectrogram inverse (WAMISI) [122] uses T-F masking, STFT, and inverse STFT as layers of a deep network to perform multi-speaker monaural speech source separation tasks. It computes loss on reconstructed signal to incorporate error due to phase inconsistency [59]. Sign prediction net and WAMISI use phase spectra of speech signals during separation to overcome phase mismatch problems. These methods are accurate but sacrifice performance. Comparison TABLE 4 shows the advantages and disadvantages of state-of-the-art T-F domain speech source separation approaches.

B. TIME DOMAIN SPEECH SOURCE SEPARATION TECHNIQUES
The limitations of T-F domain approaches are time-frequency decomposition, long-duration window, and phase magnitude decoupling, which act as obstacles to obtaining the required frequency resolution. Most end-to-end time-domain speech source separation techniques solve these problems using the encoder-decoder framework. The time-domain methods can be efficiently used in real-time applications. These approaches do not use STFT transformation. These techniques work to design encoder separation and decoder modules to perform the separation of speech sources. The encoder transforms the audio signal into a data-driven representation form. The separation module is designed to calculate the mask function using data-driven representation from the encoder. This mask function is multiplied with a mixture signal to separate the speakers. The decoder performs inverse transformation to the encoder and converts the separated VOLUME 11, 2023   speaker's signals into an understandable form. Time domain encoder-decoder frameworks for speech source separation can be categorized as recurrent neural networks (RNN), convolutional neural networks (CNN), and transformer-based approaches. However, Wavesplit works in the time domain without an encoder-decoder framework.

1) RNN-BASED APPROACHES
RNN-based techniques use LSTM, BLSTM, and RNN in their separation modules. Time-domain audio separation network (TasNet) [59] is an encoder-decoder framework. The encoder of TasNet estimates weights of mixture signal using one dimensional convolutional (1D Conv) layer followed by ReLU and sigmoid activation functions. The convolutive output of both activation functions is given to the separation module. The separation module consists of deep LSTM layers followed by a fully-connected layer with a soft mask activation function to calculate the mask function. The decoder performs transpose 1D Conv operation on mask and mixture signal multiplication to obtain a time-domain separated signal [59]. The separation module in TasNet LSTM [59] consists of unidirectional LSTM layers to consider causality for real-time systems. TasNet BLSTM [59], [123] uses bidirectional LSTM layers in separation modules for noncausal systems. 1D Conv layer in the encoder of TasNet has a short receptive field less than the length of the input sequence and hence cannot work with utterance level framework [59]. DPRNN [124] replaces the one-dimensional convolutional neural network in TasNet. It is smaller than TasNet and can work with long sequences by constructing a deep network using RNN layers. It consists of a segmentation layer, DPRNN layer, and overlap-add layer. The segmentation layer divides long input sequences into local chunks (intra-chunks) and global chunks (interchunks) [124]. In DPRNN stage two RNNs, an intra-chunk and inter-chunk RNN performs iterative and alternative processing of intra-and inter-chunks, respectively. Inter-chunk RNN aggregates the output from intra-chunks to perform utterance-level processing, then the overlap-add stage adds all the segments to obtain a separated source signal. The global processing stage of DPRNN suffers the recurrent connection problem and limits the performance of the approach [124]. It also uses positional encoding to know the sequence order information. Improved transformer [125] integrates RNN instead of positional encoding because positional encoding is not reliable for dual-path networks and creates model divergence during training.
Gated DPRNN [83] separates multiple voices simultaneously using gated neural networks. It mainly focuses on separating an unknown number of multiple speakers. The complexity and performance of two and three-speaker separation approaches decrease quadratically with an increased number of speakers. However, the complexity and performance of Gated DPRNN decrease linearly with an increased number of speakers [83].

2) CNN-BASED APPROACHES
CNN-based approaches use convolutional neural networks in their separation modules. The fully convolutional TasNet (ConvTasNet) [60] consists of only convolutional layers in all processing stages. It consists of an encoder-decoder and separation module similar to TasNet. Instead of a deep LSTM network, the separation module consists of a stacked dilated 1D convolutional block similar to the temporal convolutional network (TCN) [118]. The convolutional operation processes consecutive segments parallelly to increase processing speed and decrease model size. It incorporates global layer normalization (gLN) for causal systems and cumulative layer normalization for noncausal systems [60].
Speaker attractor network (SANet) [126] is an improved version of DANet. It uses TCN, similar to ConvTasNet, to create embedding vectors. Then attractors from these embedding vectors are calculated using mask-weighted average during training and approximated during the test phase using the k-means centroid of the embeddings [126]. In SANet number of speakers during training and testing can be different [126].
Neural architecture search (NAS) [127] is an artificial neural network technique for searching best model structure and minimizing human interaction. NasTasNet provides search space for ConvTasNet using candidate operation. It helps to obtain better design parameters for ConTasNet [127] and reduces GPU utilization with the best architecture. The auxiliary loss method with NAS is better for updating the parameters and achieving a balanced architecture for ConvTasNet [127]. VOLUME 11, 2023 Channel-aware audio separation network (CasNet) [128] is similar to TasNet with a channel encoder and separates the mixture speaker signal with the help of channel embeddings and the FilM technique [128]. It enhances the channel robustness of TasNet models. The channel encoder of CasNet consists of a residual net and a pooling layer. The residual net consists of two sub-blocks, i.e., the convolutional block and the residual block [128]. Convolutional block composed of a 1D convolutional layer followed by ReLU activation and batch normalization operations. The residual block has two convolutional blocks and a squeeze and excited layer [128].

3) TRANSFORMER-BASED APPROACHES
The basic transformer [125] consists of the encoder-decoder with multi-head attention for word-to-speech conversion and uses RNN and convolutional models. Dual-Path Transformer Network (DPTNet) [125] uses an improved transformer to model speech sequences with context-aware modeling of extremely long sequences. It has an encoder, decoder, and separation layer followed by a ReLU encoder activation function. The separation layer is a dual-path network with an improved transformer to calculate the mask function. The encoder output is segmented into overlapped intraand inters-chunks. Intra-and inter-transformers process segmented chunks at the utterance level [125]. The dual-path transformer stage can be repeated further. A 2D convolutional layer processes the last inter-transformer output to calculate the mask function for each source. Overlap-add transforms the mask function into sequences. Now mask signal is multiplied with the mixture signal to obtain masked encoder features for a particular source. The decoder converts masked encoder features into separated speech signals by performing transposed encoder operations [125].
Globally attentive and locally recurrent (GALR) [129] network takes advantage of both attention and recurrent mechanism alternatively and iteratively. It uses BLSTM for local context modeling and multi-head attention for global context modeling [129]. The GALR is globally attentive and locally recurrent, while DPTNet is locally attentive and globally recurrent [129].
The Sepformer [130] is an RNN-free transformer-based model for speech separation. It consists of multi-head attention and feed-forward layers. It learns both short and longterm dependencies with a dual-path framework similar to DPRNN and uses a multi-scale pipeline, which consists of a transformer [130]. The Sepformer performs permutations between intra-and inter-transformer to model long-term dependencies across chunks [130].
Time-domain adaptive attention network (TAANet) [11] has two attention networks, channel attention and spatial attention for local modeling and the self-attention network for global modeling. For local modeling, it works on frame level with BLSTM, and for global modeling, it works on utterance level. Self-attention can pay more attention to the long-term dependency of the speech sequence by calculating the correlation between all parts at different time scales.
The dual path hybrid attention network (DPHA-Net) [131] is a transformer-based approach and utilizes multistage aggregation training (MAT) strategy [131]. The MAT is multistage training with improved feature selective aggregation ability. Similar to transformer-based approaches, DPHA-Net comprises encoding and chunking, separation, and overlap-add and decoding stages. The encoder consists of 1D-convolutional layer with the ReLU activation function to transfer 1D input sequences to 2D output sequences. The output of the encoder is divided into chunks to produce a 3D processable tensor. DPHA-Net separation module processes this 3D tensor to predict mask function through intra-and inter-chunk processing units [131]. These units have similar architecture and consist of multi-head self-attention (MHSA), element-wise attention (EA), adaptive feature fusion, global layer normalization (gLN), and permutation operation. The separation module of DPHA is repeated in the required number of stages. The outputs of the present and previous stages are aggregated to produce the final outcome of a particular stage and separation module [131]. EA unit consists of two layers of gated recurrent unit (GRU), followed by the sigmoid activation function, then a second GRU to capture the context information at various time steps. The adaptive feature fusion (AFF) unit consists of channel-wise attention and temporal attention operations. AFF enhances the feature extraction capability of the network by suggesting suitable attention and channel characteristics for relevant time steps and channels [131].

4) MULTI-SCALE FUSION-BASED APPROACHES
Real-world speech signals have temporal scale variations due to different word lengths and pronunciations characteristics of people, which motivates the researchers to work with different receptive fields or scales. Multi-scale fusion (MSF) methods in the time domain process and fuse information at various time scales. In these methods, input from the bottom stage is processed with more processing stages in an upward direction before returning to the bottom stage. The successive down sampling and resampling of multi-resolution features (SuDoRM-RF) [132], FurcaNeXt [133], sandglasset [134], and asynchronous fully recurrent convolutional neural network (A-FCRNN) [135] are time domain MSF-based single channel speaker separation methods. SuDoRM-RF has an encoder, decoder, and separation architecture. The encoder and decoder have a 1D Conv layer and a transpose 1D Conv layer, respectively, to work opposite each other. The separation module consists of U Conv blocks to work at multiple scales of the speech signal and to calculate the mask function. U Conv block [132] is similar to U-Net and uses successive down-sampling and up-sampling operations to extract information from multiple resolutions [132].
FurcaNeXt [133] introduces variant of TCN with multiple branches for multiscale feature dynamics. For different temporal receptive field scales, these multiple branches in the network characterize different speech speeds [133].
The Sandglasset [134] has a sandglass-like shape and processes the speech signal at the multi-granularity level. For half-block of the network features, granularity becomes coarser gradually and then becomes finer successively towards the raw signal level. It uses RNN for local modeling and SAN for global modeling.
The A-FRCNN [135] introduces recurrent connections in convolution neural networks and updates the network weights asynchronously. SuDoRM-RF and Sandglasset leave the lateral information between stages unprocessed. The A-FRCNN processes information bottom-up, top-down, and also in lateral directions. It is similar to U-Net with delay [135]. In A-FRCNN, input is first passed in the bottom-up direction through stages, then parallelly fuses between adjacent stages, and finally fuses through bottom stages with skip connections. Information moves upward and becomes coarser in each stage because convolutional layers have different scales [135].
Multi-scale group transfer TasNet (MSGT TasNet) [136] applies self-attention to the small groups of the sequence instead of the whole sequence at a time [136]. This group's self-attention reduces the complexity of the model. In selfattention, any two positions are correlated for a given input sequence. Hence, for longer, input complexity increases quadratically. However, group self-attention correlates with local regions of fixed-length sequences or groups; hence with longer sequences number of groups increases and complexity increases with the increased number of groups [136]. Group self-attention does not perform cross-group correlation and loses global context information. MSGT TasNet uses multi-scale fusion to capture global information. It uses group self-attention on high-resolution scales for local context modeling and low-resolution scales for global context modeling [136].

5) TIME DOMAIN TECHNIQUES WITHOUT ENCODER DECODER FRAMEWORK
Wavesplit [65] is a time-domain speaker separation approach without an encoder-decoder framework. It uses a residual convolutional network consisting of the speaker and separation stack [65]. Fig. 6 represents the block diagram of the Wavesplit approach [65].
The speaker stack is the first stack and uses clustering to create a set of vectors for speaker representation from the mixture signal. These speaker representation vectors are in the time domain and independent of frequency bins. Then K-means clustering on speaker representation vectors results in speaker centroid. The separation stack uses the speaker's centroid and mixture signals as input to separate the speakers. The permutation problem is solved during training with the help of PIT. This way, the speaker and separation stacks are trained simultaneously [65]. At the training stage, the speaker stack creates the vector representation for every speaker and makes similar speaker distance small and different speaker distances large. The separation stack also learns to separate the clean speaker signal using these representations. At the testing stage, the speaker stack identifies the centroid for every speaker representation.
The Wavesplit uses two training objectives, i.e., (1) speaker vector objective and (2) reconstruction objective. The Speaker vector objective learns the vector representation to obtain small intra-speaker and large inter-speaker distances. The Reconstruction objective optimizes the separated speech quality. TABLE 5 illustrates the advantages and disadvantages of state-of-the-art time-domain audio source separation approaches.

C. HYBRID SPEECH SOURCE SEPARATION TECHNIQUE
The hybrid approaches work in both T-F and time domains. GCD-TasNet [137] is a hybrid domain approach. It creates an input feature map using the 1D convolutional layer in the time domain and the STFT spectrogram in the frequency domain. Then concatenated features of both domains are processed by embedding network and clustering approach to calculate the mask function [137]. The embedding network is similar to TCN and enhances the dimension of the input; then, clustering is applied to embedding to calculate the mask function. The decoder consists of transposed 1D convolutional block and ISTFT separating and adding both the results to separate the speech signals [137].
E2EPF with deep attention fusion features [112] in which the speech signal is pre-processed to separate the mixture signal in the T-F domain, then the separated signal is processed in the time domain to improve the separation outcomes. The block diagram of E2EPF is shown in Fig. 7(a) [112]. The E2EPF algorithm consists of an attention mechanism to extract deep attention fusion features [112] of speech signals and post-filter for single-channel speech source separation.
Time domain preprocessed speech signals use input features, both magnitude and phase information, to separate speech sources. The uPIT+DEF+DL [114] is the pre-processing stage and separates the mixture signal primarily in the T-F domain. However, there is residual interference in the separated speech in the pre-processing stage. The E2EPF after the pre-separation stage improves separation performance by reducing residual interferences. An attention module in the fully convolutional E2EPF network uses the feature of a mixed signal and pre-separated signal to calculate the similarity. It reduces the residual interferences from the pre-processed signal. Further, E2EPF solves the magnitude and phase mismatch problem by separating speech signals in the time domain. It has mechanisms for feature extraction [112], attention module [112], and post-filter [112]. In feature extraction, features of the mixed speech signal Y (m) and the pre-processed signals O s (m) , s = 1, 2, . . . , S where S is the total number of extracted sources.The 1D convolution operation [112] extracts deep features W y (m) and W s (m) from the Y (m) and O s (m) respectively given as: VOLUME 11, 2023 FIGURE 6. Block diagram of Wavesplit [65].
U y (m) and U s (m) represent the basis functions of 1D convolution operation [112]. The rectified linear function ReLU( * ) is a nonlinear and optional activation function. Now a day's, attention models can be used successfully to solve sequence-to-sequence learning problems [138], [139], [140], [141], [142]. The attention mechanism works on extracted features and pays more attention to reducing the interferences and improving separation performance. E2EPF applies W y (m) and W s (m) to the second 1D convolutional layer and compares the mixed and previously separated speech.
Here W ′ y (m) and W ′ s (m) are functions representing mixture and separated sources, respectively, and U ′ y (m), and U ′ s (m) represent the basis functions of second 1D convolutional operation. The correlation g m,m ′ (m) between W ′ y (m) and W ′ s (m) can be used to calculate soft mask as attention weight h m,m ′ (m) by using the global attention mechanism [141] as follows: The weighted average of W ′ s (m) computes the context function Co m ′ s (m) as follows: The context vectors Co m ′ ,s (m) and deep features W ′ y (m) of the mixture are applied to the post-filter as attention fusion features. E2EPF in Fig. 7(b) consists of the TCN similar to TasNet [60] and represents a better performance than RNNs in various sequence modeling tasks [60], [115], [143], [144], [145]. The fully convolutional post-filter consists of stacked dilated blocks of 1D convolutional layer (Conv block) with increasing dilation factors (1, 2, . . . ., 2 Z −1 , where Z represents the convolutional block number) for each TCN to capture a large temporal context which can enhance with further repeating the Z (4 times) stacked dilated convolutional blocks [106]. Fig. 7(c) represents the construction of the Conv block [146]. Skip connections maintain input information between the present and successive blocks. The depth-wise separable convolution is generally used for image processing tasks [147] [148], to reduce the number of parameters of models. A nonlinear activation function parametric rectified linear unit (PReLU) [149] improves model fitting with little overfitting risk and almost zero extra computational cost, and the global layer normalization (gLN) [60] is connected after the first 1Dconv and depth-wise 1DConv blocks.
The 1D convolutional layer, followed by ReLU nonlinear function denoted as F( * ) takes the output of the stacked dilated 1D convolutional block. ReLU learns target masks similar to the T-F domain [106]. The predicted mask Ma s (m) of each source is the output of F( * ).
Es s (m) is the estimated separated signal for the target source. The 1D convolutional operator with U E (m) as basis function reconstructs the predicted signal as follows: PIT-based approaches choose the best permutation to solve the permutation problem and have an output dimension mismatch problem. PIT separates the target speaker from the mixture signal in the T-F domain. But separation with PIT may contain interference because it only reduces the same speaker distance and leaves the distance between different speakers unchanged. Deep clustering is used after PIT to increase the distance between different speakers. Multi-task learning performs multiple tasks simultaneously to improve the training of the model. Monaural speech separation with deep CASA uses tPIT to separate the speakers from the mixture signal in the simultaneous grouping stage and k-means clustering to track the speaker in sequential grouping to provide good separation performance. Here de-noising before the simultaneous grouping reduces further complexity, simultaneously improving performance. Deep CASA is the best T-F domain method but needs speaker tracking due to tPIT. Many recent monaural speaker separation methods use DC and PIT simultaneously to increase separation accuracy. The uPIT+DEF+DL uses uPIT and DC jointly to separate different speech signals, followed by discriminative learning to increase the distance between separated sources and finetune the separated speech signal. The T-F domain phase reconstruction approaches try to solve the phase magnitude decoupling problem, but the results are not comparable with time domain approaches. Time-domain approaches use datadriven representation instead of STFT features. These methods have not been analyzed with large data sets because scaling and generalization of large data are impossible with these approaches. Time domain approaches focus on local VOLUME 11, 2023      with dual-path architecture. Existing time domain approaches separate two and three speakers from the mixture. Gated DPRNN can separate more than three speakers but still is not a number of speaker-independent methods. CNN-based approach ConvTasNet is the best model for local context modeling, but due to CNN cannot perform global modeling efficiently. The SANet is implemented in the time domain as an improved T-F domain DANet method. It is categorized as CNN-based because it creates an embedding vector and produces attractors using TCN, similar to ConvTasNet. This method uses the concept of the T-F domain in the time domain. NasTasNet searches for the best model for ConvTasNet using the NAS technique. It can be used with speaker separation models to obtain the best model architecture with minimum human interaction. The Casnet enhances the channel robustness of TasNet models by making them aware of channel information.
Transformer-based approach DPTNet outperforms global modeling of speech source separation with dual-path design. DPTNet is locally attentive and globally recurrent. GALR is globally attentive and locally recurrent. But interchanging the attention and recurrent mechanism in GALR degrades the separation performance compared to DPTNet. The Sepformer is an advancement of the transformer-based approach VOLUME 11, 2023   with a built-in attention mechanism. The Sepformer outperforms all the approaches in terms of SDR and SI-SNR with the WSJ0-2mix dataset. The TAANet is also a recent approach and incorporates CNN and attention in its architecture to perform local and global modeling to achieve impressive results.
The DPHA aggregates the output of present and previous stages to calculate present stage output. It extracts the multihead self-attention features using the EA unit and fuses them using the AFF unit to enhance the feature extraction capability of the model.
Multi-scale fusion-based methods work at different scales and characteristics of speech signals. The SuDoRM-RF uses U Conv block for successive up and down-sampling operations to extract information from multiple time steps. FurcaNeXt proposes a variety of TCNs to work on multiple branches for different speech characteristics. The Sandglasset has a sandglass-like structure and is the only method that works on the multi-granularity level of the speech signal. The A-FCRNN works on lateral information compared to SuDoRm-Rf and Sandglasset using U-Net with delay to improve separation performance. MSGT TasNet creates small groups from input vectors and calculates the correlation between the group elements using multi-head self-attention. This method calculates the cross-group correlation using MSF to reduce the complexity of the model. The information within the group is local context information, and the cross-group is global context information. The Wavesplit is a recent time-domain speech source separation approach that uses concepts of clustering and permutation invariant training in the time domain. It is a multi-speaker separation algorithm with limited performance improvement for more than three speakers and produces separation results on different datasets. It is one of the most efficient speech separation algorithms.
GCD-TasNet is a hybrid domain approach. The GCD-TasNet encodes STFT spectrograms and time domain features from raw input and concatenates the information as the encoder's output separation module consists of an embedding network similar to TCN and clustering operation to separate the speakers from these combined features. E2EPF filter for speaker separation with deep attention features is the state-of-the-art hybrid domain algorithm for deep learningbased speaker separation. The T-F domain uPIT+DEF+DL preliminarily separates the target speech signal from the mixture speech signal, then E2EPF pays more attention using the attention module, and post-filter in the time domain reduces the interference in the pre-separated speech signal. The attention module and post-filter are very good proposals for enhancing the pre-separated speaker signal.
The comparison of results of various single-channel speech source separation deep learning-based approaches in terms of SDR, SI − SNR, PESQ, and STOI on WSJ0-2mix have been illustrated in Comparison tables show that time domain models have a much-reduced size than the T-F domain model with enhanced performance. These methods overcome all the drawbacks of T-F domain approaches like output dimension mismatch, permutation ambiguity, and large context window size with the encoder-decoder framework.

VII. CONCLUSION AND FUTURE SCOPE
This paper comprehensively studies and analyzes deep learning models for monaural speech source separation. Different models are categorized into the T-F, time, and hybrid domains. The methods have been described in brief and some in detail to build the basic concepts of deep learning-based speech source separation work. The comparative analysis of different deep learning-based speech source separation models in terms of SDR, SI-SNR, and PESQ has been provided for the readers to understand the domain better. It is observed that T-F domain methods have several constraints to obtain the required frequency resolution, which time domain methods can overcome. Although numerous approaches are designed for two or three-speaker separation at a particular language dataset, real-world language independent and a number of speaker-independent speaker separations is still a challenging problem. The time domain approaches are still in the primitive stage. Some recent time-domain approaches have been designed to work on more than three speaker separations and some on languages other than English. The future prospects lie in the design of real-world deep learning models for all practical applications. Dataset creation and separation model designing for more than three speakers and multiple languages, improved separation module design for the encoder-decoder framework, improved multi-scale fusion model design that covers all scales of speech signals, designing the attention mechanism analogous to human auditory attention, and implementing the deep learning model using the concepts of T-F, time, and hybrid domain approaches are some areas for future research.