An Efficient Short-Time Discrete Cosine Transform and Attentive MultiResUNet Framework for Music Source Separation

The music source separation problem, where the task at hand is to estimate the audio components that are present in a mixture, has been at the centre of research activity for a long time. In more recent frameworks, the problem is tackled by creating deep learning models, which attempt to extract information from each component by using Short-Time Fourier Transform (STFT) spectrograms as input. Most approaches assume that one source is present at each time-frequency point, which allows to allocate this point from the mixture to the desired source. Since this assumption is strong and is reported not to hold in practice, there is a problem that arises from the use of the magnitude of the STFT as input to these networks, which is the absence of the Fourier phase information during the separated source reconstruction. The recovery of the Fourier phase information is neither easily tractable, nor computationally efficient to estimate. In this paper, we propose a novel Attentive MultiResUNet architecture, that uses real-valued Short-Time Discrete Cosine Transform data as inputs. This step avoids the phase recovery problem, by estimating the appropriate values within the network itself, rather than employing complex estimation or post-processing algorithms. The proposed novel network features a U-Net type structure with residual skip connections and an attention mechanism that correlates the skip connection and the decoder output at the previous level. The proposed network is used for the first time in source separation and is more computationally efficient than state-of-the-art separation networks and features favourable performance compared to the state-of-the-art with a fraction of the computational cost.


I. INTRODUCTION
Music production is achieved, when recordings of individual sources (vocals and instruments) are arranged together and combined into an audio mixture. Music source separation (MSS) is the process of estimating these isolated sources, also called stems, from the audio mixture. A general mixing model can be described by assuming a set of K microphones x(n) = [x 1 (n), . . . , x K (n)] T observing a set of L sound sources s(n) = [s 1 (n), . . . , s L (n)] T . Assuming no reverberation in a studio music desk mixture and stationary mixing, the mixing model The associate editor coordinating the review of this manuscript and approving it for publication was Manuel Rosa-Zurera.
can be expressed as follows: x(n) = As(n) (1) where A represents a K × L mixing matrix and n is the sample index of N available data values. The general underdetermined source separation problem (where K L) is impossible to solve, since there is an infinite number of solutions. Several previous research attempts have offered methods to trace those solutions that are relevant to source separation. For more information on the identifiability and uniqueness of solutions to the underdetermined problem, one can refer to [1] and [2]. In the case of MSS, the problem features a two-channel (stereo K = 2) mixture that contains L 2 audio sources. Over the last years, the music source separation community, through the MUSDB18 [3] and DSD100 [4] datasets and corresponding competitions, has focused on a more feasible song stem decomposition, i.e. decompose modern pop/rock music stereo songs into four basic stems: vocals, bass, drums, other (a stem containing all the remaining instruments). This is a more viable task, since it is possible to extract these four stems from most modern mixtures, even in the case of monophonic mixtures.
In order to tackle this problem and estimate the sources s, many approaches have been proposed, which can be divided into two different categories. The first group includes the signal processing based methods, where by exploiting possible cues, such as the sparse statistical profile of the source signals and their orientation in the space, observed by the mixture signals, they attempt to estimate the mixing matrix A and consequently extract the source signals. Methods belonging to the first category span from Non-negative Matrix Factorization (NMF) [5], Independent Subspace Analysis (ISA) [6], Independent Component Analysis [7], Directional Clustering [8], [9], Bayesian modeling [10] to Robust Principal Component Analysis (PCA) [11]. For more traditional approaches, one can refer to [12] and [1]. The second group contains the deep learning frameworks, where deep neural networks have been employed as an alternative to the previously dominant audio source separation methods. Some of the most recent and state-of-the-art networks for this task include the CWS-PResUNet [13], the KUIELab-MDX-Net [14], the D3NET [15], the DEMUCS [16] and the hybrid DEMUCS [17], while there are multiple more different implementations. In this paper, our discussion will focus only on frameworks that belong to the second category, since the presented approach is based on deep learning.
Modern deep learning networks follow different steps for tackling the MSS problem. Among the first applications of deep convolutive networks in source separation, Simpson et al. attempted to separate either all sources [18] or vocals to create karaoke audio [19]. In [20], Huang et al. introduced the use of deep recurrent networks to solve the separation problem. In [21], Uhlich et al. introduced a combination of deep dense layers along with bidirectional LSTM layers to achieve separation. Some models, such as the DEMUCS [16], the Conv-Tas-Net [22] and the Meta-Tas-Net [23] attempt to separate the signals in the time domain by using the mixtures waveform as input to the network. Other approaches, such as the X-UMX [24], the LASAFT [25], the D3NET [15], the CWS-PResUNet [13] and the KUIELab-MDX-Net [14], are transforming the data into sparse representations, in order to enhance the signals' features and separate the sources more efficiently. Most approaches tend to rely on traditional transformations for this task, including the Short-Time Fourier Transform (STFT), instead of learned overcomplete dictionary approaches [26] that offer more sparsity. The advantage is that there are fast algorithms for calculating the traditional transformations and their properties are well documented.
Additionally, there are models, which address the problem in both the time and the time-frequency domain, as performed by the Hybrid DEMUCS [17]. Since the mixing is considered linear and the transformation performed to the observed signals are linear, it is mathematically equivalent to perform separation either in the time-domain or in these transform domains.
In [13], [14], [15], [24], [25], [27], and [17], the signal is transformed into a spectrogram using the Short Time Fourier Transform (STFT). The spectrogram contains complex values from which we extract the magnitude and the phase of the signal. The magnitude is then used as input for training the networks and separating the sources. Most MSS approaches assume a time-frequency mutual exclusion between sources in the STFT domain. This implies that in the STFT domain, only one source is dominant at each time-frequency point. This property is theoretically supported in the work by Liutkus et al. [28], assuming smoothness, local stationarity or periodicity for the time-frequency representation of the sources. In this case, mutually exclusive masks for each source are estimated and are applied to the mixture to create the STFT representation of each separated source. The problem appears, when this assumption ceases to hold formally, which, unfortunately, seems to be the case in real-world song mixtures, where there is time-frequency overlap between the requested stems of a song. In addition, this problem becomes a phase recovery problem, since many approaches relax the mutual exclusiveness of the magnitude of the spectrogram, allowing it to take arbitrary values after separation, but instead continue to use the phase information of the mixture. This problem has been documented in the works of Magron et al. [29], Magron, Drossos et al. [30], Stoller et al. [31], where solutions are discussed especially for the phase recovery problem. In [17], the use of a temporal and a spectral branch in combination provide important information to the network for performing efficient separation. A similar approach is applied in [14], since the signals' time waveform enhances the network performance. On the other hand, in [13] the phase information is extracted after processing the spectrogram's magnitude in the employed network, while the LASAFT network in [25] decomposes the complex data to their real and imaginary values and transforms their stereo spectrograms into a 4-channel input, so that no information is lost and thus the phase information is estimated by the algorithm. Unfortunately, such kind of processing leads to increased complexity networks without significant performance improvement.
In this paper, we propose to use an alternative transform, in order to avoid the phase recovery issue. The use of Short-Time Discrete Cosine Transform (STDCT) is proposed, instead of the commonly used STFT. The main motivation is that the STDCT is equally sparse and linear, but most importantly it is a real-valued transform. Thus, the transform values can be directly presented to the network as input, and the network can infer the real values of the transformed separated sources. Then, the return to the time-domain can be achieved without any further post-processing. In addition, we introduce a novel MultiResUNet architecture with attention modules for stereo audio mixtures. To increase performance, we train L separate Attentive MultiResUNets, one for each desired component. The proposed architecture is based on a similar architecture, presented in [32], i.e. a fully convolutional layered network, which is used for Biomedical Image Segmentation. This architecture is based on the original U-Net [33], but employs residual blocks that connect similar levels of the encoder and the decoder, making the network more robust and capable of analysing objects at different scales. This network was adapted to fit the source separation task. In addition, an attention module is incorporated at the end of the residual skip connection path that connects the same level encoder and decoder layers. To the best of our knowledge, this is the first application of a multi-resolution U-Net with residual skip connections and attention on audio signal processing and source separation. Its major advantage is the considerably decreased computational cost, compared to other state-of-theart source separation networks, while featuring performance that ranks behind only far more complicated networks. The outline of the proposed network is depicted in Fig. 1.
The paper is organised as follows. At first, the details behind the data pre-processing are presented with the introduction of the STDCT and its importance over STFT-based spectrograms. Next, the architecture of the proposed network is described in detail, with emphasis on the proposed ''multires block'', the ''res path'' and the attention modules and their customization for music source separation. The post-training steps that were introduced in order to complete the source separation tasks are also presented in detail. Finally, the performance of the proposed framework in MSS is then investigated and compared with state-of-theart approaches on the MUSDB18 dataset [3], with promising results. A detailed ablation study is also presented to validate the chosen hyper-parameters and modules of the architecture.

II. PROPOSED METHOD
The task here is to separate a stereo mixture signal x s (n) = [x 1 (n), x 2 (n)] T and estimate the L participant source signals s s (n) = [s 1 (n), s 2 (n), . . . , s L (n)] T . Following most current literature, we also use the MUSDB18 dataset [3] in our experiments, which contains a collection of modern stereo songs, that are separated into L = 4 components, i.e. vocals, bass, drums and other. The other category includes anything that does not belong into the first three categories.

A. PREPROCESSING
The first step was to prepare the dataset for training the network. The MUSDB18 tracks were downsampled to 16 kHz and segmented into m-second patches. These stereo patches were transformed using either the STFT or the STDCT. To avoid possible overfitting, no augmentation was used, apart from the random choice of m-second patches for every batch and training step. In [34], [35], and [36] various augmentation techniques were proposed, however, we preferred to keep the training framework as simple as possible.

B. DATA TRANSFORMATION
In recent years, as mentioned in Section I, most researchers have been using the STFT as a transform. By using the STFT, the mixing model in (1) is transformed to X(t, f ) = A(f )S(t, f ), where t is the time frame index and f represents the normalised frequency, Assuming that the mixing is stationary and instantaneous (as in (1)) and since STFT is a linear transformation, it follows that A(f ) = A, where A is the mixing matrix of (1). The main idea behind the STFT is to divide a signal to shorter overlapping segments of equal length, apply the Fourier transform to each of these and place them in a 2D fashion to create a 2D image, i.e. timefrequency spectrogram. The resulting spectrogram contains complex values, which are computationally expensive for neural networks to process. Therefore, the magnitude of this 2D image is commonly usually used for processing and separation. After separation, the separated output must be transformed back to the time domain using the inverse STFT. However, in order to complete this task, this process requires the separated sources' phase, a piece of information that is unknown, since only the magnitude was estimated. As mentioned in the previous section, there are many methods to estimate the phase of each individual source. All these methods can invert the signal back to the time domain, but this process is complex, computationally expensive and the offered improvement is not always noteworthy.
For this reason, we propose an alternative to the STFT, which is the Short-Time Discrete Cosine Transform (STDCT). The STDCT follows the same mechanism as the STFT, but uses the Discrete Cosine Transform, instead of the Fourier Transform. More specifically, the audio signal is segmented into short overlapping segments of equal duration. Each of these frames is windowed and the 1D-DCT is applied on each frame. The DCT type-II is used, thus assuming an input s(n) of length N 1 , it can be expressed as follows [37]: Since the DCT is a linear transform and the mixing is instantaneous and stationary, the STDCT can transform (1) into X(t, k) = AS(t, k), where t is the time frame index and k represents the DCT index, X(t, k) = [X 1 (t, k), . . . , X K (t, k)] T and S(t, k) = [S 1 (t, k), . . . , S L (t, k)] T . In addition, X i (t, k) and S j (t, k) are the local-segment DCT of the mixture and source signals and are all real-valued. This simplifies the overall separation procedure, since the input matrix X now contains real (including the sign), instead of complex values. Thus, the real-valued DCT ''spectrogram'' can be used in its present form for training the network, without any further processing and without losing any primal information. Therefore, the separation network can estimate the corresponding sources using real values and produce real-valued DCT ''spectrograms'' as source estimates.
In addition, the real-valued DCT ''spectrogram'' source estimate can be inverted directly to the time domain, since no phase recovery post-processing is necessary. Each column is transformed to the time domain using the 1D-iDCT. For the inverse transformation, DCT type-III is employed, which is defined as follows [37]: where β(k) is given by (3). Once the segments are inverted to the time domain, reconstruction of the complete audio waveform is performed using the overlap-and-add (OLA) method, in a similar manner to STFT reconstruction. Perfect reconstruction using the STDCT was investigated in the past [38] and is guaranteed via careful window selection, similarly again to the STFT. In our system, the Hamming window was selected after an exhaustive ablation study over a number of possible windows, which is not included in the paper, due to limited interest. Overall, this yields a more elegant and computationally efficient solution, compared to the aforementioned STFT-based approaches.
In [14], the authors suggest that their network's training is more efficient, provided a long frame window is chosen and only the lower frequencies are retained as input to the network. This results to a better separation scheme, which is reasonable, since the most significant information of most source components resides in the lower frequencies. Therefore, this process aids at zooming only in the essential information of the mixture and thus forcing the network to recognise more efficiently the features that belong to each component. The only exception to this behaviour are the drums, where all frequencies contain information of equal importance. Consequently, we have included this strategy in our framework, by choosing different frame windows, in the same concept as in [14], for every component and keeping only the significant lower frequencies, apart from the drums.

C. PROPOSED NETWORK ARCHITECTURE
The network we present in this study is a two-part architecture that follows the general U-Net configuration, containing an encoder and a decoder [33]. Each level of the encoder concatenates its output to the other input of the decoder at the same level, i.e. skip connections. In a similar manner to [32], we incorporate two components, known as MultiRes blocks and Res paths in the general U-Net architecture (see Fig. 1). VOLUME 10, 2022

1) MultiRes BLOCKS
The MultiRes blocks consist of three different groups of 3 × 3 Convolutional blocks with a gradually increasing number of filters F. In every group, we have an increasing number of Convolutional blocks from 1 to 3 with each block containing a 3 × 3 Convolutional layer, followed by a Rectified Linear Unit (ReLU). The F metric, as proposed in [32], is used to create a connection between our model and the original U-Net [33] and is estimated as follows: where P is the number of filters at the corresponding layer of the U-Net and γ is a scaling value. As suggested in [32], we assigned F/6, F/3 and F/2 filters to the Convolutional blocks of each respective group. Finally, there is a residual connection with a 1 × 1 Convolutional layer, which is able to contribute some additional spatial information, thus the network is able to learn the data features more efficiently. By gradually increasing the number of filters there is a compromise between heavy memory operations and the quality of feature extraction, therefore we are able to use larger size data inputs and acquire better audio quality. The architecture of the MultiRes blocks is shown in Figure 2, along with the size of each filter.

2) RES BLOCKS
The Res path, on the other hand, is a shortcut between the encoder and the decoder, similar to U-Net's skip connections [33]. In [32], Ibtehaz and Rahman argued that there is a way to improve the U-Net's skip connections by incorporating some Convolutional layers along these connections and fill a possible gap between the encoder's and the decoder's extracted features. Thus, the Res path is formed as a chain of Convolutional layers, which have residual connections, as shown in Figure 3. The size of the filters are also shown in Figure 3. Using this path, the feature maps from the encoder are transferred to the decoder. There, they can be concatenated with the decoder's features, since they have the same size. The Res path assists the network in extracting improved features, since the information is more accurate, leading to better results. This has been documented for image processing related tasks, however, in this study, we confirm that this is the case for audio-related spectrogram images.

3) ATTENTION BLOCKS
A major upgrade we propose here is the introduction of a self-attention mechanism [39], [40] that is incorporated at the end of the residual convolutional layers that connect each level of the encoder with the corresponding level of the decoder. This module has the ability to preserve the key features of the target source, while suppressing the features of the other components. It receives as input the output of the residual path of the corresponding level of the encoder x r and the decoder output of the previous level x d (see Fig. 1). The attention module then (see Fig. 4) adds the two inputs after they have passed a 1 × 1 convolutional layer W r , b r and W d , b d respectively. The output passes through a ReLU, then processed with a 1 × 1 convolutional layer W a1 , b a1 and then passes through a Softmax activation function, to discern the features belonging to the target source from the irrelevant ones, before entering another 1 × 1 Convolutional layer W a2 , b a2 . The product of this process form the attention coefficients, which are multiplied by the output of the residual path and then processed by another convolutional layer W a3 , b a3 and a ReLU activation function, in order to form the final response that proceeds to the next layer. The proposed attention module was inspired by [40], but is now modified with additional convolutional layers and without the Sigmoid non-linearities, in order to accommodate the nonnegative data. The attention coefficients a i for each pixel i are given by The output is given by the hadamard product between the residual path and the attention image.

4) ARCHITECTURE SUMMARY
As presented in detail in Figure 1, the encoder is structured by chaining a repeated pattern of MultiRes blocks and 2 × 2 Max Pooling layers, which decrease the data size by half each time and keep the number of channels intact. The Res paths are placed prior to the pooling operation of the encoder in order to transfer vital information to the decoder, without being summarised by the pooling operation. On the other hand, the decoder uses a symmetrical cascade of MultiRes Blocks followed by 2 × 2 Transposed Convolutional layers to upsample the feature maps by 2 and reduce the channels by half. Each MultiRes Block of the decoder receives features from the previous level, as well as features from the Attention module. The only exception is the last MultiRes Block prior to the output, which receives the output of the previous level  Another novel element is that all convolutional layers in the network are activated by the ELU activation function [41]. In addition, batch normalisation is used in these layers. Another major difference with [32] is the fact that we replaced the Sigmoid activation function of the output layer with the Linear function, since we need the output values to be real, i.e. contain negative values as well.
Finally, the loss function that was used for the training process was the Mean Square Error (MSE), instead of the commonly used cross-entropy, since the objective is to perform regression, i.e. infer real numbers, and not perform classification. It is important to stress that the network delivers as output the mono STDCT representation of one desired stem. In other words, a different network is trained independently for each of the desired stems. The network can be trained to infer stereo outputs of the source with similar performance, however, this was not included in this study.

D. POST-PROCESSING
Since the task at hand was to extract the L participant sources, we trained different networks, one for each component. Therefore, we created 4 different networks, with the architecture shown in Figure 1, in order to predict the required sources, i.e. vocals, bass, drums, other. The output of each network is in the transform domain, thus it needs to be inverted back to the time domain to be audible. To enhance the separation quality of the separated source, we employed the following three steps:

1) SIGNAL ENERGY THRESHOLDING
Brunner et. al [42] claimed that listening to separated audio sources, provides a direct indication of when an audio source should be suppressed. Essentially, the residues of separation from other sources can be suppressed by applying a binary mask to each estimated audio source, with the appropriate signal energy threshold. Since the proposed approach is not using masks on the separated outputs or the mixture signal, there is bound to be some residual energy in non-relevant time-frequency points, which will contaminate the separated sources. To alleviate this, we applied a signal energy thresholding technique with a different approach to [42]. More specifically, we checked the energy E(j) of each of the output's data frames estimated as: If the energy of the frame is less than the applied threshold t i , we set the whole frame equal to zero: In a similar manner to [42], we set the threshold value at t i = 10 −3 for vocals, bass and drums and t i = 10 −4 for the other stem. We proceed by sliding one frame at a time until the end of the data. This method helped us to not only successfully suppress noisy parts of the estimated signals, VOLUME 10, 2022 containing residues of other stems, but also avoid rough transitions between low energy audio signal segments and high energy audio signal segments and as a result produce smooth and clean (as possible) audio signal outputs.

2) SOURCES RECONSTRUCTION TO TIME-DOMAIN
The output of the network is a 2D representation of each desired stem, that contains real values (both positive and negative). The reconstruction of each separated source can be generally described as follows: More specifically, each column of the 2D representation is inverted to the time-domain using (4). The resulting frames are used to reconstruct the time-domain version of the stem using overlap-and-add (OLA). It is worth noting that the goal of the network is to produce monophonic versions of the original stems. Nonetheless, it can be amended to create stereo versions, if required.

3) LOW-PASS FILTERING FOR HIGH FREQUENCY ARTIFACT COMPONENTS SUPPRESSION
The reconstructed signals, especially the bass and the other stems, need further manipulation in order to suppress present high-frequency artifact residues. To further remove high-pass noise from the separated signals, we applied 5th order Butterworth low-pass filtering on all components, as a postprocessing step. Low-pass filtering is not applied on drums, since percussive organs feature important energy levels over all audible frequencies. In this work, a cutoff frequency of f c = 5kHz, for the vocals and other, and a cutoff frequency of f c = 1kHz, for the bass, were chosen. Therefore, no significant degradation was observed, compared to the other sources.

III. EXPERIMENTS A. DATASET
The proposed framework was trained and tested on the MUSDB18 dataset [3], which contains 150 full-length stereo audio tracks of different genres, sampled at 44100kHz. Each track consists of 4 different stems (i.e. vocals, bass, drums and other) with every stem included separately in the dataset. Out of the 150 tracks, 100 are used for training and 50 for testing. Furthermore, the training set has a pre-defined split that separates it into 86 tracks for training and 14 tracks for validating the network. In order to train the proposed network all signals were downsampled at 16 kHz and no additional training data were involved in the process.

B. EVALUATION BENCHMARK
The performance of the proposed framework was evaluated by estimating the Signal-to-Distortion Ratio (SDR), developed by Vincent et al. [43]. For this process, the 50 tracks of the testing set were employed after being divided into 1-second windows. The separation process was assessed by estimating the median SDR of all training segments, as suggested in [44].

C. TRAINING PROCESS
The dataset preparation was a crucial part for the training procedure. Each stereo mixture song, along with the selected target source, was divided into m = 4 seconds pieces. As suggested in [14], depending on the target source, the data were transformed by the STDCT with different frame lengths.  (5) for the MultiRes blocks, we selected the scaler coefficient γ to be 1.75. This value maximizes the network's capacity and prevents overfitting and underfitting. For the separation task, a different model was trained for each target source, thus L = 4 identical models were implemented and they were all trained with the Adam optimizer [45] and a learning rate of 0.0005.
Finally, the training epoch is defined by selecting all data batches in sequence, presenting them to the network and comparing the output with the desired target source, by using the Mean Square Error (MSE) as loss function. After each epoch the data are shuffled again to avoid overfitting.
Before estimating the performance of the separation, a 5th order Butterworth low-pass filter was applied to all the sources except for the drums, where no significant degradation was observed compared to the others, as mentioned in Section II-D. This step was applied in the time domain for further high-pass denoising, after inverting the output with the iSTDCT.
All the models were trained on a computer running Ubuntu 20.04 with a NVIDIA GeForce 3090 GPU of 24GB, an Intel i9-11900F and 64GB RAM memory. The networks were implemented in Tensorflow v2.5. 1 The networks were trained between 20 − 25 epochs each, depending on their validation status, i.e. each training automatically stopped after 5 consecutive epochs without improvement in performance.

D. PRELIMINARY COMPARISON BETWEEN STDCT AND STFT
This section presents a preliminary study to inspect the properties of the STFT and the STDCT domains in the source separation problem. It is important to mention that in the STDCT domain, the term phase information denotes the sign of each data point. For this reason, two experiments were conducted, in order to compare the properties of the two transforms in the source separation problem and more specifically, the resilience of the phase information, in the two transforms, to interference from other sources. The same parameters are used for the two transforms, i.e. frame size W = 2048, hop size H = 12.5% and a Hamming window. In the first experiment, the aim was to evaluate the resilience of the phase, in the two transforms, to interference from other sources. More specifically, each ground-truth stem was moved to the transform-domain along with the corresponding song mixture. The magnitude of the source was kept intact, but the original phase was replaced by the mixture phase. The steps for this procedure are outlined below: 1) Transform the mixture and the corresponding groundtruth target stem with the STFT or the STDCT. 2) Combine the phase information of the mixture and the magnitude information of the ground-truth target stem. 3) Invert the produced signal back to the time domain using the corresponding inverse transform. The results of this process are shown in Table 1. The interference in the phase seems to affect more the STFT than the STDCT. The average SDR performance of 7.92dB for STFT-based source separation frameworks is much lower compared to the performance of STDCT-based frameworks with an average SDR of 11.81dB. This might be due to the fact that the phase information in the STDCT is a sign, instead of a real number in the case of the STFT. The second experiment is to replicate a scenario closer to the source separation problem. Here, Ideal Ratio Masks (IRM) [46], extracted from the ground-truth stems, are employed as oracle masks that are applied on the song mixture to separate each stem. The two transforms are used in comparison for the oracle mask estimation and separation. This scenario uses the assumption of time-frequency exclusion between stems in their time-frequency representation (STFT or STDCT). The IRM mask was selected, since it is transparent to the transform choice. The Multichannel Wiener Filter (MWF) [47], [48] oracle mask, which is commonly used by the yearly stem unmixing contest [49], was not selected in this experiment, since the STDCT cannot translate convolution directly into multiplication [50] and thus the derived Wiener solution in [47] and [48] cannot be applied directly. The results using IRM oracle masks for the two transforms are shown in Table 2. The performance cap imposed by the mutual exclusion assumption appears as well with the STDCT, but this is extended by approximately 1.7 dB. Consequently, the STDCT seems a more efficient framework for the source separation problem. The two experiments have demonstrated the superiority of the STDCT in two cases important for source separation. This was the initial guide to start investigating the use of the STDCT transform, instead of the STFT, in deep-learning based source separation frameworks. The performance cap that still exists, even in the STDCT case, if we use the time-frequency mutual exclusion for the stems ( Table 2) and in the case that phase, containing interference, is used during reconstruction (Table 1) has motivated us to adapt the network to use negative values as inputs and not use mutually exclusive masks for separation. Consequently, the network was forced to infer the sign of the STDCT, yielding a direct solution to this problem. In addition, the network is forced to untangle possible overlapping of the stems in several timefrequency points, thus relaxing the initial assumption. Possible energy residual in some time-frequency points is removed by thresholding after separation to filter out musical noise. The proposed solution offers a straightforward architecture that requires no sophisticated post-processing steps for phase estimation of the stems.

E. NETWORK PERFORMANCE
As mentioned in Section III-B, the SDR metric was employed to measure the proposed Attentive MultiResUNet's performance. The results are shown in Table 3, where the framework was compared to the performance of the networks presented in [13], [14], [15], [16], [17], [22], [23], [25], and [24], when tested on the MUSDB18 dataset. Audio samples from the separated output stems by the proposed MultiResUNet, along with their ground truth can be found here. 2 Compared to all these networks, Attentive MultiResUNet's performance comes third, with the Hybrid Demucs holding the first place in bass, drums and average performance and the KUIELab-MDX-Net holding the first place in the vocals and other. However, the importance of our proposal can be seen in Figure 5. As shown in Figure 5, it is clear that the two networks in [17] and [14], whose performance is higher than the proposed Attentive MultiResUNet, are very big in  size, with more than 80 million model parameters. This size is justified by the complexity of their separation schemes, however, their need for computational power and their time consuming training, may discourage their casual use. On the other hand, the proposed Attentive MultiResUNet is a much smaller network (8.6 M parameters), it is easier to train, since it needs less resources, and yields noteworthy results.
Therefore, the proposed framework, even though it may not have the best performance, compared to all the state-of-theart networks, its small size and robustness, compared to other more complex networks, shows its merits.

1) STFT VS STDCT USING THE PROPOSED NETWORK
In this experiment, the aim is to compare the performance of the STFT and the STDCT in the proposed network. The stereo inputs were transformed using the STFT, using the aforementioned window length W and hop size H . The complex stereo STFT representation was re-arranged as a 4-channel input (real-Left, imaginary-Left, real-Right, imaginary-Right) and presented as input to the proposed network for training. For the STDCT, the same window length and hop size was used, yielding a 2-channel input. Note that the STFT representation features W /2 elements along the frequency dimension (removing negative frequency content), whereas the STDCT has W elements along the frequency dimension. This shows that the two representations have equal size tensors in terms of number of elements, but different arrangement (1024 × 128 × 4 for the STFT and 2048 × 128 × 2 for the STDCT). In this experiment, the STFT is modified to present also real numbers as input to the network, in a similar manner to the STDCT. The results in Table 4 indicate the boost in performance offered by the STDCT. This might be due to the fact that the real and imaginary parts of the same signal are considered independent inputs and the receptive fields of the first layers cannot encode their correlations and combine the information there in a more productive manner.

2) LOSS FUNCTION CHOICE
An essential part of a successful training is the loss function used for calibrating the network after each batch. The most common loss functions for the task were the Mean Square Error (MSE) and the Mean Absolute Error (MAE). Therefore, we conducted a series of experiments in order to check which yields the best results.
As shown in Table 5, the MAE fails to separate the vocals from the mixture with a median SDR of −15.36 dB, while the MSE succeeded with SDR of 8.27 dB. To verify the same behaviour of the MAE, we repeated the same experiment with the other sources. Performance is improved, but is inferior to MSE. Nonetheless, due to the very low performance at vocals, the MAE was excluded from our implementation.

3) FINAL LOSS SCHEME
After selecting the appropriate loss function, we investigated on the formulation of the final loss scheme. The proposed network is trained in the transform domain, but it was necessary to conclude whether the transform or the time domain would be selected for calibrating the network, before conducting the rest of the experiments. Thus, we conducted 3 different experiments: 1) Training using the MSE in the transform domain (Loss 1) 2) Training using the MSE in the time domain (Loss 2) 3) Training with a combination of MSE loss in the transform domain and MSE loss in the time domain (Loss 3) As presented in Table 6, the network underperforms using Loss 2. As expected, it was more difficult for the network to separate the mixture in the time domain. On the other hand, the network in the third case, where we combine the two loss schemes, the separation quality is 7.76 dB in terms of SDR, however, it is the first case, where the loss function is applied only in the transform domain, that yields the best results. Therefore, the first choice was to use the transform domain loss, i.e. Loss 1.

4) ACTIVATION FUNCTION
Another important configuration for training a network is the activation function used in its layers. Usually, MSS deep networks employ the ReLU activation function, however there are more advanced functions, which might improve network's performance. Therefore, we experimented with four different activation functions, the ReLU, the ELU, the GELU [51] and the SWISH [52] and the results are shown in Table 7. Out of the three, ELU scored the best results in the evaluation, with SDR of 8.27 db, followed by GELU, with SDR of 8.10 dB, SWISH, with SDR of 7.91 dB and ReLU, with SDR of 7.83. The ability of ELU, SWISH and GELU to allow negative values seems to boost the overall performance of the network, but ELU shows a solid improvement compared to ReLU, therefore, it was included in the final framework.

5) OPTIMISER
Generally, when training a network for MSS, the usual optimiser for the task is Adam, however, there are cases, like in [14], where it is suggested to use the RMSProp optimiser [53]. Furthermore, the past few years, new optimisers, such as the AdaBelief [54], have shown promising results in deep learning, thus, it was crucial to test these available options for the proposed network. As shown by the results in Table 8, AdaBelief does not perform well in the proposed framework, with an SDR of 5.96 dB, and the RMSProp underperforms as well, with SDR of 7.63 dB, thus we excluded them from the final implementation.

6) ATTENTION MODULE
One of the main features of the proposed Attentive-MultiResUNet is the inclusion of the Attention module in its architecture. It was necessary for the authors to test its performance compared to a network without this module. Therefore, we conducted a series of experiments to investigate the network's behaviour. The results in Table 9 show that the proposed attention module gives a significant boost to the primal network, with SDR of 8.27 dB, compared to a network without attention, which performs a SDR of 7.71 dB. Consequently, we incorporated the module to the final architecture.

7) SCALER COEFFICIENT γ
The scaler coefficient γ , introduced in [32], plays an essential role in the proposed framework, since it specifies the number of filters in the MultiRes Blocks, therefore, it was essential to select a value that would boost the network's performance.
The γ values that were selected in the experiments had to keep the right balance between the number of the Mul-tiRes blocks filters and the networks efficiency. As shown in Table 10, with γ = 1.50 the MultiRes blocks' filters are insufficient to process the data, while with γ = 2.00 the number of the filters lead to a degrading of the network's performance. Instead, a value of γ = 1.75 yields a more balanced Attentive MultiResUNet, giving its best performance.

8) LOW-PASS FILTERING VS NO LOW-PASS FILTERING
As mentioned in Section II-B, [14] proposes the use of reduced spectrograms for training, keeping only the lower frequencies of all components, except for the drums, where the authors discard nothing. This process zooms in the information of the transformed data and enhances the separation scheme. It is observed that most of the sources keep essential VOLUME 10, 2022 information in the lower frequencies, hence, we observed our network's behaviour, when keeping all the frequencies and when removing some of the higher frequencies. The experiments that were conducted in the second case used a frame length W = 3072 and the results are shown in Table 11. It appears that, by keeping only the lower frequencies, the separation is improved with SDR of 8.57 dB instead of 8.27 dB that was before removing the bins, therefore, this mechanism leads to a more efficient separation.

9) DROPOUT VS NO-DROPOUT
In the final experiment, we investigate possible benefits from the existence of dropout layers [55], which are documented to prevent deep networks from overfitting and increase its performance. As shown in Table 12, the performance in both cases is a bit similar, however, the computational cost of the dropout layers led to their exclusion from the final implementation, since each epoch's duration was almost double in time, compared to its duration for network without dropout layers.

IV. CONCLUSION AND FUTURE WORK
In this paper, we present a robust and efficient Attentive MultiResUNet for Music Source separation. This network is less complex than most state-of-the-art networks (using less than 10% of the runner up's parameters) and achieves the third best performance in terms of SDR in our study. One of its most important offerings is the use of the STDCT as a transform, which helps the network avoid the phase recovery problem and separate the mixtures without further processing. The addition of an attention module seemed to boost performance. In the future, we will be looking into more complex networks to convey information from the encoder to the decoder, as well as introducing temporal attention blocks, in conjunction with spatial attention that is used in the present architecture.