A Time-Frequency Attention Module for Neural Speech Enhancement

Speech enhancement plays an essential role in a wide range of speech processing applications. Recent studies on speech enhancement tend to investigate how to effectively capture the long-term contextual dependencies of speech signals to boost performance. However, these studies generally neglect the time-frequency (T-F) distribution information of speech spectral components, which is equally important for speech enhancement. In this paper, we propose a simple yet very effective network module, which we term the T-F attention (TFA) module, that uses two parallel attention branches, i.e., time-frame attention and frequency-channel attention, to explicitly exploit position information to generate a 2-D attention map to characterise the salient T-F speech distribution. We validate our TFA module as part of two widely used backbone networks (residual temporal convolution network and Transformer) and conduct speech enhancement with four most popular training objectives. Our extensive experiments demonstrate that our proposed TFA module consistently leads to substantial enhancement performance improvements in terms of the five most widely used objective metrics, with negligible parameter overheads. In addition, we further evaluate the efficacy of speech enhancement as a front-end for a downstream speech recognition task. Our evaluation results show that the TFA module significantly improves the robustness of the system to noisy conditions.


I. INTRODUCTION
S PEECH signals in a real-world acoustic environment are inevitably corrupted by background noise, which can severely degrade speech quality and intelligibility. Speech enhancement seeks to separate the target speech signal from the background noise. It is an essential component in a number of speech processing systems, such as hearing aids, automatic speech recognition (ASR), speaker verification, and the brain-computer interface. Monaural speech enhancement represents one of the challenges. Traditional signal processing-based methods have been extensively studied for a long time, mainly including spectral subtraction [1], Wiener filtering [2], and statistical modelbased methods [3], [4], [5], [6]. These methods perform well for stationary noise, however, fail to handle non-stationary noise.
With the advent of deep learning, speech enhancement has made remarkable progress [7]. Techniques can be grouped into time-frequency (T-F) domain methods and time-domain methods, according to the way input signals are handled. Specifically, time-domain methods perform speech enhancement directly on the raw waveform domain, where a deep neural network (DNN) is optimized to learn the mapping from the noisy raw waveform to the clean one [8], [9], [10], [11], [12] via some latent feature representation. T-F domain methods typically transform the noisy raw waveform into a T-F representation or spectrogram first before mapping to the clean one with well-designed training objectives. The most popular T-F domain training objectives include ideal ratio mask (IRM) [13], spectral magnitude mask (SMM) [13], complex IRM (cIRM) [14], phase-sensitive mask (PSM) [15], target magnitude spectrum [16], and log-power spectrum [17]. More recently, the instantaneous a priori signalto-noise ratio (SNR), termed Xi, is proposed as a training objective to bridge the gap between deep learning and traditional statistical model-based methods [18], [19]. In this study, we This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ adopt a novel T-F domain method, which allows for intuitive time-frequency analysis. We also adopt the three most widely used training objectives, i.e., IRM, SMM, PSM and a recent one (Xi) in the experiments.
Advanced speech enhancement algorithms depend on a strong backbone network architecture. Multi-layer perceptrons (MLPs) are the most widely adopted backbone network architecture in early studies. Furthering the idea of T-F masking in the computational auditory scene analysis, Wang et al. [20] proposed to employ an MLP to predict the ideal binary mask (IBM) [21] to separate the speech from background noise. Subsequently, Xu et al. [17] adopted an MLP as a regression function to learn the mapping from the noisy log-power spectra to the clean one. In [22], Chen et al. formulated speech enhancement as a sequence-to-sequence mapping, that effectively addresses speaker generalization issue, and employed a recurrent neural network (RNN) with four long short-term memory (LSTM) layers to model the long-range contextual information of the speech. The LSTM-RNN model demonstrates substantial performance improvement over MLPs [22], [23], [24]. However, deep LSTM-RNN network architectures involve a large number of parameters, which significantly limits its scope of applications.
Deep convolutional neural networks (CNNs) represent another successful backbone network architecture. Unlike RNN that involves a sequential process, CNN performs the filter processing on speech frames in parallel. Meanwhile, CNN captures the contextual information by stacking multiple layers. Recently, the residual temporal convolution networks (ResTCNs) [25], which employ 1-D dilated convolutional modules and residual skip connections, have demonstrated impressive performance in modeling long-term dependencies and outperformed RNN across a broad range of sequence modeling tasks. ResTCNs have gained considerable success in speech enhancement [10], [19], [26], [27] and speaker separation [11], [28] as well. Self-attention based Transformer backbone network [29] has achieved state-of-the-art performance on many natural language processing tasks. More recently, Transformers have been successfully adopted for speech enhancement [27], [30], [31] and many other speech processing-related tasks such as speech synthesis and voice conversion. As the key component of Transformers, the multi-head self-attention (MHA) mechanism processes the whole sequence at once and computes the similarity between all time-steps to obtain the new representation, allowing the Transformer model for modeling long-term dependencies more efficiently. In [27] Zhao et al. proposed to employ a MHA module to produce dynamic representations followed by a ResTCN model to learn a nonlinear mapping for speech dereverberation.
The generation of human speech is subject to the constraint of the physiological structure of vocal production, and the phonetic and phonotactic rules of a spoken language. The success of ResTCN and Transformer in speech enhancement mainly stems from their ability to effectively model the long-range temporal context of speech. We also note that the energy concentration of a speech utterance in time or frequency varies from utterance to utterance. To preserve such a speech formant structure, we expect that a speech enhancement model performs according to the energy concentration in the T-F plane of a spectrogram. We are motivated to investigate a dedicated mechanism to characterise the salient T-F speech distribution.
The idea of attention has been well studied for the network to learn to attend to the salient features in computer vision [32], [33], [34], speech emotion recognition [35], [36], and speaker verification [37]. In a preliminary study [38], we investigated an attention mechanism to model the speech distribution along the frequency dimension and demonstrate its efficacy. We proposed a functional neural module, termed T-F attention (TFA), as part of the backbone networks to attend to the salient T-F representation for speech enhancement [39]. The proposed TFA module consists of two parallel attention branches, i.e., timedimension (TA) and frequency-dimension attention (FA) that produce two 1-D attention maps to guide the models to focus on 'where' (which time frames) and 'what' (which frequencywise channels), respectively. Then the TA and FA branches are combined to generate the final 2-D attention map, which assigns differentiated attention weights for each T-F spectral component, allowing the networks to capture the speech distribution in the T-F representation. In this paper, we further study the TFA module [39] across different backbone network architectures and training objectives, and evaluates its efficacy for a robust ASR system.
There have been attempts to capture long-range correlations in the T-F representation by applying self-attention operation along the time and frequency axes [40], [41], which was referred to as T-F attention [42]. In this paper, T-F attention (TFA) refers to a dedicated mechanism different from self-attention. It models the salient T-F speech distribution of speech signals. In particular, the T-F attention [42] is based on self-attention, and the learned attention scores represent the similarity among T-F vectors. However, the differentiated attention weights learned by our TFA represent how informative each T-F spectral component is. Such a T-F attention module can be used to augment existing neural speech enhancement solutions including self-attention. Therefore, it is different from that in [42] either in terms of motivation or implementation. The main contributions of this work are as follows: r We propose a simple yet very effective network module (TFA) to characterise the salient T-F distribution for speech enhancement. It can be flexibly integrated with existing backbone networks to improve performance. r We design time-dimension (TA) and frequency-dimension (FA) attention to enable the models to focus on informative frames and frequency-wise channels, respectively. Comprehensive ablation studies validate the efficacy.
r We extensively evaluate the TFA module across different backbone networks and training objectives. The results confirm that our TFA module consistently provides significant performance gains in speech enhancement, as well as in robust speech recognition. The remainder of this paper is organized as follows. In Section II, we formulate the research problem. In Section III, we propose a novel time-frequency attention mechanism for speech enhancement. In Section IV, we describe the experimental setup.
The experimental results are presented in Section V. Finally, Section VI concludes this study.

A. Signal Model
Let a noisy speech signal be x[n], where s[n] and d[n] denote clean speech and uncorrelated additive noise, respectively, and n denotes the discrete-time index.
The noisy speech, x[n], is then analysed frame-wise using the short-time Fourier transform (STFT): where X[l, k], S [l, k], and D[l, k] denote the complex-valued STFT coefficients of the noisy speech, the clean speech, and the noise components at time-frame index l and discrete-frequency index k.

B. Training Objectives
A backbone network architecture is trained to optimize the designed training objectives for speech enhancement. Studies show that by optimizing the network with respect to the T-F mask, we improve the intelligibility and quality of speech in speech enhancement. Without loss of generality, we have chosen four widely used training objectives in this study, as summarized next.
1) Ideal Ratio Mask: The ideal ratio mask (IRM) [13] is one of the most popular masking-based training objectives, and it is defined as: where |S[l, k]| and |D[l, k]| denote the spectral magnitudes of clean speech and noise, respectively. The value of IRM ranges from 0 to 1.

2) Spectral Magnitude Mask:
The spectral magnitude mask (SMM) [13] is defined on the STFT magnitude of clean speech and noisy speech: where |X[l, k]| denotes the spectral magnitude of the noisy speech.

3) Phase-Sensitive Mask:
The phase-sensitive mask (PSM) [15] is an extension of the SMM, which introduces a phase error item to compensate for the use of noisy speech phases: where θ S−X denotes the difference between the clean speech phase and noisy speech phase. From equations (4) and (5), we can find that the upper bound of SMM and PSM values exceeds 1. To fit the output range of the sigmoid activation function, we clip the SMM and PSM to between 0 and 1.

4) Instantaneous a priori SNR:
The instantaneous a priori SNR (Xi) was proposed as the training objective [18], [19] and is employed by statistical model-based methods [19]. To form the training objective, the normal cumulative distribution function where erf denotes the error function and mean μ k and variance σ 2 k are calculated from the training set (over 1000 randomly selected samples in this study). During inference, the a priori SNR estimate is computed as follows: With IRM, SMM, and PSM as training objectives, we train DNNs to produce masks at run time. We then apply the resulting masks on the STFT spectral magnitude of noisy speech to obtain a clean version. The enhanced magnitude is then used with the noisy speech phase to reconstruct the clean speech waveform. For Xi, we adopt the minimum mean-square error log-spectral amplitude (MMSE-LSA) estimator as the statistical model, which usesξ[l, k] to compute the spectral magnitude of enhanced speech [44].

A. Time-Frequency Attention
A TFA module is a computational unit, which takes an intermediate T-F representation Y ∈ R L×d model as the input, i.e., L frames of d model frequency-wise feature channels, and generates an enhanced representation Y ∈ R L×d model with differentiated T-F attention. The diagram of the proposed TFA module is illustrated in Fig. 1.
The distribution of speech signals over the T-F plane is defined by its time-frame index and frequency-wise channel index. We would generate a position-aware attention map, which assigns differentiated weights to position-specific speech components. In practice, we employ two parallel attention branches, termed TA and FA, which produces a 1-D time-frame attention map T A ∈ R 1×L and a 1-D frequency-dimension attention map F A ∈ R d model ×1 . Then, the two 1-D attention maps are combined via a tensor multiplication operation, resulting in a position-aware 2-D T-F attention map TF A ∈ R L×d model . We next provide the detailed working of the proposed TFA module. As the long-range context correlations as shown in T-F representation are essential to locate the informative T-F regions, each attention branch adopts a two-step strategy to capture such correlations to generate the attention map: information aggregation and attention generation.
Information Aggregation: The TA and FA branches aggregate the whole utterance information along the time and frequency dimensions, respectively. The global average pooling and max pooling are typical techniques to aggregate global spatial information [32], [33]. We adopt the global average pooling, which produces the global information descriptors that are expressive and general for the entire utterance. Specifically, the TA branch takes global average pooling along the frequency dimension on the given input Y and generates a time-frame-wise statistic Z T ∈ R 1×L as follows: where Z T (l) is the l-th element of Z T . Similarly, the FA branch applies global average pooling along the time-frame dimension on the input Y and generates a frequency-wise statistic Z F ∈ R d model ×1 . The k-th element of Z F is given as: Attention Generation: A two-layer fully-connected (FC) network is often used to provide channel attention [32], [33]. However, the use of FC layer brings a large parameter overhead especially for long-duration speech. Alternatively, it has been suggested that an effective channel attention can be implemented in a more efficient way via 1-D convolution [34].
We adopt two stacked dilated 1-D convolution layers of kernel size k tfa to capture the dependencies in the descriptors and learn a nonlinear interaction to produce the attention map. Specifically, given the descriptor Z T , the attention map in the TA branch is calculated as: where f denotes a dilated 1-D convolution operation, δ and σ refer to the rectified linear module (ReLU) and the sigmoidal activation functions, respectively. The dilation rate d is set to 1 and 2 for the first and second convolution modules, respectively. A similar process is applied to the FA branch for generating the frequency-wise channel attention map: Then, the attention maps obtained from the two attention branches interact with a tensor multiplication operation, resulting in our final 2-D T-F attention map TF A written as: where ⊗ denotes the tensor multiplication operation. The (l, k)th element of the final 2-D attention map TF A is computed as: where T A (l) and F A (k) denote the l-th element of T A and the k-th element of F A , respectively. The output of our TFA module Y is written as: where denotes an element-wise multiplication.

B. Network Architecture
Our proposed TFA module is applicable to general neural speech enhancement architecture. To set the stage for this study, in Fig. 2, we illustrate a typical neural solution to speech enhancement, it is referred to as the backbone network in this paper. Here, two recently proposed backbone networks, ResTCN [19] and Transformer [29], [31], are employed. As the key component of the Transformer layer [29] is the MHA module, we term the Transformer layer as a MHANet block. The network takes the noisy spectral magnitude as input, |X| ∈ R L×K , where L and K denote the number of time frames and frequency bins, respectively. The first layer consists of a 1-D convolution layer with a frame-wise layer normalization followed by the ReLU activation function, that encodes the input into a latent T-F representation of R L×d model . The output of the first layer is  then fed into B stacked ResTCN or MHANet blocks to perform T-F feature transformation. Following the last transformation block is the output layer, which is a 1-D convolution layer with a sigmoidal activation function that generates the estimates of IRM, SMM, PSM, and Xi. 1) Tfa in Restcn: In Fig. 3, we illustrate a backbone network based on a ResTCN block [19] and how the proposed TFA module works inside the ResTCN block. As shown in Fig. 3(a), the ResTCN block consists of three 1-D causal dilated convolutional modules. Each convolutional module employs a pre-activation design, where the input is pre-activated using frame-wise layer normalization followed by the ReLU activation function. We denote the kernel size, number of filters, and dilation rate for each convolutional module as a tuple of three elements. The first and third convolutional modules have a kernel size of 1, whilst the second convolutional module has a kernel size of k. The number of filters is d f for the first and second convolutional modules, and d model for the third convolutional module. The dilation rate, d, is employed in the second convolutional module, providing a contextual field over previous speech frames. The dilation rate is cycled as the block index b = {1, 2, 3, . . ., B} increases: d = 2 (b−1mod(log 2 (D)+1) , where mod is the modulo operation and D = 16 is the maximum dilation rate.
As shown in Fig. 3(b), the ResTCN block is augmented by a TFA module to attend to the salient T-F representation. The residual connection [45] is applied between the input and output of the block to facilitate gradient optimization. 2) Tfa in Mhanet: In Fig. 4, we illustrate how a TFA module is incorporated into the MHANet backbone [31]. As shown in Fig. 4(a), the MHANet block comprises two sub-blocks. The first is an MHA module, and the second is a two-layer fully connected feed-forward network (FFN). A residual connection is applied in each sub-block, followed by frame-wise layer normalization. To capture the T-F energy distribution of speech, as shown in Fig. 4(b), we propose to incorporate a TFA module into the MHA module.
Given an intermediate latent T-F tensor U ∈ R L×d model as the input to the block, the MHA module first projects the input U to queries (Q ∈ R L×d model ), keys (K ∈ R L×d model ), and values (V ∈ R L×d model ): Q = UW Q , K = UW K , and V = UW V , where {W Q , W K , W V } ∈ R d model ×d model are different, learned linear projections. Then they are split into H attention heads, indexed by h = {1, 2, 3, . . ., H}, and with dimensions d k , d k , and d v , respectively, which enables the model to pay attention to different aspects of information. The scaled dot-product attention is applied to each head in parallel to generate the output, where d k = d v = d model /H, and K h denotes the transpose of the h-th head of keys, K h . An upper triangular mask is used to mask out the similarities that include future frames. For more detailed descriptions about attention function, we refer the reader to the original study [29]. The outputs for each attention head are concatenated and linearly projected again, yielding the output of the MHA module as: where W O ∈ R d model ×d model . The TFA module takes the output from the previous MHA module, and conducts a T-F attention operation (described in Section III-A) to tell the model to focus on the informative spectral components, resulting in an augmented T-F representation. The two-layer FFN takes the output U ∈ R L×d model from the first sub-block, and performs two linear transformations with a ReLU activation in the first layer: where The size of input and output is d model , and the inner-layer has a size of d ff .
Model Configuration: For ResTCN backbone, we adopt the parameter settings as in [19]

A. Datasets and Feature Extraction
First, we describe the clean speech and noise data in this study. For clean speech recordings, we use the train-clean-100 set from the Librispeech corpus [46] as the training set, which includes 28 539 utterances spoken by 251 speakers. The noise recordings in the training set are taken from the following datasets: the QUT-NOISE dataset [47], the Nonspeech dataset [48], the Environmental Background Noise dataset [49], [50], the RSG-10 dataset [51] (voice babble, F16, and factory welding are excluded for testing), the Urban Sound dataset [52] (street music recording no 26 270 is excluded for testing), the noise set from the MUSAN corpus [53], and colored noise recordings (with an α value ranging from −2 to 2 in increments of 0.25). Noise recordings that are over 30 seconds in length are split into segments of 30 seconds or less. This gives a total of 6 809 noise recordings, each with a length less than or equal to 30 seconds. For validation experiments, we randomly select 1 000 clean speech and noise recordings (without replacement) and remove them from the aforementioned clean speech and noise sets. Each clean speech recording is mixed with a random section of one noise recording at a randomly selected SNR level between −10 dB and 20 dB in 1 dB increments. This generates 1 000 noisy speech signals as the validation set.
For evaluation experiments, we adopt the recordings of four real-world noise sources (excluded from training set) including two non-stationary and two coloured. The two non-stationary noise sources are the voice babble from the the RSG-10 noise dataset [51] and street music from the Urban Sound dataset [52]. The two colored noise sources are F16 and factory welding from RSG-10 noise dataset [51]. For each of the four noise recordings, ten clean speech recordings (without replacement) randomly selected from the test-clean-100 of Librispeech corpus [46] are mixed with a random segment of the noise recordings at the following SNR levels: {−5 dB, 0 dB, 5 dB, 10 dB, 15 dB}. This generates 200 noisy mixtures for evaluation.
In this study, a square-root-Hann window function is used for spectral analysis and synthesis, with a frame length of 32 ms (512 samples) and a frame-shift of 16 ms (256 samples).
The 257-point single-sided STFT magnitude spectrum of noisy speech, which includes both the DC frequency component and the Nyquist frequency component, is used as the input.

B. Training Methodology
Here, we describe the details of training methodology used in this study. A mini-batch size of 10 noisy speech utterances is used for each training iteration. The noisy speech signals are created as follows: each clean speech recording selected for the mini-batch is mixed with a random section of a randomly selected noise recording at a randomly selected SNR level (−10 dB to 20 dB, in 1 dB increments). The selection order for the clean speech recordings is randomised for each epoch. For the three masking-based training objectives (i.e., IRM, SMM, and PSM), we adopt the mask approximation to learn the mask, where the mean-square error (MSE) is the loss function. For Xi, the cross-entropy is employed as the loss function [18], [19]. Each utterance in a mini-batch is padded with zeros, giving it the same number of time frames as the longest noisy utterance.
All models are trained from scratch. For ResTCN [19] and its TFA augmented variants, the Adam algorithm with default hyper-parameters [54] and a learning rate of 0.001 is used for gradient descent optimisation. For MHANet [31] and its TFA augmented variant, the Adam algorithm with parameters as in [29], i.e., β 1 = 0.9, β 2 = 0.98, and = 10 −9 is used for training. The gradient clipping technique is used for all models, where the gradients are clipped between [−1, 1]. As the training of MHANet is sensitive to the learning rate [29], [31], we adopt the warm-up training strategy in [29], where the learning rate is adjusted during the training process according to the rule: (18) where n_step and w_steps denote the number of training steps and warm-up training steps, respectively. Following [31], the number of steps w_steps = 40 000 is adopted for warm-up training in this work.

C. Evaluation Metrics
In our experiments, five widely used metrics are adopted for extensive speech enhancement evaluations, including perceptual evaluation of speech quality (PESQ) [55], extended shorttime objective intelligibility (ESTOI) [56], and three composite metrics [57]. For the PESQ metric, we adopt the wide-band PESQ [55], which typically produces a lower score than the narrow-band counterpart [27]. The value range of PESQ is [−0.5, 4.5] and the value of ESTOI is typically in [0, 1]. The three composite metrics are mean opinion score (MOS) predictors of the signal distortion (CSIG) [57], the backgroundnoise intrusiveness (CBAK) [57], and the overall signal quality (COVL) [57], respectively. The value range of the three composite metrics is [0,5]. For all of the above five metrics, a higher score indicates better enhancement performance. The word error rate (WER%) is adopted to evaluate the downstream ASR performance, and a lower WER% score means better speech recognition performance.

D. Comparative Models
We evaluate the proposed TFA module as part of two backbone networks (ResTCN [19] and MHANet [31]) on four training objectives. The proposed ResTCN and MHANet with the TFA module are denoted by "ResTCN+TFA" and "MHANet+TFA," respectively. In addition, we conduct an ablation study to validate the efficacy of each component (the TA and FA) in the TFA module. Similarly, we denote the ResTCN and MHANet with the TA and FA by "ResTCN+TA," "MHANet+TA," "ResTCN+FA," and "MHANet+FA," respectively. All models in this study are implemented using Tensorflow 1.13. The experiments were conducted on an NVIDIA Tesla V100 graphics processing unit (GPU) and an Intel Xeon Platinum 8163 CPU at 2.50 GHz (96 logical processors).

A. Training and Validation Error
We first observe the training and validation errors across the models. The error curves produced by each of the models on the four training objectives (i.e., IRM, SMM, PSM, and Xi) are shown in Figs. 5-6, Figs. 7-8, 9-10, and 11-12, respectively, where each model is trained for 250 epochs. We observe similar trends of error curves on different training objectives. The purple curves are for ResTCN and MHANet, and the red curves are for ResTCN+TFA and MHANet+TFA. It can be easily observed that ResTCN+TFA and MHANet+TFA yield significantly lower training and validation errors than ResTCN and MHANet, which demonstrates the effect of the TFA module. In addition, the ablation study also confirms the efficacy of the FA and TA modules. One can observe that ResTCN+FA (blue curves) and ResTCN+TA (yellow curves) produce significantly lower training and validation error as compared to ResTCN. ResTCN+FA yields close error curves with ResTCN+TA. MHANet+FA (blue curves) and MHANet+TA (yellow curves) also achieve an obvious lower training and validation error than MHANet. For MHANet, applying the TA module achieves lower training and validation error than the FA module. Among the TFA, TA, and FA modules, the TFA module consistently produces the lowest training and validation error across different training objectives.

B. Experiment on Enhancement Performance
Tables I and II list the wide-band PESQ and ESTOI scores obtained by each of the models in all noisy test conditions, respectively, in four training objectives. The highest PESQ and ESTOI scores for each condition are highlighted in boldface. Compared to unprocessed noisy recordings, for all of the training objectives, our proposed models provide substantial improvements in terms of both PESQ and ESTOI scores. Taking the street music noise with SNR of 5 dB as a case, ResTCN+TFA and MHANet+TFA with the SMM achieve 0.70 and 0.65 gains on PESQ, and 19.40% and 19.43% gains on ESTOI, respectively. Among the four training objectives, overall, PSM and Xi show better performance than IRM and SMM in terms of PESQ. The obvious superiority in terms of ESTOI is not observed on any of the training objectives.
It is also easy to observe that for all training objectives, applying the TFA module significantly improves the PESQ and ESTOI scores of ResTCN and MHANet backbones with negligible parameter overheads (2.72 K and 0.34 K), demonstrating its effectiveness on speech enhancement. In the F16 noise with SNR of 5 dB case, for instance, ResTCN+TFA and MHANet+TFA with the IRM provide 0.24 and 0.19 PESQ improvements, 3.98% and 2.84% ESTOI improvements, respectively, over the corresponding baselines. From the comparison results, among the two baselines, ResTCN benefits more from the TFA module in most cases.
In addition, the performance evaluations of the TA and FA modules are also reported in the ablation study. The TA and FA modules produce two 1-D attention maps to model the energy distribution of speech along time and frequency dimensions, respectively. As shown in Tables I and II, both ResTCN and MHANet achieve performance gains, in terms of PESQ and ESTOI, due to the TA and FA modules in most cases. Overall, the TA module provides more PESQ and ESTOI gains than the FA module. This could be explained by the fact that the temporal attention mechanism assigns the differentiated attention weights along the time axis, acting like a soft voice activity detector (VAD). The temporal information could be more informative than the spectral one in speech enhancement. The TFA module effectively combines the TA and FA modules to produce a 2-D attention map for modeling the T-F distribution of speech spectral components, which attains the highest PESQ and ESTOI scores in almost all cases.
Tables III-V report the average CSIG, CBAK, and COVL scores for each of the SNR levels (covering four noise sources), respectively, and the highest scores are highlighted in boldface. It is obvious that applying the TFA module to ResTCN and MHANet significantly improves their performance in terms of the three composite metrics, across different training objectives. In the −5 dB SNR case, for instance, ResTCN+TFA and MHANet+TFA with the IRM improve CSIG by 0.23 and 0.17, CBAK by 0.1 and 0.1, and COVL by 0.17 and 0.12, respectively. Again, compared to MHANet, ResTCN benefits more from the TFA module.
The TA and FA modules also provide substantial improvements to baselines in the three metrics. For instance, in the 5 dB case, the SMM is used as the training objective. For ResTCN applying FA and TA modules improves CISG by 0.21 and 0.21, CBAK by 0.12 and 0.12, and COVL by 0.19 and 0.20, respectively. For MHANet, applying FA and TA modules improves CISG by 0.07 and 0.12, CBAK by 0.08 and 0.10, and COVL by 0.09 and 0.13, respectively. Overall, the TA module performs slightly better than the FA. In the case of Xi as training objective and the SNR of 15 dB, MHANet+FA obtains the same CSIG scores (4.18) with MHANet+TFA. In all other cases, the TFA module obtains the highest CSIG, CBAK, and COVL scores.
1   denote MHANet models with 4 and 6 MHANet blocks, respectively. The experimental results are given in Table VI. It can be seen that the TFA module consistently affords substantial improvements to both ResTCN and MHANet. Here, we also report the evaluation results of ResTCN with self-attention (ResTCN+SA) [27], which further demonstrates the efficacy    and efficiency of our TFA module. ResTCN+SA [27] employs a multi-head self-attention module as a pre-processing module followed by a ResTCN model. Compared to ResTCN+SA, ResTCN+TFA shows substantial superiority in terms of performance scores and parameter efficiency. In addition, we also study our TFA module in the recent Conformer [58], [59] and the  Fig. 1 (right)] and the 2-stage SA-TCN (2S-SA-TCN) [60] as the baseline backbones. A Conformer block consists of four modules stacked together, i.e., an FNN module, a self-attention module, a convolution module, and a second FNN module. Each stage of 2S-SA-TCN includes a self-attention module, followed by 24 ResTCN blocks. For Conformer, the TFA module is incorporated into the convolution module (following the third convolution unit) and the selfattention module (as shown in Fig. 4(b)). For 2S-SA-TCN, the TFA module is incorporated into the self-attention module and the ResTCN block as shown in Figs. 3(b) and 4(b), respectively. It is clear that the TFA module consistently provides significant improvements to both Conformer and 2S-SA-TCN.

C. Experiment on ASR Performance
In real-world environments, speech enhancement is often used as a front-end to improve the noise robustness of an ASR system. In this section, we investigate the effectiveness of our proposed model as the front-end for a robust ASR system. DeepSpeech 1 [61], an open-source ASR system developed using end-to-end deep learning technique, is used in this study to conduct ASR experiments for evaluating front-end performance.
The RNN-based acoustic model and language model are used in DeepSpeech. Here, we treat the DeepSpeech ASR system as a black box, without fine-tuning during the experiment. Table VII presents the average WER% 2 scores attained by all the models for each SNR condition and the WER% scores averaged across all SNR conditions. It can be seen that all front-end models achieve performance gains substantially in terms of WER% compared to the ASR performance on unprocessed noisy recordings. Overall, for two backbones (ResTCN and MHANet) and four training objectives, our proposed TFA module attains the lowest average WER% scores over all conditions, and performs the best under most SNR conditions. The evaluation shows that the proposed TFA module demonstrates substantial performance improvement to the two baselines, i.e., ResTCN and MHANet. For the four training objectives similar performance trends are observed, and on average, the IRM achieves the best performance. With the IRM as the training objective, the TFA module improves the ResTCN and MHANet baselines, with a relative WER% reduction of 15.78% and 8.58% over all conditions, respectively. In addition, the FA and TA modules provide a relative WER% reduction of 10.46% and 7.40% to the ResTCN, and that of 4.89% and 4.84% to  TABLE II  SPEECH ENHANCEMENT PERFORMANCE IN TERMS OF ESTOI (%) METRIC FOR DIFFERENT MODELS AND TRAINING TARGETS   TABLE III  AVERAGE CSIG SCORES FOR EACH SNR LEVEL AND THE HIGHEST CSIG  SCORES ARE HIGHLIGHTED IN BOLDFACE   TABLE IV  AVERAGE CBAK SCORES FOR EACH SNR CONDITION AND THE HIGHEST  CBAK SCORES ARE HIGHLIGHTED IN BOLDFACE  TABLE V  AVERAGE COVL SCORES FOR EACH SNR CONDITION AND THE HIGHEST  COVL SCORES ARE HIGHLIGHTED IN BOLDFACE   TABLE VI  EVALUATION RESULTS ACROSS DIFFERENT NUMBER OF BUILDING BLOCKS  WITH IRM AS TRAINING OBJECTIVE the MHANet. The ablation results on ASR performance also illustrate the efficacy of the TA and FA modules.
In Table VIII, we compare the computation required by the models (ResTCN, ResTCN+TFA, MHANet, and MHANet+TFA), in terms of real-time factor (RTF) [62], which is the ratio of the time taken to process a speech utterance to the duration of the utterance. The RTFs are measured on an NVIDIA Tesla V100 GPU, averaged over 10 executions. We use a batch size of 20 noisy mixtures with a length of 7 seconds. It can

D. Comparative Study
In this section, we compare our proposed method with multiple state-of-the-art systems on the Voicebank-DEMAND dataset [78]. As reported in Table IX, it can be observed that our proposed model ResTCN+TFA with Xi training objective (ResTCN+TFA-Xi) demonstrates highly competitive performance to those methods with respect to the five evaluation metrics. More significantly, our TFA module provides a simple and flexible way to improve existing network architectures for speech enhancement. It is worth noting that we focus on a small module rather than a system beyond existing enhancement systems.

VI. CONCLUSION
In this study, we propose the TFA module, a lightweight and flexible attention module designed to model the distribution of speech components in the T-F representation, improving the representational power of a network. Our TFA module consists of two parallel attention branches, i.e., the TA and FA modules, which produce two attention maps to model the speech distribution along time-frame and frequency dimensions, respectively. We evaluate the TFA module as part of ResTCN and Transformer backbone networks and adopt four widely used training objectives to conduct extensive speech enhancement experiments.
Our experimental results demonstrate that the TFA module consistently provides significant improvements to the baseline networks in terms of five metrics (PESQ, ESTOI, CSIG, CBAK, and COVL). Moreover, the evaluation results on the downstream ASR also demonstrate the effectiveness of the TFA module. This reveals the importance of the priori about the energy distribution of speech for speech enhancement, and the inability of the previous models to capture that priori. We believe that the success of the TFA module provides a new idea for the design of network architecture to boost speech enhancement. In future studies, we plan to further investigate the effectiveness of the TFA module across other commonly used datasets and other speech processing tasks such as speech recognition. In addition, we will also try to extend our proposed TFA module to multi-channel scenarios. Qiquan Zhang (Member IEEE) received the B.Sc.