Embedding Encoder-Decoder With Attention Mechanism for Monaural Speech Enhancement

The auditory selection framework with attention and memory (ASAM), which has an attention mechanism, embedding generator, generated embedding array, and life-long memory, is used to deal with mixed speech. When ASAM is applied to speech enhancement, the discrepancy between the voice and noise feature memories is huge and the separability of noise and voice is increased. However, ASAM cannot achieve desirable performance in terms of speech enhancement because it fails to utilize the time-frequency dependence of the embedding vectors to generate a corresponding mask unit. This work proposes a novel embedding encoder-decoder (EED), and a convolutional neural network (CNN) is used as decoder. The CNN structure is good at detecting local patterns, which can be exploited to extract correlation embedding data from the embedding array to generate the target spectrogram. This work evaluates a similar ASAM, EED with an LSTM encoder and a CNN decoder (RC-EED), RC-EED with an attention mechanism (RC-AEED), other similar EED structures and baseline models. Experiment results show that RC-EED and RC-AEED networks have good performance on speech enhancement task at low signal-to-noise ratio conditions. In addition, RC-AEED exhibits superior speech enhancement performance over ASAM and achieves better speech quality than do deep recurrent network and convolutional recurrent network.


I. INTRODUCTION
Speech enhancement has attracted considerable research attention for several decades. It aims to remove noise or reverberation from a noisy speech signal and improve the signal's intelligibility and quality. This challenging task is significant to some real-world applications, such as telephone conferencing, speech recognition systems, hearing aid devices, and conference recording. Obvious progress has been achieved in speech enhancement owing to the introduction of deep learning approaches, which outperform conventional methods, including spectral subtraction [1] and the Wiener filter method [2], which are based on stationary noise assumption. However, deep learning approaches still fail to achieve desirable performance in low signal-to-noise ratio (SNR) conditions. The associate editor coordinating the review of this manuscript and approving it for publication was Lin Wang . Recently, several fully convolutional network (FCN)-based speech enhancement algorithms were proposed. The FCN only contains CNN layers and discards the fully connected layer. This structure was proposed by Long et al. [3], who applied it on semantic segmentation. They replaced a densely connected network with the FCN and achieved obvious improvement in pixel accuracy, mean accuracy, and mean intersection over union. In a special aspect, they reinterpreted this phenomenon as a trade-off, that is, the receptive field sizes of the filters are decreased to help extract information at a finer scale. Mao et al. [4] used the FCN structure for image denoising and proposed a symmetric convolution encoder and deconvolution decoder with skipping layers. [5], [6] introduced the FCN structure to speech enhancement and found that the performance of the proposed convolutional encoder-decoder (CED) can exceed that of the structure with RNN and fully connected network.
Some new structures that combine LSTMs and CNNs were proposed [7]- [9] and were proven to have promising VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ performance in speech enhancement. Tan and Wang [8] developed the convolutional recurrent network (CRN), which inserts two LSTM layers into the middle of a CED network. The structure can be regarded as an extension of the CED in [5]. The experiment results show that CRN outperforms two special LSTM networks on trained and untrained speaker datasets when tested at −2 and −5 dB SNR levels.
Zhao et al. [9] proposed the EHNet, which also merges CNN and LSTM components. The convolutional component is leveraged to exploit local patterns in a spectrogram, and the bidirectional recurrent component captures the dynamic correlations between consecutive frames. Under a high SNR condition, EHNet achieves better perceptual evaluation of speech quality (PESQ) [26] and word error ratio (WER) than does an RNN [10] and two DNNs [11], [12]. Ge et al. [7] also developed a similar structure composed of an attention mechanism. In this special structure, an enhanced frame is generated by the surrounding frames of input, and the frames with speech information are given more attention.
This study proposes the application of a unified auditory selection framework by modeling attention and memory [13] and an embedding decoder on speech enhancement, which is an LSTM and CNN combined structure. ASAM is an auditory selection and speech separation framework with attention and life-long memory, which employs the same embedding generation method in deep clustering [14]. It maps each T-F unit in the mixture spectrogram to a high dimensional vector and the mapped vectors form an embedding array. The life-long memory is updated by a memory vector, and the memory vector serves as priori speech feature to extract target voice feature in the embedding array. However, ASAM fails to utilize the time-frequency dependence of the embedding vectors to generate a corresponding mask unit. As shown in Fig. 1, every unit of the frequency-selective attentional filter's mask [15] is generated by a single embedding from the array and the life-long memory, and the surrounding information is ignored. Inspired by the successful introduction of CNNs in feature extraction in sentiment classification [16], [17], we develop a similar CNN structure to generate an enhanced spectrum from the embeddings. The CNN is also an FCN, and the enhanced speech spectrogram is directly generated by the last layer of the CNN. The spectrum generator is combined with an LSTM embedding generator to build an recurrent-convolutional embedding encoder-decoder network (RC-EED). CNNs are good at extracting local patterns and can ignore irrelevant data to form better representations [18]. A CNN and LSTM union structure was also used in [8], [9] to extract special local patterns from a spectrum and model the temporal dependencies in a latent space. In the proposed model, the CNN embedding decoder is utilized to detect the local spatial patterns of the embedding array and generate T-F units in the spectrum from the local embedding representations, which can leverage more correlative embeddings to form the spectrum. In addition, we find the similar memory block and attention mechanism in ASAM can be applicant in the RC-EED network and the embedding array can be filtered by a stacked mask. In our experience, the results show that the RC-EED and the RC-EED combined an attention mechanism (RC-AEED) have good performance on speech enhancement task in low SNR conditions and them both apparently outperform a similar ASAM model in different noise conditions. Specially, RC-AEED achieves better PESQ than RC-EED.
The rest of this paper is organized as follows. Section II introduces the proposed models. Section III presents the details of the experimental procedure and the analysis of the experimental results. Finally, section IV concludes the paper.

II. ALGORITHM DESCRIPTION
The RC-EED and RC-AEED are shown in Fig. 2. RC-AEED is the solid part of the figure. With noisy and clean speech spectra as input, the model is composed of two embedding encoders, one memory block, and one embedding decoder. The two embedding encoders are set to generate a clean speech embedding representation and a noisy speech embedding array, respectively. The clean speech embedding representation is accumulated into a memory vector in the storage region. An attention mechanism then takes the memory vector and the noisy speech embedding array as input to form an ideal radio mask (IRM) [19]- [21]. Afterward, the noisy speech embedding array is pointwise multiplied by a stacked IRM to develop an enhanced speech embedding array. The stacked IRM is a repeated overlay mask. Finally, an embedding decoder transforms the enhanced speech embedding array to the enhanced speech spectrum, which is the solid line part in Fig. 2.
The RC-EED has no attention structure and masking layer. Only an LSTM embedding encoder of noisy speech exists in the RC-EED. The dashed line part of Fig. 2 shows the dataflow between the encoder and the decoder.

A. EMBEDDING ENCODER
The encoder aims to map each T-F unit of the speech spectrum to a high-dimensional embedding. In ASAM, this structure is constructed by LSTM layers and a fully connected layer and then the output of the structure is reshaped to an embedding array. In [22], a spectrogram and a feature map generated from speech wave were concatenated to be a hybrid-domain feature map. Next, the new feature map is entered to a repeated structure of 1-d dilated CNN layers, the repeated number is 4.
Two embedding encoders are adopted in the embedding encoder-decoder models. One is similar to the LSTM block in ASAM, and the other has a similar structure to that in [22]. The LSTM encoder is made up of two LSTM layers and a fully connected layer. The structure of the CNN encoder in [22] is also followed to build the CNN encoder in comparison, but the repeated number is reduced to 3. As show in the Fig.2, the CNN encoder is only used to deal with noisy spectrum. The EED with CNN encoder and CNN decoder is named CC-EED. The encoding process can be expressed as follows: where EE(·) represents an embedding encoder network, input X represents the speech spectrum corrupted with noise, and A ∈ R T ×F×E is the generated embedding array. Formulas (2) and (3) denote the intermediate process of (1), where NN (·) represents the LSTM or CNN structure; FC(·) refers to a fully connected layer; reshape(·) is the reshape operation; and L ∈ R T ×FE . The embedding representation of clean speech is only generated by an LSTM embedding encoder, and an embedding array of clean speech is used to generate a feature vector: where AC represents an embedding array of clean speech, a represents the embedding representation of clean speech. Note that the generation process of the memory is slightly different from that in [13], of which is simpler, but we don't find any apparent influence on the experiment results.

B. MEMORY BLOCK
A memory block with a 1 × E shape is developed in the proposed model. In every minibatch, embedding representations of the clean speech of different speakers are saved to the memory block. The operation producing the memory vector is the same as the method in [13], but the memory vector stores the general feature of different voices. The memory update formula can be expressed as where m ∈ R E represents the memory vector, and a is the embedding representation of clean speech. Note that the m is initialized to a zero vector.

C. ATTENTION MECHANISM AND MASK
The attention mechanism aims to calculate the mask by the noisy speech spectrum embedding array and the memory vector. Every element is related to the embedding at the corresponding position and the memory vector. The method to compute the attention weight is given by the following equation: where A t,f represents any embedding in the noisy speech spectrum embedding array, and α can be regarded as a T-F mask. In the ASAM, the mask is directly used to filter the enhanced spectrum from the mixture. In the proposed model, the mask is utilized to update the noisy embedding array. The mask has to be copied at first and then the same masks are stacked together to match the shape of the embedding array.

D. EMBEDDING DECODER
An enhanced speech spectrogram is reconstructed from the embedding array in this block. The process can be regarded as an inverse process of embedding generation. To overcome the problem of using a single embedding in ASAM, we propose to adopt a CNN structure to extract the features of local correlation embeddings, as illustrated in Fig. 3. Given that the FIGURE 3. Embedding decoder based on a convolutional neural network. VOLUME 8, 2020 energy in the spectrogram is continuous along the time and frequency dimension, we suppose that the embedding array generated from the spectrogram has a similar property and the CNN is appropriate for exploiting rich local patterns. One way to construct an embedding decoder is using fully connected layers, but the structure of the neural network could introduce significant irrelative information when each T-F unit is generated. Multiple structures of CNN are designed to be the embedding decoder, which can be expressed by the following formula: where ED(·) represents the embedding decoder, Y represents the enhanced speech spectrogram, A represents an embedding array.

III. EVALUATIONS AND COMPARISONS A. DATASET
This study selects TIMIT corpus and Noisex92 [23] as the speech and noise datasets, respectively. Noisex92 is a commonly used noise dataset. A total of 1984 utterances are randomly extracted from the TIMIT corpus and 5 noises are used in the training phase. The voices are corrupted with five types of noise at different SNRs, {−5, −4, −3, −2, −1, 0} dB. The noises include Babble, Factory1, Destroyerops, F16, and White. Other different 300 utterances of seen and unseen speakers in TIMIT are randomly selected to construct the test set. Three unseen noises (i.e., Factory2, Leopard, and Hfchannel) and three seen noises (i.e., Babble, Factory1, and F16) from the NOISEX-92 corpus are adopted for testing. All waveforms are sampled at 16 kHz.

B. EXPERIMENT SETUP
The short-time Fourier transform (STFT) and a Hamming window with 512 and an overlap interval with 256 are used to extract the speech features from the waveform. We then obtain a speech spectrum with a shape of 235 × 257 and feed it to the neural network. Three models are chosen as the reference approaches, including deep recurrent network (DRN) [24], CRN [8] and ASAM. Four proposed models with similar structure, such as RC-EED, convolutional-convolutional EED (CC-EED), RC-AEED and recurrent-fully_connected AEED (RF-AEED) are also evaluated. Note that DRN and CRN are completely different from the proposed RC-EED or RC-AEED, while ASAM, RC-EED, CC-EED, RF-AEED and RC-AEED have similar structures.
The ASAM baseline is similar to that in [13], the two embedding encoders of which consist of two unidirectional LSTM layers and a fully connected layer. In the embedding encoder of the noisy speech, each LSTM layer has 784 cells, and the fully connected layer has 257 units. An LSTM layer with 120 cells is used in the other embedding encoder. Two different embedding arrays of size 235 × 257 × 40 are then generated. The embedding array produced by clean speech forms a memory vector of size 40, which is subsequently stored in a memory block. The shape of the memory block is the same as the proposed method, and the memory block stores the speech features of all speakers.
RC-EED and CC-EED both have an embedding encoderdecoder structure. The embedding generator of the RC-EED is the same as that of the ASAM baseline. However, in the CC-EED, the generator is replaced by a similar CNN structure as that in [22], and the embedding decoders of the two models are two CNN layers with a kernel size of 3 × 3. The first layer has 40 input channels and 10 output channels, while the second layer has 10 input channels and 1 output channel. The decoder of RC-AEED has the same structure parameters as above. The RF-AEED, which replaces the CNN embedding decoder with a fully connected network, is built to verify the effectiveness of the CNN decoder. The fully connected layers have 2048 and 257 units respectively. During the test phase, the two AEED models directly use the memory vector generated in training phase, and the memory vector will not be updated again.
Four different decoders in the RC-AEED model are also developed, and the performances of all the decoders are evaluated, as shown in Table 7.
The models are trained with the Adam optimizer at a learning rate of 0.001. The mean squared error in the timefrequency domain is used as the objective function. The models with a minibatch size of 32 are trained on the utterance level.

C. RESULTS AND ANALYSIS
Short-term objective intelligibility (STOI) [25], PESQ, and scale-invariant signal-to-distortion ratio (SI-SDR) [27] are used as evaluation metrics in the experiment. Tables 1, 2 present the experiment results on three seen noises and untrained speakers. Generally, ASAM cannot achieve satisfactory performance in speech enhancement. DRN, CRN, and the proposed methods have better performance than the ASAM in nearly all conditions. In the −10 dB SNR condition, most of the STOI and PESQ metrics of the DRN are inferior to those of the CRN and the RC-AEED, indicating that CRN and RC-AEED have better stability in the low SNR environment. Table 2 shows that the proposed RC-AEED achieves the best PESQ among the seven models in all the noise conditions. When Babble is used as the test noise, RC-AEED generates 0.33, 0.56, 0.70, and 0.69 PESQ gains over the unprocessed mixtures in −10, −5, 0, and 5 dB SNR conditions. In Table 1, RC-AEED and CRN have the highest STOI scores. RC-AEED exhibits better performance when tested on F16 noise, while CRN has greater STOI gain when tested on Babble. Specifically, in the −10 dB SNR case, RC-AEED exhibits the best performance in terms of the STOI and PESQ metrics. Tables 3, 4 show the experiment results on three unseen noises and untrained speakers. The DRN, CRN, and proposed methods have better noise generalization than ASAM. RC-AEED achieves the best PESQ score in most noise conditions. The STOI score of RC-AEED is only inferior    to those of CRN and RF-AEED in the −10 and −5 dB SNR conditions. RF-AEED has a higher STOI score on the Hfchannel noise when tested in low SNR conditions, but it cannot derive consistent performance on the two other noises. Specifically, when Leopard noise is tested, only RC-AEED and CRN have slight STOI improvements in the −10 dB SNR condition, whereas the others lose some objective intelligibility of the speech. VOLUME 8, 2020  Table 5 shows the mean experimental results of SI-SDR on the seen and unseen noises. There is no distinct difference in SI-SDR metrics among the DRN, CRN, and proposed methods. Some tendencies can be observed from the five tables. RC-AEED with an attention mechanism obtains higher PESQ and STOI scores than the RC-EED. This outcome means that a proper attention mechanism can boost the performance of the RC-EED, and the stacked masking layer can filter some interference in the embedding array, which is broadly analogous to the T-F masking method. Meanwhile, the performance of RF-AEED is inferior to that of RC-AEED in most cases. This result is probably because the CNN can extract local pattern of embedding array more efficiently [18] and the fully connected layers introduce a large amount of irrelevant information that affects the formation of the spectrum. Table 6 shows the performance of the DRN, CRN, and proposed methods on both trained and untrained speakers in the −10, −5, 0, 5 dB SNR conditions, and all the seen and unseen noises are used. Generally, a similar tendency is observed between the experiment results of the trained and untrained speakers. RC-AEED and RC-EED both have good PESQ scores on the trained and untrained speaker dataset.
We evaluated the metric results of CC-EED and RC-EED on the seen and unseen noises. We find CC-EED has poor generalization ability. In preliminary experiments, we observed CC-EED has poor generalization, it cannot perform well and even degrades the speech in the unseen noises condition. Fig. 4 only shows the results on the seen noise. We can find CC-EED has poor performance in unseen SNR condition, and RC-EED consistently outperforms CC-EED in all metrics.
As shown in Table 7, we evaluate the performance of different CNN-based embedding decoders at the same noise condition as the evaluation in Table 6. The hyperparameters of the CNN structure are given in (inputchannel/outputchannel) format, and the kernel sizes are 3 × 3. In general, the model with a deeper CNN based decoder can achieve higher scores on STOI, PESQ. 3 layers' decoder gets the best PESQ, and 4 layers' decoder gets the best STOI. Fig. 5 illustrates a spectrogram of clean utterance and three spectrograms of the corresponding noisy speech  seen noise enhanced by different methods. Some black dotted boxes are marked in the spectrograms. The spectrogram generated by the CRN has less noise components. However, it also suppresses some of the timefrequency energy of the speech signal by mistake. This problem is mitigated in the proposed methods, and clearer voice features can be found in the right two spectrograms. Furthermore, the RC-AEED generates a better spectrogram. Table 8 shows the parameter efficiency comparison of RC-AEED, RC-EED and CRN. The decoders of   RC-AEED and RC-EED are both two-layers' CNN structure. We find the proposed models have the same magnitude, and RC-EED has fewer parameters than CRN.

IV. CONCLUSION
Multiple embedding encoder-decoder structures, RC-EED, CC-EED and AEEDs, are proposed in this study. The RC-EED and RC-AEED get good performance on speech enhancement task at low SNR conditions and apparently outperform a similar ASAM model at different noise conditions. The comparison of the RC-EED and CC-EED shows that the RC-EED with an LSTM-based embedding encoder outperforms the CNN-based encoder-decoder. The RC-AEED is also compared to an AEED with a fully connected decoder to evaluate the efficiency of the CNN decoder. The DRN and the CRN are compared with the proposed methods in the experiment, and the results show the RC-AEED achieves the best PESQ scores in almost all seen and unseen noise conditions. In lower SNR conditions, namely, −10 and −5 dB, the RC-AEED has consistent performance on speech enhancement. In lower SNR conditions, phase distortion has more impact on the quality of enhanced speech. In the future, we will explore to deal with the phase distortion problem, and combine the RC-EED structure and the target complex spectrum.
TIAN LAN received the Ph.D. degree in computer science from the University of Electronic Science and Technology of China (UESTC), Chengdu, China, in 2009. He is currently an Associate Professor with the School of Information and Software Engineering, UESTC. His current research interests include medical image processing, speech enhancement, and natural language processing.
WENZHENG YE is currently pursuing the master's degree in software engineering with UESTC. His research interests include speech enhancement, speech recognition, and machine learning.
YILAN LYU was born in 1997. She is currently pursuing the master's degree. Her research interests include speech enhancement and speech separation.
JUNYI ZHANG received the Ph.D. degree in computer science from the Beijing University of Posts and Telecommunications, in 2012. His current research interests include digital signal processing, machine learning, and data mining.
QIAO LIU received the Ph.D. degree in computer science from the University of Electronic Science and Technology of China (UESTC), Chengdu, China, in 2010. He is currently a Full Professor with the School of Information and Software Engineering, UESTC. His current research interests include natural language processing, machine learning, and data mining. VOLUME 8, 2020