Efficient Audio-Visual Speech Enhancement Using Deep U-Net With Early Fusion of Audio and Video Information and RNN Attention Blocks

Speech enhancement (SE) aims to improve speech quality and intelligibility by removing acoustic corruption. While various SE models using audio-only (AO) based on deep learning have been developed to achieve successful enhancement for non-speech background noise, audio-visual SE (AVSE) models have been studied to effectively remove competing speech. In this paper, we propose an AVSE model that estimates spectral masks for real and imaginary components to consider phase enhancement. It is based on the U-net structure that allows the decoder to perform information restoration by leveraging intermediate information in the encoding process and avoids the gradient vanishing problem by providing paths direct to the encoder’s layers. In the proposed model, we present early fusion to process audio and video with a single encoder that effectively generates features for the fused information easy to decode for SE with reduced parameters of the encoder and decoder. Moreover, we extend the U-net using the proposed Recurrent-Neural-Network (RNN) attention (RA) blocks and the Res paths (RPs) in the skip connections and the encoder. While the RPs are introduced to resolve the semantic gap between the low-level and high-level features, the RA blocks are developed to find efficient representations with inherent frequency-specific characteristics for speech as a type of time-series data. Experimental results on the LRS2-BBC dataset demonstrated that AV models successfully removed competing speech and our proposed model efficiently estimated complex spectral masks for SE. When compared with the conventional U-net model with a comparable number of parameters, our proposed model achieved relative improvements of about 7.23%, 5.21%, and 22.9% for the signal-to-distortion ratio, perceptual evaluation of speech quality, and FLOPS, respectively.


I. INTRODUCTION
Speech enhancement (SE) aims to improve sound quality and intelligibility by removing acoustic corruption from noisy speech recorded in real-world environments. Enhanced speech can be used in various fields such as mobile phones, teleconferencing, and hearing aids. In particular, it can be used as a pre-processing technique in automatic speech recognition (ASR) to improve the recognition performance for corrupted speech. Speech corruption may occur for a variety of reasons including interferences, background noise, The associate editor coordinating the review of this manuscript and approving it for publication was Long Wang . and reverberation caused by room acoustics. Therefore, various SE techniques have been developed through many studies, and many of them were based on Wiener filtering [1], minimum-mean-squared-error estimation [2]- [4], spectral subtraction [5], spectral masking [6], and Bayesian estimators [7]. Although these methods are effective for SE, they may require strong assumptions such as noise types, stationarity of noise, and the signal-to-noise ratio (SNR). With the recent great success of deep learning, various SE models based on deep learning have been presented to address the aforementioned problems and have achieved better performance than the traditional methods. These SE models can be classified into two categories: the frequency domain approach that enhances the magnitude and phase components after applying the short-time Fourier-transform (STFT) to the input signal [8]- [10], and the time domain approach that uses the input signal as it is [11]- [13].
The time-domain approach is an immediately trained method so that a clean audio signal can be obtained by using a noisy audio signal as an input without conversion into the frequency domain [11]- [13]. Since two features, magnitude and phase, are inherently present in the time-domain signal, the phase component, which is difficult to estimate directly compared to the magnitude, can be considered indirectly by the approach.
The frequency-domain approach usually estimates a spectral mask that can remove noise. In particular, many of these models estimate only the mask to enhance the magnitude spectrum of the input noisy speech and use the phase of the noisy speech to restore enhanced time-domain speech [14], [15]. The reason for not enhancing the phase spectrum in this SE is that the relationship between the phases of noisy speech and clean speech is not clear, making it difficult to effectively estimate the phase of clean speech through model training [16], [17]. However, due to a study that objective and subjective speech qualities can be improved through accurate phase estimation [18], models that handle the phase in addition to the magnitude spectrum have recently been studied. [8] presented a deep learning model that enhanced the real and imaginary parts of the spectrum by estimating the complex mask, showing better performance than the case where only the magnitude was enhanced and the phase of noisy speech was used. In addition, layers were deeply stacked based on the U-net structure to estimate the complex mask for noisy speech in [9], and a Long Short-Term Memory (LSTM) layer was added to the bottleneck in the U-net structure [9] to capture temporal dependency, which resulted in further improvement [10]. Furthermore, there was a study on SE by combining the U-net structure and Wiener filter as a combination of deep learning and signal processing techniques [19].
Although these models have been used effectively to enhance speech, the models, which used audio-only as input, were usually trained to remove non-speech while retaining speech. Therefore they frequently failed to successfully remove competing speech. In this case, there have been studies that used additional information other than input audio. In [20], a model was presented to enhance the target speaker's speech by distinguishing the target speech from competing speech based on the d-vector generated from the reference signal of the target speaker, and the target speech was enhanced by speaker identification using an auxiliary feature [21]. Also, signals from earphone acceleration sensors were used [22].
On the other hand, many studies have used video to separate and remove competing speech. In particular, visual features obtained by the speaker's lip or facial movements were directly related to the speech, allowing models to focus on the target speech and to increase speech quality.
Although there were studies in which small-sized models were trained for real-time operation [23], [24], convolutional blocks used as the encoder were deeply stacked to extract relevant information from video of huge data in most of the audio-visual SE (AVSE) models. Using only the encoder outputs, therefore, it is difficult for the decoder of the model to successfully improve the target speech by restoring the information loss in the encoding process. Furthermore, due to the deeply stacked-layer structure, the effects of the gradient vanishing problem may be increased when updating weights via backpropagation.
To address these issues, we present an AVSE model that can remove competing speech based on the U-net structure in [9]. The skip connection in the U-net allows the decoder to effectively perform information restoration by leveraging intermediate information in the encoding process and avoids the gradient vanishing problem by providing paths that can drastically reduce the number of layers passing up to the encoder's layers. In addition, the conventional AVSE models had a problem that increased the encoder parameters due to the independent encoders for audio and video information and increased the burden on the decoder after both encoder outputs were concatenated. In the proposed model, through early fusion to process audio and video with a single encoder, features for the fused information that is easy to decode for SE can be effectively generated, which may reduce the number of parameters of the encoder and decoder. Also, while many models were trained using front-facing speaker data in a clean recording environment, we use the LRS2-BBC dataset [25] to learn a model that can work robustly in real-world environments. Moreover, we extend the U-net using the proposed Recurrent-Neural-Network (RNN) attention (RA) blocks and the Res paths (RPs) in [26] in the skip connections. We add the RA blocks to existing convolutional neural network (CNN) layers to increase the effective receptive fields and to find efficient representations with inherent frequency-specific characteristics for speech as a type of time-series data. To avoid the immediate integration of features in the encoder and decoder with large level differences by direct skip connections, the RPs in [26] are added to the skip connections.
Our contributions on AVSE are summarized as follows: We present an AVSE model that can efficiently remove competing speech based on the U-net structure with a single encoder for the early fusion of audio and visual information.
2) The RA blocks in the skip connections and CNN layers are applied to increase the effective receptive fields and to find efficient representations with inherent frequency-specific characteristics for speech, which may result in improved AVSE performance. 3) A model that can work robustly in real-world environments is trained by the LRS2-BBC dataset [25].
The rest of this paper is organized as follows: Section II describes the related works. In this section, we introduce VOLUME 9, 2021 studies that addressed multi-modal U-net models, AVSE, extensions of the U-net structure, and U-net models in various speech tasks. In Section III, we describe our proposed AVSE model based on the U-net including detailed explanations for audio and visual features and fusion, the RP, the RA, and the used loss function. Section IV includes the experimental setup, results, and discussions. Section V provides the conclusion of this study.

II. RELATED WORKS A. MULTI-MODAL U-NET STRUCTURE
In [27], [28], studies that applied multi-modal magnetic resonance image data to the U-net structure were presented. In this structure, there were independent encoders for different modalities, and features extracted from the encoders were combined and delivered to the decoder through skip connections. The main purpose of using independent encoders is to disentangle information that otherwise would be fused from an early stage [27].

B. AUDIO-VISUAL SPEECH ENHANCEMENT
Networks related to AVSE were described in [29]- [32]. They used independent encoders consisting of several layers for audio and video, which were different modalities. The outputs passed through the last layer of encoders were integrated and used as input to a decoder. In [29], SE was performed by the weighted sum of masks estimated from audio and audiovisual features, where the weight was determined according to the degree of corruption of the audio signal. In [30], audio and visual data were first processed using separated encoders, and then, fused into a joint network to restore enhanced speech at the output layer of an extended denoising auto-encoder. In [31], a model with two stages was constructed and trained by first enhancing the spectral magnitude component of speech from audio-visual features and then restoring the phase using the enhanced magnitude component and the complex spectrum of its input noisy speech. In [32], a multi-layer audio-visual fusion strategy was proposed that extracted audio and visual features in every encoding layer and fused the audio-visual information in each layer to feed the corresponding decoding layer. However, these models adopted independent encoders for audio and video information, increasing the number of training parameters. Early fusion to process audio and video data with a single encoder may provide features for the fused information that is easy to decode for SE, resulting in the reduction of the number of parameters.

C. EXTENDED U-NET STRUCTURE
In [9], an SE model was presented to remove background noise using the deep U-net structure. After explaining the shortcomings of the model that only enhanced the magnitude of the noisy speech input in the STFT domain, the authors presented the U-net model that enhanced the real and imaginary parts of the spectrum by estimating the complex mask.
In addition, a weighted signal-to-distortion ratio (SDR) loss was presented to avoid the problems of the conventional SDR loss that caused fluctuation in the lower bound of the loss value during learning, became zero gradients for noise-only data, and provided scale-insensitive loss values.
In [26], [33], [34], the problem of direct skip connections in the U-net [35] structure was addressed. The skip connections connect an encoder that down-samples through maxpooling and a decoder that up-samples through deconvolution and deliver to the decoder spatial information lost while down-sampling. The features extracted from the encoder and decoder are called low-level and high-level features which tend to focus on local and global information, respectively. Therefore, since there is a difference called the semantic gap between the low-level and high-level features, naive integration of them may cause confusion in model learning.
Various studies in the field of medical image processing suggested approaches to overcome the semantic gap [26], [33], [34]. In [26], the RP was presented in which several convolutional layers were applied to a skip connection to fill the semantic gap. In [33], [34], performance was improved by changing the part that combines low-level and high-level features in the decoder. In these studies, a channel attention mechanism was used. In order to facilitate model learning, each channel was weighted using the Squeeze and Excitation Block [36] to learn the importance of the channel. Introducing the channel attention mechanism might avoid confusion in model learning using both the low-level and high-level features directly. These models are summarized in Table 1.

D. U-NET IN VARIOUS SPEECH TASKS
The U-net structure has been often used in the field of audio source separation in addition to SE. In [37], a model for separating a singing voice and accompaniment from various instruments through a U-net structure that used a waveform as an input was presented. In addition, it was extended to process multi-channel inputs, enabling source separation in stereo data in [37]. In [38], a model for acoustic echo cancellation was studied based on [37]. By extending the U-net structure in [37], a model was developed which applied attention to the far-end and input signals. A model performing separation and localization at the same time was studied in [39]. Based on the U-net structure in [37], a model that applied GRU as well as CNN to each encoding layer was presented.

III. PROPOSED AVSE MODEL
Referring to related studies, we present an AVSE model that can remove competing speech. Considering video containing speaker's lip and face movements directly related to speech, we present an efficient multi-modal U-net model that performs early-fusion of audio and visual features in one encoder rather than independent encoders for different modalities. In addition, we propose an RA block that can provide efficient representation for speech, and apply the RP in [26] to skip connection in our model to fill the semantic gap of low-level and high-level features.  In order to reduce the semantic gap in skip connection, the RP is used and the RA blocks are applied in both the encoder and skip connection. One convolutional layer is applied to extract a complex spectral mask from the output of the decoder. Finally, the inverse STFT is performed to obtain an enhanced signal. The dimension of data is represented in [·], and 'C', 'T', and 'F' represent the number of channel, time, and frequency dimensions of the data, respectively. Here, 'C' is two to represent real and imaginary parts.
In this section, we describe our model architecture in detail. The overall structure is shown in Fig. 1. The basic structure of our model is based on the U-net which compresses features through 2D convolution and expands features through 2D transpose convolution. The model estimates the complex mask representing the contribution of target components in the real and imaginary parts of the input spectrum, and the real and imaginary parts are multiplied by the two parts of the complex mask to obtain enhanced real and imaginary parts, respectively. In this model, extraction, and concatenation of audio and visual features, the used RP, and the newly proposed RA are described in detail, and the criterion function used for model training is explained.

A. AUDIO-VISUAL FEATURES 1) AUDIO FEATURES
Noisy speech sampled at a 16-kHz sampling rate is analyzed using the short-time Fourier transform (STFT) with a 40-ms-long Hamming window every 10 ms to generate a 321-dimensional complex spectral vector. In our proposed method, we use the real and imaginary components in the spectrum of a noisy audio signal since enhancement for the VOLUME 9, 2021 FIGURE 2. Our RP architecture. The dimension of data is represented in [·], and 'C', 'T', and 'F' represent the number of channel, time, and frequency dimensions of the data, respectively. real and imaginary components can provide better performance than the method applying the phase of the acquired noisy signal after enhancing the spectral magnitude.

2) VISUAL FEATURES
We use the pre-trained model in [40] to prepare visual features that represent lip movement. Using the network based on the 18-layer ResNet with a 3D convolution layer [41], we can get a 512-dimensional feature vector for a video frame every 40ms. Instead of training our model to extract visual features, we use the features obtained by the pre-trained model to show that our developed SE model performs well enough even with the pre-trained model, which is often used to extract video features.

3) CONCATENATION OF AUDIO AND VISUAL FEATURES
Many DNN models based on audio and visual information encode audio and visual features separately to disentangle their information assuming that the characteristics of audio and visual information are sufficiently different. However, since the speaker's lip or facial movements are directly related to the speech, our AVSE model processes audio and visual information with a single encoder to obtain fused features from an early stage. Using the early fusion, the fused features are effectively generated that are learned to facilitate decoding for SE. Therefore, the number of parameters of the encoder and decoder may be reduced by using a single encoder instead of separated encoders and reducing the burden on the decoder by using the fused features. Therefore, before being fed into the encoder, audio and visual features are concatenated by the following procedure.
The audio and visual features described above have the dimensions of 2 × T a × 321 and T v × 512, respectively, where the '2' represents the real and imaginary parts of the complex spectral audio feature, and T a and T v are the numbers of frames to obtain the audio and visual feature vectors, respectively. Since T a is 4 times T v , the temporal dimension of the visual feature is upsampled by 4 times to match the temporal dimension, and one fully connected layer is applied to reduce a 512-dimensional visual feature vector to a 321-dimensional vector corresponding to an audio feature vector. A multimodal feature of R 3×T a ×321 is created by concatenating the visual feature and the real and imaginary parts of the audio feature along the channel axis.

B. RES PATH (RP)
In the U-net, low-level features in the encoder and high-level features in the decoder are integrated through skip connections. Since naive integration may cause confusion in model learning due to the difference called the semantic gap between the low-level and high-level features [33], our model adopts the RPs in [26] to facilitate model learning by reducing the semantic gap in the skip connections, as shown in Fig. 1. where

C. RNN ATTENTION (RA)
Our SE network is based on the U-net [9] that is suitable for signal restoration by integrating low-level features in the encoder and high-level features in the decoder through skip connections. Since the U-net is usually composed of convolutional layers, it is easy to capture local features, but it is difficult to capture relatively long temporal features. Since temporal information is important in speech, temporal dependency of speech was considered by using one LSTM layer in the bottleneck of U-net in [10] while in the proposed SE model in Fig. 1, RA blocks are applied after CNN layers of the encoder and also applied to skip connections. In addition, when 2D convolution in the U-net is applied to the spectrum of speech, kernel filters move along the frequency axis as well as the time axis to obtain convolution results. Therefore, the 2D convolution does not use frequency-specific filter values although the speech spectrum has inherent characteristics for each frequency bin, unlike images. However, the LSTM of RA modeling the temporal dependency of speech processes every frame to estimate attention weights reflecting the inherent frequency-specific characteristics of the local features at each level while similar spatial attention weights were obtained by a convolutional layer in [14]. To this end, as shown in Fig. 3, the local feature AV v at time step t in this level is used as an input, and the AV gap is obtained by global average pooling (GAP) along the channel axis, and the weight is estimated to provide the relative importance between 0 and 1 through the LSTM and sigmoid function σ (·). This attention weight is multiplied by the input local feature vector to obtain the feature AV ra reflecting the dominance. This process can be expressed as where means the element-wise product, and Tile represents the tiling function that duplicates a T × F dimensional matrix C times along the channel to generate a C × T × F dimensional tensor. In addition to being applied to each level of the encoder, it is applied to skip connections after applying the RP.

D. CRITERION FUNCTION
Several loss functions have been used in SE tasks. The mean square error (MSE) or L1 loss in the frequency domain or the MSE or SDR loss in the time domain were used based on the difference between an enhanced speech signal at the model output and the corresponding clean target speech signal [8], [10]- [12]. In addition, the losses in the frequency and time domains could be combined [42], and the perceptual evaluation of speech quality (PESQ), which evaluates how similar an enhanced signal is to the target speech in terms of speech quality, was also used [43]. We use the weighted-SDR loss [9], which is an SDR-based measure that can preserve the scale of enhanced speech by considering not only the SDR for speech but also the SDR for noise at the same time. The weighted-SDR loss can be expressed as where loss SDR (y,ŷ) = − < y,ŷ > y ŷ , and α = y 2 y 2 + z 2 .
Here, x is an input noisy signal, x = y + z where y and z denote the corresponding clean speech and noise signals, respectively.ŷ andẑ = x −ŷ are estimated speech and noise signals, respectively. || · || and < · , · > denote the L2 norm and inner product of vectors, respectively.

IV. EXPERIMENTS AND RESULTS
We conducted two experiments to evaluate the proposed model. In the first experiment, an audio-only (AO) model was constructed by removing visual modality in the model described in Section III, and the performance of the model was evaluated for non-speech background noise. In the second experiment, the AO model and the audio-visual (AV) model were evaluated using another speaker's speech as noise.

A. DATASET
In the first experiment on non-speech background noise, we used noise and clean speech from the Diverse Environments Multichannel Acoustic Noise Database (DEMAND) [44] and the voice bank corpus [45] downsampled to a rate of 16 kHz, respectively, as in [9]. Each noisy speech utterance was simulated by mixing the corresponding clean speech utterance and a randomly selected noise signal with the same length as the speech. We generated noisy speech utterances with four different SNRs for one speech utterance. The SNRs of the training data were 0dB, 5dB, 10dB, and 15dB while the SNRs of the test data were 2.5dB, 7.5dB, 12.5dB, and 17.5dB. In the second experiment on competing speech noise, we considered the LRS2-BBC dataset [25]. In order to use competing speech as noise, each noisy speech utterance with an SNR of 0 dB in both the training and test data was generated by mixing the corresponding clean speech utterance with another speaker's utterance of a similar length in the same dataset.

B. EXPERIMENTAL SETUP
The parameters of CNNs and Transposed CNNs used in the U-net of our model are described in Table 2. After generating a complex mask in the STFT domain by passing the output of the U-net decoder through one CNN layer, the spectrogram of the input audio signal was element-wise multiplied by the complex mask to enhance the audio signal. Then, a waveform of the enhanced signal was obtained by applying the inverse STFT. When training the model, the weighted SDR loss was calculated using waveforms of the enhanced signal and the corresponding clean speech signal. VOLUME 9, 2021 TABLE 2. The parameters of CNNs and Transpose CNNs used in the U-net (4M), RPU-net, RA E U-net, RA ES U-net, and RA ESD U-net. We used the same number of parameters in [9]. Ablation studies were conducted to evaluate the performance of the proposed model. Our model was based on the U-net in [9], and Fig. 4(a) shows one layer of its encoder and decoder. Also, Fig. 4(b) shows RPU-net, a model that applies the RP to the U-net. In addition, Figs. 4(c), 4(d), and 4(e) show RA E U-net, RA ES U-net, and RA ESD U-net in which the RA is additionally and cumulatively applied to the encoder, skip connection, and decoder, respectively. To evaluate computational complexity, the FLOPS at each stage of the RA ESD U-net are summarized in Table 3. The computational cost of RA blocks proposed in this paper is very small compared to the U-net and RP. We also reported the results for the U-net by increasing the number of CNN filters in the encoder so that the U-net had a comparable number of parameters with the RPU-net, RA E U-net, and RA ES U-net.   Furthermore, for comparison of audio-visual models in the second experiment, Fig. 5 shows the SEU-net in [33], which applied channel-wise attention between the encoder and the decoder by using SE blocks, and the MMU-net in [27], where encoders for audio and visual features were separated for late fusion. Finally, a model in [24] consisting of CNN and LSTM was also compared. The models were trained by an Adam optimizer, and the learning rate was 0.0001.

C. EXPERIMENTAL RESULTS AND DISCUSSIONS
SDR, PESQ, and short-term objective intelligibility (STOI) were used as performance measures in our experiments.
In addition, the predicted ratings of speech distortion, background distortion, and overall quality denoted as CSIG, CBAK, and COVL, respectively, were also used as performance indicators. We also showed the number of parameters and FLOPS as metrics for the model size and computational cost, respectively. Table 4 shows the experimental results of AO models on non-speech background noise. Regardless of the models used, the models estimating complex masks that could take into account phase information showed slightly better performance than the model enhancing magnitude components only in general. The RA ES U-net outperformed the other compared models in all the evaluated measures. This showed that VOLUME 9, 2021 FIGURE 6. Box plots for U-net and RA ES U-net models using AO features on human speech as noise using the LRS2-BBC dataset. the proper use of RA and RP could benefit SE. However, the RA ESD U-net showed a slightly degraded performance than the RA ES U-net. This seems to be because the RA applied to the decoder performed a redundant function to the RA applied to the encoder and skip connection, resulting in increased model complexity rather than helping to improve performance. Table 5 shows the experimental results on AO and AV models when competing speech was used as noise. In addition, Fig. 6 displays box plots of SDR, PESQ, and STOI for U-net and RA ES U-net models. Unlike the results for non-speech background noise in Table 4, AO models could not effectively remove competing speech, and even model training was not performed well as confirmed by the validation loss curve of the criterion function even with the decreasing training loss curve in Fig. 7. Using non-speech background noise, the models used learned different noise properties and patterns from speech, allowing speech to be selectively enhanced. However, when competing speech was used as noise, SE could not be effectively performed because the target and noise signals could not be effectively distinguished with audio information only. Therefore, another modality had to be added, and we considered AVSE models using visual information additionally.
As shown in Table 5, AV models performed successful SE, unlike AO models. This is because the AV models could distinguish between the target and noise signals by exploiting visual features containing the target speaker's lip and face movements in addition to the start and end times of an utterance. The performance comparison between the models showed a similar tendency to the results in Table 4. The RA ES U-net achieved superior performance to the other compared models, and the models estimating complex masks outperformed the models estimating magnitude masks.
Our U-net model fusing audio and visual features in an early stage showed better performance than the MMU-net based on late fusion. The video used here was not new additional information different from audio information, but the information of the speaker directly related to the target speech. Therefore, through early fusion to process audio and video information with a single encoder, features for the fused information that was easy to decode for SE could be effectively generated by our model with a reduced number of parameters in the encoder and decoder. In addition, our U-net showed better performance than the SEU-net, which demonstrated that temporal and spectral attention using the proposed RA was more suitable for SE than the channel-wise attention in the SEU-net. On the other hand, the SDR and PESQ scores of the RA ES U-net were similar or slightly inferior to those of the model of Afouras et al. [31]. However, considering the difference in the number of parameters by TABLE 5. Results on the LRS2-BBC dataset corrupted by human speech. AO and AV denote audio-only and audio-visual models, respectively. 'M' and 'R / I' represent the models that generate masks for the magnitude component and for the real and imaginary components, respectively. When using the 'M', the phases for input spectral data were used for restoring an enhanced waveform. 'Params' denote the number of parameters. The results in [31] were also included from author's reports.
about 20 times, one may think the proposed RA ES U-net is very efficient. Furthermore, the model of Gogate et al. in [24] was evaluated for AV models estimating magnitude masks. Since the small-sized model with 3M parameters was developed for real-time operation, we repeated evaluation for an extended model with the number of parameters comparable with the RA ES U-net and still obtained degraded SE performance than the RA ES U-net. We also reduced the RA ES U-net model as an AV model estimating complex masks to have fewer parameters of 2.2M than the model of Gogate et al. in [24], resulting in comparable enhancement performance results with much less FLOPS.
When adding the RA blocks to get the RA E U-net and RA ES U-net models from the U-net model, it should be noted that the increase in the FLOPS (that is a widely used metric for the computational cost) is not proportional to the increase in the number of parameters (that is a metric for the model size), as shown in Tables 4 and 5. Also, in addition to the results for the U-net model with about 4M parameters, we reported the results for three extended U-net models that had the numbers of parameters comparable to the RPU-net, RA E U-net, and RA ES U-net models in the two tables. For AV models estimating complex masks in Table 5, for example, the FLOPS of the RA ES U-net model was much smaller than  Waveforms of enhanced speech superimposed by the corresponding target speech and the moving averages of the differences between the enhanced and target speech waveforms over the neighboring 1-ms interval for a short interval of 30 ms in the utterance, ''test/6362162543810362236/00006.wav''. The speech was enhanced by the models that generated masks for the real and imaginary components.
that of the U-net model with 12M parameters. For models with about 9M or more parameters, it is worth noting that adding the RA blocks provided better performance with less computational complexity than extending CNN layers instead of a trade-off between the performance and computational cost with comparable model complexity.  . Magnitude spectra of input and target speech and feature maps after the first layer of the U-net and RA ES U-net encoders using AV features for the LRS2-BBC dataset when competing speech was used as noise. The difference is particularly evident in the red box.
Figs. 8 and 9 display the spectrograms of enhanced speech using complex masks estimated by models for an utterance and the waveforms for its short interval of 30 ms, respectively. Compared with the spectrogram of target speech in Fig. 4(f), the RA ES U-net effectively removed the noise component at about 2kHz in the red box and the high-frequency noise in the pink box. In addition, when using the RA ES U-net, harmonics of target speech became evident in the low-frequency region of the black box, compared to the other models. Fig. 9(f) shows the moving averages of the differences between enhanced and target speech waveforms over the neighboring 1-ms interval. In general, speech enhanced by the RA ES Unet showed less difference than those of the other models. The result was consistent with the superior performance VOLUME 9, 2021 FIGURE 12. SDRs of enhanced signals at frequency bands for the LRS2-BBC dataset when competing speech was used as noise.
in Tables 2 and 3 of the RA ES U-net that filled the semantic gap of features from the encoder and decoder by the RP and efficiently applied the RA suitable for speech in the encoder and skip connection. Fig. 10(a) displays the waveforms of enhanced and target speech signals for a short interval of 30 ms of an utterance when the RA ES U-net estimated masks for the magnitude component or the real and imaginary components. Using the mask for the magnitude component, the phase of the input noisy speech was used for its enhanced speech. Fig. 10(b) shows the moving averages of the differences between enhanced and target speech waveforms over the neighboring 1-ms interval. The model that estimated the masks for the real and imaginary components provided enhanced speech closer to the target speech than the model that estimated the mask for the magnitude component. Since the masks for the real and imaginary components could consider the phase by enhancing both the real and imaginary components of the spectrum, the result was consistent with the results in Tables 2 and 3 and in [9], which demonstrated that the phase still contributes to SE. Fig. 11 depicts the spectrograms of input and target speech and feature maps after the first layer of the U-net and RA ES Unet encoders using AV features for the LRS2-BBC dataset when competing speech was used as noise. The RA in the RA ES U-net effectively enhanced target components in all frequency bands and especially in high frequency bands. Fig. 12 shows the SDRs of enhanced signals at frequency bands for the AV models. In the low-frequency band where the signal power is concentrated, the five tested models showed similar performance, while in the other bands, models using RA showed higher performance than those without RA. In models without RA, features are extracted at each level as a result of 2D convolution. Although the speech spectrum has inherent characteristics for each frequency bin, the 2D convolution, which does not use frequency-specific filter values, might learn filters tuned for the low-frequency components where the signal power is concentrated, but relatively less suitable for the high-frequency components, to minimize the loss function. However, when every frame was processed in the LSTM of RA, the attention weights considering the inherent frequency-specific characteristics could be estimated to restore high-frequency features in the output of the 2D convolution. This allowed the models with RA to obtain relatively high SDRs in high-frequency bands compared to the models without RA.
We also evaluated the performance for test data of several different audio SNRs using the RA ES U-net model as an AV model estimating complex masks and summarized the results in Table 6. Although the model was trained only on data of 0-dB SNR, it achieved sufficiently high enhancement performance on data of other SNRs as well as 0-dB SNR.

V. CONCLUSION
We proposed an AVSE model that could remove even competing speech based on the U-net structure to estimate spectral masks for real and imaginary components. Compared with a model using late fusion, the model could reduce the number of parameters of the encoder and decoder by early fusion to process audio and video with a single encoder that could effectively generate fused features easy to decode for SE. While the RP added in the skip connection resolved the semantic gap between the low-level and high-level features, the RA introduced in the encoder and skip connection found efficient representations with inherent frequency-specific characteristics for speech as a type of time-series data. The experimental results demonstrated that AV models successfully removed competing speech and our proposed model efficiently estimated complex spectral masks for SE. In the future, we plan to conduct a study linking the AVSE model to robust speech recognition.
JUNG-WOOK HWANG received the B.S. degree in electronics engineering from Sogang University, Seoul, South Korea, in 2020, where he is currently pursuing the master's degree in electronic engineering. His current research interests include deep learning-based speech enhancement, speech recognition, audio-visual speech enhancement, and audio-visual speech recognition.