On Learning Spectral Masking for Single Channel Speech Enhancement Using Feedforward and Recurrent Neural Networks

Human speech in real-world environments is typically degraded by the background noise. They have a negative impact on perceptual speech quality and intelligibility which causes performance degradation in various speech-related technological applications, such as hearing aids and automatic speech recognition systems. It also degrades the original phase of the clean speech and introduces perceptual disturbance which leads to the negative impacts on the quality of speech. Therefore, speech enhancement must vigilantly be dealt with in everyday listening environments. In this article, speech enhancement is performed using supervised learning of spectral masking. Deep neural networks (DNN) and recurrent neural networks (RNN) are trained to learn the spectral masking from the magnitude spectrograms of the degraded speech. An iterative procedure is adopted as a post-processing step to deal with the noisy phase. Additionally, an intelligibility improvement filter is also used to incorporate the critical band importance function weights where higher weights contribute more towards intelligibility. Systematic experiments demonstrated that the proposed approaches greatly attenuated the background noise. Also, they led to large improvements of the perceived speech quality and intelligibility, as well as automatic speech recognition. In experiments, TIMIT database is used. The STOI is improved by 17.6% over the noisy speech. Also, SDR and PESQ are improved by 5.22dB and 19% over the noisy speech utterances. These comparisons showed that the proposed speech enhancement approaches outperformed the related speech enhancement methods.

Speech enhancement aims to improve the intelligibility and quality of the noisy speech. The conventional unsupervised speech enhancement method improves the quality but fail to improve the intelligibility in nonstationary background noises. Moreover, most of the speech enhancement methods use the noisy phase during reconstruction of the enhanced speech. It is vital in the various speech-related applications to design a robust method that has the ability to improve The associate editor coordinating the review of this manuscript and approving it for publication was Yongping Pan .
the speech intelligibility and quality as well as deal with the noisy phase for the better results. This article is based on the supervised learning of the spectral-masking for speech enhancement using DNN and RNN frameworks. Since, spectral phase has impacts on speech quality; we have used a post-processing step to deal with the noisy phase during time-domain speech recovery for improved quality.

II. INTRODUCTION
The objective of single-channel speech enhancement is to suppress the background noise components and recover the VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ components of clean speech from the noisy version with improved perceptual quality and intelligibility. The speech enhancement algorithms are primarily used to improve the voice quality of a real-time speech communication system, pre-recorded multimedia contents, to increase the accuracy of automatic speech recognition (ASR) systems and hearing aids. Previously, many unsupervised speech enhancement methods were suggested such as the spectral subtraction [1] and its variants [2], [3], Wiener filtering [2] and its variants [4], [5], as well as the minimum mean square error (MMSE) estimator [6] and its variants [7], [8].
Although the aforesaid speech enhancement methods are apt for many real-time speech-related applications since they present a small computational complexity, but their performance remains poor for many real-world acoustic environments where they fail to track the power spectral density of an extremely non-stationary background noise. To surmount this issue, the supervised learning-based speech enhancement methods have been opted and trained with a large quantity of the training data in presence of different background noises [9], [10]. Regression, spectral-mapping and spectral masking-based deep neural networks are among the most successful methods in single-channel speech enhancement tasks [11]- [15]. In a DNN-based speech enhancement task, the relation between input and target features is not linear; therefore, network architecture composed of multiple layers with non-linear activation functions are more suitable for speech enhancement [12] rather than shallow neural networks. In addition, to completely confine the temporal dynamics of the speech signal, feed-forward DNN and recurrent neural network structures have been opted. Particularly, single-channel speech enhancement-based on feedforward and recurrent neural networks have shown considerable performance gains compared to shallow neural networks and conventional unsupervised speech enhancement methods. Furthermore, network architecture types, training-targets and associated objective functions are vital concerns for deep learning-based speech enhancement [16]. The learning approaches for speech enhancement are grouped into two groups. In the first group, the training procedure is carried out in a direct spectral-mapping rule and the output clean spectral features are mapped from the input noisy spectral features; however, it is observed that the estimated spectra have a tendency to be over-smoothed [11], [12]. The second and successful group is that of spectral-masking. A number of learning approaches have recently been proposed for estimating spectral-masks with confirmed notable results [17]- [20]. Few of the recent related work regarding supervised speech enhancement is available in [21]- [23].
Spectral masking-based learning approaches map from a noisy speech signal to a time-frequency mask and the gain parameters are multiplied to the noisy magnitude spectra to obtain a noise suppressed enhanced speech signal. Spectral masking usually estimates the ideal binary mask (IBM) [24], where a time-frequency unit is assigned a binary 1, if the signal-to-noise ratio (SNR) within the unit exceeds a local criterion (0dB), implying speech dominance. Otherwise, a time-frequency unit is assigned a binary 0, implying noise dominance. Another popular spectral mask is the ideal ratio mask (IRM) [25], where a time-frequency unit is assigned a ratio of clean and noisy speech energies. The spectral magnitude mask, called ideal amplitude mask (IAM) is defined on the short-time Fourier transform magnitudes of clean speech and noisy speech. Unlike IRM, IAM is not upper-bounded by 1. To get enhanced speech, we apply the estimate of IAM to the spectral magnitudes of noisy speech, and resynthesize the enhanced speech. Gaussian mixture models (GMM) are used to learn the distribution of speech and noise dominant time-frequency units and for developing a Bayesian classifier for IBM estimation [26]. Multilayer perceptron is employed using one hidden layer to estimate the IBM which showed encouraging results in reverberant situations [27]. Support vector machines (SVM) are used to estimate time-frequency mask which delivered more factual classification results compared to the GMM-based classifiers [28]. For the first time, GMMs are used to compute posterior probabilities of speech dominance in time-frequency units and SVMs are trained with novel features to estimate the IBM [29]. The presented approach generalized significantly to an ample range of SNRs. Motivated by the deep hidden structure with several layers, DNN was used for the binary classification for the first time to separate a speech from the mixtures [30] and it significantly outperformed the earlier speech separation methods. A number of training-targets were examined and IRM was suggested to be preferred over IBM while dealing with speech quality [16]. DNN and RNN frameworks were used to minimize the reconstruction loss associated with the spectra of two premixed speakers by lodging IRM into the loss function [31], [32] and named as signal approximation [32]. The proposed method expressed substantial performance gain over NMF-based approaches. The signal approximation was deemed an optimization objective function and suggested long short-term memory (LSTM) into RNN architecture which outperformed DNN methods [33]. Signal approximation is further extended to the phase-sensitive mask and LSTM is used for speech denoising [33], [34]. Complex ideal ratio mask (cIRM) is proposed for speech enhancement. DNN-based cIRM learnt the real and imaginary parts of the complex spectra together instead of learning the magnitude spectra only [35]. This method significantly improved perceptual speech quality.
In this article, spectral masking-based learning approaches are used to construct three time-frequency masks: IRM, IBM and IAM using DNN and RNN architectures. During the training procedure, the mask approximation is used as a loss function. Critical band importance functions are used during training to further improve the performance of DNN and RNN architectures in terms of speech intelligibility and perceptual speech quality. Since background noise degrades the original phase of clean speech; therefore, it introduces perceptual disturbance which leads to negative impacts on speech. To avoid these degradations, an iterative procedure is adopted as a post-processing step. Figure 1 shows the flow diagram of the proposed speech enhancement method. The main contributions of this study are drawn as: i. Spectral masking-based learning methods are developed using DNN and RNN architectures to enhance speech in noisy backgrounds which notably improve perceptual quality and intelligibility. In the proposed methods, we have constructed three time-frequency masks including IRM, IBM and IAM. In literature, we can find many DNN-based IRM, IBM and IAM construction, however; very few studies are available that have constructed such time-frequency masks using RNN frameworks. ii. Critical band importance functions and their weights are used in the training procedures to further improve the perceptual quality and intelligibility of the noisy speech. The weights of the functions are directly applied to the clean training data using an intelligibility improvement filter and the testing process revealed enhanced speech with the filter. iii. Most speech enhancement methods use the noisy phase for the reconstruction of the enhanced speech. We have addressed this vital problem in proposed method. We have adopted a widely used iterative procedure called as Griffin-Lim Algorithm (GLA) to deal with the noisy phase during time-domain speech reconstruction. This contribution has notably improved the performance of speech enhancement in terms of the perceptual speech quality and intelligibility. iv. Less computational complexity and fast convergence is achieved by the proposed methods as compared to the baseline feedforward-DNN and RNN frameworks. The baseline and the proposed feedforward-DNNs have used same number of layers, quantity of the neurons in the hidden and visible layers. Similarly baseline and the proposed RNNs used same LSTM units. However, we achieved better speech quality and intelligibility. The reason for fast network convergence (less loss function) is adaptation of critical band weights which are directly applied to the clean training data. v. Automatic speech recognition systems are usually tested with magnitude-only spectrums. Our proposed methods with both magnitude and phase processing improved the ASR performance in adverse noisy conditions. The remaining paper is organized as follows. Spectral masking-based speech enhancement and loss functions are discussed in Section II. Experiments are presented in Section III. Results and analysis are presented in Section IV. Finally, the discussion and conclusions are presented in the Section V.

III. SPECTRAL MASKING-BASED SPEECH ENHANCEMENT AND LOSS FUNCTIONS
Mostly in speech enhancement, the enhancement of noisy speech z(n) is performed in the time-frequency domain by applying the short time Fourier transform (STFT). Since the time-domain speech signal is a real-valued signal, and only considers X = X (t, f ) ∈ C L×(K / 2+1) , where L and K indicate the frame number and the size of discrete Fourier transform (DFT). In time-frequency domain, the magnitude spectrum of the enhanced speech signal |X (t, f )| can be achieved via following time-frequency masking procedure: where,M x (t, f ) denotes the intended time-frequency mask and |Z (t, f )| = |X (t, f )| + |D(t, f )| denotes the magnitude of noisy speech which is the sum of the clean speech and noise signal in t-th frame and f-th frequency bin, respectively. We have used 20 ms with 75% overlapping in the proposed methods. The foundation for making 20 ms frame length comes from the quasistationarity assumption. We want the speech analysis frame to be stationary. As a result, in a too large analysis frame, the signal will become nonstationary. In supervised spectral masking-based learning tasks, the loss function is typically formulated to predict the masking parameters that can effectively restore the components of the clean speech by suppressing the undesired background noise components in all time-frequency units. The time-domain enhanced speech is then recovered by applying the inverse STFT (iSTFT), as illustrated in Fig. 2. A different approach, called the spectral mapping, directly learns the mapping rule from the spectral features of noisy speech to clean speech. But, spectral-masking is identified to be more successful than spectral mapping since the time-frequency mask typically has a bounded dynamic range; hence, it achieves faster convergence speed [16]. In deep learning, there exist many approaches to learn a time-frequency mask depending on the training-target or the optimization-domain. In mask approximation, the time-frequency mask is estimated such that the mean square error (MSE) with the predefined time-frequency mask is minimized, is given as: whereM x and M x denote the estimated and the predefined reference time-frequency mask, respectively. The timefrequency mask can be derived in various forms, as given in Table 1. Signal approximation is an alternative approach introduced in a study [32]. In this approximation approach, the time-frequency mask is estimated in such a way that the estimated speech is closest to the reference clean speech. Magnitude spectra approximation [32] is a kind of signal approximation in which the optimization is achieved in the magnitude spectra domain, given as: where e indicates element-wise, Hadamard product.

A. ACOUSTIC FEATURES
A set of acoustic features is extracted from the input speech at frame level where frame length and frame shift are set to 20 ms and 10 ms, respectively [16]- [18].  working of the human auditory system. Cochleagram is computed by using a 64-channel gammatone filterbank. Additionally, delta and double-delta feature coefficients are calculated and appended with all raw acoustic features. Acoustic feature extraction has been done by using RASTAMAT toolbox.
In spectral-domain, each frame can be represented as a vector, given as: An auto-regressive moving average filter (second order) is used to flatten the temporal trajectories of acoustic features as it improves the speech enhancement performance [16].
To include the temporal information, a context window of two prior and two future frames are used, hence resulting in 1845-D (369-D × 5 = 1845-D) feature vector. The feature vector before applying to neural network is given as: where, d denotes the neighboring frames on each side and T denotes transpose operator. Zero mean and unit variance normalization have been applied to all feature vectors before applying to train neural networks. The acoustic features extraction procedure is illustrated in Fig. 3.

B. NETWORK ARCHITECTURES
In this study, feedforward-DNN and RNN networks are used as spectral-masking learning approaches. Afterward, feedforward-DNN will be denoted by DNN. The network architectures of both networks are described in this section.
DNNs are selective learning machines and have shown to perform exceptionally well in the speech enhancement task [37]- [41]. The DNN architecture consists of five layers; an input layer, three hidden layers, and an output layer.  [42], [43] are used in training. The adaptive gradient descent algorithm [44] with a momentum parameter γ is used to optimize DNN. 512 samples batch size is used. The scaling factor for adaptive gradient descent is set to 0.0010 and the learning rate is reduced linearly from 0.06 to 0.002. 100 epochs are used during the process. For the first few epochs, the γ is fixed at 0.5 and the rate is increased to 0.8 for remaining epochs. The MSE loss function using mask approximation is used. The loss optimization curves at 100 epochs are shown in Fig. 4. The rectified linear unit (ReLU) activation converts a weighted sum of the inputs to a model neuron to the neuron's output. Recent practice shows that a moderately deep MLP with ReLU can be effectively trained with the large training data without unsupervised pretraining. Therefore, ReLU is used as activation function in all hidden layers and sigmoid activation function is used in the output layer. The reason for selecting the sigmoid as output activation function is the dynamic range [0 1]. It is used for models where to predict output probabilities, since probability exists between 0 and 1. Also, the dynamic range of T-F masks exists between 0 and 1. This function is differentiable and a slope of sigmoid curve is obtainable at any two pints. The sigmoid function is monotonic but its derivative is not. The functions are: On the other hand, the RNN architecture contains an input layer, three LSTM layers consists of 256 hidden units, and a fully connected output layer with 64 sigmoid units. The Adaptive Gradient Descent (AGD) algorithm is adopted for network training. The learning rate, epochs and batch-size are set to 0.001, 100 and 1024, respectively. The AGD is adopted to minimize loss function. From the input layer to  Critical band importance functions refer to the American National Standards Institute (ANSI) S3.5 standard [45]. They indicate that the higher weights of band importance functions contribute more towards improving the speech intelligibility of the noisy speech. There are 21 frequency bands, given in Table 2. The values of the band importance VOLUME 8, 2020 functions for the frequency bands indicate their impacts on intelligibility. Based on critical band importance functions, an intelligibility-improvement filter (IIF) formulated and applied to the clean training waveforms. The weights are multiplied to the training data in order to further improve the intelligibility given as: where,X F t (t, f ) show filtered speech waveform, T is total number of speech frames and α(t) shows filter coefficients: denotes lower and higher bounds whereas λ (M ) represents the weights in M-th frequency band. The IIF filter is directly applied to the clean waveforms and the filtered waveforms are used as training data. In testing, the neural networks generate the enhanced speech waveforms by incorporating the effects of IIF filter.

D. ITERATIVE TIME-DOMAIN SPEECH RECOVERY
After generating the estimate of magnitude spectra by the neural networks, time-domain speech signals are recovered by using inverse STFT (iSTFT). One approach to recover the time-domain signals is to apply iSTFT using estimated magnitude of neural network and the phase of time-domain noisy speech waveforms. However, background noise also degrades the phase of the clean speech, and this degradation typically produces perceptual disturbances and has negative impacts associated to the speech quality. Moreover, Fourier transforms of overlapping speech frames are concatenated and then STFT is computed which is a redundant version of the time-domain signal. The magnitude spectrograms of the recovered time-domain speech signal perhaps different from the intended recovered signal [46], [47]. This difference is considered for neural network estimated magnitudes. To reduce mismatch between the magnitude and phase from which we prefer to recover a time-domain speech signal, an iterative procedure is adopted [46] in order to recover the time-domain speech signal, given as Algorithm 1. In this algorithm the phase is updated iteratively at every step and replaces it with phase of STFT of its iSTFT, whereas the estimated magnitude of neural network output always remains the fixed. The algorithm acquires as input is the estimated magnitudes from DNN/RNN outputs that need to be reconstructed. The phases are not known and need to be solved for reconstructing the estimate of the original signal. The iterations identify the closest achievable magnitude spectrogram consistent with given magnitude spectrogram. GLA is an accepted phase recovery algorithm which is based on the spectrogram consistency [43]. It recovers a complex-valued spectrogram, which remains consistent and also retains given magnitude by a projection procedure. The GLA is obtained for an optimization problem [46].

A. DATASET
The experiments are performed on a speech dataset that is produced from the TIMIT database [48]. In order to access the performance of processing methods in various noisy backgrounds, 15 different noise types are selected from the NOISEX-92 [49] and Aurora-4 [50] databases, as given in Table 3. To create noisy speech signals, three signal-to-noise (SNR) levels are used that ranged from -3dB to 3dB with a 3dB step size. For training, 2000 speech utterances from 100 different speakers of both genders are reproduced for each time-frequency mask (three times) for each SNR level, and mixed with the 15 noise types. Hence, a total of 18000 speech utterances (about 15 hours training data) are used during training. Moreover, 800 speech utterances from 30 different speakers are used for the testing purpose. To evaluate the processing methods, 150 speech utterances from 16 speakers of both genders are used at random. All noise sources are used  in training and testing. The results are averaged over 15 noise types.

B. EVALUATION METRICS AND PARAMETERS
We quantitatively evaluated various spectral masking-based speech enhancement methods by four objective measures. Short-time objective intelligibility (STOI) and the extended STOI (ESTOI) are used as intelligibility indicators whereas perceptual evaluation of speech quality (PESQ) and signalto-distortion ratio (SDR) are used as the quality indicators, respectively. PESQ [51], an ITU-T P.862 recommendation predicts the perceptual quality of the enhanced speech by giving an output value ranged from 0.5 to 4.5, where a high value implies better speech quality. SDR [52] measures the speech quality. STOI [53] predicts the intelligibility of the enhanced speech by providing an output value ranged from 0 to 1 and a high value implies better speech intelligibility. The STOI values are based on the correlation between clean and the enhanced speech signals in short-time overlapped segments. ESTOI [54] predicts intelligibility of enhanced speech by providing an output value ranged from 0 to 1.

C. SYSTEM REPRESENTATION
Based on the time-frequency masks and learning methods, various deep spectral masking-based speech enhancement methods are realized, given in Table 4. To express all the speech enhancement methods, an interpretation is followed: (< Neural Network >-< Mask Type >-< Post Processing >). A speech enhancement method ''DNN-IRM'' indicates that the feedforward DNN is used with IRM as a time-frequency mask and used no iterative time-domain speech recovery and IIF filter. Also, a speech enhancement method ''RNN-IBM'' indicates that recurrent neural network is used with IBM as a time-frequency mask and used no iterative time-domain speech recovery and IIF filter. Similarly, a speech enhancement method ''DNN-IRM-iSR'' indicates that the feedforward DNN is used with IRM as a time-frequency mask and used iterative time-domain speech recovery and IIF filter. Finally, a speech enhancement method ''RNN-IBM-iSR'' indicates that recurrent neural network is used with IBM as a time-frequency mask and used iterative time-domain speech recovery and IIF filter. The baseline deep networks are represented as DNN B and RNN B , respectively. All neural networks are trained with same training dataset. Intel Core i7-3210M 3.2GHz processor and Nvidia GTX 950 GPU are used to conduct all the experiments.

V. RESULTS AND ANALYSIS
In this section, we discussed the main findings of this study. We first subjectively compared spectral-masking methods with time-frequency masks without iterative time-domain speech recovery algorithm and intelligibility improvement filter. Secondly, we compared the proposed RNN, DNN and related speech enhancement methods. Thirdly, we evaluated the speech recognition performance of the proposed method. We finally conducted subjective listening tests to further evaluate the proposed method in terms of the speech quality and intelligibility.

A. OBJECTIVE EVALUATION
We report the detailed comparison results for three noise types on the TIMIT database of both genders in Table 5 and  Table 6 respectively, where mask approximation was used as the training loss function of all neural network models. From the Tables, we observed that the spectral masking-based methods with iterative time-domain speech recovery and the intelligibility improvement filter performed better when applied with RNN and DNN frameworks. The time-frequency masks with the speech recovery and IIF filter improved the speech quality and intelligibility over their counterparts and unprocessed noisy speech. Explicitly, RNN-based learning spectral masking outperformed DNN-based spectral masking. For example in Table 5 at −3dB babble noise, RNN-IRM-iSR improved the STOI by 17.6% over the noisy speech and by 1.25% over the RNN-IRM counterpart, respectively. Similarly, at −3dB factory noise, RNN-IAM-iSR improved the STOI by 22.38% over the noisy speech and by 1.08% over RNN-IAM. In addition, RNN-IBM-iSR improved the ESTOI and SDR by 1.58% and 3.85% over RNN-IBM at −3dB factory noise. In the same way, RNN-IRM-iSR, RNN-IBM-iSR and RNN-IAM-iSR improved the PESQ at −3dB white noise by factors 0.85, 0.81 and 0.86 over the noisy speech signal whereas improved the PESQ by 2%, 2.04% and 1.01% VOLUME 8, 2020 over RNN-IRM, RNN-IBM and RNN-IAM, respectively. The STOI, ESTOI, SDR and PESQ performance gains are higher in the nonvocal noisy backgrounds, i.e. factory, and white noise than vocal babble noise. On the other hand, DNN-based spectral masking after incorporating the IIF filter and time-domain speech recovery performed better compare to the counterparts. For example in Table 6 at 0dB babble noise, DNN-IRM-iSR improved the STOI, SDR and PESQ by 16.83%, 5.22dB and 19% over noisy speech utterances. Similarly, DNN-IRM-iSR, DNN-IBM-iSR and DNN-IAM-iSR improved STOI and PESQ by 1.51%, 1.52% and 1.51% over DNN-IRM, DNN-IBM and DNN-IAM counterparts, respectively. The average improvements in values of the STOI, ESTOI, SDR and PESQ are given in Figure 6.
We separately compared the performance gains between time-frequency masks generated by neural networks-based spectral masking methods. The results of the comparative study processed by three time-frequency mask based on RNN and DNN are given in Table 7. In terms of the STOI and ESTOI, IAM-iSR performed better than IRM-iSR and IBM-iSR. Similarly, in terms of the SDR and PESQ, IRM-iSR performed better than IAM-iSR and IBM- iSR. For example, RNN-IAM-iSR improved the STOI by 16.40% over noisy speech as compare to RNN-IRM-iSR  and RNN-IBM-iSR that improved the STOI by 16.38% and 15.5% over noisy speech, respectively. In addition, DNN-IRM-iSR improved the PESQ by a factor 0.46 over noisy speech as compare to RNN-IBM-iSR and RNN-IAM-iSR structures that improved the PESQ values by the factor 0.39 and 0.42, respectively. Time-varying spectrogram graphically shows and analyzes the important speech patterns over the time at various frequency bands. To visualize and compare the performance of the speech enhancement for both RNN and DNN, spectrograms of the clean and noisy speech samples as well as for enhanced speech signal are plotted in Fig. 7. For clear understandings, STOI, SDR and PESQ values of the speech utterances are mentioned over the spectrograms. It is evident that both RNN-iSR and DNN-iSR successfully reduced the background noise components, and RNN-iSR provides a better recovered speech signal compare to RNN.
Also, DNN-iSR provides a better recovered speech signal than DNN. To visualize impacts of the phase recovery and IIF in the proposed speech enhancement, spectrograms of the clean and noisy speech samples as well as for the phase recovered-only, RNN output with phase recovery and RNN output with IIF filter effects are plotted in Fig. 8. Phase recovery and integration of IIF filter significantly improved the speech quality and indelibility.

B. COMPARISON WITH RELATED METHODS
Additionally, the spectral-masking learning methods are compared to various state-of-the-art speech enhancement methods including deep neural network (DNN) [16], recurrent neural network (RNN) [31], non-negative matrix factorization (NMF) [10], non-negative dynamical system (NNDS) [55], robust principle component analysis (RPCA) [56], log-minimum mean square error (LMMSE) [7] and deep denoising autoencoder (DDAE) [57] in order to confirm the performance of the proposed speech enhancement method. It is evident that both learning methods have attained a significant improvement over the competing methods, with improved PESQ, STOI, ESTOI and SDR values.
The intelligibility and quality values of NNDS are consistently greater than NMF and RPCA-based methods. The results in Table 8 demonstrated that proposed RNN-iSR and DNN-iSR outscored their counterparts, RNN and DNN, as well as other competing methods, NMF, NNDS, RPCA, LMMSE, and DDAE with reasonable margins. For example, the STOI values are improved from 67% with NMF at −3dB noise to 76.5% with RNN-iSR and improved STOI by 14.18%. Similarly, the PESQ values are improved from 1.64 with NNDS at 0dB noise to 1.89 with DNN-iSR and improved the PESQ by 15.24%. Likewise, The SDR values are improved from the 6.73 with RPCA at 0dB noise to 10.90 with RNN-iSR and improved the SDR by 4.17 dB. Table 9 shows the results in terms of the output SNR (SNR O ), improvement in overall SNR ( SNR), and Segmental SNR (SSNR), respectively. The SSNR is used to measure the residual noise in the enhanced speech signals. The proposed speech enhancement methods, RNN-iSR and DNN-iSR significantly improved the SNR O and achieved a significant gain in SNR O . The overall SNRs for RNN-iSR and DNN-iSR are higher than the competing state-of-the-art methods. Similarly, the consistent SSNR values indicate that proposed speech enhancement methods significantly reduced the residual noise which is confirmed from Fig. 6 (timevarying spectrograms).

C. COMPLEXITY AND NETWORK CONVERGENCE
The complexity to train DNN/RNN depends on the network parameters and forward-backward propagation for network tuning. In the proposed speech enhancement methods, we have randomly initialized the parameters of networks. The complexity also depends upon quantity of neurons in the hidden layers and weights. Higher the number of neurons more will be the complexity of network. Observe Table 4, all DNNs/RNNs are trained with same number of layers, quantity of neurons in hidden and visible layers, LSTM units, but the proposed deep networks performed better and converged faster. The reason for fast network convergence (less loss function) is adaptation of critical band weights which are directly applied to the clean training data. The input data is pre-processed with CBIF weights which certainly improved the network performance. With equal quantity of neurons, the proposed methods provided lower values of loss functions, and this fact can be observed in Fig. 3 where all proposed methods converged at epoch ≥ 35. Based on the convergence results, we have fixed the epoch's number to 50 in the proposed speech enhancement methods. The complexity of DNN/RNN is given in Table 10

D. ROBUST SPEECH RECOGNITION
The above evaluations showed that DNN and RNN-based spectral masking significantly attenuated the background   noise and produced fine estimates of the magnitude spectrogram of the clean speech. Since automatic speech recognition methods only utilize magnitude spectrogram, one would expect DNN and RNN approaches to improve ASR performance in background noisy environments. To perform the automatic speech recognition, DNN and RNN-based spectral masking approaches are treated as front-end to enhance all speech utterances. We used Google ASR [58] in the experiments to evaluate ASR performance in terms of the word error rates (WERs). We provided average WERs results across all background noises and SNR levels. As shown in Table 11, both RNN-iSR and DNN-iSR achieved lower WERs than DNN and RNN in background noisy conditions. RNN-iSR and DNN-iSR-based speech enhancement considerably boosted the ASR performance, where the improvements are 28.29% (absolute) for the DNN-iSR and 35.22% for the RNN-iSR over noisy speech utterances. The ASR advantage gradually decreases as the SNR increases, partly because the noise becomes smaller. The ASR experiments aim to show the potential of RNNs and DNNs rather than to achieve the state-of-the-art results.

E. SUBJECTIVE EVALUATION
In addition to the objective evaluation, subjective listening tests are also performed to evaluate the perceptual quality and speech intelligibility of the enhanced speech. Speech utterances with an input SNR of −3dB, 0dB and 3dB are randomly selected from three noise sources (babble, factory, and white noise). In total 100 speech utterances are used to compare DNN-iSR and RNN-iSR. A total of 06 participants are asked to select the correctly perceived words in order to measure the speech intelligibility in terms of the word recognition rate (WRR). In experiments none of speech utterances are repeated. The tests are performed in isolated room using high quality headphones. Figure 9 demonstrates the subjective listening results in terms of the subjective intelligibility (WRR). From Fig. 9, we can observe that RNN-iSR achieved better results at all input SNRs. DNN-iSR significantly improved the results at −3dB and 0dB. The results indicate the advantages of the iterative speech recovery and IIF filter in the proposed speech enhancement methods. In order to investigate the statistical significance, we performed analysis of variance (ANOVA) To measure the speech quality of the enhanced speech subjectively, a total of 13 participants are asked to select the speech utterance that they preferred in terms of the mean opinion score (MOS). The biographical data of the listeners participated in the subjective listening tests for the speech quality is given in Table 12. A total of 200 speech utterances are randomly selected which are mixed with the three noise sources (babble, factory, and white noise) at −3dB, 0dB and 3dB SNR. The processed speech utterances are used to compare the performance of proposed DNN-iSR and RNN-iSR. In experiments none of speech utterances are repeated. Training sessions are organized to disseminate the listeners about the procedure. The tests are performed in an isolated room using high quality headphones. Figure 9-10 demonstrates the subjective listening tests in terms of MOS for speech quality. Both DNN-iSR and RNN-iSR showed better performance. The average MOS scores at negative SNRs is higher than 2.75 (MOS≥2.75 at −3dB) which shows significant improvement. But at SNR≥0dB, the average MOS score surpassed 3 (MOS≥3.0 at 0dB and 3dB). The individual MOS scores for all listeners in the tests is also depicted in Fig. 9-10. The ANOVA for MOS at −3dB, 0dB and 3dB are

VI. DISCUSSION AND CONCLUSION
We have proposed supervised spectral masking-based learning approaches to perform the single-channel speech enhancement. RNNs and DNNs are trained to learn the spectral masking between the degraded and clean speech signals. Our study trained the RNNs and DNNs without unsupervised pretraining to address single-channel speech enhancement. The presented study used the intelligibility improvement filter and an iterative reconstruction method to improve the outputs of neural networks and produced better recovered speech. In this study, all acoustic features are concatenation of the raw acoustic features in a window, since temporal dynamics gives more valuable information for the speech signals. A more elemental concept to exploit the temporal information is use of the RNN architecture, which is a fundamental extension of a feedforward network. The RNN architecture aims to grab the long-term temporal dynamics utilizing the time-delayed self-connections and is trained sequentially. We have trained RNN architectures for the spectral masking, and yielded 4.76%, 12.59%, 2.75dB and 8.67% improvements in terms of the STOI, ESTOI, SDR and PESQ. These improvements are worth significant. In our study, we also trained the DNN architectures. To test the generalization of RNNs and DNNs, we used the TIMIT database that included both male and female speakers. The overall SNRs and SSNR for RNN-iSR and DNN-iSR are higher than the competing state-of-the-art methods. The listening tests indicate that RNN-iSR approach achieved better results at all input SNRs. DNN-iSR improved the results at −3dB and 0dB significantly. In addition, ASR experiments were conducted, which showed that the proposed speech enhancement method is robust to the automatic speech recognition task. It is important to mention that we first recovered the time-domain speech signals from RNNs and DNNs outputs and then performed automatic speech recognition task which is based on the processed speech signals. We achieved less computational complexity and fast convergence as compare to the baseline DNN/RNN speech enhancement methods. According to our experiments, the iterative speech recovery and IIF filter improved the predicted speech intelligibility and quality values as well as significantly improvement the ASR. Comparing Fig. 6(c) with Fig. 6(e), the spectrogram of RNN-iSR output is better than the spectrogram of RNN recovered signal, suggesting the benefits of iterative speech recovery and IIF filter. As RNN-iSR and DNN-iSR outputs show the better spectral representation, they yielded better ASR performance. In summary, we have proposed to use RNNs and DNNs to learn the spectral-masking from degraded speech to clean speech for single-channel speech enhancement task. The proposed supervised learning approaches are conceptually uncomplicated and have improved the performance in terms of the predicted speech intelligibility and quality, and boosted the ASR results in various noisy conditions. NASIR SALEEM received the B.Sc. degree in telecommunication engineering from the University of Engineering and Technology, Peshawar, Pakistan, in 2008, and the M.Sc. degree in electrical engineering (communication) from CECOS University, Peshawar, in 2012. He is currently pursuing the Ph.D. degree with the Department of Electrical Engineering, University of Engineering and Technology, Peshawar, under the supervision of Dr. Muhammad Irfan Khattak (an Associate Professor). He was a Lecturer with the Institute of Engineering and Technology, Gomal University, Dera Ismail Khan, Pakistan. He is also an Assistant Professor with the Department of Electrical Engineering, Faculty of Engineering and Technology, Gomal University. His research interests include digital signal processing, speech processing and speech enhancement, and machine learning for speech enhancement.
MUHAMMAD IRFAN KHATTAK received the B.Sc. degree in electrical engineering from the University of Engineering and Technology, Peshawar, in 2004, and the Ph.D. degree from Loughborough University, U.K., in 2010. After doing his Ph.D. degree, he was appointed as the Chairman of the Electrical Engineering Department, UET Peshawar Bannu Campus, for five years and took care of the academic and research activities at the department. Later in 2016, he was appointed as the Campus Coordinator of UET Peshawar Kohat Campus and took the administrative control of the campus. He is currently working as an Associate Professor with the Department of Electrical Engineering, University of Engineering and Technology, Peshawar. He is also heading the research group Microwave and Antenna Research Group, where he is also supervising the postgraduate students working on latest trends in antenna technology like 5G and graphene nano-antennas for terahertz, optoelectronic, and plasmonic applications etc. His research interests include antenna design, on-body communications, anechoic chamber characterization, speech processing, and speech enhancement. Besides his research activities, he is also a certified OBE Expert with the Pakistan Engineering Council for organizing OBA-based accreditation visits. Prior to this, he was an Assistant Professor with the Capital University of Science and Technology, Islamabad, Pakistan. He also worked as an Assistant Professor with CECOS University, Peshawar, Pakistan. His industrial experience includes working for four Fortune 500 companies: IBM, Siemens, Philips Medical Systems, and NXP Semiconductors in Germany and The Netherlands. His research interests include wireless networks, antennas, technology policy, innovation, and entrepreneurship. VOLUME 8, 2020