Speech Enhancement Algorithm Based on Super-Gaussian Modeling and Orthogonal Polynomials

Different types of noise from the surrounding always interfere with speech and produce annoying signals for the human auditory system. To exchange speech information in a noisy environment, speech quality and intelligibility must be maintained, which is a challenging task. In most speech enhancement algorithms, the speech signal is characterized by Gaussian or super-Gaussian models, and noise is characterized by a Gaussian prior. However, these assumptions do not always hold in real-life situations, thereby negatively affecting the estimation, and eventually, the performance of the enhancement algorithm. Accordingly, this paper focuses on deriving an optimum low-distortion estimator with models that fit well with speech and noise data signals. This estimator provides minimum levels of speech distortion and residual noise with additional improvements in speech perceptual aspects via four key steps. First, a recent transform based on an orthogonal polynomial is used to transform the observation signal into a transform domain. Second, the noise classification based on feature extraction is adopted to find accurate and mutable models for noise signals. Third, two stages of nonlinear and linear estimators based on the minimum mean square error (MMSE) and new models for speech and noise are derived to estimate a clean speech signal. Finally, the estimated speech signal in the time domain is determined by considering the inverse of the orthogonal transform. The results show that the average classification accuracy of the proposed approach is 99.43%. In addition, the proposed algorithm significantly outperforms existing speech estimators in terms of quality and intelligibility measures.


I. INTRODUCTION
Speech is the primary means of interaction among human beings.It plays a key role in the recent communication technological era.Speech signals experience several difficult scenarios during transmission, such as interference, reverberation, and additive environmental noise.Additive noise is considered the most influential and most widespread type of noise in a real environment; therefore, Speech Enhancement Algorithms (SEAs) have been developed to deal with noisy signals, restore clean speech signals, improve speech quality and intelligibility, solve the noise pollution problem, and reduce listener fatigue [1], [2].The process of removing noise without distorting the original speech signal is a challenging task [3].SEAs are commonly implemented in different applications [3]- [7].
The probability density function (PDF) of speech and noise signals is considered a crucial point in designing a statistical speech estimator.Most conventional SEAs adopt Gaussian [3], [12], [29], Laplacian [4], [13], [30], or Gamma [31] priors to model speech signals, whereas noise is predominantly modeled as a Gaussian random process [3].The fundamental work can be traced back to the introduction of the short-time spectral amplitude (STSA) estimator for clean speech signals by Ephraim and Malah [12].This estimator is based on modeling speech and noise Fourier expansion coefficients as statistically independent, zero-mean, and Gaussian random variables.It is derived by minimizing the conditional mean squared error (MSE) [8].Ephraim and Malah extended their work in [32] by using log spectral amplitude (LSA) to improve agreement with the mechanism of human hearing [23].This estimator is efficient in reducing the musical noise (MN) phenomenon [33].A modified LSA was proposed by Cohen [34] by modifying the gain function of the LSA estimator based on a binary hypothesis model.A combination of MMSE estimators and spectral subtraction filter was developed in [35].Different studies have used real transforms, such as DCT [21], [24], DKT and DTT [6], and WT [36], [37], for enhancing noisy signals.These transforms are effective in noise reduction [21], [22].The attenuation filter is not always suitable for noise interferences, and thus, Soon and Koh [29] proposed an innovative approach that minimizes the distortion of reconstructed signals by considering two cases of additive noise.This approach called the low distortion approach.It minimizes underlying speech distortion during speech enhancement process since it identifies whether the background noise is destructive or constructive for a specific sequence.That means the attenuation filter is used to reverse the process of additive noise; however, the resultant magnitude of the addition of two complex signals (speech and noise) may not always be greater than the original amplitude of speech.Therefore, using an attenuation filter leads to high distortion in speech signal [29].Two filters, i.e., the multiplicative dual-gain Wiener filter (DGW) and subtractive filters are used in this approach.Real transforms based on an orthogonal polynomial (OP) were first used by Jassim et al. [6] to enhance noisy signals based on the WF approach in the DKT and DTT domains.If speech and noise are modeled as Gaussian priors in the real transform, then the resulting spectral gain becomes a WF, as proven by Wolfe and Godsil [8], [38].
Many SEAs have adopted super-Gaussian functions to model speech signals [31], [39] because super-Gaussian distributions have longer tails and spikier peaks, and thus, are more appropriate to represent speech signals.Moreover, a Gaussian assumption is asymptotically valid only when the size of the duration frame is longer than the span correlation of the signal under consideration [4], [39], [40].This assumption may hold for noise components but not for speech components, which are typically estimated using relatively short (20-30 ms) duration windows [3], [4].Different SEAs have reinforced this concept [40], [41].In [40], the capability of Laplacian random variables to describe speech samples during voice activity intervals was proven.The selection of an appropriate PDF is based on a comparison between a speech coefficient histogram obtained from a large dataset and a non-Gaussian distribution [31].Many researchers have adopted Laplacian or gamma PDF in their works, such as [4], [13], [30], [31], [39], [42], [43].Although SEA performance is improved, the optimal points of speech quality and intelligibility have not been achieved because leakage occurs in speech and noise modeling.Most studies do not state the different properties of various types of noise [44].In a singlemicrophone setting, improving quality and intelligibility attributes is a popular research topic [45].
Conventional SEAs require noise estimation algorithms to perform correctly [46].Most of these algorithms suffer from residual noise and speech distortion because the details of speech signals are essentially destroyed under low signal-tonoise ratio (SNR), in addition to the difficulty of processing non-stationary noise [47].Various SEAs have attempted to address these drawbacks, but their success depends on noise type [46].Therefore, recent studies that utilize the noise classification process are recommended [37], [44]- [46], [48].Noise classification is first performed, followed by SEA, which uses optimal parameters based on the selected noise type.However, no method uses noise classification to find the best noise model, which is a significant point in statistical SEAs.Accordingly, the current study proposes novel linear and nonlinear low-distortion estimators that account for constructive and destructive events based on new composite super-Gaussian representations of speech and noise signals.The new model for speech DKTT coefficients is a composite of Laplacian and gamma distributions, whereas the noise DKTT coefficient model is represented by a dual Laplacian prior.In this paper, a new estimator is proposed to avoid high distortion in speech signals in low SNR regions, minimize residual noise (including MN), and concurrently improve quality and intelligibility perceptual aspects.Accordingly, this paper focuses on deriving an optimum low- The rest of this paper is organized as follows.Section II describes the strategy stages of the proposed SEA and the basic mathematical aspects of DKTT and the noise classification method.The derivation of the proposed linear and nonlinear estimators is also provided in this section.Section 3 presents the evaluation of the noise classifier and the proposed estimator through a substantial comparison with several existing algorithms.Lastly, the conclusion is discussed in Section 4.

II. THE PROPOSED SEA
The proposed SEA and its specific stages, which embed the fulfillment requirements of enhancing noisy signals, are presented in the following subsections.For more elucidation, TABLE 1 list the notions used.In addition TABLE 2 list the abbreviation used in this paper.

A. STAGES OF THE PROPOSED SEA STRATEGY
The design of the proposed SEA is divided into five main phases.The first phase converts noisy speech into the uncorrelated domain using real transform DKTT, which is based on OP.Second, a noise classification algorithm is adopted to

B. BASIC MATHEMATICAL ASPECTS OF DKTT
DKTT exhibits the following distinctive properties: high energy compaction, good localization [49], [50], and excellent noise suppression performance.These capabilities significantly affect the enhancement process [23], where noise can be suppressed without substantial loss of the original signal information.Moreover, real transform reduces computational complexity in noisy signal analysis and clean signal synthesis.Initially, the definition of the additive noisy signal model is expressed as follows: let x(n) be the discrete time speech signal that is degraded by the uncorrelated background noise d(n) (includes white noise and color noise), which results in the following noisy signal: Then, y (n) is transformed into the DKTT domain to obtain X l (k) , Y l (k) and D l (k) in the kth transform coefficients of speech, noisy, and noise signals, respectively.
(a) k = a (a + 1) (a + 2) , . . ., (a + k − 1) (5) Meanwhile, k i (m; p, N − 1) is the weighted KP [54]: i, m = 0, 1, 2, . .., N − 1, N > 0, p ∈ (0, 1) where 3 F 2 and 2 F 1 are the hypergeometric functions [55], N represent the frame size, and p is the controlling parameter of KP.R m (x) is used to transform the noisy signal y(n) into the DKTT domain and obtain Y l (k).To transform a signal f (x) from time domain to transform domain F (k), the following expression is used [56]: k =0, 1, . . ., N − 1 and to reconstruct the signal from the transform domain F (k) to time domain f (x), the following formula is used: x =0, 1, . . ., N − 1 In addition, the matrix multiplication of equations ( 7) and ( 8) are as follows: where F, f , and R are the matrix form of F (k), f (x), and R k (x), respectively, and (•) T represent the matrix transpose operator.It is noteworthy that the transform domain coefficients (moments) can be used as a shape descriptor for different types of signals [57].In addition, basis functions of OPs can be used as an approximate solution for differential equations [75].

C. CONCEPTS OF NOISE CLASSIFICATION ALGORITHM
In order to make the proposed SEA suitable for different noise environments, a noise classification method is introduced.This method is used to find accurate models for noise signals by controlling their statistical characteristics.This process makes the PDF of the input noise signal matching the assumed distribution.Therefore, the suppression of noise will be optimized.The types of noise are classified using support vector machines (SVM) through feature extraction process.The models of SVM are trained based on eleven background noises.SVM is a very useful and popular machine learning technique for data classification [45].SVM works well with different feature sets [58], and derived from statistical learning theorem [44].New significant parameters are determined as stated in Section II-A based on noise classification.These parameters are defined in related sections.

1) Features extraction
There are two sets of features used in this work; the mean of normalized power and the mean of the standard deviation.Features are extracted based on the normalized sub-band noise.Note that, the number of partitions of the sub-band power is 25 with length equal to 16 samples, which are experimentally enough.There are 50 features calculated to realize the corresponding noise classification model.According to the noise type, the corresponding DCP are selected.Specifically, DCP control the amplitude and standard deviation of the assumed noise PDF.The power for each frame is calculated as: The normalized power feature can be obtained as follows: ) where Y 2 norm (l, k) is the normalized power in the kth moment, and its formula is: From the normalized power and for each sub-band, the mean power and the standard deviation are calculated.To find the mean power, the length of each sub-band (L) is calculated first as: where, J is the total number of sub-bands.Then, the average power will be: where j is the sub-band number for each frame.s = (j − 1) × (L + 1), represents the index of starting sample.e = j × L, represents the index of ending sample.
The first feature, the mean, is: where , N LF is the number of initial frames.The second feature, the standard deviation, is: Then feature vector is constructed based on these features (mean (16) and standard deviation (17)) using concatenation.

2) Training of SVM Model
SVM classifier is implemented to determine the type of noise from the six initial frames of the speech signal.SVM designed for binary classification problem to solve multiclass classification problem.In this work, "one-against-one" approach is performed, which is faster to train and seems preferable for problems with a large number of classes [59] and it is based on voting strategy.For a problem with C classes, the total number of classifiers will be c(c − 1)/2, and each of them trains data from two classes [44].Therefore, in this work, there are 55 classifier.The six initial frames from each speech segment are used for feature extraction to calculate feature vectors through performing DKTT on the windowed noisy speech.400 speech files are taken.100 files for training phase and 300 files for testing phase.The speech signals are corrupted by eleven types of noise, which are considered the most dominate noise in the environment.The length of training and testing data for each level of SNR is about 25 ms to get a stationary segment of speech signal.5500 segments of noisy speech signal are used as training set.These numbers of files comes from 11 types of noise, 5 levels of SNRs, and 100 speech files that are used for training phase.DKTT is used with p=0.5 to provide an appropriate localization and symmetry properties that facilities the mathematical calculations.

3) Testing of SVM Model
For testing phase, 300 clean speech files are chosen from TIMIT dataset [44].The speech files denoted by 'SA1' and 'SA2' for males and females speakers.Eleven types of noise are used in testing phase with the five levels of SNR.Therefore, there are totally 16500 files for testing phase.Each noise has different set of features that distinguish between noise types.The noise is judge during the initial six frames of the noisy speech signal, which are considered noise only frames.Then the noise classification is carried out based on these features.In this work, "one-against-one" approach is performed.This approach involves constructing a classifier for each pair of classes resulting in multi classifiers.And it is based on voting strategy to combine the 55 classifiers.For the test point, each binary classifier gives one vote for the winning class and the point is labeled with the class having most votes.
For more explanation about classification method, let m and n denote two classes chosen out of the given noise types, then the training data for class pair mn that corresponding class labels z can be expressed as follows [44]: where, M is the number of initial frames that are equal to six.The decision function for noise class pair mn is defined by: where, α mn i is from the solution of the quadratic programming problem, b mn represents the optimized bias, and K denotes the Kernel function.As mentioned, voting strategy is applied for each binary classifier gives one vote for its winner class, and feature vector r is designated to be in a class with the most votes.The noise type of the lth frame corresponding to r is given by:

D. PROPOSED MMSE ESTIMATORS
Linear and nonlinear estimators are proposed in this paper.signal.Meanwhile, the nonlinear MMSE estimator (NBSE) is based on the statistical analysis notion, which requires knowledge regarding speech and noise probability distributions [3].The analytical solution for the proposed estimators is derived in this section.Each of these estimators has two gains, and each gain deals with a constructive or destructive event.Thus, each estimator is considered a bilateral gain.
1) Proposed Non-Linear Bilateral Super-Gaussian Estimator (NBSE) In this estimator, the models for speech and noise transform coefficients are assumed to be statically independent super-Gaussian random variables.The main objective is to find a nonlinear estimate of the interest factors (clean signal) based on a given set of parameters (noisy signal).In the NBSE estimator, the statistical model for speech DKTT components is assumed to be a composite distribution of Laplacian and gamma PDFs (please see ( 28)).Meanwhile, dual Laplacian distribution is used to model noise signal (please see (32)).
Where the dual Laplacian distribution corresponds to two Lapalcian PDFs with different parameters have been combined to achieve the new distribution.The probability distribution of speech is exhibited in FIGURE 3a.In this paper, eleven types of noise are used.White noise is presented in FIGURE 3b.FIGURE 3 shows that better fitting is obtained for the assumed speech and noise DKTT PDFs than for the other presented density functions.Evidently, the assumed composite PDF is more accurate and provide better fitting with the DKTT data than the Gaussian, Laplacian, and gamma distributions.The enlarged section shows that the gamma prior has an extremely high value, making it inappropriate for representing DKTT data, The change in the external appearance of the proposed PDFs is controlled by DCP.Thus, noise reduction can be realized without significant loss in intelligibility.In general, significant noise reduction leads to serious degradation in speech intelligibility [60].The speech signal model has four DCPs.One is for the gamma prior, i.e., A G = 0.7604 , which controls the gamma PDF amplitude, and one is for The second parameter found based on noise classification is ∝.It is a significant factor in the decision-directed approach, where the former is used to estimate a priori SNR [12].Ideally, ∝ must be small during the transient parts of speech to respond faster to sudden changes in speech signals, whereas it must be large during the steady-state segments of speech to control the level of MN [3].The optimum values of DCP and ∝ are listed in TABLE 3 according to noise type.
where e k indicates the MSE and E {.} signifies the expectation operators.The analytical solution for NBSE and its gain functions are explained through the computation steps below.The two conditions that summarize the two mutually exclusive events must be defined first [22], [29] as follows: E + : speech and noise are constructive when X l (k) D l (k) ≥ 0 E − : speech and noise are destructive when X l (k) D l (k) < 0 The additive noisy signal model is expressed in (1).Then, the observed signal y(n) is transformed into the DKTT domain as indicated in (2).In NBSE, no linear relation exists between Xk and Y k .Therefore, the formula for MSE in (7) must be minimized by resolving the expected value.For readability, the moment index is written as a subscript and the frame index is omitted because the work is an up-to-date frame.The expectation formula can be expressed as where P (X k , Y k ) is the joint statistics of X k and Y k .Thereafter, the symbol (•) will refer to the estimation operation.Joint probability is converted into conditional probability based on conditional probability theory, as follows [60]: To minimize MSE, the inner integral in Equation ( 9) must be minimized for the observation vector [61] by taking its derivative with respect to Xk and its equality to zero: The general definition of the conditional expectation is based on conditional probability, as follows: which can be solved using joint and merging probabilities, as follows: Therefore, a priori knowledge regarding the PDFs of speech and noise coefficient distributions is necessary.Basically, the final NBSE output to obtain the estimated signal is The polarity estimator parameter, which is denoted as f k , controls the event probability of each condition [22], [29].f k is assumed to be ideal in this work.The modeling of a speech signal is defined as (F Speech ) and assumed to be a composite of the gamma and Laplacian priors, as follows: The definition of gamma density in the proposed work is given by For readability, σ x k = C G σ Gx k and the variance of the gamma PDF is σ 2 x k = α β 2 .When α = 0.5 and Γ (0.5) = √ π are considered, the resulting gamma function is The definition of Laplacian density is where the Laplacian factor is defined as b and the Laplacian variance is defined as σ 2 L = 2b 2 x k .The noise model (F Noise ) is assumed to be a combination of two Laplacian PDFs, as follows:  dki .The mathematical formula for the NBSE estimator in a constructive interference event is where x k and y k respectively represent the instances of random processes X k and Y k .The same equation as ( 19) is obtained for E − .The joint PDF of two independent random variables can be expressed by multiplying their marginal probability.Then, the joint PDF between x k and y k is [22], [62] p where m k = sgn (X k ), (F Speech ) , and ( F N oise ) are independent with a zero mean.When the long term of E (X k /Y k , E + ) is considered after substituting F Speech and F N oise , this term is divided into four parts, i.e., two for the numerator and two for the denominator.Then, the conditional expectation operator for constructive interference can be defined as The terms of the numerator N c are divided into N c1 and N c2 .N c1 is defined as follows: After solving the aforementioned integral using Theorem (3.381(1)) from [63] and simplifying it in terms of priori and posterior SNRs, the result is expressed in terms of an incomplete gamma function as follows: The same mathematical solution is applied to N C2 , and the result is The denominator D c is also separated into two terms.The first term D c1 is expressed in terms of an incomplete gamma function, as follows: The formula for the second part, D c2 , of the denominator will be Finally, the general form of the speech estimator, (NBSE) c , in a constructive event is From the other extreme, the analytical solution for NBSE in a destructive event is The destructive equations are clearly longer than the constructive equations; therefore, they are divided into eight parts, i.e., four for the numerator and four for the denominator.The first integral in ( 42) is termed as The mathematical solution for the second term, N d2 , is calculated as follows: Then, the second integral in the numerator is taken, and N d4 is The denominator, D d = D df + D ds , is also separated into two terms, namely, D df and D ds .The mathematical calculation of the first term, D df = D d1 +D d2 , is as follows: The second part D d2 of the first part in D df has the following form: The second term, D ds = D d3 + D d4 , in the denominator.The equation for D ds is The formula for The mathematical solution for the second term, D d4 , is Thus, the general form of the estimator in a destructive event Then, Equation ( 13) of NBSE, which provides an optimal estimation of a clean signal, is 2) The Proposed Linear Bilateral Super-Gaussian Estimator (LBSE) To improve the performance of the speech enhancement process, the problem of residual noise, including MN, which is highly irritating to the human ears, must be addressed.Therefore, a post-processing filtering technique, i.e., LBSE, is proposed as a second stage estimator.Moreover, LBSE will deal with the over-attenuation problem in low SNR levels.
The linear relation that combines Y k and Xk is expressed as where G k is the multiplicative LBSE gain.The expression for MSE has been defined previously.Then, the well-known expression for the linear MSE equation is written as follows [22], [29]: The general form of the multiplicative gain is derived as follows by differentiating and minimizing (41) with respect to the gain function and then equating it to zero: LBSE has two gains based on the relative term plays an important role in determining the performance of a linear multiplicative gain filter.Equation ( 42) can be written in terms of ξ k and p E as follows: where p E is calculated as When ( 45) is substituted into the cross term, the mathematical formulas for LBSE for constructive and destructive events are Therefore, DCP plays a significant role in determining the optimum p E value.In FIGURE 5, the percentage value of MSE (equation ( 64)) is calculated to show the improvement of different p E values.
Since the term E[X k N k ] of LBSE is not zero in this work.Therefore, the formula of MSE-LBSE will be: In the meantime, G k has two events of noise interference; therefore, the probability of occurrence of each case in normal situation is assumed to be equally possible.Therefore, the general formula of e LBSE k is

FIGURE 5: MSE Improvement of LBSE
The general formula for calculating the percentage of MSE improvement is This equation is plotted as a function of ξ k to demonstrate the percentage of improvement between WF and the LBSE estimator for different p E values.Evidently, no improvement occurs when ξ k = 0.Meanwhile, the δ e percentage of improvement begins to increase gradually as p E increases.The improvement in the proposed estimator reaches nearly 25% at ξ k =±30 dB for p E = 0.5 and 65% for p E = 0.8.After estimating a clean speech signal, the inverse of DKTT is applied to convert the signal back to the time domain.The workflow of the proposed system is presented in FIGURE 6.

III. GAIN CHARACTERISTIC OF LBSE AND NBSE ESTIMATORS
In this section, the characteristics of the two proposed estimators are presented to illustrate their performance in filtering out unwanted components of a noise signal.For a constructive event, an attenuation estimator is required.For a destructive event, an amplification filter is required.The following sections present the characteristics of NBSE and LBSE.Each estimator has two gain formulas for each event.

A. GAIN CHARACTERISTIC OF LBSE ESTIMATOR
In FIGURE 7a, various gain curves against ξ k for different values of p E are plotted in a destructive case.The gain formula for LBSE is a function of ξ k .p E has different values because DCP has varying values.
Evidently, the gain value is equal to 0.5 for all values of p E when ξ k = 0dB.By contrast, the highest curve in the region ξ k > 0dB is for p E = 0.8, which amplifies the signal approximately after ξ k > 2dB.For ξ k < 0dB, the filter with p E = 0.5 delivers less attenuation than the others.For both estimators, the curve gains become zero gain as ξ k approaches ∞ or −∞.The figure clearly shows that gains do not always amplify noisy components as predicted for a destructive event.This counter-intuitive phenomenon can be elucidated if the occurrence of polarity reversal [22] is considered at a destructive event, particularly for regions where the gain has a negative value.In FIGURE 7b, the gain curves are plotted for a constructive case, G LBSE c .The plots are superimposed for better comparison.Evidently, when ξ k = 0dB, the values of all the gains are equal to 0.5.In addition, for ξ k > 0dB, the curve for p E = 0.8 provides more attenuation to the signal, which is suitable for a constructive event, and vice versa.For both estimators, the curve gains tend toward zero gain as ξ k approaches ∞ or −∞.Evidently, all gains are attenuation gains and less than the unity in all the regions of ξ k .This property is appropriate for a constructive condition because the noise interference in such case always tends to increase noisy speech signals.

B. GAIN CHARACTERISTIC OF NBSE ESTIMATOR
NBSE is a nonlinear estimator, the output of which is not linear with its input signal.NBSE is considerably harder to derive than LBSE.NBSE gain is a function of two parameters, namely, ξ k and γ k .The 2D and 3D schemes of the NBSE gain curves are plotted.Only white noise is presented due to space limitation.The NBSE gain function, G NBSE c , is plotted in FIGURE 8 as a function of ξ k for γ k = −10dB and γ k = 5dB.
In general, the attenuation gain curves decrease gradually as ξ k decreases, which is good for maintaining signal distortion at an appropriate value.The 3D plot clearly shows that the gain of NBSE increases progressively within a little bit when γ k increases, thereby increasing the opportunity to improve the enhancement process.For a constructive case, the attenuation is low.Furthermore, it converges rapidly toward a higher gain as the value of γ k increases.NBSE provides an attenuation filtering gain in nearly all gain levels, which is significant for a constructive event.
FIGURE 9 shows the 3D and 2D plots of the parametric NBSE gain function, G NBSE d , when a destructive event is considered.In FIGURE 9a, the 3D plot of NBSE gain is shown based on the variation of two parameters, ξ k and γ k .The 2D plot in FIGURE 9b shows the parametric gain curves as a function of ξ k for γ k = −10dB and γ k = 5dB.For small values of γ k , the gain becomes higher than the unity for the range of ξ k > −6 dB, as it should be for a destructive event.However, when γ k increases, the NBSE filter tends to provide an attenuation gain that is appropriate for the case of polarity reversal, which may occur in this interference case.NBSE amplifies or attenuates each noise component in proportion to the estimated ξ k when γ k is constant.Interestingly, the gain levels are smaller than one given that ξ k is small, which causes attenuation in a degraded signal.However, gain value crosses the unity gain (0 dB) as ξ k increases to provide an amplification gain.

IV. THE EVALUATION OF THE PROPOSED SEA.
An assessment of the proposed SEA is presented in the following sections.

A. ACCURACY EVALUATION OF NOISE CLASSIFICATION METHOD
In the noise classification phase, 100 speech files are taken from the well-known TIMIT dataset [44] for the training phase, whereas 300 speech files are taken for the testing phase.The sampling rate is 16 KHz and 1-hamming window is used with 75% overlap.The speech files denoted by SA1 and SA2 are obtained for male and female speakers, respectively.150 files for "SA1" and 150 files for "SA2".The speech signals are corrupted by eleven selected noise types.Among these, ten are selected from the NOISEX-92 dataset [64], in addition to the speech-shaped noise [44].The types of noise include white, pink, F16, buccaneer, factory, babble, engine room noise, operation room noise, leopard, M109, and speech-shaped noise.Moreover, 5 levels of noise (-10, -5, 0, 5, 10 dB) are utilized for each noise type.The length of training and testing data for each SNR level is approximately 25 ms.Noise classification is carried out on the first six frames of the noisy speech.The features in moment domain are directly obtained from the noisy signal, where no other features are required to achieve a successful noise classification.Therefore, the complexity computational of classification process is low.After feature extraction process, the training stage is performed to gain the classifier model.This classifier model is used as a pre-stage before the process of SE to classify the 11 types of noises.The procedure of noise classification method can be summarized in the following steps: Step 1: The input is a noisy signal (speech+ noise) from TIMIT [65] and Noisex-92 databases [64]  The separation boundaries of different classes in SVM were determined by choosing the appropriate kernel function.As a reasonable choice, we adopted the polynomial kernel function with degree of two (d = 2 and r = 1) since this kernel has the lowest classification error against linear, radial basis function, and sigmoid kernels [66].Its formula is as follows: The cross-validation and grid-search methods are used to tune the optimal kernel parameter (γ) and the penalty parameter (C).Where, cross-validation procedure can prevent the over fitting problem.In this work, the 5-fold cross validation is applied due to its simple and easy properties.The mechanism is to create a 5-fold partitions of the whole dataset.The dataset was partitioned into 5 disjoint, equal size subsets.The process is repeated 5 times to use 4 folds for training and a left fold for validation where, the test error was calculated, and finally average the validation error rates of 5 experiments.In each run, the best parameters of a classification algorithm for a class pair were explored through 5-fold cross validation with a grid search mechanism on the training set.The classifier with the least validation error was selected for each class pair.
The summary of the testing phase is provided in TABLE 4. The accuracy of the classification has been found to be 99.43%.For example, the percentage accuracy for Buccaneer noise (3rd class) is 99.87 %, and the percentage accuracy for factory noise to babble noise is 1.2.A low percentage accuracy is obtained for babble noise (2nd class), i.e., 97.60 %.The confusion matrix shows that the proposed noise classification method attains high accuracy in different noise environments.
The average accuracy of the proposed noise classification method for all the eleven types of noise is 99.43%.For example, the percentage accuracy for Buccaneer noise (3rd class) is 99.87, and the percentage accuracy for factory noise to babble noise is 1.2.A low percentage accuracy is obtained for babble noise (2nd class), i.e., 97.60.The confusion matrix shows that the proposed noise classification method attains high accuracy in different noise environments.

B. PERFORMANCE EVALUATION OF NLBSE USING QUALITY AND INTELLIGIBILITY MEASURES
This section provides a performance assessment of the proposed SEA compared with several existing methods to establish its capability in suppressing noise perfectly.A comparative evaluation is used to assess speech intelligibility and quality.However, listening tests is a gold standard in terms of speech quality valuation; these tests are expensive and timeconsuming, which limit their application [67].Accordingly, powerful objective measures are adopted in the present study.The number of speech files used in this experimental test is 64, with different speakers (32 males and 32 females), which are randomly selected from the TIMIT database [65] to make the work complementary with mean opinion scores for hearing quality.The decision-directed approach [12] is implemented to compute the estimated ξ k with variable ∝ based on noise type as follows: The tests are performed on the eleven types of noise [64] with SNRs of -10, -5, 0, 5, 10 dB SNR.Then, five quality measures are used: PESQ [68], composite measures (SIG, BAG, and OVL) [69], [70], and FWSNR [67].Two intelligibility measures are used, namely, CSII [71] and STOI [72].A comprehensive assessment is performed on four selected classes of methods: (1) traditional estimators: WF [3] and the nonlinear MMSE estimator [12]; (2) low-distortion methods: dual-gain Wiener DGW [29], Laplacian-Gaussian mixturebased dual-gain Wiener filter (LGMDGW) [73], and dual MMSE estimator (DMMSE) [22]; (3) two-stage SEA using OP: two-stage-based DKT estimator [6] (TSDKTE) and twostage-based DTT estimator [6] (TSDTTE); and (4) a recent method called the optimally modified log-spectral amplitude based on noise classification (COMLSA) [44].
Each speech signal is divided into frames with a length of 18 ms.The standard hamming window with 75% overlap is used for the framing process.The optimal value of parameter p in DKTT transform is set to 0.2.For combination, the enhanced speech signal in each frame is synthesized via the overlap-add method [74].White noise is selected as an example, as shown in FIGURE 10, to calculate quality and intelligibility measures.FIGURE 12 shows that NLBSE provides higher measurement values for all noisy conditions, except for SNR = 10 dB in FWSNR and SNR = 5 dB and 10 dB in CSII, where the NLBSE value is comparable with those of the other algorithms.In general, NLBSE provides the best results in low SNR levels for PESQ, SIG, BAK, and OVL.NLBSE is verified to have the highest value compared with the other selected methods.
PESQ is known for its high correlation with OVL measures, which in turn, exhibit a significant correlation with subjective speech quality [44].Meanwhile, the speechshaped noise in FIGURE 11 shows that NLBSE is better than all the other algorithms.The experimental results for the other types of noise indicate that NLBSE provides the highest values in nearly all noise situations.The amount of residual noise in the enhanced speech noise cannot be quantified easily by using only objective measures.However, spectrogram representations of an enhanced speech can be applied to provide additional details on the timefrequency distribution.FIGURE 12 shows the spectrogram plot of a speech sentence obtained from the TIMIT dataset that was corrupted by white noise with 0 dB SNR.The spectrogram of a noisy signal is shown in FIGURE 12b.The spectrograms of NLBSE and the other methods are displayed in this figure.The sentence used is, "She had your dark suit in greasy wash water all year" Clean and noisy spectrograms are also provided to perform comparison evaluation and confirm the optimal process of the proposed SEA.NBSE is also presented to prove the capability of LBSE.Evidently, a clean signal is regenerated using NLBSE without noticeable signal distortion and with minimum residual noise, where no noise surrounds the original signal in the spectrogram.Moreover, the spectrograms present how the opportunity to enhance a noisy signal is increased by utilizing the second post-processing filtering shown in FIGURE 12c.The imminent analysis of other algorithms will start with DGW and DMMSE.FIGURES 12f and 12g clearly show that residual noise, including MN, surrounds the original signal.COMLSA in FIGURE 12h shows that less residual noise appears as isolated peaks in the frequency domain.By contrast, the TSDTTE estimator efficiently removes residual noise.However, speech distortion is clearly shown in FIGURE 12e.Evidently, spectrogram view reinforces the capability of NLBSE to remove noise with less speech distortion and residual noise, including MN.

V. CONCLUSION AND FUTURE WORK
This paper addresses significant problems of SEA in estimating a clean speech signal under different environments of background noise.The proposed SEA adopts a noise classification method, which is used to search for accurate speech and noise models.A new super-Gaussian composite is assumed and used first in modeling.Two stages of estimators are derived based on the models, namely, NBSE and LBSE, which are distinct from other estimators in terms of their analytical solution.These two estimators are based on a low-distortion approach and MMSE sense; they are then combined in cascade to realize NLBSE.NLBSE is proposed to minimize distortion under different conditions of the underlying speech signal during the enhancement process without compromising the noise reduction process.It is adopted    by considering the interference between clean and noise signals and the type of noise.Only a few algorithms deal with these approaches.The proposed estimators address the polarity reversal issue that occurs when noise components are stronger than signal components.High-performance noise suppression is achieved from the NBSE output, with more enhancement for speech perceptual aspects besides to reduced MN effect in LBSE.
The analytical solutions of MMSE for linear and nonlinear estimators are derived.The outcomes of the proposed estimators demonstrate their effectiveness and capability to reduce unwanted noise in terms of different speech quality and intelligibility measures.The simulation results of different noisy conditions clearly show that the proposed work reduces corrupting noise in a degraded signal in a superior manner compared with various existing methods.In the future, the proposed work will be applied to calculate an optimum value for the polarity estimator factor in practical cases of bilateral gain.Furthermore, other types of super-Gaussian prior and noise will be examined.
classify the statistical properties of noise.Then, three different sets of parameters are determined properly: the distribution controlling parameter (DCP), the expectation parameter (P E ), and the smoothing parameter (∝).The third phase is the nonlinear bilateral super-Gaussian estimator (NBSE).The fourth phase is the linear bilateral super-Gaussian estimator (LBSE).NBSE and LBSE are two-stage estimators based on MMSE sense.They are combined in a cascading form to formulate the NLBSE.Finally, the inverse of DKTT, and then an overlap-add technique, are applied to synthesize the original speech signal back to the time domain.The proposed SEA phases are shown in FIGURE1and explained in the succeeding subsections.

FIGURE 1 :
FIGURE 1: The General scheme of the proposed SEA.

FIGURE 2 :
FIGURE 2: Step used to find the DCPs for the fitting model of: (a) speech signal, and (b) noise signal

FIGURE 3 :
FIGURE 3: The proposed pdf of (a) clean speech (b) white noise DKTT coefficients verses other pdfs.
which controls the standard deviation.The other two parameters control the Laplacian amplitude and standard deviation, which are A L = 0.1839 and C L = 0.03, respectively.Meanwhile, the distribution of the DKTT noise coefficients has 44 different DCP values because the eleven types of noise have four DCPs each.B d1 and B d2 control the amplitude value of the dual Laplacian PDF.C d1 and C d2 control the standard deviation value of the dual Laplacian PDF.

FIGURE 4 :
FIGURE 4: The proposed PDF for different types of Noise.
58) p E significantly affects noise reduction along with speech distortion.In LBSE, the same PDFs for speech and noise are assumed, and thus, the term E [|X K | |D K |] must be calculated for these models.The expectation values of the speech signal E [|X K |] and noise signal E [|D K |] are [63]:

FIGURE 6 :
FIGURE 6: The Block Diagram of the proposed SEA.

FIGURE 8 :
FIGURE 8: Gain curves of NBSE for white noise in constructive event.

FIGURE 9 :
FIGURE 9: Gain curves of NBSE for white noise in destructive event.
of 400 × 11 × 5 speech files.These files consist of 400 speech files, 11 types of noise, and each speech file has five levels of SNR with frame size of 400 samples (25 ms).Step 2: The initial six frames from each noisy file are taken to extract 50 features.These 50 features are contained 25 mean power features and 25 mean standard deviation features.Step 3: The 22,000 speech files are divided into two sets which are training set (5500) and test set (16,500).Meanwhile, the training set is treated by 5-fold cross validation.Step 4: The training set is used to train the multi-class SVM.The parameters of the SVM are adjusted to make minimal the average error of 5-fold cross validation using grid search.Step 5: The test dataset is constructed to analyze the performance of the classifier and then to calculate the confusion matrix.If acceptable, then output the classifier, otherwise return to step 4 to re-train the parameters of the SVM model.

FIGURE 10 :
FIGURE 10: The Comparison test of white noise condition for seven measurements.

FIGURE 11 :
FIGURE 11: The Comparison test of Speech-shaped noise condition for seven measurements.

FIGURE 12 :
FIGURE 12: Result of the enhancement process for the male utterance "She had your dark suit in greasy wash water all year" taken from the TIMIT database corrupted by white noise with 0 dB SNR.The spectrogram plots of (a) clean speech, and (b) noisy signals; and enhanced signals using (c) NLBSE, (d) NBSE, (e) TSDTTE, (f) DGW, (g) DMMSE, and (h) COMLSA.

TABLE 1 :
Table of Notions d gain of the NBSE destructive events δe percentage of MSE improvement distortion estimator with models that fit well with speech and noise data signals to provide minimum levels of speech distortion and residual noise with additional improvements in speech perceptual aspects.The proposed SEA combines the advantages of Laplacian and gamma priors for modeling speech and noise signals in a real transform to provide good enhancement performance.

TABLE 2 :
Table of abbreviations

TABLE 3 :
DCP and ∝ for different types of noise the PDF distribution for the other types of noise which confirm the accurate mapping of the proposed model.The objective of the proposed NBSE is to find Xk by minimizing the MSE between Xk and X k .NBSE and LBSE have two gains each, namely, attenuation and amplification, based on the low distortion approach.The derivation begins with the MSE formula: